現在、このページは英語版のみです。

As an enterprise video platform leader in Europe, our mission at movingimage is to revolutionize how companies use video. Our engineering teams improve our platform every day as we continue to innovate. We prioritize stability, reliability, and performance, but found that we tended to overprovision Kubernetes containers running in the Microsoft Azure cloud environment. 

We started to ask ourselves if we really needed all the resources we allocated, and if we were introducing stability issues by under-allocating resources for some containers in our production environment. We didn’t have all the answers, nor did we have a reliable way to understand resource usage and requirements within containers.

Our video content management system needs high availability and performance. However, we were facing issues that took too long to diagnose and fix. While we had some open source infrastructure monitoring tools, they didn’t help us truly understand how our software was running. When our open source stack detected an issue, it didn’t have the telemetry data or analysis to help pinpoint problems in the software. Often, the only option for the engineering team was to manually review application logs—a tedious and time-consuming effort. 

An overview of movingimage's offerings.

From oversized VMs to overprovisioned containers 

When the company moved to Microsoft Azure, teams migrated our applications using a lift-and-shift approach. Some of the virtual machines (VMs) in our on-premises environment were massively oversized, so when the teams migrated the applications, they similarly oversized the containers in Microsoft Azure. Over time, as the number of containers grew, so did the tendency to overprovision. For example, we provisioned five cores for a Node.js application. That meant we were typically wasting 4 cores per container because Node.js is single-threaded. 

At the same time, we knew there were probably some containers that were under-sized, resulting in potential stability and performance issues. We weren’t just looking for how to make everything as small as possible, but to give each container a meaningful size based on insights into real-world resource consumption.

Creating a container dashboard

When I joined the company as DevSecOps team lead, one of the first things I asked was if we could deploy an observability platform. I got approval to trial New Relic in our environment. Suddenly we had visibility down to the code level with application metrics that could inform data-driven decisions. We could now also immediately respond to issues impacting customers and predict and resolve issues before they arose.

I’d previously used New Relic custom dashboards to go beyond cloud optimization to metrics within individual Kubernetes containers.  Taking the same approach, I created a new dashboard and used New Relic Query Language (NRQL) to show consumption metrics for our three movingimage environments—production, quality assurance (QA), and development—and the approximately 150 containers per environment.

This dashboard is used by movingimage to evaluate infrastructure data. This dashboard helps the team make big cost savings by providing visibility into which containers are good and bad candidates, what size the containers should be, and possible pitfalls.

We’re a video company, so we often use video to explain new things. I made a video for my colleagues explaining how the dashboard works and how they could interpret and use the data. I described the metrics New Relic tracks, what they mean, and the consequences of making changes based on the optimization insights in the dashboard. When our CEO and CTO saw the video, they asked if we could implement the optimizations. They saw the potential for significant cost savings.

Our engineering team first tested our optimization hypotheses. Then we provided each development team with recommendations for rightsizing the environments for the applications each team owned.             

Cutting wasted resources and costs in half 

Before we started the container optimization initiative, we were running between 40 and 50 nodes in production in the Microsoft Azure environment. Now, we run between 15 and 30 nodes in production, averaging approximately 20 nodes. By rightsizing each container environment, we cut the number of nodes—and correspondingly, compute costs—in half. Better yet, customers noticed the difference. When we rightsized containers, we also identified and corrected environments that were under-provisioned, which improved stability and performance. 

We did the same thing in our QA cluster, going from 40 to 60 nodes on an average day, down to less than 20. A roughly 60% reduction in cloud compute resources allocated and costs. 

This combined with our faster incident response and troubleshooting led to customers telling us “whatever you did this past year, it’s a big improvement.” Customers were suddenly speaking positively about our evolution. Part of that evolution is to become more data-driven across the company. New Relic has been and continues to be an amazing resource to help us realize true value.