Language can be a barrier to good communication. So can opposing interests. But as the saying goes, “Numbers don’t lie.” You can’t dispute what hard data tells you, whether you like it or not.
In our business at 27Global, we rely on data every day to break down communication barriers between our geographically distributed development teams, our dev and site reliability engineering (SRE) teams, and us and our clients. When communication gets in the way, productivity and quality suffer. That’s why data is at the heart of our SRE organization. At 27Global, SRE is both a business pillar—used internally to help us ensure high performance and quality of our development projects—and a service delivered to clients, so they can monitor and respond to issues in their production workloads.
The SRE function is mission-critical for us and our clients, and having complete, accurate telemetry data is essential. The challenge is how to get that data without imposing a lot of extra work on our engineers, who need to focus on delivering great products. Meeting that challenge requires rigorous observability and automation across the stack.
Why observability across the stack with programmability?
Observability—the ability to analyze and troubleshoot problems easily across your entire software stack— gives us a single version of the truth to share with our global development teams. But it’s not just for our developers. It also provides us with visibility and insight into production workloads, which often expose performance issues such as slow queries, resource contention, and queuing bottlenecks that might not show up in a test environment and could be virtually impossible for our development team to replicate.
Observability includes application performance monitoring, infrastructure monitoring, log analytics, digital experience monitoring—everything, everywhere—on prem, in the cloud, virtualized, containerized, monolithic, microservices, you name it. Instrumenting to collect and visualize data from all these places should be done in a programmatic way if you don’t want to bog down your engineers with manually building all those alerts. You want to give them APIs so they can program instrumentation into the software and make alerting automatic. Simply put, this reduces toil, and reducing toil for our teams means they can be more efficient and put their time into meaningful work instead of low-level tasks.
How do we do this at 27Global? We use New Relic One, which is a powerful observability platform with programmability built in. Let’s dig into some of the specific things we’re doing with New Relic to get the data we need, to not only break down communication barriers but also to improve productivity and product quality, and deliver real value to our SRE customers.
Automation using observability as code
We use Terraform as our main tool for provisioning and managing infrastructure as code. We also use Ansible to automate application builds. Historically, our cloud engineers would need to manually add monitoring to the infrastructure, and our developers would have to build monitoring into the applications. Now, using the Terraform Provider from New Relic, we take advantage of APIs to automatically add monitoring to our applications and infrastructure. This is what we call observability as code. For example, when we’re standing up a new Kubernetes pod, we can write Ansible scripts that inject New Relic's APM agent into that environment. Then we can just lay variables on top. It gives us a template that we can use over and over again instead of reinventing the wheel each time.
From a DevOps perspective, automating in this way—with observability programmed into the environment—means we can move much faster. And as previously mentioned, it reduces toil, which no developer enjoys. It minimizes the amount of time they have to spend on boring work so they can focus on the interesting stuff.
Building dashboards to improve communication across global teams
By instrumenting applications and infrastructure, we’re able to collect a lot of meaningful data. With data, communication problems go out the window—no more excuses. There are several different ways we use data to improve communications, one of which is with dashboards.
We built end-of-day dashboards for our development teams. Like most modern IT shops, we have multiple teams across different physical locations, time zones, and continents. Using a common set of data between our distributed teams is a great way to facilitate clear communications. Rolling up daily operational data into dashboards makes the transition from one team to the other much smoother. It supports a follow-the-sun model like ours. Instead of trying to convey information manually to multiple people, everybody has one place to go. They have the same view of the exact same data, which allows us to have much more continuity as projects are handed off at the end of each day.
Dashboards also overcome communication barriers that can exist between development and SRE teams. Just as we expect our developers to write deployable code, we also strive to give our developers visibility into impacts on site reliability. This visibility provides the dev team with more context around the SRE team’s needs. It’s the same for the SRE side—they can better understand developer activities. Data is a way for both sides to meet in the middle with a common set of metrics that brings teams together to have the exact same conversation, which is a huge improvement.
Increasing productivity and reducing mean time to resolution (MTTR)
One of the great things about an observability platform such as New Relic is that it provides a single view into your environment. We can “wow” a client within 30 seconds, just by showing them a Kubernetes cluster with all the services, nodes, and namespaces. We can give clients instant reports on uptime, performance, and digital user experience, whereas in the past it would take us four hours or longer to gather all that information.
Our internal productivity has also increased. In fact, we’ve cut the time to stand up a new project in half, and that time continues to decrease by using a combination of full-stack observability and automation.
Time to resolution is faster, too. For example, one of our clients had a database problem that we identified through New Relic. For months, the client had application performance problems, but nobody knew why until we brought in that visibility from New Relic. Digging into the logs from New Relic, we were able to troubleshoot a bad query and turn around a fix the next day. The client saw an immediate performance improvement for the application.
This example gets to the key benefit of our SRE service—finding and solving application and infrastructure problems quickly. We want to be able to detect and respond to our clients’ issues before they have any impact on their customers. That’s what we’re able to do using New Relic.
No communication struggles with data-centric strategy
As you can see, full-stack observability and automation have a huge impact on the performance and quality of our products and services at 27Global. We don’t struggle with communications—we have the data to back up our claims. We can accurately measure key operational attributes of applications and infrastructure in real time, gain valuable insights, and share those insights instantly. A programmable observability platform allows us to get much more useful information without a lot of extra work. That’s something any DevOps team can appreciate.
本ブログに掲載されている見解は著者に所属するものであり、必ずしも New Relic 株式会社の公式見解であるわけではありません。また、本ブログには、外部サイトにアクセスするリンクが含まれる場合があります。それらリンク先の内容について、New Relic がいかなる保証も提供することはありません。