Our Commitment to Reliability and Incident Learning

8 min read

Our customers rely on New Relic on a daily basis so they can understand and optimize the performance of their most critical applications and infrastructure. New Relic One provides customers with real-time insights into the performance of their infrastructure, cloud resources, containers, and applications by aggregating telemetry data in one place and delivering actionable insights to customers. 

On Thursday, July 29, we experienced a series of service interruptions that caused data collection and alerting to be unusable for many customers in our US region for multiple hours. What started as an isolated technology failure in our Kafka systems ultimately was exacerbated by some of the very automation and redundancy protocols that, in normal operations, keep our systems humming at scale. We know that you, our customers around the world, rely on New Relic to provide the visibility you need to run your own products and services. We understand that this incident impacted your business and as a result many of you chose to halt your own deployments and kept your teams on high-alert, even late into the night, until we restored our service. Our service issues not only affected the service you received, but also your business and teams. 

We are very sorry and are deeply committed to regaining your trust.

Over the last week, our engineering teams have been engaged in our own retrospective process, analyzing what happened, understanding why it happened, and discussing how we can do better. We not only looked at the sequence of events, but also at the technology, tools, and processes involved. We are committed to sharing details into how we build and operate our own systems, including when we fall short. In this post, we explain how some of the systems that run New Relic work, and in this case, how they didn't. 

Cell architecture overview

Our Telemetry Data Platform, also known as NRDB, is a cluster of distributed services built from the ground up for time-series telemetry data. A typical large query runs on thousands of CPU cores, scanning trillions of data-points to return answers to customers in milliseconds. We operate at massive scale: our Telemetry Data Platform currently ingests more than three billion data points per minute and that scale grows exponentially. 

NRDB uses a cell-based architecture, an approach that allows us to scale our systems by replicating them as a whole (“cells”) with isolation between each cell. This architecture provides benefits in both being able to scale our platform with customer needs and improving reliability through fault isolation.

At a high level, in each NRDB cell, we operate: 

  • A Kafka cluster
  • An internet-facing ingestion service
  • A set of data pipeline services that process ingested traffic
  • An NRDB cluster

Benefits of a cell architecture

Using fault isolated cells enables us to operate dozens of NRDB clusters concurrently while reducing the impact of failures in any one cell. As the number of cells grows, the blast radius (or customer impact) of a failure continues to decline. As an example, with only two cells in an architecture, an incident has the potential to impact 50% of customers. With 10 cells, an incident has the potential to impact 10% of customers, and so on. Cells are added incrementally as additional capacity is needed. 

Another benefit of the architecture is elasticity. We can dynamically respond to unexpected traffic spikes by shifting production traffic on the fly from unhealthy cells to healthy cells, enabling us to keep serving traffic while scaling up cells and/or adding new cells. 

Managing complexity at scale 

While cell-based architecture offers many benefits including scale, speed, and flexibility, like many modern architectures, it also presents new challenges. To manage the complexity, we take advantage of many tools and practices including Kafka and infrastructure automation. This service interruption was a result of an isolated technology failure that was exacerbated by some of the very automation and redundancy protocols that we rely on in normal operations to keep our systems reliable and performant.

Kafka

New Relic's architecture makes extensive use of Apache Kafka for streaming and processing data. At the heart of each cell is a cluster of message brokers running Apache Kafka.

In normal operation, Kafka provides a high-throughput data bus, which carries all customer data through each stage of our processing, evaluation, and storage pipeline. If there is a problem or slow-down in one phase of the pipeline, Kafka also acts as a buffer, retaining a certain amount of data, allowing us to queue data instead of discarding it. 

Although the Kafka clusters are highly redundant to prevent data loss, in some cases a broker can become unresponsive, but it is not removed from the cluster, causing data in the pipeline to be stalled behind it. When this happens, the broker needs to be restarted—or in some cases rebuilt by removing the broker from the cluster, replacing it with a new instance, and then reconnecting that new instance to the existing storage. The services are designed so that this happens automatically when a broker is detected to be unresponsive, but sometimes the automatic remediation fails, and the process needs to be triggered manually.

Infrastructure as code 

We follow infrastructure as code as a core principle, allowing us to provision resources and manage configurations across many different tiers of our environment using the same practices and tools that we use for our software services.

During normal operations, when we build cells, deploy software to cells, or change cell infrastructure, we use standard CI/CD pipelines to slowly roll out the changes to one, some, then eventually all cells. We can also tweak some aspects of particular cells independently, although we try to keep cells of the same type as consistent as possible. 

However, when a team is responding to a service problem, we want to act fast, especially when it comes to scaling up. So we also have a set of tools that are used during incident response, allowing our engineers to take quick action to fix production issues. These tools are built on the same core infrastructure as our normal operations.

Breaking down the incident

On the morning of July 29, we saw a Kafka broker in one of our cells become unresponsive. Primed from similar recent issues when the automatic system did not trigger correctly, our engineering team initiated a manually triggered restart. In actuality, the automated process had already kicked-in at the same time. While both tasks ultimately finished safely, the conflicting processes meant that the cluster spent a longer time than anticipated in a degraded state and fed some confusion as to the true cause of the issue.

While the cluster was in the degraded state, a backlog of unprocessed data began to accumulate. To ensure older data is not lost before it can be processed upon cell recovery, our responding engineers then used one of our incident response tools to extend data retention in Kafka. Unknown to the team at the time, the retention change set up a situation where the Kafka brokers would slowly run out of disk space over the next few hours. 

This ultimately happened because of a confluence of two problems: 1) A larger than usual change had been applied to retention settings and, 2) Even when large retention extensions had been applied in the past, troubleshooting had completed, data processing resumed, and the emergency measures reverted long before there were downstream issues. But uncertainty over the root cause and the longer-than-usual recovery time led our team to keep the emergency measures in place while they continued to analyze the situation.

As mentioned above, we use infrastructure automation to ensure consistency across cells. Our tooling allows for per-cell tweaks, but we typically want to keep Kafka data retention consistent across the environment. So when we extended data retention for the affected cell, the tool applied the change to all cells, not just the one where the original problem existed.

Eventually the disk growth meant more brokers failing, first on a few cells, then more and more cells, resulting in a widespread disruption to our data processing systems. Even when the disk space issue was resolved, service recovery was slow because of the widespread nature of the Kafka problems. 

Failing safety layers

Mistakes happen, but how did this one mistake turn into such a huge disruption? To understand that, let's take another look at the above sequence of events and the safety mechanisms that failed.

1. The tool itself allowed an unsafe change to be applied. Like many emergency tools, this one was very simple, with limited dependencies and was designed to be able to do its job even in unexpected situations. But in this case, it was too simple, not warning users that substantial retention increase could be potentially dangerous.

2. An engineer made a mistake in what settings were used. Engineers are humans, service incidents like this are complex, high stress situations, and sometimes mistakes happen. This was compounded by the fact that this tool was routinely used for any situation where data loss was at risk and had already been used dozens of times without a problem. This made it seem "safe," even if it wasn't.

3. The reason it seemed safe was that typically troubleshooting resolved fast enough that the extended retention settings never reached a critical level. In this case the team needed more time to investigate the root issue, allowing the extra data to grow and grow.

4. When disk space started to run low, our monitoring systems fired an alert, warning the team about the problem. But because the timing of that alert coincided with several other systems firing off an unrelated warning, the alert was missed. 

5. As a last line of safety, the isolation provided by cells should have prevented this error from becoming a more widespread problem. However, retention changes were circumventing this isolation, because they applied automatically to all cells of the same type. This allowed a problem in a single fault-isolated cell to “break out” and impact many cells. 

Looking ahead

While we rely on our cell architecture, Kafka, and infrastructure as code to deliver fault-tolerant scalable systems in our normal operations, the issue on July 29 highlighted how these same tools and techniques can fail us in unexpected circumstances. 

A few key learnings:

  • Cell isolation is our strongest protection against widespread disruption. We need to ensure the isolation is respected not just during normal operations but also by emergency tools.
  • Emergency response tools need to be safe to use even in chaotic situations without compromising their ability to function.
  • Human attention is a finite resource and incident response processes and alerts need to be constantly evaluated to ensure engineer attention is being directed at the most pressing signal.

We sincerely apologize for the stress this service incident put on your teams and your businesses. We know that our system reliability is critical to maintaining the health and performance of your own services, and that you count on us to always be available. We are committed to continuously evaluating and improving our technology, tooling, and processes to ensure that we continue to deliver world-class services so you can focus on delivering differentiated innovation to your own customers. We hope this public retrospective was useful and that you’re able to learn from our experience.