New Relic’s incident response process details the steps we take if an emergency impacts our system in a way that affects our customers’ experience with our platform, such as a UI crash or loss of data. But what if a team inside New Relic discovers a problem that isn’t necessarily an emergency but still has the potential to impact our customers? The New Relic distributed tracing team recently used the Andon system to resolve a bug that impacted the accuracy of traces. What follows is the story of our experience and nine key takeaways we uncovered when going "Andon red" for the first time.

How we use Andon at New Relic

Andon was originally developed as a quality-control component of Toyota’s manufacturing process. Under Andon, when workers discover a problem in process or quality, they can activate an alarm to notify management and other teams and to pause scheduled work until the problem is resolved.

At New Relic, if a team is blocked by circumstances beyond its control, they’re empowered to set an Andon status—green for OK, yellow for warning, or red for emergency—that is visible to all other teams via a dedicated Andon Slack channel. If a team sets any status other than green, they’re signalling for help. They may need increased attention from other teams, access to outside resources, or permission to alter process and break cross-team dependencies until their issue is resolved. If a team sets a non-green status, they’re focused strictly on resolving their issue and have slowed (yellow) or paused (red) work on any other scheduled projects.

Each team manages its own Andon process. When a team changes its Andon status, they must identify the impact of their problem, request the help they need, and keep stakeholders updated until the issue is resolved.

Before I dive into lessons the distributed tracing team learned from its first use of Andon, here’s some quick background on the bug that kicked it off for us—we called it “the span count bug.”

Meet the span count bug

In late 2018, the distributed tracing team became aware that some traces were incorrect on the trace list page. In some cases, traces reported that they contained fewer spans or services than they really did, even though the underlying Span and Transaction events—from which we stitch together traces—were correct. If a user viewed the details of a trace, it showed the correct number of spans, highlighting the discrepancy. The distributed tracing team didn’t like how this affected the integrity of our data, so we engaged the Andon process and went straight to red.

Like most of the New Relic ingress pipeline, the distributed tracing pipeline is a series of services connected by Apache Kafka queues. Step one, then, was to figure out which service was at fault. We knew some small percentage of index records were being lost while aggregating span events. Working backwards from our data storage layer, we implemented metrics for each service, so we could use New Relic to verify that the number of trace indexes produced was the same as the number coming in (minus any indexes dropped for valid reasons). Before long, we found our culprit: a service imaginatively titled “The Trace Indexer.”

The Trace Indexer uses Kafka Streams to aggregate individual Span events into traces. To store in-progress aggregations, it uses a RocksDB instance with a cache layer on top. Our New Relic Insights dashboard showed low disk input/output, which meant RocksDB wasn’t writing to disk as frequently as we expected. Instead, the cache layer stored everything, and when the cache filled up, it evicted the index records, which meant they couldn’t show up in traces.

To fix this, we resized the cache to prevent it from evicting records before aggregations were complete. In the meantime, we’re pursuing a Kafka Streams upgrade as a permanent fix.

To complete the Andon process, we held a (blameless) retrospective. We reviewed the details of the span count bug, and we discussed what did and didn’t work for us our first time going Andon red. Here are nine key takeaways from our experience.

Nine takeaways from going Andon red

1. Declare your priorities. With Andon in place, the entire team aligned on the central issue as priority one; we swarmed on it to the exclusion of almost everything else. Additionally, Andon brought us together; there is no personal pressure to identify or accept blame, which is always a worry when you declare an emergency.

2. Engage outside support early and often. It’s not uncommon for a team to feel like an emergency belongs only to them, but in modern software architectures—like what we run at New Relic—there are usually upstream or downstream dependencies. Andon is a clear way to alert other teams that you have an issue and may need additional resources. For example, in this case, we engaged developers from our Kafka Streams provider.

3. Increase communication inside and outside your team. Some parts of the Andon process that helped us the most were also the most basic. For example, we held daily status meetings specific to the span count bug, and we created a “living” status document that we all contributed to. These clear communication channels not only helped us expose our progress and findings to each other and to our stakeholders, but they became a valuable resource to help determine our next steps.

4. Prioritize your troubleshooting steps. After we went red, we first pursued avenues of investigation that required the least time, and we time-boxed efforts that didn’t yield results. We ranked our troubleshooting steps by probability to resolution vs. effort, and we made sure they were documented and visible for everyone inside and outside our team.

5. Accept a higher (though not infinite!) risk tolerance. When you go Andon red, the world speeds up. Moving quickly often means accepting more risk. In our case, we frequently deployed configuration changes and upgrades to external libraries in an effort to get more clues about the source of the bug. We didn't abandon our usual processes (for example, code reviews and staging deploys), but for any proposed action, we asked ourselves, "What is the chance things will go wrong? And how much 'wrong' can we accept?" Establish how much risk you're comfortable with, but have a clear plan for rolling back changes or forging ahead.

6. Be willing to look at big changes. As we worked through the span count bug, we established a "background thread" exploring how we could switch from Kafka Streams to Apache Flink, which led to a working prototype. This gave us confidence that even if we discovered an intractable problem that required an architectural change, we had a PlanB lined up.

7. Use data. When working through the span count bug, we developed queries that could tell us how many traces were affected at any point in time, and we were confident the issue was limited to less than 1% of total traces in the system. Once we had a hypotheses to test, though, we created and deployed metrics in our services so we could base our findings on real data. These metrics have been useful even after the issue was resolved; I can look over my shoulder at a dashboard that continually checks the last 24 hours of distributing tracing data, which we now alert on, as well.

8. When troubleshooting, document evidence that supports your conclusions. When we were trying to determine which service in our pipeline was the cause of the span count bug, we had to go back a few times and re-validate our results—because we kept forgetting to write things down. It’s crucial to document all findings, no matter how trivial or minor they may seem, because everyone is moving quickly, and you may forget what information led to a particular conclusion.

9. Avoid duplicating your efforts. Splitting up work is usually the fastest way to get to a resolution, but our retrospective revealed that multiple people had reviewed the same code for a particular service at different times—and had all reached the same conclusions. Did each team member really need to review the same code? Most likely not. On the other hand, we realized that once we identified the culprit service, we should've “mobbed” that code review together.

It's all about making the best use of Andon

Sometimes you need to put things on hold so you can combat issues to prevent future emergencies—especially when the issues degrade your customers’ satisfaction—and Andon provides a clear mechanism for doing just that.

Of course, knowing when to go Andon red is just as important as what you do after you raise the alarm. Nobody likes having to declare an emergency or halt production, but having an established Andon system in your organization can help get you through your emergencies in an ordered and timely manner.