New Relic Now Start training on Intelligent Observability February 25th.
Save your seat.
現在、このページは英語版のみです。

This is an update to a post that originally ran in June, 2019. 

February 27, 2017 was a memorable day at New Relic. It was the day the Amazon S3 US-East-1 region went offline—for 14 hours.

The outage forced New Relic to organize one of the most extensive incident responses in our history. We threw everything we had into understanding the full impact on our customer-facing systems and then into formulating a plan to restore our systems and our ability to deploy.

By any measure, this was a painful experience. Yet our response to the Amazon outage was also a huge win for our New Relic Emergency Response Force (NERF): a team of highly capable and experienced volunteers who coordinate responses to our most difficult incidents.

5 keys to elevating your IC game

NERFs take over one of the most challenging aspects of incident response: the role of Incident Commander (IC). At New Relic, every incident response team includes an IC, and some of our best incident commanders are on the volunteer NERF rotation.

It’s great to have highly skilled and well-prepared ICs for big incidents. (The Amazon incident was one of the very few in New Relic’s history for which we appointed multiple ICs to coordinate a response.) But every incident at New Relic—not just the big ones—benefits from good incident command. Much of that value is the result of lessons we’ve learned about training ICs to handle the role’s unique challenges and responsibilities.

What follows is five of our top practices for making great incident commanders.

Before you begin: incident command in context

New Relic’s incident response process, like so many of our processes, was born in a DevOps environment. That means, for example, that all our engineers are on call for their services; there’s no Ops wall to throw problems over. We’re big enough and complex enough that we need some process, but we work really hard making it just enough process rather than process for process’ sake.

This philosophy also informs New Relic’s position that every developer who responds to an incident should be capable of serving as an incident commander. If an incident turns out to be especially severe or challenging, we may bring a more experienced IC into the process, and for our worst incidents we automatically page the on-call NERF. But for most incident responses, our imperative is to equip every New Relic engineer with the tools, skills, and confidence to perform the duties of an IC.

It’s also useful to review New Relic’s incident response processes to gain context for understanding our approach to incident command. This is a topic that we have covered extensively in previous posts, most notably a deep dive into our on-call and incident response procedures.

We also suggest reviewing New Relic’s approach to using blameless retrospectives, and learning about the best practices that we apply to our retrospectives and related activities. An incident post mortem is a critical part of our incident response process; they’re also (as we’ll discuss below) essential to reminding ICs that we have their backs when they make hard decisions—even when those decisions turn out to be wrong.

A quick reminder: Why none of this is easy …

As we dig into these best practices, it’s also useful to review the three defining traits of an incident response scenario. They’re the reasons why the IC role is often so stressful and why incidents themselves can be so volatile. They’re an important reminder of why effective incident commanders are such a valuable resource.

  • Incident response is a high stakes event. Outcomes matter, and worst-case incidents may pose an existential threat to a business. Customers that can’t access your software may simply leave, or you may find yourself in violation of contractual SLAs.
  • Incident response is a high-cadence event. In other words, they’re a race against the clock. During an incident you’ll likely have worried customers filing support tickets or actively watching for status updates and resolutions. Losing that race can mean losing everything.
  • Incident response involves groups of people. When you bring people into a high-stakes, high-cadence situation, stress runs high.

One way or another, everything we discuss here is intended to address these three traits and the problems they create.

Incident commander training and empowerment: 5 New Relic best practices

1. Good incident commanders view coordination as their most important and urgent task.

Incidents are pressure cookers: They’re chaotic, extremely dynamic, and often unpredictable. They’re complex; many involve more moving parts than one person could possibly keep track of. In some cases, information is scarce and highly unreliable; in others, a team is deluged with multiple flows of real-time information.

In an environment like this, ICs who view themselves as “deciders” or who think they have all of the answers are doomed—and they’re more likely to amplify panic than to contain it.

Successful ICs focus instead on coordination. They work to identify and recruit the right people, with the right knowledge and skills, to formulate an effective team response. They ensure that all of the players have what they need to do their jobs; they minimize friction and promote clear communication.

As a coordinator, the IC is the calm at the center of a storm—an antidote to panic and to reactive thinking. In practice, this means:

  • Focusing on asking the right questions—not on knowing the answers.
  • Ensuring that constructive ideas aren’t drowned out or overlooked.
  • Questioning and challenging ideas to assess their merit.
  • Pushing back against groupthink and reactive thinking.
  • Leaving troubleshooting to other team members—but supporting the troubleshooting process.

If you remember one thing from this post, here it is: Successful ICs focus on coordination.

2. ICs control the flow of emotions, information, and analysis

New Relic ICs use what we refer to as the “three flows” to keep teams calm, focused, and prepared to work:

The flow of emotion. Incidents are breeding grounds for panic and reactive behavior. Recognizing panic responses and guiding people out of them is the IC’s top priority.

Pay attention to the emotions of incident participants, including those you’re communicating with remotely. The sooner you recognize a shift into reactive mode, the sooner you can act to pull them back towards a calm, focused state of mind.

The flow of information. This is largely about understanding your participants: Who is in the room? What do they already know, and what do they not know that they care about?

The IC’s role here involves listening, filtering, and acting on what’s meaningful. Do you need to page another team? Is there a domain expert who can solve a thorny problem? Does an engineer who just joined the incident response understand the current status and how they can help? Did you discover something new about the incident that may be important to communicate to customers? Has it been a while since an engineer, who agreed to perform a critical task, gave a status report? When ICs view themselves as conduits—dedicated to getting the right information to the right people—solutions tend to appear more quickly.

The flow of analysis. Sometimes you get an incident where you know exactly what’s wrong, and you can focus mostly on implementing a fix.

But mostly, you’ll get the other kind of incident—like the one where an engineer decided to see what happens when you run a query with 65 consecutive wildcards. (Now we know: bad things happen. True story!)

Such incidents can be scary, but they’re also really valuable. They’re opportunities for ICs to find out, in real time, where their mental models of a system don’t align with reality – or with those of their colleagues, for that matter.

3. Successful incident commanders are masters of incident context

Context is very important when your main job involves coordination. This is the fuel that powers your ability to make connections, identify useful resources, and to spot gaps in a team’s knowledge and capabilities.

There are three areas where it’s especially useful for an IC improve their grasp of context:

Fluency in your organization’s technical and human systems. This includes understanding general system architecture, how things fit together, and what parts of the system are under the most strain at a given time.

For example, an IC running an incident that is limited to their team’s services needs to know the general architecture, function, and immediate dependencies of the services.

A NERF running a large, multi-team incident needs a general understanding of the border product architecture. The IC, on the other hand, doesn’t need to have a deep technical understanding of the systems involved in an incident, so much as an awareness of how services might fit together.

The IC should also understand the organization: How roles and teams are defined, how to reach people, and which people and teams need to be involved based on what’s happening.

Familiarity with their organization’s incident response process. We don’t expect our ICs to memorize every detail or every line of documentation; an experienced IC can achieve the same goal by developing “muscle memory” of the basic incident lifecycle. It certainly helps, however, if an IC keeps the relevant process docs at their fingertips.

Understanding an organization’s priorities, culture, and way of working. A successful incident response focuses on practical solutions that stay within an organization’s usual practices and capabilities. The further an IC strays from these core capabilities, the harder it will get to organize and sustain a response.

4. Understand that training is paramount—but an eye for talent is useful, too

The last thing you want to do is discourage an IC because they “lack talent.” With the right training, pretty much anyone can learn to be a good IC, and perhaps even a great one.

Still, it’s great to encourage ICs who possess certain traits. These folks may absorb training faster, retain more of what they learn, and they may possess the right emotional traits to combat panic and to function well in stressful or chaotic settings.

When it comes to training incident commanders, some telltale signs of a “natural” include:

A talent for achieving technical fluency. In particular, an IC needs a broad technical vocabulary so that they understand the conversation that’s happening in the room. They also need well calibrated technical knowledge—i.e. an accurate sense of what they know and what they don’t know.

A talent for self-regulation. You can’t regulate the flow of emotions in the room if you can’t regulate your own emotional and intellectual responses. This is what medical professionals refer to as “clinical detachment,” and the more intense an incident response gets, the more valuable this ability becomes.

A natural enthusiasm for the job. Good ICs relish the challenges they’ll encounter during an incident response. They focus more on the thrill of a successful response than on the possibility of failure. But they also accept the reality that they won’t win every incident response battle—and they’re OK with that.

Always keep one thing in mind: Successful organizations also work very hard to make the IC role as attractive as possible. Celebrate successful incident resolutions—and the ICs who coordinated them. At the same time, build and nurture a blameless culture within your development teams, and ensure that ICs are never penalized for making tough decisions or for stepping into challenging incident command opportunities.

5. Practice, practice, practice!

Practice is, by far, the best way for novice ICs to build their skills as well as their confidence. The more realistic the practice session gets, the more impact it will have on the novice IC.

New Relic’s approach to practice for ICs relies on two related kinds of simulations: game days and adversarial game days. The first of these events tests an IC’s response to a pre-defined incident response scenario; the second, which evolved from chaos engineering methodologies, uses a selected “malicious actor” to step up the intensity and the possibilities for unexpected mischief. It’s also fairly easy to adapt these exercises to test not just what ICs do during an incident but also to assess how they respond and perform under pressure.

New Relic also encourages new ICs to “shadow” the IC role. This involves paging two team members during an incident: an experienced IC and the IC in training. Both people will participate in the incident. Wherever possible, the “shadow” is given room to perform IC duties, with the experienced IC listening in to provide guidance, tips, and reminders as needed. This practice can work well both at the individual team level and for the NERF role as well.

Strong incident commanders enable effective incident management

As we’ve described in previous posts, the recipe for good incident management includes a lot of ingredients. Probably none of them are as important as having a confident, calm, and well-trained incident commander on the job. And nothing is as important to creating effective incident commanders as an organization that recognizes the IC’s critical role—and invests the resources to train, empower, and recognize great ICs.