Concomitant to on-call rotations is the concept of incident management. What’s an incident? That’s when a system behaves in an unexpected way that might negatively impact customers (or partners or employees).
A core competency within the “you build it, you own it” DevOps approach, incident management is often given short shrift, with teams losing interest once an issue is resolved. Often organizations without effective incident management take on “firefighting” responsibilities using ad-hoc organization, methods, and communications. When something blows up, everyone scrambles to work out a plan to solve the problem.
There’s a much better way to approach incidents, one that not only minimizes the duration and frequency of outages, but also gives responsible engineers the support they need to respond efficiently and effectively.
Creating an effective incident management process
1. Define severities: Severities determine how much support will be needed and the potential impact on customers. For example, at New Relic we use a scale of 1 to 5 for severities:
- Level 5 does not impact customers and may be used to raise awareness about an issue.
- Level 4 involves minor bugs or minor data lags that affect, but don’t hinder, customers.
- Level 3 is for major data lags or unavailable features.
- Levels 2 and 1 are serious incidents that cause outages.
2. Instrument your services: Every service should have monitoring and alerting for proactive incident reporting. The goal is to discover incidents before customers do to avoid worst-case scenarios where irritated customers are calling support or posting comments on social media. With proactive incident reporting, you can respond to and resolve incidents as quickly as possible.
3. Define responder roles: At New Relic, team members from engineering and support fill the following roles during an incident: incident commander (drives resolutions), tech lead (diagnoses and fixes), communications lead (keeps everyone informed), communications manager (coordinates emergency communication strategy), incident liaison (interacts with support and the business for severity 1s), emergency commander (optional for severity 1s), and engineering manager (manages the post-incident process).
4. Create a game plan: This is the series of tasks by role that covers everything that happens throughout the lifecycle of an incident, including declaring an incident, setting the severity, determining the appropriate tech leads to contact, debugging and fixing the issue, managing the flow of communications, handing off responsibilities, ending the incident, and conducting a retrospective.
5. Implement appropriate tools and automation to support the entire process: From monitoring and alerts, to dashboards and incident tracking, automating the process is critical to keeping the appropriate team members informed and on task, and executing the game plan efficiently.
6. Conduct retrospectives: After the incident, require your teams to conduct a retrospective within one or two days of the incident. Emphasize that the retrospective is blameless and should focus instead on uncovering the true root causes of a problem.
7. Implement a Don’t Repeat Incidents (DRI) policy: If a service issue impacts your customers, then it’s time to identify and pay down technical debt. A DRI policy says that your team stops any new work on that service until the root cause of the issue has been fixed or mitigated.