I like to start every retrospective I lead with Norm Kerth’s The Prime Directive, which sets the tone and reiterates our purpose:
"Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand."
It’s a difficult truth, but complex software systems are bound to fail. While often problematic, these inevitable incidents also offer us valuable information about the shortcomings in our systems. Learning from failures, and in turn improving and tuning our approach, can lead to quicker recovery times when failure strikes again.
But to take advantage of that information, we have to look at our processes in an honest light to determine which of them are working well, and which ones lack value. Blameless retrospectives are a great tool that we can use to bring honesty to our reviews, to openly determine what works well, and to eliminate what doesn’t work.
Things change quickly in the software world, so it’s important to continually iterate on our processes, and adjust them to fit our current and future needs.
What is a blameless retrospective?
The use of Agile-based methodologies in software development—a practice which varies from company to company in terms of how strictly it is applied—has led to a rise in the use of retrospectives. These are facilitated meetings that take place at predetermined points in the production lifecycle. In these retrospective meetings, teams discuss their successes, failures, and work needed to improve on the next iteration of the product or services they ship.
Teams also hold retrospectives to discuss outages and incidents. And in these cases, so-called “blameless retrospectives” are becoming increasingly popular. Blameless retrospectives include engineers, managers, and project managers from the teams involved in an incident, as well as other stakeholders and other impacted people or teams. The purpose of the blameless retrospective is for all parties to reach an understanding of the situation that led to an incident, to understand the resulting incident response, and to find gaps or points for improvement in the teams’ processes—all with the goal of avoiding, or at least mitigating, a recurrence of the problem.
As the name suggests, a blameless retrospective is designed to identify improvements, not to point fingers. Participants are encouraged to ask questions, provide context and clarity, and offer suggestions. The final product of a blameless retrospective is a specific set of action items, with clear ownership, aimed at improving the incident response process.
Why you should hold blameless retrospectives
If your team makes a mistake and ships some code into production that brings down a handful of services, would you rather be able to state the facts about the situation and have supportive individuals offer constructive insights, or would you prefer to have mistake lead to punitive measures? Blameless retrospectives play a critical role in creating a culture of psychological safety, accountability, and continuous learning.
Humans make mistakes. But when we frame mistakes as learning opportunities and create safe environments in which to discuss them, we’re more likely to speak freely and to understand how mistakes happened—and to learn what we can do to prevent similar mistakes in the future. If we create environments in which employees don’t feel safe to speak up, offer suggestions, or provide additional details, we’re doing those employees a great disservice. We’re also less likely to get to the root of a problem and to keep it from happening again.
If Bill from marketing brings down a critical database with an inefficient query, for example, wouldn’t you rather make Bill feel comfortable enough to explain what he was doing before things went sideways? Wouldn’t you rather spend time and energy to determine why Bill was able to bring your database down and how you can prevent that from happening again, instead of making sure everyone knows that he messed up?
Companies spend a lot of time and money to hire and train experienced and knowledgeable employees, and for the most part, those employees work hard to do the right thing.
How we do blameless retrospectives at New Relic
Inspired by Etsy, we make blameless retrospectives part of our incident response process at New Relic. After an incident, we aim to complete a retrospective within two business days; while it’s important to act when information related to an incident is still fresh, we also want to give impacted engineers time to decompress.
Typically, a site reliability engineer (SRE), site reliability champion (SRC), or the incident commander facilitates the retrospective. The facilitator is responsible for keeping the meeting on task, ensuring that the tone remains blameless and respectful, and gathering the data the business needs to document the incident. The facilitator acts as an investigator: They ask questions and probe for details, and they redirect the conversation back to its intended purpose if participants become charged or veer off topic.
Between the incident and the retrospective, the facilitator completes some prep work to ensure a smooth and effective retrospective. Usually, this involves creating a crowd-sourced document to gather as much detail about the incident as possible. The facilitator acts as a liaison between the teams involved in the incident, and requests contributions from individuals to provide additional details where needed. The information they collect typically includes:
- A timeline of events surrounding the incident
- The triggering event or root cause of the incident
- Any customer impact
Things move quickly in software, so it’s not uncommon for us to uncover additional details between the failure and the retro. Often, we update details and make corrections on the fly. Because of this, we always assign a scribe to take notes and track action items during the meeting. During the retro, we walk through the document (or slide deck) we’ve created, and we examine the details and data we’ve collected. As we move through the facts, we review what went well and what could have gone better, and we discuss ways to improve our processes.
We don’t end a retrospective until we’ve created and assigned action items to address areas of improvement. These action items may lead to technical changes, organizational changes, or both. Finally, we conclude every retrospective by thanking everyone for their time and participation.
Questions to consider during a blameless retrospective
Asking and answering these seven questions helps us to uncover the root causes of an incident and gaps in our processes—without assigning blame:
How were we notified about the problem?
Did we hear about the problem from our support team, in an angry tweet, or via an alert? The answer to this question can help us uncover gaps in our monitoring and alerting strategies. We never want our customers to know we have a problem before we do.
Was the right team paged?
Did the teams with the right knowledge and access to resolve an issue get paged as early as possible, or did a dependent team have to page them?
Could we have discovered the problem sooner?
How quickly did we discover that things were going awry, and can we reduce the time required to do this? Perhaps during our discussions in the retro, we can discover new symptoms or warning signs that we can use to detect problems more quickly.
Was the information we needed to resolve the incident easily accessible?
If it wasn’t, we need to improve our documentation and runbooks.
Did we get lucky? Could things have been worse?
Maybe something went well but had the potential to go poorly. If there are things we can do to mitigate those issues, this is the time to start that process. Seemingly lucky situations that have the potential for catastrophe include:
- Finding out there was only one person with the full access or knowledge to resolve the issue.
- Having an incident occur during a low-traffic period for our customers. If a service doesn’t perform well during low traffic, how catastrophic would an incident be during peak traffic?
- Discovering an anomaly while working on an unrelated task and raising an alarm. If we had to manually alert on an anomaly, we need to evaluate or improve our alerting strategy.
Where did we have humans doing work where computers should have done it?
What work on the system can we automate to reduce toil or manual intervention? Is the effort needed to automate something worth the time, based on how often or likely this manual work would need to occur?
What did we set out to do? And what actually happened?
If the incident was caused by a change or deployment to our production environment, we might ask: What pattern, if any, did we follow to deploy the change? Did we have a plan for how to handle any failures? If so, did we follow the plan? In some cases, we may find that we need to revise our deployment plan, or do more rigorous testing or smaller canary deployments.
Improve, improve, improve
Running blameless retrospectives is about more than avoiding the act of assigning blame. It’s about creating a culture of blamelessness and a work environment in which incidents are opportunities—allowing you to examine and improve your processes and reliability. It’s not realistic to think that you’ll never have incidents, and it’s not beneficial to point fingers when you do have them. Think of incidents as opportunities for teams to evolve, and then work to understand them without blame—and you'll give yourself a powerful new way to learn, to improve, and to minimize the impact of mistakes in the future.