What is root cause analysis?

Root cause analysis is a method or process to identify any breakdowns in processes and systems currently in place that could be improved when adverse incidents arise.

When effectively implemented, a root cause analysis can:

  • Identify factors that contributed to an adverse event or near-miss so that measures can be put in place to address contributing factors.
  • Prevent incidents from happening again in the future.
  • Improve customer experience.
  • Reduce the costs associated with risk.

Conducting a root cause analysis entails an in-depth process, which consists of finding the proper teams involved with resolving the adverse event, reviewing valid notes taken during this time, reviewing the data from monitoring tools in place used to help identify mean time to detect (MTTD) and mean time to resolve (MTTR), and improve its efficiency in addition to minimizing and/or eliminating the risk of it recurring.

By team involvement, sharing details of the adverse events and effective communication can impact a positive outcome.

Root cause analysis: nuts and bolts

Root cause analysis represents comprehensive investigation, assessment, evaluation and correction. All these are attributes that feature in the services that New Relic provides. There are many types of investigative procedures used to carry out root cause analysis such as these described below:

  • Identify the problem: The first step upon recognizing an issue is to define a problem statement and the symptoms (for example, a machinery malfunction, a failed or faulty process, or human error). Once that’s done, it’s important to isolate any suspected contributing factors to contain the problem while you try to uncover the root cause.
  • Collect data: Once the problem is identified, compile as much data as possible, including incident reports, evidence in the form of screenshots and logs, and interviews with anyone involved with the issue. Using this data, you can determine the sequence of events, and especially any adverse events that led to the problem, as well as the systems that were involved, how long the problem occurred and the overall impact.
  • Determine root cause: The root cause analysis team conducts a brainstorming session using techniques such as Fishbone diagrams, Pareto charts, and other tools to ascertain the root cause. The root cause analysis manager moderates the meeting, which should be collaborative and blameless.
  • Implement the solution: The root cause may point to one or more solutions, and the root cause analysis team has to determine which fix is best and when it should be delivered. Once the solution is implemented, it must be monitored to ensure it’s effective. This process is more formally called root cause corrective action.
  • Document actions: A critical part of root cause analysis is preventing the problem in question from reoccurring. Documenting the problem and its resolution so teams can reference it in the future is essential. The root cause analysis team can also include recommendations for physical or process improvements as well as preventative actions in the documentation.

Five steps to conduct a root cause analysis

1

Properly define the problem using SMART rules to ensure you have identified the problem correctly:

  • Specific
  • Measurable
  • Action-oriented
  • Realistic
  • Time-constrained manner
2

Confirm the problem is accurately identified based on data and not perceptions.

3

Take immediate action steps to resolve the problem temporarily.

4

Find the underlying root cause of the problem and take corrective action to prevent the problem from recurring in the future.

5

Note the identified and established corrective action within the standard procedures to prevent it from happening again.

Providing a temporary fix until you figure out how to provide a permanent solution is ideal as long as you have a plan on how to resolve the problem at a later date. After completing a root cause analysis, it’s important to have a postmortem call sometime after. Having a call after the root cause analysis will assist with mitigating repeat incidents by bringing teams together to plan and communicate lessons learned and how to prevent it from happening again.

Root cause analysis frameworks and methodologies

1

Fishbone diagram:
Also called cause and effect diagrams (or Ishikawa diagrams, based on the name of its founder Kaoru Ishikawa), a fishbone diagram is a visual method for root cause analysis that organizes cause-and-effect relationships into categories.

2

Pareto analysis: (also known as the “80-20 rule”)
The Pareto Principle states that 80 percent of problems can be traced back to 20 percent of causes. Pareto analysis identifies the problem areas or tasks that will have the biggest payoff. Using Pareto analysis during a root cause analysis for errors helps us to understand and identify the most significant errors usually caused by a few problems which can then be targeted for correction or resolution.The following are the phases of Pareto analysis:

  • Phase I: Identification of causes of defects
  • Phase II: Collection of sample data
  • Phase III: Graphical representation of results
  • Phase IV: Interpreting the graphed results
3

Five why’s technique: (also known as Gemba Gembutsu, a Japanese phrase meaning “place and information”)
The five whys is a simple problem-solving technique that helps to get to the root of a problem quickly. The five whys strategy involves looking at any problem and drilling down by asking: "Why?" or "What caused this problem?" While you want clear and concise answers, you want to avoid answers that are too simple and overlook important details. Start with the problem statement and ask why it occurred. Typically, the answer to the first "why" should prompt another "why" and the answer to the second "why" will prompt another and so on. Repeat the steps until you have asked at least five whys, hence the name “five whys.” This technique can help you to quickly determine the root cause of a problem.

Root cause analysis gotchas: lessons learned

While every root cause analysis user journey is unique and the approach and techniques used to arrive at a resolution vary, one can derive a common set of best practices and learnings from the outcomes. The following are some of the pitfalls that should be avoided along the way:

  • Incomplete or insufficient definition of the problem statement.
  • Lending focus to wrong things or signals. Think long-term and never have a near-sighted approach.
  • Stopping at the first sight of a symptom or cause and not exploring other avenues/possibilities or digging deeper.
  • Pick your battles. Not everything can be investigated right away. Narrow down on high impact, high consequence incidents.
  • Get relevant teams involved and start a root cause analysis as quickly as possible to prevent critical data loss or overlap with competing priorities.
  • Data gathering for root cause analysis is difficult and time consuming. Taking shortcuts or hypothesizing will render this process ineffective or lead to incorrect conclusions. Utilizing the collaboration feature in New Relic will help break down data silos and enable teams to look at data in the context of other platform UI experiences (incidents, alerts, and dashboards).
  • Summarize findings and corrections, build an executive report with recommendations and share with the broader organization.
  • Track recommendations to completion. The ultimate goal of an effective root cause analysis is to make sure that incidents are never repeated. Most recommendations are long-term and will require significant changes or course correction. Tracking progress and holding management accountable with frequent status updates and communication is an effective way to accomplish this goal.

Best practices: post-incident improvements

To make the root cause analysis process easier, here are a few things to remember:

  • If you have multiple deployments by different teams simultaneously, you must be sure to keep detailed logs and track of what each team is doing in addition to the timings to be able to trace the root cause easily. If it’s hard to keep track of multiple deployments, you may want to provide different times. The New Relic change tracking feature allows you to capture deployment changes in any part of the system and use it to contextualize performance data and help resolve issues more quickly.
  • After the incident, make sure to have a postmortem call to discuss the incident in addition to what you can implement as a long-term solution to prevent it from happening again. Also, make sure to update your standard operating procedures so that the operations team can know what to do if the same issue happens again.
  • Continuously evaluate your real-time notifications and active alerts review. Turn off those that are not important so that you do not miss out on an important alert when there is a critical issue.
  • Understand and continue training on the monitoring tools you’re using to resolve incidents to quickly resolve issues.
  • Provide a central channel for team communication during incidents whether it’s Slack, Zoom, or a bridge call.
  • Analyze custom dashboards to track the adverse events over time.
  • In spite of your best root cause analysis efforts, if upper management isn’t committed to implement the corrective action process or take it seriously, your root cause analysis will fail to succeed and be effective.
  • Root cause analysis keys to success are focusing on continuous improvement, sharing what you’ve learned from the incident with the broader organization, learning from past mistakes, and incorporating exercises involving other teams on how to respond to the incident if it happens again.
  • No one is to blame; it’s a process that involves multiple teams coming together to make resolving issues easier with everyone communicating to create a defined process to prevent the issue from recurring. 
  • New Relic is an all-in-one observability platform that eliminates blind spots, removes team/data silos, and provides complete visibility into your tech stack in a single unified experience that ultimately helps customers solve interesting business and technical challenges more effectively.

Conclusion

When organizations build, deploy, and run high-performant, high-throughput systems, there’s always a probability of dealing with failures in computing workloads. What’s really critical is to have pre-formulated strategies to handle such breakdowns when they occur and restore operations quickly. Utilizing the best practices and lessons learned, coupled with full-stack observability tools like New Relic, will enable you to drive faster resolution, continue to accelerate operational efficiency, and prevent future incidents from recurring.