The world of SRE and DevOps teams is all about fast responses. The ability to quickly diagnose and resolve a problem can mean thousands of dollars or clicks. Effective SRE teams push for continuous improvement in measuring and managing all aspects of their on-call cycle—including detection, understanding, and resolution—to keep their incident response time as short as possible. However, few teams have taken advantage of the inherent relationships between each of these stages—the key to smarter, more efficient incident response.
Common KPIs used to measure the effectiveness of an SRE team are mean time to detect (MTTD), mean time to understand (MTTU), and mean time to resolution (MTTR). Imagine the area of this circle, which represents the SRE cycle, as your total cost. The longer each of those stages takes, the larger the area:
Traditional approaches to improving operations efficiency include hiring more engineers, configuring more tools, and training your existing engineers to understand your system better.
However, each of these options only addresses one of the key KPIs. In some cases, it can make others worse:
Event intelligence and automatic correlations
As the complexity of production systems grows with the introduction of more tools and new technology, DevOps and SRE teams need a more sustainable solution for incident management. That’s where event intelligence and automatic correlations come in.
Every step of the SRE process, and each corresponding KPI, is closely tied to the others. So why not use a tool that takes advantage of this relationship to improve all three together? With New Relic AI—an intelligent platform that automatically discovers correlations in event data across your full stack—each small improvement to one step in the cycle positively impacts the others. Let’s check out an example.
With New Relic AI’s decisions feature, you can create customized logic based on your knowledge of your production system. In this example, a spike in the volume of low-priority incidents for an application indicates a larger underlying problem. The priority of the automatically correlated issue will increase, and your MTTD just got faster.
When an SRE receives a notification about this issue and checks it out in their incident management tool, they’ll immediately notice some relationships between the events. Correlated alerts are shown together, and an “Issue Log” with details about how the issue has developed over time is included.
New Relic AI uses an automatic natural language processing (NLP) algorithm to choose a smart title and analysis summary for the incident so you can understand what’s going on quickly. All the information you need is right in front of you, decreasing the amount of digging required to investigate the issue and accelerating MTTU.
Finally, using a powerful machine learning model that learns from historical incident data, New Relic AI provides suggested responders for each incident. If the on-call SRE is stuck on the issue or needs more context to make informed troubleshooting decisions, they can check out the suggested responder. The SRE can then choose to contact that team member or search for documentation that person may have authored.
These easily accessible, continually improving recommendations will help get knowledge to the right people quickly, decreasing MTTR and minimizing production impact for your customers.
Minimizing customer impact
So, to recap, customizing decisions will lead to faster and smarter detection. The enriched context of correlated issues will result in speedier understanding. And increased focus using a suggested responder will lead to the right information and, ultimately, faster resolution.
New Relic AI is an AIOps solution for busy SRE and DevOps teams. The solution uses the relationships between each stage of the SRE cycle to leverage your team’s knowledge for more efficient incident response.
Curious about the impact event intelligence and automatic correlations can have on your team’s KPIs? Learn more about New Relic AI.