The world of SRE and DevOps teams is all about fast responses. The ability to quickly diagnose and resolve a problem can mean thousands of dollars or clicks. Effective SRE teams push for continuous improvement in measuring and managing all aspects of their on-call cycle—including detection, understanding, and resolution—to keep their incident response time as short as possible. However, few teams have taken advantage of the inherent relationships between each of these stages—the key to smarter, more efficient incident response.
Common KPIs used to measure the effectiveness of an SRE team are mean time to detect (MTTD), mean time to understand (MTTU), and mean time to resolution (MTTR). Imagine the area of this circle, which represents the SRE cycle, as your total cost. The longer each of those stages takes, the larger the area:
Traditional approaches to improving operations efficiency include hiring more engineers, configuring more tools, and training your existing engineers to understand your system better.
However, each of these options only addresses one of the key KPIs. In some cases, it can make others worse:
Event intelligence and automatic correlations
As the complexity of production systems grows with the introduction of more tools and new technology, DevOps and SRE teams need a more sustainable solution for incident management. That’s where event intelligence and automatic correlations come in.
Every step of the SRE process, and each corresponding KPI, is closely tied to the others. So why not use a tool that takes advantage of this relationship to improve all three together? With New Relic AI—an intelligent platform that automatically discovers correlations in event data across your full stack—each small improvement to one step in the cycle positively impacts the others. Let’s check out an example.
With New Relic AI’s decisions feature, you can create customized logic based on your knowledge of your production system. In this example, a spike in the volume of low-priority incidents for an application indicates a larger underlying problem. The priority of the automatically correlated issue will increase, and your MTTD just got faster.
When an SRE receives a notification about this issue and checks it out in their incident management tool, they’ll immediately notice some relationships between the events. Correlated alerts are shown together, and an “Issue Log” with details about how the issue has developed over time is included.
New Relic AI uses an automatic natural language processing (NLP) algorithm to choose a smart title and analysis summary for the incident so you can understand what’s going on quickly. All the information you need is right in front of you, decreasing the amount of digging required to investigate the issue and accelerating MTTU.
Finally, using a powerful machine learning model that learns from historical incident data, New Relic AI provides suggested responders for each incident. If the on-call SRE is stuck on the issue or needs more context to make informed troubleshooting decisions, they can check out the suggested responder. The SRE can then choose to contact that team member or search for documentation that person may have authored.
These easily accessible, continually improving recommendations will help get knowledge to the right people quickly, decreasing MTTR and minimizing production impact for your customers.
Minimizing customer impact
So, to recap, customizing decisions will lead to faster and smarter detection. The enriched context of correlated issues will result in speedier understanding. And increased focus using a suggested responder will lead to the right information and, ultimately, faster resolution.
New Relic AI is an AIOps solution for busy SRE and DevOps teams. The solution uses the relationships between each stage of the SRE cycle to leverage your team’s knowledge for more efficient incident response.
Curious about the impact event intelligence and automatic correlations can have on your team’s KPIs? Learn more about New Relic AI.
이 블로그에 표현된 견해는 저자의 견해이며 반드시 New Relic의 견해를 반영하는 것은 아닙니다. 저자가 제공하는 모든 솔루션은 환경에 따라 다르며 New Relic에서 제공하는 상용 솔루션이나 지원의 일부가 아닙니다. 이 블로그 게시물과 관련된 질문 및 지원이 필요한 경우 Explorers Hub(discuss.newrelic.com)에서만 참여하십시오. 이 블로그에는 타사 사이트의 콘텐츠에 대한 링크가 포함될 수 있습니다. 이러한 링크를 제공함으로써 New Relic은 해당 사이트에서 사용할 수 있는 정보, 보기 또는 제품을 채택, 보증, 승인 또는 보증하지 않습니다.