The nine best practices above can help you adopt and internalize an approach to MTTR based on the principles of incident resolution and availability. And the New Relic platform can be a key to successfully adopting this approach.
The New Relic platform offers monitoring, alerting, incident diagnosis, and other capabilities that contribute directly to faster, smarter, more-efficient incident resolution; driving significant improvements to MTTR and other performance metrics.
Tools that keep your response team fresh, focused, and efficient
Your ability to alert the right people—quickly and efficiently, using accurate and actionable performance insights—can make or break your incident response strategy. New Relic’s full-stack, programmatic alerting capabilities puts these capabilities, and more, at team members’ fingertips.
By defining alert conditions based on the results of custom New Relic Query Language (NRQL) queries, for example, your team can evolve alerts tied to specific, high-load system calls. Performance issues at these points can provide leading indicators of a problem even before it impacts production applications—giving your team the opportunity to find and fix problems before they lead to downtime, lost revenue, and customer complaints. Plus, features such as outlier detection and incident context use applied intelligence to spot problems faster and to suggest where to start an incident investigation.
New Relic Alerts also helps to prevent alert fatigue—a growing problem for incident-response teams operating in microservices environments. Flexible alert policies and notification-channel options give teams greater control over the flow of alert-incident data, while minimizing "noise" due to redundant alert conditions.
Tools that assess end-to-end system performance
In addition, operations teams can use New Relic Synthetics to close a critical blind spot for many DevOps teams: monitoring and understanding end-to-end system behavior. Synthetics gives organizations a range of options to measure endpoint performance—from sending a simple ping command to in-depth monitors that run scripts to simulate complex scenarios. Synthetics also supports the use of containerized private minions to monitor internal sites and expand geographic coverage—raising the bar on security, cloud-readiness, and flexibility.
Tools that add user experience insights
In many cases, a fast and successful incident resolution requires the ability to adopt a user's point of view—whether to understand how an incident impacts user experience or to assess the impact of user interactions. New Relic Browser achieves this goal by offering deep visibility and insight into how users are interacting with an application or website. Browser goes far beyond page-load timing to address the entire life cycle of a page—from individual session performance and AJAX requests to JavaScript errors and monitoring of single-page application architectures.
Browser also helps responders to understand the role that geography plays during an incident: filtering performance metrics and Apdex scores by global region or state, for example; and maintaining URL-segment whitelists and domain-specific blocking or monitoring.
Tools that support informed and aware incident resolution
Incident resolution in a modern microservices environment is always a data-driven process. The challenge is ensuring that teams get the right data to make timely and accurate decisions during an incident, and to assess and improve their response during a retrospective.
New Relic Insights is connected to every product in the New Relic platform—giving DevOps teams confidence that they have the right data to make informed incident-response decisions. Along with many other benefits,,Insights dashboards give stakeholders a shared visual language to understand the scale, origins, and impact of an incident; shared views into baselines that define healthy vs. unhealthy systems; and real-time visibility into the impact and effectiveness of a team's incident response activities.
Insights is also essential to closing the loop on the incident resolution process—providing the right context to conduct a retrospective, to identify and assess additional incident follow-up activities, and to support and direct team education and cross-training.
Tools that combat complexity and simplify troubleshooting
Finally, New Relic helps organizations to deal with the growing complexity of modern, distributed microservices environments. Complexity is the price we pay for reaping the benefits of microservices, but it's also a major barrier to creating a fast and efficient incident resolution process.
New Relic Infrastructure's Kubernetes cluster explorer is a prime example of how New Relic helps to give a team clarity and visibility into highly complex systems—even at massive scale. Kubernetes cluster explorer provides a multi-dimensional representation of a Kubernetes cluster that lets you zoom into your namespaces, deployments, nodes, pods, containers, and applications. The cluster explorer lets you retrieve the data and metadata of these elements easily, and to understand how they are related with the help of highly intuitive visualization tools.
Also, by moving effortlessly between high-level and highly detailed views, Kubernetes cluster explorer gives every stakeholder in the process a single, shared point of reference for troubleshooting and understanding the health of a cluster. This can accelerate your resolution process by getting everybody on the same page, and eliminating needless finger-pointing and miscommunication.
The distributed tracing capabilities in New Relic APM also help to combat complexity—in this case, the problems that arise tracing the cause of latency and other performance issues in distributed application architectures. Distributed tracing allows a team to trace the path of a request as it travels across a complex system; it reveals the latency of components along that path; and it shows which component in the path is creating a bottleneck.
Distributed tracing also leverages the intelligence built into the New Relic platform—using tools like anomalous span detection, trace charts, and custom queries of distributed trace data to help you isolate, diagnose and troubleshoot problems quickly and with confidence.
Taking the right approach to MTTR can be a complex and challenging task—that's the reality of working with modern application architectures. But with all of these capabilities, and many others, the New Relic platform is an essential tool to help you to implement a faster, smoother, and more reliable incident-resolution process.