MTTR—or mean time to resolution—is one of the most widely used metrics in the systems reliability toolbox. Paradoxically, it's also one of the most misunderstood metrics; many developers and operations teams lack a clear vision for how to define MTTR, how to use it, and how to improve it in a consistent and sustainable way.
As modern organizations increasingly rely on software to run their businesses, these disconnects around MTTR aren't just inconvenient; they threaten the bottom line by potentially disrupting the increasingly important digital customer experience, not to mention adding significant cost, risk, and complexity to the software development process.
The key to avoiding these problems is to adopt a progressive approach to defining and applying MTTR—one that combines comprehensive instrumentation and monitoring; a robust and reliable incident-response process; and a team that understands how and why to use MTTR to maximize application availability and performance. To help you do that, New Relic has collected 10 best practices for reducing MTTR the right way—all in the context of building a healthy incident-response strategy. We’ll also explain how the New Relic platform supports a DevOps team’s MTTR-reduction goals in a number of very important ways.