New Relic Now Start training on Intelligent Observability February 25th.
Save your seat.

Developers and engineers often use observability to solve three key business and technical challenges: reducing downtime, reducing latency, and improving efficiency.

Outage frequency, mean time to detection (MTTD), and mean time to resolution (MTTR) are common service-level metrics used in security and IT incident management.

In this section, we review service-level metric benchmarks, including outage frequency, MTTD, and MTTR by business impact level, change in MTTR since adopting an observability solution, which observability capabilities predict a faster MTTD/MTTR, and the average hourly and annual outage cost.

Service-level metrics highlights:

32%

experienced high-business-impact outages once per week or more

44%

took 30+ minutes to detect high-business-impact outages

60%

said it takes 30+ minutes to resolve high-business-impact outages

61%

spent $100K+ per hour of downtime for critical business app outages

65%

improved MTTR since adopting observability

The survey results found that observability improves service-level metrics, with those who had full-stack observability experiencing fewer outages, a faster MTTD and MTTR, and lower outage costs.

Full-stack observability

fewer
outages

faster
MTTD

faster
MTTR

lower
outage costs

higher
ROI

Outage frequency

How often are outages occurring that affect customers and end users? Survey results showed that:

  • Outages still happen fairly frequently, but the number of respondents who said they happen once per week or more decreased year-over-year (YoY) by 36% for high-business-impact outages, 52% for medium-business-impact outages, and 63% for low-business-impact outages.
  • Low-business-impact outages happen the most frequently (53% noted once per week or more).
  • High-business-impact outages happen the least frequently (two to three times per month or fewer), but almost a third (32%) still experience them once per week or more, and 13% still experience them once a day or more.
  MOST FREQUENT OUTAGES
(once per week or more)
LEAST FREQUENT OUTAGES
(2–3 times per month or fewer)
 High business impact MOST FREQUENT OUTAGES31.9% LEAST FREQUENT OUTAGES61.6%
 Medium business impact MOST FREQUENT OUTAGES41.4% LEAST FREQUENT OUTAGES54.3%
 Low business impact MOST FREQUENT OUTAGES52.6% LEAST FREQUENT OUTAGES43.6%

Given the relative frequency of outages, the findings of how often manual effort and incident tickets are the sources of knowledge for these outages are noteworthy.

Like last year, organizations with full-stack observability (by the report’s definition) consistently have fewer outages than organizations without full-stack observability.

LEAST FREQUENT OUTAGES (2–3 times per month or fewer)
  WITH full-stack observability WITHOUT full-stack observability
 High business impact WITH full-stack observability66.9% WITHOUT full-stack observability59.0%
 Medium business impact WITH full-stack observability60.8% WITHOUT full-stack observability51.1%
 Low business impact WITH full-stack observability48.2% WITHOUT full-stack observability41.3%
The data once again supports a stronger association between full-stack observability and less frequent outages. Therefore, the increase in full-stack observability is likely the reason for less frequent outages.
32%

experienced high-business-impact outages once per week or more

Role insight
Practitioners were the most likely to say they experience high-business-impact outages once per week or more (34%). Non-executive managers were the most likely to say they experience them 2–3 times per month or fewer (74%).

Regional insight
Asia Pacific had the most frequent high-business-impact outages (41% said once per week or more), while North America had the least frequent (75% said 2–3 times per month or fewer).

Industry insight
The energy/utilities industry had the most frequent high-business-impact outages (40% said once per week or more), followed by retail/consumer (36%). Nonprofits had the least frequent (77% said 2–3 times per month or fewer), followed by government (70%).

Outage frequency by high, medium, and low business impact

Mean time to detection (MTTD)

When we looked at the mean time to detect an outage, a common service-level metric used in security and IT incident management, the survey results showed:

  • The most commonly cited MTTD across all business impact levels was 5–30 minutes.
  • High-business-impact outages tend to take the longest to detect, with 44% saying it takes 30+ minutes and 21% saying it takes at least 60 minutes.
  • MTTD improved in general YoY across all impact levels—for example, it was 8% faster YoY for high-business-impact outages.
  FASTEST MTTD
(less than or equal to 30 minutes)
SLOWEST MTTD
(more than 30 minutes)
 High-business-impact outages FASTEST MTTD48.3% SLOWEST MTTD43.5%
 Medium-business-impact outages FASTEST MTTD50.9% SLOWEST MTTD42.7%
 Low-business-impact outages FASTEST MTTD60.4% SLOWEST MTTD33.6%

Respondents with full-stack observability (based on our definition) were once again more likely to experience the fastest MTTD (less than 30 minutes). They also saw the most MTTD improvement. For example, those with full-stack observability were 19% more likely to detect high-business-impact outages in 30 minutes or less, compared to those without full-stack observability.

FASTEST MTTD (30 minutes or less)
  WITH full-stack observability WITHOUT full-stack observability
 High-business-impact outages WITH full-stack observability54.0% WITHOUT full-stack observability45.5%
 Medium-business-impact outages WITH full-stack observability54.4% WITHOUT full-stack observability49.2%
 Low-business-impact outages WITH full-stack observability65.7% WITHOUT full-stack observability57.8%

In addition, those with more capabilities deployed had a faster MTTD. For example, respondents who said their organization has 5+ observability capabilities currently deployed were 40% more likely to detect high-business-impact outages in 30 minutes or less, compared to those with 1–4 deployed.

The data support a strong association between a faster MTTD and having full-stack observability and/or five or more observability capabilities deployed. These results imply that investing in observability pays off with better business outcomes (in this case, a faster MTTD).
21%

said it takes at least 60 minutes to detect high-business-impact outages

Role insight
ITDMs were more likely to say they have a faster MTTD—51% said it takes less than 30 minutes to detect high-business-impact outages compared to 47% for practitioners.

Regional insight
Asia Pacific had the fastest MTTD for low-business-impact outages (65% said ≤30 minutes), while North America had the fastest MTTD for medium- and high-business-impact outages (~60% said ≤30 minutes).

Organization size insight
Large organizations had a slightly faster MTTD for high-business-impact outages (50% said ≤30 minutes) compared to small (47%) and midsize (46%) organizations.

Industry insight
Education had the fastest MTTD for high-business-impact outages (61% said ≤30 minutes), followed by healthcare/pharma (58%). Nonprofits had the slowest MTTD for high-business-impact outages (69% said 30+ minutes), followed by retail/consumer (55%).

MTTD by high-, medium-, and low-business-impact outages

Mean time to resolution (MTTR)

We see similar patterns with MTTR, another common service-level metric used in security and IT incident management:

  • The majority had an MTTR of at least 30 minutes across all business impact levels.
  • High-business-impact outages tend to take the longest to resolve, with 60% saying it takes more than 30 minutes and 34% saying it takes more than an hour to resolve.
  • MTTR improved in general YoY; for example, it was 26% faster YoY for high-business-impact outages.
  FASTEST MTTR
(less than or equal to 30 minutes)
SLOWEST MTTR
(more than 30 minutes)
 High-business-impact outages FASTEST MTTR30.4% SLOWEST MTTR60.2%
 Medium-business-impact outages FASTEST MTTR35.6% SLOWEST MTTR57.6%
 Low-business-impact outages FASTEST MTTR46.1% SLOWEST MTTR48.0%

Respondents with full-stack observability (based on our definition) were once again more likely to experience the fastest MTTR (less than 30 minutes). They also saw the most MTTR improvement YoY. For example, those with full-stack observability were 18% more likely to resolve high-business-impact outages in 30 minutes or less compared to those without full-stack observability.

FASTEST MTTR (30 minutes or less)
  WITH full-stack observability WITHOUT full-stack observability
 High-business-impact outages WITH full-stack observability34.0% WITHOUT full-stack observability28.7%
 Medium-business-impact outages WITH full-stack observability36.3% WITHOUT full-stack observability35.3%
 Low-business-impact outages WITH full-stack observability47.7% WITHOUT full-stack observability45.2%

And those with five or more capabilities deployed had a faster MTTR. For example, respondents who said their organization has 5+ observability capabilities currently deployed were 42% more likely to resolve high-business-impact outages in 30 minutes or less compared to those with 1–4 deployed.

The data support a strong association between a faster MTTR and having full-stack observability and/or five or more observability capabilities deployed. These results imply that investing in observability pays off with better business outcomes (in this case, a faster MTTR).
34%

took 60+ minutes to resolve high-business-impact outages

Role insight
ITDMs were more likely than practitioners to say it takes more than 30 minutes to resolve outages.

Regional insight
Those surveyed in Asia Pacific were more likely to say they resolve low- and medium-business-impact outages in 30 minutes or less. Those surveyed in Europe and North America were slightly more likely to say they resolve high-business-impact outages in 30 minutes or less.

Industry insight
Education had the fastest MTTR for high-business-impact outages (42% said ≤30 minutes), followed by retail/consumer (33%). Nonprofits had the slowest MTTR for high-business-impact outages (69% said 30+ minutes), followed by financial services/insurance (66%).

MTTR by high-, medium-, and low-business-impact outages

Total downtime

Given the relative frequency of outages and time to detect and resolve them as noted above, this adds up to considerable downtime for organizations. The data show that:

  • The median annual downtime was 23 hours.
  • Those with a mature observability practice experienced 15 hours of downtime per year on average compared to 23 for those whose organizations aren’t as mature.
  • Those that had achieved full-stack observability experienced 20 hours of downtime per year on average compared to 26 for those whose organizations hadn’t achieved full-stack observability.
These results mean those with a mature observability practice experience 34% less downtime per year on average than those with less mature observability practices. In addition, those with full-stack observability experience 25% less downtime per year on average than those without full-stack observability. These findings suggest that observability can help reduce downtime.

Regional insight
Asia Pacific had the highest median annual downtime (37 hours) compared to Europe (26 hours) and North America (12 hours).

Organization size insight
Large organizations had the highest median annual downtime (26 hours) compared to midsize (23 hours) and small (16 hours) organizations.

Industry insight
The energy/utilities industry had the highest median annual downtime (37 hours). Conversely, government and services/consulting organizations had the lowest median annual downtime (both 15 hours).

Outage cost

This year, we wanted to see how much revenue critical business application outages cost organizations per hour of downtime, on average. We also estimated the annual outage cost based on high-business-impact outage frequency, total downtime (MTTD and MTTR), and hourly outage cost.

Hourly outage cost

As far as how much survey respondents said critical business app outages cost on average per hour of downtime, we found that:

  • Three in five (61%) said they cost at least $100,000, 32% said at least $500,000, and 21% said at least $1 million per hour of downtime. 
  • A quarter (25%) said they cost less than $100,000 per hour of downtime.
  • Notably, 12% said they weren’t sure how much these outages cost.

In addition, the cost of outages is higher for those who don't have full-stack observability or a mature observability practice. For example, 42% of those from organizations with full-stack observability or a mature observability practice (by the report’s definitions) said critical business app outages cost less than $250,000 per hour of downtime compared to 35% of those without full-stack observability, and 37% of those without a mature observability practice.

These results indicate that outages are expensive, but having full-stack observability and/or a mature observability practice helps reduce the cost.
Average revenue cost for critical business application outages per hour of downtime
61%

said outages cost $100K+ per hour of downtime

Role insight
Executives and practitioners were more likely to say outages cost $500K+, while non-executive managers were more likely to say they cost less than $100K. Unsurprisingly, practitioners were more likely to say they’re not sure (14%) compared to ITDMs (7%).

Regional insight
North American respondents were more likely to say outages cost $100K or less (36%) and that they’re not sure (20%), while those in Asia- Pacific and Europe were more likely to say $500K+ (38% and 35% respectively).

Organization size insight
Large organizations were more likely to say outages cost $500K+ (38%) compared to those from small (17%) and midsize (25%).

Industry insight
Those from the energy/utilities industry were the most likely to say outages cost $500K+ (52%), followed by nonprofits (46%).

Annual outage cost

Across all respondents who provided a high-business-impact outage frequency, outage time (MTTD and MTTR), and outage cost, the median annual cost of high-business-impact outages was $7.75 million.

Respondents from organizations with full-stack observability (by the report’s definition) had median outage costs of $6.17 million per year compared to $9.83 million per year for those without full-stack observability. That’s a cost savings of $3.66 million per year.

These results mean those with full-stack observability experience a median outage cost that’s 37% lower than the median outage cost for those without full-stack observability. This finding further reinforces the theme that full-stack observability has a positive effect on an organization’s bottom line.

Median outage cost without observability

$9.83M

Median outage cost with observability

$6.17M

37% lower

Regional insight
Asia Pacific had the highest median annual outage cost by a wide margin ($19.07M) compared to Europe ($8.42M) and North America ($1.20M).

Organization size impact
Large organizations had the highest median annual outage cost by a wide margin ($12.04M) compared to midsize ($4.63M) and small ($1.84M) organizations.

Industry insight
The energy/utilities industry had the highest median annual outage cost ($34.31M), followed by nonprofits ($27.87M). Conversely, government had the lowest median annual outage cost ($1.31M).

MTTR change

We wanted to know how respondents thought their organization’s MTTR for outages had changed since adopting an observability solution. We found that:

  • About two-thirds (65%) said their MTTR had improved to some extent, including 31% who said it improved by 25% or more.
  • Only 16% said their MTTR had worsened to some extent.
  • Just 14% said it remained the same.
MTTR change since adopting an observability solution

Several factors were correlated to a greater likelihood of the best MTTR improvements, including:

  • Deploying five or more observability capabilities: 68% compared to 40% for those with zero and 45% for those with 1–4 capabilities deployed.
  • Employing five or more observability practice characteristics: 69% compared to 61% for those with 1–4 characteristics employed.
  • Having a mature observability practice (by the report’s definition): 68% compared to 64% without (56% more likely to experience MTTR improvements of 25%+).
  • Achieving full-stack observability (by the report’s definition): 68% compared to 63% without (27% more likely to experience MTTR improvements of 25%+).
  • Spending at least $100,000 per year on observability: 67% compared to 62% who spend less than $100,000 per year and 17% who spend nothing (the more they spend, the more likely they were to see an improvement).
MTTR improvement since adoption observability by annual observability spend

Those who said they get the most value from observability ($2.5 million or more) were more likely to say their MTTR has improved since adopting it.

These results indicate a clear connection between observability investment and improved MTTR.
65%

said MTTR improved since adopting observability

Summary of MTTR change since adopting an observability solution

Regional insight
Asia Pacific was the least likely to say MTTR improved (61%) compared to Europe (68%) and North America (67%).

Organization size insight
Large organizations were the most likely to see improved MTTR since adopting observability (68%) compared to small (59%) and midsize (60%).

Industry insight
Nonprofits were the most likely to see improved MTTR since adopting observability (79%), followed by energy/utilities (78%) and healthcare/pharma (76%).

Predictors of MTTD/MTTR by capability

In addition, the data predicts a positive association between certain capabilities—including log management, Kubernetes monitoring, alerts, infrastructure monitoring, error tracking, dashboards, and mobile monitoring—and a faster MTTD/MTTR (less than 30 minutes). Of those capabilities, log management is statistically significant within 5% significance levels.

Predictors of an MTTD/MTTR of less than 30 minutes by capability currently deployed