Errors don’t just show up as red lights on a dashboard. They show up as failed checkouts, stalled requests, or users dropping off at the worst possible moment. Knowing an error exists is easy. Knowing why it’s happening is where the real work begins.

In modern, distributed systems, that work is harder than ever. Services depend on other services, deployments happen constantly, and small failures ripple into larger ones. While teams scramble to piece things together, customers are already feeling the impact, and the business is losing money by the minute. For years, the industry used $5,600 per minute as the average cost of an outage as was suggested by Gartner in 2014. However, more recent data suggests these costs are rising steadily with enterprise downtime now averaging as much as $14,056 per minute, with large organizations seeing losses up to $23,750 per minute. These numbers add up fast. An eight-hour global outage could cost a company millions in revenue alone, even before you factor in productivity losses, angry users, or brand damage. We all remember the CrowdStrike fiasco. The good news? With the right observability practices and tools, teams can shift from chasing symptoms to tracing solutions.

In this blog, we explain how a structured approach to error analysis helps teams cut downtime and resolve problems faster. We outline a practical framework and describe how tools such as New Relic can be used to put it into practice, reducing MTTR and protecting both users and the business.

Common failure patterns

Even when teams know what’s wrong, they often fall into predictable traps that slow down resolution. These common failure patterns include:

  • Chasing symptoms, not causes
    Most incidents start with a visible symptom: a spike in error rates, a service timing out, or a user-facing 500. However, teams usually isolate these error signals without tracing the underlying failure patterns. Alerts go off, but there's no immediate path to fix, leaving teams firefighting surface issues. This limited focus creates high-alert noise and distracts from real root causes.
  • Data overload and noise
    Many teams still cling to “collect everything” monitoring, capturing every metric, log, and trace. This strategy may sound like a safety net but it can backfire rather quickly. Today’s systems produce too much telemetry, drowning teams in data resulting in noise and inflating costs. Valuable signals get buried and engineers spend more time filtering than fixing, increasing MTTR instead of reducing it.
  • Siloed dashboards and disconnected views
    Monitoring tools often exist in isolation, APM in one pane, logs in another, dashboards built per team. When systems fail, errors become puzzles spread across multiple screens. Engineers are forced into context-switching, manually correlating data instead of focusing on resolution. This fragmentation is one of the biggest drivers of slow incident response.
  • Alert fatigue and desensitization
    When monitoring systems fire off constant alerts, many of them low-priority or even false positives, the signal loses credibility. Engineers tune them out, and critical incidents risk being missed altogether. This “alert fatigue” is so common that the Google SRE Book explicitly warns against it, recommending symptom-based alerting with clear prioritization.
  • Missing change-context
    The majority of outages are tied to recent changes such as deployments, configuration updates, or infrastructure changes. Without a clear link between those events and system behavior, teams waste time chasing blind alleys. DORA’s State of DevOps reports highlight that elite teams recover faster because they tightly couple change tracking with incident response. Missing this context means engineers are often “debugging in the dark.”

These patterns persist because traditional monitoring tools were never designed to connect the dots across systems. Breaking them requires moving beyond traditional monitoring and putting errors, metrics, events, logs and traces all in one workflow. That’s where observability comes in.

Observability ≠ monitoring: Understanding the difference

Traditional monitoring is built around logs and metrics and is fundamentally reactive and siloed. Logs give you a snapshot of what happened at a specific point in time, but they are often unstructured and tied to a single process or service; metrics aggregate numbers over windows of time, but often hiding the granular context that leads to an error. When an alert fires, say a spike in 5xx responses, you instantly know something is wrong, yet you still need to hunt across multiple dashboards, pull up loads of logs, and manually correlate timestamps to discover whether the failure originates from your own code, a downstream dependency, or a configuration change. 

Observability, on the other hand, transforms that experience by weaving together traces, structured logs, metrics, events, and context into a single, query‑able fabric. Distributed tracing stitches the journey of an individual request across all services it touches, pinpointing exactly where latency or failure occurs. Context‑aware logs are automatically linked to those trace spans via identifiers such as trace.id, allowing engineers to jump from a high‑level error alert straight into the precise log line that caused it, all without manual copy‑paste or cross‑screen navigation. By ingesting deployment events and configuration changes into the same data layer, observability gives you an instant view of whether a recent change aligns with the onset of symptoms. In short, while monitoring tells you something is wrong, observability lets you answer why it’s happening, thereby turning a noisy symptom‑driven loop into a focused root‑cause investigation.

Tracing the errors: A step‑by‑step root cause analysis (RCA) playbook

When an incident surfaces, it’s tempting to jump straight into a “fix‑it” mindset. Instead, treat every error as a breadcrumb that leads you deeper into the system’s fabric: Symptom → Signal → System → Root Cause. Each step narrows the scope of the investigation until the actual cause is revealed.

  1. Capture the Symptom

Start with what the user or alert reported: an error spike, a latency surge, a 5xx response. Record the exact timestamp, affected endpoint, and any accompanying alert message. This snapshot anchors everything that follows; if it’s off‑by‑a‑few‑seconds you’ll be chasing the wrong event.

  1. Translate to a Signal

Ask “What is this symptom telling me?” Turn the raw observation into a measurable signal. For example, instead of just observing error rate, latency percentile or throughput, ask:

  • How many requests failed per minute?
  • Which request duration percentiles have spiked?
  • Has the number of incoming requests changed?

Plot these metrics over time and look for the moment the anomaly began. The signal is what will let you correlate across services.

  1. Isolate the system under investigation

At this point, you know something went wrong but not where. Use your telemetry to narrow down the scope:

  • Filter by service or host that first shows abnormal metrics.
  • Look for downstream dependencies (databases, caches, third‑party APIs) whose own metrics have shifted.

The goal is to reduce the universe of possibilities to a handful of candidate systems.

  1. ​​Drill into the system with traces and logs

Once you’ve pinned a service, it’s time to see where inside those services the problem lives:

  • Trace a representative request: Pick a failing transaction and follow it end-to-end. The first span where latency jumps or an exception is thrown usually points to the immediate failure point.
  • Inspect logs linked to that trace: Look for stack traces, error messages, or custom attributes (e.g., db.queryTime). Logs give you the exact code line or external call that failed.
  • Check dependency health at that moment: If a span shows an external API call, verify whether that service’s own metrics were abnormal at the same timestamp.

If the trace stops abruptly after calling an external system, the fault likely lies outside your service. Conversely, if the error originates within your code, you’ll see it in the log and stack trace.

  1. Correlate with external events

External systems can be the silent drivers of failure. Align your timeline with any events that could have affected them such as:

  • Deployment windows: Did a new version go live on the same day? Even if the deployment was to a different service, changes in contract or data format can ripple downstream.
  • Configuration changes: A recent tweak to a timeout value, connection pool size, or feature flag may have introduced instability.
  • Infrastructure events: Cloud provider maintenance, network latency spikes, or DNS updates often surface as intermittent errors.

If an external event lines up with the first anomaly in your trace, you’ve likely found the root cause’s origin point.

Once you've identified a possible root cause, confirm it by reproducing the behavior (if safe) or by applying a temporary fix such as rolling back to the previous version, adjusting configuration values etc. depending on the possible cause. If the symptom resolves, your hypothesis stands; if not, iterate back to step 4 with the new data. By following this sequence, you turn a noisy incident into a clear, repeatable investigative process. The result is faster MTTR and fewer repeated outages, all while keeping the analysis focused and efficient.

This framework is tool-agnostic, but it becomes far more powerful when applied within an observability platform like New Relic.

Root cause analysis in New Relic

To illustrate how the RCA playbook maps onto real telemetry, consider the following scenario in New Relic. You receive an alert on the Ad Service, reporting that the Ad Service’s error rate has exceeded your defined threshold. That alert is the symptom that something is wrong and it might look something like follows:

Because the alert is tied to Ad Service, the failing system is already clear. You navigate to the problematic service in New Relic. Since, in this particular example, Ad service is an APM service, the investigation begins on its summary page, where application-level telemetry such as error rate, response time, and throughput is visible.

On this page, you see that the error chart shows a sustained failure rate of around 20% with recurring spikes. Response times climb during those same intervals, while throughput stays steady, ruling out a traffic surge. The symptom is confirmed: the service is returning errors under normal load.

From here, you click on the Errors chart which takes you to the Errors Inbox of the service. Here, you’ll find all the error groups from your service, as shown in the following image. 

In this particular case, you see two dominant error classes:

  • SQL connection failed with status {Too many connections}
  • TimeoutException: context timed out

Let’s take a look at the first error group. You can open the occurrence details view by simply clicking on itYou’ll see the Error group metricsoccurrencesStack traceDistributed trace and other details on this page. 

In most cases, stack trace points directly to the failing request. For example, in the above image, the stack trace is pointing to AdServiceImpl.getAds, and the error is linked to a failing request trace (GET /product/{id}). This narrows the investigation to a specific code path inside the service. You can take it one step further and open the Distributed Trace from this view.

The distributed trace shows the request’s journey across six different services, with the Ad Service span flagged in red. It also indicates where the exception bubbled up in the call stack. Combined with the earlier error message, it’s clear the Ad Service is exhausting its SQL connections, and that failure is rippling out to any product page trying to load ads.

The Logs confirm the runtime failure:

Because logs, traces, and pod metadata are automatically correlated, there’s no manual searching. The investigation has cleanly progressed from symptom, to signal, to system, and finally root cause: the Ad Service cannot open new SQL connections because the pool is saturated.

Now that you know the root cause of the problem, you can finally move to remediation. 

Conclusion

Incidents will always happen, but long outages don’t have to. A structured approach to error analysis can turn hours of guesswork into a repeatable, evidence-driven process. But this structured approach only works when supported by observability tools such as New Relic. It isn’t about collecting more data, but about surfacing the right data at the right time. New Relic connects alerts, traces, logs, and change events all in one workflow so teams can move faster from detection to resolution. The outcome is reduced MTTR, fewer outages, stronger customer trust, and greater confidence in reliability across both engineering and the business.

New Relic Now Demo new agentic integrations today.
Watch now.