In my observability workshops with enterprise customers, the conversation often shifts from "what and how to instrument" toward the topic of  "How do we manage alert noise created by all this telemetry"

I see this constantly: great telemetry, but a broken response process. You have successfully instrumented your stack, but now your SREs are suffering from alert fatigue, your L1 responders are overwhelmed with context-less tickets, and your MTTR (Mean Time to Resolution) is stuck. This usually happens because the process architecture hasn't caught up with the technology architecture.

To solve this, I walk teams through the Alert Lifecycle Reference Architecture. This blueprint maps the flow of an incident through three critical domains: Knowledge, Action, and Record. It clarifies exactly who should be doing what, and where.  Many organizations have their own models (some similar to this one and some different).  This is just one that we can use as an example to get the conversation going.

To fix this, we need to look at how an incident actually travels from a spiked metric to a resolution.

The Three Domains: Knowledge, Action, and Record

The top half of the architecture focuses on “tools” or systems that are necessary to the process.   We’ve designated these as System of Knowledge, System of Action, and System of Record.

1. System of Knowledge (Intelligence)

This is your Observability Layer (e.g., New Relic). Its primary jobs are  Detection & Intelligence.

  • It prioritizes the completeness of raw telemetry.
  • It is responsible for correlation, AI-based anomaly detection, and the initial definition of alert conditions.
  • The Golden Rule: I always tell my customers: “if it isn't in the data”, as far as the business is concerned, “it never happened”.

2. System of Action

An alerting strategy, when effectively implemented, is one of the most important parts of any successful team. With New Relic, you can ensure that the right members of your team get the alerts they need as quickly as possible, while filtering out expected behavior to minimize alert fatigue.

This is your Notification & Response Engine. Its primary jobs are AggregationNotification and Routing.

  • It prioritizes connectivity and speed.
  • It enriches the alert payload and ensures it reaches the right human.
  • It may aggregate related alerts
  • It acts as the orchestrator, ensuring that the right workflows trigger automatically so teams aren't scrambling to find phone numbers.

3. System of Record (Authoritative History)

This is your Incident Management system. Its primary jobs are Reporting and Governance.

  • It prioritizes data integrity and auditability.
  • This is where SLAs are tracked and where the Root Cause Analysis (RCA) lives.
  • It is the authoritative source for "business truth"—reporting on impact and executive visibility.

4. New Relic as Unified System

When I present this architecture to customers, the instinctive reaction is to reach for a three-tool stack: New Relic for observability, PagerDuty for on-call routing, and ServiceNow for incident records. That is a perfectly valid deployment pattern, and for organizations with deeply embedded ITSM workflows, it may be the right call.

But it is worth pausing on an assumption embedded in that model---that physical separation of tools is required to achieve logical separation of concerns. It is not. New Relic is architected to serve all three roles natively, on a single data plane, without compromising the distinct purpose each system is meant to serve.

The architectural advantage is what I call data gravity. When detection, routing, and record-keeping all draw from the same underlying store (New Relic Database, or NRDB), you eliminate the fragile ETL pipelines, delayed syncs, and context loss that plague multi-tool stacks. Your post-incident query runs against the same raw telemetry that triggered the alert in the first place. Nothing is approximated; nothing is lost in translation.

SystemPrimary JobNew Relic Capability
System of KnowledgeDetection & Intelligence

APM, Infrastructure, Logs, Browser, Synthetics, Distributed Tracing, AI anomaly detection, New Relic Lookout, custom NRQL alert conditions

System of ActionAggregation, Notification & RoutingAlert Policies, Workflow Engine, Notification Channels (Slack, PagerDuty, email, webhooks), Muting Rules, alert enrichment via NerdGraph
System of RecordReporting & GovernanceIssues & Incidents feed, incident timeline history, SLO/SLA dashboards, NerdGraph queryable incident data, Workloads for business context grouping.

Deploying New Relic in this unified mode does not eliminate the need for the architectural discipline described throughout this post. It sharpens it. You still need SREs owning the knowledge layer---defining alert conditions, thresholds, and SLOs. You still need your Workflow configuration treated as a first-class engineering artifact, reviewed and versioned like code. And you still need the incident record treated as an audit trail, not a scratch pad.

The difference is that your teams are navigating one platform, not three context switches. When an L2 responder pulls up an incident, the correlated traces, the triggering metric, and the full incident timeline are in the same pane of glass. When the SRE closes the loop in RCA, the runbook update and the alert tuning happen in the same environment where the anomaly was first detected.

Physical unification does not mean organizational consolidation. Keep your governance boundaries. Keep your team ownership lines. Keep the logical separation of concerns that makes this architecture work. What you shed is the operational drag of synchronizing three independent systems that were never designed to talk to each other cleanly.

For organizations that already have PagerDuty or ServiceNow deeply embedded, New Relic integrates with both and can serve as the System of Knowledge while delegating Action and Record to those platforms. For organizations earlier in their tooling journey, or those rationalizing their vendor footprint, the unified model deserves serious consideration.

The architecture does not prescribe a tool count. It prescribes clearcut separation of concerns.

The Human Layer: Roles and Escalation

 

A good culture can work around broken tooling, but the opposite rarely holds true.

Technology moves the data, but people solve the problems. The bottom half of the architecture defines the Escalation Ladder — but a ladder without defined rungs is just a pole. The goal of this structure is not simply to create tiers; it is to establish clear ownership of each system, explicit criteria for each handoff, and a cultural contract that protects your most expensive engineering resources from noise they should never see.

Two failure modes collapse this architecture faster than any tooling gap. The first is L3 bypass — an executive or product manager calls an engineer directly, short-circuiting the ladder and teaching the organization that the process doesn't apply under pressure. The second is IC role collapse — the Incident Commander gets pulled into technical problem-solving and abandons the coordination function, leaving no one managing the war room. Both failures are cultural, not technical, and both are preventable if roles are defined before an incident, not during one.

The table below maps each role to their system ownership, secondary touchpoints, and the specific function they serve in the lifecycle.

RolePrimary SystemSecondary SystemResponsibilities & Escalation Notes
SRE / DevOpsSystem of KnowledgeSystem of RecordDefines instrumentation, SLOs, and alert conditions. Owns the quality of detection. The SRE's most critical function runs after an incident closes — translating RCA findings into tuned alerts, updated runbooks, and improved coverage. Configuration is table stakes; post-incident learning is where the system improves.
L1 Responder / NOCSystem of ActionSystem of KnowledgeMonitors the incident console and handles high-volume, low-complexity events using runbooks. Their secondary touchpoint with the System of Knowledge is read-only — dashboards and alert context inform triage, but L1 does not modify alert conditions. Escalation trigger: runbook exhausted, issue unresolved within the defined SLA window, or confirmed customer impact.
L2 Triage TeamSystem of ActionSystem of RecordCross-functional responders who perform deeper analysis — restarting services, rolling back non-critical configs, or confirming false positives. They update the System of Record with triage findings and own the escalation decision to L3. They are the final gate. Escalation trigger: confirmed novel issue, suspected code-level root cause, or customer-facing SEV1 impact requiring L3 domain expertise.
L3 Engineering OwnersSystem of KnowledgeSystem of RecordDomain experts who own the technology stack. Brought in only when deep expertise is required — a code change, hotfix, or architectural diagnosis. Post-resolution, they produce the technical RCA artifact that feeds back to the SRE layer. Frequent L3 pages are a signal that L1/L2 runbooks are overdue for an update.
Incident Commander (IC)System of RecordSystem of ActionThe IC owns the process, not the fix. During a live incident, they hold the System of Record — managing severity classification, stakeholder communications, and escalation decisions. Their secondary touchpoint with the System of Action is supervisory: monitoring routing, confirming the right teams are engaged, and adjusting severity thresholds as the incident evolves. The IC role should be rostered, trained, and practiced before it is needed. Define your SEV1/SEV2 invocation thresholds in advance — the worst time to debate whether to invoke an IC is while the incident is active.

SREs and L3 engineers are anchored to the System of Knowledge because their value is in understanding signals. L1 and L2 responders are anchored to the System of Action because their value is in executing responses. The IC is the only role anchored to the System of Record during a live incident — because someone must maintain the authoritative account of what is happening, in real time, even while others are heads-down in the fix.

Glossary of Key Terms

RCA (Root Cause Analysis): A structured review conducted after an incident is resolved. The objective is to identify the underlying cause — whether in code, process, or infrastructure — and produce action items that prevent recurrence. Blame assignment is out of scope; system improvement is the only goal.

System of Record: The authoritative source of truth for all incident history and artifacts. This system prioritizes data integrity and auditability, and is the basis for SLA compliance reporting and executive-level incident reviews.

System of Action: The engine responsible for routing notifications, triggering workflows, and ensuring the right team is engaged at the right time. It prioritizes speed and connectivity, translating a detected signal into a coordinated human response.

System of Knowledge: The observability layer — New Relic in this architecture — that collects, correlates, and analyzes telemetry to detect anomalies and trigger alerts. It is the source of detection and the foundation on which alert conditions, SLOs, and AI-driven anomaly baselines are defined.

Incident Commander (IC): The individual responsible for overall coordination during a major incident. They manage severity classification, stakeholder communications, and escalation decisions — allowing technical responders to stay focused on the fix rather than fielding status requests.

Runbook: A documented set of procedures for diagnosing and resolving known issue patterns. L1 and L2 teams use runbooks to handle high-frequency, low-complexity events without escalating to L3. Runbooks should be treated as living documents, updated after every relevant RCA.

NOC (Network Operations Center): A centralized function — staffed in-person or remotely — where L1 administrators monitor and respond to infrastructure and network events. In this architecture, the NOC team is the primary operator of the System of Action's incident console.

Data Gravity: The tendency for applications and processes to consolidate around large volumes of data rather than pulling that data to a separate system. In observability, it argues for keeping detection, routing, and incident records on the same underlying data store — preserving full telemetry context through the entire incident lifecycle.

NRDB (New Relic Database): New Relic's time-series and event data store, capable of ingesting and querying billions of data points at sub-second response times. When New Relic serves as a unified platform, NRDB is the shared substrate that makes physical consolidation of the three systems possible without sacrificing query performance or data fidelity.

Derzeit ist diese Seite nur auf Englisch verfügbar.