In my work with enterprise customers, the same pattern surfaces regularly: all the infrastructure dashboards are green — CPU is low, memory is fine — and yet customer support tickets are piling up. I call this the Observability Gap. The monitoring is healthy; the service is not.
The disconnect happens because most organizations measure components in isolation rather than the relationships between them. A slow database query is invisible on a CPU chart. A degraded CDN edge is invisible on an application health dashboard. The only way to close the gap is to measure what the user experiences at every tier of the stack — and to express those measurements as formal commitments.
A Service Level Management (SLM) Reference Architecture provides that structure. It is a prescriptive map for defining the right Service Level Indicators (SLIs) and Service Level Objectives (SLOs) at each layer of your system — from the infrastructure data plane to the business metrics your executives track quarterly. The goal is simple: technical health should translate directly and visibly into business outcomes.
The sections that follow are drawn from the architecture I use in deep-dive workshops. Each layer includes concrete SLO examples you can use as starting baselines, along with the context for why each measurement matters.
Layer 1 — The Experience Layer: Measuring Perception
At the top of the stack, we measure what the user actually feels. A 200 OK response is not a user experience; it is a protocol handshake. True experience measurement requires tracking interactivity and visual stability under real-world conditions — including on mobile networks, on low-end devices, and at 3am when your real users are asleep.
Architect's Note: Synthetic monitoring fills the gaps that real-user monitoring cannot. Critical journeys — Login, Checkout, Account Creation — must be exercised continuously by synthetic tests even when traffic is low. Your SLAs do not pause overnight.
| SLI | Example Objective | Notes |
| Interactivity (INP) | 90% of page loads have an Interaction to Next Paint (INP) < 200ms | INP replaced FID as the Core Web Vitals interactivity metric in March 2024. If your SLOs still reference FID, update them. |
| Visual Stability (CLS) | 95% of sessions have a Cumulative Layout Shift (CLS) score < 0.1 | CLS spikes are frequently caused by late-loading ads or fonts. Track by page template, not just site-wide. |
| Mobile Crash-Free Rate | 99.9% of mobile sessions are crash-free | Segment iOS and Android separately. Platform-specific regressions are common after OS version releases. |
| ANR Rate | App Not Responding events < 1% of sessions | ANRs are distinct from crashes — the app is unresponsive, not terminated. Both require their own SLOs. |
| Critical Journey Latency | Login and Checkout complete in < 2s for 95% of attempts | Define per-journey, not as a site average. A slow checkout matters more than a slow About page. |
| Rage Click Rate | Rage clicks detected in < 1% of sessions | A leading indicator of UX failure. Rising rage clicks often precede support ticket spikes by hours. |
| Ajax Error Rate | Clicks triggering Ajax errors < 1% of sessions | Filter to user-initiated interactions only. Background polling errors skew the metric. |
| Synthetic Journey Success | 99.9% of synthetic critical journey executions succeed | Run from multiple geographic locations. A localized CDN failure may only appear in regional synthetic results. |
Layer 2 — The Gatekeeper Layer: Edge & Control
Before traffic reaches your backend, it passes through CDNs, load balancers, and DNS infrastructure. This is the first point where latency is introduced and the first point where failures are silent from the application's perspective. A slow edge is invisible to your APM tool and invisible to your users — until it isn't.
SLIs at this layer should focus on global latency, connection health, and cache efficiency. A well-tuned edge is also your primary cost control lever: every cache miss is a backend request you paid for twice.
| SLI | Example Objective | Notes |
| Time to First Byte (TTFB) | 95% of CDN-served requests have a TTFB < 100ms | Measure at the CDN edge, not the origin. A low origin TTFB with a high edge TTFB indicates a caching or routing problem. |
| Cache Hit Ratio | Cache hit ratio remains above 85% | Dropping below this threshold puts direct backend load pressure on the origin and increases latency variability. |
| SSL Handshake Latency | 99% of SSL handshakes complete in < 50ms | Spikes here are often the first signal of certificate renewal issues or TLS configuration drift. |
| TCP Retransmission Rate | TCP retransmission rate < 1% for active sessions | Elevated retransmissions indicate network congestion or hardware degradation upstream of the application layer. |
| Packet Loss Rate | Packet loss across ingress interfaces < 0.1% | Even small packet loss forces retransmissions that compound latency, regardless of application performance. |
| Edge-to-Origin RTT | Round Trip Time from edge to origin < 50ms for 95% of requests | Ensures the mid-tier network path between CDN and backend is not the hidden bottleneck. |
| DNS Resolution Latency | 99% of DNS resolution requests complete in < 50ms | DNS failures are total failures. Monitor resolution time and NXDOMAIN error rates independently. |
Layer 3 — The Service Domain: Business Logic
This is where your application runs — microservices, monoliths, or serverless functions. The common measurement mistake at this layer is tracking generic health metrics: average response time, overall error rate. Those numbers hide the failures that matter.
The discipline here is measuring critical business functions explicitly, not as a side effect of broad instrumentation. The goal is to convert general metrics that are at best a point of interest to something that will demand attention when expressed as a non-compliant SLO.
| SLI | Example Objective | Notes |
| Service Error Rate(RED Method) | 99% of requests to [service] complete without error | Define per service, per endpoint class. Exclude health checks, readiness probes, and scheduled housekeeping tasks from the denominator. |
| Service Latency(RED Method) | P95 latency for [service] requests < [baseline]ms | Set the baseline from observed P95, not an arbitrary target. Alert on deviation from baseline, not breach of a fixed threshold. |
| Critical Method Success Rate | /process_order completes successfully 99.9% of the time | Instrument at the method level for business-critical paths. Service-level SLOs will not surface a degraded checkout within a healthy service. |
| External Dependency Success Rate | 99.9% success rate for calls to external Payment Gateway | You own your dependency SLOs even when you don't own the dependency. Track and report them separately from internal service SLOs. |
| Serverless Cold Start Latency | 90% of function invocations complete cold start in < 500ms | Cold starts are not random — they are predictable under traffic patterns. Track cold start rate (not just latency) as a capacity signal. |
| Core API Response Time | Search API returns results in < 2s for 95% of requests | High-value APIs warrant individual SLOs. Aggregating Search with Auth with Admin into one service SLO obscures the user-facing impact. |
Architect's Note: The RED Method — Rate, Errors, Duration — defines three distinct measurement dimensions. Error rate and latency require separate SLOs with separate error budgets. A service can be fast and broken, or slow and reliable. Conflating the two into a single objective makes the budget uninterpretable.
Layer 4 — The Foundation: Infrastructure & Data Plane
Infrastructure SLOs serve one purpose: confirming that the platform has the capacity and health to serve the application running on it. The goal is not to monitor CPU for its own sake — it is to detect saturation and replication degradation before they manifest as service-layer failures.
Intent-based SLIs at this layer ask "can this platform serve the load it's receiving?" rather than "is this server busy?"
| SLI | Example Objective | Notes |
| Pod Scheduling Latency(Kubernetes) | 99% of pods enter Running state within 10 seconds of scheduling | Scheduling latency is an early indicator of cluster resource pressure. Alert before the queue backs up. |
| Replication Lag | Read replica lag < 1 second 99% of the time | Replica lag above threshold means read traffic may receive stale data. This is a data correctness risk, not just a performance risk. |
| Disk IOPS Saturation | Disk IOPS utilization < 85% | IOPS saturation causes non-linear latency spikes. 85% is not a comfortable threshold; it is a warning line. |
| Node Memory Pressure | < 1% of Kubernetes nodes in memory pressure state at any time | Node memory pressure triggers pod evictions, which surface as service restarts. Track the infrastructure signal, not just the application symptom. |
| Database Connection Pool Utilization | Connection pool utilization < 80% for 99% of 5-minute intervals | Pool exhaustion causes application timeouts. This metric should be tracked per service, not just per database instance. |
Layer 5 — Business Outcomes
Business outcome metrics are the ultimate test of the architecture. If Layers 1 through 4 are healthy and these metrics are degraded, the problem is a product or commercial issue, not a platform issue. If these metrics degrade while technical layers appear healthy, you have an instrumentation gap — something in the stack is failing silently.
These measures require direct input from Product Managers, Business Analysts, and in some cases executive stakeholders. Most digital businesses already track them on a monthly or quarterly cadence. Observability tooling surfaces them in real time, which changes the character of the conversation from retrospective review to operational response.
| SLI | Example Objective | Notes |
| Order Velocity | Order volume does not drop > 20% below the hour-of-day baseline | Use time-of-day baselines, not absolute thresholds. A 20% drop at 2pm is very different from a 20% drop at 2am. |
| Conversion Rate | Cart-to-order conversion rate stays within 1 standard deviation of the 7-day moving average | Absolute conversion targets vary enormously by industry. Deviation from your own baseline is a more reliable signal than any benchmark. |
| Cart Abandonment Rate | Cart abandonment rate does not exceed the 7-day rolling average by more than 10% | Industry average abandonment is ~70%. An absolute SLO at that level is not protective. Define relative to your own baseline. |
| Average Cart Value | Average daily cart value within 1 standard deviation of seasonal average | Seasonal adjustment is required. A pre-holiday drop in cart value is a signal; a post-holiday drop is expected. |
| Revenue Velocity | Hourly revenue does not fall > 15% below the day-of-week baseline | Requires integration with transaction data. Increasingly achievable via New Relic custom events or APM transaction attributes. |
The Human Element: Ownership and the Service Level Control Plane
An SLO without an owner is a metric. Ownership is what makes it a commitment. The following table maps each role to the specific dimension of SLM they are accountable for, their secondary touchpoints in the process, and the failure mode that emerges when the role is absent or unfilled.
| Role | Owns (SLM) | Secondary Touchpoint | Responsibilities & Failure Mode if Absent |
| Product Manager | SLO Target Definition | Business Outcomes Layer | Defines what "good enough" means for the customer. Sets the target threshold — the 2s for Search, the 99.9% for Checkout — based on business context and user research. Failure mode if absent: SLOs default to engineering intuition. Targets may be technically convenient rather than user-meaningful, and the business has no formal stake in reliability outcomes. |
| SRE / DevOps | SLI Implementation & Error Budget | Service & Infrastructure Layers | Translates PM-defined targets into instrumented SLIs. Owns the error budget calculation and is responsible for alerting on burn rate, not just threshold breach. Manages SLO configuration in New Relic. Failure mode if absent: Targets exist on paper but are not measured. There is no early warning before SLO breach, and no error budget to negotiate reliability-versus-velocity trade-offs. |
| Platform & Application Engineers | Telemetry Coverage | Experience & Gatekeeper Layers | Subject matter experts on what telemetry is available, where it lives, and what it means. Guide the SRE to the right data sources for each SLI — particularly at the edge, mobile, and infrastructure layers where instrumentation is non-standard. Failure mode if absent: SLIs are built on the telemetry that is easy to find, not the telemetry that is most representative. Blind spots accumulate at the layers farthest from standard APM coverage. |
| Business Analyst / Finance | Business Outcomes Layer | SLO Review Cadence | Connects observability data to business reporting. Validates that the business outcome SLIs reflect the metrics used in board and executive reporting. Drives the periodic review that keeps SLO targets aligned with commercial context. Failure mode if absent: The business outcomes layer becomes an engineering estimate of business health rather than a verified signal. Reliability investments cannot be tied to revenue impact. |
The Error Budget: The Negotiating Currency of Reliability
The error budget is the quantified gap between perfection and your SLO target. If your SLO is 99.9% availability over a 30-day window, your error budget is 43.8 minutes of allowable downtime in that period. It is not the error rate; it is the remaining allowance.
The error budget matters because it converts a binary "are we meeting SLO?" question into a continuous management signal. An SRE who can see that 60% of the monthly error budget was consumed in the last 72 hours has actionable information — the pace of consumption, the rate of burn, and a clear trigger for changing behaviour before the SLO is breached.
When the budget is healthy, engineering teams have license to ship faster. When the budget is depleted, the correct response is a reliability sprint, not an argument about whether the SLO was a good target. The budget makes the trade-off explicit and removes it from the realm of opinion.
Closing: From Green Dashboards to Real Accountability
The layered architecture described in this post exists to answer one question that a dashboard full of green tiles cannot: is the business actually working?
The SLO examples in each section are starting points, not prescriptions. The right targets for your organization depend on your user expectations, your architecture, and the commercial context your business analysts and product managers bring to the table. What matters is that targets exist, are owned, and are connected across the stack.
Green dashboards tell you that components are running. A well-structured SLM architecture tells you whether the digital business is healthy.
Glossary of Key Terms
SLI (Service Level Indicator)
A specific, quantitative measurement of service behavior — for example, the percentage of requests that complete within 300ms, or the percentage of sessions that are crash-free. An SLI is the raw metric; it has no target attached to it.
SLO (Service Level Objective)
A target applied to an SLI over a defined time window. An SLO states: "this SLI must meet or exceed this threshold for this percentage of the measurement period." SLOs are internal commitments. They should be set more conservatively than any external SLAs to preserve an operational buffer.
SLA (Service Level Agreement)
A contractual commitment to a customer or business stakeholder, typically with financial or legal consequences for breach. SLAs should be derived from SLOs, not the reverse. If your SLA equals your SLO, you have no room to detect degradation before a contract violation.
Error Budget
The quantified allowance for failure implied by an SLO. If an SLO requires 99.9% availability over 30 days, the error budget is 43.8 minutes of allowable downtime. The error budget is not the error rate — it is the remaining capacity for failure before the SLO is breached. Teams track error budget burn rate as an early warning signal.
RED Method
A framework for instrumenting microservices across three dimensions: Rate (requests per second), Errors (failed requests as a percentage of total), and Duration (latency distribution, typically P50/P95/P99). Each dimension requires its own SLO and its own error budget. Combining error rate and latency into a single SLO obscures both.
Core Web Vitals
Google's standardized set of user experience metrics: Largest Contentful Paint (LCP) for load performance, Interaction to Next Paint (INP) for responsiveness, and Cumulative Layout Shift (CLS) for visual stability. INP replaced First Input Delay (FID) as the interactivity metric in March 2024. Core Web Vitals inform search ranking and are the recommended baseline for Experience Layer SLIs.
Observability Gap
The condition in which infrastructure monitoring reports healthy while user-facing services are degraded. Typically caused by measuring component health (CPU, memory) rather than service behavior (latency, error rate, conversion). SLM is the discipline that closes the gap by connecting technical signals to user and business outcomes.
Service Level Control Plane
The organizational layer — roles, processes, and review cadences — that governs SLM. Technology implements the measurement; the control plane ensures that targets are set with business input, owned by named individuals, reviewed on a regular cadence, and revised when architecture or user expectations change.
Las opiniones expresadas en este blog son las del autor y no reflejan necesariamente las opiniones de New Relic. Todas las soluciones ofrecidas por el autor son específicas del entorno y no forman parte de las soluciones comerciales o el soporte ofrecido por New Relic. Únase a nosotros exclusivamente en Explorers Hub ( discus.newrelic.com ) para preguntas y asistencia relacionada con esta publicación de blog. Este blog puede contener enlaces a contenido de sitios de terceros. Al proporcionar dichos enlaces, New Relic no adopta, garantiza, aprueba ni respalda la información, las vistas o los productos disponibles en dichos sitios.