In my work with enterprise customers, the same pattern surfaces regularly: all the infrastructure dashboards are green — CPU is low, memory is fine — and yet customer support tickets are piling up. I call this the Observability Gap. The monitoring is healthy; the service is not.

The disconnect happens because most organizations measure components in isolation rather than the relationships between them. A slow database query is invisible on a CPU chart. A degraded CDN edge is invisible on an application health dashboard. The only way to close the gap is to measure what the user experiences at every tier of the stack — and to express those measurements as formal commitments.

Service Level Management (SLM) Reference Architecture provides that structure. It is a prescriptive map for defining the right Service Level Indicators (SLIs) and Service Level Objectives (SLOs) at each layer of your system — from the infrastructure data plane to the business metrics your executives track quarterly. The goal is simple: technical health should translate directly and visibly into business outcomes.

The sections that follow are drawn from the architecture I use in deep-dive workshops. Each layer includes concrete SLO examples you can use as starting baselines, along with the context for why each measurement matters.

Layer 1 — The Experience Layer: Measuring Perception

At the top of the stack, we measure what the user actually feels. A 200 OK response is not a user experience; it is a protocol handshake. True experience measurement requires tracking interactivity and visual stability under real-world conditions — including on mobile networks, on low-end devices, and at 3am when your real users are asleep.

Architect's Note: Synthetic monitoring fills the gaps that real-user monitoring cannot. Critical journeys — Login, Checkout, Account Creation — must be exercised continuously by synthetic tests even when traffic is low. Your SLAs do not pause overnight.

SLIExample ObjectiveNotes
Interactivity (INP)90% of page loads have an Interaction to Next Paint (INP) < 200msINP replaced FID as the Core Web Vitals interactivity metric in March 2024. If your SLOs still reference FID, update them.
Visual Stability (CLS)95% of sessions have a Cumulative Layout Shift (CLS) score < 0.1CLS spikes are frequently caused by late-loading ads or fonts. Track by page template, not just site-wide.
Mobile Crash-Free Rate99.9% of mobile sessions are crash-freeSegment iOS and Android separately. Platform-specific regressions are common after OS version releases.
ANR RateApp Not Responding events < 1% of sessionsANRs are distinct from crashes — the app is unresponsive, not terminated. Both require their own SLOs.
Critical Journey LatencyLogin and Checkout complete in < 2s for 95% of attemptsDefine per-journey, not as a site average. A slow checkout matters more than a slow About page.
Rage Click RateRage clicks detected in < 1% of sessionsA leading indicator of UX failure. Rising rage clicks often precede support ticket spikes by hours.
Ajax Error RateClicks triggering Ajax errors < 1% of sessionsFilter to user-initiated interactions only. Background polling errors skew the metric.
Synthetic Journey Success99.9% of synthetic critical journey executions succeedRun from multiple geographic locations. A localized CDN failure may only appear in regional synthetic results.

Layer 2 — The Gatekeeper Layer: Edge & Control

Before traffic reaches your backend, it passes through CDNs, load balancers, and DNS infrastructure. This is the first point where latency is introduced and the first point where failures are silent from the application's perspective. A slow edge is invisible to your APM tool and invisible to your users — until it isn't.

SLIs at this layer should focus on global latency, connection health, and cache efficiency. A well-tuned edge is also your primary cost control lever: every cache miss is a backend request you paid for twice.

SLIExample ObjectiveNotes
Time to First Byte (TTFB)95% of CDN-served requests have a TTFB < 100msMeasure at the CDN edge, not the origin. A low origin TTFB with a high edge TTFB indicates a caching or routing problem.
Cache Hit RatioCache hit ratio remains above 85%Dropping below this threshold puts direct backend load pressure on the origin and increases latency variability.
SSL Handshake Latency99% of SSL handshakes complete in < 50msSpikes here are often the first signal of certificate renewal issues or TLS configuration drift.
TCP Retransmission RateTCP retransmission rate < 1% for active sessionsElevated retransmissions indicate network congestion or hardware degradation upstream of the application layer.
Packet Loss RatePacket loss across ingress interfaces < 0.1%Even small packet loss forces retransmissions that compound latency, regardless of application performance.
Edge-to-Origin RTTRound Trip Time from edge to origin < 50ms for 95% of requestsEnsures the mid-tier network path between CDN and backend is not the hidden bottleneck.
DNS Resolution Latency99% of DNS resolution requests complete in < 50msDNS failures are total failures. Monitor resolution time and NXDOMAIN error rates independently.

Layer 3 — The Service Domain: Business Logic

This is where your application runs — microservices, monoliths, or serverless functions. The common measurement mistake at this layer is tracking generic health metrics: average response time, overall error rate. Those numbers hide the failures that matter.

The discipline here is measuring critical business functions explicitly, not as a side effect of broad instrumentation. The goal is to convert general metrics that are at best a point of interest to something that will demand attention when expressed as a non-compliant SLO.

SLIExample ObjectiveNotes
Service Error Rate(RED Method)99% of requests to [service] complete without errorDefine per service, per endpoint class. Exclude health checks, readiness probes, and scheduled housekeeping tasks from the denominator.
Service Latency(RED Method)P95 latency for [service] requests < [baseline]msSet the baseline from observed P95, not an arbitrary target. Alert on deviation from baseline, not breach of a fixed threshold.
Critical Method Success Rate/process_order completes successfully 99.9% of the timeInstrument at the method level for business-critical paths. Service-level SLOs will not surface a degraded checkout within a healthy service.
External Dependency Success Rate99.9% success rate for calls to external Payment GatewayYou own your dependency SLOs even when you don't own the dependency. Track and report them separately from internal service SLOs.
Serverless Cold Start Latency90% of function invocations complete cold start in < 500msCold starts are not random — they are predictable under traffic patterns. Track cold start rate (not just latency) as a capacity signal.
Core API Response TimeSearch API returns results in < 2s for 95% of requestsHigh-value APIs warrant individual SLOs. Aggregating Search with Auth with Admin into one service SLO obscures the user-facing impact.

Architect's Note: The RED Method — Rate, Errors, Duration — defines three distinct measurement dimensions. Error rate and latency require separate SLOs with separate error budgets. A service can be fast and broken, or slow and reliable. Conflating the two into a single objective makes the budget uninterpretable.

Layer 4 — The Foundation: Infrastructure & Data Plane

Infrastructure SLOs serve one purpose: confirming that the platform has the capacity and health to serve the application running on it. The goal is not to monitor CPU for its own sake — it is to detect saturation and replication degradation before they manifest as service-layer failures.

Intent-based SLIs at this layer ask "can this platform serve the load it's receiving?" rather than "is this server busy?"

SLIExample ObjectiveNotes
Pod Scheduling Latency(Kubernetes)99% of pods enter Running state within 10 seconds of schedulingScheduling latency is an early indicator of cluster resource pressure. Alert before the queue backs up.
Replication LagRead replica lag < 1 second 99% of the timeReplica lag above threshold means read traffic may receive stale data. This is a data correctness risk, not just a performance risk.
Disk IOPS SaturationDisk IOPS utilization < 85%IOPS saturation causes non-linear latency spikes. 85% is not a comfortable threshold; it is a warning line.
Node Memory Pressure< 1% of Kubernetes nodes in memory pressure state at any timeNode memory pressure triggers pod evictions, which surface as service restarts. Track the infrastructure signal, not just the application symptom.
Database Connection Pool UtilizationConnection pool utilization < 80% for 99% of 5-minute intervalsPool exhaustion causes application timeouts. This metric should be tracked per service, not just per database instance.

Layer 5 — Business Outcomes

Business outcome metrics are the ultimate test of the architecture. If Layers 1 through 4 are healthy and these metrics are degraded, the problem is a product or commercial issue, not a platform issue. If these metrics degrade while technical layers appear healthy, you have an instrumentation gap — something in the stack is failing silently.

These measures require direct input from Product Managers, Business Analysts, and in some cases executive stakeholders. Most digital businesses already track them on a monthly or quarterly cadence. Observability tooling surfaces them in real time, which changes the character of the conversation from retrospective review to operational response.

SLIExample ObjectiveNotes
Order VelocityOrder volume does not drop > 20% below the hour-of-day baselineUse time-of-day baselines, not absolute thresholds. A 20% drop at 2pm is very different from a 20% drop at 2am.
Conversion RateCart-to-order conversion rate stays within 1 standard deviation of the 7-day moving averageAbsolute conversion targets vary enormously by industry. Deviation from your own baseline is a more reliable signal than any benchmark.
Cart Abandonment RateCart abandonment rate does not exceed the 7-day rolling average by more than 10%Industry average abandonment is ~70%. An absolute SLO at that level is not protective. Define relative to your own baseline.
Average Cart ValueAverage daily cart value within 1 standard deviation of seasonal averageSeasonal adjustment is required. A pre-holiday drop in cart value is a signal; a post-holiday drop is expected.
Revenue VelocityHourly revenue does not fall > 15% below the day-of-week baselineRequires integration with transaction data. Increasingly achievable via New Relic custom events or APM transaction attributes.

The Human Element: Ownership and the Service Level Control Plane

An SLO without an owner is a metric. Ownership is what makes it a commitment. The following table maps each role to the specific dimension of SLM they are accountable for, their secondary touchpoints in the process, and the failure mode that emerges when the role is absent or unfilled.

RoleOwns (SLM)Secondary TouchpointResponsibilities & Failure Mode if Absent
Product ManagerSLO Target DefinitionBusiness Outcomes LayerDefines what "good enough" means for the customer. Sets the target threshold — the 2s for Search, the 99.9% for Checkout — based on business context and user research.
Failure mode if absent: SLOs default to engineering intuition. Targets may be technically convenient rather than user-meaningful, and the business has no formal stake in reliability outcomes.
SRE / DevOpsSLI Implementation & Error BudgetService & Infrastructure LayersTranslates PM-defined targets into instrumented SLIs. Owns the error budget calculation and is responsible for alerting on burn rate, not just threshold breach. Manages SLO configuration in New Relic.
Failure mode if absent: Targets exist on paper but are not measured. There is no early warning before SLO breach, and no error budget to negotiate reliability-versus-velocity trade-offs.
Platform & Application EngineersTelemetry CoverageExperience & Gatekeeper LayersSubject matter experts on what telemetry is available, where it lives, and what it means. Guide the SRE to the right data sources for each SLI — particularly at the edge, mobile, and infrastructure layers where instrumentation is non-standard.
Failure mode if absent: SLIs are built on the telemetry that is easy to find, not the telemetry that is most representative. Blind spots accumulate at the layers farthest from standard APM coverage.
Business Analyst / FinanceBusiness Outcomes LayerSLO Review CadenceConnects observability data to business reporting. Validates that the business outcome SLIs reflect the metrics used in board and executive reporting. Drives the periodic review that keeps SLO targets aligned with commercial context.
Failure mode if absent: The business outcomes layer becomes an engineering estimate of business health rather than a verified signal. Reliability investments cannot be tied to revenue impact.

The Error Budget: The Negotiating Currency of Reliability

The error budget is the quantified gap between perfection and your SLO target. If your SLO is 99.9% availability over a 30-day window, your error budget is 43.8 minutes of allowable downtime in that period. It is not the error rate; it is the remaining allowance.

The error budget matters because it converts a binary "are we meeting SLO?" question into a continuous management signal. An SRE who can see that 60% of the monthly error budget was consumed in the last 72 hours has actionable information — the pace of consumption, the rate of burn, and a clear trigger for changing behaviour before the SLO is breached.

When the budget is healthy, engineering teams have license to ship faster. When the budget is depleted, the correct response is a reliability sprint, not an argument about whether the SLO was a good target. The budget makes the trade-off explicit and removes it from the realm of opinion.

Closing: From Green Dashboards to Real Accountability

The layered architecture described in this post exists to answer one question that a dashboard full of green tiles cannot: is the business actually working?

The SLO examples in each section are starting points, not prescriptions. The right targets for your organization depend on your user expectations, your architecture, and the commercial context your business analysts and product managers bring to the table. What matters is that targets exist, are owned, and are connected across the stack.

Green dashboards tell you that components are running. A well-structured SLM architecture tells you whether the digital business is healthy.

Glossary of Key Terms

SLI (Service Level Indicator)

A specific, quantitative measurement of service behavior — for example, the percentage of requests that complete within 300ms, or the percentage of sessions that are crash-free. An SLI is the raw metric; it has no target attached to it.

SLO (Service Level Objective)

A target applied to an SLI over a defined time window. An SLO states: "this SLI must meet or exceed this threshold for this percentage of the measurement period." SLOs are internal commitments. They should be set more conservatively than any external SLAs to preserve an operational buffer.

SLA (Service Level Agreement)

A contractual commitment to a customer or business stakeholder, typically with financial or legal consequences for breach. SLAs should be derived from SLOs, not the reverse. If your SLA equals your SLO, you have no room to detect degradation before a contract violation.

Error Budget

The quantified allowance for failure implied by an SLO. If an SLO requires 99.9% availability over 30 days, the error budget is 43.8 minutes of allowable downtime. The error budget is not the error rate — it is the remaining capacity for failure before the SLO is breached. Teams track error budget burn rate as an early warning signal.

RED Method

A framework for instrumenting microservices across three dimensions: Rate (requests per second), Errors (failed requests as a percentage of total), and Duration (latency distribution, typically P50/P95/P99). Each dimension requires its own SLO and its own error budget. Combining error rate and latency into a single SLO obscures both.

Core Web Vitals

Google's standardized set of user experience metrics: Largest Contentful Paint (LCP) for load performance, Interaction to Next Paint (INP) for responsiveness, and Cumulative Layout Shift (CLS) for visual stability. INP replaced First Input Delay (FID) as the interactivity metric in March 2024. Core Web Vitals inform search ranking and are the recommended baseline for Experience Layer SLIs.

Observability Gap

The condition in which infrastructure monitoring reports healthy while user-facing services are degraded. Typically caused by measuring component health (CPU, memory) rather than service behavior (latency, error rate, conversion). SLM is the discipline that closes the gap by connecting technical signals to user and business outcomes.

Service Level Control Plane

The organizational layer — roles, processes, and review cadences — that governs SLM. Technology implements the measurement; the control plane ensures that targets are set with business input, owned by named individuals, reviewed on a regular cadence, and revised when architecture or user expectations change.

現在、このページは英語版のみです。