At New Relic, defining and setting service level indicators (SLIs) and service level objectives (SLOs) is an increasingly important aspect of our site reliability engineering (SRE) practice. It’s not news that SLIs and SLOs are an important part of high-functioning reliability practices, but planning how to apply them within the context of a real-world, complex modern software architecture can be challenging, especially figuring out what to measure and how to measure it.
In this blog post, we’ll use a highly simplified version of the New Relic architecture to walk you through some concrete, practical examples of how our reliability team defines and measure SLIs and SLOs for our own modern software platform.
How we define SLI and SLO
It’s easy to get lost in a fog of acronyms, so before we dig in, here is a quick and easy definition of how service level indicators (SLIs) and service level objectives (SLOs) are related to service level agreements (SLAs):
|X should be true...||Y portion of the time,||or else.|
When we apply this definition to availability, here are examples:
- SLIs are the key measurements to determine the availability of a system.
- SLOs are goals we set for how much availability we expect out of a system.
- SLAs are the legal contracts that explain what happens if the system doesn’t meet its SLO.
SLIs exist to help engineering teams make better decisions. Your SLO performance is critical information to have when you're making decisions about how hard and fast you can push your systems. SLOs are also important data points for other engineers when they're making assumptions about their dependencies on your service or system. Lastly, your larger organization should use your SLIs and SLOs to make informed decisions about investment levels and about balancing reliability work against engineering velocity.
Set SLIs and SLOs against system boundaries
When we look at the internals of a modern software platform, the level of complexity can be daunting (to say the least). Platforms often comprise hundreds, if not thousands, of unique components, including databases, service nodes, load balancers, message queues, and so on. Establishing customer-facing SLIs and SLOs such as general availability or uptime for each component may not be feasible.
That’s why we recommend focusing on SLIs and SLOs at system boundaries, rather than for individual components. Platforms tend to have far fewer system boundaries than individual components, and SLI/SLO data taken from system boundaries is also more valuable. This data is useful to the engineers maintaining the system, to the customers of the system, and to business decision-makers.
A system boundary is a point at which one or more components expose functionality to external customers. For example, in the New Relic platform, we have a login service that represents the functionality for a user to authenticate a set of credentials using an API.
It’s likely the login service has several internal components—service nodes, a database, and a read-only database replica. But these internal components don't represent system boundaries because we're not exposing them to the customer. Instead, this group of components acts in concert to expose the capabilities of the login service.
Using this idea of system boundaries, we can think of our simplified New Relic example as a set of logical units (or tiers)—a UI system, a service tier (which includes the login service), two separate data systems, and an ingest system—rather than as a tangle of individual components. And, of course, we have one more system boundary, which is the boundary between all of these services as a whole and our customers.
Focusing on system boundary SLIs lets us capture the value of these critical system measurements, allowing us to significantly simplify the measurements we need to implement.
Establish a baseline for service boundaries with one click
Deciding where to start defining your service boundaries and what “reliability” looks like for your team, system, and customers is a daunting task. Service level management in New Relic offers an easy solution, because you can find your baseline of reliability and then customize SLIs and SLOs from there.
For example, New Relic identifies the most common SLIs for a given service, most often some measurement of availability and latency, and scans the historical data from a service to determine the best initial setup. Across the platform, you’ll find ways to automatically set up SLIs, like we’ll show here, or you can manually create them with NRQL queries.
If you’re looking for a one-click setup to establish a baseline for SLIs and SLOs in New Relic, just follow these steps:
- Log in to New Relic and select APM from the navigation menu at the top.
- Select the service entity where you’d like to establish SLIs.
- Then, on the left-hand menu, scroll down to Reports and select the Service Levels option.
- You should see a screen similar to this:
From here, you can simply select the Add baseline service level objectives button and let New Relic define your SLIs and SLOs. Don’t forget to check out the service level management documentation to see how New Relic service levels work and find more ways to customize your SLIs and SLOs.
SLI + SLO, a simple recipe
You can apply the concepts of SLI, SLO, and system boundaries to the different components that make up your modern platform. And although the specifics of how to apply those concepts will vary based on the type of component, at New Relic we use the same general recipe in each case:
- Identify the system boundaries within our platform.
- Identify the customer-facing capabilities that exist at each system boundary.
- Articulate a plain-language definition of what it means for each service or function to be available.
- Define one or more SLIs for that definition.
- Start measuring to get a baseline.
- Define an SLO for each metric and track how we perform against it.
- Iterate and refine our system, and fine tune the SLOs over time.
Each system boundary has a unique set of functionality and dependencies to consider. Let’s take a closer look at how these considerations shape how we define our SLIs and SLOs for each tier.
The functionality of services drive SLIs
Part of the availability definition for our platform means that it can ingest data from our customers and route it to the right place so that other systems can consume it. We're dealing here with two distinct processes—ingest and routing—so we need an SLO and SLI for each.
It’s critical that we start with plain-language definitions of what “availability” for each of these services means to our customers using the system. In the case of the ingest functionality, the customers in question are the end-users of our system—the folks sending us their data. In this case, the definition of availability might look like, “If I send my data to New Relic in the right format, it’ll be accepted and processed.”
Then, we can use that plain-language definition to determine which metric best corresponds to how it defines availability. The best metric here is probably the number of HTTP POST requests containing incoming data that are accepted with 200 OK status responses. Phrasing this in terms of an SLO, we might say that “99.9% of well-formed payloads get 200 OK status responses.”
A plain-language definition for the data routing functionality might look like, “Incoming messages are available for other systems to consume off our message bus without delay.” With that definition then, we might define the SLI and SLO as, "99.xx% of incoming messages are available for other systems to consume off of our message bus within 500 milliseconds." To measure this SLO—99.95%—we can compare the ingest time stamp on each message to the timestamp of when that message became available on the message bus.
OK, great! We now have an SLO for each service. In practice, though, we worry less about the SLO than we do about the SLI, because SLO numbers are easy to adjust. We might want to adjust an SLO number for various business reasons. For example, we might start out with a lower SLO for a less mature system and increase the SLO over time as the system matures. That’s why we say it’s important for the desired functionality of a service to drive the SLI.
SLIs are broad proxies for availability
The data ingested by our platform is stored in one of our main data tiers. For New Relic this is NRDB, our proprietary database cluster. In plain-language terms, NRDB is working properly if we can rapidly insert data into the system, and customers can query their data back out.
Under the hood, NRDB is a massive, distributed system with thousands of nodes and different worker types and we monitor it to track metrics like memory usage, garbage collection time, data durability and availability, and events scanned per second. But at the system boundary level, we can just look at insert latency and query response times as proxies for those classes of underlying errors.
When we set an SLI for query response times, we’re not going to look at averages, because averages lie. But we also don’t want to look at the 99.9th percentile, because those are probably going to be the weird worst-case scenario queries. Instead, we focus on the 95th or 99th percentile, because that gives us insight into the experience of the vast majority of our customers without focusing too much on the outliers.
At this point, we can configure an alert condition to trigger if we miss our query response time SLI. That lets us track how often we violate this alert, which in turn tells us how often we satisfy our SLI—how much of the time are we available? We definitely don’t want to use this alert to wake people up in the middle of the night—that threshold should be higher—but it’s an easy way to track our performance for SLO bookkeeping.
We articulate these SLIs and SLOs so our customers know what to expect when they query their data. In fact, we can combine several SLOs into one, customer-friendly, measure of reliability:
We started with these two SLOs:
- 99.95% of well-formed queries will receive well-formed responses.
- 99.9% of queries will be answered in less than 1000ms.
We combined them into this measure of reliability: 99.9% of well-formed queries will receive well-formed responses in less than 1000ms.
Measure customer experience to understand SLO/SLIs for UIs
We assign one service to our UI tier—we expect it to be fast and error-free. But, to measure UI performance, we have to change our perspective. Until now, our reliability concerns were server-centric but, with the UI tier, we want to measure customer experience and how it’s affected by the front-end. We have to set multiple SLIs for the UI.
For page load time, for example, we use the 95th or 99th percentile load time rather than the average. Additionally, we set different SLIs for different geographies. But for modern web applications, page-load time is only one SLI to consider.
Hard dependencies require higher SLOs: the network tier
So far, we’ve explained how we define SLIs and SLOs for different services in our platform, but now we’re going to address a critical part of our core infrastructure, the network. These are our most important SLIs and SLOs because they set the foundation for our entire platform. The networking tier is a hard dependency for all of our services.
For our network, we defined three capabilities: We need connectivity between availability zones (AZs), connectivity between racks within an AZ, and load-balanced endpoints that expose services both internally and externally. We need a higher SLO for these capabilities.
With these layers of dependencies come potential failure scenarios:
- If something goes wrong in the UI tier, it’s an isolated failure that should be easy for us to recover from.
- If our service goes down, the UI is affected—but we can implement some UI caching to reduce that impact.
- If the data infrastructure goes down, the service tier and UI also go down, and the UI can’t recover until both the data tier and service tier come back online.
- If the network goes down, everything goes down, and we need recovery time before the system is back online. Because systems don’t come back the instant a dependency recovers, our mean time to recovery (MTTR) increases.
In general, we assume we'll lose a small order of magnitude in uptime at each level. If we expect an SLO of 99.9% availability for services running on the network tier, we set an SLO of 99.99% availability for the network itself.
It’s difficult to implement graceful degradation scenarios against hard infrastructure outages, so we invest in reliability at these infrastructure layers and set higher SLOs. This practice is one of the most important things we can do for the overall health of our platform.
One last overall check
After defining SLIs and SLOs for the services that deliver our overall platform, we have a great way to understand where our reliability hotspots are. And our engineering teams have a really great way to understand, prioritize, and defend their reliability decisions.
We still need to implement one last SLI and SLO check: We need to measure our end-to-end customer experience.
To do this, we run a synthetic monitoring script in New Relic that represents a simple end-to-end customer workflow. It sends a piece of data to our platform and then logs in and queries for that specific data. If we detect any significant discrepancy between the performance of this script and the expectations set in our SLOs, we know we need to revisit our SLI methodology.
Six things to remember
In closing, we encourage you to remember the these six points when it comes to setting SLIs and SLOs:
- Define SLIs and SLOs for specific capabilities at system boundaries.
- Each logical instance of a system (for example, a database shard) gets its own SLO.
- Combine SLIs for a given service into a single SLO.
- Document and share your SLI/SLO contracts.
- Assume that both your SLOs and SLIs will evolve over time.
- Stay engaged—SLOs represent an ongoing commitment.
It takes a while to build a good reliability practice, but no matter how much time and effort you invest, we strongly believe that you can’t build resilient and reliable software architectures without clear definitions of the demands and availability you’re setting for your systems.
An earlier version of this blog post was published in October 2018, adopted from a talk given at FutureStack18 titled, “SLOs and SLIs In The Real World: A Deep Dive.” Matthew Flaming, former vice president of site reliability at New Relic, contributed to this post.
The views expressed on this blog are those of the author and do not necessarily reflect the views of New Relic. Any solutions offered by the author are environment-specific and not part of the commercial solutions or support offered by New Relic. Please join us exclusively at the Explorers Hub (discuss.newrelic.com) for questions and support related to this blog post. This blog may contain links to content on third-party sites. By providing such links, New Relic does not adopt, guarantee, approve or endorse the information, views or products available on such sites.