Service levels describe services provided to users within a given period of time, in measurable terms. Service level objectives (SLOs) are the goals set for the availability expected out of a system. Service level indicators (SLIs) are the key measurements and metrics to determine the availability of a system. Service level agreements (SLAs) are the legal contracts that explain what is agreed upon and what happens if systems don’t meet SLOs.
For example, an SLO for a web application might be that videos must start playing in less than 2 seconds, 99% of the time during a one week period. The SLI measures the proportion of videos on the website that start playing in less than 2 seconds. The SLA includes both this SLO and other SLOs that are agreed upon by the customer and the service provider, the scope of services that will be covered, and the SLIs, which are the metrics that will be used to measure performance.
Site reliability engineering (SRE) has popularized best practices for maintaining uptime and reliability of distributed systems, focused on the way to measure performance and reliability of services. Google published Site Reliability Engineering: How Google Runs Production Systems in March 2016, describing a framework for modeling, selecting, and analyzing metrics, starting with service level objectives.
So how do SLOs, SLIs, and SLAs relate to each other and to ways to manage service levels that your users expect? Let’s look at each in more detail.
What are SLOs?
SLOs are the goals you set for how much availability you expect out of your system, expressed as a percentage over a period of time.
The service level objectives help teams collaborate on a shared meaning of “availability” and “uptime.” You use SLOs as a standard to measure your reliability and availability. As described in the earlier example, an SLO states that videos in the web application must start playing in less than 2 seconds, 99% of the time over a week period.
What are SLIs?
SLIs are the quantitative measurements of how users experience the availability of a system. They represent a proportion of successful outputs for a level of service, expressed as a percentage.
These service level indicators are described in relation to SLOs, but SLIs provide real-time signals into system reliability. SLIs can measure the proportion of requests that were faster than a threshold or the proportion of records coming into a pipeline that result in the correct value coming out. As described in the earlier example, the SLI measures the proportion of videos on the website that start playing in less than 2 seconds. You can tell how far you are from the objective in the SLO.
What are SLAs?
SLAs define the level of service your customers expect when they use your service.
These service level agreements are contracts between service providers and their customers that document what services the provider will furnish and define the service standards the provider is obligated to meet. SLAs describe remedies or penalties as results of breaking the SLO commitments.
For the earlier example, the SLA will include all the SLOs for the web application, as well as the scope of services that will be covered, and all the SLIs, which are the metrics that will be used to measure performance against the SLOs. The agreement also includes both the responsibilities of the service provider and the customer.
Here are more examples of SLIs measuring real-time user experience, compared against SLOs:
Who uses service levels, SLOs, SLIs, and SLAs?
SRE teams, reliability engineers, and cross-functional teams often struggle to define and measure service “reliability.” Cross-functional teams need to create an aggregated, comprehensive view of important metrics for all aspects of a service or system so they can easily measure uptime and performance.
Service levels come into play to help SRE teams and reliability engineers identify critical components of their applications and infrastructure. In particular, they need to know when one or more components expose functionality to external customers. We call these intersection points system boundaries. System boundaries are where site reliability engineers need to apply service level indicators and objectives to their metrics in order to tell the real story of system performance and reliability.
It takes a lot of effort and thought to establish service boundaries and determine which metrics need to be SLIs and what the SLO compliance requirements should be. This complexity often results in teams abandoning the effort altogether. Reliability engineers and SRE teams need accurate, customized SLIs and SLOs based on historical system performance so they can quickly set a baseline for availability and uptime across their entire stack, for all of their teams.
While SRE teams and reliability engineers aren’t always responsible for managing service levels, it often falls within their purview. By tracking SLIs and tying them to SLOs, you can set goals around the performance of a system. Google’s SRE book defines the four golden signals of service levels as latency, traffic, errors, and saturation. So, for example, you could look at an API call and track its number of successful/failed requests (the SLI) against a general percentage of requests (the SLO, for example 95%) that need to be successful for customers to have a good experience.
SRE teams often set strict SLOs on critical components within their applications and services to better understand how strict of an SLA they can agree to with customers. From here, the team can apply error budgets as a way to understand how quickly they must resolve issues in order to stay compliant with their SLOs. Service levels allow teams to aggregate metrics and create a transparent view of uptime, performance, and reliability across the entire organization. At a glance, business leaders can use service levels to monitor compliance across multiple teams, applications, services, etc. to gain a comprehensive understanding of their system’s health.
What is service level management?
Service level management means ensuring that all of your processes and operational agreements for the level of your services provided to customers are appropriate. It includes monitoring and reporting on service levels, setting and adjusting SLOs, determining SLIs, making sure you are meeting SLAs, and holding customer reviews.
The central focus really is the shared meaning of “availability” across teams, in your SLOs, also captured in the SLAs with your customers. To make sure your business is meeting or exceeding these service level agreements, it’s important for cross-functional teams to manage internal SLOs.
This next video shows how teams can use service level management with New Relic.
Benefits of service level management
Implementing SLO best practices across teams isn’t easy. You need the right data to define a shared language across teams.
Reliability engineers need to quickly set a baseline for availability and uptime across their full stack and team. You need SLOs and SLIs to determine service boundaries and a unified, transparent view of service reliability to better comply with customer-facing SLAs. You need to be able to report on reliability and SLO compliance metrics and error budgets so you can make improvements across your environment.
When you have good practices for SLIs, SLOs, and SLAs, and a platform for your service level management, you’ll see these benefits:
- Easy setup: Automatically establish a baseline of performance and reliability for any service with a one-click setup and recommendations and customizations provided in a simple, guided flow.
- Define reliability across teams: Avoid arduous alignment processes with SLO and SLI recommendations that help you determine service boundaries. Set reliability benchmarks automatically based on recent performance metrics in any entity.
- Iterate and improve: With full-stack context and automation through open-source infrastructure-as-code tools like Terraform, teams have insight into how specific nodes or services impact system reliability and can quickly take control over their performance. Custom views for both service owners and business leaders drive operational efficiency and lead to better reporting, alerting, and incident management processes.
- Standardize reliability: Cross-organizational teams have a unified, transparent view of service reliability, and can better comply with customer-facing SLAs. SLO compliance metrics and error budgets give organizations a way to report on reliability and implement changes across applications, infrastructure, and teams in a cohesive fashion.
For more tips, read our blog posts, Best practices for setting SLOs and SLIs for modern, complex systems and Introducing service level management.
Get started with service level management. Try New Relic.
The best way to learn more about service level management and observability is to get hands-on experience with an observability solution. Sign up for New Relic. Your free account includes 100 GB/month of free data ingest, one free full-access user, and unlimited free basic users. Then explore the service level management documentation. And learn how New Relic can recommend SLIs and SLOs based on historical system performance.
The views expressed on this blog are those of the author and do not necessarily reflect the views of New Relic. Any solutions offered by the author are environment-specific and not part of the commercial solutions or support offered by New Relic. Please join us exclusively at the Explorers Hub (discuss.newrelic.com) for questions and support related to this blog post. This blog may contain links to content on third-party sites. By providing such links, New Relic does not adopt, guarantee, approve or endorse the information, views or products available on such sites.