Service levels describe services provided to users within a given period of time, in measurable terms.
- Service level objectives (SLOs) are the goals set for the availability expected out of a system.
- Service level indicators (SLIs) are the key measurements and metrics to determine the availability of a system.
- Service level agreements (SLAs) are the legal contracts that explain what is agreed upon and what happens if systems don’t meet SLOs.
For example, an SLO for a web application might be that videos must start playing in less than 2 seconds, 99% of the time during a one week period. The SLI measures the proportion of videos on the website that start playing in less than 2 seconds. The SLA includes both this SLO and other SLOs that are agreed upon by the customer and the service provider, the scope of services that will be covered, and the SLIs, which are the metrics that will be used to measure performance.
Site reliability engineering (SRE) has popularized best practices for maintaining uptime and reliability of distributed systems, focused on the way to measure performance and reliability of services. Google’s ebook, Site Reliability Engineering: How Google Runs Production Systems, describes a framework for modeling, selecting, and analyzing metrics, starting with SLOs.
SLO, SLI and SLA: What's the difference?
SLOs, SLIs, and SLAs differ in many ways, but together they form a coherent stack designed to help ensure effective operations and satisfied customers. We’ll look at each in more detail, but this table offers an overview of each of these three pillars:
SLA | SLO | SLI | |
---|---|---|---|
Purpose | Establish a level of service quality agreed upon by the customer and provider. | Identify the minimum levels of performance, availability, and other qualities (for example, recoverability) that will satisfy the SLAs. | Measure and report the specific parameters that show the system’s ability to meet each SLO. |
Examples | Will deliver: 99.99% uptime; two hour resolution time. Minimum 12-hour recovery from data loss. Failure to perform: Payment credits per unit of time. | Response time less than or equal to 300ms; error rate less than 2%; 3 copies of data. | Average response time = 250.1ms. Uptime percentage = 98.9% |
Typical influencers | Customer, business group, legal department | System architect, system integrator, reliability engineering team | Reliability engineering team |
When to use | Paid services | Both free and paid services | Reliability engineering team |
Focus | Scope, metrics, legal and financial consequences | Specific targets to satisfy SLAs | Actual data to assess performance |
Flexibility | Less flexible. Requires agreement amongst multiple parties: service providers, legal teams, and clients | More flexible. Objectives can be updated according to technological and service capabilities and requirements | Most flexible. Indicators can be adapted according to evolution of technologies, such as new instrumentation and machine learning practices. |
An SLA is typically created by the business and legal teams, working with the customer (for specific contracts). Providers also present general SLAs for their services, such as cloud service providers for their different instance types. SLAs can include multiple SLOs, the scope of services that will be covered, and the SLIs used to measure performance.
SLOs, SLIs, and SLAs help ensure a quality of service from a set of technologies. All influencers of SLOs, SLIs, and SLAs should be consulted to help be sure goals are attainable and customers are satisfied.
What are service level objectives (SLOs)?
SLOs are the goals you set for how much availability you expect out of your system, expressed as a percentage over a period of time.
The service level objectives help teams collaborate on a shared meaning of “availability” and “uptime.” You use SLOs as a standard to measure your reliability and availability. As described in the earlier example, an SLO states that videos in the web application must start playing in less than 2 seconds, 99% of the time over a week period.
Examples of SLOs
As mentioned previously, SLOs serve as a bridge between technical metrics and the broader service level agreements (SLAs) agreed upon with customers. Let’s take a look at some more examples.
Uptime/Availability SLOs
- 99.9% uptime over a 30-day window.
- Less than 0.1% of requests fail due to system errors in any given week.
Latency SLOs
- 95% of web page loads complete within 2 seconds.
- 99% of API requests return within 300 milliseconds.
Error rate SLOs
- Fewer than 0.05% of all transactions result in an error.
- Less than 1% of database writes fail.
Throughput SLOs
- The system can handle 10,000 requests per second during peak times.
- Data ingestion rates of 5TB per day without degradation.
Capacity and usage SLOs
- Disk usage on critical systems remains below 80% at all times.
- No more than 70% of total RAM usage on any service instance.
Data integrity and consistency SLOs:
- Data replication across clusters completes within 5 minutes.
- Less than 0.01% data inconsistency between primary and secondary storage systems.
Durability SLOs:
- 99.9999999% (nine 9's) durability of data over a year.
- Successful backup restoration 99.5% of the time.
Change management and deployment SLOs:
- 98% of deployments occur without rollback.
- 99% of changes result in no unplanned outages.
How to set SLOs
Setting the right SLOs is a strategic process but when done correctly, improves service reliability and creates an incredible customer experience.
- Understand your users’ expectations and needs. You’ll want to engage with all stakeholders, including but not limited to customers and internal teams to gain insight into what’s critical to your application’s performance and reliability.
- Analyze the historical performance of your system to understand its current behavior and identify any recurring issues or areas of concern. This information will allow you to set specific, measurable indicators that truly represent the service’s health, like latency, error rate, or uptime.
- Define your target objectives as soon as these indicators are in place. These should be both challenging but achievable and align with your broader business goals.
Remember, SLOs should be reviewed and potentially adjusted periodically to reflect changes in user expectations, system behavior, or business priorities. Additionally, it’s essential to strike a balance: while high reliability is crucial, over-stringent SLOs can impede agility and innovation. Collaborative tools and observability platforms like New Relic can aid in continuously monitoring and adjusting SLOs as your system and business evolve.
How can you balance between setting aggressive SLOs and realistic ones?
Striking a balance involves understanding user expectations and the technical capabilities of your system. It's crucial to involve stakeholders from both the business and technical sides to set SLOs that are challenging yet feasible.
What happens if SLOs are consistently not met?
If SLOs are consistently not met, it may indicate underlying issues in the service. Teams should conduct root cause analysis to identify problems and work on improvements. For SLAs, missing SLOs might result in penalties or other consequences defined in the agreement.
What are service level indicators (SLIs)?
SLIs are the quantitative measurements of how users experience the availability of a system. They represent a proportion of successful outputs for a level of service, expressed as a percentage.
SLIs are described in relation to SLOs, but SLIs provide real-time signals into system reliability. SLIs can measure the proportion of requests that were faster than a threshold or the proportion of records coming into a pipeline that result in the correct value coming out. As described in the earlier example, the SLI measures the proportion of videos on the website that start playing in less than 2 seconds. An SLI tells you how far you are from the objective in the SLO.
Examples of SLIs
SLIs serve as the foundation upon which SLOs and SLAs are based. Let’s look at some examples.
Availability/Uptime
- Percentage of successful requests vs. total requests.
- Ratio of system uptime to the total time period.
Latency
- Time taken for an API request to return a response.
- Time taken for a web page to load for the end user.
Throughput
- Number of requests handled per second.
- Volume of data processed within a specific time frame.
Error rate
- Percentage of failed requests vs. total requests.
- Number of 4xx or 5xx HTTP status codes returned.
Saturation
- Percentage of resource utilization, such as CPU or RAM.
- Amount of used storage relative to the total available storage.
Coverage
- Percentage of users who receive a new feature update within a given time frame.
- Ratio of cached responses vs. total responses delivered.
Freshness
- Age of the data being read relative to when it was written.
- Time taken for data replication across multiple databases or systems.
Capacity
- Maximum number of users or sessions the system can handle simultaneously.
- Maximum data volume the system can handle without degradation.
How do you choose appropriate SLIs for a service?
SLIs should be chosen based on what matters most to users/customers. Common SLIs include latency, error rates, throughput, and availability. It's essential to understand user expectations and business priorities.
How do you measure SLIs accurately?
Accurate measurement often requires implementing monitoring and logging systems. Use tools that capture relevant data points and provide insights into SLIs. Regularly validate and calibrate measurement systems to ensure accuracy.
What are service level agreements (SLAs)?
SLAs define the level of service your customers expect when they use your service.
These service level agreements are contracts between service providers and their customers that document what services the provider will furnish and define the service standards the provider is obligated to meet. SLAs describe remedies or penalties as results of breaking the SLO commitments.
For the earlier example, the SLA will include all the SLOs for the web application, as well as the scope of services that will be covered, and all the SLIs, which are the metrics that will be used to measure performance against the SLOs. The agreement also includes both the responsibilities of the service provider and the customer.
Here are more examples of SLIs measuring real-time user experience, compared against SLOs:
SLIs, SLOs, and SLAs are crucial for observability. Get started with New Relic service levels today.
Examples of SLAs
SLAs vary from one company and customer to the next and according to the service being provided. Here are a few links to example SLAs from business leaders.
- AWS general SLAs
- HP Enterprise SLA for security services
- Verizon Business SLA for Internet Dedicated Services
There are many other good examples of existing SLAs from leading companies that can be evaluated when creating your own SLAs.
How to create a good SLA
Creating an SLA requires multidisciplinary input from a variety of stakeholders, including the customer, the provider’s legal counsel, the business unit, and the reliability team. Since this is a legally binding agreement, items in the SLA should be discussed thoroughly and frankly among all members of the team.
Because SLAs are legal agreements, they are comprehensive documents. Thus, SLAs can include the following topics:
- An overview, including introductory information, definitions of legal terms, scope of the agreement, purpose, review period, and contractual parameters.
- The service agreement itself, identifying the qualities of service the customer can expect, the objectives to deliver those qualities, and the metrics monitored and tracked. Important terms will identify types of problems that might be experienced and response times to fix them.
- Exceptions and limitations will state any items or events that would be excluded from the agreement, such as response times do not include delays due to customer response.
- Responsibilities identify who will do what to meet the SLA, including both the provider’s and customer’s responses to address issues.
- Service availability will define when service personnel are available to respond, what type of service will be available (on-site, phone support, online/chat support, etc.) and other aspects, which affects response times.
- References and a glossary will help define meanings of terms, in addition to the legal terms included in the overview.
- Pricing will be included, which may cover a range of services.
- Remedies will identify the compensation provided to the customer should SLAs not be met. These are often described in terms of credits.
- An appendix might be included to cover other topics.
What happens if an SLA is breached?
Since SLAs are legally binding contracts, there are consequences for one or both parties failing to meet their obligations. These remedies should be defined in the SLA. If a breach is suspected, there first needs to be clear, frequent, and professional communications between all the parties involved, from the customer through the reliability engineering team.
The first rule of managing an SLA breach is not blaming people, but blaming a problem. Solve the problem and you avoid a crisis.
Who uses service levels, SLOs, SLIs, and SLAs?
Cross-functional teams—including legal, business, and reliability engineering—rely on service levels, SLOs, SLIs, and SLAs to define and deliver quality service. On the other hand, customers see the SLA as the promises made by the provider regarding the level of service to be expected. This combination of stakeholders makes defining the service level stack demanding. Stakeholders often struggle to define and measure service “reliability.”
Service levels allow teams to aggregate metrics and create a transparent view of uptime, performance, and reliability across the entire organization. At a glance, business leaders can use service levels to monitor compliance across multiple teams, applications, services, etc. to gain a comprehensive understanding of their system’s health.
Service levels help SRE teams and reliability engineers identify critical components of their applications and infrastructure. They need to know when one or more components expose functionality to external customers. These intersection points are called system boundaries.
When to use SLIs
Use SLIs whenever there is a need to quantify a service against established SLOs. If you set an objective, you’ll need to be able to validate it to demonstrate performance. With that in mind, reliability engineers and SRE teams need accurate SLIs based on historical system performance so they can set baseline SLOs for availability and uptime across their entire stack.
System boundaries are where SREs need to apply SLIs and goals for their metrics in order to tell the real story of system performance and reliability.
When to use SLOs
SLOs can be applied whenever there is a need to achieve a desired service level from a system. From small businesses to enterprise operations, minimum service levels are usually expected. SLOs define how you’ll achieve that quality of service.
SRE teams often set strict SLOs on critical components within their applications and services to better understand how strict of an SLA they can agree to with customers. From here, the team can apply error budgets as a way to understand how quickly they must resolve issues in order to stay compliant with their SLOs.
When to use SLAs
Always use SLAs with paying customers. Some SLAs will be unique to the customer and depend on the services needed. General SLAs, such as for cloud or other IT services, define what customers can expect from the provider’s systems. Failure to provide SLAs and require customers to agree to them creates ambiguity, which can lead to customer dissatisfaction and legal challenges.
Challenges of SLAs, SLOs, and SLIs
Awareness around the challenges of SLAs, SLOs, and SLIs can help you create more effective service levels. Here are a few challenges to proactively address for each.
SLO challenges
- Select metrics wisely: Metrics (SLIs) need to align with the business goal (SLO) and ensure customer expectations (SLA). So, choosing the right metrics is critical and can be a challenge.
- Find balance: Defining a balanced SLO can be challenging. Define SLOs that can be measured. Don’t waste time on SLOs that are not well-defined with SLIs that prove their achievability. Alternatively, SLOs that are easy to meet might not set you apart from your competition.
- Keep up with external dependencies: Stay on top of any third-party services that your service reliability depends on. If an external service fails, it might reduce the ability to comply with the SLO—even if the internal components work perfectly.
SLI challenges
- Too many metrics: Don’t bombard your reliability team with too much data that complicates measurement. Evaluate each metric for its return on investment that your team will have to make to monitor, interpret, and maintain it.
- Difficult-to-measure metrics: Some performance metrics are challenging to measure accurately, including user engagement, real-time application latency, and user satisfaction. Looking to automated methods, such as machine learning or other AI tools, may be able to help you more accurately define and measure these metrics.
SLA challenges
- Not including all stakeholders: Collaborate, collaborate, collaborate. And take the time necessary to define and understand the what, why, and how around you will deliver on your SLAs. Your SLAs define your relationship with your customer. Neglecting to include all stakeholders—from reliability engineering to the customer—when delivering your SLAs can lead to unrealistic SLOs and failure to deliver the expected quality of services.
- Adapting to customer wants and new technologies: Technological evolution is happening rapidly, making it tough to keep up with new tools for reliability engineering. The same holds true for changing customer needs, which can demand frequent adjustment and renegotiation.
- Costs: The balance between cost and benefit is always a struggle. SLAs must be thought out from all perspectives, and that means investing human resources across cross-functional teams. Skimping on this area can lead to more costly litigation—or worse, losing the trust of your customers.
What is service level management?
Service level management means ensuring that all of your processes and operational agreements for the level of your services provided to customers are appropriate. It includes monitoring and reporting on service levels, setting and adjusting SLOs, determining SLIs, making sure you are meeting SLAs, and holding customer reviews.
The central focus really is the shared meaning of “availability” across teams, in your SLOs, also captured in the SLAs with your customers. To make sure your business is meeting or exceeding these service level agreements, it’s important for cross-functional teams to manage internal SLOs.
This next video shows how teams can use service level management with New Relic.
Benefits of service level management
Implementing SLO best practices across teams isn’t easy. You need the right data to define a shared language across teams.
Reliability engineers need to quickly set a baseline for availability and uptime across their full stack and team. You need SLOs and SLIs to determine service boundaries and a unified, transparent view of service reliability to better comply with customer-facing SLAs. You need to be able to report on reliability and SLO compliance metrics and error budgets so you can make improvements across your environment.
When you have good practices for SLIs, SLOs, and SLAs, and a platform for your service level management, you’ll see these benefits:
- Easy setup: Automatically establish a baseline of performance and reliability for any service with a one-click setup and recommendations and customizations provided in a simple, guided flow.
- Define reliability across teams: Avoid arduous alignment processes with SLO and SLI recommendations that help you determine service boundaries. Set reliability benchmarks automatically based on recent performance metrics in any entity.
- Iterate and improve: With full-stack context and automation through open-source infrastructure-as-code tools like Terraform, teams have insight into how specific nodes or services impact system reliability and can quickly take control over their performance. Custom views for both service owners and business leaders drive operational efficiency and lead to better reporting, alerting, and incident management processes.
- Standardize reliability: Cross-organizational teams have a unified, transparent view of service reliability, and can better comply with customer-facing SLAs, avoiding SLA breaches. SLO compliance metrics and error budgets give organizations a way to report on reliability and implement changes across applications, infrastructure, and teams in a cohesive fashion.
For more tips, read our blog posts, Best practices for setting SLOs and SLIs for modern, complex systems and Introducing service level management.
Next steps
Get started with service level management. Try New Relic.
The best way to learn more about service level management and observability is to get hands-on experience with an observability solution. Sign up for New Relic. Your free account includes 100 GB/month of free data ingest, one free full-access user, and unlimited free basic users.
The views expressed on this blog are those of the author and do not necessarily reflect the views of New Relic. Any solutions offered by the author are environment-specific and not part of the commercial solutions or support offered by New Relic. Please join us exclusively at the Explorers Hub (discuss.newrelic.com) for questions and support related to this blog post. This blog may contain links to content on third-party sites. By providing such links, New Relic does not adopt, guarantee, approve or endorse the information, views or products available on such sites.