New Relic Now Start training on Intelligent Observability February 25th.
Save your seat.
Por el momento, esta página sólo está disponible en inglés.

If you’re like me, you’ve spent most of your career working with IT operations teams. You’ve watched them invest lots of hard work trying to meet the expectations of the business, but they’ve come away with limited success. The business continually bashes IT for providing poor service, while IT struggles to meet seemingly nebulous expectations with limited resources. The major problem here is the fundamental disconnect over how IT and the business each measure success.

IT is responsible for sharing limited resources (such as CPU, memory, and disk) between business functions, so they measure consumption. IT then uses those metrics to recognize when a resource is close to exhaustion to avoid problems and keep costs low. On the other hand, the business needs responsive and error-free services, so they measure success using speed and quality. The disconnect is two teams with drastically different definitions of success.

Practically, this means that there’s lots of tension between IT and the business. Here’s a real-world example: A customer of ours was continually being bashed by the business because “the system is always slow.” Over time, they had added tools to collect thousands of consumption metrics and tried to create correlation rules that would somehow show when the system was slow. What they ended up with was a mess, a huge collection infrastructure gathering metrics at sub-second intervals, alerts that triggered 24x7, and no easy way to understand what was truly going on.  

They weren’t getting anywhere because they didn’t measure the right things. But, again, this is because resource-oriented monitoring strategies were giving an incomplete picture.

If you want a simpler and more responsive observability practice, tighter alignment with the business, and faster paths to improvement, you should focus on service-level metrics instead.

Here I’m going to introduce you to service-level indicators (SLIs) and service-level objectives (SLOs), and then I’ll show you how to set your SLOs.

Service-level indicators

The textbook definition of an SLI is: “A carefully defined quantitative indicator of some aspect of the level of service that is provided.” In other words, an SLI is a metric measuring one thing that shows how well your IT service is performing. Extending this definition a bit, I’d say that it must be relevant to the delivered service and should be simple and easy to understand. In other words, when an SLI goes wrong, there must be some business impact, such as an outage or poor user experience.

Remember, the business expects speed and quality, so you need to choose SLIs (metrics) that measure those things, such as:

  • Latency/response time
  • Error rate/quality
  • Availability 
  • Uptime

Note: Yes, there is a distinction between uptime and availability.  For now, check out  these Google search results.

And here are some potential SLI choices that you shouldn’t use because they don’t directly correlate to business impact:

  • CPU, disk, memory consumption
  • Cache hit rate
  • Garbage collection time

Again, the main difference between a good and bad SLI is the metric’s relevance to service delivery. A high error rate or slow response time affects service delivery. High CPU utilization might impact service delivery, but the relationship between CPU and service performance is harder to establish. This is why IT teams that measure resource consumption struggle.

The key here is to pick a metric for your SLI that is clearly and unambiguously related to service delivery and is simple and easy to communicate to non-technical people. That will resolve the disconnect, making things easier for everyone involved.

Service-level objectives

An SLO is simply a goal that you set for your SLIs. First, you identify your SLIs. Then, by setting thresholds for each SLI, you create your SLOs.

SLOs should be easy for even non-technical stakeholders to understand. Stand-alone resource consumption metrics, such as CPU utilization, don’t tell you if something is performing well or not—they require interpretation by an SME. Identifying business-impacting SLIs, setting SLOs, and properly presenting them means that the consumers of those SLOs don’t have to ask if the number is good or bad. Interpretation is intuitive—the answer is “good” or “not good.” As a bonus, it’s easy to use SLOs to measure improvement.

SLI, SLO, SLA Image

The best way to present your SLOs in a way that meets the above requirements (intuitive and concise) is as a percentage. Don’t use averages; they hide all sorts of things you need to see.

There is one other value to using percentages: they implicitly handle statistical outliers and aggregate business impact. There will always be slow transactions and errors, but you don’t want to trip an alert whenever they happen. You only want to trip an alert when there are enough to have an impact.

Here are some examples of well-chosen SLOs properly communicated as percentages:

  • 95% of transactions should have a response time of 500 milliseconds or less.
  • 99% of transactions should be error-free.
  • The application should have 99.9% uptime during working hours.

As opposed to:

  • Transactions should have a response time of 750 milliseconds or less.
  • Transactions should average fewer than 100 errors per hour.

Best Practice: Combine your SLIs into a single SLO where possible. For example, 99% of login transactions should be under two seconds and error-free.

SLI and SLO Image

Setting your SLOs

If the business or IT management has already set SLOs for you, then you’ll want to use those. If they haven’t, I recommend using an iterative approach as follows:

  1. Identify the service you want to set SLOs for.
  2. Identify the service’s key transactions.  Many services have transactions, such as health checks, that should not contribute to performance SLOs.
  3. Identify service and transaction SLIs.
  4. For each SLI, create a baseline SLO using the 95th percentile. (Don’t use averages, as they hide outliers, and you’ll end up with noisy alerts.)
  5. Set SLO violation alerts.
  6. Periodically review alert KPIs and service performance to ensure that your SLOs are relevant and help drive improvement.

Chapter 4 of the Google SRE book is an excellent resource with much more depth on setting your SLOs.  This post will get you started, but you should read that chapter when you get some time.

The end result

Establishing SLIs and SLOs will result in a simpler and more responsive observability practice, tighter alignment with the business, and a faster path to improvement. It’s simple and easy to get started, practice this on one service and see how well it works.