Error budget and service levels best practices

Your business and software stack runs on countless services, and as the saying goes, if everything is a priority, then nothing is a priority. And while you should monitor everything, there are significant benefits to driving additional focus on your business-critical services. Traditionally, this has been a tough problem for DevOps teams and SREs, but it's easy with New Relic service level management (SLM). SLM gives you a way to identify service boundaries and monitor the health of your most critical systems with service level indicators (SLIs) and service level objectives (SLOs).

To make things even easier, we’ve added error budget and burn rate alerting into service level management! Error budgets and burn rates help you quickly see when business-critical services are experiencing service degradations or failures, often before customers even notice a problem. You can automate alert thresholds and set up alerts for error budgets and burn rates. These enhancements allow you to alert on critical metrics related to your service levels, helping you reduce downtime and achieve your SLOs.

If you’re already taking advantage of service level management and eager to set up error budget and burn rate alerts, jump to How to apply alerts to error budgets and burn rates to get started. Otherwise, read on to learn some best practices for service-level management.

What is an error budget?

An error budget represents how many “bad” events you can afford over an SLO period. These “bad” events could be defined as metrics falling below certain thresholds, critical transaction failures or errors, or any custom event you determine to be detrimental. It’s essentially the inverse of an SLO.

By definition, if you spend all your error budget at a constant rate, then your burn rate equals one. A burn rate above one would be unsustainable because you'll completely burn down your error budget before the end of the SLO period.

Reducing alert fatigue comes down to eliminating noise, identifying areas for actionable alerts, and providing context to those alerts faster. Error budgets offer a method for more efficient alerting, allowing you to reduce alert fatigue by configuring your SLOs to only alert you when the burn rate is above one for a sustained period of time.

Understanding error budget policies

Error budget policies are a critical component of a site reliability engineering (SRE) approach and play a pivotal role in maintaining a balance between system reliability and innovation. These policies define acceptable error thresholds for a service or system, essentially quantifying how much unreliability a team can tolerate before it impacts the user experience.

By setting clear error budget policies, organizations can achieve several important objectives.

They encourage a culture of accountability and shared responsibility among development and operations teams.
Error budgets provide a framework for prioritizing reliability work over new feature development, ensuring that system stability is not compromised.
They help in decision-making, as they offer a concrete metric for evaluating the trade-offs between innovation and reliability enhancements.

Ultimately, error budget policies help organizations strike the right balance between innovation and maintaining high service reliability, keeping customers satisfied, and ensuring compliance with service level objectives (SLOs).

Establish a mature SLI and SLO alerting strategy

By building a mature alerting strategy for SLIs, SLOs, error budgets, and burn rates, you can detect and resolve issues sooner to help avoid missing internal SLOs and your customers’ SLAs. We'll show you how to do this using New Relic. You’ll first need to identify business-critical applications and services, rolling them up into SLIs and SLOs, with the one-click setup in New Relic. Then, you’ll want to optimize your alerts based on the best practices described in the How to apply alerts to error budgets and burn rates section. When you optimize your alerts this way, you’ll be able to immediately analyze your performance and make informed decisions about where you need to invest resources to meet your business objectives.

Service level management allows SRE and DevOps teams to proactively establish processes that speed up your ability to write code, push to production, and identify bugs or outages quickly, often before customers ever experience an issue. These enhanced alerts for error budgets and burn rates provide an actionable outlet for you to get notified of customer-impacting problems faster, so you can take action to help your organization meet SLOs and SLAs. In addition to these strategies, it's also crucial to conduct thorough incident post mortems when errors occur. This helps in understanding the root causes of service level breaches and in formulating more effective response strategies. Learn more about conducting effective incident post mortems.

Make sure you avoid alert fatigue!

When you implement service levels properly, you’ll be able to design alert policies that make sense for your teams, and as a byproduct, you can prioritize those notifications that relate to customer-impacting issues, reducing overall noise in your incident management lifecycle and driving clarity and focus. New Relic service level management not only can lead to better customer and business outcomes, but it can also improve the quality of life for SRE and DevOps teams by driving focus and reducing alert fatigue.

A few error budget best practices

These practices aim to help organizations effectively implement error budgeting as part of their SRE practices, ensuring a balance between reliability and innovation while meeting user expectations and business objectives. Here are some critical best practices for error budgeting:

Define service level objectives (SLOs): SLOs define the target level of reliability or performance for a service. These are typically expressed as a percentage of uptime or response time. SLOs should be realistic and based on user expectations and business requirements.
Monitor and measure service metrics: Collect and monitor relevant metrics to track the performance and reliability of the service against the defined SLOs. These metrics could include uptime, latency, error rates, and user satisfaction.
Use error budgets to drive prioritization: Error budgets should inform prioritization decisions. When error budgets are being consumed too quickly, prioritize stability and reliability efforts over new feature development. Conversely, when error budgets are underutilized, teams can focus on innovation and feature development.
Automate error budget calculations: Implement automated systems to track error budgets and provide real-time visibility into their consumption. This automation can help teams react quickly to system reliability changes and make informed decisions about prioritization.
Regularly review and adjust SLOs and error budgets: As business requirements, user expectations, and system characteristics evolve, regularly review and adjust SLOs and error budgets accordingly. This ensures that they remain relevant and aligned with the organization's goals.
Learn from exceeding error budgets: When error budgets are exceeded, conduct thorough post-incident reviews (PIRs) to understand the root causes and identify areas for improvement. Use these learnings to prevent similar incidents in the future and improve overall system reliability.

How to apply alerts to error budgets and burn rates with New Relic

Your team can set up multiple alerts on SLI- and SLO-related performance degradations to detect incidents quickly and resolve them before they affect your customers. Let's dive into how you can set up alerts for error budgets, fast burn rates, and SLI attainment with New Relic.

Configuring alerts

1. Select Alert in the top right corner of the service level details page to open up the alert configuration menu.

2. From the alert configuration menu, select one of these: Fast-burn rate, Error budget consumption, or SLO compliance.

3. Follow the guided setup to create alert rules for fast burn rates, error budgets, and SLO compliance:

Error budgets: When you set service level objectives, you can configure error budget alerts to inform you when your error budget falls below a certain threshold. These notifications will inform you when your service is approaching certain percentages where your team needs to take action. The corresponding alerts will show you when incidents with a high business impact are occurring. When these alerts are triggered, you can prioritize them and engage the proper teams to start diagnosing the source of the problem.
Fast burn rate: Fast burn rate alerts warn you of a sudden, large change in consumption that, if uncorrected, will exhaust your error budget quickly. We’ve incorporated Google's best practice of defining a 2% SLO budget consumption within one hour. This means that, if triggered, a service would consume its error budget in 50 hours if left unattended. But if you want to configure your alert differently, you have that flexibility. During alert setup, you can customize the consumption percentages and time windows based on your needs and preferences.
SLO compliance: Use this alert type when you want to be alerted when your SLI is below its SLO for longer than a set period of time. For more tips and tricks for setting up alerts based on SLI attainment, see our documentation.

4. On the alert configuration page, you’ll find recommended, pre-configured alert thresholds based on the historical performance of the entities related to your service levels.

Get started setting up service levels today

1. Log in to New Relic and select All Capabilities at the top of the left-hand navigation menu.

2. Select Service Levels.

If you’ve already configured SLIs and SLOs, select any service level.
If you haven’t configured SLIs and SLOs, select Add a service level and follow the detailed instructions in our Create and edit SLIs and SLOs documentation.

3. To add an alert on the selected service level, click on Alert and follow the in-product guided setup or review the instructions in our Alerting on service levels documentation (which also contains example alert configurations).

다음 단계

Service level management is available with a full platform user license. Try it for free today at newrelic.com/signup. Your free account includes 100 GB/month of data ingest, one full platform user, and unlimited basic users.

Read the service level management documentation to learn more about how service levels work and how to customize your SLIs and SLOs.

For a quick demo, watch this video.

댄 홀로란(Dan Holloran)

댄 홀로란은 뉴렐릭의 제품 마케팅 매니저입니다. 7년간 컨텐츠 마케팅 분야에서 경력을 쌓았으며, 그 중 4년은 데브옵스 분야에서 인시던트 관리, 옵저버빌리티 및 엔지니어링 모범 사례에 중점을 두었습니다. 댄은 시간이 나면 플라이 낚시를 하거나 글을 쓰며 새로운 것을 배우는 걸 즐깁니다.

존 위더스(John Withers), 제품 마케팅 디렉터

존 위더스는 뉴렐릭의 제품 마케팅 디렉터이자 애견가입니다.

이 블로그에 표현된 견해는 저자의 견해이며 반드시 New Relic의 견해를 반영하는 것은 아닙니다. 저자가 제공하는 모든 솔루션은 환경에 따라 다르며 New Relic에서 제공하는 상용 솔루션이나 지원의 일부가 아닙니다. 이 블로그 게시물과 관련된 질문 및 지원이 필요한 경우 Explorers Hub(discuss.newrelic.com)에서만 참여하십시오. 이 블로그에는 타사 사이트의 콘텐츠에 대한 링크가 포함될 수 있습니다. 이러한 링크를 제공함으로써 New Relic은 해당 사이트에서 사용할 수 있는 정보, 보기 또는 제품을 채택, 보증, 승인 또는 보증하지 않습니다.

780+ 개 통합을 사용해 무료로 스택 모니터링

모든 통합 보기

In this article