5 Common Sources of Alert Fatigue for SRE and DevOps Teams

5 common sources of alert fatigue for SRE and DevOps teams

Publicado 23 de Jan de 2020 3 min. de leitura

If you’ve ever been an on-call SRE, you’re familiar with alert fatigue: the burned out feeling that creeps in after responding to alert after alert from tons of services and tools across your stack. Not only is this phenomenon exhausting, but constant pages also limit your ability to focus on other work, even if you’re simply clicking “acknowledge” (“acking”). Research has shown that people lose up to 40% of productive time with brief context switches. Many of the alerts causing never-ending streams of pages are neither urgent nor important, and don’t require any human action.

So, where are they coming from?

Here are five sources of noise that can create alert fatigue and distract your on-call DevOps or SRE team from the real issues that need attention in your production system.

Irrelevant alerts

Unused services, decommissioned projects, and issues that are actively being handled by other teams are some sources of noise that are prevalent enough to be annoying but not always worth going through the legwork of turning the alerts off at their source. These notifications come from all kinds of tools in your production system and tend to get quickly acked but largely ignored since there usually isn’t an underlying actionable issue.

Low-priority alerts

Some noisemakers indicate problems that may eventually need to be addressed, but are low on the current priority list. Keeping these alerts configured can be a useful reminder to investigate or address the root cause of the issues eventually, but in the short-term, they’re probably not adding value.

Flapping alerts

Acking flapping issues can feel like playing whack-a-mole. These alerts are a good indicator of a growing problem in your system but can be a source of distraction when you’re trying to problem-solve, sometimes prompting SREs to silence pages or blindly ack incoming issues. Unrelated issues can sometimes get lost in piles of flapping notifications, which can be a risk to your team’s ability to notice important problems.

Duplicate alerts

Similar to flapping alerts, but more a symptom of redundant monitoring configuration than an underlying production issue, duplicate alerts can be another source of pager fatigue. You’re aware of the problem after the first notification, so additional alerts letting you know that it’s still there can add frustration.

Correlated alerts

These are the toughest but possibly most important sources of noise to identify. Getting to the root cause of issues is way faster with all of the context about the impact of the issue across your full stack, and missing this context can lead you down rabbit holes of investigation and troubleshooting that aren’t worth your time.

Take a quick scroll through your team’s pages from the past day or week and think about each one. How many fit into one of these categories? Noisy pages like these create distractions, build frustration, and hide real problems, and as the complexity of modern production systems continues to grow, the volume will only increase.

Cure alert fatigue with the right solution

Implementing an AIOps platform, like New Relic AI, can help you tackle alert noise across your stack and create a continuously-improving, streamlined system for correlating and prioritizing incidents. Many layers of machine learning-driven filters and logic power New Relic AI. A correlation engine looks for all of these sources of noise. It also adapts to continually provide more relevant alerts, reducing pager fatigue and empowering your team to stay focused on important issues. Learn more about New Relic AI (currently in private beta) today.

Por Guy Fighel

Guy Fighel é gerente geral de inteligência aplicada e vice-presidente do grupo de engenharia de produtos da New Relic. Ele é líder de produtos e engenharia de AIOps da New Relic e é responsável pela estratégia geral de inteligência artificial e aprendizado de máquina da empresa. Guy foi cofundador e diretor de tecnologia da SignifAI, uma empresa de inteligência de eventos, que foi adquirida pela New Relic em 2019.

As opiniões expressas neste blog são de responsabilidade do autor e não refletem necessariamente as opiniões da New Relic. Todas as soluções oferecidas pelo autor são específicas do ambiente e não fazem parte das soluções comerciais ou do suporte oferecido pela New Relic. Junte-se a nós exclusivamente no Explorers Hub ( discuss.newrelic.com ) para perguntas e suporte relacionados a esta postagem do blog. Este blog pode conter links para conteúdo de sites de terceiros. Ao fornecer esses links, a New Relic não adota, garante, aprova ou endossa as informações, visualizações ou produtos disponíveis em tais sites.

780+ integrações para começar a monitorar seu stack gratuitamente.

Veja as integrações