5 Common Sources of Alert Fatigue for SRE and DevOps Teams

5 common sources of alert fatigue for SRE and DevOps teams

Publié 23 jan 2020 Lecture : 3 min

If you’ve ever been an on-call SRE, you’re familiar with alert fatigue: the burned out feeling that creeps in after responding to alert after alert from tons of services and tools across your stack. Not only is this phenomenon exhausting, but constant pages also limit your ability to focus on other work, even if you’re simply clicking “acknowledge” (“acking”). Research has shown that people lose up to 40% of productive time with brief context switches. Many of the alerts causing never-ending streams of pages are neither urgent nor important, and don’t require any human action.

So, where are they coming from?

Here are five sources of noise that can create alert fatigue and distract your on-call DevOps or SRE team from the real issues that need attention in your production system.

Irrelevant alerts

Unused services, decommissioned projects, and issues that are actively being handled by other teams are some sources of noise that are prevalent enough to be annoying but not always worth going through the legwork of turning the alerts off at their source. These notifications come from all kinds of tools in your production system and tend to get quickly acked but largely ignored since there usually isn’t an underlying actionable issue.

Low-priority alerts

Some noisemakers indicate problems that may eventually need to be addressed, but are low on the current priority list. Keeping these alerts configured can be a useful reminder to investigate or address the root cause of the issues eventually, but in the short-term, they’re probably not adding value.

Flapping alerts

Acking flapping issues can feel like playing whack-a-mole. These alerts are a good indicator of a growing problem in your system but can be a source of distraction when you’re trying to problem-solve, sometimes prompting SREs to silence pages or blindly ack incoming issues. Unrelated issues can sometimes get lost in piles of flapping notifications, which can be a risk to your team’s ability to notice important problems.

Duplicate alerts

Similar to flapping alerts, but more a symptom of redundant monitoring configuration than an underlying production issue, duplicate alerts can be another source of pager fatigue. You’re aware of the problem after the first notification, so additional alerts letting you know that it’s still there can add frustration.

Correlated alerts

These are the toughest but possibly most important sources of noise to identify. Getting to the root cause of issues is way faster with all of the context about the impact of the issue across your full stack, and missing this context can lead you down rabbit holes of investigation and troubleshooting that aren’t worth your time.

Take a quick scroll through your team’s pages from the past day or week and think about each one. How many fit into one of these categories? Noisy pages like these create distractions, build frustration, and hide real problems, and as the complexity of modern production systems continues to grow, the volume will only increase.

Cure alert fatigue with the right solution

Implementing an AIOps platform, like New Relic AI, can help you tackle alert noise across your stack and create a continuously-improving, streamlined system for correlating and prioritizing incidents. Many layers of machine learning-driven filters and logic power New Relic AI. A correlation engine looks for all of these sources of noise. It also adapts to continually provide more relevant alerts, reducing pager fatigue and empowering your team to stay focused on important issues. Learn more about New Relic AI (currently in private beta) today.

Auteur : Guy Fighel

Guy Fighel est Directeur général d'Applied Intelligence et Vice-président du groupe d'ingénierie des produits chez New Relic. Il dirige les équipes produit et ingénierie AIOps de New Relic, et est responsable de la stratégie globale de l'entreprise en matière d'intelligence artificielle et d'apprentissage machine. Guy a été le cofondateur et le directeur de la technologie de SignifAI, une société d'intelligence événementielle acquise par New Relic en 2019.

Les opinions exprimées sur ce blog sont celles de l'auteur et ne reflètent pas nécessairement celles de New Relic. Toutes les solutions proposées par l'auteur sont spécifiques à l'environnement et ne font pas partie des solutions commerciales ou du support proposés par New Relic. Veuillez nous rejoindre exclusivement sur l'Explorers Hub (discuss.newrelic.com) pour toute question et assistance concernant cet article de blog. Ce blog peut contenir des liens vers du contenu de sites tiers. En fournissant de tels liens, New Relic n'adopte, ne garantit, n'approuve ou n'approuve pas les informations, vues ou produits disponibles sur ces sites.

780+ intégrations pour lancer le monitoring de votre stack gratuitement.

Voir les intégrations