If you’ve ever been an on-call SRE, you’re familiar with alert fatigue: the burned out feeling that creeps in after responding to alert after alert from tons of services and tools across your stack. Not only is this phenomenon exhausting, but constant pages also limit your ability to focus on other work, even if you’re simply clicking “acknowledge” (“acking”). Research has shown that people lose up to 40% of productive time with brief context switches. Many of the alerts causing never-ending streams of pages are neither urgent nor important, and don’t require any human action.
So, where are they coming from?
Here are five sources of noise that can create alert fatigue and distract your on-call DevOps or SRE team from the real issues that need attention in your production system.
Irrelevant alerts
Unused services, decommissioned projects, and issues that are actively being handled by other teams are some sources of noise that are prevalent enough to be annoying but not always worth going through the legwork of turning the alerts off at their source. These notifications come from all kinds of tools in your production system and tend to get quickly acked but largely ignored since there usually isn’t an underlying actionable issue.
Low-priority alerts
Some noisemakers indicate problems that may eventually need to be addressed, but are low on the current priority list. Keeping these alerts configured can be a useful reminder to investigate or address the root cause of the issues eventually, but in the short-term, they’re probably not adding value.
Flapping alerts
Acking flapping issues can feel like playing whack-a-mole. These alerts are a good indicator of a growing problem in your system but can be a source of distraction when you’re trying to problem-solve, sometimes prompting SREs to silence pages or blindly ack incoming issues. Unrelated issues can sometimes get lost in piles of flapping notifications, which can be a risk to your team’s ability to notice important problems.
Duplicate alerts
Similar to flapping alerts, but more a symptom of redundant monitoring configuration than an underlying production issue, duplicate alerts can be another source of pager fatigue. You’re aware of the problem after the first notification, so additional alerts letting you know that it’s still there can add frustration.
Correlated alerts
These are the toughest but possibly most important sources of noise to identify. Getting to the root cause of issues is way faster with all of the context about the impact of the issue across your full stack, and missing this context can lead you down rabbit holes of investigation and troubleshooting that aren’t worth your time.
Take a quick scroll through your team’s pages from the past day or week and think about each one. How many fit into one of these categories? Noisy pages like these create distractions, build frustration, and hide real problems, and as the complexity of modern production systems continues to grow, the volume will only increase.
Cure alert fatigue with the right solution
Implementing an AIOps platform, like New Relic AI, can help you tackle alert noise across your stack and create a continuously-improving, streamlined system for correlating and prioritizing incidents. Many layers of machine learning-driven filters and logic power New Relic AI. A correlation engine looks for all of these sources of noise. It also adapts to continually provide more relevant alerts, reducing pager fatigue and empowering your team to stay focused on important issues. Learn more about New Relic AI (currently in private beta) today.
이 블로그에 표현된 견해는 저자의 견해이며 반드시 New Relic의 견해를 반영하는 것은 아닙니다. 저자가 제공하는 모든 솔루션은 환경에 따라 다르며 New Relic에서 제공하는 상용 솔루션이나 지원의 일부가 아닙니다. 이 블로그 게시물과 관련된 질문 및 지원이 필요한 경우 Explorers Hub(discuss.newrelic.com)에서만 참여하십시오. 이 블로그에는 타사 사이트의 콘텐츠에 대한 링크가 포함될 수 있습니다. 이러한 링크를 제공함으로써 New Relic은 해당 사이트에서 사용할 수 있는 정보, 보기 또는 제품을 채택, 보증, 승인 또는 보증하지 않습니다.