The demands on your alerting practices have only increased with the shifts in your modern software practices. Orchestrated container environments, microservices architectures, serverless, and cloud-based infrastructures—these are very different approaches to building and managing software than traditional monoliths running in static on-premise data centers. Not surprisingly, monitoring and alerting has had to evolve to address the new challenges presented by these modern systems. Observability, which has come into popularity, points to many of these more sophisticated practices, toolings, and data used to address the challenges of understanding and operating these more complex systems effectively.
Organizations can find alerting to be an inherently difficult practice due to structural and competing forces, such as:
- Sensitivity. Overly sensitive systems cause excessive false positive alerts, while less sensitive systems can miss issues and have false negatives. Determining the correct alerting threshold requires ongoing tuning and refinement.
- Fatigue. The common approach to sensitivity is for teams to be more conservative when they set up alerts, but this results in a more sensitive and noisy alerting system. If teams encounter too many false positives, they will begin to ignore alerts and miss real issues, defeating the purpose of an alerting system.
- Maintenance. Systems grow and evolve quickly, but teams are often slow to alerting policies. This leads to an alerting strategy that is simultaneously filled with outdated policy deadwood and gaps where teams aren’t providing coverage to newer changes in their systems.
- Fragmented information. Many teams use multiple different systems to manage alerts across increasingly complex technology stacks, which means that the information needed to diagnose and troubleshoot a problem may be spread across multiple tools.
Changes in technology
Rapid changes in modern technology stacks are demanding different approaches to alerting; for example:
- Resources are ephemeral. Tracking resource metrics can be difficult when resources readily appear and disappear on demand. For example, a container orchestration tool like Kubernetes can destroy a container if the CPU is at 100% and will then bring up a new one. Measuring CPU saturation in a container then becomes much less important than reporting about the pattern of this CPU saturation behavior.
- Systems should scale dynamically. In a modern DevOps world, systems scale up and down quickly. If you have 5 hosts now, you may have 20 hosts an hour later, and just 10 in the following hour. Are your alerting policies dynamically added to the newly created hosts and adjusted for the hosts you’ve removed? Static alert policies are useless in an environment under constant change.
- Services are abstracted. Cloud infrastructure providers like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) are increasingly abstracting services and taking on the operational responsibilities that used to be left to Operations teams. Because of this, teams need alerting higher up the stack.
System outcomes vs. root causes
In Google’s Site Reliability Engineering book, the company presents the case for making symptoms vs. causes part of an observability strategy. This reflects a necessary shift in using the observability of outcomes to infer how well the internals of a system are running. When you know something is wrong due to some symptom, only then is it necessary to peel back the cover to see what the cause may be. In mature, static systems with known failure modes, cause-based alerting made sense when teams could identify and understand key bottlenecks. However, with more ephemeral, dynamic, and abstracted systems, new failure modes appear as systems continually change, and the value of identifying specific infrastructure “causes” has become increasingly irrelevant—understanding the final symptoms and outcomes is the practical benchmark to measure against.
Tracking system outcomes in practice
So what does it mean to “track symptoms”? While it’s easy to only focus on the number of 9s in your 99.999% uptime SLA, that doesn’t capture the actual outcomes your systems generate for your customers and your business. Being able to connect “10 minutes of outage” to “1,200 lost orders” and “$24,000 in revenue loss” provides a much more strategic measure of how your systems impact your business. From there, drilling into your underlying services and systems can help connect the true cost of the database crash that led to the outage.