Proactive alerting strategies enable you to respond to problems before they affect your customers. A great place to start with alerting is with your team’s SLOs. In fact, you can group SLOs together logically to provide an overall Boolean indicator of whether your cloud-based service is meeting expectations or not—for example, “95% of requests complete within 250 ms and service availability is 99.99%”—and then set an alert against that indicator.
By breaking down the quantitative performance metrics of a cloud-based service or technology, you can identify the most appropriate alert type for each metric. For instance, you could set an alert to notify on-call responders if web transaction times exceed a half-millisecond, or if error rates surpass 0.20%.
For a simple alerting framework, consider the following table:
||Metrics and KPIs
|Are we open for business?
||Use synthetic monitoring to set up automated pings and alert on availability.
|How’s our underlying infrastructure?
||Manage and troubleshoot your hosts and containers with infrastructure-based monitoring.
|How’s the health of our application?
||Use real end-user metrics to understand your application’s backend performance. Use metric and trace data from open source tools, and display that information alongside all the other systems and services data you’re managing.
|How do I troubleshoot a system error?
||Use logs or distributed tracing to search and investigate the root cause across your applications and infrastructure.
|How’s the overall quality of our application?
||Use an Apdex score to quickly access an application’s quality.
|How are our customers doing?
||Monitor front-end and mobile user experiences.
|How’s our overall business doing?
||Focus on key transactions within an application and tie them to expected business outcomes to correlate application and business performance.
Alerting without the proper broadcasting methods leaves you vulnerable. Your alerting strategy should include a notification channel to ensure the appropriate teams are notified if your application or architecture encounters issues.
We recommend that you first send alerts to a group chat channel (for example, using Slack or PagerDuty). Avoid alert fatigue by evaluating alerts in real time for several weeks to understand which alerts are indicative of important or problematic issues. These are the types of alerts that warrant waking up someone.
Make sure that communications during critical incidents take place in easily accessible and highly visible channels. A group chat room dedicated to incident communication is often a great choice. This allows all stakeholders to participate in or observe conversations, and it provides a chronology of notifications, key decisions, and actions for postmortem analysis.
Automation of simple or repeatable incident response tasks will increase efficiency and minimize the impact of incidents. With proper automation in place, you can disable or isolate faulty application components as soon as an alert threshold is reached, rather than after a notification has been issued.
Finally, after the incident has been resolved, key stakeholders and participants must capture accurate and thorough documentation of the incident.
At a minimum, we recommend that the documentation for the incident retrospective includes:
- A root-cause analysis
- A chronology and summary of remediation steps and their results and whether they were successful or not
- Recommendations for system or feature improvements to prevent a recurrence
- Recommendations for process and communication improvements