現在、このページは英語版のみです。

I was just finishing work, birds were chirping, and the sun was low on the horizon when a small red circle appeared at the top right corner of my Slack feed. When I clicked it, a monster jumped out: our entire Incident Intelligence product was offline. This was especially embarrassing because we focus on building and providing tools to detect these kinds of incidents quickly. It's our responsibility to be reliable for our customers. We can't be offline because then our customers won't know when they are offline.

So what exactly happened? An engineer deleted a subscriber resource in Google Cloud Platform thinking it was only used for testing, but it was actually being used in production. Mistakes do happen, but we can’t have that kind of downtime.

We took several measures to make sure this issue would never happen again. Here’s how we leveraged our synthetics solution in combination with New Relic Alerts to set up a smoke test to continuously verify that data flows through our complex Incident Intelligence pipeline from start to finish.

Our system does the following:

  • Gets incident data from New Relic and third-party sources such as PagerDuty, Grafana, and Prometheus.
  • Attempts to correlate events based on time, context, and topology under one root cause, known as an issue.
  • Enriches this issue with additional helpful information used for problem-solving.
  • Sends this vastly improved issue back to the team in ServiceNow, PagerDuty, Slack, and other third-party platforms.

This is a simplified representation of our system:

Simplified representation shows process from input to output.

Smoke testing the Incident Intelligence pipeline

According to Wikipedia, the term “smoke testing” likely originated in the plumbing industry— plumbers would use real smoke to discover leaks and cracks in pipe systems. In our case,  there are many moving parts in our system that are constantly changing. We needed a smoke test that would discover when our system had any issues. We decided to add a heartbeat test that regularly checks to make sure data is flowing happily from our input to our outputs.

We set up the test to do the following:

  • Use a Synthetic monitor to periodically insert a message (called IINTSmokeTestStart) to New Relic Metrics.
An NRQL alert definition that triggers an alert when it detects a log entry in New Relic metrics.
  • Use a NRQL condition to trigger a new alert for each log entry of this kind. Incident Intelligence receives these alerts as inputs. The image above shows our alert condition. Meanwhile, the image below shows the periodic signal that triggers new alerts.
Periodic signal that triggers new alerts.
  • Configure a webhook destination endpoint in Incident Intelligence that reacts to these periodic signals by writing another record (called IINTSmokeTestEnd) to New Relic Metrics using the Metrics API. The image below shows the webhook and the destination endpoint: New Relic's Metrics API.
Image shows the webhook key and destination endpoint.
  • Add a final NRQL alert policy that triggers if theIINTSmokeTestEnd message doesn’t arrive. That means the pipeline is down and we need an on-call engineer to address it immediately.
NRQL alert policy that triggers if message doesn't arrive.

Here’s a summary of what happens now that our heartbeat test is in place:

  1. We continuously generate data that activates our pipeline end-to-end.
  2. Every minute, we push incidents to the beginning of our pipeline. In our case, we are simulating incidents for our input data.
  3. Every five minutes, the end of our pipeline uses a webhook to log a message using the Metrics API. If this message isn’t received, we get an alert notifying us that something is wrong.

We can now sleep comfortably knowing we have one more layer of protection for our customers.