How We Created a Heartbeat Test for Monitoring Incident Intelligence

Published Jul 29, 2021 3 min read

I was just finishing work, birds were chirping, and the sun was low on the horizon when a small red circle appeared at the top right corner of my Slack feed. When I clicked it, a monster jumped out: our entire Incident Intelligence product was offline. This was especially embarrassing because we focus on building and providing tools to detect these kinds of incidents quickly. It's our responsibility to be reliable for our customers. We can't be offline because then our customers won't know when they are offline.

So what exactly happened? An engineer deleted a subscriber resource in Google Cloud Platform thinking it was only used for testing, but it was actually being used in production. Mistakes do happen, but we can’t have that kind of downtime.

We took several measures to make sure this issue would never happen again. Here’s how we leveraged our synthetics solution in combination with New Relic Alerts to set up a smoke test to continuously verify that data flows through our complex Incident Intelligence pipeline from start to finish.

Our system does the following:

Gets incident data from New Relic and third-party sources such as PagerDuty, Grafana, and Prometheus.
Attempts to correlate events based on time, context, and topology under one root cause, known as an issue.
Enriches this issue with additional helpful information used for problem-solving.
Sends this vastly improved issue back to the team in ServiceNow, PagerDuty, Slack, and other third-party platforms.

This is a simplified representation of our system:

Simplified representation shows process from input to output.

Smoke testing the Incident Intelligence pipeline

According to Wikipedia, the term “smoke testing” likely originated in the plumbing industry— plumbers would use real smoke to discover leaks and cracks in pipe systems. In our case, there are many moving parts in our system that are constantly changing. We needed a smoke test that would discover when our system had any issues. We decided to add a heartbeat test that regularly checks to make sure data is flowing happily from our input to our outputs.

We set up the test to do the following:

Use a Synthetic monitor to periodically insert a message (called IINTSmokeTestStart) to New Relic Metrics.

An NRQL alert definition that triggers an alert when it detects a log entry in New Relic metrics.

Use a NRQL condition to trigger a new alert for each log entry of this kind. Incident Intelligence receives these alerts as inputs. The image above shows our alert condition. Meanwhile, the image below shows the periodic signal that triggers new alerts.

Periodic signal that triggers new alerts.

Configure a webhook destination endpoint in Incident Intelligence that reacts to these periodic signals by writing another record (called IINTSmokeTestEnd) to New Relic Metrics using the Metrics API. The image below shows the webhook and the destination endpoint: New Relic's Metrics API.

Image shows the webhook key and destination endpoint.

Add a final NRQL alert policy that triggers if theIINTSmokeTestEnd message doesn’t arrive. That means the pipeline is down and we need an on-call engineer to address it immediately.

NRQL alert policy that triggers if message doesn't arrive.

Here’s a summary of what happens now that our heartbeat test is in place:

We continuously generate data that activates our pipeline end-to-end.
Every minute, we push incidents to the beginning of our pipeline. In our case, we are simulating incidents for our input data.
Every five minutes, the end of our pipeline uses a webhook to log a message using the Metrics API. If this message isn’t received, we get an alert notifying us that something is wrong.

We can now sleep comfortably knowing we have one more layer of protection for our customers.

Next steps

Get started with Incident Intelligence. If you’re not using New Relic One yet, then request a demo or sign up for a free trial today.

By Shy Peleg, Director of Software Engineering, Applied Intelligence

Shy is a Director of Software Engineering at New Relic for Applied Intelligence. Shy started his career at the Israeli Intelligence Corps as a software developer and team leader. He continued working in cybersecurity while completing a BSc in Mathematics and Philosophy. He then served four years as a founder and CTO for the twenty-person startup OMGWhen, which created a global search and personalization engine for leisure events. Next, Shy worked at Wix.com as a frontend staff engineer. He completed an MBA at IE University in Madrid before joining New Relic.

The views expressed on this blog are those of the author and do not necessarily reflect the views of New Relic. Any solutions offered by the author are environment-specific and not part of the commercial solutions or support offered by New Relic. Please join us exclusively at the Explorers Hub (discuss.newrelic.com) for questions and support related to this blog post. This blog may contain links to content on third-party sites. By providing such links, New Relic does not adopt, guarantee, approve or endorse the information, views or products available on such sites.

780+ integrations to start monitoring your stack for free.

See All Integrations

In this article

How We Created a Heartbeat Test for Monitoring Incident Intelligence

Smoke testing the Incident Intelligence pipeline

Next steps

Tags

Related