Determining the root cause of an incident has become more complex as engineers increasingly rely on distributed microservices to power their applications. Zebrium Root Cause as a Service, together with New Relic, helps automate the process of finding the root cause of issues in your logs. When a software or infrastructure problem occurs, Zebrium quickly and accurately finds the root cause indicators so your teams don’t need to manually search through all their logs. With the Zebrium integration for New Relic, you can see root cause details on any New Relic dashboard page, so you can leverage your telemetry data to help identify and troubleshoot issues faster.

To make it easier to get started quickly, the Zebrium quickstart in New Relic Instant Observability (I/O) provides a pre-built dashboard that includes incident detections, root cause details, and deep links to the Zebrium user interface. Here’s an example of the dashboard in New Relic.

Any engineer can get started for free with New Relic and use the feature as part of the more than 470 integrations available with New Relic’s observability platform. When you sign up, you get 100GB per month of free data ingest, one full platform user, and unlimited basic users, queries, dashboards, and alerts.

With the Zebrium integration in place, you can find the root causes of problems more quickly, reducing the burden on your SREs and DevOps teams when solving complex incidents. You get the following key benefits:

  • Never dig through logs again. When a problem occurs, automatically see log lines related to the issue in your New Relic dashboards. This can reduce downtime and speed up resolution from hours to minutes.
  • No manual training or rules needed. Getting started with initial setup only takes about 15 minutes, and Zebrium’s root cause as a service achieves accuracy in finding root cause indicators from log data within just 24 hours. No need to set up machine learning (ML) training manually!
  • Accurately find root causes with confidence. A recent third-party customer case study validated that Zebrium can automatically find the root cause in over 95% of incidents.

Identify root causes at a glance with the Zebrium integration

Let’s take a look at how the integration can help you identify root causes without needing to hunt through logs manually. The dashboard in the next image shows metrics for an online shopping app on a Kubernetes cluster. The two time-series metrics charts at the top of the dashboard show that an outage occurred.

New Relic dashboard with time-series metrics charts and Zebrium root cause charts.

The Zebrium Root Cause Finder chart on the bottom left shows a vertical bar at the time the problem occurred, indicating Zebrium’s machine learning model has detected a potential issue. The vertical bar is correlated with the drop in metrics. Meanwhile, the Zebrium Root Cause Reports chart on the bottom right details the Zebrium event. You can mouse over the Zebrium event to see a natural language processing (NLP) summary and word cloud that can help you determine the root cause of the issue. Selecting the Deep Link Url opens the full root cause report in the Zebrium user interface.

Select the Deep Link URL in New Relic to see more details about the root cause report in Zebrium.

As shown in the previous image, you can dive into the Zebrium UI for the root cause report. Approximately one million log lines were generated during the incident time window. The report highlights 46 lines from 7 different services, and no manual model training or rules were required to find these log lines. A quick review of the report reveals:

  • An English language NLP summary that gives a good sense of the problem: “The Chaos Monkey was trying to create an order.”
  • A word cloud showing “pod-network-corruption” and some other relevant words.
  • Log lines that provide descriptions of the root cause.

Zebrium uses statistical machine learning to analyze logs in real time. The machine learning model is based on the process a skilled engineer uses to search through logs when troubleshooting. In simple terms, this typically involves searching for errors around the time an issue occurs and then searching all logs around that time period to find any rare or unexpected events that explain the problem. 

How to get started with automated root cause identification in minutes

To start viewing root causes from Zebrium events automatically in New Relic, watch the following video or follow the steps listed after the video. The whole setup process should take about 15 minutes.

  1. Set up your accounts. You need to have New Relic and Zebrium accounts to use the integration. You can get started for free.
  2. Install a log collector to stream your logs to Zebrium. Read more about the supported collectors and how to install them.
  3. Retrieve your New Relic API key. Go to API keys in New Relic. Select Create a key. Choose your account and note the Account Id. Select Ingest - License as the Key type. Enter a name and notes for the API key, then select Create a key. Select the three dots next to your new key and select Copy key to save it for later.
  4. Enable the integration in Zebrium. Go to the User menu in Zebrium, select Integrations & Collectors, and select New Relic under the Observability Dashboards section. Select the General tab and enter an integration name. Select the Deployment for the integration, select Service Group(s), and then select the Send Detections tab. Select Enabled, and enter the Account Id and API key from the previous step. Then select Save.
  5. Visualize your data with the quickstart. After you’ve enabled the integration to send detections from Zebrium to New Relic, install the quickstart to get a curated dashboard. Go to the Zebrium Root Cause as a Service quickstart in New Relic I/O, select the Install Now button, and follow the guided installation process.

After you are finished, you automatically get a curated sample dashboard in New Relic where you can view detections and their root causes. You can also copy the sample charts into any other New Relic dashboard for a customized view.

Get a pre-built dashboard by installing the Zebrium Root Cause as a Service quickstart from New Relic I/O.

The next time you have to deal with a P1 incident, you can automatically see what happened directly in your New Relic dashboards. Zebrium Root Cause as a Service (RCaaS) can be set up in minutes and achieves correlation accuracy within 24 hours without any rules or manual training. Find the root cause faster and never dig through logs again.