New Relic Now Start training on Intelligent Observability February 25th.
Save your seat.
現在、このページは英語版のみです。

As observability practitioners, we naturally have questions that only telemetry data can really answer. When looking at a system, it’s useful to be able to take both a high-level view (for example, how is the checkout service performing) and also be able to answer fine-grained questions (for example, which customers are experiencing slow checkout responses). I like to call this process “zooming in” as we look at high-level data and then drill into specifics. 

Using New Relic Query Language (NRQL) on top of MELT data (metrics, events, logs, and traces collected from New Relic agents, OpenTelemetry, Prometheus, and many more), we can quickly gain insights into running systems. Using the power of a single query language, we can take a high-level view, and then further refine the query to look at a particular angle of the data.

Let’s dig into an example system to show how we can gather key insights using this technique.

High level views

Today I’m going to be using tracing telemetry that’s generated from an OpenTelemetry application as the basis of my analysis. The application is an ecommerce application that has a frontend service and some backend microservices, as shown in the image below. 

The application performance monitoring (APM) pages give a high-level overview focused on the service, but we can get to more specific insights by writing our own NRQL query on the same data that powers the charts in APM.

Digging deeper with the query builder and NRQL

To start digging into the running service, we want to get an overview of the traces. I could start with a basic query like this to get a sense of what trace summary data exists for the frontend service using the query builder:

FROM DistributedTraceSummary SELECT * 
WHERE root.entity.name = 'frontend-otel'

SELECT * queries, as shown in the following image, give a good overview of which attributes might exist by selecting the most recent 100 rows of data. I see that duration.ms is an attribute that could tell me how long these frontend transactions took in milliseconds. 

When plotting latency data, it’s preferable to look at the distribution of values rather than something like the average. I’ll pick the percentiles 50, 90, 95, and 99 to look at in addition to the number of traces over the last day of data:

FROM DistributedTraceSummary SELECT percentile(duration.ms, 50, 90, 95, 99), count(*) 
WHERE root.entity.name = 'frontend-otel'
SINCE 1 day ago

The resulting chart shows percentiles for duration.ms and trace count over the last day. 

This is a decent high-level summary. There are a lot of transactions happening. This could be a good target for an overview dashboard, but I might want to go a bit deeper.

Zooming into the data

In our example application, we have a summary of the frontend service but we don’t know how the service’s dependencies are functioning. 

The first things to find are the other services that contribute to the latency of the frontend service. To get to this level of detail, we’ll use the tracing Span event, which is one of the “atoms” of a distributed trace.

To get to the collection of services that are involved in serving the frontend, we can get a set of sample traces and then find the service names in the Span data. This is accomplished using the following NRQL:

FROM Span SELECT uniques(service.name) 
WHERE trace.id in (FROM DistributedTraceSummary SELECT trace.id  WHERE root.entity.name = 'frontend-otel' LIMIT MAX) 

The following image shows a breakdown of services involved in serving the frontend.

You’ll notice that this contains all the services in the call chain, not just the immediate set of services. This is good information for any sort of latency investigation. 

We can gather some high-level stats on this set of spans to see which ones might be the slowest in terms of latency that might be impacting the frontend. I can use FACET clause to group spans by their service name:

FROM Span SELECT percentile(duration.ms, 99), count(*)
WHERE trace.id in (FROM DistributedTraceSummary SELECT trace.id  WHERE root.entity.name = 'frontend-otel' LIMIT MAX) 
FACET service.name

The following image shows the duration distribution of each service.

This query result tells us that the checkout service is a key service to watch. From our knowledge of the system, it’s also business critical, since the checkout is where the ecommerce site makes its money. 

Let’s zoom back out and just look at spans for the checkout service to see if there’s additional insight on what might be the most latency:

FROM Span SELECT *
WHERE service.name = 'checkoutservice-otel'

The above query isolates spans from the checkout service, as shown in the following image.

I noticed that the name attribute has the endpoint within the service that’s being called. Let’s look at those using a FACET.

FROM Span SELECT percentile(duration.ms, 95, 99)
WHERE service.name = 'checkoutservice-otel'
FACET name

This query shows the transaction durations by endpoint within the checkout service.

Before going further, these names can be cleaned up a bit to just include the part after the “oteldemo.” prefix along with TIMESERIES MAX so that the latency percentiles are shown over time:

WITH aparse(name, 'oteldemo.*') as sanitizedName
FROM Span SELECT percentile(duration.ms, 95, 99)
WHERE service.name = 'checkoutservice-otel'
FACET sanitizedName TIMESERIES MAX

With simplified names and percentile shown over time, the above query makes it easy to spot endpoints with notable duration trends.

From this chart it’s clear that “PlaceOrder” stands out from the bunch. Let’s scope our queries to just this endpoint and look at a longer time range with just the p95:

WITH aparse(name, 'oteldemo.*') as sanitizedName
FROM Span SELECT percentile(duration.ms, 95)
WHERE service.name = 'checkoutservice-otel' and sanitizedName = 'CheckoutService/PlaceOrder'
TIMESERIES MAX since 1 week ago

The query returns the trend for the 'PlaceOrder' endpoint over a week, as shown in the following image.

This gives us a good interesting signal. It looks like there was a latency spike on October 18 at 5:30PM. I can use the query builder to zoom into that particular timeframe:

To really know why this might be happening, the specific slow Spans will have the details. Let’s just look at spans that have a duration greater than 1 second (1000 milliseconds):

WITH aparse(name, 'oteldemo.*') as sanitizedName
FROM Span SELECT duration.ms, *
WHERE service.name = 'checkoutservice-otel' AND sanitizedName = 'CheckoutService/PlaceOrder' AND duration.ms > 1000
since '2024-10-18 01:25:00+0530' UNTIL '2024-10-18 09:52:00+0530'

This query isolates spans with duration exceeding 1 second, allowing you to inspect possible bottlenecks in more detail.

From here there might be a signal in the order amount or other information that could provide a new theory on why this call might have been slow and what to do about it within the service code. 

Conclusion

Being able to navigate through data, zooming in and out to gain a deeper understanding, is a core skill of observability practitioners, one that helps to understand how a system behaves and identify where problems might exist. The ability to start with a small snippet of NRQL to get a high-level overview of the data, and then zoom in to specifics using the important attributes of your data, is one of the powerful yet delightful aspects of exploring data with New Relic.