The methods by which errors are defined, captured, and presented are different between New Relic application performance monitoring (APM) services and OpenTelemetry APM services. As customers are migrating to OpenTelemetry and comparing the APM and OpenTelemetry summary pages, they’ll notice that the error rate graphs display different values for the same service. This blog post details why this is the case.

What is an error?

This sounds like a simple question. A typical explanation might be: any transaction that has an unhandled exception is an error.  

Error rate is the percentage of transactions that result in an error during a particular time window. For example, if during a specific period of time your application handles 1,000 transactions, and 50 of them have unhandled exceptions, you have an error rate of 50/1000, or 5%. However, errors are defined and interpreted differently by OpenTelemetry, and therefore the error rate will differ too. Fundamentally, this is because OpenTelemetry does not have the notion of transactions, nor any of the additional logic our agents use to capture and count errors.  

First, let’s look at how New Relic handles errors for our APM agents, and then compare that with OpenTelemetry.

New Relic APM agents

At New Relic, a transaction is defined as one logical unit of work in a software application. Specifically, it refers to the function calls and method calls that make up that unit of work. For APM, it will often refer to a web transaction, which represents activity that starts when the application receives a web request to when the response is sent.  

Learn more about transaction types and subtypes here.

Transaction error logic

We record only one error per transaction. Even if there are multiple errors within the unit of work, for the sake of error rate, we still count it as one error per transaction. This is derived in order of precedence among the following types of errors:

  • NoticeError API
  • Exception observed by instrumentation
  • Web transaction with a status code >= 400

Once an error is defined, the TransactionError event captures details like exception type and stack trace. New Relic APM also supports the notions of “ignored” and “expected” errors via agent configuration.

New Relic example

In this transaction, there are four instrumented methods. Both method B and D have errors; however, this instance will only record the first error and details based on precedence.

When an unexpected error is recorded, the error count is incremented and recorded as a metric by the New Relic APM agent (apm.service.error.count). The error rate on the APM Summary Page is calculated using this metric, as shown in the screenshot below. The calculation and visualization of error rate does not factor in the TransactionError event, as it is sampled and would be a limited dataset.

Sample NRQL query:

SELECT sum(apm.service.error.count['count']) / count(apm.service.transaction.duration) AS 'Web errors' FROM Metric WHERE (entity.guid = 'foo') AND (transactionType = 'Web') LIMIT MAX SINCE 30 MINUTES AGO TIMESERIES

To read more about managing errors for New Relic agents, review our documentation.

OpenTelemetry

There are two ways that errors are defined in order to drive various parts of our UIs from OpenTelemetry data:

  1. OpenTelemetry metrics, which are used for the error rate chart on the OpenTelemetry APM summary page
  2. Transactions defined from spans, which are used for the errors inbox.

Errors from metrics (OpenTelemetry APM summary)

In the New Relic OpenTelemetry summary page, you have the option to toggle between metrics or spans.  

Metrics: A key difference between APM and OpenTelemetry is that the OpenTelemetry http metrics spec does not have an error count metric. For the OpenTelemetry APM experience in New Relic, the error rate chart references the duration metric http.server.request.duration or rpc.server.duration and classifies instances where status code >=500 as the error rate. This means that the error rate from metrics is restricted to HTTP calls.

Sample NRQL query:

SELECT filter(count(http.server.request.duration), WHERE numeric(http.status_code) >= 500 OR numeric(http.response.status_code) >= 500)/count(http.server.request.duration)  as 'Error rate for all errors' FROM Metric WHERE (entity.guid = 'foo') AND (http.server.request.duration IS NOT NULL OR http.server.request.duration IS NOT NULL) SINCE 30 minutes ago TIMESERIES

Spans: When the error rate chart is derived from spans, all OpenTelemetry spans with kind of server or consumer and status code of ERROR are considered as an error. This means that the error rate from spans is protocol agnostic.

Sample NRQL query:

SELECT filter(count(*), WHERE otel.status_code = 'ERROR')/count(*)  as 'Error rate for all errors' FROM Span WHERE (entity.guid = 'foo') AND ((span.kind LIKE 'server' OR span.kind LIKE 'consumer' OR kind LIKE 'server' OR kind LIKE 'consumer')) SINCE 30 minutes ago TIMESERIES

Errors from spans (errors inbox)

OpenTelemetry does not have a concept of a transaction, but it does have spans, and spans represent operations within a transaction. New Relic relies on SpanKind for mapping trace data to our concept of a transaction. A SpanKind of server or consumer is used to identify the entry point of a process. In other words, these are spans that are either root spans or child spans of a remote process.

In addition to the lack of a definition of a transaction, OpenTelemetry does not include an explicit error rate metric.

In order to bridge the gap between New Relic and OpenTelemetry, transactions are defined by a span of kind server, with child spans making up the sub-operations of the transaction.

In this definition of a transaction, the transaction is only considered as an error if that root span of kind server has A status.code of ERROR. Even if other child spans have a status code of ERROR, it only matters if the root span has a status code of ERROR. If the root span doesn’t have a status code of ERROR, the transaction isn’t counted towards the error rate.

OpenTelemetry example

In this example, services are represented as boxes containing circles (spans). Service A is calling Service B, and Service B calls Service C.

Within Service B, there are multiple instrumented methods, which result in multiple spans being captured. Method A is of kind server, or an entry point for this service, and it’s used to define the concept of a transaction to populate the APM UI. Within this abstraction, there are multiple spans that have a status code of ERROR.

The root span has an error, so the transaction is considered an error, and is displayed in the errors inbox.

Summary

There is no direct apples-to-apples comparison for error rate between New Relic and OpenTelemetry since the models are fundamentally different. When moving between instrumentation methods, it’s important to re-establish the error baselines of your service, and leverage those for your alert conditions, service level objectives, and dashboards.