OpenTelemetry metrics is well on its way to approaching general availability, and you’ll want to understand this signal to implement it as part of your team’s observability strategy. Currently, you can collect some application runtime metrics out of the box with several language SDKs, and you can use the host metrics receiver to generate metrics about your host system. 

If you want to generate and collect metrics beyond those, you’ll need to learn about metric instruments, types, and their use cases, including what you need to consider when choosing one. For example, you might want to know the number of active users of your application so you can better understand customer behavior.

To help frame these concepts, you can follow along with examples in our fork of the OpenTelemetry Community Demo application, which is an online shop selling a range of tools for stargazing. 

If you're familiar with general metrics concepts, feel free to move ahead to the metric instruments section. 

What is a metric? 

Before we get started with OpenTelemetry metrics, let’s start with metrics in general. A metric is simply a measurement of a service that is captured at runtime. You can aggregate these measurements further to identify trends and patterns over time. Here are standard application and resource utilization metrics that are generally important when you’re developing apps: 

  • Throughput
  • Response time/latency
  • Error rate
  • CPU utilization
  • Memory utilization 

Getting a little more complex, you might be interested in custom metrics to better understand your app and user behavior, or to track specific key performance indicators (KPIs). For example, if you’re the owner of the astronomy online store, here are some example custom metrics you might want to collect:

  • Total number of checkouts
  • Latency of search results being returned
  • The distribution of orders by size 
  • Number of abandoned shopping carts per day 

Why are metrics useful?

More specifically, you might be wondering why metrics are useful for observability in general, and what characteristics make them more useful than logs or traces. 

All three signals—metrics, logs, and traces—are useful for monitoring the overall health and performance of your application. You can use them for span data or, more commonly, powering data visualizations. 

But here’s where data from metrics really shine:

  • Data volume reduction: Exporting and analyzing measurements individually can be expensive. By aggregating measurements, you can reduce your overall data volume while still gaining insight from the data. 
  • Alerts:  Metrics form the basis of service level indicators (SLIs), which measure the performance of an application. You use the indicators to set service level objectives (SLOs) that teams use to calculate their error budgets. Metrics make a big difference if you use them to create alters for breached SLOs. 

Overview of OpenTelemetry metrics concepts

To help you understand the metric instruments available with OpenTelemetry, let’s review a few important concepts at a high level. 

Instrumentation

Instrumentation refers to adding code to your application to collect telemetry data. In metrics, this involves adding code to your application to measure specific operations or events, such as HTTP requests, database queries, or function execution times.

Metric instruments

OpenTelemetry provides several types of metric instruments for capturing different kinds of data. Some common metric instruments include counters, gauges, histograms, and summaries. Each instrument type is suited for different use cases and provides different insights into your application's behavior.

Labels/Tags 

Labels or tags are key-value pairs associated with metrics that provide additional context or dimensions for the data. For example, you might add labels to a metric representing HTTP request latency to differentiate between different endpoints or response status codes.

Exporters

Once metrics are collected, they need to be exported to a monitoring backend for storage, visualization, and analysis. OpenTelemetry provides exporters for various monitoring systems. These exporters allow you to integrate OpenTelemetry with your existing monitoring infrastructure.

Sampling

Sampling is selecting a subset of data to collect and export, which helps manage the volume of telemetry data generated by your application. OpenTelemetry supports various sampling strategies for metrics, such as probabilistic sampling, rate-based sampling, and tail-based sampling.

Meter provider

A Meter Provider is responsible for creating and managing Meter instances in OpenTelemetry. It serves as the entry point for interacting with the Metrics API and allows applications to create and configure meters for collecting metric data.

Meter

A Meter is an abstraction representing a source of metrics in OpenTelemetry. It provides methods for creating and recording different types of metric instruments, such as counters, gauges, histograms, and summaries. Meters are typically created and managed by Meter Providers.

API

The OpenTelemetry metrics API defines a set of interfaces and classes for working with metrics in OpenTelemetry. It provides methods for creating meters, recording metric data, and working with metric instruments and labels/tags. The Metrics API abstracts away the details of the underlying telemetry implementation, allowing applications to work with metrics in a consistent and portable manner.

SDK

The OpenTelemetry metrics SDK is an implementation of the Metrics API that provides the underlying functionality for collecting, processing, and exporting metric data in OpenTelemetry. It includes components for instrumenting applications, recording metric data, performing aggregation, and exporting data to monitoring backends. The Metrics SDK is responsible for integrating with the underlying telemetry infrastructure and handling the lifecycle of metric data collection and processing.

Now some key mathematical concepts:

Aggregation

Aggregation is the process of combining multiple measurements into one metric point. For example, let's say you have a set of 30 measurements, each representing a daily total number of telescopes sold. You could aggregate these totals to produce a single number, which would tell you how many of those telescopes you sold in the given time period (30 days).

Temporality

The notion of temporality dictates how you aggregate. It relates to whether the reported values of additive quantities—values that are summed together—incorporate previous measurements or not. There are two types of temporality:

  • Cumulative temporality indicates that measurements are accumulated when exported. Another way to look at cumulative temporality is that the start time is always the same. If your application restarts, it would reset to 0 and the start time would begin from the time of the application restart. 
  • Delta temporality indicates that measurements are reset each time they’re exported, which means you’re seeing the change in a measurement instead of the absolute value. Another way to look at delta temporality is that it has a constantly moving start time. 

Monotonicity 

There are two kinds of values related to monotonicity:

  • Monotonic refers to a value that is always increasing. For example, your total number of telescopes sold over time is monotonic. 
  • Non-monotonic refers to a value that is increasing and decreasing at the same time. For example, the number of telescopes sold from day to day will likely fluctuate so the value is non-monotonic. (Although as a business owner, you’d certainly like for this to be a monotonic sum!) 

Now, let’s take a look at the metric types that result from aggregation: 

Sum

A sum is an addition of values. A sum can have a temporality of either:

  • Cumulative (It never resets.)
  • Delta (It can reset and bring the state back to 0.)

Histogram

A histogram is a distribution of data consisting of buckets and counts of instances within those buckets. In OpenTelemetry, the term histogram refers to both an instrument type as well as an aggregation, and there are two types of histograms that are supported:

  • Explicit bucket histograms have buckets that are explicitly defined during initialization. 
  • Exponential histograms also have buckets and bucket counts, but the bucket boundaries are computed based on an exponential scale. Learn more at OpenTelemetry exponential histograms.

Last value

Temporality does not matter here. Since you're always just sending the last value, it doesn’t matter if you reset the state or not. 

If you're looking for a more in-depth guide to some of these concepts, see Understand and query high cardinality metrics in the New Relic documentation.

Why use OpenTelemetry for metrics

I’m going to answer this question by talking about the design goals of OpenTelemetry: 

  • To provide the ability to correlate metrics to other signals. For example, you can correlate metrics to traces via exemplars, and enrich metrics attributes with baggage and context.
  • To provide a path for OpenCensus users to migrate to OpenTelemetry. This was part of the original goal when OpenCensus and OpenTracing were merged to create OpenTelemetry back in 2019.
  • To work with existing metrics instrumentation protocols and standards, with the minimum goal being to provide full support for Prometheus and StatsD.

The biggest benefit is that OpenTelemetry grants you freedom from vendor lock-in. You can instrument your applications once, and then send your telemetry to the backends of your choice.

Get started with OpenTelemetry
Open Telemetry logo

How to choose instruments in OpenTelemetry

You use an instrument to report measurements. In OpenTelemetry, there are six metric instruments, and each has an aggregation strategy (also called an aggregation) that reflects the intended use of the measurements it reports. The instrument type you select determines how the measurements are aggregated, and ultimately the type of metric that is exported, which affects the way you can query and analyze it. 

So how do you choose the right metric instrument? Let’s look at it from another angle: different aggregations support different modes of analysis. For example, maybe you want to analyze the latency of search results being returned when your customers are searching for a product on your site. You’d want a format for the measurements to be useful for you to obtain insight. In this case, a sum of these measurements doesn't make sense, because you can’t figure out anything useful from that value. You’d want a histogram, so you can see a distribution of search response times. So, you’d want to select an instrument that will produce a histogram. 

Here's a brief framework for how to select an instrument:

  • How do you want to analyze the data?
  • Does the measurement need to be done synchronously? 
    • When you use a synchronous instrument, an instance of the instrument is called when the event that you're measuring occurs.
    • In contrast, an asynchronous instrument only records a measurement once per set interval.
    • Whether to use one or the other boils down to convenience: Is it easier for you to access the data at the point of instrumentation, or would you rather have it reported on a specified interval? 
  • Are the values that the instrument records monotonic?

To help you decide which instrument type to use, take a look at this table, which includes properties and examples for each instrument: 

Instrument

Synchronous

Additive

Monotonic

Default aggregation

Example measurements

Use when…

Counter

Sum

  • Number of bytes sent 
  • Total orders processed
  • Total checkouts
  • You want to count things and compute the rate at which things happen 
  • The sum of the things is more meaningful than the individual values

Up down counter

Sum


 
  • Number of open connections
  • Size of a queue
    (for example a work queue) 
  • You don’t want to analyze the change over time. This instrument is suitable for monitoring quantities that go up and down during a request, such as total active requests, queue size, and memory in use.

Histogram

Explicit bucket histogram

  • HTTP server response times
  • Client duration
  • Request rate
  • Latency of the search results that are returned
  • You want to analyze the distribution of measurements (for example, to evaluate SLAs and identify trends). 
  • You want to compute min, max, and average response time.
  • You need a heatmap or percentiles.

Async counter

Sum

  • CPU time 
  • Cache hits and misses
  • Monotonic sums are unnecessary or expensive on a per-request basis, such as when a system call is required.

Async up down counter

Sum

  • Memory utilization 
  • Process heap size
  • Number of active shards
  • You want to report measurements that record the rise and fall of sums.

Gauge

Last value

  • CPU utilization
  • Temperature of hardware at this point in time 
  • You’re reporting data that's not useful to aggregate across dimensions, and you have access to measurements asynchronously. 

 

Note: While the OpenTelemetry API provides a default aggregation for each instrument, you can override it using the Views API, which I won't detail here because this is just a 101. 

Considerations

Before you begin implementing metrics, there are a couple of things to take into consideration. Let’s start by reviewing the following two concepts in the context of observability:

Dimensions

A dimension refers to an attribute associated with the metrics. For example, if you’re using an instrument to count the number of customers in your telescope shop, you might also want to record information about the customers, such as their location. You'd add this information as a dimension on the measurement.

Dimensions are useful because you can use them to aggregate your data in different ways, as well as to filter on your data. 

Cardinality

Metric cardinality refers to the uniqueness of a value on a metric. Using the example of capturing the locations of our customers, let’s say you're collecting the country for each customer.

If your customers happen to be from the same one or two countries, that would be low cardinality. If you collect their city instead, and they are from many different cities, that would result in high cardinality, because the uniqueness of that value has increased. Then, imagine that your app is inundated with traffic because you ran a sale, and now you have customers from all over the world purchasing telescopes from your site. This would result in an increase in the load in the system, sometimes called a cardinality explosion. 

When collecting telemetry, managing cardinality is typically a concern. One of the challenges of cardinality is increased storage cost. Additionally, some backends, including New Relic, impose cardinality limits, which might result in your data getting dropped. Read more in our documentation on how to understand and query high cardinality metrics.