OpenTelemetry metrics is well on its way to approaching general availability, and you’ll want to understand this signal to implement it as part of your team’s observability strategy. Currently, you can collect some application runtime metrics out of the box with several language SDKs, and you can use the host metrics receiver to generate metrics about your host system.
If you want to generate and collect metrics beyond those, you’ll need to learn about metric instruments, types, and their use cases, including what you need to consider when choosing one. For example, you might want to know the number of active users of your application so you can better understand customer behavior.
To help frame these concepts, you can follow along with examples in our fork of the OpenTelemetry Community Demo application, which is an online shop selling a range of tools for stargazing.
If you're familiar with general metrics concepts, feel free to move ahead to the metric instruments section.
What is a metric?
Before we get started with OpenTelemetry metrics, let’s start with metrics in general. A metric is simply a measurement of a service that is captured at runtime. You can aggregate these measurements further to identify trends and patterns over time. Here are standard application and resource utilization metrics that are generally important when you’re developing apps:
- Throughput
- Response time/latency
- Error rate
- CPU utilization
- Memory utilization
Getting a little more complex, you might be interested in custom metrics to better understand your app and user behavior, or to track specific key performance indicators (KPIs). For example, if you’re the owner of the astronomy online store, here are some example custom metrics you might want to collect:
- Total number of checkouts
- Latency of search results being returned
- The distribution of orders by size
- Number of abandoned shopping carts per day
Why are metrics useful?
More specifically, you might be wondering why metrics are useful for observability in general, and what characteristics make them more useful than logs or traces.
All three signals—metrics, logs, and traces—are useful for monitoring the overall health and performance of your application. You can use them for span data or, more commonly, powering data visualizations.
But here’s where data from metrics really shine:
- Data volume reduction: Exporting and analyzing measurements individually can be expensive. By aggregating measurements, you can reduce your overall data volume while still gaining insight from the data.
- Alerts: Metrics form the basis of service level indicators (SLIs), which measure the performance of an application. You use the indicators to set service level objectives (SLOs) that teams use to calculate their error budgets. Metrics make a big difference if you use them to create alters for breached SLOs.
Overview of OpenTelemetry metrics concepts
To help you understand the metric instruments available with OpenTelemetry, let’s review a few important concepts at a high level.
Instrumentation
Instrumentation refers to adding code to your application to collect telemetry data. In metrics, this involves adding code to your application to measure specific operations or events, such as HTTP requests, database queries, or function execution times.
Metric instruments
OpenTelemetry provides several types of metric instruments for capturing different kinds of data. Some common metric instruments include counters, gauges, histograms, and summaries. Each instrument type is suited for different use cases and provides different insights into your application's behavior.
Labels/Tags
Labels or tags are key-value pairs associated with metrics that provide additional context or dimensions for the data. For example, you might add labels to a metric representing HTTP request latency to differentiate between different endpoints or response status codes.
Exporters
Once metrics are collected, they need to be exported to a monitoring backend for storage, visualization, and analysis. OpenTelemetry provides exporters for various monitoring systems. These exporters allow you to integrate OpenTelemetry with your existing monitoring infrastructure.
Sampling
Sampling is selecting a subset of data to collect and export, which helps manage the volume of telemetry data generated by your application. OpenTelemetry supports various sampling strategies for metrics, such as probabilistic sampling, rate-based sampling, and tail-based sampling.
Meter provider
A Meter Provider is responsible for creating and managing Meter instances in OpenTelemetry. It serves as the entry point for interacting with the Metrics API and allows applications to create and configure meters for collecting metric data.
Meter
A Meter is an abstraction representing a source of metrics in OpenTelemetry. It provides methods for creating and recording different types of metric instruments, such as counters, gauges, histograms, and summaries. Meters are typically created and managed by Meter Providers.
API
The OpenTelemetry metrics API defines a set of interfaces and classes for working with metrics in OpenTelemetry. It provides methods for creating meters, recording metric data, and working with metric instruments and labels/tags. The Metrics API abstracts away the details of the underlying telemetry implementation, allowing applications to work with metrics in a consistent and portable manner.
SDK
The OpenTelemetry metrics SDK is an implementation of the Metrics API that provides the underlying functionality for collecting, processing, and exporting metric data in OpenTelemetry. It includes components for instrumenting applications, recording metric data, performing aggregation, and exporting data to monitoring backends. The Metrics SDK is responsible for integrating with the underlying telemetry infrastructure and handling the lifecycle of metric data collection and processing.
Now some key mathematical concepts:
Aggregation
Aggregation is the process of combining multiple measurements into one metric point. For example, let's say you have a set of 30 measurements, each representing a daily total number of telescopes sold. You could aggregate these totals to produce a single number, which would tell you how many of those telescopes you sold in the given time period (30 days).
Temporality
The notion of temporality dictates how you aggregate. It relates to whether the reported values of additive quantities—values that are summed together—incorporate previous measurements or not. There are two types of temporality:
- Cumulative temporality indicates that measurements are accumulated when exported. Another way to look at cumulative temporality is that the start time is always the same. If your application restarts, it would reset to 0 and the start time would begin from the time of the application restart.
- Delta temporality indicates that measurements are reset each time they’re exported, which means you’re seeing the change in a measurement instead of the absolute value. Another way to look at delta temporality is that it has a constantly moving start time.
Monotonicity
There are two kinds of values related to monotonicity:
- Monotonic refers to a value that is always increasing. For example, your total number of telescopes sold over time is monotonic.
- Non-monotonic refers to a value that is increasing and decreasing at the same time. For example, the number of telescopes sold from day to day will likely fluctuate so the value is non-monotonic. (Although as a business owner, you’d certainly like for this to be a monotonic sum!)
Now, let’s take a look at the metric types that result from aggregation:
Sum
A sum is an addition of values. A sum can have a temporality of either:
- Cumulative (It never resets.)
- Delta (It can reset and bring the state back to 0.)
Histogram
A histogram is a distribution of data consisting of buckets and counts of instances within those buckets. In OpenTelemetry, the term histogram refers to both an instrument type as well as an aggregation, and there are two types of histograms that are supported:
- Explicit bucket histograms have buckets that are explicitly defined during initialization.
- Exponential histograms also have buckets and bucket counts, but the bucket boundaries are computed based on an exponential scale. Learn more at OpenTelemetry exponential histograms.
Last value
Temporality does not matter here. Since you're always just sending the last value, it doesn’t matter if you reset the state or not.
If you're looking for a more in-depth guide to some of these concepts, see Understand and query high cardinality metrics in the New Relic documentation.
Why use OpenTelemetry for metrics
I’m going to answer this question by talking about the design goals of OpenTelemetry:
- To provide the ability to correlate metrics to other signals. For example, you can correlate metrics to traces via exemplars, and enrich metrics attributes with baggage and context.
- To provide a path for OpenCensus users to migrate to OpenTelemetry. This was part of the original goal when OpenCensus and OpenTracing were merged to create OpenTelemetry back in 2019.
- To work with existing metrics instrumentation protocols and standards, with the minimum goal being to provide full support for Prometheus and StatsD.
The biggest benefit is that OpenTelemetry grants you freedom from vendor lock-in. You can instrument your applications once, and then send your telemetry to the backends of your choice.
How to choose instruments in OpenTelemetry
You use an instrument to report measurements. In OpenTelemetry, there are six metric instruments, and each has an aggregation strategy (also called an aggregation) that reflects the intended use of the measurements it reports. The instrument type you select determines how the measurements are aggregated, and ultimately the type of metric that is exported, which affects the way you can query and analyze it.
So how do you choose the right metric instrument? Let’s look at it from another angle: different aggregations support different modes of analysis. For example, maybe you want to analyze the latency of search results being returned when your customers are searching for a product on your site. You’d want a format for the measurements to be useful for you to obtain insight. In this case, a sum of these measurements doesn't make sense, because you can’t figure out anything useful from that value. You’d want a histogram, so you can see a distribution of search response times. So, you’d want to select an instrument that will produce a histogram.
Here's a brief framework for how to select an instrument:
- How do you want to analyze the data?
- Does the measurement need to be done synchronously?
- When you use a synchronous instrument, an instance of the instrument is called when the event that you're measuring occurs.
- In contrast, an asynchronous instrument only records a measurement once per set interval.
- Whether to use one or the other boils down to convenience: Is it easier for you to access the data at the point of instrumentation, or would you rather have it reported on a specified interval?
- Are the values that the instrument records monotonic?
To help you decide which instrument type to use, take a look at this table, which includes properties and examples for each instrument:
Instrument |
Synchronous |
Additive |
Monotonic |
Default aggregation |
Example measurements |
Use when… |
Counter |
✅ |
✅ |
✅ |
Sum |
|
|
Up down counter |
✅ |
✅ |
❌ |
Sum |
|
|
Histogram |
✅ |
❌ |
❌ |
Explicit bucket histogram |
|
|
Async counter |
❌ |
✅ |
✅ |
Sum |
|
|
Async up down counter |
❌ |
✅ |
❌ |
Sum |
|
|
Gauge |
❌ |
❌ |
❌ |
Last value |
|
|
Note: While the OpenTelemetry API provides a default aggregation for each instrument, you can override it using the Views API, which I won't detail here because this is just a 101.
Considerations
Before you begin implementing metrics, there are a couple of things to take into consideration. Let’s start by reviewing the following two concepts in the context of observability:
Dimensions
A dimension refers to an attribute associated with the metrics. For example, if you’re using an instrument to count the number of customers in your telescope shop, you might also want to record information about the customers, such as their location. You'd add this information as a dimension on the measurement.
Dimensions are useful because you can use them to aggregate your data in different ways, as well as to filter on your data.
Cardinality
Metric cardinality refers to the uniqueness of a value on a metric. Using the example of capturing the locations of our customers, let’s say you're collecting the country for each customer.
If your customers happen to be from the same one or two countries, that would be low cardinality. If you collect their city instead, and they are from many different cities, that would result in high cardinality, because the uniqueness of that value has increased. Then, imagine that your app is inundated with traffic because you ran a sale, and now you have customers from all over the world purchasing telescopes from your site. This would result in an increase in the load in the system, sometimes called a cardinality explosion.
When collecting telemetry, managing cardinality is typically a concern. One of the challenges of cardinality is increased storage cost. Additionally, some backends, including New Relic, impose cardinality limits, which might result in your data getting dropped. Read more in our documentation on how to understand and query high cardinality metrics.
다음 단계
Now that you’ve got a basic foundation about metrics in OpenTelemetry and how to choose an instrument, you’re ready to collect some metric data!
To learn how to implement application runtime metrics and create an instrument in your OpenTelemetry SDK, check out our manual instrumentation tutorial for Java in the New Relic docs. (More languages are in progress!)
To learn more about the power of exponential histograms, check out our blog post on OpenTelemetry exponential histograms.
If you'd like more hands-on work learning about OpenTelemetry with New Relic, check out OpenTelemetry: an open source data collection standard, a 90-minute self-paced course with New Relic University.
이 블로그에 표현된 견해는 저자의 견해이며 반드시 New Relic의 견해를 반영하는 것은 아닙니다. 저자가 제공하는 모든 솔루션은 환경에 따라 다르며 New Relic에서 제공하는 상용 솔루션이나 지원의 일부가 아닙니다. 이 블로그 게시물과 관련된 질문 및 지원이 필요한 경우 Explorers Hub(discuss.newrelic.com)에서만 참여하십시오. 이 블로그에는 타사 사이트의 콘텐츠에 대한 링크가 포함될 수 있습니다. 이러한 링크를 제공함으로써 New Relic은 해당 사이트에서 사용할 수 있는 정보, 보기 또는 제품을 채택, 보증, 승인 또는 보증하지 않습니다.