Por el momento, esta página sólo está disponible en inglés.

Distributed tracing illuminates the intricate workings of an application's infrastructure, offering a lens into the minutiae of requests and transactions. By consolidating a record of events across the breadth of a distributed system, traces become indispensable for diagnosing root causes and identifying performance anomalies. Yet, their rich detail, while invaluable initially, presents sustainability challenges for long-term monitoring due to escalating costs and diminishing returns over time.

Enter the realm of metrics, where the aggregation of data over time provides a comprehensive snapshot of system health and performance, all while being significantly more cost-effective for prolonged observation. This shift not only promises efficiency but also ensures a sustained, broader perspective on system dynamics.

Imagine harnessing the capability of transforming trace data into enduring, insightful metrics—a transformation that combines the immediate, detailed insights of tracing with the long-term, aggregated perspective of metrics.

This blog post guides you through the process of converting traces into metrics using OpenTelemetry, demonstrating a straightforward path to achieving long-lasting, valuable metrics from your trace data.

Problems better solved using metrics

Costs for your observability solution are dependent on the vendors you use and the scale of your infrastructure, but it’s no secret that observability can get expensive. For a comprehensive observability solution, you need distributed tracing to understand how individual requests and transactions flow through your infrastructure, metrics to capture the health and performance of your systems over time, and logs to provide detailed, contextual information about specific events and errors within your systems. 

Traces, being highly granular and detailed, demand considerable storage space and computational resources for processing and analysis. By aggregating this data into metrics, organizations can significantly reduce the volume of data stored and processed, thereby lowering infrastructure costs. 


For example, ingesting 3,000 traces with 7 spans generated 0.0176 GB of ingest. In comparison, the transformed metrics sent by the collector for the same amount of traces only resulted in 0.0095GB of ingest. That’s a nearly 50% decrease in ingest for transformed metrics. Try benchmarking your ingest with this repo.

Capacity planning and resource allocation

Trace data can be used to inform capacity planning by identifying bottlenecks and understanding dependencies. You can identify where resources might be insufficient and predict how changes in one area might affect others. For effective capacity planning, metrics provide invaluable insights into usage trends and resource consumption, facilitating informed decisions regarding scaling and resource allocation. This foresight helps in optimizing costs and ensuring that the system can handle future demands.

Alerting and anomaly detection

Metrics are essential for setting up efficient alerting systems. By establishing thresholds based on metric data, teams can be promptly notified of anomalies or deviations from normal behavior, allowing for swift corrective actions. It’s easy to set a threshold for CPU usage or response time. Traces, being highly detailed and specific to individual requests, do not lend themselves as easily to creating effective thresholds for alerting.

Identifying performance bottlenecks

With traces, you get visibility into individual transactions or requests, which doesn’t translate to an understanding of overall utilization patterns without significant aggregation and analysis. Metrics provide both granular (for example, per-service or per-container metrics) and aggregate views (for example, total CPU usage across a cluster). Having this flexibility allows for drilling down into specific areas of concern or to maintain a broad overview of system health.

Transforming traces into metrics via the OpenTelemetry Collector

OpenTelemetry is a robust, open-source observability framework equipped with a toolkit for generating, gathering, and managing traces, metrics, and logs. A core component of the framework is the OpenTelemetry Collector, a versatile tool that simplifies the orchestration of data pipelines for these telemetry types. After you collect metrics, traces, and logs via Otel agents like the Collector, you can transform the data via an extract, transform, and load (ETL) pipeline. Each pipeline consists of four key components: receivers, processors, connectors, and exporters. Today we’ll be talking about the span metrics processor, the span metrics connector, and what benefits the span metrics connector provides over the processor.

Span metrics processor

The span metrics processor was initially devised to aggregate metrics from span data in OpenTelemetry traces, aimed to bridge telemetry data with metric analysis. However, its design, which combined the OpenTelemetry data model with the Prometheus metric and attribute naming conventions, inadvertently compromised its ability to remain agnostic to exporter logic. This approach clashed with OpenTelemetry's core mission of standardizing telemetry data across different tools and platforms. Recognizing these challenges, the OpenTelemetry community has since deprecated the span metrics processor in favor of the span metrics connector.

So what is a connector?

A connector acts as the means of sending telemetry data between different collector pipelines by connecting them. A connector acts as an exporter to one pipeline and a receiver to another. Another benefit of the connector component is that it simplifies the Collector configuration by joining two telemetry pipelines. When using processors, you must manually configure receivers and exporters if working with two pipelines.

Span metrics connector

The span metrics connector is a port of the span metrics processor, created to improve some of the issues found in the processor. Some of the improvements were making changes to naming conventions to bring them in line with the Otel data model. Support for Otel exponential histograms, and generating metric resource scopes that correspond to the number of the spans resource scopes. This means more metrics are generated now.  

Configuring the Collector

The basic setup for the span metrics connector is pretty straightforward. Below, you’ll find the configuration that’s used in a demo I created to demonstrate the span metrics connector. Once you’ve configured the span metrics connector itself, you need to include it as an exporter for the traces pipeline, and a receiver for the metrics pipeline. This configuration will have both the original tracing data and the metrics transformed from that data sent to an observability backend, which in this case is New Relic. 

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

exporters:
  otlphttp:
    endpoint: https://otlp.nr-data.net:4318
    headers:
      api-key: 359516983ecc7f8ccff2913699fca09b272eNRAL
connectors:
  spanmetrics:
    histogram:
      explicit:
    dimensions: 
      - name: http.method
        default: "GET"
      - name: http.status_code
      
service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlphttp, spanmetrics]
    metrics:
      receivers: [spanmetrics]
      exporters: [otlphttp]

You can find an example of this configuration here

Sampling

There’s a high chance that if you’re using distributed tracing you are using some form of sampling. Regardless of if you’re using head sampling, tail sampling, or both, it’s important to understand how to maximize the value of this process.

Implementing span metrics connector with head-based sampling

The OpenTelemetry default sampler is a composite of two samplers: the ParentBased sampler, which has a required parameter for which sampler to use for root spans; and, in this case, the default, which is the AlwaysOn sampler. As its name implies, the AlwaysOn sampler will always sample a span with a parent span. With the default, you will get metrics from all traces sent to the collector when using the span metrics connector. Making adjustments to the default sampler configuration will affect the accuracy of metrics transformed from the sampled trace data. If you want to get metrics from all your traces and still sample your trace data you can implement the probabilistic sampling processor. This processor, by default, samples tracing data at a percentage based on traceId hashing.

You can find an example of this configuration here

Implementing span metrics connector with tail-based sampling

Tail sampling gives you the option to sample your traces based on specific criteria derived from different parts of a trace. Using this technique you can transform all your traces into metrics, and then sample only interesting traces. This provides the most control over cost, but with control comes difficulty in the implementation and operation of tail sampling. In the demo provided, we make use of another connector. The forward connector is used to forward data from one pipeline to another. This allows us to get metrics from all traces and then forward them to a second tracing pipeline where our sampling policies are applied before exporting to an observability backend.

You can find an example of this configuration here

Conclusion

The transformation of distributed traces into metrics offers a powerful method for enhancing observability and monitoring strategies. This practice not only provides a more efficient way to manage telemetry data but also ensures that organizations can maintain high levels of system performance and reliability. Adopting this approach can lead to significant cost savings and a better understanding of system performance, ultimately improving service quality and customer satisfaction.

When it comes to your observability strategy, distributed traces and metrics both play an important role. Learning how to leverage their unique strengths and weaknesses will help you get the most value from your telemetry data.