Distributed tracing is a powerful diagnostic tool for hybrid and microservices-based environments, because you can investigate performance issues from one place. A distributed trace consolidates records of events that take place across components of a distributed system. 

In this article, you'll learn:

  • What distributed tracing is, and how to use it
  • The structure of distributed traces, including spans and transactions, and examples in New Relic
  • How to pass trace context between services, including the W3C Trace Context Standard
  • The pros and cons of head-based and tail-based trace sampling

What is distributed tracing?

A distributed trace consolidates records of events that take place across components of a distributed system. These events are triggered by a single operation—such as clicking a button on a website—and they cross process, network, and security boundaries. To gain an intuitive understanding of distributied tracing, let’s define each term:

  • Distributed refers to distributed systems, which consist of independent components that communicate through requests to form an application. 
  • Tracing refers to traces, which track the end-to-end path of a request as each travels from service to service. 

Distributed tracing is an essential part of a unified application performance monitoring (APM) platform. It provides real-time visibility into the health and performance of your entire application stack when you integrate it with other observability tools, such as metrics, logs, and alerts. Distributed tracing provides two core pieces of information:

  • The path a service request takes across a distributed system
  • The time spent to complete each service request

When you’re monitoring microservices-based architectures, distributed tracing helps pinpoint where failures occur and what causes poor performance. Here's an illustration of how distributed tracing works in New Relic:

Thinking about how to use distributed tracing in the real world? While the traces themselves contain all the relevant data for conducting root cause analysis, tracing tools differentiate themselves based on their capabilities for:

  • Ease of deployment and instrumentation
  • Visualization and querying
  • Configuration and flexibility

Why is distributed tracing important?

Distributed tracing is crucial for understanding the behavior of complex, distributed systems. It provides insights into how requests flow through different components, helping to diagnose and troubleshoot issues related to latency, errors, and dependencies.

How does distributed tracing work?

Distributed tracing works by assigning a unique identifier to a request and propagating this identifier across different services involved in processing the request. Each service records information about the request, creating a trace that can be visualized and analyzed.

Benefits of distributed tracing

Distributed tracing offers substantial benefits for understanding and optimizing distributed systems. Here are some of the most notable ones: 

  1. Enhanced visibility:

Distributed tracing provides unparalleled visibility into the inner workings of your application. Tracing a request's journey across various services gives you insights into the performance, dependencies, and interactions between different components.

  1. Faster issue resolution:

Distributed tracing allows for rapid identification and resolution when performance issues or errors occur. Pinpointing bottlenecks and understanding the flow of requests helps developers address issues more efficiently, reducing downtime and improving user experience.

  1. Optimized resource utilization:

Organizations can optimize resource allocation when they clearly understand how services interact and where resources are being used. This leads to better scalability, efficient resource utilization, and cost savings in cloud-based environments.

  1. Effective debugging:

Distributed tracing acts as a powerful debugging tool. Developers can trace a specific request or transaction, analyze the associated logs, and quickly identify the source of errors or unexpected behaviors, streamlining the debugging process.

  1. Performance monitoring and trend analysis:

Collecting and analyzing traces over time allows teams to monitor application performance trends. This proactive approach identifies potential issues before they impact users, enabling continuous improvement and optimization.

  1. Improved user experience:

Ultimately, the benefits of distributed tracing contribute to an improved user experience. Faster response times, reduced errors, and seamless interactions increase customer satisfaction and engagement.

Distributed tracing vs. logging

Understanding the nuances between distributed tracing and logging is vital for effective application monitoring and troubleshooting. Distributed tracing focuses on visualizing the end-to-end journey of requests within a distributed system. It provides a holistic view, aiding in identifying performance bottlenecks and dependencies. On the other hand, logging offers a comprehensive record of application events, errors, and activities, serving purposes such as debugging and compliance.

While distributed tracing is instrumental for high-level performance insights and troubleshooting, logging is essential for detailed analysis and forensic examination of specific incidents. Distributed tracing provides a transaction-level perspective, capturing the flow of requests while logging offers granularity at the level of individual events. Integrating both practices allows organizations to holistically monitor and optimize their applications, striking a balance between high-level overviews and detailed diagnostics in the realm of distributed systems.

The structure of distributed traces

In New Relic, distributed traces gather three types of data:

  • A span is a named, timed operation that represents a piece of the workflow. Examples of span operations include datastore queries, browser-side interactions, method-level time tracking, calls to other services, and also Lambda functions. For example, in an HTTP service, you might want a span created at the beginning of an HTTP request and completed when the HTTP server returns a response. Span attributes contain important information about the operation such as duration and host data.
  • A transaction is a logical unit of work in a software application, such as HTTP requests, SQL queries, background processes, message queue activity, and so on. In New Relic, the transaction event includes information about the app, database calls, the duration of the transaction, and any errors that occur.
  • Contextual metadata shows calculations about a trace and the relationships between its spans. It also shows the duration of traces, all entities that are part of a trace, the number of entities that are part of a trace, the trace's start time in milliseconds, as well as the parent/child IDs that represent all of the span relationships within a trace.

More about spans

A span in a distributed trace represents the individual unit of work done and the time a service spends processing a request. Traces encapsulate spans in a tree-like structure: more than one child span can belong to a parent span

To understand spans in distributed tracing, you’ll need to know these concepts:

  • Trace duration is a trace's total duration, determined by the length of time from the start of the earliest span to the completion of the last span.
  • A process entry span is the first span in the execution of a logical piece of code, such as a backend service or Lambda function.
  • A process exit span is a span that is either the parent of an entry span, or if it has attributes prefixed with http. or db., an external call.
  • An in-process span represents an internal method call or function and that is not an exit or entry span.
  • A client span represents a call to another entity or external dependency. Currently, there are two client span types. First, datastore client spans have attributes prefixed with db., and second, external client spans have attributes prefixed with http. or have a child span in another process.

Here’s an example from How trace data is structured in the New Relic docs:

More about transactions

A transaction is a logical unit of work in a software application. Specifically, it refers to the function calls and method calls that make up that unit of work. In the context of application performance monitoring, it often refers to a web transaction that represents activity starting from when the application receives a web request to when the response is sent.

In her blog post explaining distributed tracing, Erika Arnold describes three main ways distributed tracing uses transactions:

  • Analyzing transactions: Tracing monitors transactions that take place throughout the system to gain insights into its performance. Each transaction plays a role in performance, and underperforming services have a knock-on effect on the rest of the services. 
  • Recording transactions: Tracing helps keep track of lots of transactions. Tracing context that comes into a service with a request is propagated to other processes and attached to transaction data. With this context, you can stitch the transactions together later. Since the industry shift from monolith applications to microservices, it’s becoming increasingly important to track transactions across process boundaries where you can’t install APM agents.
  • Describing transactions: Tracing helps measure transactions, providing information such as what transactions took place and how long they lasted.

Passing trace context between services

Trace context refers to a set of HTTP headers in New Relic that propagate data from one service to another, to compose end-to-end traces. Monitoring agents add these HTTP headers to a service's outbound requests. HTTP headers identify software traces and carry identifying information as they travel through various networks, processes, and security systems. These headers include:

  • Each trace span has a guid attribute. The guid of the last span within the process is sent with the outgoing request, so that the first segment of work in the receiving service can add this guid as the parentId attribute.
  • The parent type is the source of the trace header, such as mobile, browser, or Ruby app. This becomes the parent.type attribute on the transaction triggered by the request.
  • The timestamp is the UNIX timestamp in milliseconds when the payload was created.
  • The traceId is the unique ID used to identify a single request as it crosses inter-process boundaries and intra-process boundaries. This ID helps link spans in a distributed trace. 
  • The transactionId is the unique identifier for the transaction event.
  • The priority is a randomly generated priority ranking value that helps determine which data is sampled when sampling limits are reached. 
  • The sampled boolean value tells the agent if traced data should be collected for the request. These transactions sampled for a full trace are given a true value for the sampled attribute, which propagates downstream to signal all other APM agents the trace touches to collect spans. These downstream spans also are given a true value for the sampled attribute.

Using the W3C Trace Context standard in a distributed environment

What if you’re using multiple tools in your environment? When trace context isn’t standardized, your traces can’t be correlated or propagated when they cross boundaries between different tools from different vendors. If you’re using a distributed environment with multiple middleware services and cloud platforms, this problem is critical. The W3C Trace Context standard defines a “universally agreed-upon format for the exchange of trace context propagation data.” 

The standard improves interoperability issues by providing:

  • a unique identifier for individual traces and requests.
  • an agreed-upon mechanism to forward vendor-specific trace data and avoid broken traces when multiple tracing tools participate in a single transaction.
  • an industry standard that intermediary layers (like APIs), platforms, and hardware providers can support.

To adhere to this standard, tracing tools must interact with trace context by propagating traceparent and tracestate headers to guarantee that the traces aren’t broken. New Relic implements this using the W3C New Relic agents, which send and receive these two required headers. The agent also sends and receives the header of the prior New Relic agent. The trace context supported by New Relic include:

  • W3C traceparent identifies the entire trace (trace ID) and the calling service (span ID). The traceparent header describes the position of the incoming request in its trace graph in a portable, fixed-length format. Every tracing tool must properly set traceparent even when it only relies on vendor-specific information in tracestate.
  • W3C tracestate carries vendor-specific information and tracks where a trace has been. The tracestate header extends traceparent with vendor-specific data represented by a set of name/value pairs. Storing information in tracestate is optional.
  • The New Relic proprietary header is the original, proprietary header that’s used to maintain backward compatibility with prior New Relic agents.

Here’s an example scenario from How trace context is passed between applications in the New Relic docs that shows the flow when a request touches an OpenTelemetry tracer, a New Relic agent that uses W3C Trace Context standard, and an older New Relic agent before the W3C Trace Context standard.

Distributed tracing diagram that shows the flow of headers when a request touches three different agent types

Trace sampling: Head-based and tail-based

Trace sampling is a technique used in distributed tracing to reduce the amount of trace data that is collected and stored. Sampling the trace data reduces the overhead associated with distributed tracing and provides a representative sample of the system’s performance. There are two trace sampling methods: head-based and tail-based.

Head-based sampling

Head-based sampling decides to randomly select traces for collection and storage at the beginning—or the head—of the trace. Use it to capture a representative sample of activity while avoiding storage and performance issues. The trace origin—the first service monitored in a distributed trace—chooses requests at random to be traced, and this decision propagates to downstream services touched by that request, making all the spans in the trace available in the tracing tool. 

This also includes adaptive sampling, a technique applied to head-based sampling where APM agents adapt the limit on the number of transactions collected based on the changes in transaction throughput. If the limit is 10 traces per minute, the agent spreads out the collection of these 10 traces over a minute to get a representative sample over time. The rate responds to changes in transaction throughput, so if the previous minute had 100 transactions, the agent would anticipate a similar number of transactions and select 1 out of every 10 transactions to be traced.

Tail-based sampling

Different than head-based sampling, the trace retention decisions in tail-based sampling are done after all the spans in a trace have arrived—at the tail end. 

Pros and cons of head-based vs tail-based sampling

 

Head-based sampling

Tail-based sampling

Pros

  • Works well for applications with lower transaction throughput
  • Fast and simple to get up and running
  • Appropriate for blended monolith and microservice environments
  • Little-to-no impact on application performance
  • A low-cost solution for sending tracing data to third-party tools
  • Statistical sampling provides adequate transparency into the distributed system
  • Observes and analyzes 100% of traces
  • Samples traces after they are fully completed
  • Visualizes traces with errors or uncharacteristic slowness more quickly 

Cons

  • Traces are sampled randomly
  • Sampling happens before a trace has fully completed its path through many services, so there is no way to know upfront which trace may encounter an issue
  • In high-throughput systems, traces with errors or unusual latency might be sampled out and missed
  • May require additional gateways, proxies, and satellites to run sampling software
  • Requires work to manage and scale third-party software in some cases
  • Incurs additional costs for transmitting and storing more data

 

Choosing a distributed tracing tool

Selecting the right distributed tracing tools is paramount for achieving clear visibility into application performance. A distributed tracing tool should offer more than just data collection; it should empower engineers by transforming raw data into actionable insights. When choosing such a tool, consider the following:

  • Transparency in performance claims: Ensure the tool provides accurate and honest insights about its performance capabilities.
  • Straightforward pricing: Look for clear, no-fluff pricing models that align with your usage needs and offer real value.
  • Actionable insights: Choose a tool that turns data into practical, empowering insights, facilitating informed decision-making.
  • Comprehensive language support: The tool should support a wide range of programming languages to accommodate diverse development environments.
  • Scalability: Opt for a solution that can efficiently scale with your infrastructure as it grows and evolves.
  • Real-time visualization: It's crucial to have a tool that offers real-time visualizations of your system’s health for immediate insights.
  • Alignment with team needs: The tool should resonate with your team's unique operational requirements and foster an environment conducive to learning and innovation.
  • Streamlined troubleshooting: Prioritize tools that simplify the process of identifying and resolving performance issues.
  • Continuous learning and optimization: Look for features that encourage ongoing system optimization and learning opportunities for the team.