Why Distributed Tracing Is Essential for APM

Reduce MTTR in complex distributed application environments

Cutting through the complexity

Modern software environments and architectures like microservices have the potential to accelerate application development. But in many organizations, software engineering teams face a complex environment, which makes it difficult to diagnose and resolve performance issues and errors before they impact reliability and the customer experience.

Microservices environments can include dozens to hundreds of services, making it hard to determine request paths and diagnose issues. And application performance monitoring (APM) burdens only increase with orchestration, automation, and CI/CD for frequent software deployments. Without the proper monitoring instrumentation, organizations risk their teams having to search for answers in distributed systems repeatedly, which increases mean time to resolution (MTTR) and takes time from innovative software development.

Observability cuts through software complexity and provides end-to-end visibility that enables teams to solve problems faster, work smarter, and create better digital experiences for their customers. Observability creates context and actionable insights by, among other things, combining four essential types of observability data: metrics, events, logs, and traces (MELT).

Traces—more precisely, distributed traces—are essential for software teams that have transitioned (or are considering a move) to the cloud and have adopted microservices architectures. That’s because distributed tracing is the best way to understand quickly what happens to requests as they transit through the microservices that make up distributed applications.

Business leaders, DevOps engineers, product owners, site reliability engineers (SREs), software team leaders, or other stakeholders can use distributed tracing to find bottlenecks or errors and gain an edge with faster troubleshooting.

Tracing the path through distributed systems

Distributed tracing is now table stakes for operating and monitoring modern application environments. When teams monitor software and system performance for observability, tracing is a way to monitor and analyze requests as they propagate through a distributed environment and hop from service to service. 

Distributed tracing is the ability to trace a solution to track and observe service requests as they flow through distributed systems by collecting data as the requests go from one service to another. The trace data helps teams understand the flow of requests through the microservices environment and pinpoint where failures or performance issues occur in the system—and why.

When teams instrument systems for distributed tracing, all transactions generate trace telemetry, from the frontend user to the backend database calls. For example, when customers click on a cart to make a purchase in an e-commerce application, that request travels through several distinct frontend and backend services across multiple containers, serverless environments, virtual machines, different cloud providers, on-premises (on-prem), or any combination of these. The request might include the inventory service to ensure there is inventory available, payment service, and shipping service. And ultimately the request completes and comes back to the user. Every time a request hops from one service to another, it emits a span with tracing telemetry. Once the request finishes, spans are stitched together to create a complete trace of the request’s journey through the system.

Scatter chart and waterfall visualization showing how much time each request took on each step across application services

When to use traces

In general, distributed tracing is the best way for DevOps, operations, software, and SRE teams to get answers to specific questions quickly in environments where the software is distributed or relies on serverless architectures. As soon as a request involves a handful of microservices, having a way to see how all the different services are working together is essential.

Trace data provides context for what is happening across the application as a whole and between services and entities. If there were only raw events for each service in isolation, there would be no way to reconstruct a single chain between services for a particular transaction.

Because applications often call multiple other applications depending on the task they’re trying to accomplish, they also often process data in parallel. So the call chain can be inconsistent and timing unreliable for correlation. The only way to ensure a consistent call chain is to pass trace context between each service to identify a single transaction uniquely through the entire chain. 

This means teams should use distributed tracing to get answers to questions such as: 

  • What is the health of the services that make up a distributed system? 
  • What is the root cause of errors and defects within a distributed system?
  • Where are performance bottlenecks that could impact the customer experience? 
  • Which services have problematic or inefficient code that teams should prioritize for optimization?

How traces work

Stiched-together traces form special events called spans, which help track a causal chain through a microservices ecosystem for a single transaction. To accomplish spans, each service passes correlation identifiers, known as trace context, to each other. This trace context is used to add attributes to the span.

Example of a distributed trace composed of the spans in a credit card transaction
Timestamp EventType TraceID SpanID ParentID ServiceID Duration
Timestamp11/8/2022 15:34:23 EventTypeSpan TraceID2ec68b32 SpanIDaaa111 ParentID  ServiceIDVending Machine Duration23
Timestamp11/8/2022 15:34:22 EventTypeSpan TraceID2ec68b32 SpanIDbbb111 ParentIDaaa111 ServiceIDVending Machine Backend Duration18
Timestamp11/8/2022 15:34:20 EventTypeSpan TraceID2ec68b32 SpanIDccc111 ParentIDbbb111 ServiceIDCredit Card Company Duration15
Timestamp11/8/2022 15:34:19 EventTypeSpan TraceID2ec68b32 SpanIDddd111 ParentIDccc111 ServiceIDIssuing Bank Duration3

In the table above, the timestamp and duration data shows the credit card company has the slowest service in the transaction with 12 of the 23 seconds—more than half the time for this entire trace.

Connecting the dots

As organizations began moving to distributed applications, they quickly realized they needed a way to have visibility into individual microservices in isolation and the entire request flow. This migration is why distributed tracing became a best practice for gaining needed visibility into what was happening. And combining traces with the other three essential types of telemetry data—metrics, events, and logs—gives teams a complete picture of their software environment and performance for end-to-end observability. 

Distributed tracing also requires trace context. This requirement means assigning a unique ID to each request, assigning a unique ID to each step in a trace, encoding this contextual information, and passing (or propagating) the encoded context from one service to the next as the request makes its way through an application environment. This process lets the distributed tracing tool correlate each step of a trace, in the correct order, along with other necessary information to monitor and track performance.

A single trace typically captures data about: 

  • Spans (service name, operation name, duration, and other metadata) 
  • Errors
  • Duration of important operations within each service (such as internal method calls and functions)
  • Custom attributes

Why organizations need distributed tracing

As new technologies and practices—cloud, microservices, containers, serverless functions, DevOps, site reliability engineering, and more—increase velocity and reduce the friction of getting software from code to production, they also introduce new challenges:

  • More points of failure within the application stack
  • Increased MTTR due to the complexity of the application environment
  • Less time for teams to innovate because they need more time to diagnose problems

For example, a slow-running request might impact the experience of a set of customers. That request is distributed across multiple microservices and serverless functions. Several teams own and monitor the various services involved in the request, and none have reported any performance issues with their microservices. Without a way to view the performance of the entire request across the different services, it’s nearly impossible to pinpoint where and why the high latency is occurring and which team should address the issue. As part of an end-to-end observability strategy, distributed tracing addresses the challenges of modern application environments.

By deeply understanding the performance of every service—both upstream and downstream—software teams can more effectively and quickly:

  • Identify and resolve issues to minimize the impact on the customer experience and business outcomes.
  • Measure overall system health and understand the effect of changes on the customer experience.
  • Prioritize high-value areas for improvement to optimize the digital customer experiences.
  • Innovate continuously with confidence to outperform the competition.

Gaining visibility into the data pipeline

Distributed tracing requires the reporting and processing of tracing telemetry. The volume of trace data can grow exponentially over time as the volume of requests increases and as teams deploy more microservices within the environment. 

For this reason, many organizations use data sampling to manage the complexity and cost associated with transmitting trace activity. Ideally, the sampled data represents the characteristics of the larger data population. 

Software teams need the flexibility to choose head- or tail-based sampling to meet the monitoring requirements for each application.

Efficient head-based sampling

Head-based sampling collects and stores trace data randomly while the root (first) span is processed to track and analyze what happens to a transaction across all of the services it touches. Typically, head-based sampling happens within the agent responsible for collecting trace telemetry by randomly selecting which traces to sample for analysis. The sampling decisions occur before traces are complete. Because there is no way to know which trace might encounter an issue, teams might miss traces that contain unusually slow processes or errors. 

Head-based sampling works well to provide an overall statistical sampling of requests through a distributed system. It does a good job of catching traces with errors or latency in applications with a lower volume of transactions and environments with a mix of monolith and microservices-based architectures. Head-based sampling is an efficient way to sample a vast amount of trace data in real time, and there is little to no impact on application performance.

Actionable traces with tail-based sampling

Distributed tracing with tail-based sampling helps software teams troubleshoot issues in highly distributed, high-volume systems where teams must observe all trace telemetry and sample the traces that contain errors or unusual latency. Tail-based sampling collects all information about that trace when it’s complete.

Tail-based sampling is less of a nice to have and more of a requirement when teams need the highest level of granularity for troubleshooting. 

Some organizations need their distributed tracing tool to observe and analyze every span—every hop between services—and surface the most action-able traces for troubleshooting because downtime could cost millions of dollars, especially during peak events. 

For example, an organization with an average span load of three million spans per minute sees spikes of 300 million spans per minute when it launches a new product. Traditional head-based sampling is inadequate for this type of organization with a high transaction volume.

Not every trace is equal. To choose the best sampling method, teams should evaluate based on the use case and cost-over-benefit analysis and consider the monitoring needs of each application.

Head-based sampling

Tail-based sampling

Analysis and visualization

Collecting trace data is a waste of time if software teams don’t have an easy way to analyze and visualize the data across complex architectures. A comprehensive observability platform allows teams to see all their telemetry and business data in one place. It also provides the context they need to derive meaning, take the right action quickly, and work with the data in meaningful ways. 

A distributed trace visualization ideally has a tree-like structure. The visualization should include child spans that refer to one parent span and lets teams see which spans have high latency and errors within a trace. It also helps teams understand the exact error details and what services are slow, with detailed attributes to find issues and fix them quickly.

Observability vendors like New Relic use this visualization structure for troubleshooting and analysis.

New Relic distributed tracing

Addressing the management burden

Troubleshooting distributed systems is a classic needle-in-a-haystack problem, and instrumenting systems for tracing then collecting and visualizing the data can be labor-intensive and complex to implement. Fully managed software-as-a-service (SaaS) solutions allow teams to eliminate the burden of deploying, managing, and scaling third-party gateways or satellites for data collection.

The New Relic observability platform makes it easy to instrument applications with a single agent deployment for almost any programming language and framework. Teams can also use open-source tools and open instrumentation standards to instrument environments. OpenTelemetry is considered the standard for open-source instrumentation and telemetry collection.

The New Relic platform also offers a fully managed tail-based sampling service that observes and analyzes 100% of spans across a distributed system and provides visualizations for traces with errors or unusual latency, so teams can quickly identify and troubleshoot issues. 

The platform observes every span and provides metrics, error data, and essential traces in a single view. It provides critical insights by saving the most actionable data to the New Relic platform. The result is unparalleled visibility into distributed systems, enabling teams to understand the impact of downstream latency or errors with detailed metrics and then drill down to the saved trace data for the most relevant traces. 

Distributed tracing is included with New Relic APM, with low-latency and low-cost data transfer from New Relic agents, instrumentation within serverless functions, or any other data source, including third-party instrumentation.

Heads or tails? You don’t need to flip a coin

New Relic offers flexible options for distributed tracing so teams can make head- or tail-based sampling decisions at the application level. For critical applications where teams need to observe and analyze every trace, they can select tail-based sampling without worrying about managing sampling infrastructure. 

New Relic is the only observability vendor that gives software teams the flexibility to select distributed tracing with head-based sampling or fully managed tail-based sampling. With less to manage, there’s more room for innovation and gaining a competitive advantage.

The New Relic observability platform incorporates log management, APM, distributed tracing, infrastructure monitoring, serverless monitoring, mobile monitoring, browser monitoring, synthetic monitoring, Kubernetes monitoring, and more.