New Relic Now Start training on Intelligent Observability February 25th.
Save your seat.

OpenTelemetry has gained popularity recently, specifically around the new set of practices it introduces to build modern observability solutions. It provides both the standards and the tools for software development and site reliability engineering (SRE) communities to bring observability to the forefront of designing and building applications. But adopting OpenTelemetry at scale can be a daunting task, mainly because it is relatively new and in the early stages of its evolution.

Of course, OpenTelemetry has many advantages that make it easier for applications to emit telemetry innately as a feature provided by software libraries. Software developers and SREs who build observability in software code proactively get their hands on signals to understand the state of their workloads. This observability is essential at all times in the software development lifecycle, whether you are detecting performance issues in the early stages or doing business-critical work in production.

This blog post explores some key considerations to help you build an approach to support your organization’s OpenTelemetry initiatives.

Let’s begin with looking at current trends of OpenTelemetry.

The current state of OpenTelemetry

OpenTelemetry is a complex project with many moving parts. The OpenTelemetry project implemented a structure with special interest groups (SIGs) for language implementations and working groups that contribute to specific capabilities such as distributed tracing, metrics, and logging. The developers within each SIG have autonomy and ownership over that SIG’s part of OpenTelemetry. The joint work of all SIGs aggregates to deliver on the OpenTelemetry project goals.

Understanding the status of core components of OpenTelemetry is vital to scope your project expectations. For the available implementations, generally look for a status of stable to run in production environments, because backward-breaking changes are not allowed. Limit the experimental stage to non-production activities such as proofs of concept, unless you’re willing to be responsive to breaking changes. You can learn more about versioning and stability on the opentelemetry-specification on GitHub.

There are three key considerations to look for when evaluating OpenTelemetry for your project:

  • The status of signal/telemetry type, such as the API specification.
  • The status of the SDK for the programming languages used by the application.
  • Support for the protocol to ship the signal/telemetry from the source to its destination.

At the time of writing this post, the tracing specification for OpenTelemetry is the most matured implementation. This table provides an overview of the current status of tracing, metrics, and logging for the OpenTelemetry API, SDK, and protocol:

  Tracing Metrics Logging
 API TracingStable MetricsStable LoggingN/A
 SDK TracingStable MetricsMixed* LoggingDraft
 Protocol TracingStable MetricsStable LoggingBeta**

Source: https://opentelemetry.io/status/
Definition of Statuses: https://opentelemetry.io/docs/reference/specification/document-status/ 
*Mixed status means that a majority of the components are stable, but that some components of the SDK specification are not stable, such as exemplars.
**Expected to receive Stable status soon:  https://github.com/open-telemetry/opentelemetry-proto/pull/376

Further, each language client SDK has its own status. There is a wide variation in the maturity of individual language clients. For example, don’t assume tracing is universally stable across all SDKs (as of this publication date, three of eleven SDKs were not marked as stable for tracing). Same for metrics. The specification of the metrics declared stable doesn’t mean that OpenTelemetry clients themselves are stable or feature complete and ready for use for critical workloads. In the near future, we’ll see metrics capabilities implemented in various language SDKs, and some language communities will cut a stable release. A stable designation will be a signal for early majority adopters that metrics are ready to be used in production, because they provide a commitment for backward compatibility and a guaranteed length of support after the next major release.

This table gives an overview of the current status of tracing, metrics, and logging for the OpenTelemetry language client SDKs of Java, .NET, and Go:

  Tracing Metrics Logging
 Java TracingStable MetricsExperimental* LoggingBeta
 .NET TracingStable MetricsStable LoggingBeta
 Go TracingStable MetricsExperimental LoggingNo support

More Details: https://github.com/open-telemetry/opentelemetry-specification/blob/main/spec-compliance-matrix.md 
*Expected to receive Stable status soon. Currently in a release candidate state.

Outlook for OpenTelemetry

According to a OpenTelemetry Metrics Roadmap blog post in March 2021, Metrics API/SDK specifications were planned to achieve stable status by November 30, 2021. At present, metrics is marked as mixed status, which is an indication that the majority of the specification has stabilized. The notable exceptions are exemplars and exponential histograms, which may still undergo changes that could cause backward-compatibility issues.

There is no specific roadmap available from the OpenTelemetry project at this time. However, considering where the most activity is happening, metrics compatibility across various languages are expected to move to stable status soon, and as the final data signal, logs will follow. It is difficult to predict how soon these statuses could change and is dependent on the individual SIGs implementing the specifications. In parallel, the limited auto-instrumentation capabilities are also continuing to mature across the Java and .NET ecosystems. 

After the work to implement metrics, adding support for logging should not be much of a heavy lift. Logs specifications are comparatively well defined. The logging data model recently became stable, and some reference implementations, referred to as experimental, have been available for some language clients for months. I would expect the logs SDK, and the protocol to reach stable status sometime in the second half of the year, assuming no delays in Cloud Native Computing Foundation (CNCF) project resourcing.

OpenTelemetry adoption considerations

In addition to achieving stable status for metrics and logs, simplifying OpenTelemetry deployment may become one of the project's focus areas. Today, the learning curve is long for adopting OpenTelemetry at scale because of many unknowns for implementation and support. In the next sections, I'll go through some considerations to help with adoption.

To set up OpenTelemetry for success, you can start by reviewing some of the limitations.
 

Understand OpenTelemetry limitations

It’s important to be aware of what you can achieve with the OpenTelemetry implementation, and where there are limitations. Here are some tips to consider:

  • Understanding OpenTelemetry feature support will help with the instrumentation planning. Let's say you're planning to instrument a Python application for enabling distributed tracing, metrics, and logs. After reviewing the current state, you might conclude that most of the distributed tracing specs are supported and stable. Meanwhile, metrics and logs have limited capability and are in experimental and draft status. Going through the supported specs will help you make the right instrumentation decisions.
  • Organization fit is not just about the technology. The success of any serious OpenTelemetry adoption at scale will need mindshare across functional teams. You need to evaluate where your observability strategy OpenTelemetry fits. You also need to identify all functional groups critical for the project's success. Your roll-out plan must be supported by enablement of the OpenTelemetry concepts and implementation standards you set for your organization. 
  • OpenTelemetry isn’t an observability platform. Don't confuse OpenTelemerty with an observability platform such as New Relic. OpenTelemetry provides instrumentation standards for generating the signals such as traces, metrics, and logs but it’s not an observability solution by itself. It doesn’t include visualizations, alerts, queries, or storage capabilities. Simply put, OpenTelemetry is all "what" and no "why"! To provide business value to your organization, you need to send the raw data to an observability platform to be analyzed. 

OpenTelemetry instrumentation methodology

There are mainly two ways to instrument applications in OpenTelemetry: automatic and manual instrumentation. 

Before we compare the two, let's look at various types of instrumentation libraries available in OpenTelemetry: 

  • Core: This repository contains all language instrumentation libraries, such as opentelemetry-java. This repository contains the implementation of the API and SDK for Java, which can be used for or all manual instrumentation activities.
  • Instrumentation: This repository has automatic instrumentation capabilities on top of what you get with the Core repository. An example is openeletemetry-java-instrumentation.
  • Contrib: This repository covers any additional helpful libraries and standalone OpenTelemetry-based utilities that don't fit in the scope of the Core and Instrumentation repositories. Java Contrib is an example that contains libraries for generating signals for AWS X-Ray and JMX Metrics.
    Note: Not every language will follow the Core, Instrumentation, and Contrib structure for instrumentation libraries. Certain languages such as Ruby offer a single repository for all instrumentation libraries. 
  • Distribution: This repository is built on top of OpenTelemetry repositories with some customizations. Distribution adds customizations for vendor-specific configurations (pure), additional (plus), or less (minus) capabilities beyond what's available in OpenTelemetry. Vendor support includes a list of distributions from various products.

This table describes the differences between manual and auto instrumentation methods. As of the current state of OpenTelemetry, you may have to rely on manual instrumentation for rich telemetry until auto instrumentation capabilities mature.

  Manual Instrumentation Auto Instrumentation
 Level of effort Manual InstrumentationHigh (Developers need to add code to the application to generate telemetry signals.) Auto InstrumentationLow (No code changes)
 Learning curve Manual InstrumentationLong (Requires deep understanding of the application code.) Auto InstrumentationShort (Basic application knowledge is enough to instrument.)
 Telemetry signals Manual InstrumentationTraces, Metrics, and Logs* Auto InstrumentationTracing only**
 Programming language support Manual InstrumentationAll major programming languages Auto InstrumentationLimited*** (Popular programming languages such as Java and .Net)

*Depends on the programming language
**Java OTel agent includes metrics and logs auto instrumentation
***Mostly in Experimental status
Note: Another category is library instrumentation. Projects like opentelemetry-java-instrumentation publish artifacts that instrument popular libraries and frameworks without bytecode manipulation. Documentation: https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/standalone-library-instrumentation.md

Architecture considerations

Now let’s look at the client architecture. In this figure, Service A is instrumented with the OpenTelemetry API, and the OpenTelemetry SDK is configured with an exporter to send telemetry to a backend:

OpenTelemetry SDKs have capabilities for sampling trace spans, aggregating metrics, and processes that you can use to enrich the telemetry data. You can configure the exporter to use supported protocols such as OTLP to ship the collected telemetry data to its destination. In a basic implementation, the exporter works well to transmit the telemetry directly to the backend of your choice.

However, a centralized telemetry processing and data pipeline layer makes it easier to scale these capabilities across complex or large deployments, for example, use cases that require exporting telemetry data to multiple backends or that have large volumes of telemetry. Here is where the OpenTelemetry collector comes in.

You’ll have some architecture decisions to make depending on what you’re trying to achieve with your OpenTelemetry initiative. The OpenTelemetry Collector has two main operational modes: agent and gateway. As an agent, the OpenTelemetry Collector can act as a source of data for the local host. As a gateway, the OpenTelemetry Collector functions as a data pipeline that can receive, process, and export telemetry data. It has receivers that can accept telemetry data from several sources/formats, such as Prometheus and OTLP. The processors then filter and enrich the telemetry data before sending it through an exporter to the backend of your choice. Multiple exporters can run in parallel to fan out data to multiple backends.

As you are learning to use the Collector, it may help to keep things simple and deploy the collector as an agent first. If you’re deploying as a data pipeline or gateway, you might want to investigate processors that enable filtering sensitive data such as PII, adding attributes (tags), and tail-based sampling for traces. 

If you put it all together for a scenario where multiple applications have various instrumentation sources and you use New Relic as your observability platform,  it might look like this diagram:

Choosing the right backend for your telemetry data is critical. Using multiple siloed tools to monitor telemetry data creates a disconnected view of the truth and causes toil to troubleshoot the issue. You need a platform that provides the ease of ingesting telemetry data from various sources without worrying about ingestion, security, storage, and intelligence to connect all of your telemetry data. Because New Relic connects all your telemetry data on one platform, including the various open source solutions (OSS), you get end-to-end visibility, easy scalability, and better performance to detect and resolve issues faster.

New Relic provides an OTLP endpoint, which makes it easy to send OpenTelemetry data directly from the SDK exporter or the OpenTelemetry collector without installing a custom exporter or any other New Relic-specific code. Immediately available curated views and easily created alerts provide actionable insight by aggregating and correlating signals from all your telemetry.

Sampling methodology

The previous section briefly described sampling. Here are a few ways you can configure sampling for your OpenTelemetry data:

  • OpenTelemetry client SDK: You can configure sampling in the OpenTelemetry client SDK. There are four methods available: AlwaysSample, NeverSample, TraceIDRatioBased and ParentBased. You can find more information on how these techniques work at Processing and Exporting Data. For projects that need scalability, sampling at each instance of client instrumentation can add significant overhead in implementation and governance.
  • OpenTelemetry collector:  As mentioned earlier, OpenTelemetry collector has two implementations: agent or gateway. With a gateway implementation of the OpenTelemetry collector, you have the ability to enable tail-based sampling. This simplifies the implementation because you are configuring sampling at aggregation. However, you still need to configure sampling for each collector instance in your environment.
  • New Relic Infinite Tracing:  In addition to the previous options, with New Relic as your backend, you have the option to enable tail-based sampling using New Relic Infinite Tracing. This fully managed, distributed tracing service observes 100% of the application traces. After configured, it automatically samples the traces. Since there’s no infrastructure to manage, you don’t need to staff or operate the service for on-demand scalability.

Why use New Relic for OpenTelemetry

The New Relic platform is built to ingest all telemetry types at scale to collect and contextualize from any source. The platform simplifies exploration and correlation, and its machine learning-powered analysis supports observability across your system. These capabilities make New Relic perfectly suitable for the instrumentation standard practices that OpenTelemetry offers. Here are some of the reasons why New Relic should be the backend for OpenTelemetry:

  • Platform breadth: The New Relic platform includes support for user experience monitoring, APM, AIOPs, developer tools, and support for all telemetry types, metrics. events, logs, and traces. 
  • Flexibility: OpenTelemetry SDKs vary significantly in maturity across languages and frameworks. With New Relic you can choose the best instrumentation source for the technologies in your environment. You have the choice to use OpenTelemetry instrumentation along with New Relic agents and integrations. For example, you might have some services instrumented with OpenTelemetry and others with the New Relic APM agent. There is interoperability between New Relic and OpenTelemetry tracing. That way, you get full telemetry coverage and no blind spots.
  • Pricing: Consumption-based pricing for New Relic gives cost control to software and DevOps teams. 
  • Compliance: Enterprise features such as RBAC compliance certifications, including ISO27001, FedRAMP, HIPAA/HITRUST, and TISAX will ensure that you don’t have compliance-related hurdles in your journey.
  • OpenTelemetry project contributions: New Relic is at the forefront of contributing to the OpenTelemetry CNCF project. New Relic has an upstream team providing dedicated engineering resources to advance the OpenTelemetry project. The community recognizes these engineers for their technical expertise and meaningful contributions.

Build a solid plan for adopting OpenTelemetry

There could be several reasons you are looking to deploy OpenTelemetry in your environment. This blog post covered a lot, but OpenTelemetry is still evolving, and no list is exhaustive. 

Here are a few tips that you can use now to help you to build a solid plan for adopting OpenTelemetry:

  • Your project goals shouldn’t be limited to the instrumentation and generation of telemetry. It should begin with focusing on observability, such as the capabilities you’d like to unlock with the telemetry data. For example, building observability-driven development practices to identify performance issues early in the software development lifecycle. 
  • Develop project objectives working with all key stakeholders and document them in a project charter. Keep a limited scope: break down your long-term vision into multiple smaller projects.
  • Build a list of solid test use cases, especially if you’re looking to migrate from an existing instrumentation solution. 
  • Avoid building telemetry data silos by sending metrics, traces, and logs to separate tools.
  • Understand the current limitations of OpenTelemetry and take advantage of mixed instrumentation with a platform such as New Relic.