Ideally, you should be using distributed tracing to trace requests through your system, but Kafka decouples producers and consumers, which means there are no direct transactions to trace between them. Kafka also uses asynchronous processes, which have implicit, not explicit, dependencies. That makes it challenging to understand how your microservices are working together.
However, it is possible to monitor your Kafka clusters with distributed tracing and OpenTelemetry. You can then analyze and visualize your traces in an open-source distributed tracing tool like Jaeger or a full observability platform like New Relic. In this post, I will leverage a simple application to show how you can achieve this.
Design Considerations & Guidelines
OpenTelemetry typically comes in two flavors:

When I talk about these flavors, I typically use the analogy above. You can either buy a ready-made cake and enjoy it or buy all the ingredients and make the cake yourself. With OpenTelemetry, the approach is very similar and the flavors are:
- Zero-code instrumentation: in this approach, you will use an OpenTelemetry agent and attach it to your application at startup time. This agent will then do its magic and automatically (without any source code changes) provide a lot of telemetry signals (metrics, traces and logs) and insights into your application.
- Pros:
- Getting started quickly
- No source code changes
- Cons:
- Limited customization
- Depth of visibility into your application may be limited
- Pros:
- Manual instrumentation: this option requires you to add some dependencies and packages to your source code that you need to manage as part of your regular software development lifecycle (SDLC). However, this also allows you to be more specific and custom about your instrumentation. You can easily add custom metrics, traces, attributes to your telemetry.
- Pros:
- Way more flexible with customizing telemetry
- Easily able to add, remove and tweak the depth of your instrumentation
- Cons:
- Dependencies in your source code
- More effort to implement
- Pros:
Sample application
The sample application (available in this public GitHub repository) that I am using in this blog is based on this high-level architecture:

It contains these components:
- kafka-java-producer: a Java Spring Boot application that produces messages into a Kafka topic
- Kafka broker
- kafka-java-consumer: a Java Spring Boot application that subscribes to a Kafka topic and reads messages from it. This component also makes calls to an external REST API service (that is not in our control)
- kafka-java-service: a downstream Java Spring Boot application that is being called from the consumer service
Zero-code instrumentation
Let’s start with zero-code instrumentation, aka automatic instrumentation.
Configuration
Each of the different services contain a `run.sh` script to get the service up and running. The script looks like this:

The key line in this is the first one. Here we are defining the JAVA_TOOL_OPTIONS and configuring the `-javaagent` to point to the location of the OpenTelemetry Java agent.
The next three lines configure how we want to deal with the different telemetry signals. In our case, I define the traces, metrics and logs to be exported via OpenTelemetry Line Protocol (OTLP).
There are three additional environment variables that are quite important to configure:
- OTEL_EXPORTER_OTLP_ENDPOINT: the target system where we want to export the data to, i.e. our telemetry backend. In my case, this is for sure New Relic and so I configure New Relic’s native OTLP endpoint
- OTEL_EXPORTER_OTLP_HEADERS: the above exporter endpoint is an open API, so we need to configure an API key. In the case of New Relic, this is a New Relic license key.
- OTEL_SERVICE_NAME: ideally, we want to give the service a meaningful name, so that New Relic can create an appropriate entity from it.
This is basically all we need to configure. Everything else is dealt by the OpenTelemetry Java agent. No need to change anything in our source code.
Observability
Let’s see what level of visibility into the services we can achieve from zero-code instrumentation.
When navigating to my New Relic account, I can see all services reporting into separate entities.

Let’s start by exploring the kafka-java-producer service.
The Summary view offers a great overview of all the most important telemetry and metrics I should be focusing on.

As part of this blog, I am mostly interested in the Distributed Tracing section, so let’s dive deeper into this area.

By looking at a single trace, this allows me to view the detailed information on how long this specific trace took to execute and where the time was spent.

We also automatically draw an Entity map of all the different services involved in a given trace.


What is interesting here is the span that says “Uninstrumented time”. This is code in the consumer where the agent was not able to capture some more detailed information about what is going on in its internal methods.
This already shows the limits of zero-code instrumentation. The agent by default will not instrument all the various methods and source code, but rather stops - by design - at some level to get deeper visibility into your code.
Manual instrumentation
In the previous section, you saw how zero-code instrumentation has some limits when it comes to visibility into your application. This is exactly where manual instrumentation comes into play.
Configuration
I have configured the same application, but this time, no agent at all is configured when starting the application.

I simply use the Maven wrapper to run the application.
The other configuration details are then part of my application.properties:

These properties are then used in my Spring Boot application code to define the configuration for OpenTelemetry for traces, metrics and logs.

Observability
Before I jump into the details of how I implemented some manual instrumentation, let’s have a look at the result first.

Do you notice how the span, which previously was called out with “Uninstrumented time”, now shows much more detailed information? I now can see these additional spans:
- ExecuteLongRunningTask
- WhyTheHeckDoWeSleepHere
- SomeTinyTask
- AnotherShortRunningTask
The one that says “WhyTheHeckDoWeSleepHere” seems to be taking the most time. No wonder, as the name suggests 😉.
Let’s have a look at the source code to reveal the manual instrumentation I put in place.

In the method named ExecuteLongRunningTask I have created a new span on the current tracer by using the spanBuilder() Method.
In addition to that, you may also notice that - just for the fun of it - I created another span called “WhyTheHeckDoWeSleepHere” that contains an artificial unit of work or rather a sleep instruction on the current thread.
These concepts to leverage the OpenTelemetry SDK allow me to be much more specific in getting insights and details into my application and source code. But, as you can imagine, also have the caveat that I need to have some dependencies and custom code available in my source code.
Conclusion
I hope I was able to show you how easy it can be to leverage OpenTelemetry in order to get insights into your application and services. We looked into zero-code instrumentation to get started without any code changes, but the level of details may be limited. We then also looked into manual instrumentation. This allowed us to be more specific and customize the instrumentation, but the effort to get started is a little higher.
I encourage you to have a look into OpenTelemetry and its fascinating capabilities. Let me know your thoughts and please get in touch if you have any questions or need further information.
Happy coding!
本ブログに掲載されている見解は著者に所属するものであり、必ずしも New Relic 株式会社の公式見解であるわけではありません。また、本ブログには、外部サイトにアクセスするリンクが含まれる場合があります。それらリンク先の内容について、New Relic がいかなる保証も提供することはありません。