As more teams pivot to or prefer to maintain open source stacks, OpenTelemetry, or OTel for short, has become the de facto open standard for instrumentation and telemetry generation and collection. While OpenTelemetry is still more generally used for observing applications, you can use it to observe your infrastructure, such as your Kubernetes cluster! 

This blog post is intended to teach you about how to monitor your Kubernetes cluster with just OpenTelemetry. You can send the generated telemetry to our backend, but you will have to query your data and build custom dashboards, which you can do by following our documentation here

What does OpenTelemetry provide that other monitoring tools don’t? The two biggest things are that OpenTelemetry allows you to very easily update where your data is exported, which prevents vendor lock-in, and it also provides a standardized way to collect and process that data, which streamlines how your data is collected and processed across different environments. 

Here’s what this blog post will cover:

  • What it means to observe Kubernetes, and the data that matters
  • The OpenTelemetry tools for monitoring Kubernetes:
    • Collector
    • Operator
    • Receivers, processors, and exporter components
  • An example of how to put it all together by building data pipelines

Observing Kubernetes

What does it mean to observe your Kubernetes cluster? Simply put, it means you will be able to ensure that your applications running on Kubernetes remain healthy. It means being able to quickly identify performance issues, such as pod failures and high CPU usage, which in turn will decrease your MTTR (Mean Time To Resolution) and maximize your application uptime, which in turn will keep your users happy. 

What does it mean in practice?

  • Consuming change events, such as pod created and destroyed
  • Consuming Prometheus state metrics from kube-state-metrics using Prometheus
  • Monitoring the system metrics of the hosts that comprise the cluster
  • Consuming the logs from Kubernetes core services
  • Monitoring all the application workloads running on the cluster

There’s a lot of telemetry that Kubernetes exposes, including metrics, events, and logs for different objects, as well as data from workloads. Collecting the right data is important for obtaining end-to-end visibility of your Kubernetes cluster since you need them for creating dashboards, setting up alerts, and gaining accurate insight into your Kubernetes services, applications, and infrastructure. 

Let’s look at some of the metrics that are vital to collect and understand when it comes to monitoring Kubernetes:

Kubernetes component Insights Example metrics
Kubernetes componentNode InsightsIndividual node performance and resource usage Example metricsMemory, CPU, disk or processor overload, readiness, network availability and usage
Kubernetes componentPod InsightsPod resource usage and operation Example metricsAvailability, CPU, network usage, memory
Kubernetes componentCluster InsightsCluster state Example metricsFailed and successful pods, container resource information, replicaset pod information
Kubernetes componentControl plane InsightsAPI server availability and functionality, etc. cluster condition and operation Example metricsRequest latency, error rate, response time, cluster health, response time, error rate, disk usage
Kubernetes componentContainer InsightsIndividual container performance and resource usage Example metricsRestarts, memory, CPU, network usage

The OpenTelemetry collector

The OpenTelemetry Collector is a highly configurable data processing system that we’ll need for monitoring Kubernetes. Implementing a Collector isn’t necessary if you’re only observing an application, although it provides several benefits, including offloading the burden of additional telemetry processing from your application. It collects data from multiple sources and enables you to decorate that data using various components called processors. It is available in two versions:

  • Core is the main version
  • Contrib contains plugins not available in the core version

Below is a basic architecture of the Collector, showing its main components that access telemetry data:

  • Receivers are how telemetry gets into the Collector
  • Processors are how data gets transformed – the order in which you declare them matters! 
  • Exporters are how you forward data to your backend(s)
  • Connectors allow you to connect two pipelines

There are three primary deployment patterns for the Collector:

  • No collector – you can send your telemetry from the SDK directly to your backend
  • Agent, or Daemonset– this is the simplest setup, where an instance of the Collector runs alongside the application on the same host (e.g., sidecar container and DaemonSet)
  • Gateway, or Deployment – this is more complex, involving one or more instances of the Collector running as standalone services (e.g., a deployment in Kubernetes), usually per cluster, data center, or region

Setting up a Collector is fairly straightforward, although it can get quite complicated as you scale and have to consider load balancing and setting up additional Collector instances. You configure each component in a YAML file, and each component has to be enabled in the appropriate data pipelines in the same config file under a section called service. We will take a closer look at this below. 

For monitoring Kubernetes, OpenTelemetry documentation covers using two installations of the Collector: 

  1. Daemonset, which will collect telemetry emitted by services, logs, and metrics for nodes, pods, and containers
  2. Deployment, which will collect metrics for the cluster and events

You can install the Collector using either the OpenTelemetry Collector Helm Chart or the OpenTelemetry Operator, which we cover below.

Pay attention to the order of processors when you’re building your data pipelines, as they communicate with each other. As an example, since you wouldn’t want to process spans that may be filtered out, you would want to do any sampling first before transforming your trace data. 

OpenTelemetry Operator

This is an implementation of the Kubernetes operator, which are Kubernetes software extension that manages applications and their components via custom resources. The OpenTelemetry operator manages the Collector as well as auto-instrumentation of workloads. 

Helm is used for managing Kubernetes applications. Using the OpenTelemetry Helm charts allows you to manage the installation of the Collector and the operator. To configure the operator, you’ll use a YAML file that will contain the custom resource definition, or CRD, for your Collector; it is where we will define and enable the OpenTelemetry components that we’ll learn about in the next section. 

OpenTelemetry Collector components for Kubernetes

Earlier, we talked about what monitoring Kubernetes means and what it looks like in practice. Now, we’ll learn about the different components we’ll use in the Collector, some of which are specifically for Kubernetes, and some of which are generally recommended (depending on what you want to do with your data). 

In practice, you’ll configure these components in the CRD, and then enable them in your Collector data pipelines. You’ll learn how to do the last step in the next section. 

Receivers

These are the components that are responsible for getting data into the Collector. 

Component What it does
ComponentKubernetes Cluster Receiver What it does

Collects cluster-level metrics and entity events, such as:

  • k8s.container.cpu_limit
  • k8s.container.restarts
  • k8s.daemonset.desired_scheduled_nodes
  • k8s.job.failed_pods
  • k8s.replicaset.available
ComponentKubeletstats Receiver What it does

Collects pod, mode, and container metrics from the API server on a kubelet, such as:

  • container.cpu.time
  • container.memory.available
  • k8s.node.cpu.utilization
  • k8s.node.memory.available
  • k8s.pod.network.errors
ComponentKubernetes Events Receiver What it doesReceives change (new or updated) events from the cluster
ComponentKubernetes Objects Receiver What it doesCollects objects, such as events
ComponentPrometheus Receiver What it doesScrapes kube-state-metrics
ComponentHost Metrics Receiver What it doesScrapes system metrics from hosts that make up the cluster
ComponentFile Log Receiver What it doesCollects Kubernetes core and application logs written to stdout/stderr
ComponentOTLP receiver What it doesCollects application traces, metrics, and logs

Processors

We use these components to transform our data in some way, whether via adding attributes or filtering, or some other modification. 

Component What it does
ComponentKubernetes attributes processor What it doesThis processor is one of the most important components for monitoring Kubernetes with OpenTelemetry, as it enables you to correlate application telemetry with your Kubernetes telemetry by adding Kubernetes context
ComponentMemory limiter processor What it doesLimits the amount of memory that can be used in order to prevent out-of-memory issues
ComponentBatch processor What it doesBatches your metrics, spans, and logs to compress the data and decrease the number of outgoing connections needed to export the data
ComponentResource processor What it doesModifies resource attributes
ComponentResource detection processor What it doesDetects resource information from the host, and can append or override the resource value in telemetry data
ComponentTransform processor What it doesEnables you to customize your data by allowing you to configure multiple context statements for  your metrics, spans, and logs
ComponentMetrics transform processor What it doesEnables you to rename metrics, and modify them by adding, renaming, or deleting label keys and values

Exporters

These components route your data to the backend(s) of your choice. 

Component

What it does

ComponentOTLP exporter Exports data via gRPC using the OpenTelemetry Line Protocol (OTLP) format
ComponentLogging Exporter Exports data to the console. It will be deprecated in September 2024, in favor of the debug exporter

Building our data pipelines

A data pipeline enables you to collect, process, and route data from any source to one destination or more. Defining components alone in the CRD won’t work; you have to enable them in the service section by building data pipelines for each of your telemetry signals. This is the final piece to monitoring Kubernetes with OpenTelemetry. 

Each telemetry pipeline consists of a set of receivers, processors (if applicable), and exporters. You can use each component in more than one pipeline, depending on what you want to do with your telemetry. In the Collector config YAML file, or in this case, our Collector CRD, this is what your data pipelines may look like:

    service:
      extensions: [health_check, zpages]
      pipelines:
        metrics:
          receivers:
            - otlp
            - prometheus
            - k8s_cluster
            - k8sobjects
          processors:
            - memory_limiter
            - k8sattributes
            - batch
          exporters: [otlp]
        traces:
          receivers: [otlp]
          processors:
            - memory_limiter
            - k8sattributes
            - batch
          exporters: [otlp]
        logs:
          receivers: [otlp]
          processors:
            - k8sattributes
            - batch
          exporters: [otlp, logging]

Here’s a diagram to illustrate what the above configured pipelines look like:

Refer to both the YAML configuration and the diagram above to see the data pipelines at work: 

  1. The data is received in the Collector:
    1. Traces and logs are received by the OTLP receiver
    2. Metrics are received by the OTLP, Prometheus, k8s_cluster, and k8sobjects receivers
  2. The data then makes its way to the processors. Since the data is processed in the order you’ve enabled the processors:
    1. Traces and metrics go through the memory_limiter processor first, then get transformed by the k8sattributes processor, then finally get batched by the batch processor
    2. Logs go straight to the k8sattributes processor since we didn’t enable memory_limiter for this telemetry signal before getting batched
  3. Finally, the data is routed to an observability backend by exporters:
    1. Traces and metrics are routed via the OTLP exporter

Logs are routed via the OTLP and logging exporters

Here is an example Collector CRD using some of the components we covered earlier – keep in mind that the order of processors is important, as it dictates the order in which data is processed:

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: daemonset
spec:
  mode: daemonset
  hostNetwork: true
  serviceAccount: otel-collector-daemonset
  config:
    extensions:
      health_check: {}
      zpages:
        endpoint: 0.0.0.0:55679
    receivers:
      otlp:
        protocols:
          grpc:
          http:
            cors:
              allowed_origins:
                - "http://*"
                - "https://*"
      k8s_cluster:
        node_conditions_to_report:
          - Ready
          - MemoryPressure    
      k8sobjects:
        auth_type: serviceAccount
        objects:
          - name: pods
            mode: pull
            label_selector: environment in (production),tier in (frontend)
            field_selector: status.phase=Running
            interval: 15m
          - name: events
            mode: watch
            group: events.k8s.io
            namespaces: [default]
      prometheus:
        config:
          scrape_configs:
            - job_name: 'otel-collector'
              scrape_interval: 5s
              static_configs:
                - targets: ['0.0.0.0:8888']
    exporters:
      otlp:
        endpoint: "otlp.nr-data.net:4317"
        tls:
          insecure: false
        headers:
          api-key: ${NEW_RELIC_LICENSE_KEY}
      logging:
        verbosity: detailed
    processors:
      memory_limiter:
        check_interval: 10s
        limit_percentage: 50
        spike_limit_percentage: 30
      k8sattributes:
        auth_type: "serviceAccount"
        passthrough: false
        filter:
          node_from_env_var: KUBE_NODE_NAME
        extract:
          metadata:
           - k8s.pod.name
           - k8s.pod.uid
           - k8s.deployment.name
           - k8s.namespace.name
           - k8s.node.name
           - k8s.pod.start_time
        pod_association:
          - sources:
              - from: resource_attribute
                name: k8s.pod.ip
          - sources:
              - from: resource_attribute
                name: k8s.pod.uid
          - sources:
              - from: connection
      batch:
        send_batch_size: 1000
        send_batch_max_size: 1000
        timeout: 10s
      resource:
        attributes:
          - key: host.id
            from_attribute: host.name
            action: upsert
          - key: k8s.cluster.name
            value: otel-operator-demo
            action: insert
          - key: service.instance.id
            from_attribute: k8s.pod.uid
            action: insert
      resourcedetection:
        detectors: [env]
      transform:
        trace_statements:
          - context: span
            statements:
              - truncate_all(attributes, 4095)
              - truncate_all(resource.attributes, 4095)
      metricstransform:
        transforms:
          include: duration
          action: update
          new_name: http.server.duration
    service:
      extensions: [health_check, zpages]
      pipelines:
        metrics:
          receivers:
            - otlp
            - prometheus
            - k8s_cluster
            - k8sobjects
          processors:
            - memory_limiter
            - resourcedetection
            - resource
            - k8sattributes
            - batch
            - metricstransform
          exporters: [otlp]
        traces:
          receivers: [otlp]
          processors:
            - memory_limiter
            - resourcedetection
            - resource
            - k8sattributes
            - batch
            - transform
          exporters: [otlp]
        logs:
          receivers: [otlp]
          processors:
            - resourcedetection
            - resource
            - k8sattributes
            - batch
          exporters: [otlp, logging]
  env:
    - name: NEW_RELIC_LICENSE_KEY
      valueFrom:
        secretKeyRef:
          name: newrelic-key-secret
          key: new_relic_license_key