Monitoring Kubernetes with OpenTelemetry

As more teams pivot to or prefer to maintain open source stacks, OpenTelemetry, or OTel for short, has become the de facto open standard for instrumentation and telemetry generation and collection. While OpenTelemetry is still more generally used for observing applications, you can use it to observe your infrastructure, such as your Kubernetes cluster!

This blog post is intended to teach you about how to monitor your Kubernetes cluster with just OpenTelemetry. You can send the generated telemetry to our backend, but you will have to query your data and build custom dashboards, which you can do by following our documentation here.

What does OpenTelemetry provide that other monitoring tools don’t? The two biggest things are that OpenTelemetry allows you to very easily update where your data is exported, which prevents vendor lock-in, and it also provides a standardized way to collect and process that data, which streamlines how your data is collected and processed across different environments.

Here’s what this blog post will cover:

What it means to observe Kubernetes, and the data that matters
The OpenTelemetry tools for monitoring Kubernetes:
- Collector
- Operator
- Receivers, processors, and exporter components
An example of how to put it all together by building data pipelines

Observing Kubernetes

What does it mean to observe your Kubernetes cluster? Simply put, it means you will be able to ensure that your applications running on Kubernetes remain healthy. It means being able to quickly identify performance issues, such as pod failures and high CPU usage, which in turn will decrease your MTTR (Mean Time To Resolution) and maximize your application uptime, which in turn will keep your users happy.

What does it mean in practice?

Consuming change events, such as pod created and destroyed
Consuming Prometheus state metrics from kube-state-metrics using Prometheus
Monitoring the system metrics of the hosts that comprise the cluster
Consuming the logs from Kubernetes core services
Monitoring all the application workloads running on the cluster

There’s a lot of telemetry that Kubernetes exposes, including metrics, events, and logs for different objects, as well as data from workloads. Collecting the right data is important for obtaining end-to-end visibility of your Kubernetes cluster since you need them for creating dashboards, setting up alerts, and gaining accurate insight into your Kubernetes services, applications, and infrastructure.

Let’s look at some of the metrics that are vital to collect and understand when it comes to monitoring Kubernetes:

Kubernetes component	Insights	Example metrics
Kubernetes componentNode	InsightsIndividual node performance and resource usage	Example metricsMemory, CPU, disk or processor overload, readiness, network availability and usage
Kubernetes componentPod	InsightsPod resource usage and operation	Example metricsAvailability, CPU, network usage, memory
Kubernetes componentCluster	InsightsCluster state	Example metricsFailed and successful pods, container resource information, replicaset pod information
Kubernetes componentControl plane	InsightsAPI server availability and functionality, etc. cluster condition and operation	Example metricsRequest latency, error rate, response time, cluster health, response time, error rate, disk usage
Kubernetes componentContainer	InsightsIndividual container performance and resource usage	Example metricsRestarts, memory, CPU, network usage

The OpenTelemetry collector

The OpenTelemetry Collector is a highly configurable data processing system that we’ll need for monitoring Kubernetes. Implementing a Collector isn’t necessary if you’re only observing an application, although it provides several benefits, including offloading the burden of additional telemetry processing from your application. It collects data from multiple sources and enables you to decorate that data using various components called processors. It is available in two versions:

Core is the main version
Contrib contains plugins not available in the core version

Below is a basic architecture of the Collector, showing its main components that access telemetry data:

Receivers are how telemetry gets into the Collector
Processors are how data gets transformed – the order in which you declare them matters!
Exporters are how you forward data to your backend(s)
Connectors allow you to connect two pipelines

There are three primary deployment patterns for the Collector:

No collector – you can send your telemetry from the SDK directly to your backend
Agent, or Daemonset– this is the simplest setup, where an instance of the Collector runs alongside the application on the same host (e.g., sidecar container and DaemonSet)
Gateway, or Deployment – this is more complex, involving one or more instances of the Collector running as standalone services (e.g., a deployment in Kubernetes), usually per cluster, data center, or region

Setting up a Collector is fairly straightforward, although it can get quite complicated as you scale and have to consider load balancing and setting up additional Collector instances. You configure each component in a YAML file, and each component has to be enabled in the appropriate data pipelines in the same config file under a section called service. We will take a closer look at this below.

For monitoring Kubernetes, OpenTelemetry documentation covers using two installations of the Collector:

Daemonset, which will collect telemetry emitted by services, logs, and metrics for nodes, pods, and containers
Deployment, which will collect metrics for the cluster and events

You can install the Collector using either the OpenTelemetry Collector Helm Chart or the OpenTelemetry Operator, which we cover below.

Pay attention to the order of processors when you’re building your data pipelines, as they communicate with each other. As an example, since you wouldn’t want to process spans that may be filtered out, you would want to do any sampling first before transforming your trace data.

OpenTelemetry Operator

This is an implementation of the Kubernetes operator, which are Kubernetes software extension that manages applications and their components via custom resources. The OpenTelemetry operator manages the Collector as well as auto-instrumentation of workloads.

Helm is used for managing Kubernetes applications. Using the OpenTelemetry Helm charts allows you to manage the installation of the Collector and the operator. To configure the operator, you’ll use a YAML file that will contain the custom resource definition, or CRD, for your Collector; it is where we will define and enable the OpenTelemetry components that we’ll learn about in the next section.

OpenTelemetry Collector components for Kubernetes

Earlier, we talked about what monitoring Kubernetes means and what it looks like in practice. Now, we’ll learn about the different components we’ll use in the Collector, some of which are specifically for Kubernetes, and some of which are generally recommended (depending on what you want to do with your data).

In practice, you’ll configure these components in the CRD, and then enable them in your Collector data pipelines. You’ll learn how to do the last step in the next section.

Receivers

These are the components that are responsible for getting data into the Collector.

Component	What it does
ComponentKubernetes Cluster Receiver	What it does Collects cluster-level metrics and entity events, such as: k8s.container.cpu_limit k8s.container.restarts k8s.daemonset.desired_scheduled_nodes k8s.job.failed_pods k8s.replicaset.available
ComponentKubeletstats Receiver	What it does Collects pod, mode, and container metrics from the API server on a kubelet, such as: container.cpu.time container.memory.available k8s.node.cpu.utilization k8s.node.memory.available k8s.pod.network.errors
ComponentKubernetes Events Receiver	What it doesReceives change (new or updated) events from the cluster
ComponentKubernetes Objects Receiver	What it doesCollects objects, such as events
ComponentPrometheus Receiver	What it doesScrapes kube-state-metrics
ComponentHost Metrics Receiver	What it doesScrapes system metrics from hosts that make up the cluster
ComponentFile Log Receiver	What it doesCollects Kubernetes core and application logs written to stdout/stderr
ComponentOTLP receiver	What it doesCollects application traces, metrics, and logs

Processors

We use these components to transform our data in some way, whether via adding attributes or filtering, or some other modification.

Component	What it does
ComponentKubernetes attributes processor	What it doesThis processor is one of the most important components for monitoring Kubernetes with OpenTelemetry, as it enables you to correlate application telemetry with your Kubernetes telemetry by adding Kubernetes context
ComponentMemory limiter processor	What it doesLimits the amount of memory that can be used in order to prevent out-of-memory issues
ComponentBatch processor	What it doesBatches your metrics, spans, and logs to compress the data and decrease the number of outgoing connections needed to export the data
ComponentResource processor	What it doesModifies resource attributes
ComponentResource detection processor	What it doesDetects resource information from the host, and can append or override the resource value in telemetry data
ComponentTransform processor	What it doesEnables you to customize your data by allowing you to configure multiple context statements for your metrics, spans, and logs
ComponentMetrics transform processor	What it doesEnables you to rename metrics, and modify them by adding, renaming, or deleting label keys and values

Exporters

These components route your data to the backend(s) of your choice.

Component	What it does
ComponentOTLP exporter	Exports data via gRPC using the OpenTelemetry Line Protocol (OTLP) format
ComponentLogging Exporter	Exports data to the console. It will be deprecated in September 2024, in favor of the debug exporter

Building our data pipelines

A data pipeline enables you to collect, process, and route data from any source to one destination or more. Defining components alone in the CRD won’t work; you have to enable them in the service section by building data pipelines for each of your telemetry signals. This is the final piece to monitoring Kubernetes with OpenTelemetry.

Each telemetry pipeline consists of a set of receivers, processors (if applicable), and exporters. You can use each component in more than one pipeline, depending on what you want to do with your telemetry. In the Collector config YAML file, or in this case, our Collector CRD, this is what your data pipelines may look like:

    service:
      extensions: [health_check, zpages]
      pipelines:
        metrics:
          receivers:
            - otlp
            - prometheus
            - k8s_cluster
            - k8sobjects
          processors:
            - memory_limiter
            - k8sattributes
            - batch
          exporters: [otlp]
        traces:
          receivers: [otlp]
          processors:
            - memory_limiter
            - k8sattributes
            - batch
          exporters: [otlp]
        logs:
          receivers: [otlp]
          processors:
            - k8sattributes
            - batch
          exporters: [otlp, logging]

Here’s a diagram to illustrate what the above configured pipelines look like:

Refer to both the YAML configuration and the diagram above to see the data pipelines at work:

The data is received in the Collector:
1. Traces and logs are received by the OTLP receiver
2. Metrics are received by the OTLP, Prometheus, k8s_cluster, and k8sobjects receivers
The data then makes its way to the processors. Since the data is processed in the order you’ve enabled the processors:
1. Traces and metrics go through the memory_limiter processor first, then get transformed by the k8sattributes processor, then finally get batched by the batch processor
2. Logs go straight to the k8sattributes processor since we didn’t enable memory_limiter for this telemetry signal before getting batched
Finally, the data is routed to an observability backend by exporters:
1. Traces and metrics are routed via the OTLP exporter

Logs are routed via the OTLP and logging exporters

Here is an example Collector CRD using some of the components we covered earlier – keep in mind that the order of processors is important, as it dictates the order in which data is processed:

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: daemonset
spec:
  mode: daemonset
  hostNetwork: true
  serviceAccount: otel-collector-daemonset
  config:
    extensions:
      health_check: {}
      zpages:
        endpoint: 0.0.0.0:55679
    receivers:
      otlp:
        protocols:
          grpc:
          http:
            cors:
              allowed_origins:
                - "http://*"
                - "https://*"
      k8s_cluster:
        node_conditions_to_report:
          - Ready
          - MemoryPressure    
      k8sobjects:
        auth_type: serviceAccount
        objects:
          - name: pods
            mode: pull
            label_selector: environment in (production),tier in (frontend)
            field_selector: status.phase=Running
            interval: 15m
          - name: events
            mode: watch
            group: events.k8s.io
            namespaces: [default]
      prometheus:
        config:
          scrape_configs:
            - job_name: 'otel-collector'
              scrape_interval: 5s
              static_configs:
                - targets: ['0.0.0.0:8888']
    exporters:
      otlp:
        endpoint: "otlp.nr-data.net:4317"
        tls:
          insecure: false
        headers:
          api-key: ${NEW_RELIC_LICENSE_KEY}
      logging:
        verbosity: detailed
    processors:
      memory_limiter:
        check_interval: 10s
        limit_percentage: 50
        spike_limit_percentage: 30
      k8sattributes:
        auth_type: "serviceAccount"
        passthrough: false
        filter:
          node_from_env_var: KUBE_NODE_NAME
        extract:
          metadata:
           - k8s.pod.name
           - k8s.pod.uid
           - k8s.deployment.name
           - k8s.namespace.name
           - k8s.node.name
           - k8s.pod.start_time
        pod_association:
          - sources:
              - from: resource_attribute
                name: k8s.pod.ip
          - sources:
              - from: resource_attribute
                name: k8s.pod.uid
          - sources:
              - from: connection
      batch:
        send_batch_size: 1000
        send_batch_max_size: 1000
        timeout: 10s
      resource:
        attributes:
          - key: host.id
            from_attribute: host.name
            action: upsert
          - key: k8s.cluster.name
            value: otel-operator-demo
            action: insert
          - key: service.instance.id
            from_attribute: k8s.pod.uid
            action: insert
      resourcedetection:
        detectors: [env]
      transform:
        trace_statements:
          - context: span
            statements:
              - truncate_all(attributes, 4095)
              - truncate_all(resource.attributes, 4095)
      metricstransform:
        transforms:
          include: duration
          action: update
          new_name: http.server.duration
    service:
      extensions: [health_check, zpages]
      pipelines:
        metrics:
          receivers:
            - otlp
            - prometheus
            - k8s_cluster
            - k8sobjects
          processors:
            - memory_limiter
            - resourcedetection
            - resource
            - k8sattributes
            - batch
            - metricstransform
          exporters: [otlp]
        traces:
          receivers: [otlp]
          processors:
            - memory_limiter
            - resourcedetection
            - resource
            - k8sattributes
            - batch
            - transform
          exporters: [otlp]
        logs:
          receivers: [otlp]
          processors:
            - resourcedetection
            - resource
            - k8sattributes
            - batch
          exporters: [otlp, logging]
  env:
    - name: NEW_RELIC_LICENSE_KEY
      valueFrom:
        secretKeyRef:
          name: newrelic-key-secret
          key: new_relic_license_key

다음 단계

If you are already using OpenTelemetry for your services, you can learn how to link OpenTelemetry-instrumented applications to Kubernetes. You can also learn more about monitoring Kubernetes with OpenTelemetry here.

리즈 리(Reese Lee)

리즈는 오픈소스 소프트웨어를 전문으로 하는 선임 개발자 관계 엔지니어로, OpenTelemetry관련 주제에 대해 정기적으로 연설을 하며, 기술적으로 복잡한 문제를 해결하는 것을 좋아합니다. 여가 시간에는 브라질리안 주짓수 연습을 하고, 무서운 영화를 보며, 공상과학 소설을 읽는 것을 즐깁니다.

이 블로그에 표현된 견해는 저자의 견해이며 반드시 New Relic의 견해를 반영하는 것은 아닙니다. 저자가 제공하는 모든 솔루션은 환경에 따라 다르며 New Relic에서 제공하는 상용 솔루션이나 지원의 일부가 아닙니다. 이 블로그 게시물과 관련된 질문 및 지원이 필요한 경우 Explorers Hub(discuss.newrelic.com)에서만 참여하십시오. 이 블로그에는 타사 사이트의 콘텐츠에 대한 링크가 포함될 수 있습니다. 이러한 링크를 제공함으로써 New Relic은 해당 사이트에서 사용할 수 있는 정보, 보기 또는 제품을 채택, 보증, 승인 또는 보증하지 않습니다.

780+ 개 통합을 사용해 무료로 스택 모니터링

모든 통합 보기

In this article