As more teams pivot to or prefer to maintain open source stacks, OpenTelemetry, or OTel for short, has become the de facto open standard for instrumentation and telemetry generation and collection. While OpenTelemetry is still more generally used for observing applications, you can use it to observe your infrastructure, such as your Kubernetes cluster!
This blog post is intended to teach you about how to monitor your Kubernetes cluster with just OpenTelemetry. You can send the generated telemetry to our backend, but you will have to query your data and build custom dashboards, which you can do by following our documentation here.
What does OpenTelemetry provide that other monitoring tools don’t? The two biggest things are that OpenTelemetry allows you to very easily update where your data is exported, which prevents vendor lock-in, and it also provides a standardized way to collect and process that data, which streamlines how your data is collected and processed across different environments.
Here’s what this blog post will cover:
- What it means to observe Kubernetes, and the data that matters
- The OpenTelemetry tools for monitoring Kubernetes:
- Collector
- Operator
- Receivers, processors, and exporter components
- An example of how to put it all together by building data pipelines
Observing Kubernetes
What does it mean to observe your Kubernetes cluster? Simply put, it means you will be able to ensure that your applications running on Kubernetes remain healthy. It means being able to quickly identify performance issues, such as pod failures and high CPU usage, which in turn will decrease your MTTR (Mean Time To Resolution) and maximize your application uptime, which in turn will keep your users happy.
What does it mean in practice?
- Consuming change events, such as pod created and destroyed
- Consuming Prometheus state metrics from kube-state-metrics using Prometheus
- Monitoring the system metrics of the hosts that comprise the cluster
- Consuming the logs from Kubernetes core services
- Monitoring all the application workloads running on the cluster
There’s a lot of telemetry that Kubernetes exposes, including metrics, events, and logs for different objects, as well as data from workloads. Collecting the right data is important for obtaining end-to-end visibility of your Kubernetes cluster since you need them for creating dashboards, setting up alerts, and gaining accurate insight into your Kubernetes services, applications, and infrastructure.
Let’s look at some of the metrics that are vital to collect and understand when it comes to monitoring Kubernetes:
Kubernetes component | Insights | Example metrics |
---|---|---|
Kubernetes componentNode | InsightsIndividual node performance and resource usage | Example metricsMemory, CPU, disk or processor overload, readiness, network availability and usage |
Kubernetes componentPod | InsightsPod resource usage and operation | Example metricsAvailability, CPU, network usage, memory |
Kubernetes componentCluster | InsightsCluster state | Example metricsFailed and successful pods, container resource information, replicaset pod information |
Kubernetes componentControl plane | InsightsAPI server availability and functionality, etc. cluster condition and operation | Example metricsRequest latency, error rate, response time, cluster health, response time, error rate, disk usage |
Kubernetes componentContainer | InsightsIndividual container performance and resource usage | Example metricsRestarts, memory, CPU, network usage |
The OpenTelemetry collector
The OpenTelemetry Collector is a highly configurable data processing system that we’ll need for monitoring Kubernetes. Implementing a Collector isn’t necessary if you’re only observing an application, although it provides several benefits, including offloading the burden of additional telemetry processing from your application. It collects data from multiple sources and enables you to decorate that data using various components called processors. It is available in two versions:
Below is a basic architecture of the Collector, showing its main components that access telemetry data:
- Receivers are how telemetry gets into the Collector
- Processors are how data gets transformed – the order in which you declare them matters!
- Exporters are how you forward data to your backend(s)
- Connectors allow you to connect two pipelines
There are three primary deployment patterns for the Collector:
- No collector – you can send your telemetry from the SDK directly to your backend
- Agent, or Daemonset– this is the simplest setup, where an instance of the Collector runs alongside the application on the same host (e.g., sidecar container and DaemonSet)
- Gateway, or Deployment – this is more complex, involving one or more instances of the Collector running as standalone services (e.g., a deployment in Kubernetes), usually per cluster, data center, or region
Setting up a Collector is fairly straightforward, although it can get quite complicated as you scale and have to consider load balancing and setting up additional Collector instances. You configure each component in a YAML file, and each component has to be enabled in the appropriate data pipelines in the same config file under a section called service. We will take a closer look at this below.
For monitoring Kubernetes, OpenTelemetry documentation covers using two installations of the Collector:
- Daemonset, which will collect telemetry emitted by services, logs, and metrics for nodes, pods, and containers
- Deployment, which will collect metrics for the cluster and events
You can install the Collector using either the OpenTelemetry Collector Helm Chart or the OpenTelemetry Operator, which we cover below.
Pay attention to the order of processors when you’re building your data pipelines, as they communicate with each other. As an example, since you wouldn’t want to process spans that may be filtered out, you would want to do any sampling first before transforming your trace data.
OpenTelemetry Operator
This is an implementation of the Kubernetes operator, which are Kubernetes software extension that manages applications and their components via custom resources. The OpenTelemetry operator manages the Collector as well as auto-instrumentation of workloads.
Helm is used for managing Kubernetes applications. Using the OpenTelemetry Helm charts allows you to manage the installation of the Collector and the operator. To configure the operator, you’ll use a YAML file that will contain the custom resource definition, or CRD, for your Collector; it is where we will define and enable the OpenTelemetry components that we’ll learn about in the next section.
OpenTelemetry Collector components for Kubernetes
Earlier, we talked about what monitoring Kubernetes means and what it looks like in practice. Now, we’ll learn about the different components we’ll use in the Collector, some of which are specifically for Kubernetes, and some of which are generally recommended (depending on what you want to do with your data).
In practice, you’ll configure these components in the CRD, and then enable them in your Collector data pipelines. You’ll learn how to do the last step in the next section.
Receivers
These are the components that are responsible for getting data into the Collector.
Component | What it does |
---|---|
ComponentKubernetes Cluster Receiver | What it does
Collects cluster-level metrics and entity events, such as:
|
ComponentKubeletstats Receiver | What it does
Collects pod, mode, and container metrics from the API server on a kubelet, such as:
|
ComponentKubernetes Events Receiver | What it doesReceives change (new or updated) events from the cluster |
ComponentKubernetes Objects Receiver | What it doesCollects objects, such as events |
ComponentPrometheus Receiver | What it doesScrapes kube-state-metrics |
ComponentHost Metrics Receiver | What it doesScrapes system metrics from hosts that make up the cluster |
ComponentFile Log Receiver | What it doesCollects Kubernetes core and application logs written to stdout/stderr |
ComponentOTLP receiver | What it doesCollects application traces, metrics, and logs |
Processors
We use these components to transform our data in some way, whether via adding attributes or filtering, or some other modification.
Component | What it does |
---|---|
ComponentKubernetes attributes processor | What it doesThis processor is one of the most important components for monitoring Kubernetes with OpenTelemetry, as it enables you to correlate application telemetry with your Kubernetes telemetry by adding Kubernetes context |
ComponentMemory limiter processor | What it doesLimits the amount of memory that can be used in order to prevent out-of-memory issues |
ComponentBatch processor | What it doesBatches your metrics, spans, and logs to compress the data and decrease the number of outgoing connections needed to export the data |
ComponentResource processor | What it doesModifies resource attributes |
ComponentResource detection processor | What it doesDetects resource information from the host, and can append or override the resource value in telemetry data |
ComponentTransform processor | What it doesEnables you to customize your data by allowing you to configure multiple context statements for your metrics, spans, and logs |
ComponentMetrics transform processor | What it doesEnables you to rename metrics, and modify them by adding, renaming, or deleting label keys and values |
Exporters
These components route your data to the backend(s) of your choice.
Component |
What it does |
---|---|
ComponentOTLP exporter | Exports data via gRPC using the OpenTelemetry Line Protocol (OTLP) format |
ComponentLogging Exporter | Exports data to the console. It will be deprecated in September 2024, in favor of the debug exporter |
Building our data pipelines
A data pipeline enables you to collect, process, and route data from any source to one destination or more. Defining components alone in the CRD won’t work; you have to enable them in the service section by building data pipelines for each of your telemetry signals. This is the final piece to monitoring Kubernetes with OpenTelemetry.
Each telemetry pipeline consists of a set of receivers, processors (if applicable), and exporters. You can use each component in more than one pipeline, depending on what you want to do with your telemetry. In the Collector config YAML file, or in this case, our Collector CRD, this is what your data pipelines may look like:
service:
extensions: [health_check, zpages]
pipelines:
metrics:
receivers:
- otlp
- prometheus
- k8s_cluster
- k8sobjects
processors:
- memory_limiter
- k8sattributes
- batch
exporters: [otlp]
traces:
receivers: [otlp]
processors:
- memory_limiter
- k8sattributes
- batch
exporters: [otlp]
logs:
receivers: [otlp]
processors:
- k8sattributes
- batch
exporters: [otlp, logging]
Here’s a diagram to illustrate what the above configured pipelines look like:
Refer to both the YAML configuration and the diagram above to see the data pipelines at work:
- The data is received in the Collector:
- Traces and logs are received by the OTLP receiver
- Metrics are received by the OTLP, Prometheus, k8s_cluster, and k8sobjects receivers
- The data then makes its way to the processors. Since the data is processed in the order you’ve enabled the processors:
- Traces and metrics go through the memory_limiter processor first, then get transformed by the k8sattributes processor, then finally get batched by the batch processor
- Logs go straight to the k8sattributes processor since we didn’t enable memory_limiter for this telemetry signal before getting batched
- Finally, the data is routed to an observability backend by exporters:
- Traces and metrics are routed via the OTLP exporter
Logs are routed via the OTLP and logging exporters
Here is an example Collector CRD using some of the components we covered earlier – keep in mind that the order of processors is important, as it dictates the order in which data is processed:
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: daemonset
spec:
mode: daemonset
hostNetwork: true
serviceAccount: otel-collector-daemonset
config:
extensions:
health_check: {}
zpages:
endpoint: 0.0.0.0:55679
receivers:
otlp:
protocols:
grpc:
http:
cors:
allowed_origins:
- "http://*"
- "https://*"
k8s_cluster:
node_conditions_to_report:
- Ready
- MemoryPressure
k8sobjects:
auth_type: serviceAccount
objects:
- name: pods
mode: pull
label_selector: environment in (production),tier in (frontend)
field_selector: status.phase=Running
interval: 15m
- name: events
mode: watch
group: events.k8s.io
namespaces: [default]
prometheus:
config:
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 5s
static_configs:
- targets: ['0.0.0.0:8888']
exporters:
otlp:
endpoint: "otlp.nr-data.net:4317"
tls:
insecure: false
headers:
api-key: ${NEW_RELIC_LICENSE_KEY}
logging:
verbosity: detailed
processors:
memory_limiter:
check_interval: 10s
limit_percentage: 50
spike_limit_percentage: 30
k8sattributes:
auth_type: "serviceAccount"
passthrough: false
filter:
node_from_env_var: KUBE_NODE_NAME
extract:
metadata:
- k8s.pod.name
- k8s.pod.uid
- k8s.deployment.name
- k8s.namespace.name
- k8s.node.name
- k8s.pod.start_time
pod_association:
- sources:
- from: resource_attribute
name: k8s.pod.ip
- sources:
- from: resource_attribute
name: k8s.pod.uid
- sources:
- from: connection
batch:
send_batch_size: 1000
send_batch_max_size: 1000
timeout: 10s
resource:
attributes:
- key: host.id
from_attribute: host.name
action: upsert
- key: k8s.cluster.name
value: otel-operator-demo
action: insert
- key: service.instance.id
from_attribute: k8s.pod.uid
action: insert
resourcedetection:
detectors: [env]
transform:
trace_statements:
- context: span
statements:
- truncate_all(attributes, 4095)
- truncate_all(resource.attributes, 4095)
metricstransform:
transforms:
include: duration
action: update
new_name: http.server.duration
service:
extensions: [health_check, zpages]
pipelines:
metrics:
receivers:
- otlp
- prometheus
- k8s_cluster
- k8sobjects
processors:
- memory_limiter
- resourcedetection
- resource
- k8sattributes
- batch
- metricstransform
exporters: [otlp]
traces:
receivers: [otlp]
processors:
- memory_limiter
- resourcedetection
- resource
- k8sattributes
- batch
- transform
exporters: [otlp]
logs:
receivers: [otlp]
processors:
- resourcedetection
- resource
- k8sattributes
- batch
exporters: [otlp, logging]
env:
- name: NEW_RELIC_LICENSE_KEY
valueFrom:
secretKeyRef:
name: newrelic-key-secret
key: new_relic_license_key
다음 단계
If you are already using OpenTelemetry for your services, you can learn how to link OpenTelemetry-instrumented applications to Kubernetes. You can also learn more about monitoring Kubernetes with OpenTelemetry here.
이 블로그에 표현된 견해는 저자의 견해이며 반드시 New Relic의 견해를 반영하는 것은 아닙니다. 저자가 제공하는 모든 솔루션은 환경에 따라 다르며 New Relic에서 제공하는 상용 솔루션이나 지원의 일부가 아닙니다. 이 블로그 게시물과 관련된 질문 및 지원이 필요한 경우 Explorers Hub(discuss.newrelic.com)에서만 참여하십시오. 이 블로그에는 타사 사이트의 콘텐츠에 대한 링크가 포함될 수 있습니다. 이러한 링크를 제공함으로써 New Relic은 해당 사이트에서 사용할 수 있는 정보, 보기 또는 제품을 채택, 보증, 승인 또는 보증하지 않습니다.