Introduction

Before Kubernetes took over the world, cluster administrators, DevOps engineers, application developers, and operations teams had to perform many manual tasks in order to schedule, deploy, and manage their containerized applications. The rise of the Kubernetes container orchestration platform has altered many of these responsibilities.

Kubernetes makes it easy to deploy and operate applications in a microservice architecture. It does so by creating an abstraction layer on top of a group of hosts, so that development teams can deploy their applications and let Kubernetes manage them:

  • Controlling resource consumption by application or team

  • Evenly spreading application load across a host infrastructure

  • Automatically load balancing requests across the different instances of an application

  • Monitoring resource consumption and resource limits to automatically stop applications from consuming too many resources and restarting the applications again

  • Moving an application instance from one host to another if there is a shortage of resources in a host, or if the host dies

  • Automatically leveraging additional resources made available when a new host is added to the cluster

  • Easily performing canary deployments and rollbacks

But such capabilities also give teams new things to worry about. For example:

  • There are a lot more layers to monitor.

  • The ephemeral and dynamic nature of Kubernetes makes it a lot more complex to troubleshoot.

  • Automatic scheduling of pods can cause capacity issues, especially if you’re not monitoring resource availability.

  • Until recently, monitoring your applications required aligning to your organization’s monitoring practices, installing language agents, instrumenting each app’s code, and redeploying each application.

In effect, while Kubernetes solves old problems, it can also create new ones. Specifically, adopting containers and container orchestration requires teams to rethink and adapt their monitoring strategies to account for the new infrastructure layers introduced in a distributed Kubernetes environment. 

With that in mind, we designed this guide to highlight the fundamentals of what you need to know to effectively monitor Kubernetes deployments with New Relic One and our latest innovation, Auto-telemetry with Pixie. Pixie gives you instant Kubernetes observability without the need to manually instrument your code or install language agents. This guide outlines some best practices for monitoring Kubernetes in general, and provides detailed advice for how to do so with the New Relic One platform.

Whether you’re a Kubernetes cluster admin, an application developer, an infrastructure engineer, or a DevOps practitioner, by the end of this guide you will be able to use New Relic and Auto-telemetry with Pixie to get instant Kubernetes observability. As a result, you’ll know how to monitor the health and capacity of Kubernetes resources, debug applications running in your clusters, correlate events in Kubernetes with contextual insights to help you troubleshoot issues, and understand how to track end-user experience from your apps.

Getting Started with Kubernetes and New Relic

To effectively monitor Kubernetes deployments, New Relic gives you visibility into your Kubernetes clusters and workloads in minutes, whether your clusters are hosted on-premises or in the cloud.

 

Instant Kubernetes observability: Auto-telemetry with Pixie

Until recently, monitoring the performance of your Kubernetes clusters and the workloads running in them required installing multiple integrations and language agents. This wasn’t easy, and application monitoring, in particular, required manually instrumenting your applications, updating code, and redeploying those apps. But with our acquisition of Pixie Labs, there’s a faster and easier way.

Now you get instant visibility into your Kubernetes clusters and workloads in just minutes, without installing language agents or updating your code using Auto-telemetry with Pixie. Pixie is a Kubernetes-native, in-cluster observability solution that automatically harvests telemetry data using eBPF. Pixie data flows directly into New Relic’s Telemetry Data Platform, giving you scalable data retention, advanced correlation, intelligent alerting, and powerful visualizations.

 

Kubernetes integrations

New Relic and Pixie work with Kubernetes clusters hosted on-premises or in the cloud, including the following:

  • Amazon Elastic Container Service for Kubernetes (Amazon EKS) provides Kubernetes as a managed service on AWS. It helps make deploying, managing, and scaling containerized applications on Kubernetes easier.

  • Google Kubernetes Engine (GKE) provides an environment for deploying, managing, and scaling your containerized applications using Google-supplied infrastructure.

  • Microsoft Azure Kubernetes Service (AKS) manages your hosted Kubernetes environment, making it easier to deploy and manage containerized applications without container orchestration expertise. It also eliminates the burden of ongoing operations and maintenance by provisioning, upgrading, and scaling resources on demand, without taking your applications offline.

  • RedHat OpenShift provides developers with an integrated development environment (IDE) for building and deploying Docker-formatted containers, and then managing them with Kubernetes.

  • Pivotal Container Service (PKS) provides the infrastructure and resources to reliably deploy and run containerized workloads across private and public clouds.

 

Deploying instrumentation

New Relic’s Kubernetes solution consists of multiple components that work together to give you end-to-end observability across your clusters. While you have the flexibility to deploy the component that you prefer, to achieve full observability, you’ll want to install the complete package.

 

Kubernetes infrastructure System-level metrics for nodes, pods, namespaces, and containers
Kubernetes events Kubernetes events happening inside your clusters
Prometheus metrics Metrics exposed by Prometheus-compatible endpoints
Kubernetes logs Logs for the Kubernetes control plane and associated pods
Application performance Code-level insights with stack traces and errors, and distributed traces
Network performance monitoring Domain names, DNS, network mapping, TCP, and network flow graphs

 

To deploy the instrumentation solutions from the table above, we offer two methods: Guided install and manual setup. See below to determine which method is best for your needs.

 

Guided install (recommended for most users) Manual setup
Simple and intuitive setup flow. The guided install uses a Helm chart or manifest to instrument your cluster and workloads for you. Provides advanced options and additional flexibility beyond the guided install process.

 

Guided install

Guided install makes deploying Kubernetes instrumentation fast and simple. To do so, simply choose Add more data, select Guided install, and choose Kubernetes. Select Auto-telemetry with Pixie for code-level APM visibility without having to install language agents.

 

Manual setup

If you wish to manually install each piece of our Kubernetes solution, follow the instructions at Install the Kubernetes integration using Helm.

APM: Auto-telemetry with Pixie or installing New Relic language agents

Auto-telemetry with Pixie means that you no longer need to install language agents to get code-level insights for application performance management (APM) for apps running in Kubernetes. To deploy Pixie, choose the Guided install method.

If you do want to install language agents, you need to instrument your application with the Kubernetes Downward API. We created a sample app to demonstrate how this works in a Node.js application—fork this repo for your own use (our Monitoring Application Performance in Kubernetes blog post explains how to add this type of Kubernetes metadata to APM-monitored application transactions).

Kubernetes integration

Installing the Kubernetes integration is simple. Follow the instructions in Kubernetes integration: install and configure.

Prometheus integration

Installing the Prometheus OpenMetrics integration within a Kubernetes cluster is as easy as changing two variables in a manifest and deploying it in the cluster. Follow the instructions in Send Prometheus metric data to New Relic.

Logs

New Relic offers a Fluent Bit output plugin to enable New Relic Logs for Kubernetes to collect cluster log data. After downloading the plugin, you can deploy it as a Helm chart or manually through the command line, as described in the documentation

Explore Your Data with Kubernetes Cluster Explorer

New Relic’s Kubernetes cluster explorer provides a multi-dimensional representation of a Kubernetes cluster from which you can explore your namespaces, deployments, nodes, pods, containers, and applications. With the cluster explorer, you will be able to easily retrieve the data and metadata of these elements, and understand how they are related.

 

New Relic One dashboard displaying the Kubernetes cluster explorer

The Kubernetes cluster explorer in New Relic One.

 

From the Kubernetes cluster explorer, you can:

  • Select the cluster you want to explore

  • Filter by namespace or deployment

  • Select specific pods or nodes for status details

 

The cluster explorer has two main parts:

1. A visual display of the status of a cluster, up to 24 nodes. Within the visual display, the cluster explorer shows the nodes that have the most issues in a series of four concentric rings:

  • The outer ring shows the nodes of the cluster, with each node displaying performance metrics for CPU, memory, and storage.

  • The next innermost ring displays the distribution and status of the non-alerting pods associated with that node.

  • The third innermost ring displays the pods on alert and that may have health issues even if they are still running.

  • Finally, the innermost ring displays pods that are pending or that Kubernetes is unable to run.

You can select any pod to see its details, such as namespace, deployment, its containers, alert status, CPU usage, memory usage, and more.

2. The cluster explorer node table displays all the nodes of the selected cluster/namespace/deployments, and can be sorted according to node name, node status, pod, pod status, container, CPU% vs. Limit and MEM% vs. Limit.

 

Benefits of monitoring with the cluster explorer

Cluster explorer expands the Kubernetes monitoring capabilities already built into the New Relic One platform. Use the cluster explorer’s advanced capabilities to filter, sort, and search for Kubernetes entities, so you can better understand the relationships and dependencies within an environment. The default data visualizations of your cluster provide a fast and intuitive path to getting answers and understanding your Kubernetes environments, so you can contain the complexity associated with running Kubernetes at scale.

When your team adopts cluster explorer, you can expect improved performance and consistency, and quicker resolutions when troubleshooting errors. Our platform can help you ensure that your clusters are running as expected or quickly detect performance issues within your cluster—even before they have a noticeable impact on your customers.

How to Build a Comprehensive Kubernetes Observability Strategy

We recommend that Kubernetes observability begins with these seven steps:

  1. Visualize your services

  2. Monitor cluster health and capacity

  3. Correlate Kubernetes events with cluster health

  4. Understand APM correlations

  5. Integrate Prometheus metrics

  6. Monitor logs in context

  7. Understand end-user experience

 

1. Visualize your services

When working in a Kubernetes environment, it can be difficult to untangle the dependencies between applications and infrastructure; or to drill down into and navigate all of the entities—containers, pods, nodes, deployments, namespaces, and so on—that may be involved in a troubleshooting effort. You need to observe performance and dependencies across the entire Kubernetes environment.

You should be able to visualize key parts of your services, including:

  • The structure of your application and its dependencies 

  • The interactions between various microservices

How New Relic helps

The cluster explorer provides a multi-dimensional representation of a Kubernetes cluster that allows you to drill down into Kubernetes data and metadata in a high-fidelity, curated UI that simplifies complex environments. Your teams can use cluster explorer to troubleshoot failures, bottlenecks, and other abnormal behavior across your Kubernetes environments more quickly.

Suggested alerting

When deploying the New Relic Kubernetes integration for the first time in an account, a default set of alert conditions is deployed to the account. The alert policy is configured without a notification channel to avoid unwanted alerts.

You can customize the alert conditions' thresholds to your environment and update the alert policy to send notifications. For more, see the New Relic Infrastructure alerts documentation.

 

2. Monitor cluster health and capacity

Kubernetes environments vary from deployment to deployment, but they all have a handful of key components, resources, and potential errors in common. The following sections introduce best practices, including tips for how to use New Relic and alerts, for monitoring the health and capacity of any Kubernetes environment:

  • Track cluster resource usage
  • Monitor node resource consumption
  • Monitor for missing pods
  • Find pods that aren’t running
  • Troubleshoot container restarts
  • Track container resource usage
  • Monitor storage volumes
  • Monitor the control plane

 

Track cluster resource usage

When you administer clusters, you need enough usable resources in your cluster to avoid running into issues when scheduling pods or deploying containers. If you don’t have enough capacity to meet the minimum resource requirements of all your containers, scale up your nodes’ capacity or add more nodes to distribute the workload.

You should know:

  • What percentage of cluster resources you’re using at any given time 

  • If your clusters are over- or under-provisioned

  • How much demand have you’ve placed on your systems

How New Relic helps

Our Kubernetes integration monitors and tracks aggregated core and memory usage across all nodes in your cluster. This allows you to meet resource requirements for optimal application performance.

 

The New Relic Infrastructure default dashboard for core and memory usage

The New Relic Infrastructure default dashboard for core and memory usage.

 

Suggested alerting

Set alerts on the cores and memory usage of the hosts in your cluster.

 

Monitor node resource consumption

Beyond simply keeping track of nodes in your cluster, you need to monitor the CPU, memory, and disk usage for Kubernetes nodes (workers and masters) to ensure all nodes in your cluster are healthy.

Use this data to ensure:

  • You have enough nodes in your cluster

  • The resource allocations to existing nodes is sufficient for deployed applications

  • You’re not hitting resource limits

  • etcd is healthy

How New Relic helps

New Relic tracks resource consumption (used cores and memory) for each Kubernetes node. This lets you track the number of network requests sent across containers on different nodes within a distributed service. You can also track resources metrics for all containers on a specific node—regardless of which service they belong to:

 

The New Relic Infrastructure default dashboard to monitor Node Resource Consumption

The New Relic Infrastructure default dashboard to monitor Node Resource Consumption. 

 

Always ensure your current deployment has sufficient resources to scale. You don’t want new node deployments blocked by lack of resources.

Suggested alerting

Set alerts so you’ll be notified if hosts stop reporting or if a node’s CPU or memory usage drops below a desired threshold.

 

Monitor for missing pods

From time to time, you may find your cluster is missing a pod. A pod can go missing if the engineers did not provide sufficient resources when they scheduled it. The pod may have never started; it could be in a restart loop; or it might be missing because of an error in its configuration.

To make sure Kubernetes does its job properly, you need to confirm the health and availability of pod deployments. A pod deployment defines the number of instances that need to be present for each pod, including backup instances. (In Kubernetes, this is referred to as a ReplicaSet). Sometimes the number of active pods is not specified in the Replicas field on each deployment. Even if they are, Kubernetes may determine if it can run another instance based on resources the administrator has defined.

 

A line of code that reads "forbidden: exceeded quota: compute-resources, requested: pods=1, used: pods=1, limited: pods=1"

 

How New Relic helps

New Relic makes it easier to avoid this issue by knowing the resource limitations of the cluster.  

If you don’t have enough resources to schedule a pod, add more container instances to the cluster or exchange a container instance for one with the appropriate amount of resources. In general, you can use the New Relic Kubernetes integration to monitor for missing pods and immediately identify deployments that require attention. This often creates an opportunity to resolve resource or configuration issues before they affect application availability or performance.

 

The New Relic Infrastructure default dashboard to monitor missing pods by deployment

The New Relic Infrastructure default dashboard to monitor missing pods by deployment.

 

Suggested alerting

Set an alert for when a deployment’s missing pods value rises above a certain threshold for a certain period. If the number of available pods for a deployment falls below the number of pods you specified when you created the deployment, the alert will trigger. The alert will be applied to each deployment that matches the filters you set.

 

Find pods that aren’t running

Kubernetes dynamically schedules pods into the cluster; if you have resource issues or configuration errors, scheduling will likely fail. If a pod isn’t running or even scheduled, then there’s likely an issue with either the pod or the cluster, or with your entire Kubernetes deployment.

When you see that pods aren’t running, you’ll want to know:

  • If there are any pods in a restart loop

  • How often requests are failing

  • If there are resource issues or configuration errors

  • If a pod was terminated

How New Relic helps

As noted, if you have resource issues or configuration errors, Kubernetes may not be able to schedule the pods. In such cases, you want to check the health of your deployments, and identify configuration errors or resource issues.

With the New Relic Infrastructure Kubernetes integration (deployed automatically through Guided install), you can use default deployment data to discover and track pods that may not be running and sort them by cluster and namespace.

 

The New Relic Infrastructure default dashboard to monitor pods by cluster or namespace

The New Relic Infrastructure default dashboard to monitor pods by cluster or namespace. 

 

Additionally, you can analyze further root causes of terminated pods with the terminated pods metric. For example, if a pod is terminated because its application memory has reached the memory limit set on the containers, it is terminated by the out of memory (OOM) service. In such cases, New Relic exposes the reason for pod termination.

 

New Relic Dashboard displaying a list of pods that have been terminated

 

Suggested alerting

Set alerts on the status of your pods; alerts should trigger when a pod has a status of “Failed,” ”Pending,” or “Unknown” for the period of time you specify.

 

Troubleshoot container restarts

Under normal conditions, containers should not restart. Container restarts are a sign that you’re likely hitting a memory limit in your containers. Restarts can also indicate an issue with either the container itself or its host. Additionally, because of the way Kubernetes schedules containers, troubleshooting container resource issues can be difficult because Kubernetes will restart—if not terminate—containers when they hit their limits.

Monitoring the container restarts helps you understand:

  • If any containers are in a restart loop

  • How many container restarts occurred in X amount of time

  • Why containers are restarting

How New Relic helps

A running count of container restarts is part of the default container data New Relic gathers with the Kubernetes integration.

 

The New Relic Infrastructure default dashboard to monitor container restarts

The New Relic Infrastructure default dashboard to monitor container restarts.

 

Suggested alerting

Set alerts on the number of Kubernetes container restarts. Setting up an alert gives you immediate, useful notifications, but it doesn’t let container restarts interrupt your sleep.

 

Track container resource usage

Monitoring container resource use helps you ensure that containers and applications remain healthy. For example, if a container hits its limit for memory usage, the kubelet agent might terminate it.

When monitoring container resource use, you need to know:

  • If your containers are hitting resource limits and affecting the performance of their applications

  • If there are spikes in resource consumption

  • If there is a pattern to the distribution of errors per container

How New Relic helps

First, identify the minimum amount of CPU and memory a container requires to run—which needs to be guaranteed by the cluster—and monitor those resources with New Relic.

Second, monitor container resource limits. These are the maximum amount of resources that the container is allowed to consume. In Kubernetes, resource limits are unbounded by default.

 

New Relic Infrastructure default dashboard which monitor container memory usage through a line graph

The New Relic Infrastructure default dashboard to monitor container memory usage. 

 

New Relic Infrastructure default dashboard which monitors container CPU usage through a line graph

The New Relic Infrastructure default dashboard to monitor container CPU usage.

 

This type of monitoring can help proactively resolve resource usage issues before they affect your application.

Suggested alerting

Set alerts on container CPU and memory usage and on limits for those metrics.

 

Monitor storage volumes

You need to avoid data loss or application crashes that result from running out of space on your storage volumes.

In Kubernetes, storage volumes are allocated to pods and possess the same lifecycle as the pod; in other words, if a container is restarted, the volume is unaffected, but if a pod is terminated, the volume is destroyed with the pod. This works well for stateless application or batch processing where the data doesn’t outlive a transaction.

Persistent volumes, on the other hand, are used for stateful applications and when the data must be preserved beyond the lifespan of a pod. Persistent volumes are well suited for database instances or messaging queues.

To monitor Kubernetes volumes, you need to:

  • Ensure 1) your application has enough disk space, and 2) your pods don’t run out of space.

  • View volume usage, and adjust either the amount of data generated by the application or the size of the volume (according to usage).

  • Identify persistent volumes, and apply a different alert threshold or notification for these volumes, which likely hold important application data.

How New Relic helps

You want to monitor and alert on disk volume issues, especially in the context of persistent volumes where data must be made available to stateful applications persistently so that it’s not destroyed if a specific pod is rescheduled or recreated (for example, if a container image is updated to a new version).

By monitoring your Kubernetes volumes with New Relic One, you can set alerts to be informed as soon as a volume reaches a certain threshold—a proactive approach to limiting issues with application performance or availability.

 

New Relic Infrastructure default dashboard which is monitoring Kubernetes storage volumes through a line graph

The New Relic Infrastructure default dashboard to monitor Kubernetes storage volumes.

 

Suggested alerting

Set alerts on available bytes, capacity, and node usage in your cluster. 

 

Monitor the control plane

The control plane ensures the cluster’s current state matches the desired state by automatically starting or restarting containers and scaling the number of replicas of a given application. The control plane maintains a record of all of the Kubernetes objects in the cluster and runs continuous control loops to manage those object’s state.

Monitoring the control plane helps Kubernetes operators know the health status of control plane components, so they can proactively react before the issue impacts services and end-users.

The Kubernetes integration monitors and collects metrics from the following control plane components:

etcd

This is where the current and desired state of your cluster is stored, including information about all pods, deployments, services, secrets, etc. This is the only place where Kubernetes stores its information.

To monitor etcd, you’ll want to track:

  • Leader existence and change rate

  • Committed, applied, pending, and failed proposals

  • gRPC performance

Suggested alerting

Set alerts to be notified if pending or failed proposals reach inappropriate thresholds.

 

API server

The central RESTful HTTP API handles all requests coming from users, nodes, control plane components, and automation. The API server handles authentication, authorization, validation of all objects, and is responsible for storing said objects in etcd. It’s the only component that talks with etcd.

To monitor the API server, you’ll want to track:

  • Rate and number of HTTP requests

  • Rate and number of apiserver requests

Suggested alerting

Set alerts to trigger if the rate or number of HTTP requests crosses a desired threshold.

 

Scheduler

The scheduler is responsible for assigning newly created pods to a worker node that is capable of running said pod. To do so, the scheduler updates the pod definition through the API server.

The scheduler takes several factors into consideration when selecting a worker node, such as requested CPU/memory vs. what’s available on the node. The scheduler updates the pod definition through the API server.

To monitor the scheduler, you’ll want to track:

  • Rate, number, and latency of HTTP requests

  • Scheduling latency

  • Scheduling attempts by result

  • End-to-end scheduling latency (sum of scheduling) 

Suggested alerting

Set alerts to trigger if the rate or number of HTTP requests crosses a desired threshold.

 

Controller manager

This is where all the controllers run. Controllers, like the scheduler, use the “watch” capabilities of the API server to be notified of state changes. When notified, they work to get the actual cluster state to the desired state. For example, if we create a new object that creates Y number of pods, the associated controller is the one in charge of bringing the current cluster state of X pods to Y number of pods.

To monitor the scheduler, you’ll want to track:

  • The depth of the work queue

  • The number of retries handled by the work queue

Suggested alerting

Set alerts to trigger if requests to the worker queue exceed a maximum threshold.

 

3. Correlate Kubernetes events with cluster health

By monitoring Kubernetes events, you can correlate the status of your Kubernetes cluster and objects with Kubernetes events for faster troubleshooting and issue resolution. If you run complex Kubernetes environments or don't have command-line access to your cluster, Kubernetes events provide the insights you need to understand what’s happening inside your cluster.

For example, let’s say you have a pod that doesn’t get properly scheduled and won’t start because the node it’s assigned to doesn’t have enough memory allocated. In this case, the node can’t accommodate the pod, so the pod stays in pending status, but no other metrics or metadata provide deeper insight into the issue. With Kubernetes events, you’d get a clear message:

 

FailedScheduling [...]  0 nodes are available: Insufficient memory

 

If you’re managing a Kubernetes deployment, or developing on top of one, you need:

  • Visibility into the Kubernetes events for each cluster

  • Visibility into the Kubernetes events related to specific objects, such as pods or nodes

  • Alerting on Kubernetes events

How New Relic helps

As described in the previous example, Kubernetes events provide additional, contextual information that is not provided by metrics and metadata. When using Kubernetes events alongside the cluster explorer, you get a holistic view of the health of your platform.

 

New Relic Dashboard displaying a list of events through the use of the cluster explorer Events tab

Access Kubernetes events from the cluster explorer Events tab.

 

When troubleshooting an issue in a pod, Kubernetes events more readily point toward root causes with useful context. New Relic also layers each event with useful details, so you can determine if an event affects several pods or nodes in a cluster, such as when a ReplicaSet is scaled or when a StatefulSet creates a new pod.

You can query Kubernetes events with New Relic chart builder, or view them from the cluster explorer.

Suggested alerting

Set alerts for specific types of events on objects and resources in your cluster. For example, New Relic can send alerts if an expected autoscaling action doesn’t occur.

 

4. Understand APM correlations

A key benefit of Kubernetes is that it decouples your application and its business logic from the specific details of its runtime environment. That means if you ever have to shift the underlying infrastructure to a new Linux version, for example, you won’t have to completely rewrite the application code.

When monitoring applications managed by an orchestration layer, being able to relate an application error trace, for instance, to the container, pod, or host it’s running in can be very useful for debugging or troubleshooting.

At the application layer, you need to monitor the performance and availability of applications running inside your Kubernetes cluster. You do that by tracking such metrics as request rate, throughput, and error rate.

New Relic APM lets you add custom attributes, and that metadata is available in transaction traces gathered from your application. You can create custom attributes to collect information about the exact Kubernetes node, pod, or namespace where a transaction occurred.

The following sections introduce key parts of your Kubernetes-hosted applications to monitor:

  • Monitor application health
  • Prevent errors

 

Monitor application health

When you run applications in Kubernetes, the containers the apps run in often move around throughout your cluster as instances scale up or down. This scheduling happens automatically in Kubernetes, but could affect your application’s performance or availability. If you’re an application developer, being able to correlate Kubernetes objects to applications is important for debugging and troubleshooting.

You’ll want to know:

  • Which applications are associated with which clusters

  • How many transactions are happening within a given pod

  • The service latency or throughput for production applications

How New Relic helps

To monitor transaction traces in Kubernetes, you need a code-centric view of your applications. You need to correlate applications with the container, pod, or host it’s running in. You also need to identify pod-specific performance issues for any application’s workload.

 

New Relic dashboard displaying data and graphs through utilizing the pod details view

Use the pod details view in the cluster explorer to analyze the performance of applications running in that pod.

 

Knowing the names of the pod and node where the error occurred can speed your troubleshooting. Visibility into transaction traces quickly highlights any abnormalities in your Kubernetes-hosted application.

Additionally, we give you the ability to inspect the distributed traces for any application running in your cluster. If you click on an individual span in a distributed trace, you can quickly see the relevant Kubernetes attributes for that application. For example, you can find out which pod, cluster, and deployment an individual span belongs to.

 

New Relic dashboard displaying distributive tracing details

New Relic distributed tracing captures details about traces from your applications running in Kubernetes.

 

New Relic distributed tracing provides automated anomaly detection to identify slow spans and bottlenecks. You should also set alerts on key transactions and communications with third-party APIs.

To learn about how to gain visibility into transaction traces in Kubernetes, see the blog post, Monitoring Application Performance in Kubernetes.

Suggested alerting

Set up alerts for all applications running in production. Specifically, you’ll want to alert on API service requests, transactions, service latencies, uptime, and throughput, sending alerts when any of these metrics fall below the thresholds you define.

 

Prevent errors

If a single pod or particular pod IP starts failing or throwing errors, you need to troubleshoot  before those errors harm your cluster or application. When something goes wrong, zero in on root causes as quickly as possible.

You’ll want to know:

  • In which namespace/host/pod did a transaction fail

  • If your app is performing as expected in all pods

  • The performance of application X running on pod Y

How New Relic helps

New Relic One gives a code-centric view of the applications running inside your cluster and helps you monitor your Kubernetes-hosted applications for performance outliers and track down errors.

 

Pie chart displaying pod errors

APM Error Profiles automatically notices if errors are occurring within the same pods and from the pod IP addresses.

 

Suggested alerting

Set alerts to track error rates for any applications running in production environments in Kubernetes.

 

5. Integrate Prometheus metrics

Prometheus is an open-source toolkit that provides monitoring and alerting for services and applications running in containers, and it’s widely used to collect metrics data from Kubernetes environments. In fact, Prometheus’ scheme for exposing metrics has become the de-facto standard for Kubernetes.

Prometheus uses a pull-based system to collect multidimensional time series metrics from services over HTTP endpoints, instead of relying on services to push metrics out to Prometheus. Because of this pull-based system, third parties, such as New Relic, can build integrations that work with Prometheus’ metric exporters to gather valuable data for storage and visualization.

How New Relic helps

The New Relic Prometheus OpenMetrics Integration collects telemetry data from the many services (such as TraefikEnvoy, and etcd) that expose metrics in a format compatible with Prometheus. In fact, with this integration you’ll be able to monitor key aspects of your Kubernetes environments, such as etcd performance and health metrics, Kubernetes horizontal pod autoscaler (HPA) capacity, and node readiness.

The integration supports both Docker and Kubernetes, using Prometheus version 2. 

After you install the integration for Docker or Kubernetes, you can begin building queries to track and visualize your Prometheus data in New Relic One. When troubleshooting issues in your Kubernetes clusters, the metrics collected by this integration are viewable alongside those gathered natively by Pixie and other New Relic integrations.

See the New Relic docs for more on compatibility and requirements, installation options, data limits, configuration, metric queries, troubleshooting, metric transformation, and more. 

 

Examples of using Prometheus data in New Relic 

There are any number of ways to use Prometheus data in New Relic, but consider the following use cases:

 

Monitoring etcd

Etcd is a key-value data store that’s essential for running Kubernetes clusters. Prometheus pulls metrics from etcd, so to ensure your clusters are healthy, use the Prometheus OpenMetrics Integration to monitor etcd server, disk, and network metrics such as:

  • etcd_server_has_leader
  • etcd_server_proposals_failed_total
  • etcd_network_peer_sent_bytes_total
  • etcd_disk_wal_fsync_duration_seconds

 

Kubernetes Horizontal Pod Autoscaler (HPA)

HPA automatically scales your Kubernetes deployment based on limits you configure. After installing the Prometheus OpenMetrics Integration, you can use the following query in the New Relic One query builder to build a dashboard widget and monitor the remaining HPA capacity.

FROM Metric select latest(kube_hpa_status_current_replicas),latest(kube_hpa_spec_max_replicas) where clusterName = '<YOUR CLUSTER NAME>'  facet hpa

 

New Relic One query builder tab displaying HPA capacity

 

Node readiness

In Kubernetes, a node is marked ‘ready’ when it can accept workloads (pods). If a node is having issues, Kubernetes will label it as ‘not ready.’ To create an alert condition for this scenario using the integration, run the following query:

FROM Metric select latest(kube_node_status_condition) where condition='Ready' and status = 'true' and clusterName = '<YOUR CLUSTER NAME>' facet nodeName

 

Dashboard within the process of creating an alert condition

Create an alert condition for node status.

 

6. Monitor logs in context

Available in the Kubernetes cluster explorer, New Relic Logs provides a near-instant search with full contextual log information. And when you configure logs in context, you can correlate those log messages with application, infrastructure, Kubernetes, and event data.

For example, you can easily correlate application log messages with a related distributed trace in New Relic APM. New Relic appends trace IDs to the corresponding application logs and automatically filters these logs from the distributed trace UIs. Bringing all of this data together in a single tool, you’ll more quickly get to the root cause of issues—narrowing down from all of your logs, to finding the exact log lines that you need to identify and resolve a problem.

This gives you end-to-end visibility, as well as a level of depth and detail that simply isn’t available when you work with siloed sources of log data.

 

New Relic One Kubernetes Cluster Explorer rerouting to New Relic Logs dashboard

New Relic Logs collects log data from your clusters.

 

7. Understand end-user experience

If you order an item from a delivery service, and it arrives at your house broken or late, do you really care what part of the delivery process broke? Whether it was the fault of the manufacturer, distributor, or the delivery service, the end result is equally annoying.

The same logic applies to companies hosting apps in Kubernetes: if a customer navigates to their website and a page doesn't load or takes too long, the customer isn’t interested in the technical reasons why. That’s why it’s not enough to track your own systems’ performance--it’s also essential to monitor the front-end performance of your applications to understand what your customers are actually experiencing.

Even though your application is running in Kubernetes, you can still identify and track the key indicators of customer experience, and clarify how the mobile or browser performance of their application is affecting business. 

How New Relic helps

When you first migrate to Kubernetes, you can set up a pre-migration baseline to compare your frontend application’s average load time before and after the migration. You can use the same strategies as you would for application performance to gain insight into key indicators, such as response time and errors for mobile application and browser performance. It’s also imperative to monitor load time and availability to ensure customer satisfaction. New Relic Browser and New Relic Mobile are built to give you that crucial view into your users’ experiences.

 

New Relic browser overview page displaying three different graphs

The New Relic Browser overview page shows a summary of browser performance for that app.

 

New Relic mobile dashboard displaying four different graphs

Quickly view crash occurrences, app launches, and more with New Relic Mobile.

 

Additionally, developers and operators both need to understand the availability of any Kubernetes-hosted service, often from locations all around the world. New Relic Synthetics is designed to track application availability and performance from a wide variety of locations.

New Relic brings together business-level information and performance data in one place. This helps teams across development, operations, product, and customer support identify potential areas for improvement in your products and find better ways to debug errors that may affect your customers.
 

Suggested alerting

New Relic Mobile Alerts:

  • Mobile network error rate and response time to assure you’re notified on the most critical endpoints

New Relic Browser Alerts: 

  • Browser session count drop to indicate availability issues 

  • Browser Javascript error rate or error count 

  • Browser interaction duration

New Relic Synthetics Alerts:

  • Synthetics error rate or error count?

  • Synthetics response times?

  • Synthetics ping checks 

Scaling Kubernetes With Success: A Real-World Example

As you begin your Kubernetes journey, it may help to understand how another organization’s approach to monitoring enabled them to be successful with Kubernetes.   

Since its inception in 1997, Phlexglobal has been helping life sciences companies streamline clinical trials by enabling them to take charge of their trial master file (TMF)—i.e., the data repository for all documentation related to a clinical trial.

CHALLENGES

  • Scaling, monitoring and managing the performance of Phlexglobal's trial master file (TMF) platform while migrating workloads onto Kubernetes 

  • Assuring this platform is well maintained is critical not only to prove key compliance with industry and government regulations, but also to facilitate and improve collaboration among a clinical trial’s many partners.

SOLUTION

The team needed a tool to facilitate an agile organization with specific needs from the development and IT operations teams, so Phlexglobal looked to New Relic to get a system-wide view into performance that would enable proactive monitoring. Explore their full monitoring story to appreciate the impact and results.

More Perfect Software

Try New Relic One today and start building better, more resilient software experiences.