As Kubernetes environments continue to scale, they also get more complex, making it harder to monitor their performance and health. Our updated Kubernetes integration (v3) significantly reduces CPU and memory, adds support for external control planes such as Rancher, and includes improvements to log messages to help identify issues. For instance, in big clusters, our new integration can be configured to request 80% less memory than before. That’s a huge improvement.

In this post, I’ll cover the benefits of our new integration as well as some of the challenges we faced with our upgrade to v3.

Why should you upgrade your Kubernetes integration?

If you’re using an older version of our Kubernetes integration (or even if you’re new to monitoring Kubernetes with New Relic), you’ll get a lot of benefits by using v3.

  • Reduced memory footprint: Get up to 80% reduction in big clusters, thanks to an improved kube-state-metrics (KSM) scraping component.
  • Improved troubleshooting: Triage bugs and fix issues easier with enhanced logs and process cycle.
  • More configurable: Use three individually configurable components, including config files that provide more granular settings for each data source.
  • Scrape external control planes: Scrape metrics from components outside your clusters.
  • More flexible scraping intervals: Dial up or dial down data ingest to suit your needs.

How to upgrade your Kubernetes integration

If you’re using v2 and ready to upgrade now, you just need to run this command:

helm upgrade --install my-installation newrelic/nri-bundle

You don’t need to update your configuration, dashboards, or alerts. For details, see the migration guide

Improving our Kubernetes integration behind the scenes

When we were updating the integration, there were a lot of factors we had to keep in mind. We wanted to make the new integration compatible for our customers using v2, deliver valuable enhancements, and make the upgrade process as easy as possible. We also needed to continuously test the new implementation while the integration’s architecture was changing.

With the upgrade to v3, we decided to fully leverage the Kubernetes deployment model and the capabilities that we hadn’t been using yet. To explain what this means, I’m going to get a bit technical.

To fully unlock Kubernetes capabilities, we made these improvements:

  • Moved scraping tasks to separate components to allow the integration to make smarter decisions at scheduling and deployment time, not at runtime.
  • Switched to a sidecar pattern to move integration and agent processes to different containers, ensuring that the integration follows Kubernetes best practices.
  • Added functionality to scrape external control planes to add support for solutions such as Rancher and managed apiServers.
  • Reduced complexity in the cache system by leveraging Kubernetes informers.
  • Added support for complex data structures in integration configurations such as lists and maps.
  • Removed hardcoded scraping interval for control planes so you can manually configure the interval to fit your applications’ needs.

The next sections dive into the details of each of these improvements.

Decoupled component configuration

To avoid data duplication while scraping KSM and control plane components from a DaemonSet, we performed locality-based leader election.

In version 2, the infra DaemonSet changed its behavior depending on which pods were running in the same node.

When designing the new architecture, we divided the different integration tasks into different components to be scheduled as different workloads. Each component has its own separate configuration, resources, annotations, logs, and so on. Moreover, the behavior of the components is no longer changing depending on pods running in the same node. This allows the integration to make smarter decisions at scheduling and deployment time, rather than at runtime, as described in the next section.

V3 includes a DaemonSet scraping the Kubelet process running on the node. It also includes separate deployments for scraping kube-state-metrics and Control Plane components /metrics. Note that, depending on your configuration, the Control Plane component could be deployed as well as a DaemonSet.

In version 3, increased complexity in the architecture allowed us to reduce the codebase size and improve the user experience.

Ultimately, optimizing the integration’s architecture allowed us to reduce the codebase size, which improves the Kubernetes integration experience in New Relic.

Using a sidecar pattern to remove unnecessary processes

In v2, there were multiple processes per container: The New Relic infrastructure agent triggered the execution of the nri-kubernetes integration as a different process. This approach meant that only failures in the main process of a container surface and trigger a pod restart.

In v2, the integration was not the container’s main process, so in the case of a configuration or execution error, the failure didn't surface. There was no automatic pod restart, making it difficult to detect issues. 

The new implementation in v3 follows a sidecar pattern having the nri-kubernetes scraper container next to an agent running in a separate sidecar. Having a single process per container provides Kubernetes control over the lifecycle of the integration process, ensuring that when an error causes the integration to fail, it sends out a notification, and the Pod is restarted.

Added functionality to scrape external control planes

With v2, scraping external control planes wasn’t possible, even if you specified a URL to scrape.

With our updated Kubernetes integration v3, you can specify a static endpoint, and if you need to avoid data duplication, deploy the control plane component as a deployment and not as a DaemonSet.

Scraping the AWS API server control plane in version 3.

You can also specify different authentication methods for each component, including bearer tokens and Mutual Transport Layer Security (MTLS).

And thanks to autodiscovery and other configurations added in the defaults, the default configuration supports several Kubernetes flavors, including Kubeadm, kind, and minikube. For flavors that the configuration doesn’t cover, you can configure the integration to use any selector, URL, or auth method.

Environment variables-based configuration

In v2, if you wanted to configure the integration, you could only use environment variables. This approach made it tedious to support complex scenarios because you were limited to key=value configs. For example, since the configuration option was a flat list, we needed to add prefixes to show you which group variables belonged to. That created confusion whenever we needed to add a new parameter in nested configuration groups, so we needed environment variables with different prefixes: DiscoveryCacheTTL, APIServerCacheTTL, APIServerCacheK8SVersionTTL, APIServerCacheK8SVersionTTLJitter.

The new integration leverages a YAML config to support more complex data structures such as lists and maps, and it allows customers to configure most of the integration’s parameters. At first glance, the new YAML configuration of the Helm chart seems more complex than the previous configuration. However, the new values.yaml provides additional syntax that is often used in Helm charts.

Finally, on configuration changes, the instances are automatically restarted to reload the config file if Helm is used as an installation method.

Removed hardcoded scraping interval config

In v2, the scraping interval config option was hardcoded to 15s, and you might have wished that it was configurable.

Now it is! We’ve exposed this parameter so you can set the interval manually between 10 and 40 seconds. By default, the interval is 15s. Selecting lowDataMode automatically changes the interval to 30s, lowering the data fetched. 

How we continuously tested the new architecture

For v3, we added a set of integration tests that leverage our improved discovery and modular approach to feed static, pre-generated data to each component. 

We improved end-to-end tests with a new approach decoupled from the implementation itself. The tests install a chart in a Kubernetes environment and check to see if all entities are created and all metrics contained in the Kubernetes spec files are reported. These tests are not reliant on a specific architecture, and they test the whole pipeline, so we can run them with any version or implementation of the integration. We were able to run these tests throughout our update from v2 to v3, which led us to discover (and fix) some existing bugs.

E2E tests don't rely on a specific architecture. They test the whole pipeline as a black box.

How we reduced breaking changes

To make it easier for you to upgrade to v3, we used a compatibility layer in the Helm chart to map old values to the new ones. This way, you receive all the benefits of the latest version of nri-kubernetes without any of the headaches of breaking or configuration changes.

A compatibility layer is in place to support version 2 values.