New Relic runs one of the top Kafka implementations in the world and operates hundreds of services that either produce and/or consume from Kafka. Early on, the Streaming SRE Team invested in the foundations of a reliable Kafka environment for New Relic serving our customers. For instance, Kafka lag can directly correlate to delays or drops in telemetry ingest, which can lead to customer alerts being missed or delayed, evidencing that Kafka’s reliability remains essential.
The charter of this team is dedicated to Kafka operations, and they focused on the creation of a custom Nerdpack for Kafka observability. This highly customized New Relic Nerdpack, brimming with custom metrics, became an indispensable tool, shared internally with over 50 teams relying on Kafka services. The value derived from these operational insights was so profound that it directly spurred the development of customer-facing Kafka and observability functionalities.
The primary motivation for building this extensive Kafka observability was to overcome the blind spots experienced during incidents. Without granular data, diagnosing root causes and quickly identifying recurring problems was a significant challenge. The objective was to "layer on lots and lots of observability" to understand Kafka behavior comprehensively—before, during, and after incidents.
The custom Kafka Nerdpack provides deep insights into a wide array of metrics. The streaming SRE Team uses these insights to:
- Alert on Kafka Lag to Maintain Ingest Integrity:The team's most critical use of New Relic is for assuring telemetry ingest integrity using alerting on Kafka metrics, such as Kafka Lag. Kafka lag directly correlates to delays or drops in telemetry ingest, which can lead to critical customer alerts being missed or delayed. This poses a significant business risk, as customers rely on timely alerts for their own operational awareness. Comprehensive Kafka lag alerts allow for scaling of ingest and optimizing performance.
- Maximize Responsiveness: New Relic enables the team to be highly responsive to Kafka processing issues, facilitating rapid remediation and minimizing customer impact.
- Understand Kafka Client Behavior: Identifying misconfigurations, overloaded buffers, and stalled clients.
- Monitor Server-Side Health: Monitoring broker performance and resource utilization.
- Observe Request Patterns: Analyzing changes in client request patterns to anticipate and mitigate potential issues.
The implementation of Kafka observability has had a profound impact on New Relic operational efficiency and reliability:
- Dramatic Reduction in Troubleshooting Time: With comprehensive observability data at their fingertips, the streaming SRE Team can diagnose Kafka client incidents in minutes, often resolving issues within a total impact time in minutes and seconds. This contrasts sharply with the hour or so that may have been necessary without such detailed insights.
- Nerdpacks as Dynamic Runbooks: A key innovation championed by this SRE team is the use of New Relic Nerdpacks, as dynamic runbooks. These custom applications integrate textual instructions with live query results and visualizations. For instance, Kafka pipeline stalling issues can be diagnosed with views on the Nerdlet, which also automatically generates the command needed to extend data retention transforming a multi-step manual process into a single copy-paste action. This significantly reduces context switching and accelerates resolution.
- Executive-Level Insight: Directors and executives at New Relic utilize the Kafka observability nerdlet to quickly assess the overall status of lag across entire Kafka clusters or environments, providing a high-level view of ingestion performance and scalability.
- Intelligent Auto-scaling for Optimized Performance and Cost: The Streaming SRE Team has developed sophisticated auto-scaling tools that use both New Relic telemetry and custom metrics. For instance, they use New Relic CPU metrics to dynamically scale Kubernetes resources up or down based on traffic demands. This allows the team to effectively manage surges in ingested traffic by scaling up to burn down lag, and then downscaling during low-traffic periods. This dynamic auto-scaling prevents over-provisioning of resources, ensuring cost efficiency while maintaining the capacity to handle fluctuating workloads.