New Relic's Networking Team monitors its global network environment, which includes hundreds of cells with Kubernetes clusters that connect to multiple cloud environments. To achieve comprehensive visibility,  SREs and network engineers developed code using New Relic libraries that are deployed in every cluster to collect key network telemetry.

The primary motivation for building this extensive network observability was to provide customers with a better understanding of the network and to build trust. The goal is to empower other teams to self-serve and eliminate network issues as the first suspect when troubleshooting. 

The custom network dashboards provide deep insights into a wide array of metrics, including:

  • Network Performance: Monitoring bandwidth, packet loss, jitter, latency, and path utilization.
  • Infrastructure Health: Using the infrastructure agent with Amazon and Azure connectors to get information from those platforms and ingest it into New Relic.
  • Connectivity Validation: Utilizing a custom script that pings from one location to another to confirm connectivity.
  • Cost Optimization: Monitoring an egress network address translation (NAT) service to exit a cloud provider’s network at a significantly lower price point, and monitoring for unexpected cost spikes.

The implementation of this network observability has had a profound impact on New Relic's operational efficiency and reliability:

  • Dramatic Reduction in Troubleshooting Time: The implementation has reduced the number of pages the Network Team receives. An example of network observability  was identifying a routing issue where traffic was failing over to an undersized backup solution due to a missing static route. This allowed New Relic networking teams to quickly remediate the issue and later implement an active-active setup for cloud providers’ routes to balance traffic and prevent saturation.
  • Proactive Identification of Misconfigurations: By identifying issues like missing static routes, New Relic optimizes resource usage and significantly enhances system reliability, leading to cost efficiencies.
  • Dynamic Runbooks: The goal is to empower other teams to self-serve and eliminate network issues as the first suspect when troubleshooting.
  • Executive-Level Insight: The team also uses New Relic to optimize costs by monitoring an egress NAT service. They also monitor for unexpected cost spikes and help other teams identify and resolve issues that lead to unnecessary increased data traffic charges.
New Relic Now Demo new agentic integrations today.
Watch now.