One of the key tenets to running a reliable system is to measure what matters. This means measuring both what is important to the business and what is important for troubleshooting problems. New Relic measures what matters through New Relic. This includes using our agents, service level objective (SLO) product, and alerting product.
APM Agents at New Relic
Engineering teams like to run the application performance monitoring (APM) agent for their services and all hardware is typically monitored using the New Relic infrastructure agent. This creates a consistent set of metrics so that any engineer can move between teams or services and still understand the core health metrics.
New Relic’s APM agents are highly tuned to provide the right key metrics to notice and diagnose service problems. Key metrics include HTTP response time, database call times, and external HTTP call times.
Additionally, while dedicated agent teams frequently add new instrumentation to our agents, other engineering teams—those who utilize the agents to monitor their own infrastructure, databases, and streaming services—have also contributed instrumentation for our agents. A great example is the current Kafka instrumentation in the Java agent. This instrumentation was originally created by a New Relic team that operates many core kafka streaming services. The instrumentation was originally created as a Java agent extension. After being adopted by many internal teams, the instrumentation was eventually put in the APM product. The Kafka UI, which displays these metrics, was also jointly created by an APM product team along with our key Kafka and Streaming Services teams.
Effective Instrumentation
Using Dashboards and Nerdpacks to Eliminate Context Switching
Teams at New Relic are trained to think about observability data during development. In addition to the default instrumentation provided in our agents, engineering teams have the ability to create additional custom instrumentation. Some teams send custom instrumentation to New Relic using APM agents. Others have built libraries to send instrumentation directly to our public metric, event, log, and trace endpoints. Examples of custom instrumentation include a New Relic Event for every query to the New Relic database (NRDB) and an event for every initial APM agent connection to New Relic. Teams then display these custom events in dashboards or custom Nerdpacks (custom applications). These custom applications integrate textual instructions with live query results and visualizations.
For instance, Kafka pipeline stalling issues can be diagnosed with views on a custom Nerdlet, which also automatically generates the command needed to extend data retention transforming a multi-step manual process into a single copy-paste action. This significantly reduces context switching and accelerates resolution.
Service Level Objectives
Achieving Business Objectives with SLOs
SLOs are important as they define and measure the customer experience over a set time window. They are critical for balancing reliability work with new functionality. At New Relic, we require all teams to maintain an internal set of SLOs. Prior to enforcing SLOs across the company, we found that our telemetry data was extremely rich, but was tuned for troubleshooting and not for measuring customer experience. So we created an SLO bar raiser program that helped teams create customer focused SLOs with values that reflected the reality at the time.
Using these measurements, we were able to share with the business the cost to run at the current SLO and the work required to increase the SLO. Teams who have benefited the most from SLOs watch their SLOs daily and take corrective action when necessary. These teams have not only improved customer experience, but also significantly reduced their pager load.
Alerting and Terraform
How to Automate Proactive Insight
At New Relic, our teams use New Relic Alerting to get notified of any system anomalies. This allows New Relic engineers to take action quickly to mitigate any potential system degradation. Most teams use Terraform to create and maintain their alerts in version control. Most teams also use facet alerts to ensure alerts are created for any new cell or environment. An example facet alert on hostname is shown below:
SELECT latest(etcdServerProposalsFailedRate) FROM K8sEtcdSample WHERE clusterName = 'my-cluster' FACET hostname
To ensure teams have the right alerts, New Relic provides a set of recommended alerts to teams in our Engineering Standards. These include alerts for Out of Memory (OOM) kills, pods waiting, Kafka lag, error rates, and more. Leaders then dashboard their team’s alerts (since each alert is recorded in NRDB) and watch trends weekly.