Are you interested in learning best practices for application performance monitoring (APM) and site reliability engineering (SRE)? In this post, you’ll learn about Google’s four golden signals, why monitoring them is essential, and how you can automatically generate dashboards in New Relic for these important metrics.
What are the golden signals?
There are four golden signals when it comes to application performance monitoring:
- Latency measures how long it takes to complete requests.
- Traffic tells you the number of requests you’re handling.
- Errors are the total number of requests that fail.
- Saturation represents the total load your application is putting on the system and is generally represented as a percentage.
Golden signals are the gold standard when it comes to monitoring a web application’s metrics. Regardless of whether you have an established APM tool or are just getting started with monitoring, monitoring the golden signals allows you to quickly see an overview of the health of your application. While there are many other performance metrics worth monitoring, the golden signals cover the essentials.
Latency is the time between when a request is made and when it is completed. The less latency, the better. The longer it takes for a user to load a page or make another request, the more likely that user is to abandon an application for a competitor. Measuring the average latency of requests can give you a bird’s eye view of a web application’s performance, but it can also be misleading if you don’t drill down further. The average request may be completed very quickly, but it’s more important to focus on a web application’s slowest requests.
If 95 percent of the requests have minimal latency but 5 percent of the requests are painfully slow for users, just looking at the average will mask potential problems with an application—especially if the slow pages are important, high-traffic pages like landing pages or pages for user signups. That’s why you should be looking not just at the average but also at pages in a specific percentile for latency. The 95th percentile is a good starting point, but the exact number will vary depending on your application’s needs. Later in this post, you’ll learn how to use New Relic to measure golden signals by percentile.
While we typically think of traffic as the number of site users, in this case, traffic measures the demand users as a whole put on an application. For web applications, you can measure the total number of HTTP requests per second. Additionally, you can look at traffic by page or resource, giving you insight into which of your pages are most successful or need work. Measuring traffic is a key part of understanding and fine-tuning the user experience.
On a fundamental level, APM is about finding errors and fixing them before they affect users. But what exactly constitutes an error? Some errors are obvious: for instance, you should be tracking all HTTP requests that return 500 status codes, meaning there was an internal service error. Other errors are harder to catch. Users might make a request that returns a 200 OK status code, but if the request doesn’t return the right content, that should be an error, too. You might also want to set up error policies based on specific service-level objectives.
Saturation measures the percentage of the system that you are using. If a web application is approaching 100 percent saturation, performance degradation is likely and your users will be negatively impacted. On the other hand, if saturation is consistently at 50 percent, you might be over-provisioning and paying too much for services you aren’t using. By measuring the saturation of a web application, you get insights on how to optimize the services you’re using.
Monitoring golden signals with New Relic
If you don’t already have an APM solution or are interested in trying New Relic, you can sign up for a free account. After you have signed up, you can instrument your applications with guided install. Instrumentation is the process of setting up monitoring for an application. When you run the guided install, New Relic’s infrastructure agent checks your environment and recommend which applications you should instrument.
Using a pre-built golden signals dashboard
You can install the Golden Signals for Web Servers quickstart to immediately begin monitoring golden signals in your application. Even better, this quickstart has prebuilt alerts on each of the golden signals so you can notify your teams when your application’s performance is suboptimal. Go to Golden Signals for Web Servers and select the + Install quickstart button, then follow the onscreen instructions to complete the installation.
You can also access this quickstart and other pre-built dashboards by logging in to New Relic, selecting Dashboards in the upper right corner, and then selecting + Create a dashboard. You have the option to either Browse pre-built dashboards or Create a new dashboard.
You can browse for other pre-built dashboards, including quickstarts, that will help you expand your application monitoring capabilities immediately.
After you are finished installing Golden Signals for Web Servers, you will have a golden signals dashboard like the one in the next image.
The dashboard includes the following metrics:
- Response Time corresponds to latency. It’s the total time it takes for a response to complete.
- Throughput measures user traffic. For New Relic APM, throughput is measured as requests per minute.
- Error % measures the error rate in your application.
- Average CPU Usage and Memory Usage cover the utilization of resources in your application, monitoring the degree of saturation your system has.
Using pre-built alerts for golden signals
Each of the metrics in Golden Signals for Web Servers comes with a prebuilt alert, which you can see by going to Golden Signals for Web Servers and selecting the Alerts tab. They are also summarized here. All of these alerts fire when the threshold occurs over a five-minute period.
- CPU Usage: Alerts when CPU usage goes above 90%.
- Errors: Alerts when 10% of transactions end with an error.
- Memory Usage: Alerts when the memory usage exceeds 90%.
- Response Time: Alerts when average transaction duration is above five seconds.
- Throughput: Alerts when there are fewer than five transactions over a five-minute period.
You can set up notification channels to send alerts to your team via email, Slack, PagerDuty, or other channels that work best for your workflow.
Viewing golden signals in New Relic Lookout
You can also get a high-level view of the golden signals for all of your applications in New Relic Lookout by selecting More > Lookout in the upper navbar. Then select the Saved Views button in the upper right corner of the screen to pull up the Change view pane. You can select Your system > Application golden signals or you can select a range of golden signal views from Other views such as views for browser or mobile application golden signals. The nice thing about New Relic Lookout is that it shows you all of your application’s services so you don’t have to switch between views. You can learn more about New Relic Lookout here.
Customizing golden signals dashboards
Finally, you can also create workloads in New Relic, which allow you to group and monitor entities for teams or sets of responsibilities. Whenever you create a workload, dashboards for golden signals are automatically generated.
If you are interested in diving deeper into customizing those dashboards, check out this video, which shows you how to configure golden signals charts in workloads:
If you are using an APM tool to monitor your application (and you should be), it’s important to monitor the four golden signals: latency, traffic, errors, and saturation. Together, these signals give you important insights into the health of your application and help you be more proactive about finding and detecting issues before they affect your users.
本ブログに掲載されている見解は著者に所属するものであり、必ずしも New Relic 株式会社の公式見解であるわけではありません。また、本ブログには、外部サイトにアクセスするリンクが含まれる場合があります。それらリンク先の内容について、New Relic がいかなる保証も提供することはありません。