While DevOps may seem like a simple integration of two functional roles, measuring the results of a newly established DevOps initiative can be challenging. Essentially, you’re forming new teams with new responsibilities, which can boost risk and frustration if things aren’t well managed.
This is why it’s vital to have end-to-end visibility for your applications, infrastructure, reliability, and team health in your DevOps environment. Sharing key performance metrics with all stakeholders in your digital business lets everyone monitor your DevOps efforts and prove success at every stage. Leaders benefit from knowing that everyone is aligned and moving forward towards the same goals. And shared insights help teammates collaborate easier and quicker.
DevOps teams work tirelessly to catch problems quickly, ideally before they manifest and affect customers. They do so by tracking and monitoring a number of key application performance and infrastructure metrics. These metrics should be relevant to both software developers and operations engineers in DevOps organizations.
Apdex is a measurement of your customers' satisfaction with the response time of web applications and services. Specifically, it measures the ratio of satisfactory response times to unsatisfactory—or worse—response times. The response time begins when a customer makes a request and ends when the request is complete.
To measure Apdex for an application, you first define a response-time threshold according performance baselines in your application, T.
Apdex tracks three response counts at three different levels:
Satisfied: The response time is less than or equal to T.
Tolerating: The response time is greater than T and less than or equal to 4T.
Frustrated: The response time is greater than 4T.
For example, if T is 1.2 seconds and a response completes in 0.5 seconds, the user is considered satisfied. Responses greater than 1.2 seconds are considered to have dissatisfied the user. Responses greater than 4.8 seconds frustrate the user.
Average response time
Response time is the amount of time it takes an application to process a transaction. This metric is a good indicator of how your customers are experiencing your website. It’s important to test this metric in multiple locations and with the most important types of interactions your users have with your website or app (not surprisingly, for example, a login request will have different response times than a file download, and you want to make sure both are within acceptable ranges for that interaction).
In New Relic, the default overview page in APM shows the average response time for all your applications as part of the web transactions time chart. Additionally, you could write an NRQL query to create a New Relic Insights widget to track average response time for individual applications.
If you want to know your application's average response time over a specified time period, you could use the New Relic API Explorer (v2).
CPU percentage usage
CPU usage is a critical measurement for tracking the availability of your application. In an on-premise environment, as CPU usage rises, your app is likely to experience some degradation, which could lead to customer-experience issues. However, for general cloud usage, if you’re not maximizing CPU usage, you’re likely not taking advantage of resources you’re paying for. In New Relic APM, CPU percentage usage measures aggregate CPU usage of all instances of your app or service on a given server. In New Relic Infrastructure, it’s a measurement of CPU percentage usage by host or process. The CPU percentage usage metric is gathered by default and displayed in Infrastructure in a host performance chart.
You can also use the New Relic REST API (v2) to get the average CPU usage for your application on a single host.
In general, consider any unhandled exception to be an error. An error-rate metric tracks the percentage of transactions that result in an error during a particular time window. For example, if during a specific period of time your application handles 1,000 transactions, and 50 of them have unhandled exceptions, you have an error rate of 50/1000, or 5%.
Use this metric to measure the average number of system processes, threads, or tasks that are waiting and ready for the CPU. Monitoring the load average can help you understand if your system is overloaded, or it you have processes that are consuming too many resources. With New Relic Infrastructure, you can track load average in 1-, 5-, or 15-minute intervals. The data appears on the Infrastructure hosts page.
Memory percentage usage
Using too much memory on a host can lead to poor application performance, while using too little memory on a consistent basis might mean that you’re under-utilizing expensive resources, especially in the cloud.
Memory percentage usage, then, compares the amount of free memory bytes to the amount of used memory bytes for each host in your infrastructure. The memory percentage usage metric is gathered by default and illustrated in Infrastructure in a host performance chart.
Use the New Relic REST API (v2) to obtain the average memory usage for your application on a single host.
Throughput is a measurement of user activity for a monitored application. In New Relic APM, throughput tracks requests per minute (RPM) made against your application. Tracking throughput can help you determine, for instance, if a new feature or improvement or architectural change changes how your application handles requests.
Use the New Relic REST API (v2) to obtain the average throughput for your app, including web application and non-web application throughput. These same values appear in the Throughput chart on your app's overview page in APM.
DevOps teams can track system reliability, quality, and overall health using a few key metrics. In DevOps organizations, site reliability engineers, operations engineers, software developers, project managers, and engineering leadership will all find value in these measurements.
Defect-rate metrics track the number of issues or bugs reported against your software in production, or against issues that arise during deployments of your software. Those issues could be infrastructure-, application-, mobile app-, or browser-based. These defects are typically tracked in the form of bug tickets or support tickets.
You can integrate New Relic with a “bug-tracking” system—such as Atlassian JIRA, Lighthouse, or Pivotal Tracker—to quickly create tickets, issues, and stories about performance issues you discover with New Relic.
Mean time to detection (MTTD)
This metric tracks the amount of time between the start of an issue and the detection of the issue, ideally at which point some action is taken. DevOps teams should work to keep their MTTD as short as possible. With proper instrumentation, alerting, and notification channels in place across your teams, you’ll be able to more quickly respond to any error detection.
Note that MTTD does not include the time needed to actually fix the issue.
Use New Relic Alerts and set your conditions and alert policies so developer, operations personnel, and software owners can remain up to date and know to take immediate action should any issues occur. Alerts are especially valuable when paired with notification solutions like Slack or PagerDuty, which can assist in communicating around error detection and prevention issues.
Mean time to recovery (MTTR)
MTTR tracks the average time it takes to repair a failed component in your system, from the moment the failure is detected until the point at which the system is operating normally again. Use this metric to measure and improve communication mechanisms in your recovery process. When you have direct communication channels, fixes can be identified, tested, validated, and deployed more quickly, minimizing system downtime.
For instance, New Relic Infrastructure gathers real-time metrics to help reduce your MTTR by connecting changes in host performance to configuration changes in your infrastructure. Set New Relic alerts against the Infrastructure metrics you gather to learn about any potential issues before they impact your systems.
Service-level agreements (SLAs)
Whether you’re managing a single development team or an entire organization, SLAs are the (sometimes legally binding) contract between you and your users or customers.
Several of the metrics discussed in this guide should be built into your SLAs, including Apdex and average response time. New Relic APM provides SLA reports that track application downtime and trends over time to help you better understand your application performance. You can also get SLA reports for key transactions in APM and for Synthetics monitors.
Service-level objectives (SLOs)
SLOs are goals your teams set about what you—and your customers—can expect from your system in terms of availability, performance, error rates, and anything else you agree upon measuring. Your SLO targets should reflect what your team actually commits to supporting, what your organization actually commits to supporting, and what you actually can support based on technical reality. An example SLO for a team that provides an API service might state that it accepts 99.99% of well-formed payloads.
Of course, your SLOs can change over time. For example, if you have an immature system, you may want to start with relatively modest SLOs and increase them over time.
New Relic is a great way for your teams to measure basic application and infrastructure health metrics to set SLOs covering CPU percentage usage, availability, error rate, average response time, and more.
Successful DevOps organizations don’t just track technical metrics, they also look at measurements of team health and performance. These measurements are of particular interest to software developers, operations engineers, project managers, and engineering leadership in DevOps organizations.
By tracking the commits a team makes to an artifact in a development lifecycle before that artifact can be deployed to production, this metric can serve an indicator of team velocity and code quality. Too many commits—or, conversely, not enough commits—could mean the team members are not properly managing a project. For example, a high number of commits could mean that team members don’t have clear direction for solving a problem, so they’re hacking away to find a resolution. Too few commits could mean they’re distracted by other obligations or even toil.
In New Relic, you can use the Insights API to create a custom event for each git commit, track them with a NRQL query, and display the results on a dashboard. Additionally, you track commits by recording deployments using the New Relic Rest API v2.
Deployment time and deployment frequency
Rapid iteration and continuous delivery—basically, how long it takes to deploy your software, and how often you deploy it—are often seen as the key proxy measurements of DevOps success. DevOps experts like Gene Kim, co-author of the DevOps Handbook, believe that these metrics highly correlate to positive outcomes for DevOps organizations.
Using the New Relic REST API v2, you can record new deployments, retrieve a list of past deployments, and delete past deployments. Some agents also have agent-specific methods to record deployments automatically. After recording deployments, you can view them in the APM Deployments page and in the Recent Events list on the Overview page. New Relic APM's Deployments page lists recent deployments, their date and time, and their impact on your end user and app server's Apdex scores, response times, throughput, and errors. You can view and drill down into the details, use search and sort options, hide or delete the error, share it with others, or file a ticket about it.
Use this metric to track the amount of time between development cycles during the execution of a project. In modern agile workflows, development cycles typically run one to two weeks, with each cycle (or sprint) punctuated by planning and retrospectives. (The team may or may not have a shippable artifact after each sprint.) Tracking iteration length can help you better understand changes in project scope, team velocity and workload,s and your ability to adapt to changes as a project evolves.
Passed/failed unit tests
A unit is the smallest component of your software that you can test. Track the the number of unit tests that pass or fail during a development cycle to offer an indication of whether your teams are writing is well designed code.
For example, if you use PHPUnit to manage and run your unit tests, the New Relic PHP agent can automatically capture the test summary results and send them to Insights as an event where you can query and visualize test data at a glance.
Another approach would be to use a NRQL query to create a dashboard widget to track passed/failed unit tests.
Project lead time
Also known as mean time to change (MTTC), this metric captures the amount of time that passes between the inception of a project and the actual production deployment of that project’s artifact. This can help measure your team's ability to adapt to change as the business evolves.
One way to track project lead time is to use the Insights API to create a custom event for each git commit, and then create a dashboard widget using an NRQL query.