This post is a follow-up to my recent discussion of What Is the Expected Distribution of Website Response Times?
It's an all-too-common scenario: you’ve just deployed a change to your web application (or maybe are observing the deployment in real time) when you realize that performance has taken a hit. Maybe your baseline alerts have fired or Radar has flagged a regression in performance. Or maybe—worst case scenario—you’re hearing it directly from your customers: your page response times have spiked.
You’ve got to find the culprit quickly, but where do you begin?
You might start by looking at transaction traces in New Relic APM. These are detailed waterfall diagrams of individual web transactions continuously sampled in real time, based on a sampling rate and the transaction latency. On both the application overview and transaction overview pages you’ll find a list of recent transaction traces. Some are examples of the worst performing transactions while others represent more typical transaction response times.
If you’re concerned about slow transactions, it makes sense to look at the traces for the slowest transactions, right?
Maybe not. I’m going to show you why you want to start with your typical—not your slowest—transactions to get to the root of the problem.
Step away from that slow transaction trace
Here’s what’s likely to happen—you look at one trace and immediately see a problem: a single query or external call that took 10 seconds when it normally takes less than 50 milliseconds.
There’s your problem, right? A call to an external service took 42 seconds! But why is it so slow? Nothing in the code has changed. The service call doesn’t appear unusual and there’s no indication the service was having problems during this period. Most important, the call normally takes less than 25 milliseconds, according to the transaction breakdown.
Many of the other transaction traces show the same problem—a stalled call to the same external service. Surely that’s the issue then?
To find out, you need to focus on investigating why that external service is so slow. But if you look at the service in question, you quickly discover there’s nothing wrong. There is no change in the performance profile or the error rate. Throughput and response time are consistent with normal operations. There are a few outliers, of course, but relative to the number of properly handled requests, these are just noise. There’s nothing more to do here.
The truth is, you haven’t yet found the cause of the slowdown. In fact, you’re looking in the wrong place.
Slow transaction traces give you visibility into the worst performing transactions—the outliers. If you want to understand what’s causing a degradation in your typical transaction response times, then studying your outliers won’t help. They are not representative of typical performance. In my previous post—What Is the Expected Distribution of Website Response Times?—I showed how the expected distribution of response times closely follows an Erlang distribution. In this situation, outliers are an inevitable consequence of transactional applications essentially being large queuing networks. No matter how fast your application is responding, there will always be some occurrences of pathologically slow transactions.
You can see this in a typical application histogram. This one shows the 97% fastest transactions for a web application:
Slow transaction traces typically fall in the range of the slowest 0.1% of requests; they don’t even show up in the histogram above. A histogram that includes 99.8% of requests will look like the one below. The transaction traces for this population are in the range of 3-5 seconds and are so far away from the center of mass of the population that they barely register:
When you look at your slow transaction traces you’ll find that, in nearly all cases, there is one segment that is consuming almost all of the extra time. I took samples of 4,000 anonymized slow transaction traces and found that in about half of them the slowest segment accounted for more than 96% of the total transaction time. Everything else behaved more typically.
In the sample data, database operations accounted for 25% of the slow segments. External calls were another 18%. The remainder were spread across many different component types. Database and external calls involve network operations and remote servers, so they are quite likely to stall out in a very small number of cases. The traces may still indicate a problem that needs to be addressed, but they aren’t going to show you where website response times are being spent overall. Therefore, they’re not a likely indicator of significant performance regression.
How do we know this? From looking at the data.
We’ve already seen how slow transaction traces are usually triggered by a single segment. I compared this segment to the typical performance profile in my sample of 4,000 transaction traces and found that, on average, the transaction trace’s slowest segment was 600 times longer than the typical latency of that segment. These super slow calls in transaction traces are truly pathological cases.
Move along, nothing to see here
As shown in our example, even if the stalled external calls showing up in slow transactions have a specific cause (transient network errors, say, or retry loops), those few outliers are not enough to significantly degrade the overall average time for that remote call. Focusing on them is nothing more than a red herring in the search for the true cause of a performance regression.
Should you avoid looking at transaction traces altogether in these situations? Not necessarily.
You can still look at traces of more typical transactions instead of just the slowest ones. Typically, some 25% of transactions fall into the range of typical transaction performance; another 25% are within twice the typical latency. When you look at the traces available for a particular transaction, you will often see a mix of typical as well as slow transactions (as long as the typical performance is at or above the transaction trace threshold).
The key is to not always reach for the slowest transaction trace. Instead, study the typical cases—the traces with latency closest to typical.
A better place to start
The truth is, you’re better off studying the breakdown of average segment times in the transaction overview. That’s where you will see any consistently poor-performing segments. Meanwhile, the historical charts will let you see segments that may have regressed over time:
A better approach to finding what’s broken is to first narrow down the possible causes, starting from the breakdown chart in the application overview page and then drilling into the transaction overviews and studying the breakdown charts and tables there.
When to use slow transaction traces
Despite these points, New Relic captures slow transaction traces for a reason. Sometimes, you do care about tracking the outliers. One of the most popular customer-centric metrics for analyzing website performance is the 99th percentile.
Apdex scores and average response times are good ways to characterize typical behavior, but they don’t capture the experience of those customers experiencing worst-case performance. Tracking the 99th percentile shows you are concerned about all your customers—including the outliers.
That’s why it makes sense to analyze the outliers. While optimizing them may not significantly impact your overall performance profile, it could improve the dispersion, moving outliers inward and lowering the 99th percentile for response times.
I’ve already pointed out that slow transaction traces generally blame a single segment for the slowness. Among these traces, about 8% show evidence of an N+1 anti-pattern, in which database or external calls are invoked more than 20 times for data that should be pre-fetched with a single call. Fixing these problems may not change the overall performance profile, but could have a big impact on the 1% of users experiencing the worst performance. And at scale, even 1% can be a significant number.
Transaction traces can provide a deep and detailed look into the behavior of web applications. But when it comes to traces of slow transactions, it’s important to understand exactly what they represent. Tuning the outliers won’t dramatically change the overall performance profile of your application, but it can improve the user experience for those transactions in the 99th percentile. After all, we care about every one of our customers, not just the “average” ones.