When you’re trying to optimize the performance for your website, there’s one question that comes up a lot: What is the expected distribution of response times for a web application? It’s critical that you understand the expected distribution if you want to find the best ways to reduce the number of long-tail outliers that take significantly longer to load.

So, what is that expected distribution? We all know it’s not a Normal (Gaussian) distribution. Typically, website response times look something like this:

right skewed long tail distribution

This distribution is generally described as a right-skewed long-tail distribution. It looks like it has a clear underlying function. But what is that function?

A mathematical mystery

Whenever I see articles on webpage response-time distributions, they all seem vague about what exactly this distribution is. A search on Stack Exchange suggests there is no real consensus in our collective intelligence. Often, something that looks like a skewed distribution is lumped into a Log Normal distribution, which is simply a Gaussian distribution with a log transform applied.

That’s not a new approach. Back in 2001, David M. Ciemiewicz published What Do You ‘Mean’? Revisiting Statistics for Web Response Time Measurements, an article that described the advantages of using Log Normal distributions as an approximation. These distributions offer some nice properties that can make it easier to estimate quantiles. But while the fit is good, it’s hardly exact:

log normal distribution

New Relic Senior Site Reliability Engineer David Copeland explains that website response time distributions can be modeled with an Erlang Distribution. An Erlang is a special case of a Gamma Distribution when the shape parameter is positive. Gamma distributions have two parameters: shape and rate.

Here is an Erlang distribution overlay:

gamma distribution

Clearly a much better fit, but why? According to Wikipedia:

The Erlang distribution was developed by A. K. Erlang to examine the number of telephone calls which might be made at the same time to the operators of the switching stations. This work on telephone traffic engineering has been expanded to consider waiting times in queueing systems in general.

Queuing systems

Application servers and multi-tier apps that are transactional in nature are also examples of queuing systems, in which every request gets routed through a network of resources—CPU, disk, network, and so on. Every transaction competes for access to these resources and inevitably spends time “queuing” for service at each resource. The overall latency is the summation of the time spent waiting for each resource in addition to the time being serviced.

For the queuing laws that govern the behavior of a telecom network to apply to a transaction processing system, a few key assumptions must hold. Particularly, the responses must arrive at random intervals completely independent of each other, which is known as a Poisson process.

To the extent a response time distribution does not fit neatly into an Erlang distribution, it may be because the request intervals do not follow a Poisson process. More likely, though, the distribution reflects a mixture of different classes of requests, each with its own distinct resource demands. For example, you might have two classes: one with cached results, the other with cache misses. Often this results in a multi-modal histogram with two or more peaks representing the different classes.

multi modal histogram

In the case of a multi-modal distribution, what you are actually looking at is the summation of multiple distinct distributions, all sharing the same underlying function but with different parameters. When you disambiguate the underlying modes, it looks like this:

disambiguated multi modal distribution

In this case, each distinct color represents a different web transaction.

Why does it matter?

Understanding the expected distribution of website response times can help us better estimate and interpret key parameters like the mean, median, and variance. It can also allow construction of more accurate models for load testing and capacity planning. Most important, it can help displace the confusion around application latency with a stronger sense of understanding and expectation. And that can go a long way toward helping us better discern the unexpected and anomalous cases, and thus more effectively target optimizations of those edge cases.