Many companies rely on application performance monitoring (APM) tools to alert them when things go wrong in their apps that could affect the customer experience. But for those alerts to be effective, they need to be accurate; that is, they need to alert only on true performance issues that have a tangible user and business impact.

If you set your alert threshold too low, you’re going to be flooded with false alerts, causing the ops team to lose confidence in the alerting system. If you set your alert threshold too high, IT ops could miss relevant performance degradation, resulting in a poor user experience.

Here at New Relic, we use Apdex to address this complex alerting problem. Apdex is an industry standard for measuring user satisfaction with the response time of an application or service. Compared to traditional metrics such as average response time, which can be skewed by a few very long responses, Apdex provides better insight into how satisfied users are by measuring app response time against a set threshold, called Apdex T.

How does Apdex work?

The Apdex method converts many measurements into one number on a uniform scale of 0 to 1 (0 = no users satisfied, 1 = all users satisfied). The resulting Apdex score is a numerical measure of user satisfaction with the performance of enterprise applications. This metric can be used to report on any source of end-user performance measurements for which a performance objective has been defined.

Apdex T is the central value for Apdex—it is the response time above which a transaction is considered “tolerable.” You can define Apdex T values for each application, with separate values for app server and end-user browser performance. You can also define individual Apdex T thresholds for key transactions.

Why do I need to override the default T threshold?

Setting a good T value will give you fidelity in your Apdex score, so that as performance shifts, the score shifts appropriately as well. This is important for detecting changes in the performance profile that can impact user experience positively or negatively. Suffice it to say it’s helpful to have an Apdex chart with a signal that isn’t pegged at 1 or 0.

When an application starts reporting data for the first time, New Relic chooses a default T value of 500 ms for most agents (Python uses 100 ms). But it’s not really a one-size-fits-all value. The default produces a score with a good range for only about 20% of applications. The rest need to override the default to get a good score.

What should I set T to?

If you have an app that has been running for a while in a steady state and you feel you have a good baseline for acceptable performance, you can start by setting your Apdex threshold to give you a baseline Apdex score of 0.95. So you’ll want to get the 90th percentile value and set that to Apdex T.

Here’s the NRQL query to do so:

select percentile(duration, 90) from Transaction where appId=NNNNN since 24 hours ago

Now, 0.95 is pretty high. Want to show more headroom for improvement? Try a target score of 0.92, or 0.90. Use the percentile value from this table that corresponds to the score you want.

How we came up with these values

We wanted to find a good “rule of thumb” for setting Apdex T without having to do trial and error, successively adjusting the value until the score looked right. Once an application has been running for a while it generates summary statistics that can be used to estimate the T threshold.

The most accurate method we discovered was to build an HDR (high dynamic range) histogram of the historical data and perform a simple search, optimizing the error between the target score and actual score using a range of threshold values. This was a complicated calculation to do manually, so we wanted to find a simpler way.

A long time ago we wrote about using the mean as a T value. But using the mean was not effective if you want a consistent score. The mean can be heavily skewed by outliers, resulting in wide range of T values and less predictable scores. Percentiles seemed like they would be more robust. But was that actually the case, and which one should you use? We began to investigate.

Using a model distribution for response times

We decided to use a model distribution to represent a typical probability density for a web transaction. There are a number of different distributions that form a theoretical fit of response time distributions. Log Normal is one useful distribution because it’s a simple calculation. But we wanted to go for accuracy, not simplicity, so an Erlang distribution gave us a better approximation for a system based on a queuing network with a Poisson arrival process. We’d use the R functions for calculating probabilities and quantiles for the gamma distribution, which defines an Erlang distribution when the shape value is positive.

Here’s an example distribution:

model distribution chart

This is a good approximation for what histograms look like for web transactions.

Calculating the Apdex score

First we created a function to calculate the Apdex score and bucket counts. Bucket values are standardized so N = 1 where N would normally be the number of requests.

apdex <- function(t, shape, scale) {

   buckets <- pgamma(c(t, 4*t), shape=shape, scale=scale)

   # Score:

   list(score=buckets[1] + (0.5 * (buckets[2] - buckets[1])),

      s=buckets[1],

      t=buckets[2] - buckets[1],

      f=1 - buckets[2])

}

Using this function we can create charts of a simulated response time distribution that shows the resulting scores in a histogram/rug plot combination.

Based on this, the 90th percentile should give us an Apdex score of about 0.95. So we decided to test it out.

We conducted an experiment using the 90th percentile as an estimate for T on 500 real applications. We were shooting for a score of 0.95. In our trial of 500 random production applications, when using the 90% quantile for a T threshold, 45% of cases had a score of 0.95. The mean was 0.938 with a 95% interval of 0.91 to 0.96.

Here’s the actual distribution:

Now that we know that using the Erlang distribution to estimate T for a target Apdex score works pretty well, what if we want to set a score other than 0.95?

Let’s determine the ideal percentile values to use for a set of target scores. We’ll use the optimize() function to perform a search minimizing the error between the actual and target Apdex score.

score.target

The resulting table (shown earlier in this post) shows the target Apdex score and the percentile value that yields a T threshold that meets that score.

Now that all that explanation is out of the way…

Ready to set your T threshold?

From the New Relic UI, go to Settings > Application. You will then see a field to change your Apdex T setting (note: you will need to have admin permissions).

Conclusion

From a business perspective, the ideal way to set a T threshold is to choose a value that represents your business requirements in terms of a threshold for “tolerable” latency.

From an ops perspective, the ideal way to set the T threshold is to pick a distinct value for each application that results in consistent scores across your application portfolio so that you can monitor significant movement over time. The easiest way to do this is to use a percentile value that corresponds to your target Apdex score.

So keep the table above handy, along with the NRQL query, for the next time you see a flat Apdex chart in your application overview page.

Additional resources:

White paper: Improve Alerting Accuracy: New Relic’s Apdex-Driven Approach Honed by Big Data