New Relic já Sonha com inovar mais? Viva esse sonho em outubro.
Reserve já
No momento, esta página está disponível apenas em inglês.

Over the last few years, the New Relic customer experience has been tied to inspected count (IC), which is our internal unit of measure for calculating the “cost” of a query. When customers perform an action, either from a query or UI page load, that action inspects a certain number of data points in the New Relic database (NRDB). In 2017, we introduced IC limits, referring to the cumulative cost of inspected customer data points over a 15-minute time range.

Recently, we announced that we’ve removed these limits for all New Relic customers. This means no more dropped queries due to IC limits, no 15-minute wait time for limits to reset, a doubling of our query capacity for both data options, and no rejected queries once new limits are reached.

Understanding why we were able to do this requires understanding a bit about what's changed under the hood of NRDB in the past few years.

What hasn't changed

There are a few relevant things that haven't changed about NRDB over this time.

The first is that there is a massive variation between "normal" NRDB queries and the extreme of large queries. Normal queries (at the median) look at up to a few million data points and take less than 100ms to execute. On the other hand, the 99.99th percentile of queries look at tens of billions of data points and take tens of seconds to execute. It may seem silly to spend much time thinking about these extreme outliers, but in a massively multi-tenant system like NRDB, these 1 in 10,000 events are happening constantly.

The other unchanged (at least as far as this topic is concerned) aspect of the system is how we store data on disk. As we receive data from our customers, it’s written into what we refer to as "archive files."  Archive files store a single type of data, for a single customer, and are bounded in both file size and the time range of data they contain. When a query is executed, archive files are essentially the atomic unit of work. We can distribute different archive files to different nodes to be queried in parallel, but any single archive file must be entirely processed on a single node.

NRDB in 2017

In 2017, NRDB employed a traditional data center architecture with large quantities of bare-metal hardware in a relatively static configuration. Our "query worker" compute nodes combined several thousands of individual Java processes into a single massive shared cluster organized as a fairly standard consistent hash ring.

This approach is conceptually and relatively operationally simple, and it has fairly nice economics as well. Normal queries required only a small fraction of the resources of the cluster, but occasional large queries could draw on a much larger pool of resources to run much more quickly than would have been possible if CPUs were strictly allocated to single tenants.

While this works well for large one-off queries, it opens the door to significant issues of fairness across tenants. Although one user running one very large query didn't generally have noticeable effects on anyone else, running many such queries did have the potential to impact other users significantly in the absence of guardrails. This was the consideration which led us to introduce inspected count limits. IC limits allowed us to notice a tenant who was likely to be causing a degraded experience for other users and slow down their query rate a bit (by rejecting a fraction of their queries) to ensure fair resource utilization.

NRDB in 2024

Meanwhile, many things have changed. Most fundamentally, we've migrated NRDB to the public cloud and to a cellular architecture. Rather than one massive shared query compute cluster, we now run dozens of isolated query compute clusters with the ability to rapidly and dynamically redistribute workloads across them. Hand in hand with this new architecture, we've also implemented new near-real-time feedback mechanisms to understand the query workloads of individual users.

As a result, it's no longer possible for a single tenant to starve other customers of compute resources. When one user has an abnormal surge in usage that might saturate one cluster, we can detect that and route other customers' workloads to other clusters so they continue to run without any degradation. Meanwhile, the first customer's queries continue to run, but possibly a bit slower than normal if they’ve capped out their allocated capacity.

The upshot of all this is that we observed two things: one, that our query availability and performance service level objectives improved by two orders of magnitude; and, two, that the times when our existing inspected count limiting kicked in no longer correlated with other measures of system health. In other words, we could now provide a better experience for all of our customers without imposing IC limits on any of our customers.

Does this mean there aren't any limits on NRDB queries any more? Well, no, not exactly. But, those limits are much more focused, nuanced, and responsive, such that we expect that most users will never notice them.