Many software-as-a-service (SaaS) businesses leverage a multi-tenant architecture to keep costs and operational overhead manageable. A multi-tenant architecture is one in which a single instance of the software handles the workloads of multiple tenants. In practice, a tenant is often a customer, but it could be some other boundary that makes sense for keeping workloads partitioned.  Multi-tenant architectures have operational benefits such as fewer service instances to support and maintain but also have a number of challenges. Having adequate observability to understand tenant usage and diagnose operational issues is critical for operating these systems successfully.

What is hard about operating a multi-tenant system?

Noisy neighbor problem

Due to the shared nature of multi-tenant systems, there are times when the behavior of one tenant impacts the overall performance of the system and thus other tenants. While there are design techniques that can mitigate these problems, they will happen. Being able to diagnose these issues quickly is paramount to restoring service to the non-problematic customers and meeting service-level agreement (SLA) goals.

Common issues that manifest in this way include a tenant sending abnormally large amounts of traffic, tenant operations triggering an unusual error condition, and other unusual tenant behaviors that consume resources in an unexpected way. Being able to facet telemetry data by a tenant identifier allows the operator to quickly identify this type of problem and take action.

For example, consider a server application that’s experiencing thread pool exhaustion due to slow requests tying up threads. In the New Relic APM experience you might see something like this:

You see that there are a lot of slow requests and few runnable threads.

What’s causing this issue?

Solution

  1. Add custom instrumentation to your application. One approach is to add custom attributes on APM Transaction events using the New Relic Java agent, which would look something like this:
@Override
@Trace (async = true)
public void onMessage(@Nonnull final KafkaMessage<ByteBuffer, ByteBuffer> message) throws Exception {

    final DataBatch dataBatch = deserializer.deserialize(topicName, message.value());
    NewRelic.addCustomParameter("tenantId", dataBatch.getTenantId());
    NewRelic.addCustomParameter("batchSize", dataBatch.getBatchSize());

You can find more information on adding custom attributes to telemetry here.

2.    Deploy the updated application.

3.    Explore data in the query builder or other New Relic capability that allows for viewing and/or filtering by custom attributes such as the errors inbox and distributed tracing UI.

4.    Optionally, you can create dashboards for your custom queries so that they are easy to refer back to in the future. A best practice is to create dashboards such as this and link them from service runbooks. You can find more information on building dashboards here.

Now that we have the instrumentation we need, we can write a query such as the following and see that a particular tenant is sending large batches of work that are tying up lots of threads and preventing our application from being as responsive as it should be.

FROM Transaction
SELECT average(batchSize)
WHERE appName = 'batch-processor-service (production)'
    AND name = 'OtherTransaction/Custom/com.example.package.DataProcessor/process'
FACET tenantId 
SINCE 1 hour ago TIMESERIES EXTRAPOLATE

Creating metrics that track operations by a tenant is also a possibility. Metrics have a key benefit that they are not sampled like events are and thus the results of certain types of queries are more accurate. Our APM and mobile agents support the creation of custom metrics. You can find documentation on that here. Another option is using the New Relic metrics API to create dimensional metrics with tenant ID and any other important tenant attributes as dimensions.

An important thing to consider when creating metrics is the limits on the cardinality of the dataset you’ll be creating. Cardinality is generally defined as the number of unique elements in a set, which in the context of this blog post means that there are limits on the number of tenant IDs you can create metrics for. If this is an issue, one approach is to limit the creation of metrics to tenants that exceed some threshold. To better understand the concepts related to metric cardinality, you can find more information here. To see current limits on cardinality limits, consult this documentation.

You can read more about the noisy neighbor problem here.

Aside from diagnosing end-user issues

Along the same vein, our browser, mobile and APM capabilities support adding user IDs to telemetry so that issues being caused by individual users can be tracked in the same way as we were diagnosing issues with tenant usage of the system above. For example, the browser agent API’s setUserId function can be used to add the end-user’s identifier to telemetry data such as JavaScriptError events. This event attribute will then power features such as the errors inbox’s ability to track user impact of errors.

Diagnosing the source of errors

Now that you’ve added instrumentation to track tenant IDs you can leverage the New Relic errors inbox to determine if tenant behavior is a source of errors in your system. The errors inbox is a powerful tool for triaging errors in your application. One very useful feature is the attribute profiles feature, which you can access by clicking the View profiles button when viewing error group details. A profile like this indicates a single tenant is causing a disproportionate number of errors and warrants further investigation:

On the other hand when you see a very high percentage of “Other” you can rule out tenant behavior as the source of your errors.

Understanding resource utilization

Which tenants are consuming resources and impacting the cost of delivering your service? With the rise in popularity of container orchestration systems such as Amazon Elastic Container Service and Kubernetes, autoscaling configurations are becoming much more common for handling scale up/down as application load fluctuates. However, in the cloud, scaling has a financial cost, so it’s important to be able to diagnose which customer behaviors are driving the cost of scale-out events.  With this information, you will be able to put appropriate system limits in place and/or adjust your business model to ensure that your cost of doing business is in line with the revenue you are generating.

Utilizing the techniques described above we can instrument our services and then build queries that help us assess customer usage patterns. For example, a query such as the following will help us diagnose where spikes in throughput to a particular set of query API endpoints are coming from:

FROM Transaction 
SELECT count(*) WHERE appName = 'my-graphql-api (production)' 
   AND name LIKE 'WebTransaction/GraphQL/QUERY/%'
FACET tenantId TIMESERIES EXTRAPOLATE 
SINCE 1 hour ago

The following query will show us how much of our batch processing resources in another service can be attributed to each customer.

FROM Transaction
SELECT sum(batchSize) as 'Batch Items Processed'
WHERE appName = 'batch-processor-service (production)'
    AND name = 'OtherTransaction/Custom/com.example.package.DataProcessor/process'
FACET tenantId 
SINCE 1 day ago LIMIT 100 TIMESERIES EXTRAPOLATE

A best practice is to implement limits on tenant usage to mitigate both noisy neighbor issues as well as control costs in areas where the business is not utilizing a consumption pricing model.  In order to tune limits and assess their impact on customer usage, the above telemetry will prove very useful.

Conclusion

In this article, we’ve seen simple but powerful techniques for enhancing our telemetry data and leveraging the power of New Relic to gain insights into the functioning of multi-tenant software systems. The ability to facet telemetry by a tenant id is an indispensable tool for operators debugging noisy neighbor issues or trying to understand customer usage patterns. With these insights, operators can minimize the impact of incidents as well as make improvements to the system so that it’s more performant, resilient and cost-effective to operate. New Relic provides powerful tools to capture and analyze tenant data. In fact, we use them daily to run our own business.