In modern software environments, like those built on scalable microservices architectures, hitting capacity limits is a common cause of production-level incidents. It’s also, arguably, a type of incident teams can often prevent through proactive planning.

At New Relic, for example, our platform is made up of services written and maintained by more than 50 engineering teams, and capacity planning is a mandate for every one of them—we can’t afford for our real-time, data platform to hit capacity limits. The first time through, each team spends several days focused on the analysis and development work needed to model their capacity needs. Once they have their capacity models in place, the ongoing practice of planning occupies, at most, a few hours a quarter—a time investment that’s more than worth it if it prevents just one incident per year.

To help make the process as smooth and repeatable as possible, the New Relic site reliability engineering team publishes a “capacity planning how-to guide” to walk teams through the process of capacity planning in software developmet. This post was adapted from that guide. You can also read our docs tutorial about using New Relic to prepare for a peak demand event.

What is capacity planning?

Simply put, capacity planning is what work teams do to make sure their services have enough spare capacity to handle any likely increases in workload, and enough buffer capacity to absorb normal workload spikes, between planning iterations.

During the capacity-planning process, teams answer these four questions:

  1. How much free capacity currently exists in each of our services?
  2. How much capacity buffer do we need for each of our services?
  3. How much workload growth do we expect between now and our next capacity-planning iteration, factoring in both natural customer-driven growth and new product features?
  4. How much capacity do we need to add to each of our services so that we’ll still have our targeted free capacity buffer after any expected workload growth?

The answers to those four questions—along with the architectures and uses of the services—help determine the methodology our teams use to calculate their capacity needs.

Who does capacity planning in software development?

Capacity planning in software development is a collaborative effort that typically involves multiple stakeholders, including; developers, project managers, product owners, DevOps engineers, IT operations, and data analysts. For example, at New Relic, our platform is made up of services written and maintained by more than 50 engineering teams, and capacity planning is a mandate for every one of them.

What is the difference between short-term and long-term capacity planning?

Short-term capacity planning focuses on immediate resource needs and typically covers a timeframe that spans weeks to a few months. It mainly focuses on current demands, managing fluctuations in workloads, and ensuring day-to-day operations run smoothly.
Long-term capacity planning takes a more extended view, looking months or even years down the line. It involves strategic planning to accommodate growth, scale infrastructure, and align organizational capabilities with long-term business goals.

What are the benefits of capacity planning in software development?

There are several benefits to capacity planning in software development, including; efficient resource utilization, cost-savings, enhanced system performance, scalability, informed-decision making, risk mitigation, reduced downtime, and more.

Calculating capacity

We use three common methodologies to calculate how much free capacity exists for a given service:

  1. Service-starvation analysis
  2. Load-generation
  3. Static-resource analysis

It’s important to note that each component of a service tier (for example, application host, load balancer, or database instances) requires separate capacity analysis.

Service-starvation analysis

Service starvation involves reducing the number of service instances available to a service tier until the service begins to falter under a given workload. The amount of resource “starvation” that’s possible without causing the service to fail represents the free capacity in the service tier.

For example, a team has 10 deployed instances of service x, which handle 10K RPM hard drives in a production environment. The team finds that it’s able to reduce the number of instances of service x to 8 and still support the same workload.

This tells the team two things:

  1. A single service instance is able to handle a max of 1.25K RPM drives (in other words, 10K drives divided by 8 instances).
  2. The service tier normally has 20% free capacity: Two “free” instances equals 20% of the service tier.

Of course, this scenario assumes that the service tier supports a steady-state of 10K RPMs; if the workload is spiky, there may actually be less (or more) than 20% free capacity across the 10 service instances.

Load-generation analysis

Load generation is effectively the inverse of service starvation. Rather than scaling down a service tier to the point of failure, you generate synthetic loads on your services until they reach the point of failure.

A percentage of your normal workload then is based on the amount of synthetic workload that you were able to successfully process. This represents the free capacity in your service tier.

Static-resource analysis

This approach involves identifying the most constrained computational resource for a given service tier (typically, CPU, memory, disk space, or network I/O) and determining what percentage of that resource is available to the service as it is currently deployed.

Although this can be a quick way to estimate free capacity in a service, there are a few important gotchas:

  • Some services have dramatically different resource consumption profiles at different points in their lifecycle (for example, in startup mode versus normal operation).
  • It may be necessary to look at an application’s internals to determine free memory. For example, an application may allocate its maximum configured memory at startup time even if it's not using that memory.
  • Resources in a network interface controller (NIC) or switch typically reach saturation at a throughput rate lower than the maximum advertised by manufacturers. Because of this, it’s important to benchmark the actual maximum possible throughput rather than relying on the manufacturer’s specs.

No matter which methodology you choose, experiment during both peak and non-peak workload periods to get an accurate understanding of what the service can handle.

Now, let’s look at how to apply these methodologies in a capacity-planning exercise.

Capacity planning in software development

Our capacity planning comprises five main steps. Teams work through these steps and calculate their capacity needs in a template, an example of which is included below.

  1. List your services, and calculate each service’s free capacity using one of the available methodologies discussed above. Free capacity is generally best expressed as a percentage of overall capacity; for example, “This service tier has 20% free capacity.
  2. Determine the safest minimum amount of free capacity you need for each service. Free capacity is your buffer against unexpected workload spikes, server outages, or performance regressions in your code.
    • Typically, we recommend a minimum of 30% free capacity for each service.
    • In all cases, teams should scale their services to at least n+2—they should be able to lose two instances and still support the service tier’s workload.
  3. Determine when you will next review your capacity needs, and hold that date. You should review services that are mature and experiencing typical growth quarterly, and review new services or those that are experiencing rapid growth monthly.
  4. Project the percentage of workload growth that your service is likely to experience before your next capacity review meeting. Base this projection on historical trend data and any known changes—such as new product features or architectural changes—that may impact this growth.
  5. Calculate how much capacity you’ll need to add to your service before your next capacity review, so that you can maintain your target free capacity and support your expected growth.

Capacity planning template

Record the results of your calculations in a template, and make the information accessible to all stakeholders in the larger engineering organization (for example, site reliability engineers, engineering managers, and product owners). This sample template covers capacity planning for a Java-based service:

Item Label Example
ItemA LabelService name/component ExampleCollector/
Java SVC instances
ItemB LabelToday’s date Example4/1/2019
ItemC LabelScheduled date for next capacity planning exercise Example7/1/2019
ItemD LabelMethodology used ExampleStatic-resource analysis
ItemE LabelCurrent service tier size
(# of hosts or container instances)
ItemF LabelCurrent cores per service instance Example10
ItemG LabelCurrent storage per service instance Example100GB
ItemH LabelDetermined free capacity
(as a percentage)
ItemI LabelCurrent utilization
Formula: E - (E * H)
Example60 - (60 * .2) = 48
ItemJ LabelTarget free capacity
(as a percentage; should represent at least two free instances, or n+2)
ItemK LabelExpected workload growth until date of C Example15%
ItemL LabelCapacity needed to service minimum planned workload
Formula: I + (I * K)
Example48 + (48 * .15) = 55.2
ItemM LabelCapacity needed to maintain target free capacity
Formula: L + (L * J)
Example55.2 + (55.2 * .3) = 71.76
ItemN LabelAdditional capacity to be added
Formula: roundup(M - E)
Exampleroundup(71.76 - 60) = 12
ItemO LabelAdditional cores needed
Formula: N * F
Example12 * 10 = 120
ItemP LabelAdditional storage needed
Formula: N * G
Example12 * 100GB = 1.2TB

Capacity planning strategies

There are a few different capacity planning strategies, each with a different focus. Here are the four main strategies that teams can leverage for effective capacity planning.

Lag Strategy

The lag strategy involves deliberately delaying certain aspects of the project to align with the team's capacity. This approach acknowledges that not all tasks or features need to be executed at the same time. By purposely lagging behind in some areas, teams can ensure a smoother workflow, preventing burnout and maintaining consistent productivity. This strategy is particularly useful when facing resource constraints or unexpected challenges.

Lead Strategy

Conversely, the lead strategy approach is all about prioritizing critical tasks or features that align with the team's strengths and capacity. By leading with high-priority items, teams can ensure that crucial aspects of the project are addressed promptly, reducing the risk of delays. This proactive approach allows teams to leverage their strengths and manage resources efficiently, creating a solid foundation for subsequent phases of the project.

Match Strategy

The match strategy revolves around aligning the pace of development with the team's sustainable capacity. It involves carefully matching the complexity and volume of tasks with the team's available resources. This strategy promotes a balanced workload, preventing both overcommitment and underutilization of resources. Matching the team's capacity with the demands of the project fosters a sustainable and predictable development pace.

Adjustment Strategy

The adjustment strategy emphasizes flexibility and adaptability in response to changing circumstances. This approach involves regularly reassessing the initial capacity plan and making necessary adjustments based on evolving project requirements, external factors, or unexpected challenges. The adjustment strategy ensures that the team remains responsive to dynamic situations, allowing for continuous optimization of capacity planning efforts.

Iterate on the capacity plan

After teams establish a regular cadence of capacity planning, it’s necessary to iterate on the data they collect. Future decisions about capacity should be informed by the difference between the forecasted capacity and the capacity they actually need.

Teams should ask such questions as: What accounted for any differences? Did services grow as expected? Or did growth slow? Were there architectural changes or new features to account for?

These questions can uncover whether or not teams forecasted their growth appropriately in relation to organic growth.

In rare cases, teams may struggle to properly calculate their capacity needs. Here we advise teams to plan time to reduce their capacity needs or work to make their services more efficient.

We also encourage teams to set up proactive alerting (for example, for out-of-memory kills or CPU throttling) in case actual growth exceeds their forecasts and their services hit capacity limits before the next review.

Of course, even the best capacity planning efforts won’t necessarily prevent all related production incidents in modern software environments. But careful, consistent, iterative, and proactive planning can go a long way towards minimizing capacity-related incidents. And that makes it totally worth the effort.