Reality has a habit of being more real than we expect. This is particularly true in the world of software—what looks like perfection on a whiteboard or in a test environment often turns out to be … well, less than perfect when it’s handling the unpredictability of a production workload. That’s why blue-green deploys, feature flags (or toggles), and canary deploys have become such essential tools for modern software teams: They allow teams to roll back changes quickly—or at least to minimize customer impact—if deploys go wrong.

At New Relic, we believe canary deploys provide the most effective means for us to deploy new services reliably in the least disruptive manner to our customers. In a canary deploy, you roll out new functionality for the first time on a subset of “canary” nodes across horizontally scaled services. You can randomly distribute the traffic to the canary nodes or base it on some predefined criteria (for example, deploy canary traffic to users you’ve identified as willing early adopters or those on a specific set of infrastructure). As you verify the that the canary isn’t causing problems, you roll out the change to more instances until all users have access to that new code. Deploying a small number of canary instances makes it significantly easier to roll back changes should the new code present issues for your users or infrastructure.

But what makes for an effective canary deploy? Here, we’ll look at some of the best practices we’ve evolved at New Relic for using canaries to move quickly while minimizing disruption to our users and our platform.

Considerations for effective canary deploys

In his talk “Canarying Well: Lessons Learned from Canarying Large Populations” from SRECon18, Štěpán Davidovič, a site reliability engineer from Google, defined what he calls, “the triangle of canarying."

The key to effective canary deploys, Štěpán explains, is finding the right balance between three different concerns:

  • Canary time: Slower canary rollouts mean better data but reduced velocity.
  • Canary size: Using a larger set of canary instances provides a better representation of the overall population but also increases the impact of bad canaries.
  • Metric selection: Using a larger set of metrics to evaluate canary vs. non-canary instances can increase the predictive accuracy of a canary deploy, but it can also introduce noise and false signals into the evaluation.

With these concerns in mind, here are nine best practices we rely on at New Relic for using canaries.

9 best practices for canary deployments

1. Always use canaries with your deploys. Avoid thinking “this change is small enough that we don’t need to canary”—this is almost always the wrong call. If you can't use canaries to test changes to a service, consider that service high risk, and direct reliability work so the service can be deployed with canaries. We ask teams to come up with an appropriate canary strategy that fits all of their deploys rather than making these decisions on a case-by-case basis.

2. Time canary deploys with your traffic cycles. If you see regular workload peaks and troughs, time your canary period to begin before peak traffic and cover a portion of the peak traffic period.

3. Avoid focusing canaries on outliers in your service pool or workload. If specific portions of your workload represent ongoing outlier cases in the profile of your service, make sure that your selected canary instances don’t focus disproportionately on those outliers. Canaries should cover the major modes/types of workload handled by your service. For example, if a database service has one or two shards that are write-heavy while the remainder are read-heavy, there should be at least one canary for each of these workload modes.

4. For critical services, deploy more canaries and monitor them for longer periods. This point seems obvious, but canaries deployed for critical services should have longer lives than canaries for non-critical services. Longer canary durations will better detect issues like slow memory or resource leaks and the effects of such variations in your workload. For highly critical services, we recommend canary durations of 4 to 24 hours. For all other services, we recommend durations of at least 1 hour. Similarly, you should deploy more canaries for your critical services.

5. How you evaluate your canaries should not be a spur of the moment or ad hoc decision. Define and document the metrics you’ll use to evaluate canary vs. non-canary instances. In most cases, error rates, response and transaction time, and throughput are the right core metrics to measure. Add additional metrics as needed, based on the specific profile of your service.

6. Your canary process should cover 5% to 10% of your service’s workload. For example, if a service tier is normally comprised of 100 instances, you should canary on 5 to 10 of those instances. A canary that covers only 1% to 2% of the workload is more likely to miss or minimize some important cases; a canary that covers more than 10% of the workload may have too much impact if it doesn’t work as expected.

7. Your canary population should be small enough that complete failure of the initial canary set will not overwhelm remaining service nodes. The whole point of using canaries is that newly deployed changes may be less stable than code that’s currently in production. If canaries fail, or you need to quickly roll them back, you don’t want that to create a cascading failure in the rest of the service tier. Generally this means you need to have free capacity in the service tier that’s at least equal to the percentage of workload that will be handled by the initial canary set.

8. When practical, canary processes should include a canary population of more than one instance. Some service tiers are small enough that you can satisfy practices #6 and #7 with a single canary instance. For critical services that are scaled wider, though, using more than one canary helps reduce the chance that you’re evaluating your canaries on the basis of an outlier workload or host configuration.

9. If you use feature flags or blue/green deploys, use them in conjunction with canariesWhen you use a feature flag, typically you deploy the new code path to all instances of a service, and the flag is used to control the usage of the new code path. Unfortunately, if you introduce the new code to your entire service tier and that code path proves to be problematic (for example, a regression ties up threads or other resources), the entire service will be impacted. In other words, you’re not guaranteed to have a bulkhead between your feature-flagged code and the rest of the service.

When you use blue-green deploys, flipping an entire service tier to a new release can result in massive impact even if your rollback mechanisms are fast—your service’s dependencies may take time to recover; caches may need to be warmed; and you may experience a thundering-herd when the service restarts.

With that in mind, use canaries to predict the likelihood of issues for feature flags or blue/green deploys.

Define the strategies that work for you

These best practices make sense for us at New Relic at our current scale, complexity, and emphasis on reliability. Of course the specifics of what makes sense for your organization may vary—but canary deploys are an amazing tool to help software teams move fast with confidence. Still, like any other strategy, the effectiveness of canaries depends on good planning and design, and these best practices should get you on the right path to helpful, workable canary deploys.