Get instant Kubernetes observability—no agents required. Meet Pixie Auto-telemetry

The Circuit Breaker Pattern Is A Great Tool (When Used Appropriately)

6 min read

At some point in your life, you’ve almost certainly had to flip a circuit breaker in your home or apartment. Essentially an automated electrical switch, a circuit breaker protects an electrical circuit and will "trip" open during surges, physically preventing electricity from continuing to flow and causing serious damage.

In software, the circuit breaker pattern follows the same approach, and I urge you to check out Martin Fowler’s description for a detailed explanation. Developers can use a circuit breaker to prevent a resource dependency (typically a downstream HTTP service or database) from becoming overloaded. The circuit trips open automatically based on configured settings, like elevated response time, timeouts, or other errors, and then automatically closes (again, based on configurations such as elapsed time or some other trigger), ideally after the dependency has recovered. In some cases, this circuit breaker pattern can help you reduce overall downtime if you allow the dependencies to recover on their own before you start hammering on them.

In this post, I’ll lay out a few considerations to help you decide when it’s appropriate—or not—to use this pattern.

Circuit breakers require careful tuning

Circuit breakers are an example of self-healing software techniques that are fantastic for building resilient systems. They’re also hard to get right, which is why we shouldn’t use them thoughtlessly.

The pattern is extremely alluring: It can prevent thundering herds, "OutofMemory error death spirals," and all manner of problems that come from temporarily overloaded systems. And using this pattern seems easy enough: Simply stick a circuit breaker library in front of your API calls, and get ready for self-healing software. Such promises are hard to resist.

However, I assert that implementing a circuit breaker pattern requires careful tuning:

  • You have to tune the threshold (error rate or throughput) at which the circuit opens.
  • You have to tune the time for which the circuit stays open, and possibly tune other triggers that allow the circuit to close.
  • You need to tune these settings on a per-environment and per-endpoint basis at the very least—and, in some cases, possibly even on a per-request basis if requests to the same endpoint are not all equal.

Most important, though, you have to maintain these tunings as both your service and the underlying dependencies evolve. To be more direct: Proactive tuning of circuit breakers is critical.

A poorly tuned circuit breaker is a problem

Here are four issues you may encounter if you don’t regularly tune your circuit breaker correctly. (In some cases, you could see these even if you have tuned your circuit breaker):

  1. Your service provides less throughput than it should be able to provide.
  2. You elongate downtime for the period of time in which the circuit is open, even if the underlying resource has recovered.
  3. You create episodic request patterns that can be confusing for downstream service owners to understand.
  4. You get paged for errors that wouldn't have happened if there were no circuit breaker.

That is to say, naively applying a circuit breaker can easily cause more problems than it solves.

Three questions to ask before deploying the circuit breaker pattern

Lest anyone accuse me of harsh criticism, let me make it clear that I think circuit breakers can be quite useful—if used strategically. Rather than liberally sprinkling circuit breakers over all your API calls, consider the following questions:

  1. Will a circuit breaker enhance your service?
    • Is it possible for a dependency downstream of your service to become overloaded—if not, you probably don’t need one.
    • What will happen to the dependency if it gets overloaded? Will reducing requests help it recover?
    • Can you define a heuristic for when your circuit breaker should open and close? Should it close based on elapsed time, or some other trigger?
    • Do you have errors that are highly correlated—in other words, if you see one, you're likely to see many of the same—and would it be better to fail fast in this scenario?
  2. Can you deploy a circuit breaker?
    • Do you have a way to accurately measure what load your downstream dependency can and can't handle? Can you tune your circuit breaker? Do you know how the “knobs” work?
    • What monitoring do you need to verify that your circuit breaker is working as expected?
    • Can you override the circuit breaker in an emergency if it stops working as expected?
  3. Can you afford the cost of maintaining a circuit breaker?
    • Will you take the time to re-measure your dependency's limit as part of your routine capacity planning process, and then tune your circuit breaker accordingly?

In the end, a circuit breaker may very well be the right choice for your service and your customers, but don’t jump to that conclusion without analyzing the benefits versus the costs.

And, finally, if you're also having trouble with your real-world circuit breaker tripping, here's a pro tip: Stop plugging your entire cryptocurrency mining fleet into one circuit.