Netflix isn’t just the home of bingeable TV shows streamed over Internet. It also gave birth—out of necessity—to the discipline of chaos engineering.
While the term may sound like an oxymoron or the title of a bad science-fiction movie, it’s actually an increasingly popular approach to improving the resiliency of complex, modern technology architectures.
This post is intended to help explain exactly what chaos engineering is and how it’s used. But first, a quick history lesson can help put chaos engineering into perspective.
Over the years, Netflix evolved its infrastructure to support increasingly complex and resource-hungry activities, especially as its customer base grew to 100 million users in more than 190 countries. The company’s original rental and streaming services ran in on-premise racked servers, but this created a single point of failure and other issues. Famously, in August 2008, corruption in a major database caused a three-day outage during which Netflix could not ship any DVDs. In response, Netflix engineers set out to find an alternative architecture, and in 2011, they migrated the company’s monolithic on-premise stack to a distributed cloud-based architecture running on Amazon Web Services.
This new, distributed architecture, comprised of hundreds of microservices, removed that single point of failure. But it also introduced new types of complexity that required significantly more reliable and fault-tolerant systems. It was at this point that Netflix’s engineering teams learned a critical lesson: Avoid failure by failing constantly.
A new use for chaos
To do this, Netflix engineers created Chaos Monkey, a tool they could use to proactively cause failures in random places at random intervals throughout their systems. More specifically, as stated by the tool’s maintainers on GitHub, “Chaos Monkey randomly terminates virtual machine instances and containers that run inside of your production environment.” With Chaos Monkey, engineers quickly come to learn if the services they’re building are robust and resilient enough to tolerate unplanned failures.
And with the advent of Chaos Monkey, a new discipline was born: chaos engineering, described as “the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”
In 2012, Netflix released Chaos Monkey under an open source license. Today, numerous companies, from Google to Amazon to IBM to Nike, all practice some form of chaos engineering to improve the reliability of their modern architectures. Netflix has even extended its chaos-engineering toolset to include a entire “Simian Army,” with which it attacks its own systems.
Chaos engineering: not really all that chaotic
Kolton Andrus, CEO of chaos engineering startup Gremlin, who worked at both Google and Netflix, suggests thinking of chaos engineering as a flu shot. It may seem crazy to deliberately infuse something harmful into your body in hopes of preventing a future illness, but this approach also works with distributed cloud-based systems, Andrus said. Chaos engineering involves carefully injecting harm into systems to test the systems’ response to it. This allows companies to prepare and practice for outages, and to minimize the effects of downtime before it occurs.
The operative word here is carefully. It’s a misnomer to think of chaos engineering as actually chaotic. In fact, very few such tests are random. Instead, chaos engineering involves thoughtful, planned, and controlled experiments designed to demonstrate how your systems behave in the face of failure.
“Of all the chaos engineering experiments that I have conducted with customers over the last year, I can probably count just one or two that have had a random quota to them,” Russ Miles, founder and CEO of ChaosIQ.io, a European chaos engineering platform, said in an interview. “Most of them are very careful, very controlled, proper experiments. It really has nothing to do with randomness, unless randomness is the thing you're trying to test for.”
Minimizing the blast area
Tom Petrocelli, a research fellow at Amalgam Insights, said in an interview that one key chaos engineering best practice is to “minimize the blast area. That means minimizing the effects on the business—not necessarily on the technology.”
“Yes, you want to discover the holes in your technology’s resilience," Petrocelli stated, "but you want to do so in a way that doesn't damage business operations.”
To make sure they don’t muck up the business, Petrocelli advised engineering teams to “plan meticulously” for chaos engineering work. If you’re lucky, he said, something will go wrong that you didn't expect to go wrong, which is actually considered a success in the world of chaos engineering.
With that in mind, Petrocelli said, it’s critical to make sure that you have the right team in place to fix anything that might break. “Don’t mess with Kubernetes containers if all your Kubernetes engineers are at an offsite meeting,” he warned.
Not just testing: experiments to generate knowledge
Casey Rosenthal, a former engineering manager on Netflix’s Chaos Team, made it clear in a DZone Q&A that chaos engineering is not just a case of testing systems. Testing looks for a binary output. Did something pass a specific challenge? Yes or no? Chaos engineering, on the other hand, is a formal method to generate new knowledge. Precisely because modern software systems are often too complex for anyone to fully understand them, engineers perform experiments to reveal more about the systems. Testing is still critical, said Rosenthal in the Q&A, but chaos engineering should complement traditional testing.
Chaos engineering can be thought of as the facilitation of experiments to uncover systemic weaknesses. These experiments often follow four steps:
1. Define and measure your system’s “steady state.” Start by pinpointing metrics that indicate, in real time, that your systems is working the way it should. Netflix uses the rate at which customers press the play button on a video streaming device as steady state, calling this "streams per second." Note that this is more a business metric than a technical one; in fact, in chaos engineering, business metrics are often more useful than technical metrics, as they’re more suited for measuring customer experience or operations.
2. Create a hypothesis. As with any experiment, you need a hypothesis to test. Because you’re trying to disrupt the usual running of your system—the steady state—your hypothesis will be something like, “When we do X, there should be no change in the steady state of this system.” Why phrase things that way? Because if you have a reasonable expectation that a particular action on your part will change the steady state of a system, then the first thing you should do is fix the system so the action will not have that effect. Your chaos engineering activities should involve real experiments, involving real unknowns.
“Chaos engineering is not for the type of incident that is fairly predictable, covered by runbooks, that you know you have to automate but just haven’t gotten around to it yet,” said Beth Long, DevOps solutions strategist—and former site reliability engineer—at New Relic. “You need it for the kinds of things that arise from the nature of complexity itself. Where everyone piles into Slack and scratches their heads because they don’t know what to think.”
3. Simulate what could happen in the real world. In the O'Reilly book, Chaos Engineering: Building Confidence in System Behavior through Experiments, the Netflix architects of chaos engineering, Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, and Ali Basiri, suggest a number of chaos engineering experiments:
- Simulate the failure of a datacenter
- Force system clocks to become out of sync
- Execute a routine in driver code that emulates I/O errors
- Induce latency between services
- Randomly cause functions to throw exceptions.
In general, you want to simulate scenarios that have the potential to make your systems become unavailable or cause their performance to degrade. Ask yourself, “What could go wrong?” and then simulate that. Be sure to prioritize potential errors, too. “When you have really complex systems, it's very easy to get downstream effects that you didn't anticipate, which is one of the things that chaos engineering is looking to find,” said Petrocelli. “So the more complex the system is and the more vital it is, the more likely a candidate it is for chaos engineering.”
4. Prove or disprove your hypothesis. Compare your steady-state metrics to those you gather after injecting the disturbance into your system. If you find differences in the measurements, your chaos engineering experiment has succeeded—you can now go on to strengthen and prepare your system so a similar incident in the real world doesn’t cause problems. Alternatively, if you find that your steady state remains steady, you may walk away with a higher degree of trust in that part of your system.
Don’t break your system—learn about it and improve it
“Chaos engineering is not about breaking things per se—it's never been about breaking things—but about learning,” said Miles. “You're trying to introduce a learning loop for the team, and the way that humans in groups absorb information best is through experience.”
Of course you can—and do—learn from actual outages, but that’s very painful, noted Russ. “Chaos engineering gives you the chance to do these ‘pre-mortems’ that are within your control.”
Chaos engineering also taps the brains of the people who know a complex system best. According to Long, “the more interesting chaos engineering experiments are based not on important-but-obvious hypotheses like, ‘If this rack fails, that service should increase latency but remain available,’ but on hypotheses that you're unlikely to think up without a strong intuitive understanding of the system and any recent incidents. The chaos engineering process helps convert that expert intuition into explicit, testable hypotheses, exposing valuable information not easily derived just from looking, as an outsider, at the system itself.”
There are many toolkits available, said Matthew Fellows, principal consultant at DIUS, a consulting firm based in Melbourne, Australia, which has run chaos engineering projects for clients. (Check out this curated list of chaos engineering resources on GitHub: https://github.com/dastergon/awesome-chaos-engineering.) “Go ahead, get Chaos Monkey, and use it to blow up one of your instances,” Fellows suggested. “It’s pretty scary if you’ve never done it before, but definitely worth it.”