When was the last time your engineering organization had a conversation about “resilience?” Have any of your teams—developers, site reliability engineers (SREs), product managers—defined what “resilience” should mean to your organization? We use this term a lot in modern software, usually to point at a set of properties that explain how our systems survive and thrive.
For example, it’s been 10 years since Lew Cirne founded New Relic. We’ve gone from a complicated system to a complex system. We started as a scrappy little Ruby on Rails monitoring application, but now we’re a truly complex distributed system. With complexity comes inescapable uncertainty, and a host of new challenges and opportunities. So, how do we thrive and grow and stay on the leading edge?
Our resilience is the key, allowing us to navigate all of that.
Exploring more closely how we use the term “resilience” can help us think more productively about our systems in general and about reliability in particular.
Four concepts of resilience
In Four concepts for resilience and the implications for the future of resilience engineering, Dr. David Woods identifies four ways we often talk about resilience.
The first two concepts:
Rebound: Dr. Woods defines this as the ability of a system to recover from a trauma. We most often use this sense of resilience when we look at how our systems recover from specific outages.
Graceful extensibility: This concept refers to our system’s ability to stretch and grow in response to surprises. Microservices are typically more “resilient;” in other words, they’re more “extensible” than monoliths.
For this post, though, I’m going to focus on Dr. Woods’ two remaining concepts:
Robustness: According to Dr. Woods, “robustness” is the word we most commonly use interchangeably with “resilience.” Robustness is about more than our system’s ability to rebound from a specific shock; specifically, it’s about how our systems absorb a wider and wider range of disturbances without breaking. So when we ask if our systems are “resilient” to particular kinds of failure, we’re really asking if they’re “robust” enough to handle those failures.
Think of a bridge. A good bridge can withstand stress from wind and weather; it can support constant traffic; it can lose a certain number of supports and still maintain its integrity, say, across a canyon. The builders of the bridge anticipated any number of disturbances, and then engineered the system to be able to withstand those disturbance within certain tolerances.
Robustness is important. We engineer our systems to be able to withstand traffic spikes. We architect them so they can scale horizontally. We build in n+2 redundancy so we can lose multiple hosts without causing severe problems for ourselves or our customers.
Adaptive capacity: This concept of resilience points to our system’s ability to adapt and grow in response to dramatic changes in its environment. “Adaptive capacity” encompasses more than how we respond to failure; it also includes how well we’re able to take advantage of new opportunities and unexpected challenges. This concept of resilience is the most relevant to complex distributed systems.
Resilience as “adaptive capacity” takes us out of the more deterministic, industrial model into the world of complex ecosystems.
In other words, while a bridge can be robust to many stresses over time, it can’t adapt if the landscape itself changes. But a city can, and, in fact, many are doing just that as low-lying coastal cities are working toward becoming more resilient to climate change and rising sea levels. Many cities, too, are taking advantage of the new opportunities to instrument themselves using data from citizens’ smartphones to improve life for residents.
Measuring the robustness and adaptive capacity of resilient reliability
At the end of New Relic’s first year, we had fewer than a dozen customers using just one product. Today, the New Relic Platform is integral to the operation of thousands of software systems around the world, many of which are critical to the operation of modern society in industries like finance, transportation, healthcare, and insurance.
Our customers have one thing in common: they all want more nines. They want constant “uptime.” From a marketing perspective, this sounds great—“We give you more nines.” But if you’re skeptical, like me, you might be wondering, more nines of what? What does “more nines” actually mean?
The real question we’re asking here is how do we measure the robustness of our reliability in a way that’s useful?
A simple uptime measurement might be useful for small systems, but it’s next to useless in a complex system, like what we have at New Relic. What exactly needs to be “up?” We have a range of products, each composed of multiple subsystems, fed by thousands of agents pumping data into multiple ingest pipelines that process the data into multiple data stores and accessed from a variety of interfaces and APIs.
The problem is that reliability isn’t binary. Complex systems like ours aren’t “up” or “down.” Reliability is more like a topography. A landscape. There are mountains of technical debt and valleys of high reliability, and happy plains where everything runs smoothly until one day a random pull request rips a chasm in the ground.
Of course, just because it’s tricky doesn’t mean we can give up on measuring reliability. We get to draw the boundaries, so we look for boundaries that make sense and give us a meaningful view of system health.
In other words, we look for ways to make our software landscape adaptive and robust.
One widely accepted way of measuring reliability comes from Google, as outlined in the Site Reliability Engineering book: each service has service level indicators (SLIs) and service level objectives (SLOs) that measure the health of that service. This approach gives a nuanced, granular view of system health. At New Relic, we currently use a measure of system health called Qdex. This is an availability measurement based on our incident data. It takes into account three things:
- How many customers were impacted by an outage this month?
- How severe was the impact?
- How long did the impact last?
From this data, we calculate a weighted availability measurement that allows us to compare how well products and teams are performing in a given month.
Whether you use formal SLIs and SLOs for all your services, or a more homegrown measurement like our Qdex, your reliability measurements should focus on customer impact and should motivate teams to focus on improving both the robustness and the adaptive capacity of their services in ways that keep the customer foremost in their minds.
Practical steps for robustness and adaptive capacity in resilient reliability
So what are some practical ways we can apply robustness and adaptive capacity to the reliability space?
1. We can fail small
Reliability is a customer-driven idea. We don’t improve reliability so that we can get a better reliability score for the month; we improve reliability to improve our customers’ experiences.
When we fail small, we don’t necessarily reduce all risk, but we do reduce the number of customers a risk is likely to impact. We reduce the severity of the impact to those customers, and we make it so the impact doesn't last as long.
When we fail small, we prioritize the work that will protect our customers. This might mean, for example, that before you tackle bigger reliability challenges for an unstable frontend service, you update your 404 pages with a login so that during outages, users can still sign in. Or you architect your data ingest to be most robust for the pathways that carry most of the traffic before you fortify the pathways that are edge cases.
2. We can fail smart
Our systems are complex, filled with uncertainty. We’ll never be able to eliminate that uncertainty, but we can choose where to dig. That’s where incidents come in: incidents are our dowsing rods. They’re magical little events that stop us in our tracks and point us at the spots where digging under the surface will be most productive.
Incidents are not distractions. If we continue to think of incidents as distractions from our “real work,” we’re rejecting an incredibly rich source of insight. We maximize our “return on incidents” by using incidents to dowse for uncertainty and to find new ways to adapt.
3. We can fail more (maybe)
We’ll always, all of us, have incidents. Our production software systems will always suffer outages, whether large or small. And there will always be variations in the nature and frequency of the problems we encounter in our systems.
This brings us to an interesting tension between robustness and adaptive capacity. We have to be robust enough to survive, but if we fixate on a particular point in time—if we become too robust—we become less flexible. We actually reduce our capacity to adapt to new, surprising situations.
In that case, we need to fail more. We need to use some of the capital we’ve built up to experiment and push our boundaries. This is another way of saying, don't over-invest in reliability. When our services become highly robust, maybe we’ve created some space to experiment and grow. Are we leaving excess reliability on the table? Maybe we can take more risks and invest in our long-term flexibility and adaptive capacity so we can address the challenging work we’ll face in the coming years.
Robust and adaptive resilience through time
Resilience is about a lot more than how a system responds to an immediate threat.
Incidents are moments in time when our systems’ resilience (or lack of it) becomes most visible. In a complex distributed software system like New Relic, resilience grows or dies depending on many factors: hardware resource availability, system architecture and restraints, success and failures of on-call rotations and incident response processes, strong or poor engineering habits, workable deployment mechanisms, and so forth.
Some of the things we’ve done at New Relic to improve our robust and adaptive resilience include containerizing our databases, releasing teams from operational burdens by moving their deployment workflows into container fabric (our container orchestration platform), and laying the groundwork to launch an entire instance of New Relic in Europe.
We also gain resilience from our internal team health assessments, which increase the ability of our teams to do the right work and to do it well. We gain resilience based on who we hire, how we onboard them, how we train and educate our employees. We gain resilience from the psychological safety of a blameless culture and from the superior performance that emerges from fostering inclusivity and equity.
Our sources of resilience stretch back ten years. They live, today, in every person at New Relic.
This post was strongly influenced by the work New Relic is doing with the SNAFU Catchers.