If you want to learn the ins and outs of site reliability engineering, it makes sense to learn from the folks actually doing the job. And that’s exactly what this post is all about: spending time with Liz Fong-Jones, staff SRE at Google, and New Relic’s own vice president of software engineering, Matthew Flaming.
Liz has worked up and down the stack since joining Google in 2008, and is an amazing fount of knowledge on reliability practices and the burgeoning SRE role. Matthew started doing DevOps before it had a name, and currently focuses on reliability from both a technical and cultural perspective.
The two experts shared the stage at FutureStack: New York, where they discussed the SRE role and a bunch of related topics, including the critical relationship between SRE and scalability, the many ways to measure reliability, best practices for building a SRE function, and much more.
The importance of site reliability engineering
New Relic is very focused on SRE practice, Matthew notes, “because we feel it’s maybe the purest distillation of DevOps principles into a particular role. And it’s something we try very hard to empower both internally in the organization and for our customers.”
Liz, meanwhile, explains Google’s distinction between SREs and other software engineers: While everyone takes responsibility for their code and the reliability of their production systems, SREs are charged with developing a particularly specialized depth of understanding of how different systems work together, how they fail, how they can be improved, how they can best be designed and monitored—expertise that they must then share with their counterparts who are more focused on product development.
You can watch the entire video at the end of this post, or read on for some of the most useful highlights.
The two axes of scaling
According to Matthew, there are two types, or “axes,” of scale that companies must plan for. The first axis is workload—the number of physical hosts or virtual machines and other resources that must be able to grow efficiently in concert with the services that run on them. The second axis is complexity, in terms of the number of service dependencies and the growth of the organization itself. Fundamentally, site reliability engineering is about enabling both forms of scalability.
Liz supports that idea based on lessons learned at her early days at Google, circa 2009, working on the Bigtable database service. Back then there was a lot of manual effort involved in running Bigtable as a shared service throughout the company. The footprint was still relatively small, but it quickly became apparent that it would have trouble scaling. As Liz notes, the non-SRE—or non-DevOps—solution would have been Let’s hire some more sysadmins to help handle this torrent of incoming tickets. But that approach is just throwing more people at a problem, not actually solving it for long-term scalability.
Scalability, Liz says, typically boils down to automation: “The evolution over 10 years of ‘how do we go from handling dozens of footprints to hundreds or thousands of footprints?’ is ‘Let's automate.’”
Resource management, self-service provisioning, and other areas are important to scaling, too—optimizing automation is a key part of the SRE discipline. With Bigtable, Liz says, this involved giving infrastructure ownership back to teams so that they had visibility into their workloads and could allocate and manage their own resources more effectively, rather than just firing off requests to an ops team. The shift made those client teams happier, too, because they were more empowered to self-govern their infrastructure.
Small team + big scale = automation
In many situations, relatively small SRE teams must support enormous scale. (This should be of particular interest to teams that don’t have Google’s resources.) Fortunately, Liz and Matthew agree that you don’t need to be Google-sized to benefit from an SRE practice. In fact, Google itself made a conscious choice to maintain a relatively small, central SRE team that could support dozens of software engineering teams across the company.
As noted, the key to making it all work is focusing on automation. As Liz puts it, the goal is not to solve problems by throwing more people at them, “but instead making sure that we have enough capacity so that we’re not overwhelmed doing tickets all the time … and instead carve off time for automation. And then design our way out of situations where even the automation looks like it’s brittle and creating problems.”
According to Liz, doing this well requires standardizing on processes and tooling: “One SRE team is going to have a really difficult time supporting fifty different software engineering teams if they’re each doing their own separate thing and they’re each using separate tooling.… The approach that we’ve relied on instead is trying to formalize what the best practices are. How can we reach as many of our product development software engineering colleagues as possible using those best practices?”
The importance of SLOs, SLIs, and measuring reliability
Service level objectives, or SLOs, can be equally important to site reliability engineering success. As Liz puts it, an SLO is essentially a representation of the appropriate level of reliability for a particular service; it will vary from application to application. Without SLOs, the SRE team is essentially throwing darts at a board with no target.
But Liz notes that, by itself, an SLO doesn’t necessarily define what reliability is. That’s where service level indicators (SLIs) come in. These are performance metrics that represent some facet of the business. For example, this could be something like “The fraction of user queries that are successfully completed within 200 milliseconds without error.”
Defining the two is critical for every system component. “The SLO is the part where we talk about number of nines, like 99.9% availability,” Liz says. “But the SLI refers to things like, in this individual minute or across this number of months, how many queries were successfully served?”
Then there’s the necessity of and challenges inherent in measuring reliability, including metrics like MTBF, MTTR, and MTTD. Ultimately, what’s critical is for organizations to develop their own “risk matrix,” which defines answers to questions like, What’s the frequency of this event? What fraction of the users does it affect? And then how long is outage in terms of MTTD and MTTR? “You multiply all those numbers out,” Liz says, “and you look at all the possible kinds of outages you have either seen or expect to see.”
That becomes a powerful tool for prioritizing issues and risks that will have a quantifiable impact on your SLOs, while downshifting on issues that may not be especially urgent.
If you’re struggling with where to start, Liz suggests looking at your reliability target, or SLO. Everything else grows from there. “You can’t reason about any of this stuff without having a reliability target in mind that the business agrees to and that your engineers agree to,” Liz explains. “You need to either decide, ‘Okay, this is going to be a short-term issue. We know what we need to do [and] it’ll be fixed in a month. [Or] let’s ignore anything except for catastrophic failures.’ Or … if users are happy and your service is 99% available instead of 99.9% available, maybe that’s where you should set your SLO.”
How organizations can set SREs up for success
The SRE function often looks different from organization to organization—it might be its own department and management chain in one company, and a distributed team and reporting structure in another, for example—and that’s just fine.
Regardless, Liz stresses the importance of implementing the right incentives and rewards to foster excellence in the pursuit of reliability. If you say you value simplicity but your goals and incentives are actually rewarding people for writing huge, complex systems that are a pain to maintain, you’re doing it wrong and you’re likely to end up with a huge complex system that is a pain to maintain.
“Make sure that you have a job ladder that really rewards thinking about what things make the product excellent from a reliability perspective,” she notes. It’s also critical to have a community of practice, people who care deeply about reliability and share best practices with each other—that’s what really creates a culture of reliability.
Communication, humility, and trust
A usual, once the right organizational structure and incentives are in place, it all comes down to execution. Success as a SRE depends on a variety of skills and traits—be sure to check out or related post on the 7 Habits of Highly Successful Site Reliability Engineers for more.
In the meantime, Liz and Matthew tap into something fundamentally important to SRE success: You can always teach technical skills, but you can’t necessarily impart equally essential qualities like empathy and curiosity. “The fundamental element that makes SRE work,” Liz says, “is that ability to work across multiple teams, whether it be with your pure SRE teams or with product development and software engineering teams.” To put it simply, “It’s really about communication, humility, and trust.”