If you want to be a site reliability engineer—or SRE, for short—it would certainly help to know exactly what people in that role actually do.
Yet understanding that simple question can be a bit fuzzy even for veteran SREs, because the definition of the term is still so malleable. That’s actually a good thing, as we’ll get to in a moment. But it suggests that the world of SRE itself might seem a bit mysterious to the untrained eye.
So, just as we recently unearthed the “secrets” of DevOps, we’re here to do the same with the related field of site reliability engineering.
Whether you’re looking for your first role in SRE or are a grizzled veteran of the reliability trenches in search of a fresh perspective, you’ll find not only the history of the term, but also commonly accepted best practices and some fundamentals of SRE success gleaned from New Relic’s own SRE journey. These 10 “secrets” should be common knowledge—but too often aren’t.
1. Credit for the term “SRE” goes to Google’s Ben Treynor Sloss
Site Reliability Engineering, or SRE, was introduced into the tech lexicon by Benjamin Treynor Sloss, VP of engineering at Google. That’s kind of a big job. As Sloss’ LinkedIn profile says: “If Google ever stops working, it’s my fault.”
This is more than just an interesting bit of tech trivia, according to New Relic Site Reliability Engineer Jason Qualman. The creation of the term, in large part to describe Google’s approach to operating its production systems, remains fundamental to SRE practices today.
“It’s important to recognize and heed how Sloss thought about the SRE role when he came up with it. It is a software engineer’s approach to doing operations,” Jason says. “Using code to enhance and automate your operational toil is one of the key tenets of SRE work, and that cannot be repeated enough.”
2. A team of Google engineers wrote the book on SRE—literally
Given the global scope of Google’s SaaS products and cloud platform, their SREs know a thing or three about reliability at scale. The latter word is key to SRE—it’s not just about improving the reliability of a system today, but making it better as it morphs and grows over time.
So, if a group of Google engineers writes a book about SRE, you should probably read it. (You can do so for free online.) Site Reliability Engineering offers an in-depth look at the role and its practices. Yes, it does so from the Google point of view, and how Google does SRE isn’t necessarily how your company should do it, but the book remains the foundational tome for everyone from newbies to experienced SREs.
3. SRE is very much what you make of it
It’s important to understand—especially as the SRE role proliferates—that both the job title and the practices involved can (and should) be very specific to a particular organization.
“Like its sibling term ‘DevOps,’ the definition of ‘Site Reliability Engineering’ can get squishy,” says New Relic Software Engineer Beth Long. This is likely to remain true for the foreseeable future, Beth adds.
In fact, it’s a positive characteristic of the field. The SRE role is not a one-size-fits-all situation. “I think this is inherent to the nature of SRE,” Beth explains. “It’s a cross-disciplinary role, and it’s deeply connected to shifts in how we think about the way humans connect to the systems they’re building.”
So while Google’s book is recommended reading, Jason points out that it’s not a turnkey approach for everyone to mimic. “Trying to adopt Google’s model for your company is an easy pitfall if you don’t deal in the specific hardware and software combination that Google does,” he warns.
The growth of SRE at New Relic is a great example. Jason and Beth note that the team has tried various formations of the SRE function—such as embedded SREs, a centralized SRE team, on-call SREs, and other iterations—to find out what works best for New Relic.
They advise a similar approach in your own organization. “Experiment, be agile, and use the role in the way that works best for your team,” Jason advises.
See also: Defining Modern Software Roles—SREs at New Relic, by VP of Site Reliability Matthew Flaming
4. Site Reliability Champions can help make SRE work for you
New Relic’s own Site Reliability Champion (SRC) role offers an example of how to refine the job to meet specific challenges.
“SRCs are one solution to the challenge of having autonomous teams who own their own services,” Beth explains. “That approach is fantastic, but it leaves the organization vulnerable to missing the forest for the trees. SRCs are the scouts on the hill, looking out at the bigger picture and helping guide those autonomous teams to move in the same direction.”
5. Automation is fundamental to SRE
While definitions and implementations of SRE vary, one word is intrinsic to just about all of them: automation.
As Jason explained in our post on the 7 Habits of Highly Effective Site Reliability Engineers, top SREs seize upon every opportunity to automate: “A lot of this role is thinking about inefficient and time-consuming things people are doing and putting a stop to them as soon as possible. Instead of kicking a can down the road on manual work, you’re saying, ‘I’m going to take the time to automate this right now and stop anyone else from having to do this painful thing.’”
6. There’s no standard set of SRE tools—but you should standardize anyway
There’s no single, uniform SRE toolset. But most experts agree that any organization looking to build out a SRE function should define for itself what tools it will use. Alongside automation, standardization—of both tools and processes—is crucial to scalability, repeatability, and other important goals. As Google staff SRE Liz Fong-Jones said during New Relic’s FutureStack 2017 New York event, standardization is one of the key strategies that enables relatively small teams of SREs to support larger product teams—much larger, in Google’s case.
As Liz explained, “One SRE team is going to have a really difficult time supporting 50 different software engineering teams if they’re each doing their own separate thing and they’re each using separate tooling.”
7. SRE isn’t just for tech companies
Don’t be fooled into thinking SRE is only for cloud-native and SaaS companies. Just as DevOps culture has permeated a wide range of industries, the site reliability engineering role is expanding well beyond the tech industry.
It is simply a sign of the times. As New Relic is fond of saying, every company is a software company these days. That’s reflected in the spread of SRE and the SRE job market. A recent site reliability engineer job search on Glassdoor includes open positions at companies such as Delta Airlines and Owens Corning alongside the likes of eBay and Adobe.
8. SRE excellence requires experience
Beth offers a perspective that might become more common as more companies add and iterate on the SRE function: the role is best served by software pros with a few miles on their odometer. That doesn’t mean less-experienced folks can’t immediately bring the SRE mindset to the services they build and maintain; rather, it’s a reflection that the complexity and size of modern systems is easier to tame if you’ve been around the block a few times.
“SRE is akin to an architect role in that you can’t really take on true SRE at the very start of your software engineering career,” Beth says. “You can certainly step onto the path, but effective SRE requires a combination of depth and breadth, a certain fluency, that you don’t get without putting in serious time on the ground, particularly with systems of significant scale and complexity.”
9. SRE is as much a philosophy as a skill set
For site reliability engineering, the word “mindset” is key. Being an effective SRE is as much about how you think as it is about your technical skills.
Depending on how it’s defined, the SRE role requires a mix of development and operations skills. But creating a functional site reliabilty engineer takes more than telling a software developer or systems engineer to read Google’s book.
“The greater challenge is to help people cultivate a new way of approaching the process of building systems,” Beth says. She points to this equation from Krishelle Hardson-Hurley’s Hacker Noon article about SRE: “Site Reliability Engineer = Software Engineer + Systems Enthusiast.”
“That equation captures the direction we’re trying to head here at New Relic, which is to say SREs should be well grounded in traditional software engineering practices and tools, but also have a knack for looking at the system holistically and understanding how to move the system towards reliability—or even better, towards resilience and graceful extensibility,” Beth says.
10. SRE should be a catalyst for change
No matter how you define and implement SRE in your company, the role and the practices it embodies should have a cascading effect.
“Let your SREs act as catalysts for those who want to bring reliability to their team,” Jason advises. “At New Relic, our setup allows for some teams to have embedded SREs, and others to have SREs or SRCs available on demand. But that doesn’t stop any team who wants to develop with reliability in mind from doing so on their own.”
Indeed, while SREs have reliability written into their job title and responsibilities, it can and should be everyone’s mission. Jason points to a monthly community session that New Relic’s reliability team holds. Anyone, whether a SRE or not, can attend and ask questions or present on any reliability topic.
“This way, we can spread the ideas and culture of reliability without needing to manage the logistics of which team is most in need of an SRE. We can even implement and manage SRE principles without needing the help of someone with the job title,” he says. In fact, he jokes, “The main mission of an SRE is to automate themselves out of a job.”