Some software jobs seem as old as software itself. Terms like “programmer”—or its more contemporary version, “developer”—don’t need much explaining these days. Sure, the day-to-day job responsibilities and other specifics have evolved over time and might vary across organizations and other variables. Ultimately, though, software developers (or comparable titles like software engineer) are still people who build things with code, even if the tools and languages they use have come a long way.
But that kind of fundamental definition isn’t quite as neatly packaged for newer roles like DevOps engineer, or, in particular, Site Reliability Engineer (SRE). That’s even true at New Relic—just because the SRE role has become a super-important internal function doesn’t mean everyone agrees on what, exactly, a SRE does.
“We had an internal reliability event a couple weeks ago, and the one consistent piece of feedback was that everyone has different takes on what exactly ‘site reliability engineering’ means,” says Beth Long, software engineer, site engineering, at New Relic. “It’s definitely a concept with a lot of gray areas!”
That’s actually a good thing, as it speaks to the adaptability and agility of the SRE role and how it can be implemented productively in various parts of an organization. Nor does the apparent lack of a precise definition that most people agree upon mean we can’t try to deliver one—but hold that thought for a moment.
With help from Beth and her team member Jason Qualman, site reliability engineer at New Relic, we’re here to explore the history of the SRE role, its accelerating growth and importance in many organizations, and how it works here at New Relic.
The history of SRE: “Hope is not a strategy”
The phrase “site reliability engineering” is credited to Benjamin Treynor Sloss, vice president of engineering at Google. Sloss joined Google in 2003 and was tasked with building a team to help ensure the health of Google’s production systems at scale—no small task. Sloss himself has defined SRE as “what happens when you ask a software engineer to design an operations function.”
Members of Sloss’ team literally wrote the book on “Site Reliability Engineering;” the “hope is not a strategy” quote above comes from Sloss’ own introduction to the book, and is credited, tongue-presumably-in-cheek, as a “traditional SRE saying.”
Sloss begins his intro with the particular challenge SRE intends to meet: “It is a truth universally acknowledged that systems do not run themselves. How, then, should a system—particularly a complex computing system that operates at a large scale—be run?”
Here’s Google’s answer to that question, at least the brief version: hire software engineers to do the work usually handled in traditional organizations by IT operations. “Our Site Reliability Engineering teams focus on hiring software engineers to run our products and to create systems to accomplish the work that would otherwise be performed, often manually, by sysadmins,” Sloss writes.
You could think of SRE as Google’s homegrown approach to DevOps. Obviously, the need to efficiently manage large, complex systems at scale is not unique to Google—even if the web giant epitomizes that need. You don’t have to be operating software services at Google scale to require the SRE function, which is why the role has continued to proliferate, not just in tech companies but in many other businesses where software is increasingly critical.
The proliferation of the SRE
No matter how you define it, the SRE role is clearly expanding rapidly in many companies. A recent jobs search for “Site Reliability Engineer” on Glassdoor produced more than 210,000 open positions!
Tech firms are certainly well represented—it’s like a who’s who list, actually, with everyone from Apple to Twitter to Dropbox and many more hiring for SRE positions. But you’ll also see plenty of other bellwether companies (such as GE and Chase) and industries (including entertainment and education) there, too. It’s another sign of modern software’s impacts on long-standing industries like real estate (Zillow is hiring for a SRE position) and television (so is Hulu).
While job descriptions and day-to-day tasks of a SRE may vary from company to company—and even within a company, as we’ll address in a moment—Jason notes that just about any large software organization now has some version of the SRE function. And forward-thinking companies with a DevOps or DevOps-ish culture have probably had SREs for quite some time.
That’s starting to expand into companies of all shapes and sizes. “My impression is that there’s a slow trickle-down to smaller companies,” Beth says. “Google and Netflix and Amazon and Heroku—these companies have had SREs for a long time because they have the resources and the scale that demand it. You’re starting to see that role appear in smaller companies where they realize—‘Oh, we need someone to play this role’—and they’re starting to hear about it more.”
While there’s no one-size-fits-all definition of what a Site Reliability Engineer does, Jason points out that there is a mindset that unifies them all: automate everything. That’s probably the most well-known, visible trait of the SRE, Jason says. “If you think about any industry, how did they get better, faster, and more efficient? It was probably ‘automate it.’”
As a result, folks with the right mix of talent and experience for the SRE role are increasingly in demand. Last year TechCrunch asked, “Are site reliability engineers the next data scientists?” Comparing the SRE role to perhaps the sexiest job title in tech speaks volumes. And earlier this year, LinkedIn named SRE as the most promising job in tech for 2017!
The Role of the SRE at New Relic
The Site Engineering department at New Relic was created in 2014, at a time, Beth recalls, when New Relic was both beginning to grow rapidly and address the attendant challenges of stability and reliability. But the SRE role really took off in 2015.
At New Relic, Jason says, the role began with heavy-duty ops personnel slowly beginning to apply more software-based approaches to their work, rather than the Google approach of tasking software engineers with building out operations functions from scratch.
“That’s kind of where we are today. We have a lot of operational people working under the title of SRE who are really trying to bring more automation and self-healing to the site engineering side of the world,” Jason says. If the Google method was essentially to take a software-based approach to reliable and automated infrastructure, New Relic worked to add more software-engineering-based approaches and skill sets to its existing operations practices and personnel.
Today, New Relic SREs work not only within the Site Engineering group, but outside it as well, embedded in product and platform teams. So as Beth mentioned earlier, you won’t find one single, uniform list of day-to-day job responsibilities for a New Relic SRE. But they’re all working toward a common goal and with a shared mindset about software and infrastructure.
“How can they bring reliability to everything that they are doing?” Jason asks. “They’re thinking, How can I make this withstand failure? How can I reduce the toil involved in maintaining this system? How can I take a painful, manual process and automate it so that we’re not wasting human time on it? The overarching goal is that anything I’m doing, any of my goals, I’m bringing reliability to them and I’m spearheading that for the rest of my team, too.”
That latter part is key to the idea of New Relic SREs being embedded into product teams. They might be taking on software engineering tasks, but they’re also charged with helping to grow automation-first practices that reduce toil and improve reliability. That is the fundamental goal: greater reliability with less manual intervention as a system scales.