The day-to-day responsibilities of developers and operations engineers are increasingly evolving as high-growth companies look for new ways of improving stability, reliability, and automation-first practices. Because of this need to reduce downtime (with less manual intervention) as systems scale, a new role is taking shape in many organizations: the site reliability engineer (SRE).
The phrase “site reliability engineering” is credited to Benjamin Treynor Sloss, vice president of engineering at Google. Sloss joined Google in 2003 and was tasked with building a team to help ensure the health of Google’s production systems at scale—no small task. According to Sloss, site reliability engineering is “what happens when you ask a software engineer to design an operations function.” Site reliability engineering is a cross-functional role, assuming responsibilities traditionally siloed off to development, operations, and other IT groups.
The proliferation of the SRE
No matter how you define it, the SRE role is clearly expanding into more and more companies. A recent jobs search for “Site Reliability Engineer” on Glassdoor produced more than 61,600 open positions at the time of this writing. Tech firms are certainly well represented—companies from Adobe to GitHub to Spotify (and plenty more) are all hiring SREs. But you’ll also see plenty of other bellwether companies (such as GE, Chase, Walmart, and McGraw-Hill Education) and industries (including entertainment and education) seeking SRE practitioners, too.
And it’s no surprise that companies of all shapes and sizes are starting to adopt the role. “My impression is that there’s a slow trickle-down to smaller companies,” says Beth Long, a software engineer with New Relic’s Reliability Engineering team. “Google and Netflix and Amazon and Heroku—these companies have had SREs for a long time because they have the resources and the scale that demand it. You’re starting to see that role appear in smaller companies where they realize ‘Oh, we need someone to play this role.’”
As a result, folks with the right mix of talent and experience for the SRE role are increasingly in demand. Not too long ago TechCrunch asked, “Are site reliability engineers the next data scientists?” And last year LinkedIn named SRE as one of the most promising jobs in tech.
From Google to the rest of the world
Sloss’ team literally wrote the book on site reliability engineering. So if you’re wondering what a great modern SRE practice should look like in a DevOps world, the Google Site Reliability Engineering book is a fantastic point of reference.
In it, Sloss writes, “It is a truth universally acknowledged that systems do not run themselves. How, then, should a system—particularly a complex computing system that operates at a large scale—be run?”
Google’s answer has been to hire software engineers to do the work usually handled in traditional organizations by IT operations folks. “Our Site Reliability Engineering teams focus on hiring software engineers to run our products and to create systems to accomplish the work that would otherwise be performed, often manually, by sysadmins,” explains Sloss.
Starting the SRE journey
While job descriptions and day-to-day tasks for SREs vary from company to company, the utility of the role is quickly becoming apparent to those software organizations who’ve adopted it.
So where does that leave you?
Whether you’re still figuring out how to create a site reliability practice at your company or you’re trying to improve the processes and habits of an existing SRE team, the more you know about the subject the better—especially since what may work for a massive company like Google may not always work for a small or mid-sized outfit. To that end, this ebook shares the philosophies, habits, and tools of successful SREs, along with New Relic’s own definition, guidelines, and expectations for the role.