For those of us who spend a lot of time thinking about what a great modern SRE practice should look like in a DevOps world, the Site Reliability Engineering book serves as a fantastic point of reference. Written by members of Google's SRE team, the book shares a compelling glimpse of how they scale and operate their cloud platform and SaaS products.
But what about SRE practices at companies that aren’t the size of Google? For all that's been written about reliability practices, it's surprisingly hard to find specific, detailed descriptions of the day-to-day role SREs play in other engineering organizations. Most descriptions on the internet contain relatively vague phrases, like "SREs combine software engineering and operational skillsets," and "SREs automate all the things."
Of course, some companies have great, robust internal descriptions of how their SREs support teams in their engineering organizations—but even there, "SRE" is often used as a catchall term meaning "operations engineers, or engineers who support infrastructure components, or who write code but also spend a lot of time doing other things." For a long time, this was how we used the term at New Relic. We knew roughly what our SREs were supposed to be doing, but disagreement about the specifics—like how much manual operational toil it was acceptable for an SRE to take on, or how SREs should engage with teams in architecture discussions—sometimes made it challenging for our SREs to prioritize the most high-leverage, high-value work.
Working toward clarity and consensus
The process of creating our own SRE role description took time and involved the input of a variety of stakeholders—from individual SREs to executive leadership. This was a worthwhile investment: the exercise helped us clarify and shape a shared understanding of
- Why we have SREs at New Relic.
- The vision for our SRE team.
- How SREs can most effectively contribute to the future of our platform.
This clarity also gave our SREs and their managers tools for calibrating expectations, identifying failures, and targeting success.
Our experience suggests that engineering organizations can benefit by creating clarity and consensus around what they expect from their SREs. To support that effort, we want to share our internal definition of the SRE role at New Relic, pulled from our engineering organization’s process documentation.
SREs at New Relic operate in two different contexts:
- Some are part of “pure” SRE teams that work to build and support our core internal platform, such as our container fabric clusters (our in-house container orchestration and runtime platform) and networking systems.
- Others partner with product engineering teams as domain experts in reliability, tooling, and scaling areas.
In both cases, the same fundamental role description applies. Similarly, the same description applies to all title levels of our SRE practice, although the focus and scope of work naturally changes as our SREs increase in seniority.
So, here is our internal SRE role description:
The SRE role at New Relic
SREs at New Relic are engineers who focus on, and are recognized primarily for, improving the reliability of systems in the New Relic platform. From a business perspective, the goal of the work that SREs do is to build and maintain our customers’ trust, and to allow the business to scale by steadily decreasing the per-service and per-host operational overhead of our global platform.
At a high level, SREs make this happen by
- Championing reliability best practices.
- Guiding designs and processes with an eye toward resilience and low toil.
- Reducing technical complexity and sprawl.
- Driving the usage of tooling and common components.
- Implementing software and tooling to improve resilience and automate operations.
In some cases, SREs perform manual operations work (toil), but this kind of work is a tax on SREs that detracts from their core mission; it is not the reason why we have SREs. Necessary toil should be shared by an entire team rather than handed off to an SRE and should be a trigger for the team to automate that work.
|Type of Work||Examples||Notes|
|Learn and enhance New Relic operational and reliability best practices (e.g.ha, capacity planning, SLOs, incident response) and work with teams to adopt those practices.||
|Stay current with the overall New Relic platform architecture and with the current state of, and top risks in, their teams’ “neighborhood” in production.||
||We expect all SREs to be familiar with the dependencies and underlying infrastructure of the systems they work with.|
|Building, or helping teams adopt, core shared internal platform components.||
|Improve the monitoring and observability of the New Relic platform.||We encourage SREs to actively use and extend existing New Relic products whenever it’s possible and effective to do so, and to influence Product Management to implement necessary features when it’s not.|
|Work with teams to design and implement automation, tooling, and application code to improve reliability and reduce toil.||
|Mentor less senior SREs and grow the SRE community and practice at New Relic.||
|Perform task-based operational work (toil) required to unblock teams with operational needs where automated or self-service solutions do not yet exist for those teams.||
Set your SREs up for success
Although this SRE role description works well for us at New Relic, it may not be right for other engineering organizations. Regardless, we hope it provides a useful example and helps clarify the tremendous value a great SRE practice can bring to your organization. More important, by developing guidelines, companies can set their SREs up for success and advance the collective understanding of the key role the SRE practice will play as it matures to support the ever increasing complexity of our computing platforms.