We hear a lot about the promise of DevOps these days, about how shifts in culture, tooling, process, and monitoring can transform how teams ship software. I work on a New Relic DevOps team, and I believe in that promise.
But what does it mean to say that I work on a DevOps team? Recently, my manager explained in a blog post (How to Manage a DevOps Team: Q&A With the Manager of New Relic Mobile Team) what it’s like to manage a DevOps team. But what is it like to be an individual contributor on such a team? What’s a typical day like for a DevOps pro, or in my case, a site reliability engineer (SRE)?
This post—a mash-up of real events that occurred over several days—is designed to answer that question, but I can tell you one thing that’s always true: The days move pretty fast.
A quick morning
8:00-8:30: This is my time to take care of housekeeping. The office is still pretty quiet—it usually is at this time—so I make some coffee, check my email, and get into the groove of the day.
I’m an SRE on the New Relic Mobile team. We build and ship New Relic’s Mobile APM product used by mobile application developers to monitor their Android and iOS apps. Specifically, as an SRE, I’m responsible for the reliability of our product. Along with my team, I make sure the code that we ship is reliable, that the canaries we deploy behave as we expect them to, that our deployment pipeline is clear and easy to use, and that key parts of our pipeline (for example, Kafka, Amazon Simple Queue Service (SQS), and Amazon S3) are working as expected.
Before I head off to the first meeting, I check our ops dashboard to make sure that no system reported an out of memory (OOM) error overnight or became starved for resources. I also check that our Kafka topics are flowing as expected. We have built a series of New Relic Insights dashboards that cover all of our services as well as our locations in the United States and Europe. Today, I also review the reliability report for New Relic that shows the last three months of uptime across all teams and all products.
9:00-9:45 First up is the Container Linux (CoreOS) community of practice (CoP) meeting. At New Relic, we use communities of practice to connect people from around the company who want to understand how we engineer our products. Lately, the CoreOS CoP has been discussing the the work we’re doing to shift our container infrastructure from Docker to CoreOS.
Today, the topic of discussion is a deep dive into the nitty gritty of how CPU allocation works inside of our container framework.
9:45-12:00: Ok, time to get into the thick of it. We’re in the final stages of an upgrade of our Apache Kafka clusters; specifically, we’re going from version 0.8 to 0.10, and it’s been a bit of a pain. Dependencies changed between versions, code changed, and configuration objects were no longer compatible. New configuration options exposed in Kafka 0.10 have changed all the defaults we set in version 0.8. It has taken a ton of work to reconfigure our producers. And reconfiguring the consumers hasn’t been a walk in the park. We have to bring each one down, upgrade it, reconfigure it, bring it back up, and hope it picks up any lag created while it was absent from the cluster.
It’s enough to make an engineer ready for lunch.
12:00-13:00: I hit the food cart pod outside our Portland office—the headquarters of New Relic engineering—for some takeout.
Back in the office, I meet up with another SRE, and we eat while working on a side project. She wants to improve some tooling for troubleshooting a rather messy monolithic service. The service is an interdependent tangle of Java and Python and Go code, and to work on it, you have to check out various repos and link it all together. The tooling makes it pretty easy to deal with most problems in the service, but getting the tooling to run reliably on developers’ machines has been tricky. We finally decide to containerize the entire tooling ecosystem so that anyone who needs to troubleshoot problems can access a trusted version of the tools.
We’re DevOps heroes.
As I’m headed back to my desk, I get a Slack message from my manager: “Emergency MMF meeting at 1.”
An even quicker afternoon
13:00-13:30: In my manager’s blog post, he talked about how our team is T-shaped, so we all share a certain amount of skill and expertise. I may be an SRE focused primarily on the backend of our product, but I find out in this MMF planning meeting that we’ve got a week to ship a new UI component, and we’re down one frontend developer who is on vacation, so I should expect some frontend UI work coming my way. While I largely consider this a pain in my neck, I know it’s going to help me better understand how we design and build the front-end of our app. I’m enhancing my "T."
This an expedited MMF, which means that we’ll all drop what we're doing and dig in as best we can, where we can.
13:30-14:30: I can’t get to work on that new feature just yet. Since I’m the team’s SRE, I have to attend the quarterly capacity planning meeting with our Site Reliability Champions. During this meeting, we review current resource utilization of all of our microservices across all of our environments. We plan a project for the next quarter and make sure that the team responsible for purchasing servers is able to get us those resources.
14:30-16:00: For the next 90 minutes, I pair with our frontend engineer on the new MMF. We have to fix a buggy time picker, which requires reaching deep into the app’s React code.
16:00-16:30: I have to step away from the MMF work and represent my team at the quarterly risk matrix meeting. In these sessions, we meet with a New Relic architect, and we identify any impending risks and their expected impacts within the portions of the New Relic ecosystem we own. We then brainstorm ways to remove or mitigate those risks. Today, I report that my team has rolled out a pair of new services that we have to integrate into the risk matrix. One of the services helps us abstract away direct dependencies on several core databases. We cache our records and stream updates via watchers to our cache, which should help us reduce any risks associated with database outages.
16:30-17:00: Back to that MMF...
17:00- 17:30: Before I leave the office, I take one last look at the situation with our Kafka upgrade. We’ve been deploying our new consumer fleet, but consumers running versions 0.8 and 0.10 can’t coexist. So to deploy the new fleet, we have to completely shut down all of our existing consumers and deploy the new ones. We get the the fleet deployed, but the consumption levels are far too low to keep up with traffic. We quickly revert the changes and avoid an incident, but tomorrow I’ll have to figure out why the consumption was below expected levels.
So, that was a fast day! But that’s kind of the point.
Sometimes, I admit, it would be nice to work on just one problem all day—and occasionally, I do just that. But DevOps can’t succeed if we all have such myopic viewpoints. New Relic moves fast because our customers move fast. Beyond speed, we need versatility and careful planning and full awareness of our systems. I may have spent my day running around, but it was in the service of helping my team ship the best product we can.