Table of contents
So you just finished putting out another fire (nice work!). Clearly, you know what you’re doing, but that’s not to say there isn’t an opportunity to boost your ops superpowers even more.
At New Relic, we’re constantly thinking about how to improve our internal operations and deployment processes, while trying to learn and incorporate best practices from other companies as well. Whether it’s figuring out how to use New Relic to monitor New Relic, scaling our infrastructure to meet growth, effectively managing the incident lifecycle, or optimizing our deployment processes, we’ve pondered many an ops question. As a result, we’ve gathered a ton of best practices and lessons learned—all of which are compiled in this Greatest Hits compilation.
From articles and videos to ebook excerpts, you’ll find all sorts of awesome content to help you do operations better. Start from the top of this page and scroll down, or check out the table of contents on the left and go straight to what interests you. Like what you see? Feel free to click on the links within each chapter to read the source material in its entirety.
Building a System That Never Stops
It’s a problem that every successful software organization faces at some point: how to evolve systems under load to meet growing demands and changing requirements, without dropping data or causing service interruptions.
This is something that New Relic’s Site Engineering team is all too familiar with. From our humble beginnings, we’ve now grown our data capacity to accept more than 16 million requests and power more than 3 billion queries every minute. On the organizational side, the New Relic Software Analytics Cloud contains more than 200 different services, accessing more than 2.5 petabytes of SSD storage—all built and maintained by more than 25 different engineering teams.
That kind of growth took a lot of doing, so at our FutureStack15 conference, New Relic’s Engineering VP Matthew Flaming and Chief Architect and VP of Engineering Nic Benders were excited to share some of the lessons we’ve learned along the way. Nic and Matthew go into deep technical detail in their presentation—you can either watch the video of their entire talk below or scan through the highlights that follow.
The New Relic Lesson Plan
Lesson 1: NOTHING lasts forever—Don’t get cocky; attachment to existing processes can get in the way of critical thinking (time code 6:18).
Lesson 2: Run EXPERIMENTS—We pride ourselves on having a culture of experimentation, even though experiments don’t always work (7:16).
Lesson 3: SYNCHRONOUS CALLS are going to be a problem. Sometimes, the issue isn’t that you don’t have enough databases (9:32).
Lesson 4: Master the ROLLOUT—From “incremental rollouts” to “dark deploys,” the most important part of your technology may be how it is delivered (14:28).
Lesson 5: NEW TECH = NEW CHALLENGES—But also new opportunities (17:27).
Lesson 6: Use the right WORKLOAD DISTRIBUTION—Active management vs. Random (19:22).
Lesson 7: Technology enables CULTURE—A tool can be more than just a way to connect components; for example, it can also change the way your business works (22:22).
Lesson 8: Software Architecture: THE BIG PICTURE—Organic growth eventually leads to a breaking point, which requires a paradigm shift (23:36).
If It Touches Production, It is Production
We all want to build amazing software, so shouldn’t our tools, our jobs, and everything we do to get there be done with the same care and feeling?
That’s why at New Relic, we treat operations like software. We operationalize our development teams to build, maintain, and scale a unified polyglot environment. By following the mantra of “automate everything,” we treat all tasks, tools, and processes the same way we do our products, with a well-defined lifecycle. And in our eyes, that’s what makes a modern DevOps team.
“Modern DevOps is really about changing your culture. Changing the way that you think about operations, because operations IS the product.” - Dana Lawson, Director of Site Engineering, New Relic
As explained by New Relic’s Director of Engineering Dana Lawson, a modern DevOps operation is best characterized by the following characteristics:
- Software maturity
- Self-healing systems
- Many experts and low tribal knowledge
- Small incremental changes continuously
- Strong foundation of knowledge and documentation (dynamic and/or curated)
But how do you get to this point? Find out in Dana’s FutureStack15 talk below, and get more insights and best practices that can help you successfully take your platform into the future.
To Boost DevOps, Try ChatOps
Excerpted from the blog post by Stevan Arychuk of New Relic
When people talk about DevOps, terms like “automation,” “collaboration,” and “tools” always seem to dominate the discussion. So it should be no surprise that a paradigm combining all of these traits into a single concept has surfaced as a new-and-better way for modern teams to communicate and collaborate. Think real-time collaborative group chat powered by bots that help with sharing information, plus integrated notifications from other tools. Put it all together and innovative teams are now doing conversation-driven development and operations—this is the way of ChatOps.
By creating a new communication channel that automates common tasks and makes it easy to distribute real-time information, ChatOps can improve collaboration to help teams shorten feedback loops, enabling them to move faster and be more productive.
Picture this: An app crashes and causes an alert, and notifies the on-call engineer responsible for support. That engineer replies to the chatroom that she is addressing it, and asks for assistance or other information if needed. Along with other members of the chatroom, the team identifies the bug, creates a fix, tests it, and then pushes it to production. That resolves the issue and the alert is closed. Every step of this scenario can be captured via chat, with most of it automated. See this internal example of a similar scenario at New Relic:
The screenshot shows the team has integrated notifications from New Relic Alerts into the HipChat room. We try to make integrations like this as simple as possible because they are important and can trigger both action and discussion. Most teams have their own room where they talk about their area of focus and field requests from other teams. Again, the idea here is that groups of people gather in a location where they can discuss and collaborate, and others can easily find them if there are issues or questions.
Want to learn more about ChatOps and how to get started? Read the rest of Stevan’s blog post and start chatting away.
Incident Lifecycle at New Relic (Step 1: Don’t Panic!)
Speaking of efficiently managing incident lifecycles, the Site Engineering team at New Relic has spent a lot of time and energy thinking about and tuning how we deal with issues in production. As a result, we’ve learned quite a few tricks and best practices along the way, which New Relic Product Manager Nate Heinrich shared at New Relic’s FutureStack15. He covered everything from how teams at New Relic work together, to the tools and “routines” they use to create a low-stress environment.
According to Nate, you need to have three key areas of investment to achieve an awesome incident lifecycle:
- Culture (pre-incident): These are the behaviors and beliefs about building, deploying, and running software.
- Routines (incident): These are the habits and built-in neural pathways that allow you to spend your cognitive abilities on solving the problem at hand, not running the process.
- Priority (post-incident): These are the activities and action items performed after resolution.
Watch the video below to learn more about these focus areas and how you can achieve a high-quality incident lifecycle.
DevOps Without Measurement Is a Fail
The DevOps movement continues to gather speed, and, according to many, it’s about time. After all, fostering collaboration and transparency across the entire delivery process has been shown to help everyone get great work done quickly. That means faster delivery of software, fewer defects, faster resolution of problems, and better allocation of limited resources.
However, faster development of better software is not the end of the story nor is it the chief reason for implementing DevOps in the first place. For your DevOps efforts to be a true success, you need to show more than how you made nice between dev and ops to get better results. You need to demonstrate what you do has a positive impact on the business, regardless of what you call the changes you make and the culture you build.
The key ingredient of measuring DevOps success? You guessed it: data. You can’t know for sure how you’re doing unless you measure the right things and manage your DevOps operation to continuously keep key performance indicators in the right balance.
Based on the framework we introduced in our ebook, “DevOps Without Measurement Is a Fail,” the graphic below offers a quick snapshot of what types of metrics you should be tracking.
Depending on how much progress you’ve already made in achieving your company’s goals, you’ll need to decide which ones are most important to track at this time. If you’re not tracking any or some, it’s time to get started! You can do that by establishing a baseline and monitoring the metrics to make sure they are increasing or decreasing accordingly.
Click on the "Download PDF" button on the left to learn how New Relic can help improve your DevOps efforts.
For more helpful articles and advice, be sure to visit the New Relic blog.