Table of contents
In today’s software landscape, organizations large and small are under constant pressure to modernize their practices—to break down their monoliths, automate their pipelines, and reduce overall toil. To get there, most shift to a DevOps practice, but it’s a rare team that can complete this journey on its own.
After all, there is no single recipe for a smooth transition to DevOps. Aligning a traditionally siloed organization requires a mix of cultural, procedural, and technological changes. But if you’re careful and use a pragmatic approach that fits your business needs and goals, you’ll see success in the end.
At New Relic, we’ve taken our own DevOps journey. We started as a small company, running a monolithic Ruby application, but our growth and success (more customers, more data, and the need to deliver features more quickly) forced us to revisit our application architecture and how we deliver software. We now operate with more than 50 DevOps engineering teams managing over 300 containerized microservices, to which they deploy changes 20 to 70 times a day.
During this journey, we learned that a successful DevOps transition consists of three phases:
Regardless of where you are in your transition to DevOps—whether you’re just getting started, have seen success with pilot projects, or are well underway with a full DevOps transformation—this ebook is for you. Read on to learn more about our prescriptive steps for mapping your path to DevOps success: understand what your organization has achieved, where it sits today, and how to make progress in your DevOps journey.
DevOps is a software development methodology that removes the barriers between software development teams (Dev) and information technology operations (Ops). The methodology encompasses changes to an organization’s processes, culture, and mindset to shorten the software development life cycle—even as teams deliver features, fixes, and updates more frequently, with shorter feedback loops.
Phase 1: Prepare
So, you’re ready to embrace DevOps, but where do you start? This phase is about establishing basic visibility into your applications—preparing for the procedural and cultural changes that a healthy DevOps practice requires. In this phase you will
- Establish performance objectives and baselines for your applications.
- Set up proactive alerting against those baselines.
Gather performance statistics and remediate applications
A key step on the DevOps journey is the gathering of metrics and performance statistics to help you diagnose and resolve throughput bottlenecks, transaction errors, and similar issues. These remediations are critical; they will contribute to a more stable environment in which to establish your DevOps practices. (They’ll also lead to a better customer experience!)
What should you measure? In Google’s ebook Site Reliability Engineering: How Google Runs Production Systems, the authors suggest using the Four Golden Signals to measure and improve user-facing applications:
- Latency: How responsive is your application?
- Traffic: How many requests or sessions are aimed at your application?
- Errors: How much traffic is unable to fulfill its requests or sessions?
- Saturation: How are resources being stressed to meet the demands of your application?
By focusing on metrics like the Four Golden Signals, you’ll get proof of measurable improvements that you can share throughout your organization to gain momentum on your DevOps journey.
Set service-level objectives for application performance
As you prepare your applications, you also need to set clear and measurable objectives. These will enable your teams to build the skills and motivations required to perform cross-team work in a true DevOps environment.
Service-level objectives (SLOs) articulate what successful reliability looks like. SLOs are also a powerful mechanism for codifying the goals of your DevOps team and helping the team to achieve greater velocity.
An SLO is an agreed upon means of measuring the performance of an application or microservice within an application. The SLO defines a target value for a specified quantitative measure, which is called the service-level indicator (SLI); for example:
- (SLI) = average response time
(SLO) = should be less than 200 ms
- (SLI) = 95% of requests
(SLO) = should complete within 250 ms
- (SLI) = availability of the service
(SLO) = should be 99.99%
Effectively setting SLOs and SLIs for a modern, complex system is a multi-step process that includes:
- Identifying system boundaries within your application
- Defining the capabilities of each system
- Measuring performance baselines
- Defining an SLI and applying an SLO for each capability, and then iterating on those settings over time
A good DevOps teams uses its SLIs as key performance indicators (KPIs) to ensure their service meets customer expectations. Further, measuring the current state of your service or application’s reliability provides clear visibility into your DevOps progress. Doing so also allows your teams to focus on resolving meaningful performance gaps as you assess future optimization efforts.
Set and tune alerts to better understand system performance
Proactive DevOps teams establish effective “alerting” strategies that respond to problems before they affect customers. A great place to start with alerting is with your team’s SLOs. In fact, you can group SLOs together logically to provide an overall boolean indicator of whether your service is meeting expectations or not—for example, “95% of requests complete within 250 ms AND service availability is 99.99%”—and then set an alert against that indicator.
By breaking down the quantitative performance metrics of a service or application, your DevOps team can identify the most appropriate alert type for each metric. For instance, the team could set an alert to notify on-call responders if web transaction times go above half a millisecond, or if the error rate goes higher than 0.20%.
For a simple alerting framework, consider the following table:
A focused set of alerts will not only surface true performance degradations to which a DevOps team should respond, but it will also decrease the number of end user-reported incidents. This approach also helps to support DevOps team morale by combating alert fatigue and instilling confidence that rapid, small-scale deployments won’t increase the risk of unnecessary alarms.
Phase 2: Activate
As your DevOps team matures, it will steadily increase the speed and rate of deployments. This, in turn, makes it more important to improve a team’s visibility into its processes.
In this phase of the DevOps journey, your teams will
- Create dashboards to share insights into their work.
- Track how changes affect application and infrastructure health and performance.
- Create an incident incident-response process and learn from incidents.
- Establish measurements for delivering code quickly and reliably.
Create shared insights on reliability issues and business goals within teams
An important DevOps tenet concerns collaboration within teams—including a shared understanding of what work is happening, when, and where. Dashboards enable such collaboration by helping teams align with business goals, and by giving teams insights into how an application’s performance impacts the larger business.
When issues arise, DevOps teams can use dashboards to focus troubleshooting efforts on a manageable number of endpoints and service layers, reducing the time to detection or resolution. Team dashboards also give DevOps teams a single view with which to visualize the SLIs and KPIs for their applications.
Fostering collaboration in this manner also mitigates the risk of friction. Teams, for example, can use dashboards during stand-ups to guide the day’s work. They can also use business performance dashboards as a single source of truth for broader observation about your business as a whole.
Dashboards also help ensure that your DevOps teams’ release and maintenance processes become more predictable, even as they gain confidence in their ability to deploy faster and more frequently; for example, by providing a single, shared source of component status updates.
Understand how changes affect your application and infrastructure
DevOps is a cultural shift that moves your teams toward more frequent but less risky code and infrastructure changes. By properly instrumenting your application, you’ll be able to integrate measurement of the development process (via team dashboards) with deployment markers and infrastructure monitoring that immediately reveal the impact of any changes and minimize the effort needed to troubleshoot service degradations. Capturing tangible, measurable metrics from before and after each change will allow your DevOps teams to optimize changes in isolation, reduce the risk that changes will impact other ongoing work, and increase infrastructure agility while shortening product feature cycles.
Consider a few examples of metrics that support these goals:
- Mean Time to Resolution (MTTR) tells everyone how quickly the organization is recovering from problems, on average. Too many incidents that run too long can threaten your business, so there’s always pressure to resolve incidents faster. Use high MTTR as a prompt to dig into reliability challenges, but don’t attach too much significance to “mean time” measurements in isolation or you’ll motivate unhealthy behaviors and undermine success.
Find out more about reducing MTTR the right way in our best practices for effective incident resolution.
- Deployment status tells you about the health of your deployments. A common DevOps performance indicator is the number of deployments within a set time span—reflecting the fact that more deployments usually mean smaller changes, reduced risk, and more opportunities to experiment. This metric, however, is just as important. After all, having more deployments is not an improvement if most of them fail!
- Unit tests tell you about the health of your codebase and enable your development teams to achieve quick wins.
Create a common incident response process
DevOps organizations need a well-defined incident response process to share across all engineering teams and functions. Your DevOps teams need a predictable framework and process to respond to incidents more efficiently and to minimize the overall business impact of incidents.
Beyond adopting a DevOps model, some incident response best practices include:
- Balancing autonomy and accountability. A successful on-call process depends on the composition of the team, the services they manage, and the team’s collective knowledge of the services. This is where team autonomy comes into play; for example, allowing each DevOps team to create its own on-call system, which should reflect the needs and capabilities of the team.
- Tracking and measuring on-call performance. It’s useful to track on-call metrics at the individual engineer, team, and group levels; for example:
- The total number of pages per engineer
- The number of hours during which an engineer was paged
- The number of off-hours pages received (those that occur outside of normal business hours)
- Developing a system to assess incident severity. Effective incident response begins with a system to rank incidents based on their severity, usually measured in terms of customer impact. Each incident level should involve a specific protocol for managing the response, and for communicating with internal and external customers.
- Defining and assigning response team roles. A good incident response team should have, among other roles, an incident commander, a tech lead, and a communications lead—each with clearly defined authority and duties.
Your incident response process and framework should be clear, consistent, and repeatable. A successful incident response process will also help to reduce alert fatigue and improve your DevOps teams’ morale, even as it reduces the risk that an incident will degrade the customer experience.
Learn from incidents and stop recycling problems
Every incident provides your teams an opportunity to learn, improve, and grow—and to avoid recycling the same problems over and over.
Incidents, for example, often point to important vulnerabilities in your systems, making them a valuable starting point for reliability efforts. Create a process for learning from incidents, and encourage your teams to improve existing KPIs and incident response patterns and to adapt when new challenges surface. The goal is to learn first, then fix things.
After resolving an incident, key stakeholders and participants must capture accurate and thorough documentation of the incident. The preferable way to accomplish this involves holding a blameless retrospective that focuses on constructive learning and improvement, not punishment or blame.
During retrospectives, thoroughly document the discussion, including:
- An analysis of all factors that contributed to the incident
- A chronology and summary of remediation steps and their result, whether successful or not
- Recommendations for system or feature improvements to prevent a recurrence
- Recommendations for process and communication improvements
Store postmortem reports in a highly visible, searchable repository, such as a shared drive folder or wiki.
Measure ability to deliver code frequently and reliably
Another central tenet of DevOps involves building a process in which your teams can move dozens of code commits per day from a source code repository, through the build-and-test process, and into production deployment—without impacting quality or adding risk to the development cycle.
High-functioning DevOps teams use instrumentation in precisely this manner, pushing changes to production more frequently and with lower risk.
Here are four best practices for measuring your team’s code pipeline:
- Start with source-code management. Capturing time-stamped state changes to your pipeline is critical to analyzing your pipeline performance and especially for troubleshooting source code errors.
- Recorded state changes are useful metrics for process improvements. When you push a change, you want to capture its status: Was it successful? If it failed, why? Such data often feeds metrics that a DevOps team can track against internal goals and process assessments; for example, increasing deployment frequency or build quality.
- Report the results of unit tests—and then go deeper. Unit test results are obviously a good target output source for New Relic. Pass/fail results give you a handle for assessing real-time pipeline performance, and they’re also useful tools for assessing and improving a development team’s growth and progress over the longer term.
- Learn more from successful deployments than failed ones. Using deployment markers in conjunction with application performance data allows you to use those deploy markers to trace back to the exact change that caused the degradation, and teams can also configure alerts for real-time notification when such correlations occur.
When your DevOps team instruments its code pipeline, they can prioritize reliability work by identifying services with frequent deployment failures or gaps in test coverage—ensuring they don’t sacrifice quality in the pursuit of velocity.
Phase 3: Optimize
At this point, you’ve completed the first two phases of your DevOps transition and are starting to see success within your teams. Now is the time to level up the rest of the engineering organization—demonstrating and delivering the full business value of the DevOps operating model.
In this final phase, you’ll focus on optimizing your DevOps teams. You’ll
- Resolve dependency risks within your applications.
- Measure and iterate on customer experience.
- Improve infrastructure resource allocation.
- Automate your instrumentation.
- Create a cross-functional operations review to track your success and identify areas for improvement.
Resolve application dependency risks
Successfully scaling DevOps practices across an engineering organization requires a robust understanding of dependencies across application teams and related services. A microservices architecture, for example, likely involves dozens, if not hundreds, of services that make requests to one another, and your DevOps teams must understand how to mitigate risky up- and downstream dependencies in these complex environments.
Building visibility into critical dependencies improves collaboration across teams—reducing outages and supporting more consistent performance.
Working from both frontend and backend services, begin by creating an action plan to reduce dependency risks and achieve your SLOs. As you do so, keep these four principles in mind:
- Understand your risk tolerance. It’s helpful to have a clear picture of your tolerance for risk, which ideally should be informed by your SLOs. Use alert policies to monitor dependencies that you’ve determined have a high relationship to SLO achievement.
- Minimize dependencies. Removing unnecessary complexity is an important way to ensure you have a maintainable system that meets your customers’ expectations.
- Localize dependencies. When your teams write code, encourage them to package together functions that depend on each other whenever possible.
- Stabilize dependencies. When dependencies are unavoidable, mitigate risks by ensuring dependencies point to modules that are the least likely to change or are easier to substitute.
After you complete your action plan, monitor the results. Your SLOs should reveal whether your efforts to resolve dependency risks are paying off.
Improve customer experience
An efficient, well-functioning DevOps culture enables organizations to make rapid, frequent releases and product changes. Such environments also enable teams to share data about the customer experience with other stakeholders, including your customer service, support, sales, and marketing teams.
A single point of reference, such as a dashboard, brings together business-level information alongside performance data—and makes it all very easy to share across teams. When considering how or with whom to share your dashboards, consider the following questions:
- Which teams are responsible for applications that have high levels of end-user interaction?
- What non-engineering teams could benefit from this information?
a. Customer support: Could customer issues be resolved faster?
b. Product/engineering: Could product make more informed roadmap decisions?
c. Customer success: Can this data be used to make customers more successful?
d. What other teams can benefit from end-user analysis that includes performance metrics?
A clear understanding of what creates successful customer experience will help your DevOps teams drive greater efficiencies in their work efforts and, in turn, deliver greater productivity.
Optimize infrastructure resource allocation
No DevOps transformation is complete until you’ve optimized your infrastructure resources to operate more efficiently without degrading application performance. Whether you’re in the cloud or on premise, better utilization of your resources is key. You need the ability to scale, but you shouldn’t pay for resources you don’t need.
Often, this comes down to a decision between downsizing and consolidating resources. In most cases, it’s generally more cost effective to consolidate applications onto larger hosts than it is to downsize host count and run fewer applications on smaller hosts.
Containerization also figures very prominently in most DevOps teams’ optimization efforts. Container orchestration platforms like Kubernetes and Amazon Elastic Container Service (ECS) provide an efficient means for managing compute resources, handling the distribution of container instances based on the available capacity within host clusters.
Beyond such infrastructure changes, DevOps teams can also use proactive alerting and team dashboards to ensure efficient usage of infrastructure resources, while still knowing they’ll quickly detect any impact on customer experience.
Moving fast in DevOps means reducing toil. Visibility should be the default for DevOps teams—not a burden. As a team’s application scales, it becomes increasingly important—and increasingly complicated—to effectively monitor the entire software lifecycle, from code deployment through build and deploy to alerting.
Whenever possible, your DevOps teams should automate tasks with CLIs and reduce toil as their development ecosystem grows by replacing manual instrumentation with an automated setup.
Create a cross-functional operations review
The ultimate goal of DevOps is the ability for teams to deliver stable services in a timely manner that meets customer expectations. To ensure your DevOps teams can continually meet this goal over time, create a cross-functional team with which you hold regular operations reviews. The best cross-functional teams have broad representation, including:
- Product owners, engineering managers, and technical leads
- Individual contributors from DevOps teams that develop applications, work with service delivery, and maintain your ecosystem
- Representatives from business operations, marketing, and support
Use the time with your cross-functional team to ensure that your service-delivery process is strongly integrated with customer expectations. Work on a weekly or bi-weekly basis to identify how and where technical improvements meet customer expectations, or find new ways to ensure they do. (Ideally, you’d track your service record against the SLOs established by your DevOps teams.) Dashboards are a great way to hone in on specific time periods and metrics so you can gain a quantified understanding of whether you’re meeting service-delivery objectives or need to set specific actions to improve delivery.
Modern software practices, like those described in this ebook, can lead your teams to faster feature delivery, fewer incidents, and more experimentation. Forward-thinking organizations that have made the leap to a DevOps operating model are already using the gains to separate themselves from their competitors. They’ve eliminated silos, streamlined their tools and processes, and improved communication channels to break through the barriers to DevOps adoption.
So, is your organization ready to become a lean, mean DevOps machine? We can show you how to get started—and how to stay focused on success.
Our prescriptive Guide to Measuring DevOps Success—written by New Relic solutions engineers and DevOps experts—walks you through each phase (Prepare, Activate, and Optimize) and details how you can use New Relic to track every step, be deliberate about every decision, and increase the odds of your DevOps success.
Our prescription is flexible enough to be customized based on your maturity and specific needs. By collecting data about all stages of your growth, you’ll have invaluable guardrails to better understand how your DevOps efforts impact your overall business every step of the way.