In the past few years, I have had the opportunity to work with several customers on the implementation of the observability for infrastructure and applications for day to day operations. One critical moment that shows the true success of the implementation is how the teams use the data in responses to incidents during the peak season.
For ecommerce companies, the peak season usually represents the biggest revenue opportunity of the year. The application and the technology stacks that it runs on are the means to capture revenue and to grow the customer loyalty in the more and more competitive landscape.
The teams of engineers that support the application stacks not only need to see what’s happening, they also need to be able to isolate the cause of the issue quickly, to mitigate, and to prioritize the effort based on the business impacts.
In this blog, you will learn the followings:
- Understanding peak readiness
- The common mistakes while preparing for the peak readiness
- The step-by-step planning process for peak readiness
For more detailed information, please refer to the full document link at the end of this blog.
Understanding peak readiness
Availability is not equal to performant.
Peak ready, at a minimum, means that your application is continuously available to users under heavy load. However, there is much more involved in peak readiness than just application availability.
The application availability and performance should align with your company’s business objectives. After all, the purpose of your application is to provide a means to accomplish a set of business objectives.
Everyone, from management to engineers, needs to be able to see the current business health state, especially when there are heavy request loads or system issues—be it software or hardware—and what those impacts have on the business and customer’s experience. The ability to relate the application performance to the business impacts is what determines the success of your peak readiness preparation.
Common mistakes when preparing for peak readiness
The following are common mistakes I've observed.
Mistake #1: Starting too late
The first common mistake is starting too late. It's often that the teams only start the peak readiness activities one to two months before the peak season. With competing priorities and resources, starting too late results in just a subset of tasks being completed and leaving gaps and unknowns to chances.
Mistake #2: Not connecting application performance to business objectives
The team focuses only on the application and infrastructure data from a technical perspective but does not understand how the degradations and the errors impact the revenue stream or customer satisfaction. Without clear understanding of the business impacts, it could result in mis-prioritization of the importance of issues.
Mistake #3: Information silos
The modern application architecture of microservices, and the continuous integration/continuous deployment (CI/CD) practice, often create siloed engineering teams. Naturally, each team only understands the ins and outs of what they are responsible for, but less about the dependencies, and even less about how their application performance affects the overall business process. This lack of big picture understanding impedes how the team shares information and slows down the triage during an incident.
Mistake #4: Not including peak readiness in development
With the peak readiness activities only spanning the production environment, the teams lose the opportunity to learn the normal behavior of the application, to identify weak links, the key indicators, the effective alerts, and potential mitigation actions while developing and testing in non-production environments. What's observed in non-production environments can make the final release more resilient to the peak demands.
How to properly plan for peak readiness
The following sections provide recommendations to guide and prepare your for peak events. This is broken down into the following eight phases:
Phase 1: Review and planning
In this phase, the tasks are to identify the key members, the executive sponsor and the business objectives that are related to this project. It's also important to review the scope, objectives, measurements, and communication plan.
Phase 2: Identifying critical business processes
The next step is to identify and describe critical business processes that capture revenue or have crucial implications on customer satisfaction and loyalty. These may be identified based on typical request fulfillment data flows.
Phase 3: Review data collection
The goal for this phase is to review the data required to support both the performance data and the business data as the minimum. Please note that oftentimes some level of custom instrumentation may be required for the business data, such as the shopping cart value. When reviewing the available data, the team should consider the metrics, events, logs and traces (MELT), not just the metrics and events. Here's a list of the New Relic platform capabilities and resources available for leveling up the observability maturity:
- Outdated agent detection (using Agent Groundskeeper)
- Using log in context to speed up mean time to resolution (MTTR) on application issue troubleshooting
- Troubleshooting with distributed tracing to identify slowness or errors across multiple services
- Improve web site and see web vital with browser data
- Analyze and troubleshoot crashes with mobile data
- See your infrastructure health with infrastructure data
- See K8s and app performance in context
- Monitor your cloud services in AWS, GCP and Azure with infrastructure integrations
- Using workloads to focus on things that matters to you
- Using errors inbox to help operations to see errors in one place and collaborate
- Using Vulnerability Management to stay on top of critical vulnerabilities
- Follow alert quality management process to improve alert effectiveness
- Leverage service level for service level management
- Automate dashboards and alerts using Terraform
- See the changes due to deployment or significant event with change tracking
- Using New Relic logs to reduce MTTR and troubleshoot faster
- Allow developer to see performance in IDE with New Relic CodeStream
- Improve observability in general with observability maturity guides
- Monitor AWS Lambda, Azure Functions, and Google Cloud Functions
Phase 4: Production implementation
In this phase, the objective is to implement the tasks that are identified in the previous phase, in the production environment. The working unit is each application team. This may include additional agent deployment, agent update, custom instrumentation, configuring alert conditions, notification workflows, remedial actions, and required dashboards.
Phase 5: Team review and joint operations
The focus of this phase is to share the results across the teams. This is especially important among the teams that depend on one another. The goal is for neighboring teams to have a good understanding of what impacts may have on their own application when something happens to other applications.
Phase 6: Final adjustments
This phase is shortly before the code freeze where only critical bug fixes can be deployed. This means no new New Relic deployments either. But this is another excellent opportunity to continue fine-tuning alerts and dashboards because these activities have no impact on the application whatsoever and the better availability from the development team.
Phase 7: Freeze period
This is game time. But if you've done the work in previous phases, there shouldn't be any surprises. You can still add/modify/delete alerts and dashboards as needed because those changes will not interfere with the applications.
Phase 8: Lesson learned
After the peak season is over, as in any good project management practice, it's time to conduct lesson learned sessions.
First, the key members of the peak readiness team and the application leaders/architects would review the overall team performance in responding to issues that have happened. Second, it's time to discuss known environment changes for the next season, such as new software releases, new technology stacks, and/or new business processes and the associated impacts on the observability.
Peak readiness is a project, not just a task. It requires step-by-step execution from the non-production to the production environment. It should also involve people from development, operations, and management to participate in the planning, preparation, and execution. It should have a clear connection between the business data and the performance or operations data to ensure it's aligned with the business objectives.
When done correctly, peak season will be a much less stressful time for everyone involved.
Please see the full document on peak readiness for more details and examples.
The views expressed on this blog are those of the author and do not necessarily reflect the views of New Relic. Any solutions offered by the author are environment-specific and not part of the commercial solutions or support offered by New Relic. Please join us exclusively at the Explorers Hub (discuss.newrelic.com) for questions and support related to this blog post. This blog may contain links to content on third-party sites. By providing such links, New Relic does not adopt, guarantee, approve or endorse the information, views or products available on such sites.