現在、このページは英語版のみです。

In the past few years, I have had the opportunity to work with several customers on the implementation of the observability for infrastructure and applications for day to day operations. One critical moment that shows the true success of the implementation is how the teams use the data in responses to incidents during the peak season. 

For ecommerce companies, the peak season usually represents the biggest revenue opportunity of the year. The application and the technology stacks that it runs on are the means to capture revenue and to grow the customer loyalty in the more and more competitive landscape. 

The teams of engineers that support the application stacks not only need to see what’s happening, they also need to be able to isolate the cause of the issue quickly, to mitigate, and to prioritize the effort based on the business impacts.

In this blog, you will learn the followings:

  • Understanding peak readiness
  • The common mistakes while preparing for the peak readiness
  • The step-by-step planning process for peak readiness

For more detailed information, please refer to the full document link at the end of this blog.

Understanding peak readiness

Availability is not equal to performant. 

Peak ready, at a minimum,  means that your application is continuously available to users under heavy load. However,  there is much more involved in peak readiness than just application availability.

The application availability and performance should align with your company’s business objectives. After all, the purpose of your application is to provide a means to accomplish a set of business objectives. 

Everyone, from management to engineers, needs to be able to see the current business health state, especially when there are heavy request loads or system issues—be it software or hardware—and what those impacts have on the business and customer’s experience. The ability to relate the application performance to the business impacts is what determines the success of your peak readiness preparation.

Common mistakes when preparing for peak readiness

The following are common mistakes I've observed.

Mistake #1: Starting too late

The first common mistake is starting too late. It's often that the teams only start the peak readiness activities one to two months before the peak season. With competing priorities and resources, starting too late results in just a subset of tasks being completed and leaving gaps and unknowns to chances.

Mistake #2: Not connecting application performance to business objectives

The team focuses only on the application and infrastructure data from a technical perspective but does not understand how the degradations and the errors impact the revenue stream or customer satisfaction. Without clear understanding of the business impacts, it could result in mis-prioritization of the importance of issues.  

Mistake #3: Information silos

The modern application architecture of microservices, and the continuous integration/continuous deployment (CI/CD) practice, often create siloed engineering teams. Naturally, each team only understands the ins and outs of what they are responsible for, but less about the dependencies, and even less about how their application performance affects the overall business process. This lack of big picture understanding impedes how the team shares information and slows down the triage during an incident.  

Mistake #4: Not including peak readiness in development

With the peak readiness activities only spanning the production environment, the teams lose the opportunity to learn the normal behavior of the application, to identify weak links, the key indicators, the effective alerts, and potential mitigation actions while developing and testing in non-production environments. What's observed in non-production environments can make the final release more resilient to the peak demands.

How to properly plan for peak readiness

The following sections provide recommendations to guide and prepare your for peak events. This is broken down into the following eight phases:

Phase 1: Review and planning

In this phase, the tasks are to identify the key members, the executive sponsor and the business objectives that are related to this project. It's also important to review the scope, objectives, measurements, and communication plan. 

Phase 2: Identifying critical business processes

The next step is to identify and describe critical business processes that capture revenue or have crucial implications on customer satisfaction and loyalty. These may be identified based on typical request fulfillment data flows.

Phase 3: Review data collection

The goal for this phase is to review the data required to support both the performance data and the business data as the minimum. Please note that oftentimes some level of custom instrumentation may be required for the business data, such as the shopping cart value. When reviewing the available data, the team should consider the metrics, events, logs and traces (MELT), not just the metrics and events. Here's a list of the New Relic platform capabilities and resources available for leveling up the observability maturity:

Phase 4: Production implementation

In this phase, the objective is to implement the tasks that are identified in the previous phase, in the production environment. The working unit is each application team. This may include additional agent deployment, agent update, custom instrumentation, configuring alert conditions, notification workflows, remedial actions, and required dashboards. 

Phase 5: Team review and joint operations

The focus of this phase is to share the results across the teams. This is especially important among the teams that depend on one another. The goal is for neighboring teams to have a good understanding of what impacts may have on their own application when something happens to other applications. 

Phase 6: Final adjustments

This phase is shortly before the code freeze where only critical bug fixes can be deployed. This means no new New Relic deployments either. But this is another excellent opportunity to continue fine-tuning alerts and dashboards because these activities have no impact on the application whatsoever and the better availability from the development team. 

Phase 7: Freeze period

This is game time. But if you've done the work in previous phases, there shouldn't be any surprises. You can still add/modify/delete alerts and dashboards as needed because those changes will not interfere with the applications. 

Phase 8: Lesson learned 

After the peak season is over, as in any good project management practice, it's time to conduct lesson learned sessions.

First, the key members of the peak readiness team and the application leaders/architects would review the overall team performance in responding to issues that have happened. Second, it's time to discuss known environment changes for the next season, such as new software releases, new technology stacks, and/or new business processes and the associated impacts on the observability.

Summary

Peak readiness is a project, not just a task. It requires step-by-step execution from the non-production to the production environment. It should also involve people from development, operations, and management to participate in the planning, preparation, and execution. It should have a clear connection between the business data and the performance or operations data to ensure it's aligned with the business objectives. 

When done correctly, peak season will be a much less stressful time for everyone involved. 

Please see the full document on peak readiness for more details and examples.