On October 19th and 20th, 2025, the digital landscape, heavily reliant on Amazon Web Services (AWS), experienced a severe disruption due to an outage in the AWS North Virginia (us-east-1) region. This incident, commencing around 11:49 PM PDT on October 19th and lasting over 15 hours, impacted more than 140 AWS services. 

The Domino Effect

A single DNS breakdown within the DynamoDB API endpoint in us-east-1 triggered a cascading failure across these services. The repercussions were widespread, affecting AWS's own offerings like Alexa and Amazon.com, major clients such as Snapchat, PayPal’s Venmo and Reddit, and even critical utility tools including Docker and Zoom.

Many internal AWS services depend on DynamoDB to store critical data, so the initial DNS failure triggered a cascade of secondary disruptions:

  1. EC2 Launch Issues: Although the DNS issue was resolved around 2:24 AM PDT on October 20, a new problem arose in EC2’s internal subsystem responsible for launching instances. This system’s reliance on DynamoDB caused errors when attempting to launch new instances, often resulting in “Insufficient Capacity” errors.
  2. Network Connectivity Problems: While working on the EC2 issue, AWS discovered that health checks for Network Load Balancers were failing. This led to widespread network connectivity issues across multiple services, including DynamoDB, SQS, and Amazon Connect.
  3. Mitigation Efforts and Backlogs: To contain the cascading failures, AWS temporarily throttled certain operations, such as new EC2 instance launches, SQS polling via Lambda Event Source Mappings, and asynchronous Lambda invocations. While this helped stabilize core services, it created backlogs in systems like AWS Config, Redshift, and Amazon Connect, which required several hours to fully process even after service recovery.

These events illustrate how critical inter-dependencies within the AWS ecosystem can amplify the impact of a single failure. Read more about it with the AWS Health Dashboard and Service Report.

The Business Impact: Outage costs and Why Observability Matters

For AWS customers and organizations heavily dependent on cloud platforms and services, multi-hour outages spread across multiple AWS services can have a significant business impact.

High cost for failure

The Observability Forecast 2025 highlights the staggering financial impact of outages. An application, platform, or even a global SaaS offering outage can cost organizations a median of $2.2 million per hour, or approximately $33,333 per minute.

Implementing Full-Stack Observability (FSO) directly enhances resilience and mitigates this risk. The Forecast further reveals that organizations utilizing FSO can significantly reduce outage costs to $1 million per hour, compared to $2 million per hour for those without it.

Reactive firefighting elimination

AWS outages, particularly those involving throttling actions, placed a significant burden on engineering teams. On-call engineers, DevOps personnel, and SREs dedicated approximately 33% of their collective time to resolving the resulting issues and incidents, directly addressing these service disruptions.

This is precisely where observability shifts the paradigm:

  • Faster Detection: Organizations that implement observability tools like New Relic for their Full-Stack Observability (FSO) achieve faster detection of critical outages. On average, their Mean Time to Detect (MTTD) is 28 minutes, compared to 35 minutes for those without FSO solutions.
  • AI-powered Responses and Automated RCA: Given the inherent complexity of modern distributed systems, human operators often find themselves overwhelmed, making artificial intelligence (AI) an indispensable asset. This reality is reflected among executives and IT leaders, who identify AI-assisted troubleshooting (38%) and automatic root cause analysis (RCA) (33%) as crucial capabilities. These AI-driven approaches are seen as vital for accelerating incident resolution and significantly limiting the fallout from major events, such as the AWS cascade.
  • End-to-End Tracing: Distributed tracing is a crucial tool for preventing and resolving outages, offering a way to track transaction requests as they move across interconnected back-end services. This end-to-end visibility is vital. When a problem arises in a back-end service, such as a database failure, distributed tracing helps engineers pinpoint exactly which services are degrading the customer experience through issues like slow page loads or errors. In turn, back-end engineers can clearly see how infrastructure issues are directly impacting their customers.
  • Alert Correlation: Observability tools like New Relic streamline incident management by intelligently grouping related alerts. This reduces noise and accelerates root cause identification by uncovering correlation patterns tied to specific incident scenarios. Such functionality is essential for navigating the complexities of multi-component failures.

Validation of Recovery

While observability tools help with MTTD, it is also important to identify Mean Time To Resolution (MTTR), this would be active monitoring to confirm if everything is back to normal.

While AWS Health Dashboards may mark open tickets as "resolved," services frequently still contend with backlogs. These often stem from elements like SQS queues, background processes triggering Lambda functions, or other third-party dependencies. Observability, therefore, provides the critical empirical evidence needed to confirm that service quality has genuinely returned to normalcy.

  • Confirmed Uptime and Reliability: Observability confirms the application is meeting its core business goal of system uptime and reliability.
  • Synthetic Monitoring: Synthetic monitoring allows teams to run continuous checks to ensure the application endpoints are responding correctly post-recovery.
  • Measuring Resolution Success: Observability practices like monitoring DORA metrics and the golden signals (latency, utilization, errors, and saturation) are helpful in confirming improvements in MTTD and MTTR following recovery efforts and procedural changes
  • Change Trackers: This capability is critical because it supports advanced automation features, such as AI-assisted remediation actions like rollbacks or configuration updates.

It’s AWS Outage, What Can You Do?

Under the shared responsibility model, AWS is responsible for resolving and restoring service functionality for customers using AWS services. But what can we, as customers, proactively do? Our involvement extends beyond merely monitoring the AWS Health Dashboards.

When it comes to handling service disruptions, it's not enough to simply have a Disaster Recovery (DR) strategy, a multi-region setup, or even sophisticated region failover systems in place. The most critical first step is achieving clear visibility into which services are actually impacted during an incident. This foundational awareness must come before any recovery plan can be effectively executed. In modern cloud environments, architectures are often built using AWS services as interconnected Lego blocks. This complexity is magnified in microservices and distributed systems, where AWS services are repeatedly leveraged across the architecture, creating a web of dependencies that can be difficult to untangle during an outage. Without real-time visibility, identifying the root cause and the full scope of the impact becomes a significant challenge.

  • Identifying the Impacted Services for Your Stack: The affected AWS service could have had a significant impact, whether on your entire system or platform, or just a small component of it. Observability provides the clarity you need to identify which services have been impacted, ensuring you can address issues efficiently.
  • Monitor golden signals: Monitor golden signals within the failover environment to ensure its stability and performance, thereby validating the disaster recovery (DR) strategy's intended operation.
  • Quantifying Revenue Loss: Observability extends to business outcomes. The New Relic Pathpoint application allows customers to visualize the customer journey and quantify the financial impact of business metrics, seeing the possible revenue lost for each minute of downtime.
  • Alerts and Dashboards: Leverage your unified alerts view to quickly pinpoint all services affected by an AWS failure. Promptly inform dependent teams to establish comprehensive situational awareness. And centralized dashboard to look at the health and metrics of the application or platform.

Daniela Miao (Co-founder and CTO at Momento) shares their alerting mechanism detected the downtime a full 17 minutes before the AWS Health Dashboard updated its status, enabling their on-call engineer to respond immediately. Choosing the right parameters for your alerts is crucial for proactive incident management and can significantly reduce the impact of potential 

Wrapping Up

During an AWS outage, there’s little within our control to immediately say, “Let’s deploy the fixes!” It’s also not the right time to reassess your disaster recovery (DR) strategy or evaluate the architecture itself—especially if you’re already following a multi-region setup, as this outage specifically impacted us-east-1. Instead, the focus should be on leveraging observability tools like New Relic to make sense of telemetry data (MELTX) and quickly identify affected issues.

While such outages can cause significant business disruption, as engineers, the priority is understanding: what’s broken, which parts are impacted, and to what extent. Observability platforms like New Relic provide full-stack visibility across your architecture—from front-end applications and APM to databases and infrastructure. This includes not just VMs, containers, or Kubernetes clusters but also insights into your AWS environment and cloud provider’s health. In critical moments like these, having a comprehensive observability strategy is essential for maintaining awareness and minimizing impact.

New Relic Now New capabilities announced at our online event.
Register Now