AWS Outage And Why O11y is Non Negotiable

On October 19th and 20th, 2025, the digital landscape, heavily reliant on Amazon Web Services (AWS), experienced a severe disruption due to an outage in the AWS North Virginia (us-east-1) region. This incident, commencing around 11:49 PM PDT on October 19th and lasting over 15 hours, impacted more than 140 AWS services.

In this blog we’ll explore what happened, the impact it had, and what we can learn from the incident.

The domino effect

A single DNS breakdown within the DynamoDB API endpoint in us-east-1 triggered a cascading failure across these services. Here is the AWS incident report, detailing the disruption:

Many internal AWS services depend on DynamoDB to store critical data, so the initial DNS failure triggered a cascade of secondary disruptions:

EC2 Launch Issues: Although the DNS issue was resolved around 2:24 AM PDT on October 20, a new problem arose in EC2’s internal subsystem responsible for launching instances. This system’s reliance on DynamoDB caused errors when attempting to launch new instances, often resulting in “Insufficient Capacity” errors.
Network Connectivity Problems: While working on the EC2 issue, AWS discovered that health checks for Network Load Balancers were failing. This led to widespread network connectivity issues across multiple services, including DynamoDB, SQS, and Amazon Connect.
Mitigation Efforts and Backlogs: To contain the cascading failures, AWS temporarily throttled certain operations, such as new EC2 instance launches, SQS polling via Lambda Event Source Mappings, and asynchronous Lambda invocations. While this helped stabilize core services, it created backlogs in systems like AWS Config, Redshift, and Amazon Connect, which required several hours to fully process even after service recovery.

This “domino effect” event illustrates how critical inter-dependencies within the AWS ecosystem can amplify the impact of a single failure. Read more about it with the AWS Health Dashboard and Service Report.

The business impact, and why observability matters

The impact of the outage was widespread, affecting AWS's own offerings like Alexa and Amazon.com, major clients such as Snapchat, PayPal’s Venmo, and Reddit, and even critical utility tools including Docker and Zoom. For AWS customers and organizations like these that heavily depend on cloud platforms and services, multi-hour outages spread across multiple AWS services has a significant business impact.

The Observability Forecast 2025 highlights the staggering financial impact of outages. An application, platform, or even a global SaaS offering outage can cost organizations a median of $2.2 million per hour, or approximately $33,333 per minute. While we are too early to calculate specific numbers for this most recent outage, at over 15 hours long, it is safe to assume that the loss is major.

The Forecast further reveals that organizations utilizing Full-Stack Observability (FSO) can significantly reduce outage costs to $1 million per hour due to enhanced resilience and mitigated risks.

The business impact is not just about dollars and cents. AWS outages, particularly those involving throttling actions, placed a significant burden on engineering teams. On-call engineers, DevOps personnel, and SREs dedicated approximately 33% of their collective time to resolving the resulting issues and incidents, directly addressing these service disruptions.

This is precisely where observability shifts the paradigm:

Faster Detection: Organizations that implement observability tools like New Relic for their Full-Stack Observability (FSO) achieve faster detection of critical outages. On average, their Mean Time to Detect (MTTD) is 28 minutes, compared to 35 minutes for those without FSO solutions.
AI-powered Responses and Automated RCA: Given the inherent complexity of modern distributed systems, human operators often find themselves overwhelmed, making artificial intelligence (AI) an indispensable asset. This reality is reflected among executives and IT leaders, who identify AI-assisted troubleshooting (38%) and automatic root cause analysis (RCA) (33%) as crucial capabilities. These AI-driven approaches are seen as vital for accelerating incident resolution and significantly limiting the fallout from major events, such as the AWS cascade.
End-to-End Tracing: Distributed tracing is a crucial tool for preventing and resolving outages, offering a way to track transaction requests as they move across interconnected back-end services. This end-to-end visibility is vital. When a problem arises in a back-end service, such as a database failure, distributed tracing helps engineers pinpoint exactly which services are degrading the customer experience through issues like slow page loads or errors. In turn, back-end engineers can clearly see how infrastructure issues are directly impacting their customers.
Alert Correlation: Observability tools like New Relic streamline incident management by intelligently grouping related alerts. This reduces noise and accelerates root cause identification by uncovering correlation patterns tied to specific incident scenarios. Such functionality is essential for navigating the complexities of multi-component failures.

Validation of Recovery

While observability tools help with MTTD, it is also important to identify Mean Time To Resolution (MTTR), this would be active monitoring to confirm if everything is back to normal.

While AWS Health Dashboards may mark open tickets as "resolved," services frequently still contend with backlogs. These often stem from elements like SQS queues, background processes triggering Lambda functions, or other third-party dependencies. Observability, therefore, provides the critical empirical evidence needed to confirm that service quality has genuinely returned to normalcy.

Confirmed Uptime and Reliability: Observability confirms the application is meeting its core business goal of system uptime and reliability.
Synthetic Monitoring: Synthetic monitoring allows teams to run continuous checks to ensure the application endpoints are responding correctly post-recovery.
Measuring Resolution Success: Observability practices like monitoring DORA metrics and the golden signals (latency, utilization, errors, and saturation) are helpful in confirming improvements in MTTD and MTTR following recovery efforts and procedural changes
Change Trackers: This capability is critical because it supports advanced automation features, such as AI-assisted remediation actions like rollbacks or configuration updates.

It's an AWS outage… What can we do?

Under the shared responsibility model, AWS is responsible for resolving and restoring service functionality for customers using AWS services. But what can we, as customers, proactively do to prepare for the next one and mitigate risk?

Our involvement extends beyond merely monitoring the AWS Health Dashboards. When it comes to handling service disruptions, it's not enough to simply have a Disaster Recovery (DR) strategy, a multi-region setup, or even sophisticated region failover systems in place.

The most critical first step is achieving clear visibility into which services are actually impacted during an incident. This foundational awareness must come before any recovery plan can be effectively executed. In modern cloud environments, architectures are often built using AWS services as interconnected Lego blocks. This complexity is magnified in microservices and distributed systems, where AWS services are repeatedly leveraged across the architecture, creating a web of dependencies that can be difficult to untangle during an outage. Without real-time visibility, identifying the root cause and the full scope of the impact becomes a significant challenge.

Observability tools play a critical role in achieving real-time visibility:

Identifying the Impacted Services for Your Stack: The affected AWS service could impact your entire system/platform, or just a small component of it. Observability provides the clarity you need to identify which services have been impacted, ensuring you can address issues efficiently.
Monitor golden signals: Monitor golden signals within the failover environment to ensure its stability and performance, thereby validating the disaster recovery (DR) strategy's intended operation.
Quantifying Revenue Loss: Observability extends to business outcomes. The New Relic Pathpoint application allows customers to visualize the customer journey and quantify the financial impact of business metrics, seeing the possible revenue lost for each minute of downtime.
Alerts and Dashboards: Leverage your unified alerts view to quickly pinpoint all services affected by an AWS failure, promptly inform dependent teams to establish comprehensive situational awareness, and view centralized dashboard to quickly see the health and metrics of the application or platform.

Using New Relic to detect the outage

At New Relic, we use AWS for our own workloads. Since the majority of our platform runs outside of the us-east-1 region, our core functionalities remained largely unaffected during the AWS outage on October 20th. This means that our data ingest, storage, query, alerting, and the New Relic UI were all operational.

However, some of our workloads were impacted. These included Synthetics, AWS Cloud Monitoring, Infinite Tracing, mobile symbolication, and events related to New Relic consumption (such as NrConsumption and NRMTDConsumption). Synthetics, Cloud Monitoring, and Infinite Tracing are designed to run across multiple regions, so they were only partially affected. In contrast, mobile symbolication and consumption events have a specific dependency on us-east-1.

Because we actively monitor the New Relic platform with New Relic, we detected the issue as soon as it began. At 11:57 PST, alerts were triggered for a service using DynamoDB in us-east-1, allowing us to identify the incoming errors immediately.

Upon receiving alerts, we actively monitored the triggered incidents to assess the outage's impact. Despite a minor effect on New Relic as a platform, we continued surveillance to ensure our customers could resolve any associated issues.

Wrapping Up

During an AWS outage, you don’t yet have the knowledge to confidently say, “let’s deploy the fixes!” It’s also not the right time to reassess your disaster recovery (DR) strategy or evaluate the architecture itself—especially if you’re already following a multi-region setup, as this outage specifically impacted us-east-1.

Instead, the focus should be on leveraging observability tools like New Relic to make sense of telemetry data (MELTX) and quickly identify affected services. These tools provide full-stack visibility across your architecture—from front-end applications and APM to databases and infrastructure. This includes not just VMs, containers, or Kubernetes clusters but also insights into your AWS environment and cloud provider’s health.

Outages like this one can cause significant business disruption. As engineers, our first priority is understanding: what’s broken, what parts are impacted, and to what extent. In critical moments like these, having a comprehensive observability strategy is essential for maintaining awareness and minimizing impact.

Jones Zachariah Noel N, Senior Developer Relations Engineer

Jones Zachariah Noel is a Senior Developer Relations Engineer at New Relic. Jones has a sound background with building full-stack applications on Server-full and Serverless architectures with technologies such as PHP, Node, Python and has been an advocate for Serverless-first mindset. Jones is also recognized as one of the AWS Serverless Heroes.

Die in diesem Blog geäußerten Ansichten sind die des Autors und spiegeln nicht unbedingt die Ansichten von New Relic wider. Alle vom Autor angebotenen Lösungen sind umgebungsspezifisch und nicht Teil der kommerziellen Lösungen oder des Supports von New Relic. Bitte besuchen Sie uns exklusiv im Explorers Hub (discuss.newrelic.com) für Fragen und Unterstützung zu diesem Blogbeitrag. Dieser Blog kann Links zu Inhalten auf Websites Dritter enthalten. Durch die Bereitstellung solcher Links übernimmt, garantiert, genehmigt oder billigt New Relic die auf diesen Websites verfügbaren Informationen, Ansichten oder Produkte nicht.

780+ Integrationen für Ihren Einstieg ins Stack-Monitoring. Kostenlos.

Alle Integrationen

In this article

AWS outage and why it again proves full-stack observability is non-negotiable

Leveraging the best of Full-Stack Observability

The domino effect

The business impact, and why observability matters

Validation of Recovery

It's an AWS outage… What can we do?

Using New Relic to detect the outage

Wrapping Up

AWS outage and why it again proves full-stack observability is non-negotiable

Leveraging the best of Full-Stack Observability

The domino effect

The business impact, and why observability matters

Validation of Recovery

It's an AWS outage… What can we do?

Using New Relic to detect the outage

Wrapping Up

Verwandte Inhalte