Back to Blog
Infrastructure & Cloud

AWS US-EAST-1 Outage (October 2025): Root Cause, Timeline, and Lessons

·10 min read

Updated: AWS Post-Mortem Released

AWS published their official post-mortem 3 days after the incident. This article has been updated with their findings, including the DynamoDB DNS race condition root cause and the EC2 cascade mechanism.

On October 20, 2025, at approximately 3:11 AM ET, Amazon Web Services experienced one of its most significant outages in recent history. For roughly 15 hours, a critical failure in the US-EAST-1 region (Northern Virginia) cascaded across the internet, taking down thousands of websites and applications that power our daily digital lives.

Snapchat went dark for 375 million daily users. Fortnite and Roblox became unplayable for millions of gamers. Ring doorbells stopped recording. McDonald's mobile orders failed. United Airlines booking systems stuttered. Even the British government's tax website (HMRC) became inaccessible.

The scale was staggering: Downdetector received over 6.5 million reports spanning more than 1,000 services globally. This wasn't just an AWS problem—it exposed the fragility of our cloud-dependent world.

What Happened: The Technical Breakdown

AWS released their official post-mortem 3 days after the incident. Here's what actually happened:

Root Cause

Critical

A race condition in DynamoDB's internal DNS management system. Two DNS Enactor processes ran concurrently, and a stale plan check allowed an old plan to overwrite a newer one, which was then deleted by cleanup automation, wiping all DynamoDB DNS records.

Affected Services

Critical

14 AWS services including EC2, DynamoDB, SQS, Amazon Connect, Lambda, and S3. DynamoDB was offline for ~3 hours; the EC2 cascade lasted 12+ hours.

Duration

High

Approximately 14 hours total. DynamoDB DNS was restored in ~3 hours, but EC2's DropletWorkflow Manager entered congestive collapse, requiring 5+ more hours for network state recovery.

Geographic Scope

High

Primary impact in US-EAST-1 (N. Virginia), but global services affected due to control plane dependencies and cross-region service dependencies.

The Chain Reaction: DNS Enactor Race Condition

DynamoDB's DNS is managed by three components: a DNS Planner that monitors load balancer health and creates weighted traffic plans, and DNS Enactors (3 instances across 3 AZs) that update Route 53. The system relies on eventual consistency between parallel enactors. Here's how it broke:

  1. DNS Enactor #1 experienced unusual delays → It verified plan freshness at the start of processing, but by the time it finished updating endpoints, that check was stale
  2. DNS Planner accelerated plan generation → New plans were created while Enactor #1 was still working on an old one
  3. DNS Enactor #2 detected the stale plan and deleted it → Cleanup automation wiped all IP addresses for dynamodb.us-east-1.amazonaws.com from Route 53
  4. EC2 DropletWorkflow Manager (DWFM) cascaded → DWFM stores state in DynamoDB; queries timed out, droplet leases expired, servers were marked unavailable, and the system entered congestive collapse
  5. Network Manager backlog compounded the outage → Network state propagation latencies caused an additional 5+ hours of recovery time

The Global Impact

The outage didn't just affect tech companies—it rippled through every sector of the digital economy:

Snapchat

Social Media

375M daily users

Roblox

Gaming

70M+ daily users

Fortnite

Gaming

Millions affected

Ring

Smart Home

Global disruption

McDonald's App

Food & Retail

Order systems down

United Airlines

Travel

Booking systems affected

Robinhood

Finance

Trading disrupted

Bank of Scotland

Banking

Service interruptions

Why Did Global Services Fail from a Single Region?

This is the critical question. Many affected services were deployed across multiple regions. So why did they still fail?

1.
Control Plane Dependencies: AWS's global control plane (IAM, Route53, CloudFormation) has critical infrastructure in US-EAST-1. Even healthy regions couldn't perform certain operations.
2.
Single-Region Data Stores: Many multi-region applications kept their primary databases (DynamoDB, RDS) in US-EAST-1 only, making them single points of failure.
3.
Configuration and Secrets: Applications in healthy regions couldn't start or scale because they relied on AWS Secrets Manager or Parameter Store in US-EAST-1.
4.
Async Processing Bottlenecks: SQS queues and Lambda functions often centralized in US-EAST-1 for cost optimization, creating hidden dependencies.

What Engineering Teams Should Learn

This outage is a masterclass in distributed systems failure modes. Here are the critical lessons:

1

Multi-Region Architecture is Non-Negotiable

Services running only in US-EAST-1 had zero availability. Multi-region deployments with active-active or active-passive failover could have maintained partial service.

Action: Design for multi-region from day one, even if it seems expensive. The cost of downtime far exceeds infrastructure costs.

2

Don't Put All Dependencies in One Region

Many services with multi-region deployments still failed because their DynamoDB databases, SQS queues, or Lambda functions were only in US-EAST-1.

Action: Map ALL dependencies. Ensure critical data stores and async processing exist in multiple regions with replication.

3

Test Your Disaster Recovery Plan

Having a DR plan on paper is worthless if you've never actually failed over. The outage exposed companies with untested recovery procedures.

Action: Run quarterly chaos engineering exercises. Kill US-EAST-1 deliberately and measure your actual recovery time.

4

Control Plane vs Data Plane Awareness

Even services in healthy regions failed because AWS's control plane operations (DNS, IAM, CloudFormation) depended on US-EAST-1 infrastructure.

Action: Understand which AWS services have regional vs global control planes. Design systems to operate during control plane outages.

5

Monitoring and Observability Must Be External

Many companies couldn't access their own monitoring dashboards because they were hosted on AWS infrastructure that was down.

Action: Use external observability tools (DataDog, New Relic, external status pages) that don't depend on your primary cloud provider.

6

Communication Plans for Extended Outages

Companies struggled to communicate with customers during the outage because their status pages, email systems, and notification services were down.

Action: Maintain status pages and communication channels on separate infrastructure (different cloud provider or on-prem).

Why US-EAST-1 Outages Are Especially Catastrophic

US-EAST-1 (Northern Virginia) isn't just another AWS region—it's special:

  • Oldest AWS Region: Launched in 2006, it has the most mature services and features launch here first.
  • Default Region: Many AWS services default to US-EAST-1 in SDKs and console, leading to accidental dependencies.
  • Control Plane Hub: Global AWS services (CloudFront, Route53, IAM) have critical infrastructure here.
  • Largest Deployment: Estimated to host 30-40% of all AWS workloads globally.
  • Cost Optimized: Lowest pricing, incentivizing companies to centralize here despite risks.

This combination makes US-EAST-1 outages uniquely impactful. When this region fails, the internet notices.

What AWS Said: The Official Post-Mortem

AWS published their post-mortem just 3 days after the incident, significantly faster than their 4-month delay after the 2023 outage. Here's what they confirmed:

Confirmed Root Cause

The outage was caused by a race condition in an internal DynamoDB microservice that automates DNS record management for DynamoDB's regional cells. The DNS Enactor verified plan freshness once at the start of processing, but by the time it finished updating endpoints (after unusual delays), that check was stale. This allowed an old plan to overwrite a newer one, which cleanup automation then deleted, effectively erasing all DNS records for DynamoDB in US-EAST-1.

The cascade into EC2 was caused by the DropletWorkflow Manager (DWFM), which stores state in DynamoDB. When DynamoDB queries timed out, DWFM marked servers as unavailable, triggering congestive collapse as the system tried to reattempt lease renewals all at once.

AWS Committed Fixes

  1. Stronger safeguards in the DynamoDB DNS Enactor before applying plans (preventing stale overwrites)
  2. Velocity controls for Network Load Balancers to prevent large-scale failovers
  3. A new scale-testing suite for recovery workflows

Notably, the post-mortem did not explain why DNS Enactor #1 slowed down in the first place, nor the design rationale behind the aggressive DNS record deletion behavior.

The Bigger Picture: Cloud Dependency Risk

This outage exposes a fundamental tension in modern software architecture:

Cloud providers promise five-nines reliability (99.999% uptime), but achieving that requires architectural discipline that most companies don't implement. Multi-region deployments are expensive and complex. Many startups and even mature companies accept the risk of single-region deployment to move faster and reduce costs.

The result? We've created a world where a networking issue in a single data center in Northern Virginia can disable critical services globally—from emergency Ring cameras to banking apps to government tax systems.

As an industry, we need to have honest conversations about acceptable risk, true cost of downtime, and realistic expectations for cloud reliability.

Stay Informed: Subscribe to Tech Upkeep

This is exactly the kind of critical infrastructure news that product engineers need to know about—but often miss until it's too late.

Tech Upkeep curates breaking infrastructure incidents, post-mortems from companies like AWS, Netflix, and Google, and the best engineering blog content—delivered to your inbox every Tuesday and Friday.

  • Breaking outage analysis like this one
  • AWS, Google, Azure post-mortem reports explained
  • Production engineering lessons from Netflix, Uber, Airbnb
  • System design patterns to prevent failures

Catch the next major infrastructure incident before it catches you.

Final Thoughts

The October 20, 2025 AWS outage will likely be studied in computer science courses for years to come. It's a stark reminder that even the world's most sophisticated cloud infrastructure has failure modes we don't fully understand until they manifest.

For engineering teams, this incident should trigger honest architectural reviews. Ask yourself:

  • Could our service survive a US-EAST-1 outage?
  • Have we actually tested our disaster recovery procedures?
  • Do we have dependencies we don't know about?
  • Is our monitoring independent of our infrastructure?

The best time to fix these issues is before the next major outage—not during it.

Last updated: April 11, 2026. Updated with AWS's official post-mortem findings.

Start Your Learning System Today

Get curated tech content from Netflix, Uber, and Airbnb engineering teams. Plus GitHub Trending projects every Tuesday & Friday.

Early-adopter insights
Ship, don't just code
Free foreverUnsubscribe anytime

Written by Benjamin Loh, curator of Tech Upkeep