AWS US-EAST-1 Outage (October 2025): Root Cause, Timeline, and Lessons
Updated: AWS Post-Mortem Released
AWS published their official post-mortem 3 days after the incident. This article has been updated with their findings, including the DynamoDB DNS race condition root cause and the EC2 cascade mechanism.
On October 20, 2025, at approximately 3:11 AM ET, Amazon Web Services experienced one of its most significant outages in recent history. For roughly 15 hours, a critical failure in the US-EAST-1 region (Northern Virginia) cascaded across the internet, taking down thousands of websites and applications that power our daily digital lives.
Snapchat went dark for 375 million daily users. Fortnite and Roblox became unplayable for millions of gamers. Ring doorbells stopped recording. McDonald's mobile orders failed. United Airlines booking systems stuttered. Even the British government's tax website (HMRC) became inaccessible.
The scale was staggering: Downdetector received over 6.5 million reports spanning more than 1,000 services globally. This wasn't just an AWS problem—it exposed the fragility of our cloud-dependent world.
What Happened: The Technical Breakdown
AWS released their official post-mortem 3 days after the incident. Here's what actually happened:
Root Cause
CriticalA race condition in DynamoDB's internal DNS management system. Two DNS Enactor processes ran concurrently, and a stale plan check allowed an old plan to overwrite a newer one, which was then deleted by cleanup automation, wiping all DynamoDB DNS records.
Affected Services
Critical14 AWS services including EC2, DynamoDB, SQS, Amazon Connect, Lambda, and S3. DynamoDB was offline for ~3 hours; the EC2 cascade lasted 12+ hours.
Duration
HighApproximately 14 hours total. DynamoDB DNS was restored in ~3 hours, but EC2's DropletWorkflow Manager entered congestive collapse, requiring 5+ more hours for network state recovery.
Geographic Scope
HighPrimary impact in US-EAST-1 (N. Virginia), but global services affected due to control plane dependencies and cross-region service dependencies.
The Chain Reaction: DNS Enactor Race Condition
DynamoDB's DNS is managed by three components: a DNS Planner that monitors load balancer health and creates weighted traffic plans, and DNS Enactors (3 instances across 3 AZs) that update Route 53. The system relies on eventual consistency between parallel enactors. Here's how it broke:
- DNS Enactor #1 experienced unusual delays → It verified plan freshness at the start of processing, but by the time it finished updating endpoints, that check was stale
- DNS Planner accelerated plan generation → New plans were created while Enactor #1 was still working on an old one
- DNS Enactor #2 detected the stale plan and deleted it → Cleanup automation wiped all IP addresses for dynamodb.us-east-1.amazonaws.com from Route 53
- EC2 DropletWorkflow Manager (DWFM) cascaded → DWFM stores state in DynamoDB; queries timed out, droplet leases expired, servers were marked unavailable, and the system entered congestive collapse
- Network Manager backlog compounded the outage → Network state propagation latencies caused an additional 5+ hours of recovery time
The Global Impact
The outage didn't just affect tech companies—it rippled through every sector of the digital economy:
Snapchat
Social Media375M daily users
Roblox
Gaming70M+ daily users
Fortnite
GamingMillions affected
Ring
Smart HomeGlobal disruption
McDonald's App
Food & RetailOrder systems down
United Airlines
TravelBooking systems affected
Robinhood
FinanceTrading disrupted
Bank of Scotland
BankingService interruptions
Why Did Global Services Fail from a Single Region?
This is the critical question. Many affected services were deployed across multiple regions. So why did they still fail?
What Engineering Teams Should Learn
This outage is a masterclass in distributed systems failure modes. Here are the critical lessons:
Multi-Region Architecture is Non-Negotiable
Services running only in US-EAST-1 had zero availability. Multi-region deployments with active-active or active-passive failover could have maintained partial service.
Action: Design for multi-region from day one, even if it seems expensive. The cost of downtime far exceeds infrastructure costs.
Don't Put All Dependencies in One Region
Many services with multi-region deployments still failed because their DynamoDB databases, SQS queues, or Lambda functions were only in US-EAST-1.
Action: Map ALL dependencies. Ensure critical data stores and async processing exist in multiple regions with replication.
Test Your Disaster Recovery Plan
Having a DR plan on paper is worthless if you've never actually failed over. The outage exposed companies with untested recovery procedures.
Action: Run quarterly chaos engineering exercises. Kill US-EAST-1 deliberately and measure your actual recovery time.
Control Plane vs Data Plane Awareness
Even services in healthy regions failed because AWS's control plane operations (DNS, IAM, CloudFormation) depended on US-EAST-1 infrastructure.
Action: Understand which AWS services have regional vs global control planes. Design systems to operate during control plane outages.
Monitoring and Observability Must Be External
Many companies couldn't access their own monitoring dashboards because they were hosted on AWS infrastructure that was down.
Action: Use external observability tools (DataDog, New Relic, external status pages) that don't depend on your primary cloud provider.
Communication Plans for Extended Outages
Companies struggled to communicate with customers during the outage because their status pages, email systems, and notification services were down.
Action: Maintain status pages and communication channels on separate infrastructure (different cloud provider or on-prem).
Why US-EAST-1 Outages Are Especially Catastrophic
US-EAST-1 (Northern Virginia) isn't just another AWS region—it's special:
- •Oldest AWS Region: Launched in 2006, it has the most mature services and features launch here first.
- •Default Region: Many AWS services default to US-EAST-1 in SDKs and console, leading to accidental dependencies.
- •Control Plane Hub: Global AWS services (CloudFront, Route53, IAM) have critical infrastructure here.
- •Largest Deployment: Estimated to host 30-40% of all AWS workloads globally.
- •Cost Optimized: Lowest pricing, incentivizing companies to centralize here despite risks.
This combination makes US-EAST-1 outages uniquely impactful. When this region fails, the internet notices.
What AWS Said: The Official Post-Mortem
AWS published their post-mortem just 3 days after the incident, significantly faster than their 4-month delay after the 2023 outage. Here's what they confirmed:
Confirmed Root Cause
The outage was caused by a race condition in an internal DynamoDB microservice that automates DNS record management for DynamoDB's regional cells. The DNS Enactor verified plan freshness once at the start of processing, but by the time it finished updating endpoints (after unusual delays), that check was stale. This allowed an old plan to overwrite a newer one, which cleanup automation then deleted, effectively erasing all DNS records for DynamoDB in US-EAST-1.
The cascade into EC2 was caused by the DropletWorkflow Manager (DWFM), which stores state in DynamoDB. When DynamoDB queries timed out, DWFM marked servers as unavailable, triggering congestive collapse as the system tried to reattempt lease renewals all at once.
AWS Committed Fixes
- Stronger safeguards in the DynamoDB DNS Enactor before applying plans (preventing stale overwrites)
- Velocity controls for Network Load Balancers to prevent large-scale failovers
- A new scale-testing suite for recovery workflows
Notably, the post-mortem did not explain why DNS Enactor #1 slowed down in the first place, nor the design rationale behind the aggressive DNS record deletion behavior.
The Bigger Picture: Cloud Dependency Risk
This outage exposes a fundamental tension in modern software architecture:
Cloud providers promise five-nines reliability (99.999% uptime), but achieving that requires architectural discipline that most companies don't implement. Multi-region deployments are expensive and complex. Many startups and even mature companies accept the risk of single-region deployment to move faster and reduce costs.
The result? We've created a world where a networking issue in a single data center in Northern Virginia can disable critical services globally—from emergency Ring cameras to banking apps to government tax systems.
As an industry, we need to have honest conversations about acceptable risk, true cost of downtime, and realistic expectations for cloud reliability.
Stay Informed: Subscribe to Tech Upkeep
This is exactly the kind of critical infrastructure news that product engineers need to know about—but often miss until it's too late.
Tech Upkeep curates breaking infrastructure incidents, post-mortems from companies like AWS, Netflix, and Google, and the best engineering blog content—delivered to your inbox every Tuesday and Friday.
- Breaking outage analysis like this one
- AWS, Google, Azure post-mortem reports explained
- Production engineering lessons from Netflix, Uber, Airbnb
- System design patterns to prevent failures
Catch the next major infrastructure incident before it catches you.
Final Thoughts
The October 20, 2025 AWS outage will likely be studied in computer science courses for years to come. It's a stark reminder that even the world's most sophisticated cloud infrastructure has failure modes we don't fully understand until they manifest.
For engineering teams, this incident should trigger honest architectural reviews. Ask yourself:
- Could our service survive a US-EAST-1 outage?
- Have we actually tested our disaster recovery procedures?
- Do we have dependencies we don't know about?
- Is our monitoring independent of our infrastructure?
The best time to fix these issues is before the next major outage—not during it.
Last updated: April 11, 2026. Updated with AWS's official post-mortem findings.
Start Your Learning System Today
Get curated tech content from Netflix, Uber, and Airbnb engineering teams. Plus GitHub Trending projects every Tuesday & Friday.
Keep Reading
Major Cloud Outages of 2025: Root Causes Exposed
AWS, Azure, GCP, Cloudflare all went down in 2025. Not from cyberattacks, but config drift and brittle control planes. $60B in damages.
AI Coding Agents: What the Data Actually Shows
GitHub Copilot writes 46% of code at Microsoft. Google sees 2.5x task completion. Real numbers on what works and what's hype.
Written by Benjamin Loh, curator of Tech Upkeep