AWS US-EAST-1 Outage (Oct 2025): What Happened and What We Can Learn

On October 20, 2025, at approximately 3:11 AM ET, Amazon Web Services experienced one of its most significant outages in recent history. For roughly 15 hours, a critical failure in the US-EAST-1 region (Northern Virginia) cascaded across the internet, taking down thousands of websites and applications that power our daily digital lives.

Snapchat went dark for 375 million daily users. Fortnite and Roblox became unplayable for millions of gamers. Ring doorbells stopped recording. McDonald's mobile orders failed. United Airlines booking systems stuttered. Even the British government's tax website (HMRC) became inaccessible.

The scale was staggering: Downdetector received over 6.5 million reports spanning more than 1,000 services globally. This wasn't just an AWS problem—it exposed the fragility of our cloud-dependent world.

What Happened: The Technical Breakdown

Based on AWS status updates and monitoring data from affected customers, here's what we know so far:

Root Cause

Critical

Error in AWS's EC2 internal network subsystem responsible for monitoring network load balancer health. DNS resolution failures for DynamoDB API endpoints in US-EAST-1.

Affected Services

Critical

14 AWS services including EC2, DynamoDB, SQS, Amazon Connect, Lambda, and S3. Cascading failures across dependent services.

Duration

High

Approximately 15 hours from first reports (3:11 AM ET) to full restoration. Partial recovery began after 8 hours.

Geographic Scope

High

Primary impact in US-EAST-1 (N. Virginia), but global services affected due to control plane dependencies and cross-region service dependencies.

The Chain Reaction

The failure originated in a subsystem responsible for monitoring the health of network load balancers in EC2. When this subsystem failed, it triggered a cascade:

Network Load Balancer monitoring failed → Load balancers couldn't properly route traffic
DynamoDB API endpoints became unreachable → DNS resolution failures prevented connections
Dependent services cascaded into failure → SQS, Lambda, S3, and others couldn't function without DynamoDB
Global services impacted → Even services in other regions failed due to control plane dependencies in US-EAST-1

The Global Impact

The outage didn't just affect tech companies—it rippled through every sector of the digital economy:

Snapchat

Social Media

375M daily users

Roblox

Gaming

70M+ daily users

Fortnite

Gaming

Millions affected

Ring

Smart Home

Global disruption

McDonald's App

Food & Retail

Order systems down

United Airlines

Travel

Booking systems affected

Robinhood

Finance

Trading disrupted

Bank of Scotland

Banking

Service interruptions

Why Did Global Services Fail from a Single Region?

This is the critical question. Many affected services were deployed across multiple regions. So why did they still fail?

Control Plane Dependencies: AWS's global control plane (IAM, Route53, CloudFormation) has critical infrastructure in US-EAST-1. Even healthy regions couldn't perform certain operations.

Single-Region Data Stores: Many multi-region applications kept their primary databases (DynamoDB, RDS) in US-EAST-1 only, making them single points of failure.

Configuration and Secrets: Applications in healthy regions couldn't start or scale because they relied on AWS Secrets Manager or Parameter Store in US-EAST-1.

Async Processing Bottlenecks: SQS queues and Lambda functions often centralized in US-EAST-1 for cost optimization, creating hidden dependencies.

What Engineering Teams Should Learn

This outage is a masterclass in distributed systems failure modes. Here are the critical lessons:

Multi-Region Architecture is Non-Negotiable

Services running only in US-EAST-1 had zero availability. Multi-region deployments with active-active or active-passive failover could have maintained partial service.

Action: Design for multi-region from day one, even if it seems expensive. The cost of downtime far exceeds infrastructure costs.

Don't Put All Dependencies in One Region

Many services with multi-region deployments still failed because their DynamoDB databases, SQS queues, or Lambda functions were only in US-EAST-1.

Action: Map ALL dependencies. Ensure critical data stores and async processing exist in multiple regions with replication.

Test Your Disaster Recovery Plan

Having a DR plan on paper is worthless if you've never actually failed over. The outage exposed companies with untested recovery procedures.

Action: Run quarterly chaos engineering exercises. Kill US-EAST-1 deliberately and measure your actual recovery time.

Control Plane vs Data Plane Awareness

Even services in healthy regions failed because AWS's control plane operations (DNS, IAM, CloudFormation) depended on US-EAST-1 infrastructure.

Action: Understand which AWS services have regional vs global control planes. Design systems to operate during control plane outages.

Monitoring and Observability Must Be External

Many companies couldn't access their own monitoring dashboards because they were hosted on AWS infrastructure that was down.

Action: Use external observability tools (DataDog, New Relic, external status pages) that don't depend on your primary cloud provider.

Communication Plans for Extended Outages

Companies struggled to communicate with customers during the outage because their status pages, email systems, and notification services were down.

Action: Maintain status pages and communication channels on separate infrastructure (different cloud provider or on-prem).

Why US-EAST-1 Outages Are Especially Catastrophic

US-EAST-1 (Northern Virginia) isn't just another AWS region—it's special:

•
Oldest AWS Region: Launched in 2006, it has the most mature services and features launch here first.
•
Default Region: Many AWS services default to US-EAST-1 in SDKs and console, leading to accidental dependencies.
•
Control Plane Hub: Global AWS services (CloudFront, Route53, IAM) have critical infrastructure here.
•
Largest Deployment: Estimated to host 30-40% of all AWS workloads globally.
•
Cost Optimized: Lowest pricing, incentivizing companies to centralize here despite risks.

This combination makes US-EAST-1 outages uniquely impactful. When this region fails, the internet notices.

What Happens Next?

AWS will publish a detailed post-mortem report in the coming days or weeks. These reports typically include:

Precise timeline of events down to the minute
Root cause analysis with technical depth
Why detection and mitigation took so long
What corrective actions AWS is implementing
How they'll prevent similar failures

We'll update this article once AWS releases their official incident report. In the meantime, engineering teams should be reviewing their own architectures for similar vulnerabilities.

Questions AWS Needs to Answer

Why did a network load balancer monitoring subsystem have such broad cascading impact?
Why couldn't the issue be detected and isolated faster?
Why did services in other regions experience control plane failures?
What redundancy existed (or didn't exist) for this critical subsystem?
How will AWS improve blast radius isolation for future incidents?

The Bigger Picture: Cloud Dependency Risk

This outage exposes a fundamental tension in modern software architecture:

Cloud providers promise five-nines reliability (99.999% uptime), but achieving that requires architectural discipline that most companies don't implement. Multi-region deployments are expensive and complex. Many startups and even mature companies accept the risk of single-region deployment to move faster and reduce costs.

The result? We've created a world where a networking issue in a single data center in Northern Virginia can disable critical services globally—from emergency Ring cameras to banking apps to government tax systems.

As an industry, we need to have honest conversations about acceptable risk, true cost of downtime, and realistic expectations for cloud reliability.

Stay Informed: Subscribe to Tech Upkeep

This is exactly the kind of critical infrastructure news that product engineers need to know about—but often miss until it's too late.

Tech Upkeep curates breaking infrastructure incidents, post-mortems from companies like AWS, Netflix, and Google, and the best engineering blog content—delivered to your inbox every Tuesday and Friday.

Breaking outage analysis like this one
AWS, Google, Azure post-mortem reports explained
Production engineering lessons from Netflix, Uber, Airbnb
System design patterns to prevent failures

We'll update you the moment AWS publishes their official post-mortem for this incident.

Final Thoughts

The October 20, 2025 AWS outage will likely be studied in computer science courses for years to come. It's a stark reminder that even the world's most sophisticated cloud infrastructure has failure modes we don't fully understand until they manifest.

For engineering teams, this incident should trigger honest architectural reviews. Ask yourself:

Could our service survive a US-EAST-1 outage?
Have we actually tested our disaster recovery procedures?
Do we have dependencies we don't know about?
Is our monitoring independent of our infrastructure?

The best time to fix these issues is before the next major outage—not during it.

This article will be updated as new information becomes available and when AWS publishes their official post-mortem report. Last updated: October 21, 2025.

techUpkeep()

AWS US-EAST-1 Outage (October 2025): What Happened and What We Can Learn