Back to Blog
Infrastructure & Reliability14 min read

The Year the Internet Kept Breaking: 2025's Major Cloud Outages and the AI Factor

AI demand made headlines, but the outages that defined 2025 were rooted in dull realities: control planes that ship faster than their safeguards, redundant systems without coordination, and recovery playbooks that create second incidents. I spent the week after the AWS outage auditing our own kill-switch latency—and found gaps I wish I'd caught earlier. This post captures the root causes and the signals worth carrying into your own roadmap.

Before diving into timelines, align on what actually mattered this year. Ignore hype; follow the constraints.

Hours of disruption

31+

Combined multi-region downtime across the four largest incidents.

Revenue impact

$60–100B

Parametrix and insurance-industry estimates across all affected services.

Config-driven incidents

100%

Every root cause traced to control-plane changes, not external attackers.

How 2025 actually unfolded

Four incidents dominated the year. Each started as a routine change, escalated because safety nets lagged, and lingered because recovery plans weren’t designed for thundering herds. Track them chronologically to see the shared playbook.

2025 Cloud Outage Timeline

June 12, 20253 hours

Google Cloud: Null Pointer Disaster

Impact: 50+ services including Gmail, Spotify, Discord
Root Cause: Unprotected feature flag with null pointer bug
Read Official Post-Mortem →
October 19-20, 202514+ hours

AWS: DynamoDB DNS Race Condition

Impact: DynamoDB, EC2, Lambda, ECS/EKS
Root Cause: DNS Enactor collision deleted all endpoint IPs
Read Official Post-Mortem →
October 29-30, 20258.5 hours

Azure: Async Processing Bug

Impact: Microsoft 365, Teams, Azure Portal
Root Cause: Incompatible metadata across build versions
Read Official Post-Mortem →
November 18, 20255h 46m

Cloudflare: Hardcoded Limit Disaster

Impact: 12.5M websites, X, ChatGPT
Root Cause: 200-feature hardcoded limit exceeded
Read Official Post-Mortem →

Total 2025 Impact

Combined Downtime:

31+ hours

Est. Financial Loss:

$60–100B

Services Affected:

100+

Root Cause:

Config Errors

What failed at each provider (and why it mattered)

3h of widespread API failures

Google Cloud

Quota policy enforcement code shipped dark on May 29, then triggered mid-June when blank fields hit the control plane. Crash loops starved the us-central1 databases that coordinated restarts. Gmail, Spotify, and Discord all went down—not because of load, but because a null pointer escaped review.

Root cause

Kill switches existed but propagated slower than the failure, so every restart re-triggered the same null pointer.

Why it matters

Treat restart storms as incidents. Ship kill switches and blast-radius limits alongside new control-plane code.

14h cascading impact

AWS

Two "DNS Enactor" processes tried to publish the same plan; the older plan won and cleanup logic deleted every IP for DynamoDB regional endpoints. The irony? Those redundant Enactors were supposed to prevent exactly this kind of failure. Instead, they raced each other and both lost.

Root cause

Redundant controllers shared mutable state without coordination, turning safety into a race condition.

Why it matters

Redundancy without orchestration creates single points of failure. DNS changes need locking, auditing, and dry-run parity with production.

8.5h global degradation

Azure

Customer configs built in two different toolchains produced metadata that passed syntax checks but crashed asynchronous workers at every edge location.

Root cause

Validation only covered synchronous submission, not the async pipeline that actually executes the config.

Why it matters

Replay configs through the same asynchronous workers you run in production. Deterministic validation > "looks good" linting.

5h 46m rolling 5xx errors

Cloudflare

A routine permission change let ClickHouse duplicate metadata. The Bot Management file doubled, exceeded a hardcoded 200-feature limit, and proxies panicked. X (Twitter) and ChatGPT went dark—two of the most visible services on the internet, felled by a number someone once decided was 'big enough.'

Root cause

Size limits were hardcoded, unobserved, and treated as "won't happen."

Why it matters

Track limit utilization like any other SLO. When you breach a limit, degrade gracefully instead of panicking processes.

Root-cause patterns worth stealing

After reading every post-mortem twice, three patterns kept surfacing. I've started using them as a checklist before any control-plane deploy.

Configuration drift is the default state

Each outage began with a "routine" change that bypassed meaningful review because it looked safe in isolation.

Diff and lint every control-plane change the same way you treat application code, and require rollback budgets before rollout.

Safety nets assume synchronous worlds

Azure, Google, and Cloudflare all had validation, but it measured inputs—not how async workers and caches would behave hours later.

Shift testing closer to execution. If an async pipeline will touch it, your validation should too.

Recovery can be a second incident

Thundering herds and manual restarts turned short outages into all-day events at AWS and Google.

Budget for reconnection. Rate-limit clients, stage restarts, and automate partial rollbacks so humans aren’t improvising under load.

Separate the valuable signals from the noise

Worth your attention
  • Vendor post-mortems that quantify blast radius, recovery time, and concrete fixes.
  • Evidence of config isolation, policy linting, or kill-switch telemetry that can be reused internally.
  • Details about how control planes were instrumented (or not) to catch regressions before customers felt them.
  • Numbers about reconnection load, queue depth, or other recovery bottlenecks you can benchmark against.
Skip the distractions
  • Vague statements about "AI demand" with zero data about what actually failed.
  • Hero narratives that credit war rooms but skip the bugs and missing guardrails.
  • One-off vendor promises like "we added more capacity" with no mention of process changes.
  • Any summary that blames rare coincidence instead of describing the config path that broke.

What reliability teams should do next

  1. 01

    Measure kill-switch latency

    List every feature flag or kill switch tied to control-plane code and record how long it takes to propagate globally. If the number exceeds five minutes, it’s effectively useless.

  2. 02

    Rehearse config rollbacks with load

    Run simulated rollbacks while synthetic clients hammer your APIs. If state rebuild floods a shared database, fix it now—not during an incident.

  3. 03

    Score dependencies by reconnection load

    Quantify how many clients, workers, or devices reconnect at once when a dependency recovers. Use that score to phase restarts.

  4. 04

    Instrument the control plane first

    Add alerts for config size, plan version skew, and queue depth. The application tier can’t save you if the control plane is blind.

  5. 05

    Pilot a small second provider

    Not full multi-cloud—just enough critical workload to keep the lights on. Use it as leverage and as a training ground for failover procedures.

The uncomfortable pattern: Every outage confirmed the same thing—the control plane was blind until customers yelled. I've shipped features that looked urgent at the time but added zero observability to the systems that actually keep the lights on. These incidents were a wake-up call. Instrument the boring systems first; feature launches can wait.

Written by Tech Upkeep

November 21, 2025