AWS US-East-1 Outage: Lessons Learned

December 20, 2021

On December 7, AWS experienced a significant outage in us-east-1 that lasted several hours. The impact was widespread: Disney+, Slack, Instacart, McDonald’s app, and countless other services were affected. Even AWS’s own status page and support systems were impaired.

Here’s what happened and what we can learn from it.

What Happened

The Timeline

timeline:
  7:30_am_pst:
    - Network issues begin in us-east-1
    - API errors start appearing
    - Services begin degrading

  8:00_am:
    - Major services impacted
    - Customer reports increase
    - AWS status page shows issues (delayed)

  10:30_am:
    - Primary issue identified
    - Recovery begins
    - Some services restoring

  1:00_pm:
    - Most services recovered
    - Some residual issues

  4:30_pm:
    - Full recovery confirmed
    - Total outage: ~7 hours

The Root Cause

root_cause:
  location: Internal network in us-east-1
  issue: Automated scaling activity in internal network

  details:
    - Unexpected behavior in internal networking
    - Network devices overwhelmed
    - Cascading failures
    - Control plane impacted

  affected:
    - EC2 (API and console)
    - ECS
    - Lambda
    - DynamoDB
    - CloudWatch
    - Many other services

Why It Was So Widespread

impact_analysis:
  control_plane_dependency:
    - Services depend on shared control plane
    - When control plane fails, everything fails
    - Can't provision, can't modify, can't diagnose

  us_east_1_concentration:
    - Many services default to us-east-1
    - Global services often anchor there
    - Largest AWS region

  status_page_failure:
    - Status page itself in us-east-1
    - Customers couldn't get information
    - Created communication void

  cascading_effects:
    - Dependencies between services
    - Retry storms
    - Backlog accumulation

What We Saw

Customer Impact

customer_symptoms:
  api_errors:
    - 5xx errors from AWS APIs
    - Throttling and timeouts
    - "Service Unavailable"

  data_plane_mostly_ok:
    - Existing EC2 instances still ran
    - Existing ECS tasks continued
    - But couldn't scale, deploy, or modify

  console_unavailable:
    - Couldn't log into console
    - Couldn't see resources
    - Couldn't diagnose issues

Observability Blindness

monitoring_impact:
  cloudwatch_issues:
    - CloudWatch metrics delayed/missing
    - Alarms not firing
    - Dashboards incomplete

  api_monitoring:
    - Synthetic tests failing
    - But unclear if app or AWS

  external_monitoring:
    - More reliable during outage
    - Confirmed external perspective

Lessons Learned

Multi-Region Isn’t Optional

multi_region_importance:
  control_plane:
    - Critical management must work cross-region
    - Deployment systems should be resilient
    - Monitoring should be distributed

  data_plane:
    - At minimum, failover capability
    - Active-active if budget allows
    - DNS-based failover at minimum

  considerations:
    - Cost vs. resilience trade-off
    - Data replication complexity
    - Application architecture requirements

Don’t Depend Solely on Cloud Provider Monitoring

monitoring_diversity:
  internal_monitoring:
    - May fail with the region
    - CloudWatch during AWS outage = limited

  external_monitoring:
    - Synthetic monitoring from outside cloud
    - Third-party providers (Datadog, New Relic)
    - Simple external health checks

  recommendation:
    - Health checks from outside AWS
    - External uptime monitoring
    - Multi-provider observability stack

Design for Control Plane Failures

control_plane_resilience:
  keep_running:
    - Existing workloads should survive
    - Graceful degradation
    - Don't crash-loop into API calls

  avoid_dependency:
    - Don't call AWS APIs in request path
    - Cache configuration locally
    - Have fallback modes

  stateless:
    - Stateless services survive better
    - Can restart independently
    - Don't need control plane to run

DNS and Global Services

dns_considerations:
  route53:
    - Generally resilient, but check
    - Consider secondary DNS provider
    - Pre-propagate DNS for failover

  global_accelerator:
    - Can help with regional failover
    - But check its dependencies

  external_dns:
    - Having backup DNS provider is wise
    - Cloudflare, Google Cloud DNS, etc.

Architecture Patterns

Active-Active Multi-Region

active_active:
  architecture:
    - Traffic to multiple regions simultaneously
    - Data replicated between regions
    - Can serve from either

  benefits:
    - Automatic failover
    - Better latency for global users
    - True resilience

  challenges:
    - Data replication complexity
    - Conflict resolution
    - Cost (2x or more)

  when_to_use:
    - Mission-critical applications
    - Global user base
    - Budget allows

Active-Passive (Pilot Light)

active_passive:
  architecture:
    - Primary region serves traffic
    - Secondary region minimal (pilot light)
    - Scale up and failover when needed

  benefits:
    - Lower cost than active-active
    - Simpler data management
    - Still provides resilience

  challenges:
    - Failover time (minutes to hours)
    - Data may be slightly stale
    - Need to test regularly

  when_to_use:
    - Moderate resilience needs
    - Cost-conscious
    - RTO of minutes acceptable

Cross-Region Considerations

cross_region_checklist:
  data:
    - [ ] Database replication configured
    - [ ] Replication lag monitored
    - [ ] Backup region data tested

  networking:
    - [ ] VPC peering or Transit Gateway
    - [ ] Route53 health checks configured
    - [ ] Failover DNS ready

  deployment:
    - [ ] CI/CD can deploy to multiple regions
    - [ ] Artifacts available in all regions
    - [ ] Configuration synchronized

  monitoring:
    - [ ] External synthetic monitoring
    - [ ] Cross-region dashboards
    - [ ] Alerts for regional issues

  testing:
    - [ ] Regular failover drills
    - [ ] Chaos engineering for regional failures
    - [ ] Documented runbooks

Status Page Strategy

status_page_learning:
  problem:
    - AWS status page was in us-east-1
    - Failed during the outage
    - Customers couldn't get info

  solution:
    - Status page outside primary region
    - Or outside primary provider
    - Multiple communication channels

  for_your_company:
    - Don't host status page on your infra
    - Use SaaS (Statuspage, Instatus)
    - Or host in different region/provider
    - Have Twitter/other backup

Key Takeaways

Every major outage is a reminder: the cloud is someone else’s computer, and computers fail.