On December 7, AWS experienced a significant outage in us-east-1 that lasted several hours. The impact was widespread: Disney+, Slack, Instacart, McDonald’s app, and countless other services were affected. Even AWS’s own status page and support systems were impaired.
Here’s what happened and what we can learn from it.
What Happened
The Timeline
timeline:
7:30_am_pst:
- Network issues begin in us-east-1
- API errors start appearing
- Services begin degrading
8:00_am:
- Major services impacted
- Customer reports increase
- AWS status page shows issues (delayed)
10:30_am:
- Primary issue identified
- Recovery begins
- Some services restoring
1:00_pm:
- Most services recovered
- Some residual issues
4:30_pm:
- Full recovery confirmed
- Total outage: ~7 hours
The Root Cause
root_cause:
location: Internal network in us-east-1
issue: Automated scaling activity in internal network
details:
- Unexpected behavior in internal networking
- Network devices overwhelmed
- Cascading failures
- Control plane impacted
affected:
- EC2 (API and console)
- ECS
- Lambda
- DynamoDB
- CloudWatch
- Many other services
Why It Was So Widespread
impact_analysis:
control_plane_dependency:
- Services depend on shared control plane
- When control plane fails, everything fails
- Can't provision, can't modify, can't diagnose
us_east_1_concentration:
- Many services default to us-east-1
- Global services often anchor there
- Largest AWS region
status_page_failure:
- Status page itself in us-east-1
- Customers couldn't get information
- Created communication void
cascading_effects:
- Dependencies between services
- Retry storms
- Backlog accumulation
What We Saw
Customer Impact
customer_symptoms:
api_errors:
- 5xx errors from AWS APIs
- Throttling and timeouts
- "Service Unavailable"
data_plane_mostly_ok:
- Existing EC2 instances still ran
- Existing ECS tasks continued
- But couldn't scale, deploy, or modify
console_unavailable:
- Couldn't log into console
- Couldn't see resources
- Couldn't diagnose issues
Observability Blindness
monitoring_impact:
cloudwatch_issues:
- CloudWatch metrics delayed/missing
- Alarms not firing
- Dashboards incomplete
api_monitoring:
- Synthetic tests failing
- But unclear if app or AWS
external_monitoring:
- More reliable during outage
- Confirmed external perspective
Lessons Learned
Multi-Region Isn’t Optional
multi_region_importance:
control_plane:
- Critical management must work cross-region
- Deployment systems should be resilient
- Monitoring should be distributed
data_plane:
- At minimum, failover capability
- Active-active if budget allows
- DNS-based failover at minimum
considerations:
- Cost vs. resilience trade-off
- Data replication complexity
- Application architecture requirements
Don’t Depend Solely on Cloud Provider Monitoring
monitoring_diversity:
internal_monitoring:
- May fail with the region
- CloudWatch during AWS outage = limited
external_monitoring:
- Synthetic monitoring from outside cloud
- Third-party providers (Datadog, New Relic)
- Simple external health checks
recommendation:
- Health checks from outside AWS
- External uptime monitoring
- Multi-provider observability stack
Design for Control Plane Failures
control_plane_resilience:
keep_running:
- Existing workloads should survive
- Graceful degradation
- Don't crash-loop into API calls
avoid_dependency:
- Don't call AWS APIs in request path
- Cache configuration locally
- Have fallback modes
stateless:
- Stateless services survive better
- Can restart independently
- Don't need control plane to run
DNS and Global Services
dns_considerations:
route53:
- Generally resilient, but check
- Consider secondary DNS provider
- Pre-propagate DNS for failover
global_accelerator:
- Can help with regional failover
- But check its dependencies
external_dns:
- Having backup DNS provider is wise
- Cloudflare, Google Cloud DNS, etc.
Architecture Patterns
Active-Active Multi-Region
active_active:
architecture:
- Traffic to multiple regions simultaneously
- Data replicated between regions
- Can serve from either
benefits:
- Automatic failover
- Better latency for global users
- True resilience
challenges:
- Data replication complexity
- Conflict resolution
- Cost (2x or more)
when_to_use:
- Mission-critical applications
- Global user base
- Budget allows
Active-Passive (Pilot Light)
active_passive:
architecture:
- Primary region serves traffic
- Secondary region minimal (pilot light)
- Scale up and failover when needed
benefits:
- Lower cost than active-active
- Simpler data management
- Still provides resilience
challenges:
- Failover time (minutes to hours)
- Data may be slightly stale
- Need to test regularly
when_to_use:
- Moderate resilience needs
- Cost-conscious
- RTO of minutes acceptable
Cross-Region Considerations
cross_region_checklist:
data:
- [ ] Database replication configured
- [ ] Replication lag monitored
- [ ] Backup region data tested
networking:
- [ ] VPC peering or Transit Gateway
- [ ] Route53 health checks configured
- [ ] Failover DNS ready
deployment:
- [ ] CI/CD can deploy to multiple regions
- [ ] Artifacts available in all regions
- [ ] Configuration synchronized
monitoring:
- [ ] External synthetic monitoring
- [ ] Cross-region dashboards
- [ ] Alerts for regional issues
testing:
- [ ] Regular failover drills
- [ ] Chaos engineering for regional failures
- [ ] Documented runbooks
Status Page Strategy
status_page_learning:
problem:
- AWS status page was in us-east-1
- Failed during the outage
- Customers couldn't get info
solution:
- Status page outside primary region
- Or outside primary provider
- Multiple communication channels
for_your_company:
- Don't host status page on your infra
- Use SaaS (Statuspage, Instatus)
- Or host in different region/provider
- Have Twitter/other backup
Key Takeaways
- us-east-1 is not special but it is common—don’t over-concentrate there
- Control plane failures are different from data plane failures
- Multi-region isn’t just nice-to-have for critical services
- External monitoring is essential—don’t rely solely on cloud provider
- Design applications to survive without API access to cloud provider
- Status pages must be hosted separately from main infrastructure
- Regular disaster recovery testing validates your architecture
- This won’t be the last major outage—build for resilience
- Even AWS has failures—no provider is immune
- Trade-offs exist: cost, complexity, resilience—choose consciously
Every major outage is a reminder: the cloud is someone else’s computer, and computers fail.