Every system has failure modes waiting to be discovered. The question is whether you find them in controlled experiments or during real incidents at 3 AM. Chaos engineering is the practice of intentionally introducing failures to build confidence in system resilience.
Here’s how to implement chaos engineering effectively.
Why Chaos Engineering
The Confidence Problem
We build systems with resilience in mind:
- Auto-scaling
- Redundancy
- Timeouts and retries
- Circuit breakers
- Failover
But do they actually work?
Theory: "If the database fails, we'll failover to the replica"
Reality: "The failover script hasn't been run in 8 months and has a bug"
Theory: "Retry logic will handle transient failures"
Reality: "Retry storms overwhelm the service during partial outages"
Theory: "We can handle losing an availability zone"
Reality: "Session state isn't replicated, users lose their carts"
From Testing to Experimentation
Traditional testing verifies known behaviors. Chaos engineering discovers unknown behaviors.
Testing:
Input → Expected Output
"Given X, assert Y"
Chaos Engineering:
Inject Failure → Observe System Behavior
"What happens when X fails?"
Principles
The Scientific Method
Chaos engineering follows experimentation:
Hypothesis: “If database replica fails, traffic will failover within 30 seconds with no user impact”
Experiment: Kill the database replica
Observe: Monitor failover time, error rates, user experience
Learn: Did hypothesis hold? What surprised us?
Start in Production
Staging doesn’t match production:
- Different scale
- Different traffic patterns
- Different configurations
- Different dependencies
Chaos in production finds real issues. Start small, build controls.
Minimize Blast Radius
Progressive approach:
1. Run in staging first (learn tooling)
2. Run in production during low traffic
3. Limit scope (one service, one zone)
4. Have abort mechanisms
5. Monitor continuously
6. Expand scope as confidence grows
Automate Experiments
Manual chaos is better than none. Automated chaos catches regressions.
Starting Chaos Engineering
Prerequisites
Before chaos, you need:
Observability:
- Metrics (request rate, error rate, latency)
- Logs (correlated, searchable)
- Traces (request flow visibility)
- Dashboards (system health at a glance)
Runbooks:
- Known failure modes documented
- Response procedures defined
- Escalation paths clear
Culture:
- Blameless post-mortems
- Learning from failure
- Leadership buy-in
First Experiments
Start simple:
## Starter Chaos Experiments
### 1. Kill a service instance
- Hypothesis: Load balancer routes around it
- Inject: Terminate one pod/instance
- Observe: Error rate, latency, user impact
### 2. Add network latency
- Hypothesis: Timeouts trigger, graceful degradation
- Inject: 500ms latency to downstream service
- Observe: Error rate, timeout triggers, cascade effects
### 3. Fill disk
- Hypothesis: Alerts fire, no data loss
- Inject: Fill disk to 95%
- Observe: Alerting, application behavior
### 4. Exhaust CPU
- Hypothesis: Auto-scaling triggers, requests queue
- Inject: CPU stress on instances
- Observe: Scale-up time, queue depth, errors
Failure Injection Techniques
Application Level
# Feature flag controlled chaos
if chaos_flags.enabled("random_latency"):
delay = random.uniform(0, 500) # 0-500ms
await asyncio.sleep(delay / 1000)
if chaos_flags.enabled("random_errors"):
if random.random() < 0.01: # 1% error rate
raise ServiceUnavailable("Injected chaos")
Network Level
# Using tc (traffic control) for latency
tc qdisc add dev eth0 root netem delay 100ms 20ms
# Packet loss
tc qdisc add dev eth0 root netem loss 1%
# Using iptables for blocking
iptables -A OUTPUT -p tcp --dport 5432 -j DROP
Infrastructure Level
# Kubernetes pod deletion
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-failure
spec:
action: pod-kill
mode: one
selector:
labelSelectors:
app: api
scheduler:
cron: "*/5 * * * *"
Chaos Tools
Chaos Monkey (Netflix):
- Randomly terminates instances
- Simple, proven approach
Gremlin:
- Commercial platform
- Wide range of attacks
- Safety controls built-in
Chaos Mesh (Kubernetes):
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay
spec:
action: delay
mode: all
selector:
labelSelectors:
app: api
delay:
latency: "200ms"
correlation: "100"
jitter: "50ms"
duration: "5m"
LitmusChaos (Kubernetes):
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine
spec:
appinfo:
appns: default
applabel: app=api
chaosServiceAccount: litmus
experiments:
- name: pod-network-latency
spec:
components:
env:
- name: NETWORK_INTERFACE
value: eth0
- name: NETWORK_LATENCY
value: "200"
Experiment Catalog
Service Failures
### Kill primary database
Hypothesis: Replica promotes, app reconnects
Blast radius: One database cluster
Duration: 2 minutes
Abort criteria: Error rate > 5%
### DNS failure
Hypothesis: Cached responses serve, graceful degradation
Blast radius: DNS resolution for one service
Duration: 1 minute
Abort criteria: Full service outage
### Certificate expiration (simulated)
Hypothesis: Alerts fire before expiry
Blast radius: Non-production environment
Duration: N/A (monitoring check)
Resource Exhaustion
### Memory pressure
Hypothesis: OOM killer targets right process, service recovers
Blast radius: One pod
Duration: Until OOM or 5 minutes
Abort: Manual intervention
### Connection pool exhaustion
Hypothesis: Requests queue, no crashes, eventual recovery
Blast radius: One service
Duration: 3 minutes
Abort: Queue depth > 10000
### Thread/goroutine leak
Hypothesis: Monitoring alerts, graceful restart
Blast radius: One instance
Duration: Until alert or 10 minutes
Network Partitions
### Availability zone isolation
Hypothesis: Cross-AZ traffic routes, stateful services consistent
Blast radius: One AZ in staging
Duration: 5 minutes
Abort: Data inconsistency detected
### Downstream service unreachable
Hypothesis: Circuit breaker trips, fallback serves
Blast radius: Traffic to one downstream
Duration: 2 minutes
Abort: Cascade to other services
Running Experiments Safely
Pre-Experiment Checklist
## Before Running Chaos
- [ ] Hypothesis documented
- [ ] Metrics dashboards ready
- [ ] Rollback procedure known
- [ ] Abort criteria defined
- [ ] Communication sent (if production)
- [ ] On-call team aware
- [ ] Time window appropriate (not during peak)
During Experiment
## Monitoring During Chaos
Watch for:
- Error rates (should stay below threshold)
- Latency (expected degradation?)
- Cascading failures (unexpected services affected?)
- Alert firing (did we catch it?)
Ready to abort:
- One-click/command to stop
- Rollback procedure ready
- Incident response if needed
Post-Experiment
## After Chaos Experiment
### Findings:
1. What worked as expected?
2. What surprised us?
3. What broke unexpectedly?
### Actions:
1. Bugs to fix
2. Monitoring gaps
3. Runbook updates
4. Follow-up experiments
### Documentation:
- Update experiment results
- Share learnings with team
Game Days
Larger-scale chaos exercises:
## Quarterly Game Day
### Scope:
Simulate major regional failure
### Participants:
- SRE team (running)
- On-call engineers (responding)
- Leadership (observing)
### Scenario:
1. 10:00 - Inject: Primary database becomes unreachable
2. Observe: Alerting, response, failover
3. 10:15 - Inject: Cache cluster degraded
4. Observe: Performance impact, degradation handling
5. 10:30 - Recovery begins
6. 11:00 - Full recovery, debrief
### Evaluation:
- Time to detect
- Time to respond
- Communication effectiveness
- Recovery completeness
Key Takeaways
- Chaos engineering discovers unknown failures before they become incidents
- Start with observability; you can’t learn from chaos you can’t see
- Begin with simple experiments (kill a pod) and increase complexity
- Run in production (carefully) because staging doesn’t reflect reality
- Minimize blast radius; have abort mechanisms ready
- Document hypotheses and learn from both confirmed and failed hypotheses
- Automate experiments to catch regressions continuously
- Game days test human response, not just system behavior
- Chaos engineering builds confidence; it shouldn’t create incidents
The goal isn’t to break things—it’s to learn how systems behave under stress and improve them systematically.