Chaos Engineering: Building Confidence in System Resilience

June 8, 2020

Every system has failure modes waiting to be discovered. The question is whether you find them in controlled experiments or during real incidents at 3 AM. Chaos engineering is the practice of intentionally introducing failures to build confidence in system resilience.

Here’s how to implement chaos engineering effectively.

Why Chaos Engineering

The Confidence Problem

We build systems with resilience in mind:

But do they actually work?

Theory: "If the database fails, we'll failover to the replica"
Reality: "The failover script hasn't been run in 8 months and has a bug"

Theory: "Retry logic will handle transient failures"
Reality: "Retry storms overwhelm the service during partial outages"

Theory: "We can handle losing an availability zone"
Reality: "Session state isn't replicated, users lose their carts"

From Testing to Experimentation

Traditional testing verifies known behaviors. Chaos engineering discovers unknown behaviors.

Testing:
  Input → Expected Output
  "Given X, assert Y"

Chaos Engineering:
  Inject Failure → Observe System Behavior
  "What happens when X fails?"

Principles

The Scientific Method

Chaos engineering follows experimentation:

  1. Hypothesis: “If database replica fails, traffic will failover within 30 seconds with no user impact”

  2. Experiment: Kill the database replica

  3. Observe: Monitor failover time, error rates, user experience

  4. Learn: Did hypothesis hold? What surprised us?

Start in Production

Staging doesn’t match production:

Chaos in production finds real issues. Start small, build controls.

Minimize Blast Radius

Progressive approach:
1. Run in staging first (learn tooling)
2. Run in production during low traffic
3. Limit scope (one service, one zone)
4. Have abort mechanisms
5. Monitor continuously
6. Expand scope as confidence grows

Automate Experiments

Manual chaos is better than none. Automated chaos catches regressions.

Starting Chaos Engineering

Prerequisites

Before chaos, you need:

Observability:

Runbooks:

Culture:

First Experiments

Start simple:

## Starter Chaos Experiments

### 1. Kill a service instance
- Hypothesis: Load balancer routes around it
- Inject: Terminate one pod/instance
- Observe: Error rate, latency, user impact

### 2. Add network latency
- Hypothesis: Timeouts trigger, graceful degradation
- Inject: 500ms latency to downstream service
- Observe: Error rate, timeout triggers, cascade effects

### 3. Fill disk
- Hypothesis: Alerts fire, no data loss
- Inject: Fill disk to 95%
- Observe: Alerting, application behavior

### 4. Exhaust CPU
- Hypothesis: Auto-scaling triggers, requests queue
- Inject: CPU stress on instances
- Observe: Scale-up time, queue depth, errors

Failure Injection Techniques

Application Level

# Feature flag controlled chaos
if chaos_flags.enabled("random_latency"):
    delay = random.uniform(0, 500)  # 0-500ms
    await asyncio.sleep(delay / 1000)

if chaos_flags.enabled("random_errors"):
    if random.random() < 0.01:  # 1% error rate
        raise ServiceUnavailable("Injected chaos")

Network Level

# Using tc (traffic control) for latency
tc qdisc add dev eth0 root netem delay 100ms 20ms

# Packet loss
tc qdisc add dev eth0 root netem loss 1%

# Using iptables for blocking
iptables -A OUTPUT -p tcp --dport 5432 -j DROP

Infrastructure Level

# Kubernetes pod deletion
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure
spec:
  action: pod-kill
  mode: one
  selector:
    labelSelectors:
      app: api
  scheduler:
    cron: "*/5 * * * *"

Chaos Tools

Chaos Monkey (Netflix):

Gremlin:

Chaos Mesh (Kubernetes):

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay
spec:
  action: delay
  mode: all
  selector:
    labelSelectors:
      app: api
  delay:
    latency: "200ms"
    correlation: "100"
    jitter: "50ms"
  duration: "5m"

LitmusChaos (Kubernetes):

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: engine
spec:
  appinfo:
    appns: default
    applabel: app=api
  chaosServiceAccount: litmus
  experiments:
    - name: pod-network-latency
      spec:
        components:
          env:
            - name: NETWORK_INTERFACE
              value: eth0
            - name: NETWORK_LATENCY
              value: "200"

Experiment Catalog

Service Failures

### Kill primary database
Hypothesis: Replica promotes, app reconnects
Blast radius: One database cluster
Duration: 2 minutes
Abort criteria: Error rate > 5%

### DNS failure
Hypothesis: Cached responses serve, graceful degradation
Blast radius: DNS resolution for one service
Duration: 1 minute
Abort criteria: Full service outage

### Certificate expiration (simulated)
Hypothesis: Alerts fire before expiry
Blast radius: Non-production environment
Duration: N/A (monitoring check)

Resource Exhaustion

### Memory pressure
Hypothesis: OOM killer targets right process, service recovers
Blast radius: One pod
Duration: Until OOM or 5 minutes
Abort: Manual intervention

### Connection pool exhaustion
Hypothesis: Requests queue, no crashes, eventual recovery
Blast radius: One service
Duration: 3 minutes
Abort: Queue depth > 10000

### Thread/goroutine leak
Hypothesis: Monitoring alerts, graceful restart
Blast radius: One instance
Duration: Until alert or 10 minutes

Network Partitions

### Availability zone isolation
Hypothesis: Cross-AZ traffic routes, stateful services consistent
Blast radius: One AZ in staging
Duration: 5 minutes
Abort: Data inconsistency detected

### Downstream service unreachable
Hypothesis: Circuit breaker trips, fallback serves
Blast radius: Traffic to one downstream
Duration: 2 minutes
Abort: Cascade to other services

Running Experiments Safely

Pre-Experiment Checklist

## Before Running Chaos

- [ ] Hypothesis documented
- [ ] Metrics dashboards ready
- [ ] Rollback procedure known
- [ ] Abort criteria defined
- [ ] Communication sent (if production)
- [ ] On-call team aware
- [ ] Time window appropriate (not during peak)

During Experiment

## Monitoring During Chaos

Watch for:
- Error rates (should stay below threshold)
- Latency (expected degradation?)
- Cascading failures (unexpected services affected?)
- Alert firing (did we catch it?)

Ready to abort:
- One-click/command to stop
- Rollback procedure ready
- Incident response if needed

Post-Experiment

## After Chaos Experiment

### Findings:
1. What worked as expected?
2. What surprised us?
3. What broke unexpectedly?

### Actions:
1. Bugs to fix
2. Monitoring gaps
3. Runbook updates
4. Follow-up experiments

### Documentation:
- Update experiment results
- Share learnings with team

Game Days

Larger-scale chaos exercises:

## Quarterly Game Day

### Scope:
Simulate major regional failure

### Participants:
- SRE team (running)
- On-call engineers (responding)
- Leadership (observing)

### Scenario:
1. 10:00 - Inject: Primary database becomes unreachable
2. Observe: Alerting, response, failover
3. 10:15 - Inject: Cache cluster degraded
4. Observe: Performance impact, degradation handling
5. 10:30 - Recovery begins
6. 11:00 - Full recovery, debrief

### Evaluation:
- Time to detect
- Time to respond
- Communication effectiveness
- Recovery completeness

Key Takeaways

The goal isn’t to break things—it’s to learn how systems behave under stress and improve them systematically.