Chaos Engineering for Mortals

August 21, 2017

Netflix Chaos Monkey randomly kills production instances. This sounds terrifying—and for unprepared organizations, it would be. But controlled failure injection reveals weaknesses before real failures do.

You don’t need Netflix’s scale or a dedicated chaos team to benefit. Start small, grow capability over time, and build confidence in your system’s resilience.

What Chaos Engineering Is

Chaos engineering is empirical testing of system behavior under adverse conditions. You hypothesize how the system should behave when something fails, inject that failure, and observe whether reality matches expectations.

The process:

  1. Define steady state: What does “working” look like? Metrics that indicate health.
  2. Hypothesize: “If X fails, the system should continue working with Y degradation.”
  3. Inject failure: Actually cause X to fail.
  4. Observe: Did the system behave as hypothesized?
  5. Learn: What gaps exist? What needs improvement?

This isn’t random destruction—it’s scientific experimentation about resilience.

Starting Simple

Game Days

Before automated failure injection, practice manual exercises:

Tabletop exercises: Gather the team, describe a failure scenario, walk through response without actually breaking anything.

“It’s 2 PM on Tuesday. The primary database server becomes unreachable. What happens? Who does what? How long until recovery?”

Tabletops reveal gaps in knowledge, documentation, and process without risk.

Controlled failovers: During low-traffic periods with engineers ready, trigger failover scenarios manually:

These exercises build muscle memory for real incidents.

Single-Service Experiments

Start chaos experiments with a single, non-critical service:

  1. Choose a service with good observability
  2. Define steady state (request success rate, latency)
  3. Inject simple failure (restart, network delay)
  4. Observe impact on steady state
  5. Document findings

Graduate to more complex experiments as confidence grows.

Staging First

Run experiments in staging before production. Staging experiments are safer and still valuable—they reveal issues without customer impact.

Production experiments are the ultimate test, but staging experiments catch many problems.

Types of Failures to Inject

Process Failures

Kill processes and observe recovery:

What happens when the process dies? Does it restart? How long? Do health checks detect it?

Network Failures

Network problems are common in distributed systems:

Latency injection: Add delay to network calls. Does the caller handle slow responses gracefully?

Packet loss: Drop some packets. Does retry logic work?

DNS failure: Make DNS resolution fail. Does caching help? How does the system degrade?

Partition: Isolate components from each other. Does the system handle split-brain scenarios?

Resource Exhaustion

Consume resources until limits are hit:

CPU saturation: What happens when CPU is fully utilized?

Memory pressure: How does the system behave under memory constraints?

Disk space: Fill the disk. Does the application handle disk-full gracefully?

Connection exhaustion: Consume all database or HTTP connections.

Dependency Failures

Services depend on other services. What happens when dependencies fail?

Dependency unavailable: Make a dependency unreachable.

Dependency slow: Slow responses from dependencies.

Dependency returning errors: Dependencies returning error responses.

Clock Skew

Time-related bugs are insidious:

Clock drift: Make system clocks diverge.

Clock jump: Jump clocks forward or backward.

Configuration Errors

Misconfiguration is a common failure cause:

Invalid configuration: Inject bad config values.

Missing configuration: Remove required configuration.

Tools and Techniques

Manual Injection

Start with simple manual techniques:

# Kill a process
kill -9 $PID

# Add network latency
tc qdisc add dev eth0 root netem delay 100ms

# Fill disk
dd if=/dev/zero of=/tmp/fill bs=1M count=10000

# Consume CPU
stress --cpu 4 --timeout 60

Manual injection gives full control and requires no tooling investment.

Chaos Monkey and Simians

Netflix’s Simian Army includes tools for various failure types. Open source versions are available.

Chaos Monkey: Randomly terminates instances.

Latency Monkey: Injects latency into RESTful services.

Chaos Kong: Simulates region failures.

Gremlin

Commercial chaos engineering platform. Provides:

Gremlin reduces the tooling investment required to start.

Chaos Toolkit

Open source chaos engineering framework. Extensible with plugins for various systems (Kubernetes, AWS, etc.).

{
  "title": "Database failure experiment",
  "description": "Verify graceful degradation when database is unavailable",
  "steady-state-hypothesis": {
    "title": "Application responds to requests",
    "probes": [
      {
        "type": "probe",
        "name": "app-responds",
        "provider": {
          "type": "http",
          "url": "http://app/health"
        }
      }
    ]
  },
  "method": [
    {
      "type": "action",
      "name": "stop-database",
      "provider": {
        "type": "process",
        "path": "docker",
        "arguments": ["stop", "database"]
      }
    }
  ]
}

Litmus (Kubernetes)

Chaos engineering for Kubernetes. Provides:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-delete-experiment
spec:
  appinfo:
    appns: default
    applabel: "app=myapp"
  experiments:
    - name: pod-delete

Safety Practices

Chaos engineering involves deliberately breaking things. Safety is essential.

Blast Radius Control

Limit the scope of experiments:

Rollback Capability

Ensure you can undo experiments quickly:

Communication

Inform relevant parties:

Progressive Expansion

Grow experiment scope gradually:

  1. Staging environment
  2. Production, single instance
  3. Production, multiple instances
  4. Production, during higher traffic
  5. Production, automated/continuous

Each stage builds confidence for the next.

Building a Practice

Start with Learning

Initial experiments reveal unknowns:

Learning is the goal, not proving the system is perfect.

Make It Routine

Regular chaos experiments build organizational muscle:

Routine practice makes failure response automatic.

Connect to Improvement

Experiments should drive improvements:

Chaos engineering without follow-through is just chaos.

Key Takeaways