Chaos Engineering for Mortals

Netflix Chaos Monkey randomly kills production instances. This sounds terrifying—and for unprepared organizations, it would be. But controlled failure injection reveals weaknesses before real failures do.

You don’t need Netflix’s scale or a dedicated chaos team to benefit. Start small, grow capability over time, and build confidence in your system’s resilience.

What Chaos Engineering Is

Chaos engineering is empirical testing of system behavior under adverse conditions. You hypothesize how the system should behave when something fails, inject that failure, and observe whether reality matches expectations.

The process:

Define steady state: What does “working” look like? Metrics that indicate health.
Hypothesize: “If X fails, the system should continue working with Y degradation.”
Inject failure: Actually cause X to fail.
Observe: Did the system behave as hypothesized?
Learn: What gaps exist? What needs improvement?

This isn’t random destruction—it’s scientific experimentation about resilience.

Starting Simple

Game Days

Before automated failure injection, practice manual exercises:

Tabletop exercises: Gather the team, describe a failure scenario, walk through response without actually breaking anything.

“It’s 2 PM on Tuesday. The primary database server becomes unreachable. What happens? Who does what? How long until recovery?”

Tabletops reveal gaps in knowledge, documentation, and process without risk.

Controlled failovers: During low-traffic periods with engineers ready, trigger failover scenarios manually:

Promote database replica
Redirect traffic to backup region
Restart services to test recovery

These exercises build muscle memory for real incidents.

Single-Service Experiments

Start chaos experiments with a single, non-critical service:

Choose a service with good observability
Define steady state (request success rate, latency)
Inject simple failure (restart, network delay)
Observe impact on steady state
Document findings

Graduate to more complex experiments as confidence grows.

Staging First

Run experiments in staging before production. Staging experiments are safer and still valuable—they reveal issues without customer impact.

Production experiments are the ultimate test, but staging experiments catch many problems.

Types of Failures to Inject

Process Failures

Kill processes and observe recovery:

Application processes
Supporting services (sidecars, agents)
System processes (logging, monitoring)

What happens when the process dies? Does it restart? How long? Do health checks detect it?

Network Failures

Network problems are common in distributed systems:

Latency injection: Add delay to network calls. Does the caller handle slow responses gracefully?

Packet loss: Drop some packets. Does retry logic work?

DNS failure: Make DNS resolution fail. Does caching help? How does the system degrade?

Partition: Isolate components from each other. Does the system handle split-brain scenarios?

Resource Exhaustion

Consume resources until limits are hit:

CPU saturation: What happens when CPU is fully utilized?

Memory pressure: How does the system behave under memory constraints?

Disk space: Fill the disk. Does the application handle disk-full gracefully?

Connection exhaustion: Consume all database or HTTP connections.

Dependency Failures

Services depend on other services. What happens when dependencies fail?

Dependency unavailable: Make a dependency unreachable.

Dependency slow: Slow responses from dependencies.

Dependency returning errors: Dependencies returning error responses.

Clock Skew

Time-related bugs are insidious:

Clock drift: Make system clocks diverge.

Clock jump: Jump clocks forward or backward.

Configuration Errors

Misconfiguration is a common failure cause:

Invalid configuration: Inject bad config values.

Missing configuration: Remove required configuration.

Tools and Techniques

Manual Injection

Start with simple manual techniques:

# Kill a process
kill -9 $PID

# Add network latency
tc qdisc add dev eth0 root netem delay 100ms

# Fill disk
dd if=/dev/zero of=/tmp/fill bs=1M count=10000

# Consume CPU
stress --cpu 4 --timeout 60

Manual injection gives full control and requires no tooling investment.

Chaos Monkey and Simians

Netflix’s Simian Army includes tools for various failure types. Open source versions are available.

Chaos Monkey: Randomly terminates instances.

Latency Monkey: Injects latency into RESTful services.

Chaos Kong: Simulates region failures.

Gremlin

Commercial chaos engineering platform. Provides:

Web UI for experiment definition
Many attack types
Safety controls
Reporting and analysis

Gremlin reduces the tooling investment required to start.

Chaos Toolkit

Open source chaos engineering framework. Extensible with plugins for various systems (Kubernetes, AWS, etc.).

{
  "title": "Database failure experiment",
  "description": "Verify graceful degradation when database is unavailable",
  "steady-state-hypothesis": {
    "title": "Application responds to requests",
    "probes": [
      {
        "type": "probe",
        "name": "app-responds",
        "provider": {
          "type": "http",
          "url": "http://app/health"
        }
      }
    ]
  },
  "method": [
    {
      "type": "action",
      "name": "stop-database",
      "provider": {
        "type": "process",
        "path": "docker",
        "arguments": ["stop", "database"]
      }
    }
  ]
}

Litmus (Kubernetes)

Chaos engineering for Kubernetes. Provides:

Chaos experiments as Kubernetes resources
Pod, node, and network experiments
Integration with CI/CD pipelines

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-delete-experiment
spec:
  appinfo:
    appns: default
    applabel: "app=myapp"
  experiments:
    - name: pod-delete

Safety Practices

Chaos engineering involves deliberately breaking things. Safety is essential.

Blast Radius Control

Limit the scope of experiments:

Start with single instances, not entire services
Run during low-traffic periods
Have kill switches to stop experiments instantly
Monitor closely during experiments

Rollback Capability

Ensure you can undo experiments quickly:

Automated rollback triggers
Clear procedures for manual rollback
Tested recovery processes

Communication

Inform relevant parties:

Announce experiments to teams that might be affected
Have incident response ready
Document experiments and findings

Progressive Expansion

Grow experiment scope gradually:

Staging environment
Production, single instance
Production, multiple instances
Production, during higher traffic
Production, automated/continuous

Each stage builds confidence for the next.

Building a Practice

Start with Learning

Initial experiments reveal unknowns:

How does the system actually behave under failure?
What assumptions are wrong?
What gaps exist in monitoring, alerting, recovery?

Learning is the goal, not proving the system is perfect.

Make It Routine

Regular chaos experiments build organizational muscle:

Weekly or monthly game days
Automated experiments in CI/CD
Continuous chaos in production (when ready)

Routine practice makes failure response automatic.

Connect to Improvement

Experiments should drive improvements:

Findings become tickets
Tickets become fixes
Fixed issues are re-tested

Chaos engineering without follow-through is just chaos.

Key Takeaways

Chaos engineering is empirical testing of resilience, not random destruction
Start with tabletop exercises and manual controlled failures
Progress from staging to production, single instances to broader scope
Inject failures at multiple levels: process, network, resources, dependencies
Use available tools (manual, Chaos Monkey, Gremlin, Chaos Toolkit) based on needs
Maintain safety through blast radius control, rollback capability, and communication
Make chaos engineering routine and connect findings to concrete improvements