Netflix Chaos Monkey randomly kills production instances. This sounds terrifying—and for unprepared organizations, it would be. But controlled failure injection reveals weaknesses before real failures do.
You don’t need Netflix’s scale or a dedicated chaos team to benefit. Start small, grow capability over time, and build confidence in your system’s resilience.
What Chaos Engineering Is
Chaos engineering is empirical testing of system behavior under adverse conditions. You hypothesize how the system should behave when something fails, inject that failure, and observe whether reality matches expectations.
The process:
- Define steady state: What does “working” look like? Metrics that indicate health.
- Hypothesize: “If X fails, the system should continue working with Y degradation.”
- Inject failure: Actually cause X to fail.
- Observe: Did the system behave as hypothesized?
- Learn: What gaps exist? What needs improvement?
This isn’t random destruction—it’s scientific experimentation about resilience.
Starting Simple
Game Days
Before automated failure injection, practice manual exercises:
Tabletop exercises: Gather the team, describe a failure scenario, walk through response without actually breaking anything.
“It’s 2 PM on Tuesday. The primary database server becomes unreachable. What happens? Who does what? How long until recovery?”
Tabletops reveal gaps in knowledge, documentation, and process without risk.
Controlled failovers: During low-traffic periods with engineers ready, trigger failover scenarios manually:
- Promote database replica
- Redirect traffic to backup region
- Restart services to test recovery
These exercises build muscle memory for real incidents.
Single-Service Experiments
Start chaos experiments with a single, non-critical service:
- Choose a service with good observability
- Define steady state (request success rate, latency)
- Inject simple failure (restart, network delay)
- Observe impact on steady state
- Document findings
Graduate to more complex experiments as confidence grows.
Staging First
Run experiments in staging before production. Staging experiments are safer and still valuable—they reveal issues without customer impact.
Production experiments are the ultimate test, but staging experiments catch many problems.
Types of Failures to Inject
Process Failures
Kill processes and observe recovery:
- Application processes
- Supporting services (sidecars, agents)
- System processes (logging, monitoring)
What happens when the process dies? Does it restart? How long? Do health checks detect it?
Network Failures
Network problems are common in distributed systems:
Latency injection: Add delay to network calls. Does the caller handle slow responses gracefully?
Packet loss: Drop some packets. Does retry logic work?
DNS failure: Make DNS resolution fail. Does caching help? How does the system degrade?
Partition: Isolate components from each other. Does the system handle split-brain scenarios?
Resource Exhaustion
Consume resources until limits are hit:
CPU saturation: What happens when CPU is fully utilized?
Memory pressure: How does the system behave under memory constraints?
Disk space: Fill the disk. Does the application handle disk-full gracefully?
Connection exhaustion: Consume all database or HTTP connections.
Dependency Failures
Services depend on other services. What happens when dependencies fail?
Dependency unavailable: Make a dependency unreachable.
Dependency slow: Slow responses from dependencies.
Dependency returning errors: Dependencies returning error responses.
Clock Skew
Time-related bugs are insidious:
Clock drift: Make system clocks diverge.
Clock jump: Jump clocks forward or backward.
Configuration Errors
Misconfiguration is a common failure cause:
Invalid configuration: Inject bad config values.
Missing configuration: Remove required configuration.
Tools and Techniques
Manual Injection
Start with simple manual techniques:
# Kill a process
kill -9 $PID
# Add network latency
tc qdisc add dev eth0 root netem delay 100ms
# Fill disk
dd if=/dev/zero of=/tmp/fill bs=1M count=10000
# Consume CPU
stress --cpu 4 --timeout 60
Manual injection gives full control and requires no tooling investment.
Chaos Monkey and Simians
Netflix’s Simian Army includes tools for various failure types. Open source versions are available.
Chaos Monkey: Randomly terminates instances.
Latency Monkey: Injects latency into RESTful services.
Chaos Kong: Simulates region failures.
Gremlin
Commercial chaos engineering platform. Provides:
- Web UI for experiment definition
- Many attack types
- Safety controls
- Reporting and analysis
Gremlin reduces the tooling investment required to start.
Chaos Toolkit
Open source chaos engineering framework. Extensible with plugins for various systems (Kubernetes, AWS, etc.).
{
"title": "Database failure experiment",
"description": "Verify graceful degradation when database is unavailable",
"steady-state-hypothesis": {
"title": "Application responds to requests",
"probes": [
{
"type": "probe",
"name": "app-responds",
"provider": {
"type": "http",
"url": "http://app/health"
}
}
]
},
"method": [
{
"type": "action",
"name": "stop-database",
"provider": {
"type": "process",
"path": "docker",
"arguments": ["stop", "database"]
}
}
]
}
Litmus (Kubernetes)
Chaos engineering for Kubernetes. Provides:
- Chaos experiments as Kubernetes resources
- Pod, node, and network experiments
- Integration with CI/CD pipelines
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: pod-delete-experiment
spec:
appinfo:
appns: default
applabel: "app=myapp"
experiments:
- name: pod-delete
Safety Practices
Chaos engineering involves deliberately breaking things. Safety is essential.
Blast Radius Control
Limit the scope of experiments:
- Start with single instances, not entire services
- Run during low-traffic periods
- Have kill switches to stop experiments instantly
- Monitor closely during experiments
Rollback Capability
Ensure you can undo experiments quickly:
- Automated rollback triggers
- Clear procedures for manual rollback
- Tested recovery processes
Communication
Inform relevant parties:
- Announce experiments to teams that might be affected
- Have incident response ready
- Document experiments and findings
Progressive Expansion
Grow experiment scope gradually:
- Staging environment
- Production, single instance
- Production, multiple instances
- Production, during higher traffic
- Production, automated/continuous
Each stage builds confidence for the next.
Building a Practice
Start with Learning
Initial experiments reveal unknowns:
- How does the system actually behave under failure?
- What assumptions are wrong?
- What gaps exist in monitoring, alerting, recovery?
Learning is the goal, not proving the system is perfect.
Make It Routine
Regular chaos experiments build organizational muscle:
- Weekly or monthly game days
- Automated experiments in CI/CD
- Continuous chaos in production (when ready)
Routine practice makes failure response automatic.
Connect to Improvement
Experiments should drive improvements:
- Findings become tickets
- Tickets become fixes
- Fixed issues are re-tested
Chaos engineering without follow-through is just chaos.
Key Takeaways
- Chaos engineering is empirical testing of resilience, not random destruction
- Start with tabletop exercises and manual controlled failures
- Progress from staging to production, single instances to broader scope
- Inject failures at multiple levels: process, network, resources, dependencies
- Use available tools (manual, Chaos Monkey, Gremlin, Chaos Toolkit) based on needs
- Maintain safety through blast radius control, rollback capability, and communication
- Make chaos engineering routine and connect findings to concrete improvements