Distributed systems introduce failure modes that don’t exist in monolithic applications. Networks partition. Services become unavailable. Latency spikes unpredictably. Messages arrive out of order or not at all.
You can’t prevent these failures. You can design systems that handle them gracefully.
Embrace Failure
The Fallacies of Distributed Computing
These assumptions are false but commonly made:
- The network is reliable
- Latency is zero
- Bandwidth is infinite
- The network is secure
- Topology doesn’t change
- There is one administrator
- Transport cost is zero
- The network is homogeneous
Design for the reality: networks fail, latency varies, and things break.
Failure Is Normal
In large systems, something is always failing:
- At 1000 servers with 99.9% uptime each, expect ~1 down at any time
- Network switches fail
- Disks fail
- Processes crash
- Dependencies become unavailable
Design for continuous partial failure, not rare complete failure.
Patterns for Reliability
Timeouts
Always set timeouts. Never wait forever:
# Bad - waits forever
response = requests.get(url)
# Good - explicit timeout
response = requests.get(url, timeout=5)
Timeout strategy:
- Connection timeout: How long to establish connection (short, ~1s)
- Read timeout: How long to wait for response (depends on operation)
Retries with Backoff
Transient failures often resolve with retry:
def retry_with_backoff(func, max_retries=3, base_delay=1):
for attempt in range(max_retries):
try:
return func()
except RetryableError:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
time.sleep(delay)
Key principles:
- Exponential backoff prevents thundering herd
- Jitter spreads out retry attempts
- Retry only idempotent operations
- Set maximum retry attempts
Circuit Breakers
Stop calling failing services:
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30):
self.failures = 0
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.state = 'closed'
self.last_failure_time = None
def call(self, func):
if self.state == 'open':
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = 'half-open'
else:
raise CircuitOpenError()
try:
result = func()
if self.state == 'half-open':
self.state = 'closed'
self.failures = 0
return result
except Exception as e:
self.failures += 1
self.last_failure_time = time.time()
if self.failures >= self.failure_threshold:
self.state = 'open'
raise
States:
- Closed: Normal operation, tracking failures
- Open: Failing fast, not calling service
- Half-open: Allowing test calls to check recovery
Bulkheads
Isolate failures to prevent cascade:
# Separate thread pools for different dependencies
payment_pool = ThreadPoolExecutor(max_workers=10)
inventory_pool = ThreadPoolExecutor(max_workers=10)
shipping_pool = ThreadPoolExecutor(max_workers=5)
# Payment service slowdown doesn't exhaust resources for others
payment_future = payment_pool.submit(call_payment_service)
Bulkhead patterns:
- Separate thread pools
- Separate connection pools
- Separate instances
- Microservice isolation
Fallbacks
Provide degraded functionality when dependencies fail:
def get_recommendations(user_id):
try:
return recommendation_service.get(user_id)
except ServiceUnavailable:
# Fallback to popular items
return get_popular_items()
Fallback strategies:
- Cached data (stale but available)
- Default values
- Degraded functionality
- Alternative service
Idempotency
Make operations safe to retry:
def create_order(order_id, items):
# Check if already processed
if order_exists(order_id):
return get_order(order_id)
# Process order
order = Order(order_id, items)
order.save()
return order
Idempotency key pattern:
@app.route('/orders', methods=['POST'])
def create_order():
idempotency_key = request.headers.get('Idempotency-Key')
# Check cache for previous result
cached = cache.get(f'idempotency:{idempotency_key}')
if cached:
return cached
# Process order
result = process_order(request.json)
# Cache result
cache.set(f'idempotency:{idempotency_key}', result, ttl=86400)
return result
Consistency Patterns
Eventual Consistency
Accept that data won’t be immediately consistent:
Write to Primary → Replicate to Replicas → Eventually Consistent
Design for it:
- Tolerate stale reads
- Conflict resolution strategies
- Reconciliation processes
Saga Pattern
Coordinate distributed transactions without distributed locks:
class OrderSaga:
def execute(self):
try:
# Step 1: Reserve inventory
reservation_id = inventory_service.reserve(self.items)
# Step 2: Process payment
payment_id = payment_service.charge(self.customer, self.total)
# Step 3: Confirm order
order_id = order_service.create(self.order)
return order_id
except PaymentFailed:
# Compensate: release inventory
inventory_service.release(reservation_id)
raise
except OrderCreationFailed:
# Compensate: refund and release
payment_service.refund(payment_id)
inventory_service.release(reservation_id)
raise
Each step has a compensating action for rollback.
Event Sourcing for Reliability
Store events, derive state:
# Instead of updating state
user.balance = user.balance + amount
# Store event
events.append(CreditApplied(user_id, amount, timestamp))
# Derive state from events
def get_balance(user_id):
balance = 0
for event in get_events(user_id):
if isinstance(event, CreditApplied):
balance += event.amount
elif isinstance(event, DebitApplied):
balance -= event.amount
return balance
Events are immutable facts. Easier to recover and audit.
Operational Reliability
Health Checks
Deep health checks that verify dependencies:
@app.route('/health')
def health_check():
checks = {
'database': check_database(),
'cache': check_cache(),
'external_api': check_external_api()
}
healthy = all(checks.values())
status_code = 200 if healthy else 503
return jsonify({
'status': 'healthy' if healthy else 'unhealthy',
'checks': checks
}), status_code
Health check types:
- Liveness: Is the process alive?
- Readiness: Can it handle traffic?
- Deep: Are dependencies healthy?
Graceful Degradation
Handle overload gracefully:
@app.route('/api/search')
def search():
if rate_limiter.is_overloaded():
# Return cached/default results
return get_cached_results()
if request_queue.is_full():
# Shed load
return jsonify({'error': 'overloaded'}), 503
return perform_search(request.args)
Load Shedding
Drop requests when overwhelmed rather than failing completely:
class LoadShedder:
def __init__(self, max_queue_size=100):
self.queue = Queue(max_queue_size)
def process(self, request):
try:
self.queue.put_nowait(request)
except Full:
# Shed load - reject immediately
raise ServiceOverloaded()
Better to reject some requests than fail all.
Backpressure
Slow down producers when consumers are overwhelmed:
# Producer respects backpressure
async def produce(queue):
while True:
item = await get_item()
await queue.put(item) # Blocks when queue is full
# Consumer processes at its own pace
async def consume(queue):
while True:
item = await queue.get()
await process(item)
Testing Reliability
Chaos Engineering
Intentionally inject failures:
# Chaos middleware
class ChaosMiddleware:
def __init__(self, app, failure_rate=0.01):
self.app = app
self.failure_rate = failure_rate
def __call__(self, environ, start_response):
if random.random() < self.failure_rate:
start_response('500 Internal Server Error', [])
return [b'Chaos!']
return self.app(environ, start_response)
Chaos experiments:
- Kill instances randomly
- Inject latency
- Partition networks
- Exhaust resources
Failure Mode Testing
Test specific failure scenarios:
def test_database_failure():
with mock.patch('db.query', side_effect=DatabaseError):
response = client.get('/users/123')
assert response.status_code == 503
assert 'fallback' in response.json
def test_timeout():
with mock.patch('service.call', side_effect=Timeout):
response = client.get('/data')
assert response.status_code == 504
Load Testing
Test behavior under stress:
- Normal load
- Peak load
- Overload
- Spike patterns
Measure degradation, not just success/failure.
Key Takeaways
- Embrace failure as normal; design for continuous partial failure
- Always use timeouts; never wait forever
- Implement retries with exponential backoff and jitter
- Use circuit breakers to fail fast when dependencies are down
- Isolate failures with bulkheads (separate pools, services)
- Provide fallbacks for degraded operation
- Make operations idempotent for safe retries
- Use sagas for distributed transactions; each step has compensating action
- Implement deep health checks that verify dependencies
- Shed load gracefully; rejecting some requests is better than failing all
- Test failure modes explicitly; use chaos engineering
Reliable distributed systems aren’t accident-proof—they’re accident-tolerant. Design for graceful degradation, not perfect operation.