Building Reliable Distributed Systems

Distributed systems introduce failure modes that don’t exist in monolithic applications. Networks partition. Services become unavailable. Latency spikes unpredictably. Messages arrive out of order or not at all.

You can’t prevent these failures. You can design systems that handle them gracefully.

Embrace Failure

The Fallacies of Distributed Computing

These assumptions are false but commonly made:

The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
Topology doesn’t change
There is one administrator
Transport cost is zero
The network is homogeneous

Design for the reality: networks fail, latency varies, and things break.

Failure Is Normal

In large systems, something is always failing:

At 1000 servers with 99.9% uptime each, expect ~1 down at any time
Network switches fail
Disks fail
Processes crash
Dependencies become unavailable

Design for continuous partial failure, not rare complete failure.

Patterns for Reliability

Timeouts

Always set timeouts. Never wait forever:

# Bad - waits forever
response = requests.get(url)

# Good - explicit timeout
response = requests.get(url, timeout=5)

Timeout strategy:

Connection timeout: How long to establish connection (short, ~1s)
Read timeout: How long to wait for response (depends on operation)

Retries with Backoff

Transient failures often resolve with retry:

def retry_with_backoff(func, max_retries=3, base_delay=1):
    for attempt in range(max_retries):
        try:
            return func()
        except RetryableError:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)

Key principles:

Exponential backoff prevents thundering herd
Jitter spreads out retry attempts
Retry only idempotent operations
Set maximum retry attempts

Circuit Breakers

Stop calling failing services:

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30):
        self.failures = 0
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.state = 'closed'
        self.last_failure_time = None

    def call(self, func):
        if self.state == 'open':
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = 'half-open'
            else:
                raise CircuitOpenError()

        try:
            result = func()
            if self.state == 'half-open':
                self.state = 'closed'
                self.failures = 0
            return result
        except Exception as e:
            self.failures += 1
            self.last_failure_time = time.time()
            if self.failures >= self.failure_threshold:
                self.state = 'open'
            raise

States:

Closed: Normal operation, tracking failures
Open: Failing fast, not calling service
Half-open: Allowing test calls to check recovery

Bulkheads

Isolate failures to prevent cascade:

# Separate thread pools for different dependencies
payment_pool = ThreadPoolExecutor(max_workers=10)
inventory_pool = ThreadPoolExecutor(max_workers=10)
shipping_pool = ThreadPoolExecutor(max_workers=5)

# Payment service slowdown doesn't exhaust resources for others
payment_future = payment_pool.submit(call_payment_service)

Bulkhead patterns:

Separate thread pools
Separate connection pools
Separate instances
Microservice isolation

Fallbacks

Provide degraded functionality when dependencies fail:

def get_recommendations(user_id):
    try:
        return recommendation_service.get(user_id)
    except ServiceUnavailable:
        # Fallback to popular items
        return get_popular_items()

Fallback strategies:

Cached data (stale but available)
Default values
Degraded functionality
Alternative service

Idempotency

Make operations safe to retry:

def create_order(order_id, items):
    # Check if already processed
    if order_exists(order_id):
        return get_order(order_id)

    # Process order
    order = Order(order_id, items)
    order.save()
    return order

Idempotency key pattern:

@app.route('/orders', methods=['POST'])
def create_order():
    idempotency_key = request.headers.get('Idempotency-Key')

    # Check cache for previous result
    cached = cache.get(f'idempotency:{idempotency_key}')
    if cached:
        return cached

    # Process order
    result = process_order(request.json)

    # Cache result
    cache.set(f'idempotency:{idempotency_key}', result, ttl=86400)
    return result

Consistency Patterns

Eventual Consistency

Accept that data won’t be immediately consistent:

Write to Primary → Replicate to Replicas → Eventually Consistent

Design for it:

Tolerate stale reads
Conflict resolution strategies
Reconciliation processes

Saga Pattern

Coordinate distributed transactions without distributed locks:

class OrderSaga:
    def execute(self):
        try:
            # Step 1: Reserve inventory
            reservation_id = inventory_service.reserve(self.items)

            # Step 2: Process payment
            payment_id = payment_service.charge(self.customer, self.total)

            # Step 3: Confirm order
            order_id = order_service.create(self.order)

            return order_id

        except PaymentFailed:
            # Compensate: release inventory
            inventory_service.release(reservation_id)
            raise

        except OrderCreationFailed:
            # Compensate: refund and release
            payment_service.refund(payment_id)
            inventory_service.release(reservation_id)
            raise

Each step has a compensating action for rollback.

Event Sourcing for Reliability

Store events, derive state:

# Instead of updating state
user.balance = user.balance + amount

# Store event
events.append(CreditApplied(user_id, amount, timestamp))

# Derive state from events
def get_balance(user_id):
    balance = 0
    for event in get_events(user_id):
        if isinstance(event, CreditApplied):
            balance += event.amount
        elif isinstance(event, DebitApplied):
            balance -= event.amount
    return balance

Events are immutable facts. Easier to recover and audit.

Operational Reliability

Health Checks

Deep health checks that verify dependencies:

@app.route('/health')
def health_check():
    checks = {
        'database': check_database(),
        'cache': check_cache(),
        'external_api': check_external_api()
    }

    healthy = all(checks.values())
    status_code = 200 if healthy else 503

    return jsonify({
        'status': 'healthy' if healthy else 'unhealthy',
        'checks': checks
    }), status_code

Health check types:

Liveness: Is the process alive?
Readiness: Can it handle traffic?
Deep: Are dependencies healthy?

Graceful Degradation

Handle overload gracefully:

@app.route('/api/search')
def search():
    if rate_limiter.is_overloaded():
        # Return cached/default results
        return get_cached_results()

    if request_queue.is_full():
        # Shed load
        return jsonify({'error': 'overloaded'}), 503

    return perform_search(request.args)

Load Shedding

Drop requests when overwhelmed rather than failing completely:

class LoadShedder:
    def __init__(self, max_queue_size=100):
        self.queue = Queue(max_queue_size)

    def process(self, request):
        try:
            self.queue.put_nowait(request)
        except Full:
            # Shed load - reject immediately
            raise ServiceOverloaded()

Better to reject some requests than fail all.

Backpressure

Slow down producers when consumers are overwhelmed:

# Producer respects backpressure
async def produce(queue):
    while True:
        item = await get_item()
        await queue.put(item)  # Blocks when queue is full

# Consumer processes at its own pace
async def consume(queue):
    while True:
        item = await queue.get()
        await process(item)

Testing Reliability

Chaos Engineering

Intentionally inject failures:

# Chaos middleware
class ChaosMiddleware:
    def __init__(self, app, failure_rate=0.01):
        self.app = app
        self.failure_rate = failure_rate

    def __call__(self, environ, start_response):
        if random.random() < self.failure_rate:
            start_response('500 Internal Server Error', [])
            return [b'Chaos!']
        return self.app(environ, start_response)

Chaos experiments:

Kill instances randomly
Inject latency
Partition networks
Exhaust resources

Failure Mode Testing

Test specific failure scenarios:

def test_database_failure():
    with mock.patch('db.query', side_effect=DatabaseError):
        response = client.get('/users/123')
        assert response.status_code == 503
        assert 'fallback' in response.json

def test_timeout():
    with mock.patch('service.call', side_effect=Timeout):
        response = client.get('/data')
        assert response.status_code == 504

Load Testing

Test behavior under stress:

Normal load
Peak load
Overload
Spike patterns

Measure degradation, not just success/failure.

Key Takeaways

Embrace failure as normal; design for continuous partial failure
Always use timeouts; never wait forever
Implement retries with exponential backoff and jitter
Use circuit breakers to fail fast when dependencies are down
Isolate failures with bulkheads (separate pools, services)
Provide fallbacks for degraded operation
Make operations idempotent for safe retries
Use sagas for distributed transactions; each step has compensating action
Implement deep health checks that verify dependencies
Shed load gracefully; rejecting some requests is better than failing all
Test failure modes explicitly; use chaos engineering

Reliable distributed systems aren’t accident-proof—they’re accident-tolerant. Design for graceful degradation, not perfect operation.