Building Reliable Distributed Systems

September 17, 2018

Distributed systems introduce failure modes that don’t exist in monolithic applications. Networks partition. Services become unavailable. Latency spikes unpredictably. Messages arrive out of order or not at all.

You can’t prevent these failures. You can design systems that handle them gracefully.

Embrace Failure

The Fallacies of Distributed Computing

These assumptions are false but commonly made:

  1. The network is reliable
  2. Latency is zero
  3. Bandwidth is infinite
  4. The network is secure
  5. Topology doesn’t change
  6. There is one administrator
  7. Transport cost is zero
  8. The network is homogeneous

Design for the reality: networks fail, latency varies, and things break.

Failure Is Normal

In large systems, something is always failing:

Design for continuous partial failure, not rare complete failure.

Patterns for Reliability

Timeouts

Always set timeouts. Never wait forever:

# Bad - waits forever
response = requests.get(url)

# Good - explicit timeout
response = requests.get(url, timeout=5)

Timeout strategy:

Retries with Backoff

Transient failures often resolve with retry:

def retry_with_backoff(func, max_retries=3, base_delay=1):
    for attempt in range(max_retries):
        try:
            return func()
        except RetryableError:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)

Key principles:

Circuit Breakers

Stop calling failing services:

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30):
        self.failures = 0
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.state = 'closed'
        self.last_failure_time = None

    def call(self, func):
        if self.state == 'open':
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = 'half-open'
            else:
                raise CircuitOpenError()

        try:
            result = func()
            if self.state == 'half-open':
                self.state = 'closed'
                self.failures = 0
            return result
        except Exception as e:
            self.failures += 1
            self.last_failure_time = time.time()
            if self.failures >= self.failure_threshold:
                self.state = 'open'
            raise

States:

Bulkheads

Isolate failures to prevent cascade:

# Separate thread pools for different dependencies
payment_pool = ThreadPoolExecutor(max_workers=10)
inventory_pool = ThreadPoolExecutor(max_workers=10)
shipping_pool = ThreadPoolExecutor(max_workers=5)

# Payment service slowdown doesn't exhaust resources for others
payment_future = payment_pool.submit(call_payment_service)

Bulkhead patterns:

Fallbacks

Provide degraded functionality when dependencies fail:

def get_recommendations(user_id):
    try:
        return recommendation_service.get(user_id)
    except ServiceUnavailable:
        # Fallback to popular items
        return get_popular_items()

Fallback strategies:

Idempotency

Make operations safe to retry:

def create_order(order_id, items):
    # Check if already processed
    if order_exists(order_id):
        return get_order(order_id)

    # Process order
    order = Order(order_id, items)
    order.save()
    return order

Idempotency key pattern:

@app.route('/orders', methods=['POST'])
def create_order():
    idempotency_key = request.headers.get('Idempotency-Key')

    # Check cache for previous result
    cached = cache.get(f'idempotency:{idempotency_key}')
    if cached:
        return cached

    # Process order
    result = process_order(request.json)

    # Cache result
    cache.set(f'idempotency:{idempotency_key}', result, ttl=86400)
    return result

Consistency Patterns

Eventual Consistency

Accept that data won’t be immediately consistent:

Write to Primary → Replicate to Replicas → Eventually Consistent

Design for it:

Saga Pattern

Coordinate distributed transactions without distributed locks:

class OrderSaga:
    def execute(self):
        try:
            # Step 1: Reserve inventory
            reservation_id = inventory_service.reserve(self.items)

            # Step 2: Process payment
            payment_id = payment_service.charge(self.customer, self.total)

            # Step 3: Confirm order
            order_id = order_service.create(self.order)

            return order_id

        except PaymentFailed:
            # Compensate: release inventory
            inventory_service.release(reservation_id)
            raise

        except OrderCreationFailed:
            # Compensate: refund and release
            payment_service.refund(payment_id)
            inventory_service.release(reservation_id)
            raise

Each step has a compensating action for rollback.

Event Sourcing for Reliability

Store events, derive state:

# Instead of updating state
user.balance = user.balance + amount

# Store event
events.append(CreditApplied(user_id, amount, timestamp))

# Derive state from events
def get_balance(user_id):
    balance = 0
    for event in get_events(user_id):
        if isinstance(event, CreditApplied):
            balance += event.amount
        elif isinstance(event, DebitApplied):
            balance -= event.amount
    return balance

Events are immutable facts. Easier to recover and audit.

Operational Reliability

Health Checks

Deep health checks that verify dependencies:

@app.route('/health')
def health_check():
    checks = {
        'database': check_database(),
        'cache': check_cache(),
        'external_api': check_external_api()
    }

    healthy = all(checks.values())
    status_code = 200 if healthy else 503

    return jsonify({
        'status': 'healthy' if healthy else 'unhealthy',
        'checks': checks
    }), status_code

Health check types:

Graceful Degradation

Handle overload gracefully:

@app.route('/api/search')
def search():
    if rate_limiter.is_overloaded():
        # Return cached/default results
        return get_cached_results()

    if request_queue.is_full():
        # Shed load
        return jsonify({'error': 'overloaded'}), 503

    return perform_search(request.args)

Load Shedding

Drop requests when overwhelmed rather than failing completely:

class LoadShedder:
    def __init__(self, max_queue_size=100):
        self.queue = Queue(max_queue_size)

    def process(self, request):
        try:
            self.queue.put_nowait(request)
        except Full:
            # Shed load - reject immediately
            raise ServiceOverloaded()

Better to reject some requests than fail all.

Backpressure

Slow down producers when consumers are overwhelmed:

# Producer respects backpressure
async def produce(queue):
    while True:
        item = await get_item()
        await queue.put(item)  # Blocks when queue is full

# Consumer processes at its own pace
async def consume(queue):
    while True:
        item = await queue.get()
        await process(item)

Testing Reliability

Chaos Engineering

Intentionally inject failures:

# Chaos middleware
class ChaosMiddleware:
    def __init__(self, app, failure_rate=0.01):
        self.app = app
        self.failure_rate = failure_rate

    def __call__(self, environ, start_response):
        if random.random() < self.failure_rate:
            start_response('500 Internal Server Error', [])
            return [b'Chaos!']
        return self.app(environ, start_response)

Chaos experiments:

Failure Mode Testing

Test specific failure scenarios:

def test_database_failure():
    with mock.patch('db.query', side_effect=DatabaseError):
        response = client.get('/users/123')
        assert response.status_code == 503
        assert 'fallback' in response.json

def test_timeout():
    with mock.patch('service.call', side_effect=Timeout):
        response = client.get('/data')
        assert response.status_code == 504

Load Testing

Test behavior under stress:

Measure degradation, not just success/failure.

Key Takeaways

Reliable distributed systems aren’t accident-proof—they’re accident-tolerant. Design for graceful degradation, not perfect operation.