Designing Systems That Fail Gracefully

May 6, 2019

Failure is not exceptional—it’s inevitable. Networks partition, services crash, databases become unavailable. The question isn’t whether your system will face failures, but how it behaves when they occur.

Systems that fail gracefully degrade partially rather than completely. They continue providing value even when components break.

Principles of Graceful Degradation

Expect Failure

Design assuming every dependency will eventually fail:

Every network call → will timeout
Every service → will become unavailable
Every database → will become unreachable
Every disk → will fill up

Fail Fast, Recover Gracefully

Detect failures quickly, recover automatically:

// Circuit breaker pattern
if breaker.IsOpen() {
    return fallbackResponse() // Fail fast
}

result, err := callService()
if err != nil {
    breaker.RecordFailure()
    return fallbackResponse() // Graceful degradation
}

breaker.RecordSuccess()
return result

Partial Availability Over Total Failure

When one component fails, others continue:

Homepage Request:
├── User Profile → ✓ Show profile
├── Recommendations → ✗ Show popular items (fallback)
├── Notifications → ✗ Hide notifications section
└── Result: Partial page loads (better than error page)

Patterns for Graceful Failure

Circuit Breakers

Stop calling failing services:

class CircuitBreaker:
    def __init__(self, failure_threshold=5, reset_timeout=60):
        self.failures = 0
        self.failure_threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.state = "CLOSED"
        self.last_failure_time = None

    def call(self, func, fallback):
        if self.state == "OPEN":
            if self.should_attempt_reset():
                return self.try_call(func, fallback)
            return fallback()

        return self.try_call(func, fallback)

    def try_call(self, func, fallback):
        try:
            result = func()
            self.on_success()
            return result
        except Exception:
            self.on_failure()
            return fallback()

Bulkheads

Isolate failures to prevent cascade:

# Separate thread pools per dependency
payment_pool = ThreadPoolExecutor(max_workers=10)
inventory_pool = ThreadPoolExecutor(max_workers=10)
notifications_pool = ThreadPoolExecutor(max_workers=5)

# Payment service issues don't exhaust resources for others
def process_order(order):
    payment_future = payment_pool.submit(process_payment, order)
    inventory_future = inventory_pool.submit(reserve_inventory, order)
    # notifications can fail without blocking order
    try:
        notifications_pool.submit(send_notification, order)
    except RejectedExecutionError:
        log.warn("Notification pool exhausted, skipping")

Fallback Strategies

Provide alternatives when primary fails:

def get_user_recommendations(user_id):
    try:
        # Primary: Personalized recommendations
        return recommendation_service.get_personalized(user_id)
    except ServiceUnavailable:
        try:
            # Fallback 1: Cached recommendations
            return cache.get(f"recommendations:{user_id}")
        except CacheMiss:
            # Fallback 2: Popular items
            return get_popular_items()

Timeouts and Deadlines

Never wait forever:

# Per-request deadline
def handle_request(request):
    deadline = time.time() + 5  # 5 second deadline

    user = get_user(request.user_id, deadline=deadline)
    if time.time() > deadline:
        return partial_response(user)

    orders = get_orders(request.user_id, deadline=deadline)
    if time.time() > deadline:
        return partial_response(user, orders)

    # Full response only if within deadline
    recommendations = get_recommendations(request.user_id)
    return full_response(user, orders, recommendations)

Retries with Backoff

Transient failures often recover:

def retry_with_backoff(func, max_retries=3):
    for attempt in range(max_retries):
        try:
            return func()
        except TransientError:
            if attempt == max_retries - 1:
                raise
            delay = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)

Load Shedding

Reject excess load to protect the system:

class LoadShedder:
    def __init__(self, max_concurrent=100):
        self.semaphore = threading.Semaphore(max_concurrent)

    def process(self, request):
        if not self.semaphore.acquire(blocking=False):
            raise ServiceOverloaded()

        try:
            return handle_request(request)
        finally:
            self.semaphore.release()

Feature-Level Degradation

Feature Flags for Graceful Degradation

def homepage():
    content = get_essential_content()  # Always needed

    if feature_flags.is_enabled("recommendations"):
        try:
            content.recommendations = get_recommendations()
        except ServiceError:
            feature_flags.disable_temporarily("recommendations")

    if feature_flags.is_enabled("notifications"):
        content.notifications = get_notifications()

    return content

Priority-Based Features

# Critical features must work
CRITICAL = ["login", "checkout", "order_status"]
# Important but degradable
IMPORTANT = ["recommendations", "reviews", "notifications"]
# Nice to have
OPTIONAL = ["social_share", "wishlists", "recently_viewed"]

def handle_request(request):
    if system_under_stress():
        disable_features(OPTIONAL)
        if system_severely_stressed():
            disable_features(IMPORTANT)

Database Resilience

Read Replicas for Availability

def get_user(user_id):
    try:
        return primary_db.get_user(user_id)
    except DatabaseUnavailable:
        # Fall back to potentially stale replica
        return replica_db.get_user(user_id)

Write Queue for Eventual Consistency

def update_user(user_id, data):
    try:
        primary_db.update_user(user_id, data)
    except DatabaseUnavailable:
        # Queue for later processing
        write_queue.enqueue("update_user", user_id, data)
        return AcceptedForProcessing()

Caching for Resilience

Stale-While-Revalidate

def get_product(product_id):
    cached = cache.get(product_id)

    if cached and not cached.is_stale():
        return cached.value

    try:
        fresh = product_service.get(product_id)
        cache.set(product_id, fresh, ttl=3600)
        return fresh
    except ServiceUnavailable:
        if cached:
            # Return stale data rather than error
            return cached.value
        raise

Cache Stampede Prevention

def get_with_lock(key, fetch_func, ttl=3600):
    value = cache.get(key)
    if value:
        return value

    lock = cache.acquire_lock(f"lock:{key}", timeout=10)
    if not lock:
        # Another process is fetching, wait briefly
        time.sleep(0.1)
        return cache.get(key) or fetch_func()

    try:
        value = fetch_func()
        cache.set(key, value, ttl)
        return value
    finally:
        lock.release()

Testing Failure Modes

Chaos Engineering

Intentionally inject failures:

# Chaos Mesh experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: payment-failure
spec:
  action: pod-failure
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: payment-service
  duration: 5m

Failure Injection in Tests

def test_handles_payment_failure():
    with mock.patch('payment.process', side_effect=Timeout):
        response = client.post('/checkout', order_data)

        # Order still created, payment queued
        assert response.status_code == 202
        assert response.json['status'] == 'pending_payment'

Key Takeaways

Graceful degradation isn’t optional—it’s the difference between partial service and complete outage.