Failure is not exceptional—it’s inevitable. Networks partition, services crash, databases become unavailable. The question isn’t whether your system will face failures, but how it behaves when they occur.
Systems that fail gracefully degrade partially rather than completely. They continue providing value even when components break.
Principles of Graceful Degradation
Expect Failure
Design assuming every dependency will eventually fail:
Every network call → will timeout
Every service → will become unavailable
Every database → will become unreachable
Every disk → will fill up
Fail Fast, Recover Gracefully
Detect failures quickly, recover automatically:
// Circuit breaker pattern
if breaker.IsOpen() {
return fallbackResponse() // Fail fast
}
result, err := callService()
if err != nil {
breaker.RecordFailure()
return fallbackResponse() // Graceful degradation
}
breaker.RecordSuccess()
return result
Partial Availability Over Total Failure
When one component fails, others continue:
Homepage Request:
├── User Profile → ✓ Show profile
├── Recommendations → ✗ Show popular items (fallback)
├── Notifications → ✗ Hide notifications section
└── Result: Partial page loads (better than error page)
Patterns for Graceful Failure
Circuit Breakers
Stop calling failing services:
class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_timeout=60):
self.failures = 0
self.failure_threshold = failure_threshold
self.reset_timeout = reset_timeout
self.state = "CLOSED"
self.last_failure_time = None
def call(self, func, fallback):
if self.state == "OPEN":
if self.should_attempt_reset():
return self.try_call(func, fallback)
return fallback()
return self.try_call(func, fallback)
def try_call(self, func, fallback):
try:
result = func()
self.on_success()
return result
except Exception:
self.on_failure()
return fallback()
Bulkheads
Isolate failures to prevent cascade:
# Separate thread pools per dependency
payment_pool = ThreadPoolExecutor(max_workers=10)
inventory_pool = ThreadPoolExecutor(max_workers=10)
notifications_pool = ThreadPoolExecutor(max_workers=5)
# Payment service issues don't exhaust resources for others
def process_order(order):
payment_future = payment_pool.submit(process_payment, order)
inventory_future = inventory_pool.submit(reserve_inventory, order)
# notifications can fail without blocking order
try:
notifications_pool.submit(send_notification, order)
except RejectedExecutionError:
log.warn("Notification pool exhausted, skipping")
Fallback Strategies
Provide alternatives when primary fails:
def get_user_recommendations(user_id):
try:
# Primary: Personalized recommendations
return recommendation_service.get_personalized(user_id)
except ServiceUnavailable:
try:
# Fallback 1: Cached recommendations
return cache.get(f"recommendations:{user_id}")
except CacheMiss:
# Fallback 2: Popular items
return get_popular_items()
Timeouts and Deadlines
Never wait forever:
# Per-request deadline
def handle_request(request):
deadline = time.time() + 5 # 5 second deadline
user = get_user(request.user_id, deadline=deadline)
if time.time() > deadline:
return partial_response(user)
orders = get_orders(request.user_id, deadline=deadline)
if time.time() > deadline:
return partial_response(user, orders)
# Full response only if within deadline
recommendations = get_recommendations(request.user_id)
return full_response(user, orders, recommendations)
Retries with Backoff
Transient failures often recover:
def retry_with_backoff(func, max_retries=3):
for attempt in range(max_retries):
try:
return func()
except TransientError:
if attempt == max_retries - 1:
raise
delay = (2 ** attempt) + random.uniform(0, 1)
time.sleep(delay)
Load Shedding
Reject excess load to protect the system:
class LoadShedder:
def __init__(self, max_concurrent=100):
self.semaphore = threading.Semaphore(max_concurrent)
def process(self, request):
if not self.semaphore.acquire(blocking=False):
raise ServiceOverloaded()
try:
return handle_request(request)
finally:
self.semaphore.release()
Feature-Level Degradation
Feature Flags for Graceful Degradation
def homepage():
content = get_essential_content() # Always needed
if feature_flags.is_enabled("recommendations"):
try:
content.recommendations = get_recommendations()
except ServiceError:
feature_flags.disable_temporarily("recommendations")
if feature_flags.is_enabled("notifications"):
content.notifications = get_notifications()
return content
Priority-Based Features
# Critical features must work
CRITICAL = ["login", "checkout", "order_status"]
# Important but degradable
IMPORTANT = ["recommendations", "reviews", "notifications"]
# Nice to have
OPTIONAL = ["social_share", "wishlists", "recently_viewed"]
def handle_request(request):
if system_under_stress():
disable_features(OPTIONAL)
if system_severely_stressed():
disable_features(IMPORTANT)
Database Resilience
Read Replicas for Availability
def get_user(user_id):
try:
return primary_db.get_user(user_id)
except DatabaseUnavailable:
# Fall back to potentially stale replica
return replica_db.get_user(user_id)
Write Queue for Eventual Consistency
def update_user(user_id, data):
try:
primary_db.update_user(user_id, data)
except DatabaseUnavailable:
# Queue for later processing
write_queue.enqueue("update_user", user_id, data)
return AcceptedForProcessing()
Caching for Resilience
Stale-While-Revalidate
def get_product(product_id):
cached = cache.get(product_id)
if cached and not cached.is_stale():
return cached.value
try:
fresh = product_service.get(product_id)
cache.set(product_id, fresh, ttl=3600)
return fresh
except ServiceUnavailable:
if cached:
# Return stale data rather than error
return cached.value
raise
Cache Stampede Prevention
def get_with_lock(key, fetch_func, ttl=3600):
value = cache.get(key)
if value:
return value
lock = cache.acquire_lock(f"lock:{key}", timeout=10)
if not lock:
# Another process is fetching, wait briefly
time.sleep(0.1)
return cache.get(key) or fetch_func()
try:
value = fetch_func()
cache.set(key, value, ttl)
return value
finally:
lock.release()
Testing Failure Modes
Chaos Engineering
Intentionally inject failures:
# Chaos Mesh experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: payment-failure
spec:
action: pod-failure
mode: one
selector:
namespaces:
- production
labelSelectors:
app: payment-service
duration: 5m
Failure Injection in Tests
def test_handles_payment_failure():
with mock.patch('payment.process', side_effect=Timeout):
response = client.post('/checkout', order_data)
# Order still created, payment queued
assert response.status_code == 202
assert response.json['status'] == 'pending_payment'
Key Takeaways
- Design expecting every dependency to fail
- Use circuit breakers to fail fast and recover gracefully
- Implement bulkheads to isolate failures
- Provide fallbacks at every level
- Always set timeouts; never wait forever
- Retry with exponential backoff for transient failures
- Shed load to protect the system from overload
- Use feature flags for runtime degradation control
- Cache aggressively with stale-while-revalidate
- Test failure modes with chaos engineering
Graceful degradation isn’t optional—it’s the difference between partial service and complete outage.