Building Resilient Systems: Lessons from Production Failures

July 18, 2016

Systems fail. Networks partition. Services crash. Databases corrupt. Hardware fails. The question isn’t whether your system will experience failures—it’s whether your system will survive them gracefully.

Resilient systems continue functioning, perhaps in degraded mode, when components fail. Fragile systems cascade from a single component failure to complete outage. The difference lies in design: how you architect for failure, not just for success.

These lessons come from production failures—ours and others’. Each failure taught something about building systems that survive the unexpected.

Expect Failure

The first principle of resilient design: failures will occur. Design for failure from the start, not as an afterthought.

Everything Fails Eventually

Your database will go down. Not “might”—will. The question is when and how your system responds.

Pretending these won’t happen doesn’t make them less likely. Planning for them does make their impact smaller.

Design for Degradation

When a component fails, what happens? Resilient systems degrade gracefully:

Identify which components are truly critical (no workaround exists) versus which can degrade (reduced functionality is acceptable). Design fallback behaviors for the latter.

Isolation Patterns

Failures spread when components are tightly coupled. Isolation patterns contain failures within boundaries.

Bulkheads

In ships, bulkheads are watertight compartments. A hull breach floods one compartment, not the entire ship. Software bulkheads work similarly.

Separate critical paths from non-critical paths:

When the recommendation service overwhelms its database connection pool, it shouldn’t affect the checkout service’s connections.

Circuit Breakers

When a dependency fails, continuing to call it is harmful. Requests pile up, timeouts accumulate, and your system degrades waiting for a dead service.

Circuit breakers stop calling failing dependencies:

  1. Closed state: Requests flow normally. Track failure rates.
  2. Open state: After threshold failures, stop calling the dependency. Return fallback responses immediately.
  3. Half-open state: Periodically allow test requests. If they succeed, close the circuit. If they fail, stay open.

Circuit breakers prevent cascade failures and give failing services time to recover without load.

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.state = "closed"
        self.last_failure_time = None

    def call(self, func, *args, **kwargs):
        if self.state == "open":
            if self._should_attempt_recovery():
                self.state = "half-open"
            else:
                raise CircuitOpenError()

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

Timeouts

Operations without timeouts can hang indefinitely. A slow dependency makes your service slow; enough slow requests exhaust your capacity.

Set timeouts on every external call:

Timeout values require tuning. Too short: false failures under load. Too long: slow failure detection. Start with generous timeouts, then tighten based on observed latency distributions.

Rate Limiting

Protect your system from overload—whether from traffic spikes, misbehaving clients, or denial of service.

Rate limits prevent any single source from consuming excessive resources:

When rate limited, return clear error responses (429 Too Many Requests) with information about when to retry.

Redundancy and Replication

Single points of failure are resilience killers. Redundancy ensures component failure doesn’t cause system failure.

Stateless Services

Stateless services are trivially redundant. Any instance can handle any request. If one instance fails, others continue serving.

Keep services stateless by:

Data Replication

Databases are harder to make redundant because they hold state. Replication copies data across multiple nodes:

Choose replication strategy based on durability requirements. Financial transactions need synchronous replication; activity logs might tolerate async.

Geographic Distribution

Regional outages happen. Truly resilient systems span multiple regions:

Geographic distribution adds complexity and latency but survives disasters that take out entire data centers.

Testing Resilience

You can’t know if resilience mechanisms work without testing them. Testing in production is often the only way to know for sure.

Chaos Engineering

Netflix popularized chaos engineering: deliberately injecting failures to test system resilience.

Start small: inject failures in non-production environments. As confidence grows, test in production during low-traffic periods with engineers ready to respond.

Game Days

Scheduled exercises where teams practice incident response. Inject realistic failures and respond as if they were real incidents.

Game days reveal:

Regular practice builds muscle memory for real incidents.

Failure Mode Analysis

For each component, document:

This analysis reveals gaps before production incidents expose them.

Recovery Patterns

When failures occur, recovery should be fast and safe.

Automated Recovery

Systems should heal themselves when possible:

Automated recovery reduces incident duration and eliminates manual intervention for common failures.

Rollback Capability

Every deployment should be reversible. When a deployment causes problems:

Practice rollback regularly. Untested rollback procedures fail when needed.

Incremental Recovery

After major failures, recover incrementally:

Rushing recovery often causes secondary failures.

Observability

You can’t respond to failures you can’t detect. Observability provides visibility into system behavior.

Monitoring

Track metrics that indicate health:

Alert on symptoms (users experiencing errors) rather than causes (high CPU). Users don’t care about CPU; they care about errors.

Logging

Structured logs enable investigation:

Distributed Tracing

In microservices, a single user request may touch dozens of services. Distributed tracing shows the full request path:

Without tracing, debugging distributed systems is guesswork.

Cultural Elements

Resilience isn’t only technical. Organizational culture affects how systems respond to failure.

Blameless Postmortems

After incidents, understand what happened and improve—without blaming individuals. Blame prevents learning; people hide mistakes rather than analyzing them.

Focus on systems: what allowed this failure to happen? How do we prevent it? What can we improve?

On-Call Practices

Sustainable on-call enables effective incident response:

Exhausted engineers make poor decisions during incidents.

Incident Communication

Clear communication during incidents:

Good communication reduces support burden and maintains trust.

Key Takeaways