Zero Downtime Deployments: Patterns and Practices

October 21, 2019

Zero downtime deployments used to be a luxury. Now they’re expected. Users don’t tolerate maintenance windows, and competitive pressure demands continuous delivery. But achieving true zero downtime requires careful orchestration across application code, databases, and infrastructure.

Here’s how to deploy without interruption.

Why Downtime Happens

Common Causes

Application restarts:

Database changes:

Traffic management:

Dependencies:

Deployment Patterns

Rolling Deployments

Deploy new version incrementally:

# Kubernetes rolling update
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 25%

Process:

  1. Start new instance
  2. Wait for health check
  3. Begin sending traffic
  4. Terminate old instance
  5. Repeat until complete

Key settings:

Blue-Green Deployments

Run two complete environments:

┌─────────────────────────────────────────────────┐
│                  Load Balancer                   │
│                       │                          │
│          ┌───────────┴───────────┐              │
│          ▼                       ▼              │
│   ┌─────────────┐         ┌─────────────┐      │
│   │   Blue      │         │   Green     │      │
│   │   (v1.0)    │         │   (v1.1)    │      │
│   │   ACTIVE    │         │   STANDBY   │      │
│   └─────────────┘         └─────────────┘      │
└─────────────────────────────────────────────────┘

Process:

  1. Deploy new version to standby (Green)
  2. Test Green environment
  3. Switch traffic from Blue to Green
  4. Keep Blue running for rollback

Advantages:

Disadvantages:

Canary Deployments

Gradual traffic shifting:

# Istio canary
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: api
spec:
  hosts:
  - api
  http:
  - route:
    - destination:
        host: api
        subset: v1
      weight: 95
    - destination:
        host: api
        subset: v2
      weight: 5

Process:

  1. Deploy new version alongside old
  2. Send small percentage of traffic to new
  3. Monitor metrics (errors, latency)
  4. Gradually increase traffic
  5. Complete migration or rollback

Monitoring criteria:

Database Migrations

The Challenge

Database schema changes are the hardest part of zero downtime:

-- This locks the table
ALTER TABLE users ADD COLUMN phone VARCHAR(20);

-- This breaks old code
ALTER TABLE users DROP COLUMN legacy_field;

Expand-Contract Pattern

Phase 1: Expand (backward compatible)

-- Add new column (nullable or with default)
ALTER TABLE users ADD COLUMN phone VARCHAR(20) DEFAULT NULL;

Old code continues to work, ignores new column.

Phase 2: Migrate

-- Backfill data if needed
UPDATE users SET phone = legacy_phone WHERE phone IS NULL;

Application updated to use new column.

Phase 3: Contract (cleanup)

-- After all instances updated
ALTER TABLE users DROP COLUMN legacy_phone;

Safe Migration Practices

Add columns safely:

-- Safe: nullable or with default
ALTER TABLE users ADD COLUMN phone VARCHAR(20);

-- Safe: with default
ALTER TABLE users ADD COLUMN active BOOLEAN DEFAULT true;

-- Unsafe: NOT NULL without default
ALTER TABLE users ADD COLUMN phone VARCHAR(20) NOT NULL;

Remove columns safely:

# Step 1: Stop writing to column
# Deploy code that doesn't write to column

# Step 2: Stop reading from column
# Deploy code that doesn't read column

# Step 3: Remove column
# ALTER TABLE users DROP COLUMN old_field;

Rename columns safely:

-- Don't rename. Instead:
-- 1. Add new column
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);

-- 2. Backfill
UPDATE users SET full_name = name;

-- 3. Update code to write both, read new

-- 4. Update code to only use new

-- 5. Drop old
ALTER TABLE users DROP COLUMN name;

Online Schema Migrations

For large tables, use online migration tools:

gh-ost (GitHub):

gh-ost \
  --host=db.example.com \
  --database=myapp \
  --table=users \
  --alter="ADD COLUMN phone VARCHAR(20)" \
  --execute

pt-online-schema-change (Percona):

pt-online-schema-change \
  --alter="ADD COLUMN phone VARCHAR(20)" \
  D=myapp,t=users \
  --execute

These tools create shadow tables and swap, avoiding locks.

Application Patterns

Graceful Shutdown

Handle in-flight requests:

func main() {
    server := &http.Server{Addr: ":8080", Handler: router}

    go func() {
        if err := server.ListenAndServe(); err != http.ErrServerClosed {
            log.Fatal(err)
        }
    }()

    // Wait for termination signal
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGTERM, syscall.SIGINT)
    <-quit

    // Graceful shutdown with timeout
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    if err := server.Shutdown(ctx); err != nil {
        log.Printf("Server forced to shutdown: %v", err)
    }
}

Health Checks

Separate liveness and readiness:

# Kubernetes probes
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
// Liveness: is the process healthy?
func healthzHandler(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
}

// Readiness: can it accept traffic?
func readyHandler(w http.ResponseWriter, r *http.Request) {
    if !database.IsConnected() {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    if !cache.IsConnected() {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    w.WriteHeader(http.StatusOK)
}

Connection Draining

Ensure requests complete before shutdown:

# Kubernetes termination
spec:
  terminationGracePeriodSeconds: 60
  containers:
  - name: app
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "sleep 10"]

The sleep allows load balancer to stop sending traffic before pod terminates.

Backward Compatibility

Code must handle both old and new data:

# Handle missing field gracefully
def get_user_phone(user):
    return getattr(user, 'phone', None) or user.legacy_phone

# Handle API version differences
def parse_request(data):
    if 'user_id' in data:
        return data['user_id']
    elif 'userId' in data:  # Old format
        return data['userId']
    else:
        raise ValueError("Missing user identifier")

Infrastructure Patterns

Load Balancer Configuration

Connection draining:

# AWS ALB
target_group:
  deregistration_delay: 30  # seconds to drain connections
  slow_start: 30  # seconds to ramp up new targets

DNS Considerations

For DNS-based routing:

# Before migration
example.com  A  1.2.3.4  TTL=300

# Lower TTL in advance
example.com  A  1.2.3.4  TTL=60

# Make change
example.com  A  5.6.7.8  TTL=60

# After stable, raise TTL
example.com  A  5.6.7.8  TTL=3600

Feature Flags

Decouple deployment from release:

# Deploy code, control activation separately
if feature_flags.is_enabled('new_checkout', user_id=user.id):
    return new_checkout_flow(cart)
else:
    return legacy_checkout_flow(cart)

Benefits:

Monitoring Zero Downtime

Key Metrics

# Monitor during deployment
error_rate:
  threshold: < 0.1%
  alert_if_exceeded: true

latency_p99:
  threshold: < 500ms
  alert_if_exceeded: true

active_connections:
  watch_for: drops to zero

request_rate:
  watch_for: sudden drops

Deployment Monitoring

Track deployment progress:

# Deployment events
deployment_started{version="1.2.3"}
pods_updated{version="1.2.3", count=5}
deployment_completed{version="1.2.3", duration_seconds=120}

# Or failure
deployment_rolled_back{version="1.2.3", reason="error_rate"}

Automated Rollback

# Kubernetes with Flagger
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: api
spec:
  analysis:
    threshold: 5  # max failed checks before rollback
    metrics:
    - name: request-success-rate
      threshold: 99
    - name: request-duration
      threshold: 500

Checklist

Before deployment:

During deployment:

After deployment:

Key Takeaways

Zero downtime deployment is achievable with proper patterns and tooling. The investment pays off in reliability and deployment confidence.