Zero Downtime Deployments: Patterns and Practices

Zero downtime deployments used to be a luxury. Now they’re expected. Users don’t tolerate maintenance windows, and competitive pressure demands continuous delivery. But achieving true zero downtime requires careful orchestration across application code, databases, and infrastructure.

Here’s how to deploy without interruption.

Why Downtime Happens

Common Causes

Application restarts:

Old instances terminated before new ones ready
Health checks not configured properly
Startup time exceeds termination grace period

Database changes:

Schema migrations lock tables
Breaking changes to data formats
Connection pool exhaustion during migration

Traffic management:

Load balancer not updated
DNS propagation delays
Connection draining not configured

Dependencies:

Downstream services unavailable
Configuration not propagated
Secrets rotation issues

Deployment Patterns

Rolling Deployments

Deploy new version incrementally:

# Kubernetes rolling update
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 25%

Process:

Start new instance
Wait for health check
Begin sending traffic
Terminate old instance
Repeat until complete

Key settings:

maxUnavailable: 0 - Never reduce capacity
Proper readiness probes
Sufficient replicas

Blue-Green Deployments

Run two complete environments:

┌─────────────────────────────────────────────────┐
│                  Load Balancer                   │
│                       │                          │
│          ┌───────────┴───────────┐              │
│          ▼                       ▼              │
│   ┌─────────────┐         ┌─────────────┐      │
│   │   Blue      │         │   Green     │      │
│   │   (v1.0)    │         │   (v1.1)    │      │
│   │   ACTIVE    │         │   STANDBY   │      │
│   └─────────────┘         └─────────────┘      │
└─────────────────────────────────────────────────┘

Process:

Deploy new version to standby (Green)
Test Green environment
Switch traffic from Blue to Green
Keep Blue running for rollback

Advantages:

Instant rollback (switch back)
Full testing in production environment
Clean separation

Disadvantages:

Double infrastructure cost
Database complexity (shared vs. separate)

Canary Deployments

Gradual traffic shifting:

# Istio canary
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: api
spec:
  hosts:
  - api
  http:
  - route:
    - destination:
        host: api
        subset: v1
      weight: 95
    - destination:
        host: api
        subset: v2
      weight: 5

Process:

Deploy new version alongside old
Send small percentage of traffic to new
Monitor metrics (errors, latency)
Gradually increase traffic
Complete migration or rollback

Monitoring criteria:

Error rate not increasing
Latency within bounds
No unusual patterns in logs
Business metrics stable

Database Migrations

The Challenge

Database schema changes are the hardest part of zero downtime:

-- This locks the table
ALTER TABLE users ADD COLUMN phone VARCHAR(20);

-- This breaks old code
ALTER TABLE users DROP COLUMN legacy_field;

Expand-Contract Pattern

Phase 1: Expand (backward compatible)

-- Add new column (nullable or with default)
ALTER TABLE users ADD COLUMN phone VARCHAR(20) DEFAULT NULL;

Old code continues to work, ignores new column.

Phase 2: Migrate

-- Backfill data if needed
UPDATE users SET phone = legacy_phone WHERE phone IS NULL;

Application updated to use new column.

Phase 3: Contract (cleanup)

-- After all instances updated
ALTER TABLE users DROP COLUMN legacy_phone;

Safe Migration Practices

Add columns safely:

-- Safe: nullable or with default
ALTER TABLE users ADD COLUMN phone VARCHAR(20);

-- Safe: with default
ALTER TABLE users ADD COLUMN active BOOLEAN DEFAULT true;

-- Unsafe: NOT NULL without default
ALTER TABLE users ADD COLUMN phone VARCHAR(20) NOT NULL;

Remove columns safely:

# Step 1: Stop writing to column
# Deploy code that doesn't write to column

# Step 2: Stop reading from column
# Deploy code that doesn't read column

# Step 3: Remove column
# ALTER TABLE users DROP COLUMN old_field;

Rename columns safely:

-- Don't rename. Instead:
-- 1. Add new column
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);

-- 2. Backfill
UPDATE users SET full_name = name;

-- 3. Update code to write both, read new

-- 4. Update code to only use new

-- 5. Drop old
ALTER TABLE users DROP COLUMN name;

Online Schema Migrations

For large tables, use online migration tools:

gh-ost (GitHub):

gh-ost \
  --host=db.example.com \
  --database=myapp \
  --table=users \
  --alter="ADD COLUMN phone VARCHAR(20)" \
  --execute

pt-online-schema-change (Percona):

pt-online-schema-change \
  --alter="ADD COLUMN phone VARCHAR(20)" \
  D=myapp,t=users \
  --execute

These tools create shadow tables and swap, avoiding locks.

Application Patterns

Graceful Shutdown

Handle in-flight requests:

func main() {
    server := &http.Server{Addr: ":8080", Handler: router}

    go func() {
        if err := server.ListenAndServe(); err != http.ErrServerClosed {
            log.Fatal(err)
        }
    }()

    // Wait for termination signal
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGTERM, syscall.SIGINT)
    <-quit

    // Graceful shutdown with timeout
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    if err := server.Shutdown(ctx); err != nil {
        log.Printf("Server forced to shutdown: %v", err)
    }
}

Health Checks

Separate liveness and readiness:

# Kubernetes probes
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

// Liveness: is the process healthy?
func healthzHandler(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
}

// Readiness: can it accept traffic?
func readyHandler(w http.ResponseWriter, r *http.Request) {
    if !database.IsConnected() {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    if !cache.IsConnected() {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    w.WriteHeader(http.StatusOK)
}

Connection Draining

Ensure requests complete before shutdown:

# Kubernetes termination
spec:
  terminationGracePeriodSeconds: 60
  containers:
  - name: app
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "sleep 10"]

The sleep allows load balancer to stop sending traffic before pod terminates.

Backward Compatibility

Code must handle both old and new data:

# Handle missing field gracefully
def get_user_phone(user):
    return getattr(user, 'phone', None) or user.legacy_phone

# Handle API version differences
def parse_request(data):
    if 'user_id' in data:
        return data['user_id']
    elif 'userId' in data:  # Old format
        return data['userId']
    else:
        raise ValueError("Missing user identifier")

Infrastructure Patterns

Load Balancer Configuration

Connection draining:

# AWS ALB
target_group:
  deregistration_delay: 30  # seconds to drain connections
  slow_start: 30  # seconds to ramp up new targets

DNS Considerations

For DNS-based routing:

Use low TTLs before changes
Account for client caching
Consider intermediate caches

# Before migration
example.com  A  1.2.3.4  TTL=300

# Lower TTL in advance
example.com  A  1.2.3.4  TTL=60

# Make change
example.com  A  5.6.7.8  TTL=60

# After stable, raise TTL
example.com  A  5.6.7.8  TTL=3600

Feature Flags

Decouple deployment from release:

# Deploy code, control activation separately
if feature_flags.is_enabled('new_checkout', user_id=user.id):
    return new_checkout_flow(cart)
else:
    return legacy_checkout_flow(cart)

Benefits:

Deploy anytime, activate when ready
Gradual rollout by percentage
Instant rollback (disable flag)
A/B testing built-in

Monitoring Zero Downtime

Key Metrics

# Monitor during deployment
error_rate:
  threshold: < 0.1%
  alert_if_exceeded: true

latency_p99:
  threshold: < 500ms
  alert_if_exceeded: true

active_connections:
  watch_for: drops to zero

request_rate:
  watch_for: sudden drops

Deployment Monitoring

Track deployment progress:

# Deployment events
deployment_started{version="1.2.3"}
pods_updated{version="1.2.3", count=5}
deployment_completed{version="1.2.3", duration_seconds=120}

# Or failure
deployment_rolled_back{version="1.2.3", reason="error_rate"}

Automated Rollback

# Kubernetes with Flagger
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: api
spec:
  analysis:
    threshold: 5  # max failed checks before rollback
    metrics:
    - name: request-success-rate
      threshold: 99
    - name: request-duration
      threshold: 500

Checklist

Before deployment:

Database migrations are backward compatible
New code handles old data formats
Health checks are properly configured
Graceful shutdown is implemented
Rollback plan is documented

During deployment:

Monitoring dashboards open
Error rates stable
Latency within bounds
No connection drops

After deployment:

All pods running new version
Metrics stable for 15+ minutes
Old resources cleaned up
Document any issues for next time

Key Takeaways

Zero downtime requires coordination across application, database, and infrastructure
Use rolling deployments with maxUnavailable: 0 for basic zero downtime
Blue-green deployments enable instant rollback at cost of double infrastructure
Canary deployments allow gradual rollout with automatic rollback
Database migrations need expand-contract pattern for zero downtime
Never remove or rename columns directly; always use multi-phase approach
Implement graceful shutdown to complete in-flight requests
Separate liveness probes (is it alive?) from readiness probes (can it serve?)
Feature flags decouple deployment from release
Monitor error rates, latency, and request rates during deployment

Zero downtime deployment is achievable with proper patterns and tooling. The investment pays off in reliability and deployment confidence.