Zero downtime deployments used to be a luxury. Now they’re expected. Users don’t tolerate maintenance windows, and competitive pressure demands continuous delivery. But achieving true zero downtime requires careful orchestration across application code, databases, and infrastructure.
Here’s how to deploy without interruption.
Why Downtime Happens
Common Causes
Application restarts:
- Old instances terminated before new ones ready
- Health checks not configured properly
- Startup time exceeds termination grace period
Database changes:
- Schema migrations lock tables
- Breaking changes to data formats
- Connection pool exhaustion during migration
Traffic management:
- Load balancer not updated
- DNS propagation delays
- Connection draining not configured
Dependencies:
- Downstream services unavailable
- Configuration not propagated
- Secrets rotation issues
Deployment Patterns
Rolling Deployments
Deploy new version incrementally:
# Kubernetes rolling update
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 25%
Process:
- Start new instance
- Wait for health check
- Begin sending traffic
- Terminate old instance
- Repeat until complete
Key settings:
maxUnavailable: 0- Never reduce capacity- Proper readiness probes
- Sufficient replicas
Blue-Green Deployments
Run two complete environments:
┌─────────────────────────────────────────────────┐
│ Load Balancer │
│ │ │
│ ┌───────────┴───────────┐ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Blue │ │ Green │ │
│ │ (v1.0) │ │ (v1.1) │ │
│ │ ACTIVE │ │ STANDBY │ │
│ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────┘
Process:
- Deploy new version to standby (Green)
- Test Green environment
- Switch traffic from Blue to Green
- Keep Blue running for rollback
Advantages:
- Instant rollback (switch back)
- Full testing in production environment
- Clean separation
Disadvantages:
- Double infrastructure cost
- Database complexity (shared vs. separate)
Canary Deployments
Gradual traffic shifting:
# Istio canary
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: api
spec:
hosts:
- api
http:
- route:
- destination:
host: api
subset: v1
weight: 95
- destination:
host: api
subset: v2
weight: 5
Process:
- Deploy new version alongside old
- Send small percentage of traffic to new
- Monitor metrics (errors, latency)
- Gradually increase traffic
- Complete migration or rollback
Monitoring criteria:
- Error rate not increasing
- Latency within bounds
- No unusual patterns in logs
- Business metrics stable
Database Migrations
The Challenge
Database schema changes are the hardest part of zero downtime:
-- This locks the table
ALTER TABLE users ADD COLUMN phone VARCHAR(20);
-- This breaks old code
ALTER TABLE users DROP COLUMN legacy_field;
Expand-Contract Pattern
Phase 1: Expand (backward compatible)
-- Add new column (nullable or with default)
ALTER TABLE users ADD COLUMN phone VARCHAR(20) DEFAULT NULL;
Old code continues to work, ignores new column.
Phase 2: Migrate
-- Backfill data if needed
UPDATE users SET phone = legacy_phone WHERE phone IS NULL;
Application updated to use new column.
Phase 3: Contract (cleanup)
-- After all instances updated
ALTER TABLE users DROP COLUMN legacy_phone;
Safe Migration Practices
Add columns safely:
-- Safe: nullable or with default
ALTER TABLE users ADD COLUMN phone VARCHAR(20);
-- Safe: with default
ALTER TABLE users ADD COLUMN active BOOLEAN DEFAULT true;
-- Unsafe: NOT NULL without default
ALTER TABLE users ADD COLUMN phone VARCHAR(20) NOT NULL;
Remove columns safely:
# Step 1: Stop writing to column
# Deploy code that doesn't write to column
# Step 2: Stop reading from column
# Deploy code that doesn't read column
# Step 3: Remove column
# ALTER TABLE users DROP COLUMN old_field;
Rename columns safely:
-- Don't rename. Instead:
-- 1. Add new column
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
-- 2. Backfill
UPDATE users SET full_name = name;
-- 3. Update code to write both, read new
-- 4. Update code to only use new
-- 5. Drop old
ALTER TABLE users DROP COLUMN name;
Online Schema Migrations
For large tables, use online migration tools:
gh-ost (GitHub):
gh-ost \
--host=db.example.com \
--database=myapp \
--table=users \
--alter="ADD COLUMN phone VARCHAR(20)" \
--execute
pt-online-schema-change (Percona):
pt-online-schema-change \
--alter="ADD COLUMN phone VARCHAR(20)" \
D=myapp,t=users \
--execute
These tools create shadow tables and swap, avoiding locks.
Application Patterns
Graceful Shutdown
Handle in-flight requests:
func main() {
server := &http.Server{Addr: ":8080", Handler: router}
go func() {
if err := server.ListenAndServe(); err != http.ErrServerClosed {
log.Fatal(err)
}
}()
// Wait for termination signal
quit := make(chan os.Signal, 1)
signal.Notify(quit, syscall.SIGTERM, syscall.SIGINT)
<-quit
// Graceful shutdown with timeout
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
if err := server.Shutdown(ctx); err != nil {
log.Printf("Server forced to shutdown: %v", err)
}
}
Health Checks
Separate liveness and readiness:
# Kubernetes probes
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
// Liveness: is the process healthy?
func healthzHandler(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
}
// Readiness: can it accept traffic?
func readyHandler(w http.ResponseWriter, r *http.Request) {
if !database.IsConnected() {
w.WriteHeader(http.StatusServiceUnavailable)
return
}
if !cache.IsConnected() {
w.WriteHeader(http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
}
Connection Draining
Ensure requests complete before shutdown:
# Kubernetes termination
spec:
terminationGracePeriodSeconds: 60
containers:
- name: app
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 10"]
The sleep allows load balancer to stop sending traffic before pod terminates.
Backward Compatibility
Code must handle both old and new data:
# Handle missing field gracefully
def get_user_phone(user):
return getattr(user, 'phone', None) or user.legacy_phone
# Handle API version differences
def parse_request(data):
if 'user_id' in data:
return data['user_id']
elif 'userId' in data: # Old format
return data['userId']
else:
raise ValueError("Missing user identifier")
Infrastructure Patterns
Load Balancer Configuration
Connection draining:
# AWS ALB
target_group:
deregistration_delay: 30 # seconds to drain connections
slow_start: 30 # seconds to ramp up new targets
DNS Considerations
For DNS-based routing:
- Use low TTLs before changes
- Account for client caching
- Consider intermediate caches
# Before migration
example.com A 1.2.3.4 TTL=300
# Lower TTL in advance
example.com A 1.2.3.4 TTL=60
# Make change
example.com A 5.6.7.8 TTL=60
# After stable, raise TTL
example.com A 5.6.7.8 TTL=3600
Feature Flags
Decouple deployment from release:
# Deploy code, control activation separately
if feature_flags.is_enabled('new_checkout', user_id=user.id):
return new_checkout_flow(cart)
else:
return legacy_checkout_flow(cart)
Benefits:
- Deploy anytime, activate when ready
- Gradual rollout by percentage
- Instant rollback (disable flag)
- A/B testing built-in
Monitoring Zero Downtime
Key Metrics
# Monitor during deployment
error_rate:
threshold: < 0.1%
alert_if_exceeded: true
latency_p99:
threshold: < 500ms
alert_if_exceeded: true
active_connections:
watch_for: drops to zero
request_rate:
watch_for: sudden drops
Deployment Monitoring
Track deployment progress:
# Deployment events
deployment_started{version="1.2.3"}
pods_updated{version="1.2.3", count=5}
deployment_completed{version="1.2.3", duration_seconds=120}
# Or failure
deployment_rolled_back{version="1.2.3", reason="error_rate"}
Automated Rollback
# Kubernetes with Flagger
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: api
spec:
analysis:
threshold: 5 # max failed checks before rollback
metrics:
- name: request-success-rate
threshold: 99
- name: request-duration
threshold: 500
Checklist
Before deployment:
- Database migrations are backward compatible
- New code handles old data formats
- Health checks are properly configured
- Graceful shutdown is implemented
- Rollback plan is documented
During deployment:
- Monitoring dashboards open
- Error rates stable
- Latency within bounds
- No connection drops
After deployment:
- All pods running new version
- Metrics stable for 15+ minutes
- Old resources cleaned up
- Document any issues for next time
Key Takeaways
- Zero downtime requires coordination across application, database, and infrastructure
- Use rolling deployments with
maxUnavailable: 0for basic zero downtime - Blue-green deployments enable instant rollback at cost of double infrastructure
- Canary deployments allow gradual rollout with automatic rollback
- Database migrations need expand-contract pattern for zero downtime
- Never remove or rename columns directly; always use multi-phase approach
- Implement graceful shutdown to complete in-flight requests
- Separate liveness probes (is it alive?) from readiness probes (can it serve?)
- Feature flags decouple deployment from release
- Monitor error rates, latency, and request rates during deployment
Zero downtime deployment is achievable with proper patterns and tooling. The investment pays off in reliability and deployment confidence.