Database Migration Strategies for Zero Downtime

Database migrations are one of the riskiest operations in software engineering. Unlike application code, you can’t easily roll back. A bad migration can bring down production, corrupt data, or lock tables for hours. Zero-downtime migrations require specific patterns.

Here’s how to migrate databases safely.

The Risk Landscape

What Can Go Wrong

migration_risks:
  locks:
    - Schema changes lock tables
    - Reads and writes blocked
    - Application timeouts
    - Cascade to dependent services

  data_loss:
    - Destructive changes
    - Incorrect transformations
    - Race conditions

  incompatibility:
    - New schema, old code
    - Old schema, new code
    - Rollback impossible

  performance:
    - Large data copies
    - Index rebuilds
    - Vacuum/analyze storms

Safe Migration Patterns

Expand and Contract

expand_contract_pattern:
  principle: |
    Never make breaking changes directly.
    Add new things, migrate data, then remove old things.

  phases:
    expand:
      - Add new column/table
      - Support both old and new
      - Deploy application that writes to both

    migrate:
      - Backfill data
      - Verify consistency
      - Switch reads to new

    contract:
      - Remove old code paths
      - Remove old column/table

Example: Renaming a Column

-- WRONG: Breaking change
ALTER TABLE users RENAME COLUMN name TO full_name;

-- RIGHT: Expand and contract

-- Phase 1: Expand (add new column)
ALTER TABLE users ADD COLUMN full_name TEXT;

-- Phase 2: Migrate (backfill and dual-write)
-- Application writes to both columns
UPDATE users SET full_name = name WHERE full_name IS NULL;

-- Phase 3: Contract (remove old)
-- After all code uses full_name:
ALTER TABLE users DROP COLUMN name;

Example: Splitting a Table

-- Original: users table with address embedded
-- Target: separate addresses table

-- Phase 1: Create new table
CREATE TABLE addresses (
    id SERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(id),
    street TEXT,
    city TEXT,
    country TEXT
);

-- Phase 2: Backfill (batched)
INSERT INTO addresses (user_id, street, city, country)
SELECT id, street, city, country
FROM users
WHERE id BETWEEN $1 AND $2
ON CONFLICT DO NOTHING;

-- Phase 3: Application reads from addresses
-- Phase 4: Remove columns from users
ALTER TABLE users
    DROP COLUMN street,
    DROP COLUMN city,
    DROP COLUMN country;

Non-Blocking Operations

PostgreSQL Specifics

-- BLOCKING: Standard index creation
CREATE INDEX idx_users_email ON users(email);
-- Locks table for duration!

-- NON-BLOCKING: Concurrent index
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
-- No lock, but takes longer and can fail

-- If concurrent fails:
DROP INDEX CONCURRENTLY IF EXISTS idx_users_email;
-- Then retry

-- BLOCKING: Adding NOT NULL with default
ALTER TABLE users ADD COLUMN status TEXT NOT NULL DEFAULT 'active';
-- Rewrites entire table in old PostgreSQL!

-- NON-BLOCKING approach:
-- PostgreSQL 11+: Default doesn't cause rewrite
ALTER TABLE users ADD COLUMN status TEXT DEFAULT 'active';
-- Later, after backfill:
ALTER TABLE users ALTER COLUMN status SET NOT NULL;

Lock Timeout Strategy

-- Set aggressive lock timeout
SET lock_timeout = '5s';

-- Attempt migration
ALTER TABLE users ADD COLUMN new_col TEXT;

-- If timeout, retry later
-- Don't block production traffic

# Retry logic for migrations
def safe_migrate(migration_func, max_retries=5):
    for attempt in range(max_retries):
        try:
            with connection.cursor() as cursor:
                cursor.execute("SET lock_timeout = '5s'")
                migration_func(cursor)
                return True
        except LockTimeoutError:
            logger.warning(f"Migration locked, attempt {attempt + 1}")
            time.sleep(30)  # Wait and retry

    raise MigrationFailed("Could not acquire lock")

Backfill Patterns

Batched Updates

# DON'T: Update all at once
cursor.execute("UPDATE users SET full_name = name")
# Locks entire table, huge transaction

# DO: Batch updates
BATCH_SIZE = 1000

while True:
    cursor.execute("""
        UPDATE users
        SET full_name = name
        WHERE id IN (
            SELECT id FROM users
            WHERE full_name IS NULL
            LIMIT %s
        )
        RETURNING id
    """, [BATCH_SIZE])

    updated = cursor.fetchall()
    connection.commit()

    if len(updated) < BATCH_SIZE:
        break

    time.sleep(0.1)  # Give other queries a chance

Online Backfill with pt-online-schema-change

# For MySQL: pt-online-schema-change
pt-online-schema-change \
    --alter "ADD COLUMN full_name VARCHAR(255)" \
    --execute \
    --chunk-size=1000 \
    --max-load="Threads_running=25" \
    D=mydb,t=users

Ghost for MySQL

# GitHub's gh-ost
gh-ost \
    --user="root" \
    --password="***" \
    --host=db.example.com \
    --database="mydb" \
    --table="users" \
    --alter="ADD COLUMN full_name VARCHAR(255)" \
    --execute

Application Compatibility

Forward and Backward Compatible Code

# Code must work with both old and new schema

# During migration period:
class User:
    def get_name(self):
        # Try new column first, fall back to old
        if hasattr(self, 'full_name') and self.full_name:
            return self.full_name
        return self.name

    def save(self):
        # Write to both columns
        self.full_name = self.full_name or self.name
        super().save()

Database Views for Compatibility

-- Create view that presents old interface
CREATE VIEW users_v1 AS
SELECT
    id,
    full_name AS name,  -- Map new to old
    email,
    created_at
FROM users;

-- Old code uses view, new code uses table
-- Remove view when old code is gone

Migration Workflow

Pre-Migration

pre_migration_checklist:
  planning:
    - [ ] Document the migration steps
    - [ ] Identify locking operations
    - [ ] Plan rollback strategy
    - [ ] Estimate data volume and time

  testing:
    - [ ] Test on production-like data
    - [ ] Measure duration
    - [ ] Verify application compatibility
    - [ ] Test rollback

  preparation:
    - [ ] Schedule maintenance window (if needed)
    - [ ] Notify stakeholders
    - [ ] Ensure backup is current
    - [ ] Prepare monitoring

During Migration

during_migration:
  monitoring:
    - Watch lock wait times
    - Monitor replication lag
    - Track query latency
    - Watch error rates

  checkpoints:
    - Verify each step completed
    - Check data consistency
    - Validate application behavior

  abort_criteria:
    - Lock timeout exceeded X times
    - Replication lag > threshold
    - Error rate increase
    - Data inconsistency detected

Post-Migration

post_migration:
  verification:
    - [ ] Data integrity checks
    - [ ] Application functionality
    - [ ] Performance metrics
    - [ ] Replication status

  cleanup:
    - [ ] Remove temporary objects
    - [ ] Update documentation
    - [ ] Archive migration scripts

  monitoring:
    - Continue monitoring for 24-48 hours
    - Watch for delayed issues

Rollback Strategies

Design for Rollback

rollback_planning:
  additive_changes:
    migration: Add new column
    rollback: Just deploy old code (column ignored)

  data_changes:
    migration: Backfill transformation
    rollback: Reverse transformation or restore from backup

  destructive_changes:
    migration: Drop column
    rollback: NOT POSSIBLE - this is why we wait

Rollback Checklist

before_destructive_operations:
  - [ ] All code using new schema deployed
  - [ ] No rollback to old code planned
  - [ ] Sufficient time passed (usually 1+ weeks)
  - [ ] Backup of dropped data exists

Key Takeaways

Database migrations are high-risk; plan carefully
Use expand and contract pattern for breaking changes
Never rename or remove until all code is migrated
Use CREATE INDEX CONCURRENTLY in PostgreSQL
Set lock timeouts to avoid blocking production
Backfill in batches, not single transactions
Test migrations on production-like data
Application code must be forward/backward compatible
Design for rollback before migrating
Wait before destructive operations (drop column, etc.)
Monitor throughout and have abort criteria

Database migrations require discipline and patience. Rushing leads to outages.