Database Migration Strategies for Zero Downtime

February 21, 2022

Database migrations are one of the riskiest operations in software engineering. Unlike application code, you can’t easily roll back. A bad migration can bring down production, corrupt data, or lock tables for hours. Zero-downtime migrations require specific patterns.

Here’s how to migrate databases safely.

The Risk Landscape

What Can Go Wrong

migration_risks:
  locks:
    - Schema changes lock tables
    - Reads and writes blocked
    - Application timeouts
    - Cascade to dependent services

  data_loss:
    - Destructive changes
    - Incorrect transformations
    - Race conditions

  incompatibility:
    - New schema, old code
    - Old schema, new code
    - Rollback impossible

  performance:
    - Large data copies
    - Index rebuilds
    - Vacuum/analyze storms

Safe Migration Patterns

Expand and Contract

expand_contract_pattern:
  principle: |
    Never make breaking changes directly.
    Add new things, migrate data, then remove old things.

  phases:
    expand:
      - Add new column/table
      - Support both old and new
      - Deploy application that writes to both

    migrate:
      - Backfill data
      - Verify consistency
      - Switch reads to new

    contract:
      - Remove old code paths
      - Remove old column/table

Example: Renaming a Column

-- WRONG: Breaking change
ALTER TABLE users RENAME COLUMN name TO full_name;

-- RIGHT: Expand and contract

-- Phase 1: Expand (add new column)
ALTER TABLE users ADD COLUMN full_name TEXT;

-- Phase 2: Migrate (backfill and dual-write)
-- Application writes to both columns
UPDATE users SET full_name = name WHERE full_name IS NULL;

-- Phase 3: Contract (remove old)
-- After all code uses full_name:
ALTER TABLE users DROP COLUMN name;

Example: Splitting a Table

-- Original: users table with address embedded
-- Target: separate addresses table

-- Phase 1: Create new table
CREATE TABLE addresses (
    id SERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(id),
    street TEXT,
    city TEXT,
    country TEXT
);

-- Phase 2: Backfill (batched)
INSERT INTO addresses (user_id, street, city, country)
SELECT id, street, city, country
FROM users
WHERE id BETWEEN $1 AND $2
ON CONFLICT DO NOTHING;

-- Phase 3: Application reads from addresses
-- Phase 4: Remove columns from users
ALTER TABLE users
    DROP COLUMN street,
    DROP COLUMN city,
    DROP COLUMN country;

Non-Blocking Operations

PostgreSQL Specifics

-- BLOCKING: Standard index creation
CREATE INDEX idx_users_email ON users(email);
-- Locks table for duration!

-- NON-BLOCKING: Concurrent index
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
-- No lock, but takes longer and can fail

-- If concurrent fails:
DROP INDEX CONCURRENTLY IF EXISTS idx_users_email;
-- Then retry
-- BLOCKING: Adding NOT NULL with default
ALTER TABLE users ADD COLUMN status TEXT NOT NULL DEFAULT 'active';
-- Rewrites entire table in old PostgreSQL!

-- NON-BLOCKING approach:
-- PostgreSQL 11+: Default doesn't cause rewrite
ALTER TABLE users ADD COLUMN status TEXT DEFAULT 'active';
-- Later, after backfill:
ALTER TABLE users ALTER COLUMN status SET NOT NULL;

Lock Timeout Strategy

-- Set aggressive lock timeout
SET lock_timeout = '5s';

-- Attempt migration
ALTER TABLE users ADD COLUMN new_col TEXT;

-- If timeout, retry later
-- Don't block production traffic
# Retry logic for migrations
def safe_migrate(migration_func, max_retries=5):
    for attempt in range(max_retries):
        try:
            with connection.cursor() as cursor:
                cursor.execute("SET lock_timeout = '5s'")
                migration_func(cursor)
                return True
        except LockTimeoutError:
            logger.warning(f"Migration locked, attempt {attempt + 1}")
            time.sleep(30)  # Wait and retry

    raise MigrationFailed("Could not acquire lock")

Backfill Patterns

Batched Updates

# DON'T: Update all at once
cursor.execute("UPDATE users SET full_name = name")
# Locks entire table, huge transaction

# DO: Batch updates
BATCH_SIZE = 1000

while True:
    cursor.execute("""
        UPDATE users
        SET full_name = name
        WHERE id IN (
            SELECT id FROM users
            WHERE full_name IS NULL
            LIMIT %s
        )
        RETURNING id
    """, [BATCH_SIZE])

    updated = cursor.fetchall()
    connection.commit()

    if len(updated) < BATCH_SIZE:
        break

    time.sleep(0.1)  # Give other queries a chance

Online Backfill with pt-online-schema-change

# For MySQL: pt-online-schema-change
pt-online-schema-change \
    --alter "ADD COLUMN full_name VARCHAR(255)" \
    --execute \
    --chunk-size=1000 \
    --max-load="Threads_running=25" \
    D=mydb,t=users

Ghost for MySQL

# GitHub's gh-ost
gh-ost \
    --user="root" \
    --password="***" \
    --host=db.example.com \
    --database="mydb" \
    --table="users" \
    --alter="ADD COLUMN full_name VARCHAR(255)" \
    --execute

Application Compatibility

Forward and Backward Compatible Code

# Code must work with both old and new schema

# During migration period:
class User:
    def get_name(self):
        # Try new column first, fall back to old
        if hasattr(self, 'full_name') and self.full_name:
            return self.full_name
        return self.name

    def save(self):
        # Write to both columns
        self.full_name = self.full_name or self.name
        super().save()

Database Views for Compatibility

-- Create view that presents old interface
CREATE VIEW users_v1 AS
SELECT
    id,
    full_name AS name,  -- Map new to old
    email,
    created_at
FROM users;

-- Old code uses view, new code uses table
-- Remove view when old code is gone

Migration Workflow

Pre-Migration

pre_migration_checklist:
  planning:
    - [ ] Document the migration steps
    - [ ] Identify locking operations
    - [ ] Plan rollback strategy
    - [ ] Estimate data volume and time

  testing:
    - [ ] Test on production-like data
    - [ ] Measure duration
    - [ ] Verify application compatibility
    - [ ] Test rollback

  preparation:
    - [ ] Schedule maintenance window (if needed)
    - [ ] Notify stakeholders
    - [ ] Ensure backup is current
    - [ ] Prepare monitoring

During Migration

during_migration:
  monitoring:
    - Watch lock wait times
    - Monitor replication lag
    - Track query latency
    - Watch error rates

  checkpoints:
    - Verify each step completed
    - Check data consistency
    - Validate application behavior

  abort_criteria:
    - Lock timeout exceeded X times
    - Replication lag > threshold
    - Error rate increase
    - Data inconsistency detected

Post-Migration

post_migration:
  verification:
    - [ ] Data integrity checks
    - [ ] Application functionality
    - [ ] Performance metrics
    - [ ] Replication status

  cleanup:
    - [ ] Remove temporary objects
    - [ ] Update documentation
    - [ ] Archive migration scripts

  monitoring:
    - Continue monitoring for 24-48 hours
    - Watch for delayed issues

Rollback Strategies

Design for Rollback

rollback_planning:
  additive_changes:
    migration: Add new column
    rollback: Just deploy old code (column ignored)

  data_changes:
    migration: Backfill transformation
    rollback: Reverse transformation or restore from backup

  destructive_changes:
    migration: Drop column
    rollback: NOT POSSIBLE - this is why we wait

Rollback Checklist

before_destructive_operations:
  - [ ] All code using new schema deployed
  - [ ] No rollback to old code planned
  - [ ] Sufficient time passed (usually 1+ weeks)
  - [ ] Backup of dropped data exists

Key Takeaways

Database migrations require discipline and patience. Rushing leads to outages.