Database Migrations Without Downtime

August 15, 2016

In the early days, database migrations were simple. Take the application offline, run the migration, bring the application back up. Users understood maintenance windows. Downtime was acceptable.

Those days are over. Users expect 24/7 availability. Competitors don’t have maintenance windows. Downtime costs money and trust.

Zero-downtime migrations are achievable, but they require discipline. Every schema change must be backward compatible with running application code. Migrations must be non-blocking. Rollbacks must be possible. Here’s how.

The Core Principle

Zero-downtime migrations follow one principle: at every moment during migration, the database schema must be compatible with both the old and new application code.

This means you can’t simply rename a column and update code simultaneously. For a brief moment, either old code would look for the old column name (and fail) or new code would look for the new column name (and fail).

Instead, migrations happen in multiple phases, each maintaining compatibility.

The Expand-Contract Pattern

Most schema changes follow the expand-contract pattern:

Expand: Add new schema elements alongside existing ones. The database has both old and new structures. Old code uses old structures; new code can use either.

Migrate: Update application code to use new structures. Deploy this code across all instances.

Contract: Remove old schema elements that no code references.

Each phase is a separate deployment. Each maintains compatibility. If problems occur, you can stop between phases without data inconsistency.

Adding a Column

Adding a nullable column is the simplest migration:

-- Expand: Add the column (non-blocking in most databases)
ALTER TABLE users ADD COLUMN middle_name VARCHAR(100);

New code can use the column immediately. Old code ignores it. No contract phase needed since old code isn’t affected by new columns.

Adding a Non-Nullable Column

Non-nullable columns require more care:

-- Expand: Add as nullable first
ALTER TABLE users ADD COLUMN status VARCHAR(20);

-- Migrate: Backfill existing rows
UPDATE users SET status = 'active' WHERE status IS NULL;

-- Contract: Add constraint after all rows have values
ALTER TABLE users ALTER COLUMN status SET NOT NULL;

The backfill might be slow for large tables. Run it in batches to avoid locking:

UPDATE users SET status = 'active'
WHERE id IN (SELECT id FROM users WHERE status IS NULL LIMIT 1000);

Removing a Column

Never drop a column that running code might reference:

-- First: Deploy code that doesn't reference the column
-- Wait: Ensure all application instances are updated
-- Then: Drop the column
ALTER TABLE users DROP COLUMN legacy_field;

If old code instances remain running and reference the dropped column, they’ll fail. Wait until all instances run code that doesn’t need the column.

Renaming a Column

Renaming requires multiple phases:

-- Expand: Add new column
ALTER TABLE users ADD COLUMN email_address VARCHAR(255);

-- Expand: Copy data (in batches for large tables)
UPDATE users SET email_address = email WHERE email_address IS NULL;

-- Migrate: Deploy code that reads from new column, writes to both
-- (Application writes to both email and email_address)

-- Migrate: Deploy code that only uses new column
-- Contract: Drop old column
ALTER TABLE users DROP COLUMN email;

During the transition, application code must handle both columns. This is temporary complexity for safe migration.

Changing Column Type

Type changes are similar to renames:

-- Expand: Add new column with new type
ALTER TABLE products ADD COLUMN price_cents BIGINT;

-- Migrate: Backfill data
UPDATE products SET price_cents = price * 100 WHERE price_cents IS NULL;

-- Migrate: Update application to use new column
-- Contract: Drop old column
ALTER TABLE products DROP COLUMN price;

Handling Large Tables

The approaches above work for small tables. Large tables (millions of rows) require additional care because migrations can lock tables or overwhelm the database.

Batched Updates

Never update millions of rows in one transaction:

-- Bad: Locks table, creates huge transaction
UPDATE users SET new_field = computed_value;

-- Good: Process in batches
DO $$
DECLARE
  batch_size INT := 1000;
  processed INT := 0;
BEGIN
  LOOP
    UPDATE users
    SET new_field = computed_value
    WHERE id IN (
      SELECT id FROM users
      WHERE new_field IS NULL
      LIMIT batch_size
    );

    GET DIAGNOSTICS processed = ROW_COUNT;
    EXIT WHEN processed = 0;

    COMMIT;
    PERFORM pg_sleep(0.1);  -- Brief pause to reduce load
  END LOOP;
END $$;

Online Schema Change Tools

For large MySQL tables, tools like gh-ost and pt-online-schema-change perform schema changes without blocking operations:

  1. Create shadow table with new schema
  2. Copy data to shadow table in batches
  3. Capture changes to original table during copy
  4. Atomically swap tables when copy completes

These tools enable schema changes on tables with billions of rows while the application runs normally.

PostgreSQL handles many schema changes without locking, but tools like pgloader and careful manual processes help with large data migrations.

Lazy Migration

For some changes, migrate data lazily rather than all at once:

def get_user_preference(user_id):
    user = db.query("SELECT old_pref, new_pref FROM users WHERE id = %s", user_id)

    if user.new_pref is not None:
        return user.new_pref

    # Migrate on read
    migrated_value = migrate_preference(user.old_pref)
    db.execute("UPDATE users SET new_pref = %s WHERE id = %s",
               migrated_value, user_id)

    return migrated_value

Data migrates as it’s accessed. Combined with background batch migration, this ensures all data eventually migrates without a single blocking operation.

Index Changes

Index creation can lock tables. Most modern databases support concurrent index creation:

-- PostgreSQL: CREATE INDEX CONCURRENTLY doesn't lock
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);

-- MySQL 5.6+: Online DDL for many index operations
ALTER TABLE users ADD INDEX idx_email (email), ALGORITHM=INPLACE, LOCK=NONE;

Concurrent index creation takes longer but doesn’t block reads or writes.

Dropping indexes is generally safe but verify that no queries depend on the index. Dropping an index that query plans rely on can cause severe performance degradation.

Foreign Keys

Adding foreign keys to existing tables requires care:

-- This validates all existing rows, potentially locking and slow
ALTER TABLE orders ADD CONSTRAINT fk_user
FOREIGN KEY (user_id) REFERENCES users(id);

-- PostgreSQL: Add NOT VALID, then validate separately
ALTER TABLE orders ADD CONSTRAINT fk_user
FOREIGN KEY (user_id) REFERENCES users(id) NOT VALID;

-- Later, validate without blocking
ALTER TABLE orders VALIDATE CONSTRAINT fk_user;

The NOT VALID constraint enforces the constraint for new rows immediately. VALIDATE checks existing rows without blocking concurrent operations.

Testing Migrations

Test migrations before running in production:

Test Against Production-Like Data

Development databases with 100 rows don’t reveal problems that appear with 100 million rows. Test migrations against databases with production-like:

Time Migrations

Measure how long migrations take on production-sized data. A migration that takes 5 seconds in development might take 5 hours in production.

Test Rollback

Verify that rollback works:

  1. Run migration
  2. Deploy new code
  3. Simulate problem
  4. Rollback code
  5. Rollback migration
  6. Verify system functions correctly

If rollback doesn’t work in testing, it won’t work during a production incident.

Operational Practices

Migration Review

Review migrations like code:

Deployment Order

Coordinate migration and code deployment:

  1. Run expand migrations before code deployment
  2. Deploy code that handles both schemas
  3. Run contract migrations after all code instances are updated

Automate this coordination where possible; manual coordination is error-prone.

Monitor During Migration

Watch database metrics during migration:

Be ready to stop migration if metrics degrade unacceptably.

Key Takeaways