In the early days, database migrations were simple. Take the application offline, run the migration, bring the application back up. Users understood maintenance windows. Downtime was acceptable.
Those days are over. Users expect 24/7 availability. Competitors don’t have maintenance windows. Downtime costs money and trust.
Zero-downtime migrations are achievable, but they require discipline. Every schema change must be backward compatible with running application code. Migrations must be non-blocking. Rollbacks must be possible. Here’s how.
The Core Principle
Zero-downtime migrations follow one principle: at every moment during migration, the database schema must be compatible with both the old and new application code.
This means you can’t simply rename a column and update code simultaneously. For a brief moment, either old code would look for the old column name (and fail) or new code would look for the new column name (and fail).
Instead, migrations happen in multiple phases, each maintaining compatibility.
The Expand-Contract Pattern
Most schema changes follow the expand-contract pattern:
Expand: Add new schema elements alongside existing ones. The database has both old and new structures. Old code uses old structures; new code can use either.
Migrate: Update application code to use new structures. Deploy this code across all instances.
Contract: Remove old schema elements that no code references.
Each phase is a separate deployment. Each maintains compatibility. If problems occur, you can stop between phases without data inconsistency.
Adding a Column
Adding a nullable column is the simplest migration:
-- Expand: Add the column (non-blocking in most databases)
ALTER TABLE users ADD COLUMN middle_name VARCHAR(100);
New code can use the column immediately. Old code ignores it. No contract phase needed since old code isn’t affected by new columns.
Adding a Non-Nullable Column
Non-nullable columns require more care:
-- Expand: Add as nullable first
ALTER TABLE users ADD COLUMN status VARCHAR(20);
-- Migrate: Backfill existing rows
UPDATE users SET status = 'active' WHERE status IS NULL;
-- Contract: Add constraint after all rows have values
ALTER TABLE users ALTER COLUMN status SET NOT NULL;
The backfill might be slow for large tables. Run it in batches to avoid locking:
UPDATE users SET status = 'active'
WHERE id IN (SELECT id FROM users WHERE status IS NULL LIMIT 1000);
Removing a Column
Never drop a column that running code might reference:
-- First: Deploy code that doesn't reference the column
-- Wait: Ensure all application instances are updated
-- Then: Drop the column
ALTER TABLE users DROP COLUMN legacy_field;
If old code instances remain running and reference the dropped column, they’ll fail. Wait until all instances run code that doesn’t need the column.
Renaming a Column
Renaming requires multiple phases:
-- Expand: Add new column
ALTER TABLE users ADD COLUMN email_address VARCHAR(255);
-- Expand: Copy data (in batches for large tables)
UPDATE users SET email_address = email WHERE email_address IS NULL;
-- Migrate: Deploy code that reads from new column, writes to both
-- (Application writes to both email and email_address)
-- Migrate: Deploy code that only uses new column
-- Contract: Drop old column
ALTER TABLE users DROP COLUMN email;
During the transition, application code must handle both columns. This is temporary complexity for safe migration.
Changing Column Type
Type changes are similar to renames:
-- Expand: Add new column with new type
ALTER TABLE products ADD COLUMN price_cents BIGINT;
-- Migrate: Backfill data
UPDATE products SET price_cents = price * 100 WHERE price_cents IS NULL;
-- Migrate: Update application to use new column
-- Contract: Drop old column
ALTER TABLE products DROP COLUMN price;
Handling Large Tables
The approaches above work for small tables. Large tables (millions of rows) require additional care because migrations can lock tables or overwhelm the database.
Batched Updates
Never update millions of rows in one transaction:
-- Bad: Locks table, creates huge transaction
UPDATE users SET new_field = computed_value;
-- Good: Process in batches
DO $$
DECLARE
batch_size INT := 1000;
processed INT := 0;
BEGIN
LOOP
UPDATE users
SET new_field = computed_value
WHERE id IN (
SELECT id FROM users
WHERE new_field IS NULL
LIMIT batch_size
);
GET DIAGNOSTICS processed = ROW_COUNT;
EXIT WHEN processed = 0;
COMMIT;
PERFORM pg_sleep(0.1); -- Brief pause to reduce load
END LOOP;
END $$;
Online Schema Change Tools
For large MySQL tables, tools like gh-ost and pt-online-schema-change perform schema changes without blocking operations:
- Create shadow table with new schema
- Copy data to shadow table in batches
- Capture changes to original table during copy
- Atomically swap tables when copy completes
These tools enable schema changes on tables with billions of rows while the application runs normally.
PostgreSQL handles many schema changes without locking, but tools like pgloader and careful manual processes help with large data migrations.
Lazy Migration
For some changes, migrate data lazily rather than all at once:
def get_user_preference(user_id):
user = db.query("SELECT old_pref, new_pref FROM users WHERE id = %s", user_id)
if user.new_pref is not None:
return user.new_pref
# Migrate on read
migrated_value = migrate_preference(user.old_pref)
db.execute("UPDATE users SET new_pref = %s WHERE id = %s",
migrated_value, user_id)
return migrated_value
Data migrates as it’s accessed. Combined with background batch migration, this ensures all data eventually migrates without a single blocking operation.
Index Changes
Index creation can lock tables. Most modern databases support concurrent index creation:
-- PostgreSQL: CREATE INDEX CONCURRENTLY doesn't lock
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
-- MySQL 5.6+: Online DDL for many index operations
ALTER TABLE users ADD INDEX idx_email (email), ALGORITHM=INPLACE, LOCK=NONE;
Concurrent index creation takes longer but doesn’t block reads or writes.
Dropping indexes is generally safe but verify that no queries depend on the index. Dropping an index that query plans rely on can cause severe performance degradation.
Foreign Keys
Adding foreign keys to existing tables requires care:
-- This validates all existing rows, potentially locking and slow
ALTER TABLE orders ADD CONSTRAINT fk_user
FOREIGN KEY (user_id) REFERENCES users(id);
-- PostgreSQL: Add NOT VALID, then validate separately
ALTER TABLE orders ADD CONSTRAINT fk_user
FOREIGN KEY (user_id) REFERENCES users(id) NOT VALID;
-- Later, validate without blocking
ALTER TABLE orders VALIDATE CONSTRAINT fk_user;
The NOT VALID constraint enforces the constraint for new rows immediately. VALIDATE checks existing rows without blocking concurrent operations.
Testing Migrations
Test migrations before running in production:
Test Against Production-Like Data
Development databases with 100 rows don’t reveal problems that appear with 100 million rows. Test migrations against databases with production-like:
- Data volume
- Data distribution
- Concurrent load
Time Migrations
Measure how long migrations take on production-sized data. A migration that takes 5 seconds in development might take 5 hours in production.
Test Rollback
Verify that rollback works:
- Run migration
- Deploy new code
- Simulate problem
- Rollback code
- Rollback migration
- Verify system functions correctly
If rollback doesn’t work in testing, it won’t work during a production incident.
Operational Practices
Migration Review
Review migrations like code:
- Does this migration block operations?
- Is it backward compatible with running code?
- What happens if we need to rollback?
- How long will it take on production data?
Deployment Order
Coordinate migration and code deployment:
- Run expand migrations before code deployment
- Deploy code that handles both schemas
- Run contract migrations after all code instances are updated
Automate this coordination where possible; manual coordination is error-prone.
Monitor During Migration
Watch database metrics during migration:
- Query latency
- Lock waits
- Replication lag
- Connection pool utilization
Be ready to stop migration if metrics degrade unacceptably.
Key Takeaways
- Zero-downtime migrations require backward compatibility at every step
- Follow the expand-contract pattern: add new, migrate, remove old
- Never remove or rename schema elements that running code references
- Use batched updates for large tables; consider online schema change tools
- Create indexes concurrently to avoid blocking
- Test migrations against production-like data volumes
- Review migrations for blocking operations and rollback capability