Database migrations are one of the riskiest operations in software engineering. Unlike application code, you can’t easily roll back. A bad migration can bring down production, corrupt data, or lock tables for hours. Zero-downtime migrations require specific patterns.
Here’s how to migrate databases safely.
The Risk Landscape
What Can Go Wrong
migration_risks:
locks:
- Schema changes lock tables
- Reads and writes blocked
- Application timeouts
- Cascade to dependent services
data_loss:
- Destructive changes
- Incorrect transformations
- Race conditions
incompatibility:
- New schema, old code
- Old schema, new code
- Rollback impossible
performance:
- Large data copies
- Index rebuilds
- Vacuum/analyze storms
Safe Migration Patterns
Expand and Contract
expand_contract_pattern:
principle: |
Never make breaking changes directly.
Add new things, migrate data, then remove old things.
phases:
expand:
- Add new column/table
- Support both old and new
- Deploy application that writes to both
migrate:
- Backfill data
- Verify consistency
- Switch reads to new
contract:
- Remove old code paths
- Remove old column/table
Example: Renaming a Column
-- WRONG: Breaking change
ALTER TABLE users RENAME COLUMN name TO full_name;
-- RIGHT: Expand and contract
-- Phase 1: Expand (add new column)
ALTER TABLE users ADD COLUMN full_name TEXT;
-- Phase 2: Migrate (backfill and dual-write)
-- Application writes to both columns
UPDATE users SET full_name = name WHERE full_name IS NULL;
-- Phase 3: Contract (remove old)
-- After all code uses full_name:
ALTER TABLE users DROP COLUMN name;
Example: Splitting a Table
-- Original: users table with address embedded
-- Target: separate addresses table
-- Phase 1: Create new table
CREATE TABLE addresses (
id SERIAL PRIMARY KEY,
user_id INTEGER REFERENCES users(id),
street TEXT,
city TEXT,
country TEXT
);
-- Phase 2: Backfill (batched)
INSERT INTO addresses (user_id, street, city, country)
SELECT id, street, city, country
FROM users
WHERE id BETWEEN $1 AND $2
ON CONFLICT DO NOTHING;
-- Phase 3: Application reads from addresses
-- Phase 4: Remove columns from users
ALTER TABLE users
DROP COLUMN street,
DROP COLUMN city,
DROP COLUMN country;
Non-Blocking Operations
PostgreSQL Specifics
-- BLOCKING: Standard index creation
CREATE INDEX idx_users_email ON users(email);
-- Locks table for duration!
-- NON-BLOCKING: Concurrent index
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
-- No lock, but takes longer and can fail
-- If concurrent fails:
DROP INDEX CONCURRENTLY IF EXISTS idx_users_email;
-- Then retry
-- BLOCKING: Adding NOT NULL with default
ALTER TABLE users ADD COLUMN status TEXT NOT NULL DEFAULT 'active';
-- Rewrites entire table in old PostgreSQL!
-- NON-BLOCKING approach:
-- PostgreSQL 11+: Default doesn't cause rewrite
ALTER TABLE users ADD COLUMN status TEXT DEFAULT 'active';
-- Later, after backfill:
ALTER TABLE users ALTER COLUMN status SET NOT NULL;
Lock Timeout Strategy
-- Set aggressive lock timeout
SET lock_timeout = '5s';
-- Attempt migration
ALTER TABLE users ADD COLUMN new_col TEXT;
-- If timeout, retry later
-- Don't block production traffic
# Retry logic for migrations
def safe_migrate(migration_func, max_retries=5):
for attempt in range(max_retries):
try:
with connection.cursor() as cursor:
cursor.execute("SET lock_timeout = '5s'")
migration_func(cursor)
return True
except LockTimeoutError:
logger.warning(f"Migration locked, attempt {attempt + 1}")
time.sleep(30) # Wait and retry
raise MigrationFailed("Could not acquire lock")
Backfill Patterns
Batched Updates
# DON'T: Update all at once
cursor.execute("UPDATE users SET full_name = name")
# Locks entire table, huge transaction
# DO: Batch updates
BATCH_SIZE = 1000
while True:
cursor.execute("""
UPDATE users
SET full_name = name
WHERE id IN (
SELECT id FROM users
WHERE full_name IS NULL
LIMIT %s
)
RETURNING id
""", [BATCH_SIZE])
updated = cursor.fetchall()
connection.commit()
if len(updated) < BATCH_SIZE:
break
time.sleep(0.1) # Give other queries a chance
Online Backfill with pt-online-schema-change
# For MySQL: pt-online-schema-change
pt-online-schema-change \
--alter "ADD COLUMN full_name VARCHAR(255)" \
--execute \
--chunk-size=1000 \
--max-load="Threads_running=25" \
D=mydb,t=users
Ghost for MySQL
# GitHub's gh-ost
gh-ost \
--user="root" \
--password="***" \
--host=db.example.com \
--database="mydb" \
--table="users" \
--alter="ADD COLUMN full_name VARCHAR(255)" \
--execute
Application Compatibility
Forward and Backward Compatible Code
# Code must work with both old and new schema
# During migration period:
class User:
def get_name(self):
# Try new column first, fall back to old
if hasattr(self, 'full_name') and self.full_name:
return self.full_name
return self.name
def save(self):
# Write to both columns
self.full_name = self.full_name or self.name
super().save()
Database Views for Compatibility
-- Create view that presents old interface
CREATE VIEW users_v1 AS
SELECT
id,
full_name AS name, -- Map new to old
email,
created_at
FROM users;
-- Old code uses view, new code uses table
-- Remove view when old code is gone
Migration Workflow
Pre-Migration
pre_migration_checklist:
planning:
- [ ] Document the migration steps
- [ ] Identify locking operations
- [ ] Plan rollback strategy
- [ ] Estimate data volume and time
testing:
- [ ] Test on production-like data
- [ ] Measure duration
- [ ] Verify application compatibility
- [ ] Test rollback
preparation:
- [ ] Schedule maintenance window (if needed)
- [ ] Notify stakeholders
- [ ] Ensure backup is current
- [ ] Prepare monitoring
During Migration
during_migration:
monitoring:
- Watch lock wait times
- Monitor replication lag
- Track query latency
- Watch error rates
checkpoints:
- Verify each step completed
- Check data consistency
- Validate application behavior
abort_criteria:
- Lock timeout exceeded X times
- Replication lag > threshold
- Error rate increase
- Data inconsistency detected
Post-Migration
post_migration:
verification:
- [ ] Data integrity checks
- [ ] Application functionality
- [ ] Performance metrics
- [ ] Replication status
cleanup:
- [ ] Remove temporary objects
- [ ] Update documentation
- [ ] Archive migration scripts
monitoring:
- Continue monitoring for 24-48 hours
- Watch for delayed issues
Rollback Strategies
Design for Rollback
rollback_planning:
additive_changes:
migration: Add new column
rollback: Just deploy old code (column ignored)
data_changes:
migration: Backfill transformation
rollback: Reverse transformation or restore from backup
destructive_changes:
migration: Drop column
rollback: NOT POSSIBLE - this is why we wait
Rollback Checklist
before_destructive_operations:
- [ ] All code using new schema deployed
- [ ] No rollback to old code planned
- [ ] Sufficient time passed (usually 1+ weeks)
- [ ] Backup of dropped data exists
Key Takeaways
- Database migrations are high-risk; plan carefully
- Use expand and contract pattern for breaking changes
- Never rename or remove until all code is migrated
- Use
CREATE INDEX CONCURRENTLYin PostgreSQL - Set lock timeouts to avoid blocking production
- Backfill in batches, not single transactions
- Test migrations on production-like data
- Application code must be forward/backward compatible
- Design for rollback before migrating
- Wait before destructive operations (drop column, etc.)
- Monitor throughout and have abort criteria
Database migrations require discipline and patience. Rushing leads to outages.