Databases are often the most critical and fragile component of any system. Stateless services can be restarted and scaled easily; databases hold state that must survive failures. Database reliability engineering applies SRE principles specifically to data stores.
Here’s how to keep databases reliable.
The Database Reliability Challenge
Why Databases Are Different
database_challenges:
state:
- Can't just restart and hope
- Data must survive failures
- Recovery is complex
performance:
- Query patterns matter
- Indexes need maintenance
- Connection limits exist
scaling:
- Vertical limits exist
- Horizontal requires sharding
- Replication adds complexity
changes:
- Schema migrations are risky
- Can't easily roll back data
- Changes affect performance
Reliability Fundamentals
Replication
# PostgreSQL streaming replication
replication_config:
primary:
wal_level: replica
max_wal_senders: 10
synchronous_standby_names: 'standby1'
standby:
primary_conninfo: 'host=primary port=5432 user=replicator'
hot_standby: on
replication_modes:
async:
pros: Lower latency, no blocking
cons: Potential data loss on failover
rpo: Seconds to minutes
sync:
pros: Zero data loss (RPO=0)
cons: Higher latency, blocking on standby failure
recommendation: Use for critical data
semi_sync:
pros: Balance of durability and performance
cons: More complex configuration
implementation: Wait for at least one standby
Backup Strategy
backup_strategy:
full_backups:
frequency: Weekly
retention: 4 weeks
method: pg_basebackup or cloud snapshot
incremental:
frequency: Daily
retention: 2 weeks
method: WAL archiving
wal_archiving:
purpose: Point-in-time recovery
target: S3 or dedicated backup storage
retention: 2 weeks minimum
testing:
frequency: Monthly
process: Restore to test environment
validation: Query sample data
# PostgreSQL backup script
#!/bin/bash
set -euo pipefail
BACKUP_DIR="/backups/$(date +%Y%m%d)"
mkdir -p "$BACKUP_DIR"
# Base backup
pg_basebackup -h localhost -U replicator \
-D "$BACKUP_DIR/base" \
-Ft -z -P
# Upload to S3
aws s3 sync "$BACKUP_DIR" "s3://backups/postgres/$(date +%Y%m%d)/"
# Archive WAL (continuous)
archive_command = 'aws s3 cp %p s3://backups/postgres/wal/%f'
Point-in-Time Recovery
# PITR restoration
# 1. Stop PostgreSQL
# 2. Clear data directory
# 3. Restore base backup
# 4. Configure recovery
# recovery.conf (PG < 12) or postgresql.conf (PG >= 12)
restore_command = 'aws s3 cp s3://backups/postgres/wal/%f %p'
recovery_target_time = '2021-08-09 14:30:00 UTC'
recovery_target_action = 'promote'
Monitoring
Key Metrics
database_metrics:
availability:
- Connection success rate
- Replication lag
- Failover time
performance:
- Query latency (p50, p95, p99)
- Queries per second
- Lock wait time
- Buffer cache hit ratio
capacity:
- Connection count / limit
- Disk usage and growth
- CPU utilization
- Memory usage
health:
- Replication status
- Vacuum progress
- Dead tuple count
- Index bloat
Prometheus Queries
# PostgreSQL Exporter metrics
queries:
connection_utilization:
query: pg_stat_activity_count / pg_settings_max_connections
alert_threshold: 0.8
replication_lag_bytes:
query: pg_replication_slots_wal_status
alert_threshold: 100MB
slow_queries:
query: rate(pg_stat_statements_seconds_total[5m])
alert_threshold: varies
cache_hit_ratio:
query: |
pg_stat_database_blks_hit /
(pg_stat_database_blks_hit + pg_stat_database_blks_read)
alert_threshold: 0.95
Alerting
# Prometheus alerts
groups:
- name: database
rules:
- alert: DatabaseDown
expr: pg_up == 0
for: 1m
labels:
severity: critical
- alert: ReplicationLag
expr: pg_replication_lag_seconds > 30
for: 5m
labels:
severity: warning
- alert: ConnectionsNearLimit
expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
for: 5m
labels:
severity: warning
- alert: DiskSpaceLow
expr: pg_database_size_bytes / node_filesystem_size_bytes > 0.85
for: 10m
labels:
severity: warning
Operational Practices
Connection Management
// Connection pooling with PgBouncer or application-level
// Application-level pooling
db, err := sql.Open("postgres", connStr)
if err != nil {
log.Fatal(err)
}
// Configure pool
db.SetMaxOpenConns(25) // Match PgBouncer or direct limit
db.SetMaxIdleConns(25) // Keep connections ready
db.SetConnMaxLifetime(5*time.Minute) // Refresh connections
// Health check
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
if err := db.PingContext(ctx); err != nil {
log.Error("database health check failed", err)
}
# PgBouncer configuration
[databases]
mydb = host=localhost port=5432 dbname=mydb
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
reserve_pool_size = 5
server_idle_timeout = 600
Schema Migrations
migration_best_practices:
testing:
- Test on production-like data volume
- Measure migration duration
- Have rollback plan
online_migrations:
- Add columns as nullable first
- Backfill in batches
- Add NOT NULL constraint after
- Use CREATE INDEX CONCURRENTLY
safety:
- Wrap in transaction where possible
- Set lock_timeout
- Monitor for blocking
- Have abort plan
-- Safe migration pattern
BEGIN;
-- Set statement timeout to avoid long locks
SET lock_timeout = '5s';
-- Add nullable column (fast, no rewrite)
ALTER TABLE orders ADD COLUMN status_new text;
COMMIT;
-- Backfill in batches (outside transaction)
UPDATE orders
SET status_new = status
WHERE id IN (SELECT id FROM orders WHERE status_new IS NULL LIMIT 10000);
-- Later: add constraint, remove old column
ALTER TABLE orders ALTER COLUMN status_new SET NOT NULL;
ALTER TABLE orders DROP COLUMN status;
ALTER TABLE orders RENAME COLUMN status_new TO status;
Vacuum and Maintenance
vacuum_strategy:
autovacuum:
enabled: true
settings:
autovacuum_vacuum_scale_factor: 0.1
autovacuum_analyze_scale_factor: 0.05
autovacuum_vacuum_cost_delay: 2ms
manual_vacuum:
when: After bulk operations
command: VACUUM (VERBOSE, ANALYZE) table_name;
monitoring:
- Dead tuple count
- Last vacuum time
- Vacuum progress
- Transaction ID wraparound
Query Performance
-- Identify slow queries
SELECT
query,
calls,
total_exec_time / 1000 as total_seconds,
mean_exec_time as avg_ms,
rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
-- Find missing indexes
SELECT
schemaname || '.' || relname as table,
seq_scan,
seq_tup_read,
idx_scan,
seq_tup_read / seq_scan as avg_seq_tup
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_tup_read DESC
LIMIT 20;
-- Check index usage
SELECT
schemaname || '.' || indexrelname as index,
idx_scan,
pg_size_pretty(pg_relation_size(indexrelid)) as size
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC
LIMIT 20;
Failure Scenarios
Failover
failover_scenarios:
primary_failure:
detection: Health check timeout (10-30s)
action: Promote standby
verification: Confirm writes work
recovery: Add new standby, repair old primary
split_brain:
prevention: Fencing, leader election
detection: Multiple primaries detected
action: Stop all writes, investigate
data_corruption:
detection: Checksum failures, query errors
action: Restore from backup to point before corruption
prevention: checksums = on
Runbook Example
# Database failover runbook
name: Primary Database Failure
triggers:
- Primary unreachable for > 30s
- Primary reports critical errors
steps:
- name: Verify failure
actions:
- Check primary from multiple locations
- Review monitoring dashboards
- Check network connectivity
- name: Promote standby
actions:
- pg_ctl promote -D /var/lib/postgresql/data
- Verify new primary accepts writes
- Update DNS or connection pooler
- name: Notify
actions:
- Page on-call DBA
- Update status page
- Create incident
- name: Recover
actions:
- Investigate original primary
- Rebuild as standby if possible
- Verify replication working
rollback:
- If promotion fails, restore from backup
- If data issues, stop writes and investigate
Key Takeaways
- Databases require special reliability practices due to state
- Implement replication: async for performance, sync for durability
- Backup strategy includes full backups, WAL archiving, and tested restores
- Monitor connections, replication lag, performance, and capacity
- Use connection pooling to manage connection limits
- Safe schema migrations: add nullable, backfill, then constrain
- Regular maintenance: vacuum, analyze, index maintenance
- Document and practice failover procedures
- Set appropriate timeouts and alerts
- Test recovery regularly—untested backups aren’t backups
Database reliability requires ongoing attention. Build the practices into your operations.