Database Reliability Engineering

Databases are often the most critical and fragile component of any system. Stateless services can be restarted and scaled easily; databases hold state that must survive failures. Database reliability engineering applies SRE principles specifically to data stores.

Here’s how to keep databases reliable.

The Database Reliability Challenge

Why Databases Are Different

database_challenges:
  state:
    - Can't just restart and hope
    - Data must survive failures
    - Recovery is complex

  performance:
    - Query patterns matter
    - Indexes need maintenance
    - Connection limits exist

  scaling:
    - Vertical limits exist
    - Horizontal requires sharding
    - Replication adds complexity

  changes:
    - Schema migrations are risky
    - Can't easily roll back data
    - Changes affect performance

Reliability Fundamentals

Replication

# PostgreSQL streaming replication
replication_config:
  primary:
    wal_level: replica
    max_wal_senders: 10
    synchronous_standby_names: 'standby1'

  standby:
    primary_conninfo: 'host=primary port=5432 user=replicator'
    hot_standby: on

replication_modes:
  async:
    pros: Lower latency, no blocking
    cons: Potential data loss on failover
    rpo: Seconds to minutes

  sync:
    pros: Zero data loss (RPO=0)
    cons: Higher latency, blocking on standby failure
    recommendation: Use for critical data

  semi_sync:
    pros: Balance of durability and performance
    cons: More complex configuration
    implementation: Wait for at least one standby

Backup Strategy

backup_strategy:
  full_backups:
    frequency: Weekly
    retention: 4 weeks
    method: pg_basebackup or cloud snapshot

  incremental:
    frequency: Daily
    retention: 2 weeks
    method: WAL archiving

  wal_archiving:
    purpose: Point-in-time recovery
    target: S3 or dedicated backup storage
    retention: 2 weeks minimum

  testing:
    frequency: Monthly
    process: Restore to test environment
    validation: Query sample data

# PostgreSQL backup script
#!/bin/bash
set -euo pipefail

BACKUP_DIR="/backups/$(date +%Y%m%d)"
mkdir -p "$BACKUP_DIR"

# Base backup
pg_basebackup -h localhost -U replicator \
  -D "$BACKUP_DIR/base" \
  -Ft -z -P

# Upload to S3
aws s3 sync "$BACKUP_DIR" "s3://backups/postgres/$(date +%Y%m%d)/"

# Archive WAL (continuous)
archive_command = 'aws s3 cp %p s3://backups/postgres/wal/%f'

Point-in-Time Recovery

# PITR restoration
# 1. Stop PostgreSQL
# 2. Clear data directory
# 3. Restore base backup
# 4. Configure recovery

# recovery.conf (PG < 12) or postgresql.conf (PG >= 12)
restore_command = 'aws s3 cp s3://backups/postgres/wal/%f %p'
recovery_target_time = '2021-08-09 14:30:00 UTC'
recovery_target_action = 'promote'

Monitoring

Key Metrics

database_metrics:
  availability:
    - Connection success rate
    - Replication lag
    - Failover time

  performance:
    - Query latency (p50, p95, p99)
    - Queries per second
    - Lock wait time
    - Buffer cache hit ratio

  capacity:
    - Connection count / limit
    - Disk usage and growth
    - CPU utilization
    - Memory usage

  health:
    - Replication status
    - Vacuum progress
    - Dead tuple count
    - Index bloat

Prometheus Queries

# PostgreSQL Exporter metrics
queries:
  connection_utilization:
    query: pg_stat_activity_count / pg_settings_max_connections
    alert_threshold: 0.8

  replication_lag_bytes:
    query: pg_replication_slots_wal_status
    alert_threshold: 100MB

  slow_queries:
    query: rate(pg_stat_statements_seconds_total[5m])
    alert_threshold: varies

  cache_hit_ratio:
    query: |
      pg_stat_database_blks_hit /
      (pg_stat_database_blks_hit + pg_stat_database_blks_read)
    alert_threshold: 0.95

Alerting

# Prometheus alerts
groups:
  - name: database
    rules:
      - alert: DatabaseDown
        expr: pg_up == 0
        for: 1m
        labels:
          severity: critical

      - alert: ReplicationLag
        expr: pg_replication_lag_seconds > 30
        for: 5m
        labels:
          severity: warning

      - alert: ConnectionsNearLimit
        expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
        for: 5m
        labels:
          severity: warning

      - alert: DiskSpaceLow
        expr: pg_database_size_bytes / node_filesystem_size_bytes > 0.85
        for: 10m
        labels:
          severity: warning

Operational Practices

Connection Management

// Connection pooling with PgBouncer or application-level

// Application-level pooling
db, err := sql.Open("postgres", connStr)
if err != nil {
    log.Fatal(err)
}

// Configure pool
db.SetMaxOpenConns(25)           // Match PgBouncer or direct limit
db.SetMaxIdleConns(25)           // Keep connections ready
db.SetConnMaxLifetime(5*time.Minute) // Refresh connections

// Health check
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
if err := db.PingContext(ctx); err != nil {
    log.Error("database health check failed", err)
}

# PgBouncer configuration
[databases]
mydb = host=localhost port=5432 dbname=mydb

[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
reserve_pool_size = 5
server_idle_timeout = 600

Schema Migrations

migration_best_practices:
  testing:
    - Test on production-like data volume
    - Measure migration duration
    - Have rollback plan

  online_migrations:
    - Add columns as nullable first
    - Backfill in batches
    - Add NOT NULL constraint after
    - Use CREATE INDEX CONCURRENTLY

  safety:
    - Wrap in transaction where possible
    - Set lock_timeout
    - Monitor for blocking
    - Have abort plan

-- Safe migration pattern
BEGIN;

-- Set statement timeout to avoid long locks
SET lock_timeout = '5s';

-- Add nullable column (fast, no rewrite)
ALTER TABLE orders ADD COLUMN status_new text;

COMMIT;

-- Backfill in batches (outside transaction)
UPDATE orders
SET status_new = status
WHERE id IN (SELECT id FROM orders WHERE status_new IS NULL LIMIT 10000);

-- Later: add constraint, remove old column
ALTER TABLE orders ALTER COLUMN status_new SET NOT NULL;
ALTER TABLE orders DROP COLUMN status;
ALTER TABLE orders RENAME COLUMN status_new TO status;

Vacuum and Maintenance

vacuum_strategy:
  autovacuum:
    enabled: true
    settings:
      autovacuum_vacuum_scale_factor: 0.1
      autovacuum_analyze_scale_factor: 0.05
      autovacuum_vacuum_cost_delay: 2ms

  manual_vacuum:
    when: After bulk operations
    command: VACUUM (VERBOSE, ANALYZE) table_name;

  monitoring:
    - Dead tuple count
    - Last vacuum time
    - Vacuum progress
    - Transaction ID wraparound

Query Performance

-- Identify slow queries
SELECT
    query,
    calls,
    total_exec_time / 1000 as total_seconds,
    mean_exec_time as avg_ms,
    rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

-- Find missing indexes
SELECT
    schemaname || '.' || relname as table,
    seq_scan,
    seq_tup_read,
    idx_scan,
    seq_tup_read / seq_scan as avg_seq_tup
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_tup_read DESC
LIMIT 20;

-- Check index usage
SELECT
    schemaname || '.' || indexrelname as index,
    idx_scan,
    pg_size_pretty(pg_relation_size(indexrelid)) as size
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC
LIMIT 20;

Failure Scenarios

Failover

failover_scenarios:
  primary_failure:
    detection: Health check timeout (10-30s)
    action: Promote standby
    verification: Confirm writes work
    recovery: Add new standby, repair old primary

  split_brain:
    prevention: Fencing, leader election
    detection: Multiple primaries detected
    action: Stop all writes, investigate

  data_corruption:
    detection: Checksum failures, query errors
    action: Restore from backup to point before corruption
    prevention: checksums = on

Runbook Example

# Database failover runbook
name: Primary Database Failure
triggers:
  - Primary unreachable for > 30s
  - Primary reports critical errors

steps:
  - name: Verify failure
    actions:
      - Check primary from multiple locations
      - Review monitoring dashboards
      - Check network connectivity

  - name: Promote standby
    actions:
      - pg_ctl promote -D /var/lib/postgresql/data
      - Verify new primary accepts writes
      - Update DNS or connection pooler

  - name: Notify
    actions:
      - Page on-call DBA
      - Update status page
      - Create incident

  - name: Recover
    actions:
      - Investigate original primary
      - Rebuild as standby if possible
      - Verify replication working

rollback:
  - If promotion fails, restore from backup
  - If data issues, stop writes and investigate

Key Takeaways

Databases require special reliability practices due to state
Implement replication: async for performance, sync for durability
Backup strategy includes full backups, WAL archiving, and tested restores
Monitor connections, replication lag, performance, and capacity
Use connection pooling to manage connection limits
Safe schema migrations: add nullable, backfill, then constrain
Regular maintenance: vacuum, analyze, index maintenance
Document and practice failover procedures
Set appropriate timeouts and alerts
Test recovery regularly—untested backups aren’t backups

Database reliability requires ongoing attention. Build the practices into your operations.