Database replication seems straightforward: keep copies of data on multiple servers. The reality is nuanced. Different replication patterns have different tradeoffs in consistency, availability, performance, and complexity.
Here’s how to think about database replication.
Why Replicate
High Availability
Primary fails → Replica promoted → Minimal downtime
┌─────────┐ ┌─────────┐
│ Primary │ ──WAL──►│ Replica │
│ (RW) │ │ (RO) │
└─────────┘ └─────────┘
│ │
▼ ▼
Fails Becomes Primary
Without replication, a single server failure means downtime. With replication, failover can happen in seconds.
Read Scaling
┌─────────────────────────────────────────┐
│ Load Balancer │
│ (routes reads to replicas) │
└─────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Primary │ │ Replica │ │ Replica │
│ (RW) │ │ (RO) │ │ (RO) │
└─────────┘ └─────────┘ └─────────┘
One server handles writes; multiple servers handle reads.
Geographic Distribution
US Region EU Region
┌─────────┐ ┌─────────┐
│ Primary │ ───WAL────► │ Replica │
│ (NYC) │ │ (FRA) │
└─────────┘ └─────────┘
Users read from nearby replicas for lower latency.
Replication Modes
Asynchronous Replication
Primary doesn’t wait for replicas to confirm:
Primary: COMMIT → Return success → Send to replica (background)
Characteristics:
- Fast commits
- No added write latency
- Possible data loss on failover
- Replication lag possible
-- PostgreSQL async replication
-- In postgresql.conf on primary
wal_level = replica
max_wal_senders = 3
-- In pg_hba.conf
host replication replicator replica_ip/32 md5
-- On replica
-- In recovery.conf / postgresql.conf
primary_conninfo = 'host=primary_ip user=replicator'
When to use:
- Read scaling where slight staleness is acceptable
- Geographic replicas where latency makes sync impractical
- Backup replicas not used for failover
Synchronous Replication
Primary waits for at least one replica:
Primary: COMMIT → Wait for replica ACK → Return success
Characteristics:
- No data loss on failover
- Added write latency
- Replica failure can block writes
- Stronger consistency
-- PostgreSQL sync replication
-- In postgresql.conf on primary
synchronous_commit = on
synchronous_standby_names = 'replica1'
When to use:
- Financial transactions
- Regulatory requirements
- When data loss is unacceptable
Semi-Synchronous
Hybrid approach:
-- PostgreSQL: quorum-based
synchronous_standby_names = 'ANY 1 (replica1, replica2, replica3)'
-- Commit succeeds if ANY 1 of 3 replicas confirms
Better availability than full sync, better durability than async.
Replication Topologies
Primary-Replica (Master-Slave)
┌─────────┐
│ Primary │
│ (RW) │
└────┬────┘
┌────┴────┐
▼ ▼
┌─────────┐ ┌─────────┐
│ Replica │ │ Replica │
│ (RO) │ │ (RO) │
└─────────┘ └─────────┘
Simple and common. One writer, multiple readers.
Cascading Replication
┌─────────┐
│ Primary │
└────┬────┘
▼
┌─────────┐
│ Replica │ (also replicates)
└────┬────┘
▼
┌─────────┐
│ Replica │
└─────────┘
Reduces load on primary. Useful for many replicas or geographic distribution.
-- PostgreSQL cascading
-- Intermediate replica needs:
hot_standby = on
max_wal_senders = 3
Multi-Primary (Multi-Master)
┌─────────┐ ┌─────────┐
│ Primary │◄───►│ Primary │
│ US │ │ EU │
└─────────┘ └─────────┘
Both nodes accept writes. Conflict resolution required.
Challenges:
- Conflict resolution complexity
- Eventual consistency
- Split-brain prevention
Technologies:
- MySQL Group Replication
- PostgreSQL BDR (Bi-Directional Replication)
- CockroachDB, TiDB (distributed SQL)
Ring Replication
┌─────────┐
│ Node 1 │
└────┬────┘
│
┌────▼────┐
│ Node 2 │
└────┬────┘
│
┌────▼────┐
│ Node 3 │───► Node 1
└─────────┘
Each node replicates to the next. Simple but fragile (one failure breaks the ring).
Conflict Resolution
For multi-primary setups:
Last-Writer-Wins (LWW)
-- Concurrent updates
Node A: UPDATE users SET email='a@new.com' WHERE id=1; -- ts: 10:00:01
Node B: UPDATE users SET email='b@new.com' WHERE id=1; -- ts: 10:00:02
-- Node B's change wins (later timestamp)
Simple but can lose data.
Merge/Custom Resolution
-- Application-defined merge
CREATE OR REPLACE FUNCTION resolve_conflict(old_row, new_row)
RETURNS record AS $$
BEGIN
-- Merge strategy: keep higher balance
IF new_row.balance > old_row.balance THEN
RETURN new_row;
ELSE
RETURN old_row;
END IF;
END;
$$ LANGUAGE plpgsql;
CRDT-Based
Conflict-free Replicated Data Types:
# G-Counter (grow-only counter)
# Each node maintains its own count
class GCounter:
def __init__(self, node_id):
self.counts = {}
self.node_id = node_id
def increment(self):
self.counts[self.node_id] = self.counts.get(self.node_id, 0) + 1
def value(self):
return sum(self.counts.values())
def merge(self, other):
for node, count in other.counts.items():
self.counts[node] = max(self.counts.get(node, 0), count)
No conflicts by design, but limited data types.
Failover Strategies
Manual Failover
# 1. Verify primary is down
# 2. Stop replica from receiving
pg_ctl stop -D /var/lib/postgresql/data
# 3. Promote replica
pg_ctl promote -D /var/lib/postgresql/data
# 4. Update application connection strings
# 5. Rebuild old primary as replica
Safe but slow (minutes to hours).
Automated Failover
# Patroni configuration
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
parameters:
synchronous_commit: "on"
Tools: Patroni, PgBouncer + HAProxy, MHA (MySQL)
Faster (seconds) but requires careful configuration.
Proxy-Based
┌─────────────────────────────────────────┐
│ PgBouncer │
│ (tracks primary, routes writes) │
└─────────────────────────────────────────┘
│ │
▼ ▼
┌─────────┐ ┌─────────┐
│ Primary │ │ Replica │
└─────────┘ └─────────┘
Application connects to proxy; proxy handles routing.
Read-Your-Writes Consistency
With async replication, users might not see their own writes:
User writes → Primary
User reads → Replica (hasn't received write yet)
Solutions:
Session Affinity
Route user to same replica:
def get_connection(user_id, is_write):
if is_write:
return primary_connection
else:
# Consistent hashing to same replica
replica = replicas[hash(user_id) % len(replicas)]
return replica
Read-from-Primary After Write
def get_connection(user_id, is_write):
if is_write:
cache.set(f"recent_write:{user_id}", True, ttl=5)
return primary_connection
if cache.get(f"recent_write:{user_id}"):
return primary_connection # Read from primary briefly
return random.choice(replicas)
GTID Tracking
def write_data(data):
result = primary.execute("INSERT ... RETURNING *")
gtid = primary.get_gtid()
return result, gtid
def read_data(user_id, min_gtid=None):
if min_gtid:
# Wait for replica to catch up
replica.wait_for_gtid(min_gtid, timeout=1)
return replica.execute("SELECT ...")
Monitoring Replication
Key Metrics
-- PostgreSQL replication lag
SELECT
client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
(sent_lsn - replay_lsn) AS byte_lag
FROM pg_stat_replication;
-- Lag in seconds (approximate)
SELECT
EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds
FROM pg_stat_replication;
Alerts
alerts:
- name: ReplicationLagHigh
condition: replication_lag_seconds > 30
severity: warning
- name: ReplicationLagCritical
condition: replication_lag_seconds > 300
severity: critical
- name: ReplicaDisconnected
condition: replica_state != 'streaming'
severity: critical
Key Takeaways
- Async replication is faster but risks data loss; sync is safer but slower
- Primary-replica is simple and sufficient for most read scaling
- Multi-primary adds complexity; use only when geographically necessary
- Automated failover requires careful configuration and testing
- Read-your-writes consistency needs explicit handling with async replication
- Monitor replication lag continuously; stale reads cause subtle bugs
- Choose replication mode based on durability requirements, not convenience
- Test failover regularly; untested failover often fails
Replication isn’t free. Understand the tradeoffs and choose patterns that match your actual requirements.