Sharding—partitioning data across multiple database instances—is often mentioned as a scaling solution. But it’s also one of the most complex architectural decisions you can make. Most applications don’t need sharding, and many that implement it do so incorrectly.
Here’s how to know when you need sharding and how to do it right.
When You Actually Need Sharding
Signs You Might Need Sharding
Data size exceeds single instance capacity:
- Dataset larger than practical for one server
- Backup and restore times unacceptable
- Storage costs for single large instance prohibitive
Write throughput exceeds capacity:
- Write operations saturating disk I/O
- Replication lag growing unbounded
- Lock contention on write-heavy tables
Read replicas aren’t enough:
- Reads scale with replicas, writes don’t
- Write-heavy workloads can’t use read replicas
- Replica lag creates consistency issues
Signs You Don’t Need Sharding
You haven’t optimized your current setup:
- Missing indexes
- Inefficient queries
- Wrong instance size
- No connection pooling
- No caching layer
Your data fits on one server:
- Modern servers handle terabytes
- SSD and NVMe are fast
- Managed databases auto-scale storage
Read replicas would solve your problem:
- Read-heavy workloads scale with replicas
- Much simpler than sharding
You have less than 100M rows:
- Properly indexed tables handle hundreds of millions of rows
- Sharding adds complexity that outweighs benefits
Sharding Strategies
Horizontal Sharding (Range-Based)
Partition data by ranges of a shard key:
Shard 1: user_id 1-1,000,000
Shard 2: user_id 1,000,001-2,000,000
Shard 3: user_id 2,000,001-3,000,000
Pros:
- Efficient range queries
- Easy to understand
- Predictable data location
Cons:
- Hotspots if distribution uneven
- Rebalancing is complex
- New ranges require new shards
Hash-Based Sharding
Hash the shard key to determine location:
def get_shard(user_id, num_shards):
return hash(user_id) % num_shards
# user_id=12345 → shard 2
# user_id=67890 → shard 0
Pros:
- Even distribution
- No hotspots (with good hash function)
- Simple shard selection
Cons:
- Range queries require all shards
- Adding shards requires data movement
- Hash function choice matters
Consistent Hashing
Distribute shards on a ring, minimize rebalancing:
class ConsistentHash:
def __init__(self, nodes, virtual_nodes=100):
self.ring = {}
for node in nodes:
for i in range(virtual_nodes):
key = hash(f"{node}:{i}")
self.ring[key] = node
self.sorted_keys = sorted(self.ring.keys())
def get_node(self, key):
hash_key = hash(key)
for ring_key in self.sorted_keys:
if hash_key <= ring_key:
return self.ring[ring_key]
return self.ring[self.sorted_keys[0]]
Pros:
- Adding/removing shards moves minimal data
- Better load balancing
- Industry proven
Cons:
- More complex implementation
- Still can’t do efficient range queries
Directory-Based Sharding
Lookup table maps keys to shards:
CREATE TABLE shard_directory (
key_value VARCHAR PRIMARY KEY,
shard_id INT NOT NULL
);
-- Lookup: Which shard has user_12345?
SELECT shard_id FROM shard_directory WHERE key_value = 'user_12345';
Pros:
- Complete flexibility
- Can move individual records
- Handles any distribution pattern
Cons:
- Directory is a bottleneck
- Extra lookup per query
- Directory needs high availability
Choosing a Shard Key
The shard key determines everything. Choose poorly and you’ll regret it.
Good Shard Keys
Properties:
- High cardinality (many unique values)
- Even distribution
- Commonly used in queries
- Stable (doesn’t change)
Examples:
- User ID for user-specific data
- Tenant ID for multi-tenant applications
- Order ID for order data
Bad Shard Keys
Low cardinality:
# Bad - only a few values
shard_key = country_code # 200 countries, uneven distribution
shard_key = status # 5 statuses
Hotspots:
# Bad - new data clusters on one shard
shard_key = created_date # Today's shard gets all writes
Mutable:
# Bad - changing key requires data movement
shard_key = email # Users change email
Implementation Patterns
Application-Level Sharding
Application determines shard routing:
class ShardedRepository:
def __init__(self, shards):
self.shards = shards # Dict of shard_id → connection
def get_shard(self, user_id):
shard_id = hash(user_id) % len(self.shards)
return self.shards[shard_id]
def get_user(self, user_id):
shard = self.get_shard(user_id)
return shard.execute("SELECT * FROM users WHERE id = ?", user_id)
def create_user(self, user):
shard = self.get_shard(user.id)
shard.execute("INSERT INTO users VALUES (...)", user)
Pros:
- Full control over routing
- No proxy latency
- Flexible query patterns
Cons:
- Sharding logic in application
- Cross-shard operations complex
- All applications need shard awareness
Proxy-Based Sharding
Proxy routes queries to appropriate shard:
Application → Proxy (Vitess, ProxySQL) → Shards
-- Application writes normal SQL
SELECT * FROM users WHERE id = 12345;
-- Proxy rewrites and routes to correct shard
-- Shard 2: SELECT * FROM users WHERE id = 12345;
Pros:
- Application sees single database
- Centralized routing logic
- Easier migration path
Cons:
- Proxy is another component to operate
- Adds latency
- Proxy limitations on query types
Database-Native Sharding
Some databases have built-in sharding:
- CockroachDB: Automatic sharding
- TiDB: MySQL-compatible distributed database
- Citus: PostgreSQL extension for sharding
- Vitess: MySQL sharding middleware
Pros:
- Vendor-supported
- Less custom code
- Often handles cross-shard queries
Cons:
- Vendor lock-in
- May not fit specific requirements
- Performance characteristics differ
Cross-Shard Challenges
Cross-Shard Queries
Queries spanning shards are expensive:
-- This hits ALL shards
SELECT * FROM orders WHERE status = 'pending';
-- Scatter-gather pattern
for shard in shards:
results += shard.query("SELECT * FROM orders WHERE status = 'pending'")
Mitigations:
- Include shard key in all queries
- Denormalize data to avoid joins
- Use global tables for reference data
- Accept eventual consistency for analytics
Cross-Shard Joins
Joins across shards are very expensive:
-- Bad: requires cross-shard join
SELECT orders.*, users.name
FROM orders
JOIN users ON orders.user_id = users.id
WHERE orders.status = 'pending';
Solutions:
- Denormalize: store user.name in orders table
- Application-level join: fetch separately, join in code
- Avoid: design schemas to not need cross-shard joins
Cross-Shard Transactions
Distributed transactions are complex:
# Dangerous: cross-shard transaction
with transaction():
shard1.debit(account_a, 100)
shard2.credit(account_b, 100)
# What if shard2 fails after shard1 commits?
Options:
- Avoid: keep related data on same shard
- Saga pattern: compensating transactions
- Two-phase commit: complex, impacts performance
- Accept eventual consistency where possible
Operations
Rebalancing
Moving data between shards:
def rebalance_shard(old_shards, new_shards):
# 1. Add new shards
# 2. Start copying data to new locations
# 3. Enable dual-writes
# 4. Backfill remaining data
# 5. Verify consistency
# 6. Switch reads to new shards
# 7. Stop writes to old locations
# 8. Cleanup
This is complex. Plan for it before you need it.
Monitoring
Track per-shard:
- Query volume and latency
- Data size
- Connection count
- Replication lag (if applicable)
Alert on:
- Shard imbalance
- Hot shards
- Connection exhaustion
- Unusual query patterns
Backup and Recovery
- Backup all shards consistently (or accept point-in-time differences)
- Test recovery of individual shards
- Document shard topology
- Automate shard provisioning
Key Takeaways
- Sharding is complex; exhaust simpler options first (optimization, caching, read replicas)
- Choose shard key carefully: high cardinality, even distribution, stable, commonly queried
- Hash-based sharding provides even distribution; consistent hashing minimizes rebalancing
- Cross-shard queries and joins are expensive; design to minimize them
- Cross-shard transactions are very complex; avoid or use eventual consistency
- Consider proxy-based or database-native sharding to reduce application complexity
- Plan for rebalancing before you need it
- Monitor per-shard metrics to detect hotspots early
Sharding enables massive scale, but at significant complexity cost. Most applications don’t need it. If you do, invest in doing it right.