Database Sharding: When and How

August 6, 2018

Sharding—partitioning data across multiple database instances—is often mentioned as a scaling solution. But it’s also one of the most complex architectural decisions you can make. Most applications don’t need sharding, and many that implement it do so incorrectly.

Here’s how to know when you need sharding and how to do it right.

When You Actually Need Sharding

Signs You Might Need Sharding

Data size exceeds single instance capacity:

Write throughput exceeds capacity:

Read replicas aren’t enough:

Signs You Don’t Need Sharding

You haven’t optimized your current setup:

Your data fits on one server:

Read replicas would solve your problem:

You have less than 100M rows:

Sharding Strategies

Horizontal Sharding (Range-Based)

Partition data by ranges of a shard key:

Shard 1: user_id 1-1,000,000
Shard 2: user_id 1,000,001-2,000,000
Shard 3: user_id 2,000,001-3,000,000

Pros:

Cons:

Hash-Based Sharding

Hash the shard key to determine location:

def get_shard(user_id, num_shards):
    return hash(user_id) % num_shards

# user_id=12345 → shard 2
# user_id=67890 → shard 0

Pros:

Cons:

Consistent Hashing

Distribute shards on a ring, minimize rebalancing:

class ConsistentHash:
    def __init__(self, nodes, virtual_nodes=100):
        self.ring = {}
        for node in nodes:
            for i in range(virtual_nodes):
                key = hash(f"{node}:{i}")
                self.ring[key] = node
        self.sorted_keys = sorted(self.ring.keys())

    def get_node(self, key):
        hash_key = hash(key)
        for ring_key in self.sorted_keys:
            if hash_key <= ring_key:
                return self.ring[ring_key]
        return self.ring[self.sorted_keys[0]]

Pros:

Cons:

Directory-Based Sharding

Lookup table maps keys to shards:

CREATE TABLE shard_directory (
    key_value VARCHAR PRIMARY KEY,
    shard_id INT NOT NULL
);

-- Lookup: Which shard has user_12345?
SELECT shard_id FROM shard_directory WHERE key_value = 'user_12345';

Pros:

Cons:

Choosing a Shard Key

The shard key determines everything. Choose poorly and you’ll regret it.

Good Shard Keys

Properties:

Examples:

Bad Shard Keys

Low cardinality:

# Bad - only a few values
shard_key = country_code  # 200 countries, uneven distribution
shard_key = status        # 5 statuses

Hotspots:

# Bad - new data clusters on one shard
shard_key = created_date  # Today's shard gets all writes

Mutable:

# Bad - changing key requires data movement
shard_key = email  # Users change email

Implementation Patterns

Application-Level Sharding

Application determines shard routing:

class ShardedRepository:
    def __init__(self, shards):
        self.shards = shards  # Dict of shard_id → connection

    def get_shard(self, user_id):
        shard_id = hash(user_id) % len(self.shards)
        return self.shards[shard_id]

    def get_user(self, user_id):
        shard = self.get_shard(user_id)
        return shard.execute("SELECT * FROM users WHERE id = ?", user_id)

    def create_user(self, user):
        shard = self.get_shard(user.id)
        shard.execute("INSERT INTO users VALUES (...)", user)

Pros:

Cons:

Proxy-Based Sharding

Proxy routes queries to appropriate shard:

Application → Proxy (Vitess, ProxySQL) → Shards
-- Application writes normal SQL
SELECT * FROM users WHERE id = 12345;

-- Proxy rewrites and routes to correct shard
-- Shard 2: SELECT * FROM users WHERE id = 12345;

Pros:

Cons:

Database-Native Sharding

Some databases have built-in sharding:

Pros:

Cons:

Cross-Shard Challenges

Cross-Shard Queries

Queries spanning shards are expensive:

-- This hits ALL shards
SELECT * FROM orders WHERE status = 'pending';

-- Scatter-gather pattern
for shard in shards:
    results += shard.query("SELECT * FROM orders WHERE status = 'pending'")

Mitigations:

Cross-Shard Joins

Joins across shards are very expensive:

-- Bad: requires cross-shard join
SELECT orders.*, users.name
FROM orders
JOIN users ON orders.user_id = users.id
WHERE orders.status = 'pending';

Solutions:

Cross-Shard Transactions

Distributed transactions are complex:

# Dangerous: cross-shard transaction
with transaction():
    shard1.debit(account_a, 100)
    shard2.credit(account_b, 100)
    # What if shard2 fails after shard1 commits?

Options:

Operations

Rebalancing

Moving data between shards:

def rebalance_shard(old_shards, new_shards):
    # 1. Add new shards
    # 2. Start copying data to new locations
    # 3. Enable dual-writes
    # 4. Backfill remaining data
    # 5. Verify consistency
    # 6. Switch reads to new shards
    # 7. Stop writes to old locations
    # 8. Cleanup

This is complex. Plan for it before you need it.

Monitoring

Track per-shard:

Alert on:

Backup and Recovery

Key Takeaways

Sharding enables massive scale, but at significant complexity cost. Most applications don’t need it. If you do, invest in doing it right.