Imagine you're running a distributed cache with 4 nodes and millions of keys. Everything is working great. Then traffic spikes and you add a 5th node. With naive modulo hashing, almost every key gets remapped to a different node — you've just invalidated your entire cache in one shot. This is the thundering herd problem, and consistent hashing was invented to solve it.
The Problem with Naive Hashing
With modulo hashing, a key maps to node hash(key) % N where N is the number of nodes. When N changes, most keys change their target node:
| Change | Keys Remapped | Impact |
|---|---|---|
| 4 nodes → 5 nodes | ~80% | Cache invalidated, DB flooded |
| 4 nodes → 3 nodes (node failure) | ~75% | Cache cold, potential outage |
| 10 nodes → 11 nodes | ~91% | Worse as cluster grows |
| Consistent hashing (any change) | ~K/N only | Only affected keys move |
For a cache, remapped keys mean cache misses. Every miss hits the database — a sudden flood of queries that may overwhelm it. Systems have gone down because of this cascade.
The Ring: How Consistent Hashing Works
Consistent hashing maps both keys and nodes onto a circular ring of integers (typically 0 to 2³²−1 ≈ 4.3 billion). The algorithm:
- Hash each node to a position on the ring using its ID or IP address.
- To find which node owns a key, hash the key and walk clockwise until you hit a node.
- That node is responsible for storing the key.
Hash ring with 4 nodes (blue) and 4 keys (orange). Each key routes clockwise to the next node.
When you add a node (say Node E at position 45), it only takes over keys between position 0 (Node A) and 45. All other keys are unaffected. On average, only K/N keys need to move (K = total keys, N = total nodes).
Example: 1,000,000 keys across 10 nodes. Add node 11: with modulo hashing ~910,000 keys remap. With consistent hashing: only ~90,909 keys move. Same math, vastly different impact.
Python Implementation
Here's a working consistent hash ring implementation. This is what production libraries like uhashring and client-side Memcached do internally:
import hashlib
import bisect
class ConsistentHashRing:
def __init__(self, nodes=None, replicas=150):
"""
replicas: number of virtual nodes per physical node
Higher = better distribution, more memory
"""
self.replicas = replicas
self.ring = {} # hash position → node name
self.sorted_keys = [] # sorted list of hash positions
for node in (nodes or []):
self.add_node(node)
def _hash(self, key: str) -> int:
return int(hashlib.md5(key.encode()).hexdigest(), 16)
def add_node(self, node: str):
for i in range(self.replicas):
# Each virtual node gets a unique key: "node:0", "node:1" …
vnode_key = f"{node}:{i}"
h = self._hash(vnode_key)
self.ring[h] = node
bisect.insort(self.sorted_keys, h)
def remove_node(self, node: str):
for i in range(self.replicas):
h = self._hash(f"{node}:{i}")
del self.ring[h]
self.sorted_keys.remove(h)
def get_node(self, key: str) -> str:
"""Walk clockwise to find the node responsible for key."""
if not self.ring:
return None
h = self._hash(key)
# Find first position >= h (clockwise)
idx = bisect.bisect(self.sorted_keys, h) % len(self.sorted_keys)
return self.ring[self.sorted_keys[idx]]
# Usage
ring = ConsistentHashRing(nodes=["node-1", "node-2", "node-3"], replicas=150)
# Route keys
print(ring.get_node("user:1001")) # → "node-2"
print(ring.get_node("session:abc")) # → "node-1"
# Add a node — minimal remapping
ring.add_node("node-4")
print(ring.get_node("user:1001")) # may or may not change
The key design decisions in production: replicas=150 gives reasonable distribution without excessive memory. MD5 is used for speed, not security. The bisect module gives O(log n) lookups on the sorted ring.
Virtual Nodes: Solving the Hot Spot Problem
Basic consistent hashing has a load distribution flaw. If nodes land unevenly on the ring by chance, some nodes own much larger arcs than others:
Virtual nodes dramatically smooth out load distribution across physical nodes.
With 150 virtual nodes per physical node, the law of large numbers kicks in and each physical node ends up owning roughly equal arcs of the ring.
How Lookups Flow End-to-End
Replication with the Ring
For fault tolerance, data is replicated to the next N nodes clockwise (N = replication factor). In Cassandra with RF=3:
# Cassandra replication
CREATE KEYSPACE my_app WITH replication = {
'class': 'NetworkTopologyStrategy',
'us-east': 3, # 3 replicas in US-East
'eu-west': 2 # 2 replicas in EU-West
};
# Consistency levels per query
session.execute(
"SELECT * FROM users WHERE id = ?",
[user_id],
consistency_level=ConsistencyLevel.QUORUM # wait for majority
)
When a node fails, its replicas cover the gap transparently. When it rejoins, nodetool repair syncs missed writes.
Real-World Usage
Amazon DynamoDB
DynamoDB hashes the partition key to determine storage placement. This is why partition key choice is critical:
# Good: high cardinality → even distribution
partition_key = "user_id" # millions of unique values
# Bad: low cardinality → hot partition
partition_key = "status" # only "active" / "inactive"
# Bad: time-based → hot partition (all writes hit latest)
partition_key = "date" # 2024-01-15
Apache Cassandra
Cassandra's token ring is visible with nodetool ring. Adding a node triggers streaming:
# Check ring layout
nodetool ring
# When adding node-5:
# 1. node-5 joins, gets token positions
# 2. Existing nodes stream ~K/N keys to node-5
# 3. nodetool repair syncs remaining state
nodetool repair
Redis Cluster
Redis Cluster uses 16,384 hash slots instead of a continuous ring. Keys map to slots via CRC16(key) % 16384. Each node owns a range of slots. This is consistent hashing with discrete slots rather than a continuous ring.
| System | Approach | Virtual Nodes | Default Replicas |
|---|---|---|---|
| Apache Cassandra | Token ring + vnodes | 256 per node | RF=1 (configurable) |
| Amazon DynamoDB | Internal ring (proprietary) | Yes (details hidden) | 3 (across AZs) |
| Redis Cluster | 16,384 hash slots | Slot ranges | 1 (+ replicas) |
| Memcached (client-side) | Ring in client lib | 100–200 typical | None (cache only) |
Limitations
- Non-uniform load even with vnodes — mitigated but not eliminated. Real-world variance ~5%.
- Hotspots from skewed access patterns — consistent hashing distributes keys evenly, not necessarily requests. If 90% of traffic hits "user:VIP", no algorithm helps.
- Range queries are incompatible — sequential keys land on random nodes. HBase and Bigtable use range-based partitioning instead, trading off add/remove efficiency for range scan performance.
- Memory overhead from vnodes — 256 vnodes × 1000 nodes = 256K ring entries to maintain. Not a problem in practice, but worth knowing.
Takeaways
- Modulo hashing is simple but catastrophic when cluster size changes.
- Consistent hashing limits remapping to K/N keys per topology change.
- Virtual nodes fix uneven load distribution — use ≥100 per node.
- Partition key design in DynamoDB is directly a consistent hashing problem.
- The algorithm originates from David Karger's 1997 MIT paper — 27 years old and still powering global-scale infrastructure.