Consistent Hashing: The Algorithm That Keeps Distributed Systems Sane

Imagine you're running a distributed cache with 4 nodes and millions of keys. Everything is working great. Then traffic spikes and you add a 5th node. With naive modulo hashing, almost every key gets remapped to a different node — you've just invalidated your entire cache in one shot. This is the thundering herd problem, and consistent hashing was invented to solve it.

The Problem with Naive Hashing

With modulo hashing, a key maps to node hash(key) % N where N is the number of nodes. When N changes, most keys change their target node:

Change	Keys Remapped	Impact
4 nodes → 5 nodes	~80%	Cache invalidated, DB flooded
4 nodes → 3 nodes (node failure)	~75%	Cache cold, potential outage
10 nodes → 11 nodes	~91%	Worse as cluster grows
Consistent hashing (any change)	~K/N only	Only affected keys move

For a cache, remapped keys mean cache misses. Every miss hits the database — a sudden flood of queries that may overwhelm it. Systems have gone down because of this cascade.

The Ring: How Consistent Hashing Works

Consistent hashing maps both keys and nodes onto a circular ring of integers (typically 0 to 2³²−1 ≈ 4.3 billion). The algorithm:

Hash each node to a position on the ring using its ID or IP address.
To find which node owns a key, hash the key and walk clockwise until you hit a node.
That node is responsible for storing the key.

Hash ring with 4 nodes (blue) and 4 keys (orange). Each key routes clockwise to the next node.

When you add a node (say Node E at position 45), it only takes over keys between position 0 (Node A) and 45. All other keys are unaffected. On average, only K/N keys need to move (K = total keys, N = total nodes).

Example: 1,000,000 keys across 10 nodes. Add node 11: with modulo hashing ~910,000 keys remap. With consistent hashing: only ~90,909 keys move. Same math, vastly different impact.

Python Implementation

Here's a working consistent hash ring implementation. This is what production libraries like uhashring and client-side Memcached do internally:

import hashlib
import bisect

class ConsistentHashRing:
    def __init__(self, nodes=None, replicas=150):
        """
        replicas: number of virtual nodes per physical node
        Higher = better distribution, more memory
        """
        self.replicas = replicas
        self.ring = {}          # hash position → node name
        self.sorted_keys = []   # sorted list of hash positions

        for node in (nodes or []):
            self.add_node(node)

    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def add_node(self, node: str):
        for i in range(self.replicas):
            # Each virtual node gets a unique key: "node:0", "node:1" …
            vnode_key = f"{node}:{i}"
            h = self._hash(vnode_key)
            self.ring[h] = node
            bisect.insort(self.sorted_keys, h)

    def remove_node(self, node: str):
        for i in range(self.replicas):
            h = self._hash(f"{node}:{i}")
            del self.ring[h]
            self.sorted_keys.remove(h)

    def get_node(self, key: str) -> str:
        """Walk clockwise to find the node responsible for key."""
        if not self.ring:
            return None
        h = self._hash(key)
        # Find first position >= h (clockwise)
        idx = bisect.bisect(self.sorted_keys, h) % len(self.sorted_keys)
        return self.ring[self.sorted_keys[idx]]

# Usage
ring = ConsistentHashRing(nodes=["node-1", "node-2", "node-3"], replicas=150)

# Route keys
print(ring.get_node("user:1001"))   # → "node-2"
print(ring.get_node("session:abc")) # → "node-1"

# Add a node — minimal remapping
ring.add_node("node-4")
print(ring.get_node("user:1001"))   # may or may not change

The key design decisions in production: replicas=150 gives reasonable distribution without excessive memory. MD5 is used for speed, not security. The bisect module gives O(log n) lookups on the sorted ring.

Virtual Nodes: Solving the Hot Spot Problem

Basic consistent hashing has a load distribution flaw. If nodes land unevenly on the ring by chance, some nodes own much larger arcs than others:

Virtual nodes dramatically smooth out load distribution across physical nodes.

With 150 virtual nodes per physical node, the law of large numbers kicks in and each physical node ends up owning roughly equal arcs of the ring.

How Lookups Flow End-to-End

flowchart TD Client(["Client"]) Hash["Hash the key hash('user:1001')"] Ring["Walk ring clockwise find next node position"] Node(["Node-2 (responsible node)"]) Cache{{"Cache hit?"}} DB[("Database")] Return(["Return value"]) Client -->|"get('user:1001')"| Hash Hash --> Ring Ring --> Node Node --> Cache Cache -->|"Yes"| Return Cache -->|"No"| DB DB --> Return

Replication with the Ring

For fault tolerance, data is replicated to the next N nodes clockwise (N = replication factor). In Cassandra with RF=3:

# Cassandra replication
CREATE KEYSPACE my_app WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'us-east': 3,   # 3 replicas in US-East
    'eu-west': 2    # 2 replicas in EU-West
};

# Consistency levels per query
session.execute(
    "SELECT * FROM users WHERE id = ?",
    [user_id],
    consistency_level=ConsistencyLevel.QUORUM  # wait for majority
)

When a node fails, its replicas cover the gap transparently. When it rejoins, nodetool repair syncs missed writes.

Real-World Usage

Amazon DynamoDB

DynamoDB hashes the partition key to determine storage placement. This is why partition key choice is critical:

# Good: high cardinality → even distribution
partition_key = "user_id"          # millions of unique values

# Bad: low cardinality → hot partition
partition_key = "status"           # only "active" / "inactive"

# Bad: time-based → hot partition (all writes hit latest)
partition_key = "date"             # 2024-01-15

Apache Cassandra

Cassandra's token ring is visible with nodetool ring. Adding a node triggers streaming:

# Check ring layout
nodetool ring

# When adding node-5:
# 1. node-5 joins, gets token positions
# 2. Existing nodes stream ~K/N keys to node-5
# 3. nodetool repair syncs remaining state
nodetool repair

Redis Cluster

Redis Cluster uses 16,384 hash slots instead of a continuous ring. Keys map to slots via CRC16(key) % 16384. Each node owns a range of slots. This is consistent hashing with discrete slots rather than a continuous ring.

System	Approach	Virtual Nodes	Default Replicas
Apache Cassandra	Token ring + vnodes	256 per node	RF=1 (configurable)
Amazon DynamoDB	Internal ring (proprietary)	Yes (details hidden)	3 (across AZs)
Redis Cluster	16,384 hash slots	Slot ranges	1 (+ replicas)
Memcached (client-side)	Ring in client lib	100–200 typical	None (cache only)

Limitations

Non-uniform load even with vnodes — mitigated but not eliminated. Real-world variance ~5%.
Hotspots from skewed access patterns — consistent hashing distributes keys evenly, not necessarily requests. If 90% of traffic hits "user:VIP", no algorithm helps.
Range queries are incompatible — sequential keys land on random nodes. HBase and Bigtable use range-based partitioning instead, trading off add/remove efficiency for range scan performance.
Memory overhead from vnodes — 256 vnodes × 1000 nodes = 256K ring entries to maintain. Not a problem in practice, but worth knowing.

Takeaways

Modulo hashing is simple but catastrophic when cluster size changes.
Consistent hashing limits remapping to K/N keys per topology change.
Virtual nodes fix uneven load distribution — use ≥100 per node.
Partition key design in DynamoDB is directly a consistent hashing problem.
The algorithm originates from David Karger's 1997 MIT paper — 27 years old and still powering global-scale infrastructure.