One question.
Picture this: you're on call. 3am. Your app is hammered — 50,000 concurrent users. One server is melting. The other two are barely touched. And the load balancer just keeps sending traffic to the melting one, mechanically, because nobody told it to do anything smarter.
That's not a hypothetical. That's what happens when you pick the wrong algorithm and don't think about it again. The load balancer is usually the first thing that meets user traffic and the last thing engineers actually read the docs on. Let's fix that.
The load balancer is the gatekeeper. Everything flows through it. It forwards your request to a server, that server processes it, and the response comes back. The whole round trip happens before you finish blinking. But the decision about which server to send it to? That's the interesting part — and there's more than one way to do it.
Before we get to algorithms — the full flow
A lot of people think the load balancer is just a dumb router. It's not. It's simultaneously running a routing algorithm and health-checking every server behind it. These two things work together. Here's how a request actually travels end-to-end:
The algorithm only matters for healthy servers. A server that's failing health checks doesn't exist to the algorithm — it's already been yanked from the pool. We'll come back to health checks separately because they deserve their own spotlight.
Round Robin — deal requests like cards
You have 3 servers. Request 1 goes to Server 1. Request 2 goes to Server 2. Request 3 goes to Server 3. Request 4 goes back to Server 1. Repeat forever.
That's it. That's Round Robin. There's no cleverness here — just a counter that wraps around. And for a huge class of workloads, that's exactly what you want. Stateless REST APIs, rendering services, auth endpoints — if every request is roughly the same cost and your servers are roughly the same size, Round Robin is perfectly appropriate and blazing fast. The load balancer doesn't need to track anything except a single integer.
Where it falls apart: uneven servers. Suppose your fleet has a beefy 32-core machine and a leftover 8-core from two years ago. Round Robin will give them equal turns. The 8-core gets crushed while the 32-core sits at 20% CPU. That's when you reach for Weighted Round Robin:
# NGINX weighted round-robin
upstream backend {
server s1.internal weight=4; # 32-core, gets 4/6 of traffic
server s2.internal weight=1; # 8-core, gets 1/6
server s3.internal weight=1;
}
Weight it proportional to capacity. If a server is twice as powerful, give it twice the weight. Most NGINX and HAProxy setups I've seen in the wild get this wrong — they spin up heterogeneous fleets and forget to tune the weights, then wonder why one node's CPU is always pegged.
Least Connections — route to whoever has the most breathing room
Round Robin doesn't know — and doesn't care — how long each request takes. If Server 2 is in the middle of processing 100 file uploads that are each 10 seconds long, Round Robin will happily pile another request on top. It's "Server 2's turn."
Least Connections is different. The load balancer watches how many active connections each server currently has. When a new request arrives, it goes to whoever has the smallest number. The intuition: fewer active connections means more available capacity right now.
This is the right algorithm for WebSocket connections, video streams, long database queries — anything where requests have wildly different lifetimes. Connection counts change thousands of times per second in a busy system, so the load balancer tracks this with atomic counters, not locks.
# HAProxy — leastconn
backend app_servers
balance leastconn
server web01 10.0.1.1:8080 check
server web02 10.0.1.2:8080 check
server web03 10.0.1.3:8080 check
One thing to know: if all connections are very short (sub-millisecond REST calls), Least Connections degenerates toward Round Robin in practice. The connection counts are so uniformly distributed that it barely matters which one "has fewer." Save Least Connections for genuinely long-lived connections.
IP Hashing — the same user always hits the same server
Here's a failure mode that catches people off guard the first time. Your app stores the user's shopping cart in memory — not in Redis, just in the server process. Round Robin routes the user's first request to Server 2. Their cart is now in Server 2's RAM. Second request? Round Robin sends it to Server 3. Cart is gone. User is confused. You're getting a 1-star review.
IP Hashing sidesteps this by making the routing decision deterministic: hash the client's IP address, modulo the number of servers, always land on the same one.
# The math behind it
def select_server(client_ip, servers):
hash_value = hash(client_ip) # CRC32 or MD5
server_index = hash_value % len(servers)
return servers[server_index]
# 192.168.1.42 → hash → 7345892 → 7345892 % 3 = 1 → Server B
# 10.0.0.5 → hash → 1829374 → 1829374 % 3 = 0 → Server A
# Same IP tomorrow, next week, always → same server
Every request from the same IP always lands on the same server.
Health checks — the feature nobody thinks about until production is down
Every few seconds, the load balancer sends a small HTTP request to every server — something like GET /health. If the server replies with 200 OK within the timeout window, it stays in the pool. If it doesn't — crash, overload, deadlock, anything — it gets removed immediately. No engineer has to wake up and manually reroute traffic. It just happens.
This is the feature that turns "a cluster of servers" into "a resilient system." Without it, you could have the world's most sophisticated routing algorithm and still send 30% of traffic to a server that's been silently OOM-killed.
There's an important subtlety here: shallow vs. deep health checks. A shallow check just verifies the port is open — the process is running but might be completely stuck waiting on a database lock. A deep check actually pings your database, your cache, your message queue, and returns 503 if anything is broken. Here's what that looks like in practice:
# NGINX Plus health check config
upstream backend {
server 10.0.1.1:8080;
server 10.0.1.2:8080;
health_check interval=5s fails=3 passes=2 uri=/health;
# fails=3: mark DOWN only after 3 misses in a row (avoids flapping)
# passes=2: require 2 clean checks before re-adding to pool
}
# Your /health endpoint — make it actually check things
@app.route('/health')
def health_check():
try:
db.execute('SELECT 1') # real DB check
redis_client.ping() # real cache check
return jsonify(status='ok'), 200
except Exception:
return jsonify(status='degraded'), 503
The fails=3 / passes=2 thresholds are deliberate. You don't want one blip to pull a server — that causes cascading load spikes on the remaining nodes. And you don't want a recovering server to rejoin on its first successful check — it might still be warming up.
Layer 4 vs. Layer 7 — this distinction actually matters
Load balancers can operate at different levels of the network stack, and this affects what they can see and what they can do.
Layer 4 — Transport
- Sees IP addresses and TCP ports only
- Cannot read HTTP headers, URLs, or cookies
- Faster — no deep packet inspection
- Used for TCP/UDP services: databases, game servers, DNS
- AWS NLB, HAProxy in TCP mode
Layer 7 — Application
- Reads full HTTP request: headers, URL, cookies, body
- Can route
/api/*to one pool,/images/*to another - Handles TLS termination so servers don't have to
- Enables cookie-based sticky sessions
- AWS ALB, NGINX, Traefik, Envoy
For almost any web application, you want Layer 7. The routing flexibility alone is worth it — you can send WebSocket traffic to a dedicated pool, route /admin to IP-restricted nodes, and A/B test by inspecting headers. Layer 4 is for raw TCP performance where every microsecond of latency counts and you can't afford to read the payload.
# AWS ALB listener rules — Layer 7 content routing
# Route based on path
/api/* → api-target-group
/ws/* → websocket-target-group (longer idle timeout)
/static/* → s3-origin or CDN
# Route canary traffic by header
If X-Beta-User: true → canary-target-group
else → prod-target-group
# Route by hostname
api.myapp.com → api cluster
admin.myapp.com → admin cluster (with WAF + IP restriction)
The real solution to session state
IP hashing and cookie stickiness solve the symptom, not the disease. If your session state lives in server memory, you've built a fragile system. The server goes down, the session is gone. You auto-scale, the new servers don't have old sessions. You do a deployment, all sessions die.
The proper fix is to move session state outside the server entirely:
# Move sessions to Redis — now any server can handle any user
import redis
session_store = redis.Redis(host='cache.internal', port=6379)
# On login
session_store.setex(f"session:{token}", 3600, json.dumps(user_data))
# On request
user_data = session_store.get(f"session:{token}")
# Now you can use Round Robin freely.
# Any server can serve any user.
# Scale up, scale down, restart — sessions survive.
Once you do this, load balancing becomes trivially simple. Round Robin with health checks covers 95% of use cases. IP hashing and sticky sessions are escape hatches for systems you can't refactor, not goals to aim for.
How real systems actually do it
| System | Load Balancer | Algorithm | Why |
|---|---|---|---|
| Netflix API | AWS ALB + Eureka | Round Robin | Stateless microservices, uniform nodes |
| Discord (voice) | Custom L4 + L7 | Least Connections | Long-lived WebSocket/UDP sessions vary wildly in cost |
| GitHub | HAProxy | Least Connections | Git clone of a 10GB repo vs. a simple API call — not the same |
| Cloudflare | Unimog (custom) | Weighted RR + Anycast | Global L4, every nanosecond matters |
| Kubernetes services | kube-proxy (iptables) | Random / Round Robin | Service mesh handles smarter routing above this |
Notice that Least Connections appears whenever requests have uneven cost. Discord's voice channels and GitHub's git operations both have the same property: a quick request and a slow one look identical at the connection level, but one of them is a 10-second upload. The algorithm needs to see that.
Which one to pick
If you're starting fresh: Round Robin with deep health checks. Get your session state into Redis. Make every server interchangeable. That's the architecture that lets you scale horizontally without headaches.
If requests have wildly different processing times — uploads, video, WebSockets, long database queries — switch to Least Connections. You'll immediately see better load distribution, especially during traffic spikes.
IP hashing is for when you inherit a system you can't change and sessions are baked into server memory. It's a band-aid. Cookie-based stickiness at Layer 7 (NGINX Plus, ALB) is a better band-aid that doesn't break behind NATs.
And health checks — always, always deep health checks. Shallow TCP pings lie. Your database can be completely locked and the port still accepts connections. Check what actually matters.