How a Load Balancer Decides Which Server Gets Your Request

Millions of requests. Dozens of servers.
One question.

Who handles this one?

Picture this: you're on call. 3am. Your app is hammered — 50,000 concurrent users. One server is melting. The other two are barely touched. And the load balancer just keeps sending traffic to the melting one, mechanically, because nobody told it to do anything smarter.

That's not a hypothetical. That's what happens when you pick the wrong algorithm and don't think about it again. The load balancer is usually the first thing that meets user traffic and the last thing engineers actually read the docs on. Let's fix that.

The Basic Picture — Traffic In, Distributed Out

The load balancer is the gatekeeper. Everything flows through it. It forwards your request to a server, that server processes it, and the response comes back. The whole round trip happens before you finish blinking. But the decision about which server to send it to? That's the interesting part — and there's more than one way to do it.

Before we get to algorithms — the full flow

A lot of people think the load balancer is just a dumb router. It's not. It's simultaneously running a routing algorithm and health-checking every server behind it. These two things work together. Here's how a request actually travels end-to-end:

flowchart TD User(["User Request"]) LB["Load Balancer\nreceives request"] HC{"Health Check\nPassed?"} Algo["Apply routing\nalgorithm"] RR["Round Robin\n→ next server in list"] LC["Least Connections\n→ server with fewest active"] IH["IP Hashing\n→ hash(IP) mod N"] Fwd["Forward request\nto chosen server"] Srv(["Server handles\nrequest"]) Dead["Mark server DOWN\nRemove from pool"] Retry["Try next\nhealthy server"] User --> LB LB --> HC HC -- "Yes" --> Algo HC -- "No" --> Dead Dead --> Retry Retry --> HC Algo --> RR Algo --> LC Algo --> IH RR --> Fwd LC --> Fwd IH --> Fwd Fwd --> Srv

The algorithm only matters for healthy servers. A server that's failing health checks doesn't exist to the algorithm — it's already been yanked from the pool. We'll come back to health checks separately because they deserve their own spotlight.

1 Method One

Round Robin — deal requests like cards

You have 3 servers. Request 1 goes to Server 1. Request 2 goes to Server 2. Request 3 goes to Server 3. Request 4 goes back to Server 1. Repeat forever.

That's it. That's Round Robin. There's no cleverness here — just a counter that wraps around. And for a huge class of workloads, that's exactly what you want. Stateless REST APIs, rendering services, auth endpoints — if every request is roughly the same cost and your servers are roughly the same size, Round Robin is perfectly appropriate and blazing fast. The load balancer doesn't need to track anything except a single integer.

Round Robin — Click to Send Requests

Where it falls apart: uneven servers. Suppose your fleet has a beefy 32-core machine and a leftover 8-core from two years ago. Round Robin will give them equal turns. The 8-core gets crushed while the 32-core sits at 20% CPU. That's when you reach for Weighted Round Robin:

# NGINX weighted round-robin
upstream backend {
    server s1.internal weight=4;   # 32-core, gets 4/6 of traffic
    server s2.internal weight=1;   # 8-core, gets 1/6
    server s3.internal weight=1;
}

Weight it proportional to capacity. If a server is twice as powerful, give it twice the weight. Most NGINX and HAProxy setups I've seen in the wild get this wrong — they spin up heterogeneous fleets and forget to tune the weights, then wonder why one node's CPU is always pegged.

2 Method Two

Least Connections — route to whoever has the most breathing room

Round Robin doesn't know — and doesn't care — how long each request takes. If Server 2 is in the middle of processing 100 file uploads that are each 10 seconds long, Round Robin will happily pile another request on top. It's "Server 2's turn."

Least Connections is different. The load balancer watches how many active connections each server currently has. When a new request arrives, it goes to whoever has the smallest number. The intuition: fewer active connections means more available capacity right now.

Least Connections — Press "Send Request" and Watch Where It Goes

Server A

2 conn →

Server B

8 conn →

Server C

5 conn →

Send a request, then finish some to watch the balancer adapt.

This is the right algorithm for WebSocket connections, video streams, long database queries — anything where requests have wildly different lifetimes. Connection counts change thousands of times per second in a busy system, so the load balancer tracks this with atomic counters, not locks.

# HAProxy — leastconn
backend app_servers
    balance leastconn
    server web01 10.0.1.1:8080 check
    server web02 10.0.1.2:8080 check
    server web03 10.0.1.3:8080 check

One thing to know: if all connections are very short (sub-millisecond REST calls), Least Connections degenerates toward Round Robin in practice. The connection counts are so uniformly distributed that it barely matters which one "has fewer." Save Least Connections for genuinely long-lived connections.

3 Method Three

IP Hashing — the same user always hits the same server

Here's a failure mode that catches people off guard the first time. Your app stores the user's shopping cart in memory — not in Redis, just in the server process. Round Robin routes the user's first request to Server 2. Their cart is now in Server 2's RAM. Second request? Round Robin sends it to Server 3. Cart is gone. User is confused. You're getting a 1-star review.

IP Hashing sidesteps this by making the routing decision deterministic: hash the client's IP address, modulo the number of servers, always land on the same one.

# The math behind it
def select_server(client_ip, servers):
    hash_value = hash(client_ip)          # CRC32 or MD5
    server_index = hash_value % len(servers)
    return servers[server_index]

# 192.168.1.42  →  hash → 7345892  →  7345892 % 3 = 1  →  Server B
# 10.0.0.5      →  hash → 1829374  →  1829374 % 3 = 0  →  Server A
# Same IP tomorrow, next week, always → same server

IP Hashing — Same IP, Same Server, Every Time

192.168.1.42 → hash → 7,345,892 % 3 = Server B

10.0.0.5 → hash → 1,829,374 % 3 = Server A

172.16.0.200 → hash → 4,927,183 % 3 = Server C

Every request from the same IP always lands on the same server.

The trap: If 5,000 employees in a company all browse your site through the same corporate NAT gateway, they all share one public IP. IP hashing sends every single one of them to the same server — which now handles 5,000× the expected load while the other servers sit idle. Mobile users change IPs constantly and break their own sessions. Real production systems use cookie-based stickiness at Layer 7 instead, which avoids all of this.

4 What makes it all work

Health checks — the feature nobody thinks about until production is down

Every few seconds, the load balancer sends a small HTTP request to every server — something like GET /health. If the server replies with 200 OK within the timeout window, it stays in the pool. If it doesn't — crash, overload, deadlock, anything — it gets removed immediately. No engineer has to wake up and manually reroute traffic. It just happens.

This is the feature that turns "a cluster of servers" into "a resilient system." Without it, you could have the world's most sophisticated routing algorithm and still send 30% of traffic to a server that's been silently OOM-killed.

Health Check — Kill a Server and Watch It Get Removed

Server 1

● Healthy

Server 2

● Healthy

Server 3

● Healthy

All servers healthy. Simulate a failure below.

There's an important subtlety here: shallow vs. deep health checks. A shallow check just verifies the port is open — the process is running but might be completely stuck waiting on a database lock. A deep check actually pings your database, your cache, your message queue, and returns 503 if anything is broken. Here's what that looks like in practice:

# NGINX Plus health check config
upstream backend {
    server 10.0.1.1:8080;
    server 10.0.1.2:8080;

    health_check interval=5s fails=3 passes=2 uri=/health;
    # fails=3: mark DOWN only after 3 misses in a row (avoids flapping)
    # passes=2: require 2 clean checks before re-adding to pool
}

# Your /health endpoint — make it actually check things
@app.route('/health')
def health_check():
    try:
        db.execute('SELECT 1')           # real DB check
        redis_client.ping()              # real cache check
        return jsonify(status='ok'), 200
    except Exception:
        return jsonify(status='degraded'), 503

The fails=3 / passes=2 thresholds are deliberate. You don't want one blip to pull a server — that causes cascading load spikes on the remaining nodes. And you don't want a recovering server to rejoin on its first successful check — it might still be warming up.

Layer 4 vs. Layer 7 — this distinction actually matters

Load balancers can operate at different levels of the network stack, and this affects what they can see and what they can do.

Layer 4 — Transport

Sees IP addresses and TCP ports only
Cannot read HTTP headers, URLs, or cookies
Faster — no deep packet inspection
Used for TCP/UDP services: databases, game servers, DNS
AWS NLB, HAProxy in TCP mode

Layer 7 — Application

Reads full HTTP request: headers, URL, cookies, body
Can route /api/* to one pool, /images/* to another
Handles TLS termination so servers don't have to
Enables cookie-based sticky sessions
AWS ALB, NGINX, Traefik, Envoy

For almost any web application, you want Layer 7. The routing flexibility alone is worth it — you can send WebSocket traffic to a dedicated pool, route /admin to IP-restricted nodes, and A/B test by inspecting headers. Layer 4 is for raw TCP performance where every microsecond of latency counts and you can't afford to read the payload.

# AWS ALB listener rules — Layer 7 content routing
# Route based on path
/api/*           → api-target-group
/ws/*            → websocket-target-group   (longer idle timeout)
/static/*        → s3-origin or CDN

# Route canary traffic by header
If X-Beta-User: true  → canary-target-group
else                  → prod-target-group

# Route by hostname
api.myapp.com    → api cluster
admin.myapp.com  → admin cluster (with WAF + IP restriction)

The real solution to session state

IP hashing and cookie stickiness solve the symptom, not the disease. If your session state lives in server memory, you've built a fragile system. The server goes down, the session is gone. You auto-scale, the new servers don't have old sessions. You do a deployment, all sessions die.

The proper fix is to move session state outside the server entirely:

# Move sessions to Redis — now any server can handle any user
import redis
session_store = redis.Redis(host='cache.internal', port=6379)

# On login
session_store.setex(f"session:{token}", 3600, json.dumps(user_data))

# On request
user_data = session_store.get(f"session:{token}")

# Now you can use Round Robin freely.
# Any server can serve any user.
# Scale up, scale down, restart — sessions survive.

Once you do this, load balancing becomes trivially simple. Round Robin with health checks covers 95% of use cases. IP hashing and sticky sessions are escape hatches for systems you can't refactor, not goals to aim for.

How real systems actually do it

System	Load Balancer	Algorithm	Why
Netflix API	AWS ALB + Eureka	Round Robin	Stateless microservices, uniform nodes
Discord (voice)	Custom L4 + L7	Least Connections	Long-lived WebSocket/UDP sessions vary wildly in cost
GitHub	HAProxy	Least Connections	Git clone of a 10GB repo vs. a simple API call — not the same
Cloudflare	Unimog (custom)	Weighted RR + Anycast	Global L4, every nanosecond matters
Kubernetes services	kube-proxy (iptables)	Random / Round Robin	Service mesh handles smarter routing above this

Notice that Least Connections appears whenever requests have uneven cost. Discord's voice channels and GitHub's git operations both have the same property: a quick request and a slow one look identical at the connection level, but one of them is a 10-second upload. The algorithm needs to see that.

Which one to pick

If you're starting fresh: Round Robin with deep health checks. Get your session state into Redis. Make every server interchangeable. That's the architecture that lets you scale horizontally without headaches.

If requests have wildly different processing times — uploads, video, WebSockets, long database queries — switch to Least Connections. You'll immediately see better load distribution, especially during traffic spikes.

IP hashing is for when you inherit a system you can't change and sessions are baked into server memory. It's a band-aid. Cookie-based stickiness at Layer 7 (NGINX Plus, ALB) is a better band-aid that doesn't break behind NATs.

And health checks — always, always deep health checks. Shallow TCP pings lie. Your database can be completely locked and the port still accepts connections. Check what actually matters.

How a Load Balancer Decides Which Server Gets Your Request

Before we get to algorithms — the full flow

Round Robin — deal requests like cards

Least Connections — route to whoever has the most breathing room

IP Hashing — the same user always hits the same server

Health checks — the feature nobody thinks about until production is down

Layer 4 vs. Layer 7 — this distinction actually matters

Layer 4 — Transport

Layer 7 — Application

The real solution to session state

How real systems actually do it

Which one to pick

Srikanth's Portfolio

quick links

contact info