Every massive system you use today — Instagram, Twitter, Uber — started on a single server. The path from that first machine to handling millions of requests per minute is not one big leap. It is a sequence of specific, well-understood engineering decisions, each triggered by hitting a concrete bottleneck.
This post walks through every stage. Real numbers, real trade-offs, real code.
Stage 0 — The Single Server (0 → ~1,000 RPM)
Every product starts here. One machine runs everything: the web server, the application code, and the database. Deployment is an SSH session and a git pull.
A typical $20/month VPS handles around 50–200 requests/second for simple read-heavy apps (~3,000–12,000 RPM). The bottleneck hits when:
- CPU spikes to 100% under concurrent requests
- The database and app compete for the same RAM
- A single bad query locks the database and stalls everything
A basic Node.js/Express server at this stage looks like this:
const express = require('express');
const { Pool } = require('pg');
const app = express();
const db = new Pool({ connectionString: process.env.DATABASE_URL });
app.get('/posts', async (req, res) => {
// Every request hits the DB — no cache, no optimisation
const { rows } = await db.query('SELECT * FROM posts ORDER BY created_at DESC LIMIT 20');
res.json(rows);
});
app.listen(3000);
What breaks first: the database. Every request queries it. No caching. As traffic grows, DB connections pile up and query latency climbs from 5ms to 500ms.
Stage 1 — Vertical Scaling (1K → ~5,000 RPM)
The cheapest first move: upgrade the machine. More CPU cores, more RAM, faster SSD. No code changes, no architecture changes.
| Spec | Before | After | Approx Cost |
|---|---|---|---|
| CPU | 2 vCPU | 16 vCPU | $160/mo |
| RAM | 4 GB | 64 GB | included |
| Disk | 50 GB HDD | 500 GB NVMe SSD | included |
| Network | 1 Gbps | 10 Gbps | included |
Vertical scaling has a hard ceiling — the largest EC2 instance (u-24tb1.metal) has 448 vCPUs and 24TB RAM and costs ~$218/hour. Beyond that, you must scale horizontally. More importantly: vertical scaling gives you zero redundancy. One hardware failure takes your entire service down.
Rule of thumb: Vertical scale first (fast, zero risk), then horizontal scale when you hit the ceiling or need redundancy. Do not prematurely distribute.
Stage 2 — Separate the Database (5K → ~15,000 RPM)
Split the app and database onto separate machines. The database gets dedicated CPU, RAM, and I/O. The app server is no longer competing with it.
Now the database can be tuned independently — buffer pool size, connection limits, vacuum settings — without touching the app. This separation alone typically doubles or triples throughput.
# postgresql.conf — tuning for a dedicated 64GB DB server
shared_buffers = 16GB # 25% of RAM
effective_cache_size = 48GB # 75% of RAM
work_mem = 64MB # per sort/hash operation
max_connections = 200 # app uses a connection pool
wal_buffers = 64MB
checkpoint_completion_target = 0.9
random_page_cost = 1.1 # SSD — lower than default 4.0
Stage 3 — Load Balancer + Multiple App Servers (15K → ~60,000 RPM)
One app server has run out of CPU headroom. The solution: run multiple identical app servers behind a load balancer that distributes incoming requests across all of them.
Nginx load balancer config:
# nginx.conf — load balancer
upstream app_servers {
least_conn; # route to server with fewest active connections
server 10.0.1.10:3000 weight=1;
server 10.0.1.11:3000 weight=1;
server 10.0.1.12:3000 weight=1;
keepalive 64; # reuse upstream connections
}
server {
listen 80;
location / {
proxy_pass http://app_servers;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_connect_timeout 5s;
proxy_read_timeout 30s;
}
}
Now you have redundancy. If one app server dies, the load balancer stops sending traffic to it. Users never notice. But a new problem emerges: the database is now the single bottleneck for all three app servers.
Stage 4 — Caching with Redis (60K → ~200,000 RPM)
Most web traffic is read-heavy. The same data is fetched repeatedly — home feeds, product listings, user profiles. Hitting PostgreSQL for every one of these is wasteful. Redis (an in-memory key-value store) answers these reads in under 1ms, compared to 5–50ms for a database query.
Cache-aside pattern in Node.js:
const redis = require('redis');
const client = redis.createClient({ url: process.env.REDIS_URL });
app.get('/posts', async (req, res) => {
const cacheKey = 'posts:homepage';
// 1. Try cache first
const cached = await client.get(cacheKey);
if (cached) {
return res.json(JSON.parse(cached)); // ~0.5ms
}
// 2. Cache miss — query DB
const { rows } = await db.query(
'SELECT * FROM posts ORDER BY created_at DESC LIMIT 20'
);
// 3. Store in cache for 60 seconds
await client.setEx(cacheKey, 60, JSON.stringify(rows));
res.json(rows); // ~15ms DB query, cached for next 60s
});
A well-tuned Redis cache achieves 95–99% cache hit rates for read-heavy workloads. At 99% hit rate, your database only processes 1 in 100 requests — effectively a 100x reduction in DB load.
| Cache Strategy | When to Use | Invalidation |
|---|---|---|
| Cache-Aside (Lazy) | Read-heavy, tolerate slight staleness | TTL expiry |
| Write-Through | Data must always be fresh | Write to cache + DB simultaneously |
| Write-Behind | Write-heavy, DB writes can be async | Queue writes, flush periodically |
| Read-Through | Transparent caching, cache is source of truth | Cache fills itself on miss |
Stage 5 — CDN for Static Assets (200K → ~400,000 RPM)
Images, CSS, JS, fonts — these files never change between users. Every request for /logo.png that hits your server is wasted compute. A CDN (Content Delivery Network) caches static assets at hundreds of edge locations worldwide, serving them in milliseconds without ever touching your origin.
After adding a CDN, 60–80% of all requests (static assets) are handled at the edge — your app servers only see dynamic API traffic. Effective RPM on your origin effectively halves.
Stage 6 — Database Read Replicas (400K → ~600,000 RPM)
Even with Redis, some queries cannot be cached — complex searches, user-specific feeds, analytics. These all hit PostgreSQL. A single primary DB is now the ceiling.
Solution: read replicas. PostgreSQL streaming replication copies every write from the primary to one or more replicas in near-real-time (~10–100ms lag). Read queries are distributed across replicas; writes go only to the primary.
// Route reads and writes to different DB connections
const primaryPool = new Pool({ host: process.env.DB_PRIMARY });
const replicaPool = new Pool({ host: process.env.DB_REPLICA_1 });
// Writes always go to primary
async function createPost(data) {
return primaryPool.query(
'INSERT INTO posts (title, body, user_id) VALUES ($1, $2, $3) RETURNING *',
[data.title, data.body, data.userId]
);
}
// Reads go to replica — slight lag is acceptable for feeds
async function getPosts() {
return replicaPool.query(
'SELECT * FROM posts ORDER BY created_at DESC LIMIT 20'
);
}
With 3 read replicas, your read throughput is effectively 3x. The primary handles writes + replication; replicas absorb reads.
Stage 7 — Message Queues for Async Work (600K → ~800,000 RPM)
Not every operation needs to complete before responding to the user. Sending a welcome email, resizing a profile photo, updating recommendation models, logging analytics — all of these can be done asynchronously. Doing them synchronously adds latency to every request and can cause cascading failures.
Message queues (Kafka, RabbitMQ, SQS) decouple producers from consumers. Your API puts a job on the queue and immediately returns 200 OK. Worker processes pick up jobs at their own pace.
// producer — API handler
const { Kafka } = require('kafkajs');
const kafka = new Kafka({ brokers: [process.env.KAFKA_BROKER] });
const producer = kafka.producer();
app.post('/register', async (req, res) => {
const user = await db.createUser(req.body);
// Don't wait for email — publish to queue and return immediately
await producer.send({
topic: 'user.registered',
messages: [{ value: JSON.stringify({ userId: user.id, email: user.email }) }],
});
res.status(201).json({ id: user.id }); // responds in ~5ms
});
// consumer — separate email worker process
const consumer = kafka.consumer({ groupId: 'email-service' });
await consumer.subscribe({ topic: 'user.registered' });
await consumer.run({
eachMessage: async ({ message }) => {
const { userId, email } = JSON.parse(message.value.toString());
await sendWelcomeEmail(email); // happens async, user already got 201
},
});
By moving slow operations off the request path, your API response time drops from ~200ms to ~10ms. This alone allows existing servers to handle far more concurrent requests.
Stage 8 — Auto-Scaling (800K → 1,000,000 RPM)
Traffic is never flat. A product launch, a viral tweet, a prime-time TV mention — traffic can spike 10x in seconds. Pre-provisioning enough servers for peak traffic wastes money 23 hours a day. Auto-scaling adds and removes servers automatically based on real-time CPU/memory/request-rate metrics.
# AWS Auto Scaling policy — scale out when CPU > 70% for 2 minutes
resource "aws_autoscaling_policy" "scale_out" {
name = "scale-out"
autoscaling_group_name = aws_autoscaling_group.app.name
adjustment_type = "ChangeInCapacity"
scaling_adjustment = 2 # add 2 instances
cooldown = 120 # wait 2min before next scale event
}
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
alarm_name = "high-cpu"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = 60
statistic = "Average"
threshold = 70
alarm_actions = [aws_autoscaling_policy.scale_out.arn]
}
# Scale in when CPU < 30% — save money during off-peak
resource "aws_autoscaling_policy" "scale_in" {
name = "scale-in"
autoscaling_group_name = aws_autoscaling_group.app.name
adjustment_type = "ChangeInCapacity"
scaling_adjustment = -1 # remove 1 instance
cooldown = 300 # wait longer before removing
}
A new instance boots in ~60–90 seconds on EC2. With a warm AMI (pre-baked image with your app already installed), this drops to under 30 seconds. Combine with predictive scaling — scale out 5 minutes before your historical peak — and you absorb spikes before they impact users.
The Full 1M RPM Architecture
The Numbers at Each Stage
| Stage | Architecture | Approx RPM | Bottleneck Removed |
|---|---|---|---|
| 0 | Single server (all-in-one) | 0 → 1,000 | — |
| 1 | Vertical scale (bigger machine) | → 5,000 | CPU / RAM ceiling |
| 2 | Separate DB server | → 15,000 | DB competing with app for RAM |
| 3 | Load balancer + 3 app servers | → 60,000 | Single app server CPU |
| 4 | Redis caching layer | → 200,000 | Repeated DB reads |
| 5 | CDN for static assets | → 400,000 | Static traffic hitting origin |
| 6 | DB read replicas (×3) | → 600,000 | Single DB read throughput |
| 7 | Message queues (async work) | → 800,000 | Slow synchronous operations |
| 8 | Auto-scaling + predictive scale | → 1,000,000+ | Fixed capacity during spikes |
What Real Companies Did
| Company | Moment They Scaled | Key Move |
|---|---|---|
| Twitter (2008) | Fail Whale era — DB could not handle fanout | Moved timeline to Redis; pre-computed feeds |
| Instagram (2011) | 12 engineers, 30M users | PostgreSQL + Redis + CDN; never rewrote the monolith |
| Airbnb (2011) | Single MySQL on EC2 | Moved to RDS, added Memcached, then Kafka for async |
| Uber (2014) | Monolith hitting limits | Microservices + geo-sharded databases by city |
| Discord (2017) | 100M messages/day, MongoDB struggling | Migrated to Cassandra for time-series message storage |
Mistakes Engineers Make While Scaling
- Over-engineering too early. Microservices on day one adds coordination overhead without the traffic to justify it. Start with a well-structured monolith.
- Not caching aggressively enough. Most teams add Redis too late. Add it at Stage 3, not Stage 6.
- Forgetting the database indexes. A single missing index on a high-traffic query will crash your DB before any architecture change saves you.
EXPLAIN ANALYZEevery slow query. - Using the primary DB for reads. Even with low traffic, separate reads to a replica. The habit pays off enormously later.
- Synchronous everything. If the operation does not need to complete before you respond to the user, put it in a queue.
- No connection pooling. PostgreSQL has a max_connections limit. Without a connection pooler (PgBouncer), 50 app server threads each opening 10 connections = 500 connections → DB crashes.
# PgBouncer — connection pooler in front of PostgreSQL
# 1,000 app connections → pooled to 50 real DB connections
[databases]
myapp = host=db.internal port=5432 dbname=myapp
[pgbouncer]
listen_port = 6432
pool_mode = transaction # reuse connections per transaction
max_client_conn = 1000 # accept up to 1000 app connections
default_pool_size = 50 # maintain 50 real DB connections
reserve_pool_size = 10
server_idle_timeout = 600
The One-Paragraph Summary
Start with one server. When it is full, make it bigger (vertical scale). Separate the database. When the app is full, add more app servers behind a load balancer. Add Redis to stop hammering the DB with repeat reads. Put a CDN in front of static assets. Add read replicas when the primary DB is the bottleneck. Move slow work to async queues to keep response times fast. Set up auto-scaling so you pay for only what you need and absorb any spike automatically. At each stage, measure first, then fix the actual bottleneck — never the assumed one.