How a Server Scales: 0 to 1 Million Requests/Minute

Every massive system you use today — Instagram, Twitter, Uber — started on a single server. The path from that first machine to handling millions of requests per minute is not one big leap. It is a sequence of specific, well-understood engineering decisions, each triggered by hitting a concrete bottleneck.

This post walks through every stage. Real numbers, real trade-offs, real code.

Stage 0 — The Single Server (0 → ~1,000 RPM)

Every product starts here. One machine runs everything: the web server, the application code, and the database. Deployment is an SSH session and a git pull.

flowchart LR U([Users]) --> S[Single Server\nNginx + App + DB\n2 vCPU · 4GB RAM] style S fill:#1877F2,stroke:#00c2ff,color:#fff

A typical $20/month VPS handles around 50–200 requests/second for simple read-heavy apps (~3,000–12,000 RPM). The bottleneck hits when:

CPU spikes to 100% under concurrent requests
The database and app compete for the same RAM
A single bad query locks the database and stalls everything

A basic Node.js/Express server at this stage looks like this:

const express = require('express');
const { Pool } = require('pg');
const app = express();

const db = new Pool({ connectionString: process.env.DATABASE_URL });

app.get('/posts', async (req, res) => {
  // Every request hits the DB — no cache, no optimisation
  const { rows } = await db.query('SELECT * FROM posts ORDER BY created_at DESC LIMIT 20');
  res.json(rows);
});

app.listen(3000);

What breaks first: the database. Every request queries it. No caching. As traffic grows, DB connections pile up and query latency climbs from 5ms to 500ms.

Stage 1 — Vertical Scaling (1K → ~5,000 RPM)

The cheapest first move: upgrade the machine. More CPU cores, more RAM, faster SSD. No code changes, no architecture changes.

Spec	Before	After	Approx Cost
CPU	2 vCPU	16 vCPU	$160/mo
RAM	4 GB	64 GB	included
Disk	50 GB HDD	500 GB NVMe SSD	included
Network	1 Gbps	10 Gbps	included

Vertical scaling has a hard ceiling — the largest EC2 instance (u-24tb1.metal) has 448 vCPUs and 24TB RAM and costs ~$218/hour. Beyond that, you must scale horizontally. More importantly: vertical scaling gives you zero redundancy. One hardware failure takes your entire service down.

Rule of thumb: Vertical scale first (fast, zero risk), then horizontal scale when you hit the ceiling or need redundancy. Do not prematurely distribute.

Stage 2 — Separate the Database (5K → ~15,000 RPM)

Split the app and database onto separate machines. The database gets dedicated CPU, RAM, and I/O. The app server is no longer competing with it.

flowchart LR U([Users]) --> AS[App Server\nNode.js / Python\n8 vCPU · 16GB] AS --> DB[(PostgreSQL\nDatabase Server\n16 vCPU · 64GB\nNVMe SSD)] style AS fill:#0d6e4e,stroke:#00e676,color:#fff style DB fill:#420177,stroke:#b39ddb,color:#fff

Now the database can be tuned independently — buffer pool size, connection limits, vacuum settings — without touching the app. This separation alone typically doubles or triples throughput.

# postgresql.conf — tuning for a dedicated 64GB DB server
shared_buffers        = 16GB      # 25% of RAM
effective_cache_size  = 48GB      # 75% of RAM
work_mem              = 64MB      # per sort/hash operation
max_connections       = 200       # app uses a connection pool
wal_buffers           = 64MB
checkpoint_completion_target = 0.9
random_page_cost      = 1.1       # SSD — lower than default 4.0

Stage 3 — Load Balancer + Multiple App Servers (15K → ~60,000 RPM)

One app server has run out of CPU headroom. The solution: run multiple identical app servers behind a load balancer that distributes incoming requests across all of them.

flowchart TD U([Users]) --> LB[Load Balancer\nNginx / HAProxy\nRound Robin] LB --> A1[App Server 1\n8 vCPU] LB --> A2[App Server 2\n8 vCPU] LB --> A3[App Server 3\n8 vCPU] A1 & A2 & A3 --> DB[(PostgreSQL)] style LB fill:#FF9900,stroke:#ffb74d,color:#000 style DB fill:#420177,stroke:#b39ddb,color:#fff

Nginx load balancer config:

# nginx.conf — load balancer
upstream app_servers {
    least_conn;                        # route to server with fewest active connections
    server 10.0.1.10:3000 weight=1;
    server 10.0.1.11:3000 weight=1;
    server 10.0.1.12:3000 weight=1;
    keepalive 64;                      # reuse upstream connections
}

server {
    listen 80;
    location / {
        proxy_pass         http://app_servers;
        proxy_http_version 1.1;
        proxy_set_header   Connection "";
        proxy_set_header   Host $host;
        proxy_set_header   X-Real-IP $remote_addr;
        proxy_connect_timeout 5s;
        proxy_read_timeout    30s;
    }
}

Now you have redundancy. If one app server dies, the load balancer stops sending traffic to it. Users never notice. But a new problem emerges: the database is now the single bottleneck for all three app servers.

Stage 4 — Caching with Redis (60K → ~200,000 RPM)

Most web traffic is read-heavy. The same data is fetched repeatedly — home feeds, product listings, user profiles. Hitting PostgreSQL for every one of these is wasteful. Redis (an in-memory key-value store) answers these reads in under 1ms, compared to 5–50ms for a database query.

flowchart TD U([Users]) --> LB[Load Balancer] LB --> A1[App Server 1] LB --> A2[App Server 2] LB --> A3[App Server 3] A1 & A2 & A3 --> R[(Redis Cache\n~0.5ms reads)] R -- Cache MISS --> DB[(PostgreSQL)] style R fill:#cc0000,stroke:#ff5555,color:#fff style DB fill:#420177,stroke:#b39ddb,color:#fff

Cache-aside pattern in Node.js:

const redis = require('redis');
const client = redis.createClient({ url: process.env.REDIS_URL });

app.get('/posts', async (req, res) => {
  const cacheKey = 'posts:homepage';

  // 1. Try cache first
  const cached = await client.get(cacheKey);
  if (cached) {
    return res.json(JSON.parse(cached));   // ~0.5ms
  }

  // 2. Cache miss — query DB
  const { rows } = await db.query(
    'SELECT * FROM posts ORDER BY created_at DESC LIMIT 20'
  );

  // 3. Store in cache for 60 seconds
  await client.setEx(cacheKey, 60, JSON.stringify(rows));

  res.json(rows);   // ~15ms DB query, cached for next 60s
});

A well-tuned Redis cache achieves 95–99% cache hit rates for read-heavy workloads. At 99% hit rate, your database only processes 1 in 100 requests — effectively a 100x reduction in DB load.

Cache Strategy	When to Use	Invalidation
Cache-Aside (Lazy)	Read-heavy, tolerate slight staleness	TTL expiry
Write-Through	Data must always be fresh	Write to cache + DB simultaneously
Write-Behind	Write-heavy, DB writes can be async	Queue writes, flush periodically
Read-Through	Transparent caching, cache is source of truth	Cache fills itself on miss

Stage 5 — CDN for Static Assets (200K → ~400,000 RPM)

Images, CSS, JS, fonts — these files never change between users. Every request for /logo.png that hits your server is wasted compute. A CDN (Content Delivery Network) caches static assets at hundreds of edge locations worldwide, serving them in milliseconds without ever touching your origin.

flowchart TD U([User in London]) --> CDN[CloudFront Edge\nLondon PoP\n~5ms] CDN -- Cache HIT --> U CDN -- Cache MISS --> OR[Origin Server\nUS-East\n~90ms RTT] OR --> CDN style CDN fill:#FF9900,stroke:#ffb74d,color:#000

After adding a CDN, 60–80% of all requests (static assets) are handled at the edge — your app servers only see dynamic API traffic. Effective RPM on your origin effectively halves.

Stage 6 — Database Read Replicas (400K → ~600,000 RPM)

Even with Redis, some queries cannot be cached — complex searches, user-specific feeds, analytics. These all hit PostgreSQL. A single primary DB is now the ceiling.

Solution: read replicas. PostgreSQL streaming replication copies every write from the primary to one or more replicas in near-real-time (~10–100ms lag). Read queries are distributed across replicas; writes go only to the primary.

flowchart TD APP[App Servers] --> PG_W[(Primary DB\nWrites Only)] APP --> PG_R1[(Read Replica 1\nReads)] APP --> PG_R2[(Read Replica 2\nReads)] PG_W -- Streaming Replication --> PG_R1 PG_W -- Streaming Replication --> PG_R2 style PG_W fill:#420177,stroke:#b39ddb,color:#fff style PG_R1 fill:#1a1a2e,stroke:#00c2ff,color:#fff style PG_R2 fill:#1a1a2e,stroke:#00c2ff,color:#fff

// Route reads and writes to different DB connections
const primaryPool = new Pool({ host: process.env.DB_PRIMARY });
const replicaPool = new Pool({ host: process.env.DB_REPLICA_1 });

// Writes always go to primary
async function createPost(data) {
  return primaryPool.query(
    'INSERT INTO posts (title, body, user_id) VALUES ($1, $2, $3) RETURNING *',
    [data.title, data.body, data.userId]
  );
}

// Reads go to replica — slight lag is acceptable for feeds
async function getPosts() {
  return replicaPool.query(
    'SELECT * FROM posts ORDER BY created_at DESC LIMIT 20'
  );
}

With 3 read replicas, your read throughput is effectively 3x. The primary handles writes + replication; replicas absorb reads.

Stage 7 — Message Queues for Async Work (600K → ~800,000 RPM)

Not every operation needs to complete before responding to the user. Sending a welcome email, resizing a profile photo, updating recommendation models, logging analytics — all of these can be done asynchronously. Doing them synchronously adds latency to every request and can cause cascading failures.

Message queues (Kafka, RabbitMQ, SQS) decouple producers from consumers. Your API puts a job on the queue and immediately returns 200 OK. Worker processes pick up jobs at their own pace.

flowchart TD API[App Server] API -- publish event --> Q[(Kafka / SQS\nMessage Queue)] Q --> W1[Email Worker] Q --> W2[Image Resize Worker] Q --> W3[Analytics Worker] style Q fill:#cc0000,stroke:#ff5555,color:#fff style W1 fill:#0d6e4e,stroke:#00e676,color:#fff style W2 fill:#0d6e4e,stroke:#00e676,color:#fff style W3 fill:#0d6e4e,stroke:#00e676,color:#fff

// producer — API handler
const { Kafka } = require('kafkajs');
const kafka = new Kafka({ brokers: [process.env.KAFKA_BROKER] });
const producer = kafka.producer();

app.post('/register', async (req, res) => {
  const user = await db.createUser(req.body);

  // Don't wait for email — publish to queue and return immediately
  await producer.send({
    topic: 'user.registered',
    messages: [{ value: JSON.stringify({ userId: user.id, email: user.email }) }],
  });

  res.status(201).json({ id: user.id });   // responds in ~5ms
});

// consumer — separate email worker process
const consumer = kafka.consumer({ groupId: 'email-service' });
await consumer.subscribe({ topic: 'user.registered' });
await consumer.run({
  eachMessage: async ({ message }) => {
    const { userId, email } = JSON.parse(message.value.toString());
    await sendWelcomeEmail(email);   // happens async, user already got 201
  },
});

By moving slow operations off the request path, your API response time drops from ~200ms to ~10ms. This alone allows existing servers to handle far more concurrent requests.

Stage 8 — Auto-Scaling (800K → 1,000,000 RPM)

Traffic is never flat. A product launch, a viral tweet, a prime-time TV mention — traffic can spike 10x in seconds. Pre-provisioning enough servers for peak traffic wastes money 23 hours a day. Auto-scaling adds and removes servers automatically based on real-time CPU/memory/request-rate metrics.

flowchart TD CW[CloudWatch\nCPU > 70%] CW -- triggers --> AS[Auto Scaling Group] AS -- adds --> A4[New App Server 4] AS -- adds --> A5[New App Server 5] LB[Load Balancer] --> A1[App Server 1] LB --> A2[App Server 2] LB --> A3[App Server 3] LB --> A4 LB --> A5 style AS fill:#FF9900,stroke:#ffb74d,color:#000 style CW fill:#cc0000,stroke:#ff5555,color:#fff style LB fill:#1877F2,stroke:#00c2ff,color:#fff

# AWS Auto Scaling policy — scale out when CPU > 70% for 2 minutes
resource "aws_autoscaling_policy" "scale_out" {
  name                   = "scale-out"
  autoscaling_group_name = aws_autoscaling_group.app.name
  adjustment_type        = "ChangeInCapacity"
  scaling_adjustment     = 2        # add 2 instances
  cooldown               = 120      # wait 2min before next scale event
}

resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "high-cpu"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 60
  statistic           = "Average"
  threshold           = 70
  alarm_actions       = [aws_autoscaling_policy.scale_out.arn]
}

# Scale in when CPU < 30% — save money during off-peak
resource "aws_autoscaling_policy" "scale_in" {
  name                   = "scale-in"
  autoscaling_group_name = aws_autoscaling_group.app.name
  adjustment_type        = "ChangeInCapacity"
  scaling_adjustment     = -1       # remove 1 instance
  cooldown               = 300      # wait longer before removing
}

A new instance boots in ~60–90 seconds on EC2. With a warm AMI (pre-baked image with your app already installed), this drops to under 30 seconds. Combine with predictive scaling — scale out 5 minutes before your historical peak — and you absorb spikes before they impact users.

The Full 1M RPM Architecture

flowchart TD Users([1M Requests / min]) Users --> CDN[CloudFront CDN] Users --> DNS[Route 53 DNS] DNS --> LB[Load Balancers\nUS-East · EU-West] LB --> ASG[Auto Scaling Group\n10 to 50 App Servers] ASG --> RC[(Redis Cache)] ASG --> MQ[(Kafka Queue)] ASG --> PG_W[(Primary DB\nWrites)] PG_W --> PG_R[(Read Replicas\nReads)] MQ --> WRK[Worker Fleet] style CDN fill:#FF9900,stroke:#ffb74d,color:#000 style RC fill:#cc0000,stroke:#ff5555,color:#fff style MQ fill:#cc0000,stroke:#ff5555,color:#fff style PG_W fill:#420177,stroke:#b39ddb,color:#fff style PG_R fill:#1a1a2e,stroke:#00c2ff,color:#fff style ASG fill:#0d6e4e,stroke:#00e676,color:#fff style LB fill:#1877F2,stroke:#00c2ff,color:#fff

The Numbers at Each Stage

Stage	Architecture	Approx RPM	Bottleneck Removed
0	Single server (all-in-one)	0 → 1,000	—
1	Vertical scale (bigger machine)	→ 5,000	CPU / RAM ceiling
2	Separate DB server	→ 15,000	DB competing with app for RAM
3	Load balancer + 3 app servers	→ 60,000	Single app server CPU
4	Redis caching layer	→ 200,000	Repeated DB reads
5	CDN for static assets	→ 400,000	Static traffic hitting origin
6	DB read replicas (×3)	→ 600,000	Single DB read throughput
7	Message queues (async work)	→ 800,000	Slow synchronous operations
8	Auto-scaling + predictive scale	→ 1,000,000+	Fixed capacity during spikes

What Real Companies Did

Company	Moment They Scaled	Key Move
Twitter (2008)	Fail Whale era — DB could not handle fanout	Moved timeline to Redis; pre-computed feeds
Instagram (2011)	12 engineers, 30M users	PostgreSQL + Redis + CDN; never rewrote the monolith
Airbnb (2011)	Single MySQL on EC2	Moved to RDS, added Memcached, then Kafka for async
Uber (2014)	Monolith hitting limits	Microservices + geo-sharded databases by city
Discord (2017)	100M messages/day, MongoDB struggling	Migrated to Cassandra for time-series message storage

Mistakes Engineers Make While Scaling

Over-engineering too early. Microservices on day one adds coordination overhead without the traffic to justify it. Start with a well-structured monolith.
Not caching aggressively enough. Most teams add Redis too late. Add it at Stage 3, not Stage 6.
Forgetting the database indexes. A single missing index on a high-traffic query will crash your DB before any architecture change saves you. EXPLAIN ANALYZE every slow query.
Using the primary DB for reads. Even with low traffic, separate reads to a replica. The habit pays off enormously later.
Synchronous everything. If the operation does not need to complete before you respond to the user, put it in a queue.
No connection pooling. PostgreSQL has a max_connections limit. Without a connection pooler (PgBouncer), 50 app server threads each opening 10 connections = 500 connections → DB crashes.

# PgBouncer — connection pooler in front of PostgreSQL
# 1,000 app connections → pooled to 50 real DB connections
[databases]
myapp = host=db.internal port=5432 dbname=myapp

[pgbouncer]
listen_port           = 6432
pool_mode             = transaction    # reuse connections per transaction
max_client_conn       = 1000          # accept up to 1000 app connections
default_pool_size     = 50            # maintain 50 real DB connections
reserve_pool_size     = 10
server_idle_timeout   = 600

The One-Paragraph Summary

Start with one server. When it is full, make it bigger (vertical scale). Separate the database. When the app is full, add more app servers behind a load balancer. Add Redis to stop hammering the DB with repeat reads. Put a CDN in front of static assets. Add read replicas when the primary DB is the bottleneck. Move slow work to async queues to keep response times fast. Set up auto-scaling so you pay for only what you need and absorb any spike automatically. At each stage, measure first, then fix the actual bottleneck — never the assumed one.

How a Server Scales: 0 → 1 Million Requests / Minute