How Does a 10GB File Become 3GB When Zipped?

This is one of those interview questions that sounds simple but reveals a lot about how deeply you understand systems. The answer isn't "ZIP uses magic." It's a two-algorithm pipeline — LZ77 and Huffman coding — working in sequence, each exploiting a different type of redundancy in your data. Let's go end to end.

The Core Insight: Data Has Redundancy

A 10GB file isn't 10GB of unique information. Most files contain massive amounts of redundancy — repeated patterns, predictable sequences, and symbols that appear far more often than others. Compression finds and eliminates that redundancy.

There are two fundamental types of redundancy that ZIP exploits:

Type	What It Is	Algorithm That Fixes It
Repetition redundancy	The same sequence of bytes appears multiple times	LZ77 (back-references)
Statistical redundancy	Some symbols appear far more often than others	Huffman coding (shorter codes for common symbols)

ZIP's DEFLATE algorithm runs both in sequence: LZ77 first (eliminates repetition), then Huffman (eliminates statistical imbalance). The result is the compressed file.

The Full Compression Pipeline

flowchart TD Raw["Raw File\n10 GB\n(bytes on disk)"] LZ77["Step 1: LZ77\nSliding Window\nReplace repeated\nsequences with\nback-references"] Inter["Intermediate\ntoken stream\n~5-6 GB equivalent\n(literals + references)"] Huff["Step 2: Huffman\nCoding\nAssign shorter bit\npatterns to frequent\nsymbols"] ZIP["ZIP File\n~3 GB\n(compressed bytes)"] Raw -->|"scan for\nrepeats"| LZ77 LZ77 -->|"token\nstream"| Inter Inter -->|"encode\nfrequencies"| Huff Huff --> ZIP

Step 1: LZ77 — Eliminating Repetition

LZ77 (Lempel-Ziv 1977) uses a sliding window — a buffer of previously seen bytes. As it scans the file, whenever it sees a sequence it has seen before, it replaces that sequence with a compact (distance, length) back-reference instead of repeating the actual bytes.

How the Sliding Window Works

LZ77 sliding window: "the" in the lookahead matches "the" from 14 positions back. Output a (14,3) back-reference — 2 bytes instead of 3.

Concrete Example

Consider this text (simplified to show the principle):

# Original string — 44 bytes
"abracadabra is a magic word. abracadabra!"

# LZ77 sees:
"abracadabra" first occurrence → output as literals (11 bytes)
" is a magic word. " → output as literals (19 bytes)
"abracadabra" second occurrence → output as (distance=30, length=11)
"!" → output as literal (1 byte)

# Token stream: 11 + 19 + 1(ref) + 1 = 32 tokens
# Instead of 44 bytes, we have 32 token units — already 27% smaller
# In a real 10GB file with thousands of repetitions, this is massive

Why Text and Code Compress Better Than Images

English text reuses words constantly — "the", "and", "function", variable names. A typical source code file might have the word return hundreds of times. LZ77 replaces every second occurrence with a 2–3 byte back-reference instead of 6 bytes. Log files are even more extreme — timestamps, hostnames, and error codes repeat thousands of times per file.

High Compression — 70–80% reduction

Text files
.txt .log .csv

Source code
.py .js .java

XML / JSON
.xml .json .html

Medium Compression — 30–50% reduction

Word docs
.docx .pptx

Executables
.exe .dll

Low / No Compression — 0–5% reduction

JPEG images
.jpg

Videos
.mp4 .mkv

Already zipped
.zip .gz .7z

Step 2: Huffman Coding — Eliminating Statistical Imbalance

After LZ77, we have a stream of tokens. Each token is still stored with a fixed number of bits. But some tokens appear far more often than others — Huffman coding exploits this by assigning shorter bit codes to frequent symbols and longer codes to rare ones.

The Key Insight: Not All Bits Are Equal

Standard ASCII stores every character in 8 bits — 'e' and 'z' both cost 8 bits, even though 'e' appears ~13% of the time in English and 'z' appears ~0.07%. That's wasteful. Huffman says: give 'e' a 3-bit code and 'z' a 12-bit code. On average, you save bits.

Building the Huffman Tree

Huffman tree built from symbol frequencies. More frequent symbols ('e', 's') get shorter paths from the root = fewer bits. Rare symbols get longer paths.

The Math Behind the Savings

# Fixed-width encoding (before Huffman): 8 bits per symbol
# Assume 1,000,000 symbols total

# Character frequencies and their codes:
# Symbol | Frequency | Fixed (bits) | Huffman code | Huffman (bits)
# -------|-----------|--------------|--------------|---------------
# 'e'    |   300,000 |      8       |     00       |      2
# 's'    |   350,000 |      8       |     10       |      2
# 't'    |   150,000 |      8       |     010      |      3
# 'a'    |   100,000 |      8       |     011      |      3
# 'z'    |   100,000 |      8       |     11       |      2

# Fixed-width total bits:
fixed_total = 1_000_000 * 8  # = 8,000,000 bits = 1,000,000 bytes

# Huffman total bits:
huffman_total = (300_000 * 2 +   # 'e'
                 350_000 * 2 +   # 's'
                 150_000 * 3 +   # 't'
                 100_000 * 3 +   # 'a'
                 100_000 * 2)    # 'z'
# = 600,000 + 700,000 + 450,000 + 300,000 + 200,000
# = 2,250,000 bits = 281,250 bytes

savings = (1 - huffman_total / fixed_total) * 100
print(f"Huffman saves: {savings:.1f}%")
# Output: Huffman saves: 71.9%

The Entropy Limit: Why You Can't Compress Forever

There's a mathematical floor on how small you can compress data — called Shannon Entropy. It measures how much "surprise" or true information is in the data.

import math

def shannon_entropy(probabilities):
    """
    H(X) = -sum(p * log2(p))
    Returns entropy in bits per symbol — the theoretical minimum.
    """
    return -sum(p * math.log2(p) for p in probabilities if p > 0)

# English text: letter frequencies (approximate)
english_freqs = [0.13, 0.09, 0.08, 0.075, 0.07,  # e,t,a,o,i
                 0.065, 0.063, 0.061, 0.06, 0.055, # n,s,h,r,l
                 0.04, 0.028, 0.028, 0.024, 0.022,  # d,c,u,m,f
                 0.02, 0.02, 0.019, 0.015, 0.01,    # p,g,w,y,b
                 0.008, 0.005, 0.002, 0.001, 0.001, 0.001] # v,k,x,j,q,z

entropy = shannon_entropy(english_freqs)
print(f"English text entropy: {entropy:.2f} bits/symbol")
# English text entropy: 4.17 bits/symbol
# vs 8 bits/symbol (ASCII fixed-width)
# → Maximum theoretical compression: ~48% reduction on text alone

# Random data (uniform distribution — all symbols equally likely)
random_freqs = [1/256] * 256
entropy_random = shannon_entropy(random_freqs)
print(f"Random data entropy: {entropy_random:.2f} bits/symbol")
# Random data entropy: 8.00 bits/symbol
# → Cannot compress at all (no redundancy)

Key insight: Random data has maximum entropy — every bit carries genuine information. You cannot losslessly compress truly random data. This is why encrypted files and already-compressed files barely shrink when zipped: encryption deliberately randomizes the bytes, removing all pattern.

End-to-End: What Actually Happens to Your 10GB File

sequenceDiagram participant File as "10 GB Raw File" participant LZ77 as "LZ77 Engine" participant Huff as "Huffman Encoder" participant ZIP as "ZIP File (3 GB)" File->>LZ77: Stream bytes through sliding window (32KB) Note over LZ77: Scan lookahead buffer
Find longest match in search buffer
Emit (distance, length) or literal LZ77->>LZ77: Repeat for every byte in 10 GB LZ77->>Huff: Token stream (~6 GB equivalent) Note over Huff: Count symbol frequencies
Build Huffman tree (bottom-up)
Assign variable-length bit codes Huff->>Huff: Encode every token with its bit code Huff->>ZIP: Compressed bit stream + Huffman table header Note over ZIP: DEFLATE block format:
3-bit header + Huffman table
+ compressed data ZIP-->>File: Decompress: reverse Huffman → reverse LZ77 → original bytes

Why Exactly 10GB → 3GB? The Numbers

The 70% compression ratio you get on a typical text/log file breaks down like this:

Stage	Input Size	Output Size	What Was Eliminated
Raw file	10 GB	10 GB	—
After LZ77	10 GB	~5–6 GB	Repeated byte sequences → back-references
After Huffman	5–6 GB	~2.5–3 GB	Fixed-width codes → variable-width by frequency
Final ZIP	10 GB	~3 GB	~70% of data was redundancy

The actual ratio depends entirely on the file content. Here's what you'd see in practice:

Output size after zipping 10 GB of different file types. Log files compress from 10 GB → 1.5 GB. JPEG images barely move. Zipping a zip makes it slightly larger.

Why JPEG and MP4 Don't Compress

JPEG images are already compressed using lossy compression (DCT + quantization). MP4 video uses H.264/H.265, also lossy. After their compression, the remaining bytes are intentionally unpredictable — close to random. LZ77 finds no repeated sequences. Huffman finds no frequency imbalance. ZIP adds a tiny overhead and the file grows slightly.

This is not a bug — it's physics. You cannot compress information that has already been compressed to its entropy limit.

Decompression: Perfectly Reversible

ZIP is lossless — decompressing gives you the exact original bytes. How?

flowchart LR Z["ZIP File\n3 GB"] HuffD["Reverse Huffman\nRead Huffman table\nfrom header, decode\nbits → tokens"] LZD["Reverse LZ77\nFor each back-reference\n(dist, len): copy bytes\nfrom output buffer"] Out["Original File\n10 GB\nbit-for-bit identical"] Z --> HuffD HuffD -->|"token stream"| LZD LZD --> Out

The Huffman table is stored in the ZIP file header (a few hundred bytes). The LZ77 back-references resolve because the decompressor builds the same output buffer as the compressor did — so "copy 11 bytes from 30 positions back" always resolves correctly.

Where Exactly Did the 7 GB Go? A Byte-Level Walkthrough

Let's trace a real example. Take a 100-byte log line that repeats 1 million times in your log file — that's ~100 MB of a real log file. Here's what happens byte by byte:

# Original log line — 98 bytes, repeated 1,000,000 times in the file
line = "2026-04-03 08:14:22 INFO  [api-server] GET /health 200 OK latency=12ms
"

# Total raw size: 98 bytes × 1,000,000 = 98,000,000 bytes ≈ 98 MB

# ─────────────────────────────────────────────────────────────────
# PHASE 1: LZ77 processes this
# ─────────────────────────────────────────────────────────────────

# First occurrence (line 1): No match in empty buffer → emit 98 literals
# Cost: 98 bytes

# Second occurrence (line 2): FULL MATCH in search buffer
# LZ77 emits: (distance=98, length=98) → back-reference token
# This token costs: ~3 bytes (2 bytes distance + 1 byte length)
# Instead of 98 bytes → 3 bytes saved = 95 bytes saved

# Lines 3–1,000,000: same back-reference each time
# 999,999 × 3 bytes = 2,999,997 bytes

# LZ77 output size: 98 + (999,999 × 3) = 3,000,095 bytes ≈ 3 MB
# From 98 MB → 3 MB (97% reduction on this pattern alone!)

# ─────────────────────────────────────────────────────────────────
# PHASE 2: Huffman processes the token stream
# ─────────────────────────────────────────────────────────────────

# In the 3 MB token stream:
# - The back-reference token (dist=98, len=98) appears 999,999 times → very high frequency
# - Literal bytes for the first line appear once each

# Huffman assigns the most common token the shortest code: ~2 bits
# vs the default 8 bits per byte

# Final size: roughly 999,999 × (2/8) + small overhead ≈ ~250 KB
# Total compressed size for this section: ~250 KB from 98 MB original

The 7 GB Visualized — Where It Goes

Anatomy of 7 GB of eliminated redundancy in a typical server log file. LZ77 handles structural repetition; Huffman handles symbol frequency imbalance.

Real Numbers: Compressing a Log File Line by Line

import zlib  # Python's built-in DEFLATE (same as ZIP)

# Simulate what ZIP does to a log file
log_line = b"2026-04-03 08:14:22 INFO  [api-server] GET /health 200 OK latency=12ms
"

# Single line — not much repetition yet
single = log_line
compressed_single = zlib.compress(single, level=9)
print(f"1 line:    {len(single):>10,} bytes → {len(compressed_single):>8,} bytes  ({len(compressed_single)/len(single)*100:.1f}%)")

# 100 identical lines — LZ77 kicks in hard
hundred = log_line * 100
compressed_hundred = zlib.compress(hundred, level=9)
print(f"100 lines: {len(hundred):>10,} bytes → {len(compressed_hundred):>8,} bytes  ({len(compressed_hundred)/len(hundred)*100:.1f}%)")

# 10,000 identical lines
ten_k = log_line * 10_000
compressed_10k = zlib.compress(ten_k, level=9)
print(f"10K lines: {len(ten_k):>10,} bytes → {len(compressed_10k):>8,} bytes  ({len(compressed_10k)/len(ten_k)*100:.1f}%)")

# Output:
# 1 line:            98 bytes →       76 bytes  (77.6%)
# 100 lines:      9,800 bytes →      115 bytes  ( 1.2%)  ← LZ77 finds all repeats
# 10K lines:    980,000 bytes →      460 bytes  ( 0.05%) ← nearly free to add more

# The more repetition, the better compression gets.
# A real log file isn't 100% identical lines but has thousands of repeated substrings.
# This is why a 10 GB log compresses to ~0.5-1 GB in practice.

What Happens to Each Type of Content in Your 10 GB File

Content in File	Example	LZ77 Effect	Huffman Effect	Net Saving
Repeated timestamps	`2026-04-03 08:14:`	Back-reference after 1st occurrence	Common chars get 2–3 bits	~97%
Repeated hostnames	`api-server-prod-01`	Replaced by (dist, len) tuple	Letters already Huffman-coded	~95%
HTTP status codes	`200 OK`, `404`	Very short → Huffman more useful	Digits coded in 3–4 bits	~60%
Unique user IDs	`usr_a3f9b2c...`	No repetition possible	Hex chars even frequency	~10%
Random UUIDs	`550e8400-e29b-41d4`	No pattern	Near-uniform distribution	~5%
JSON structure	`{"key":"value"}`	Braces/quotes repeat heavily	{ } " : very frequent → short codes	~80%

ZIP vs gzip vs 7-Zip vs Brotli

Format	Algorithm	Ratio (text)	Speed	Best For
`.zip`	DEFLATE (LZ77 + Huffman)	Good	Fast	General purpose, cross-platform
`.gz`	DEFLATE (same as ZIP)	Good	Fast	Unix/Linux, single files
`.bz2`	BWT + Huffman	Better	Slow	Better ratio than gzip, CPU-heavy
`.7z`	LZMA (LZ77 + range coding)	Best	Slowest	Maximum compression, archiving
`.br` (Brotli)	LZ77 + Huffman + context	Best for web	Medium	HTTP compression (web servers)
`.zst` (Zstandard)	ANS + LZ77	Excellent	Very fast	Real-time compression (Facebook)

The Interview Answer — Concise Version

If asked this in an interview, here's the clean 60-second answer:

"A 10GB file shrinks to 3GB because most files contain massive redundancy — repeated patterns and uneven symbol frequencies. ZIP uses two algorithms in sequence: first LZ77, which replaces repeated byte sequences with compact back-references to earlier positions in the file; then Huffman coding, which assigns shorter bit codes to symbols that appear frequently and longer codes to rare ones. Together they can eliminate 70%+ of the data in text and log files. The theoretical limit is Shannon entropy — truly random data like encrypted files or already-compressed media can't be compressed further because every bit carries genuine information."

Key Takeaways

ZIP = LZ77 (eliminate repetition) + Huffman (eliminate frequency imbalance), run in sequence.
Compression ratio depends entirely on how much redundancy the file contains — not its size.
Shannon entropy is the hard mathematical floor — you cannot compress below it losslessly.
Text, logs, JSON, and source code compress extremely well (60–90%). JPEG, MP4, and ZIP files don't.
Decompression is always perfect and lossless — the Huffman table in the header + the LZ77 output buffer guarantee exact reconstruction.
Modern formats (Zstandard, Brotli) use the same principles but with better implementations — Zstandard is now the default in Facebook's infrastructure and Linux kernels.