Back to Blog Systems

How Does a 10GB File Become 3GB When Zipped?

April 2026 12 min read Srikanth Badavath


This is one of those interview questions that sounds simple but reveals a lot about how deeply you understand systems. The answer isn't "ZIP uses magic." It's a two-algorithm pipeline — LZ77 and Huffman coding — working in sequence, each exploiting a different type of redundancy in your data. Let's go end to end.

The Core Insight: Data Has Redundancy

A 10GB file isn't 10GB of unique information. Most files contain massive amounts of redundancy — repeated patterns, predictable sequences, and symbols that appear far more often than others. Compression finds and eliminates that redundancy.

There are two fundamental types of redundancy that ZIP exploits:

TypeWhat It IsAlgorithm That Fixes It
Repetition redundancyThe same sequence of bytes appears multiple timesLZ77 (back-references)
Statistical redundancySome symbols appear far more often than othersHuffman coding (shorter codes for common symbols)

ZIP's DEFLATE algorithm runs both in sequence: LZ77 first (eliminates repetition), then Huffman (eliminates statistical imbalance). The result is the compressed file.

The Full Compression Pipeline

flowchart LR Raw["Raw File\n10 GB\n(bytes on disk)"] LZ77["Step 1: LZ77\nSliding Window\nReplace repeated\nsequences with\nback-references"] Inter["Intermediate\ntoken stream\n~5-6 GB equivalent\n(literals + references)"] Huff["Step 2: Huffman\nCoding\nAssign shorter bit\npatterns to frequent\nsymbols"] ZIP["ZIP File\n~3 GB\n(compressed bytes)"] Raw -->|"scan for\nrepeats"| LZ77 LZ77 -->|"token\nstream"| Inter Inter -->|"encode\nfrequencies"| Huff Huff --> ZIP

Step 1: LZ77 — Eliminating Repetition

LZ77 (Lempel-Ziv 1977) uses a sliding window — a buffer of previously seen bytes. As it scans the file, whenever it sees a sequence it has seen before, it replaces that sequence with a compact (distance, length) back-reference instead of repeating the actual bytes.

How the Sliding Window Works

Search Buffer (past — up to 32KB) t h e c a t s a t o n Lookahead Buffer (current) t h e m a t Match found! "the" at distance=14, length=3 Output: literal 't','h','e',' ','c','a','t',' ','s','a','t',' ','o','n' then: (distance=14, length=3) instead of 't','h','e' again

LZ77 sliding window: "the" in the lookahead matches "the" from 14 positions back. Output a (14,3) back-reference — 2 bytes instead of 3.

Concrete Example

Consider this text (simplified to show the principle):

# Original string — 44 bytes
"abracadabra is a magic word. abracadabra!"

# LZ77 sees:
"abracadabra" first occurrence → output as literals (11 bytes)
" is a magic word. " → output as literals (19 bytes)
"abracadabra" second occurrence → output as (distance=30, length=11)
"!" → output as literal (1 byte)

# Token stream: 11 + 19 + 1(ref) + 1 = 32 tokens
# Instead of 44 bytes, we have 32 token units — already 27% smaller
# In a real 10GB file with thousands of repetitions, this is massive

Why Text and Code Compress Better Than Images

English text reuses words constantly — "the", "and", "function", variable names. A typical source code file might have the word return hundreds of times. LZ77 replaces every second occurrence with a 2–3 byte back-reference instead of 6 bytes. Log files are even more extreme — timestamps, hostnames, and error codes repeat thousands of times per file.

flowchart LR subgraph HighCompression["High Compression (70-80% reduction)"] T["Text files\n.txt .log .csv"] C["Source code\n.py .js .java"] X["XML / JSON\n.xml .json .html"] end subgraph MedCompression["Medium Compression (30-50% reduction)"] D["Word docs\n.docx .pptx"] E["Executable\n.exe .dll"] end subgraph LowCompression["Low/No Compression (0-5% reduction)"] I["JPEG images\n.jpg"] V["Videos\n.mp4 .mkv"] Z["Already zipped\n.zip .gz .7z"] end

Step 2: Huffman Coding — Eliminating Statistical Imbalance

After LZ77, we have a stream of tokens. Each token is still stored with a fixed number of bits. But some tokens appear far more often than others — Huffman coding exploits this by assigning shorter bit codes to frequent symbols and longer codes to rare ones.

The Key Insight: Not All Bits Are Equal

Standard ASCII stores every character in 8 bits — 'e' and 'z' both cost 8 bits, even though 'e' appears ~13% of the time in English and 'z' appears ~0.07%. That's wasteful. Huffman says: give 'e' a 3-bit code and 'z' a 12-bit code. On average, you save bits.

Building the Huffman Tree

Huffman Tree — example with 5 symbols root 100% 0 1 55% e+t+a 45% s+z 0 1 'e' 30% 25% t+a 0 1 't' 15% 'a' 10% 0 1 's' 35% 'z' 10% e=00 (2 bits) s=10 (2 bits) t=010 (3 bits) a=011 (3 bits) z=11 (2 bits)

Huffman tree built from symbol frequencies. More frequent symbols ('e', 's') get shorter paths from the root = fewer bits. Rare symbols get longer paths.

The Math Behind the Savings

# Fixed-width encoding (before Huffman): 8 bits per symbol
# Assume 1,000,000 symbols total

# Character frequencies and their codes:
# Symbol | Frequency | Fixed (bits) | Huffman code | Huffman (bits)
# -------|-----------|--------------|--------------|---------------
# 'e'    |   300,000 |      8       |     00       |      2
# 's'    |   350,000 |      8       |     10       |      2
# 't'    |   150,000 |      8       |     010      |      3
# 'a'    |   100,000 |      8       |     011      |      3
# 'z'    |   100,000 |      8       |     11       |      2

# Fixed-width total bits:
fixed_total = 1_000_000 * 8  # = 8,000,000 bits = 1,000,000 bytes

# Huffman total bits:
huffman_total = (300_000 * 2 +   # 'e'
                 350_000 * 2 +   # 's'
                 150_000 * 3 +   # 't'
                 100_000 * 3 +   # 'a'
                 100_000 * 2)    # 'z'
# = 600,000 + 700,000 + 450,000 + 300,000 + 200,000
# = 2,250,000 bits = 281,250 bytes

savings = (1 - huffman_total / fixed_total) * 100
print(f"Huffman saves: {savings:.1f}%")
# Output: Huffman saves: 71.9%

The Entropy Limit: Why You Can't Compress Forever

There's a mathematical floor on how small you can compress data — called Shannon Entropy. It measures how much "surprise" or true information is in the data.

import math

def shannon_entropy(probabilities):
    """
    H(X) = -sum(p * log2(p))
    Returns entropy in bits per symbol — the theoretical minimum.
    """
    return -sum(p * math.log2(p) for p in probabilities if p > 0)

# English text: letter frequencies (approximate)
english_freqs = [0.13, 0.09, 0.08, 0.075, 0.07,  # e,t,a,o,i
                 0.065, 0.063, 0.061, 0.06, 0.055, # n,s,h,r,l
                 0.04, 0.028, 0.028, 0.024, 0.022,  # d,c,u,m,f
                 0.02, 0.02, 0.019, 0.015, 0.01,    # p,g,w,y,b
                 0.008, 0.005, 0.002, 0.001, 0.001, 0.001] # v,k,x,j,q,z

entropy = shannon_entropy(english_freqs)
print(f"English text entropy: {entropy:.2f} bits/symbol")
# English text entropy: 4.17 bits/symbol
# vs 8 bits/symbol (ASCII fixed-width)
# → Maximum theoretical compression: ~48% reduction on text alone

# Random data (uniform distribution — all symbols equally likely)
random_freqs = [1/256] * 256
entropy_random = shannon_entropy(random_freqs)
print(f"Random data entropy: {entropy_random:.2f} bits/symbol")
# Random data entropy: 8.00 bits/symbol
# → Cannot compress at all (no redundancy)

Key insight: Random data has maximum entropy — every bit carries genuine information. You cannot losslessly compress truly random data. This is why encrypted files and already-compressed files barely shrink when zipped: encryption deliberately randomizes the bytes, removing all pattern.

End-to-End: What Actually Happens to Your 10GB File

sequenceDiagram participant File as "10 GB Raw File" participant LZ77 as "LZ77 Engine" participant Huff as "Huffman Encoder" participant ZIP as "ZIP File (3 GB)" File->>LZ77: Stream bytes through sliding window (32KB) Note over LZ77: Scan lookahead buffer
Find longest match in search buffer
Emit (distance, length) or literal LZ77->>LZ77: Repeat for every byte in 10 GB LZ77->>Huff: Token stream (~6 GB equivalent) Note over Huff: Count symbol frequencies
Build Huffman tree (bottom-up)
Assign variable-length bit codes Huff->>Huff: Encode every token with its bit code Huff->>ZIP: Compressed bit stream + Huffman table header Note over ZIP: DEFLATE block format:
3-bit header + Huffman table
+ compressed data ZIP-->>File: Decompress: reverse Huffman → reverse LZ77 → original bytes

Why Exactly 10GB → 3GB? The Numbers

The 70% compression ratio you get on a typical text/log file breaks down like this:

StageInput SizeOutput SizeWhat Was Eliminated
Raw file10 GB10 GB
After LZ7710 GB~5–6 GBRepeated byte sequences → back-references
After Huffman5–6 GB~2.5–3 GBFixed-width codes → variable-width by frequency
Final ZIP10 GB~3 GB~70% of data was redundancy

The actual ratio depends entirely on the file content. Here's what you'd see in practice:

Compression Ratio by File Type (10 GB input) Output Size (GB) 10 7.5 5 2.5 0 1.5 Log files 2 Source code 2.5 JSON/XML 5 Word docs 9.8 JPEG images 10 Already zipped

Output size after zipping 10 GB of different file types. Log files compress from 10 GB → 1.5 GB. JPEG images barely move. Zipping a zip makes it slightly larger.

Why JPEG and MP4 Don't Compress

JPEG images are already compressed using lossy compression (DCT + quantization). MP4 video uses H.264/H.265, also lossy. After their compression, the remaining bytes are intentionally unpredictable — close to random. LZ77 finds no repeated sequences. Huffman finds no frequency imbalance. ZIP adds a tiny overhead and the file grows slightly.

This is not a bug — it's physics. You cannot compress information that has already been compressed to its entropy limit.

Decompression: Perfectly Reversible

ZIP is lossless — decompressing gives you the exact original bytes. How?

flowchart LR Z["ZIP File\n3 GB"] HuffD["Reverse Huffman\nRead Huffman table\nfrom header, decode\nbits → tokens"] LZD["Reverse LZ77\nFor each back-reference\n(dist, len): copy bytes\nfrom output buffer"] Out["Original File\n10 GB\nbit-for-bit identical"] Z --> HuffD HuffD -->|"token stream"| LZD LZD --> Out

The Huffman table is stored in the ZIP file header (a few hundred bytes). The LZ77 back-references resolve because the decompressor builds the same output buffer as the compressor did — so "copy 11 bytes from 30 positions back" always resolves correctly.

Where Exactly Did the 7 GB Go? A Byte-Level Walkthrough

Let's trace a real example. Take a 100-byte log line that repeats 1 million times in your log file — that's ~100 MB of a real log file. Here's what happens byte by byte:

# Original log line — 98 bytes, repeated 1,000,000 times in the file
line = "2026-04-03 08:14:22 INFO  [api-server] GET /health 200 OK latency=12ms
"

# Total raw size: 98 bytes × 1,000,000 = 98,000,000 bytes ≈ 98 MB

# ─────────────────────────────────────────────────────────────────
# PHASE 1: LZ77 processes this
# ─────────────────────────────────────────────────────────────────

# First occurrence (line 1): No match in empty buffer → emit 98 literals
# Cost: 98 bytes

# Second occurrence (line 2): FULL MATCH in search buffer
# LZ77 emits: (distance=98, length=98) → back-reference token
# This token costs: ~3 bytes (2 bytes distance + 1 byte length)
# Instead of 98 bytes → 3 bytes saved = 95 bytes saved

# Lines 3–1,000,000: same back-reference each time
# 999,999 × 3 bytes = 2,999,997 bytes

# LZ77 output size: 98 + (999,999 × 3) = 3,000,095 bytes ≈ 3 MB
# From 98 MB → 3 MB (97% reduction on this pattern alone!)

# ─────────────────────────────────────────────────────────────────
# PHASE 2: Huffman processes the token stream
# ─────────────────────────────────────────────────────────────────

# In the 3 MB token stream:
# - The back-reference token (dist=98, len=98) appears 999,999 times → very high frequency
# - Literal bytes for the first line appear once each

# Huffman assigns the most common token the shortest code: ~2 bits
# vs the default 8 bits per byte

# Final size: roughly 999,999 × (2/8) + small overhead ≈ ~250 KB
# Total compressed size for this section: ~250 KB from 98 MB original

The 7 GB Visualized — Where It Goes

Where the 7 GB Actually Goes Raw 10 GB LZ77 eliminates ~4 GB (repeated sequences) Huffman eliminates ~3 GB (bit waste) Remains ~3 GB = ZIP file Repetition redundancy (back-references replace repeated byte patterns) Statistical redundancy (frequent symbols stored in fewer bits) True entropy (irreducible information — cannot be compressed further) Typical breakdown in a 10 GB server log file: Timestamps (repeated format) → LZ77 saves ~2.1 GB Hostnames/IPs (repeated strings) → LZ77 saves ~0.8 GB Status codes (200, 404, 500) → LZ77 saves ~0.4 GB Common words (INFO, GET, POST) → Huffman ~1.2 GB ASCII encoding (only 95/256 used) → Huffman ~1.5 GB Unique values (latency, user IDs) → Stays ~3.0 GB Total saved: ~7 GB | Remaining: ~3 GB | ZIP file: 3 GB

Anatomy of 7 GB of eliminated redundancy in a typical server log file. LZ77 handles structural repetition; Huffman handles symbol frequency imbalance.

Real Numbers: Compressing a Log File Line by Line

import zlib  # Python's built-in DEFLATE (same as ZIP)

# Simulate what ZIP does to a log file
log_line = b"2026-04-03 08:14:22 INFO  [api-server] GET /health 200 OK latency=12ms
"

# Single line — not much repetition yet
single = log_line
compressed_single = zlib.compress(single, level=9)
print(f"1 line:    {len(single):>10,} bytes → {len(compressed_single):>8,} bytes  ({len(compressed_single)/len(single)*100:.1f}%)")

# 100 identical lines — LZ77 kicks in hard
hundred = log_line * 100
compressed_hundred = zlib.compress(hundred, level=9)
print(f"100 lines: {len(hundred):>10,} bytes → {len(compressed_hundred):>8,} bytes  ({len(compressed_hundred)/len(hundred)*100:.1f}%)")

# 10,000 identical lines
ten_k = log_line * 10_000
compressed_10k = zlib.compress(ten_k, level=9)
print(f"10K lines: {len(ten_k):>10,} bytes → {len(compressed_10k):>8,} bytes  ({len(compressed_10k)/len(ten_k)*100:.1f}%)")

# Output:
# 1 line:            98 bytes →       76 bytes  (77.6%)
# 100 lines:      9,800 bytes →      115 bytes  ( 1.2%)  ← LZ77 finds all repeats
# 10K lines:    980,000 bytes →      460 bytes  ( 0.05%) ← nearly free to add more

# The more repetition, the better compression gets.
# A real log file isn't 100% identical lines but has thousands of repeated substrings.
# This is why a 10 GB log compresses to ~0.5-1 GB in practice.

What Happens to Each Type of Content in Your 10 GB File

Content in FileExampleLZ77 EffectHuffman EffectNet Saving
Repeated timestamps2026-04-03 08:14:Back-reference after 1st occurrenceCommon chars get 2–3 bits~97%
Repeated hostnamesapi-server-prod-01Replaced by (dist, len) tupleLetters already Huffman-coded~95%
HTTP status codes200 OK, 404Very short → Huffman more usefulDigits coded in 3–4 bits~60%
Unique user IDsusr_a3f9b2c...No repetition possibleHex chars even frequency~10%
Random UUIDs550e8400-e29b-41d4No patternNear-uniform distribution~5%
JSON structure{"key":"value"}Braces/quotes repeat heavily{ } " : very frequent → short codes~80%

ZIP vs gzip vs 7-Zip vs Brotli

FormatAlgorithmRatio (text)SpeedBest For
.zipDEFLATE (LZ77 + Huffman)GoodFastGeneral purpose, cross-platform
.gzDEFLATE (same as ZIP)GoodFastUnix/Linux, single files
.bz2BWT + HuffmanBetterSlowBetter ratio than gzip, CPU-heavy
.7zLZMA (LZ77 + range coding)BestSlowestMaximum compression, archiving
.br (Brotli)LZ77 + Huffman + contextBest for webMediumHTTP compression (web servers)
.zst (Zstandard)ANS + LZ77ExcellentVery fastReal-time compression (Facebook)

The Interview Answer — Concise Version

If asked this in an interview, here's the clean 60-second answer:

"A 10GB file shrinks to 3GB because most files contain massive redundancy — repeated patterns and uneven symbol frequencies. ZIP uses two algorithms in sequence: first LZ77, which replaces repeated byte sequences with compact back-references to earlier positions in the file; then Huffman coding, which assigns shorter bit codes to symbols that appear frequently and longer codes to rare ones. Together they can eliminate 70%+ of the data in text and log files. The theoretical limit is Shannon entropy — truly random data like encrypted files or already-compressed media can't be compressed further because every bit carries genuine information."

Key Takeaways