What is Google TurboQuant? KV Cache Compression Explained
Problem
When I deployed a Llama-3.1-70B model with a 128K token context window, my H100 GPU ran out of memory within minutes. The culprit? KV cache explosion.
Here’s what happened:
Model Weights (FP16): ~140 GBKV Cache (128K tokens): ~320 GB <-- This killed meActivation Memory: ~20 GB-----------------------------------Total Required: ~480 GB
Available: 1x H100 = 80 GBResult: OOM ErrorThe KV cache consumed 4x more memory than the model weights themselves. I thought quantizing the model weights would help, but that only addressed 30% of the problem. The real bottleneck was the KV cache.
What is KV Cache?
Before diving into the solution, I needed to understand why KV cache explodes:
KV Cache Size = Layers x 2 (K and V) x Sequence Length x Heads x Head Dimension x Bytes
For Llama-3.1-70B:- 80 layers- 128K max context length- 64 attention heads- 128 dimension per head- 2 bytes (FP16)
Calculation:80 x 2 x 128000 x 64 x 128 x 2 = ~320 GBThe KV cache grows linearly with context length. Double your context, double your memory. This is why million-token contexts are commercially impractical.
The Solution: TurboQuant
On March 24, 2026, Google Research announced TurboQuant - a training-free compression algorithm that reduces KV cache memory by 6x with zero accuracy loss.
The key insight: you don’t need 16-bit precision to store KV cache values. TurboQuant compresses to just 3 bits per value.
Before TurboQuant:- Precision: FP16 (16 bits per value)- KV cache for 128K context: ~320 GB
After TurboQuant:- Precision: 3 bits per value- KV cache for 128K context: ~53 GB
Compression Ratio: 6xHow TurboQuant Works
TurboQuant combines two mathematical techniques:
1. PolarQuant: Random Rotation Quantization
The first technique is deceptively simple. Before quantizing, apply a random rotation to the vectors:
Traditional Quantization:[3.7, -2.1, 0.8, ...] --> [4, -2, 1, ...] (High error on outliers)
PolarQuant:[3.7, -2.1, 0.8, ...] --> Rotate --> [0.3, -0.1, 0.2, ...] --> Quantize
Why this works:- Random rotation spreads values uniformly- No more extreme outliers- 3-bit quantization becomes viableThe math relies on a beautiful property: random rotations make any vector “look the same” from a distribution perspective. This means aggressive quantization doesn’t distort the relative relationships between vectors.
2. QJL: Quantized Johnson-Lindenstrauss
The second technique preserves distance relationships during compression:
Johnson-Lindenstrauss Lemma:"Any set of n points in d dimensions can be embedded intok dimensions (k = O(log n)) while preserving all pairwisedistances within a small factor."
QJL applies this to KV compression:- Projects high-dim KV vectors to lower dimensions- Quantizes during projection (no extra step)- Preserves attention computation accuracyHere’s the architecture:
Original KV Vectors (FP16) | v[PolarQuant] ---> Random Rotation ---> Quantize to 3-bit | v[QJL] ----------> Dimension Reduction + Quantization | vCompressed KV Cache (3-bit) | vDe-quantize on-the-fly during attentionWhy Training-Free Matters
Traditional quantization requires a calibration dataset:
# Traditional approach - requires calibrationfrom transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("llama-3.1-70b")
# Need representative data for calibrationcalibration_data = load_dataset("calibration-samples")
# Time-consuming calibration processquantized_model = quantize( model, calibration_data=calibration_data, # <-- This is the bottleneck bits=4)
# Takes hours to days depending on model sizeTurboQuant skips this entirely:
# TurboQuant - no calibration neededfrom turboquant import TurboQuant
model = load_model("llama-3.1-70b")
# Apply compression immediatelyquantizer = TurboQuant(bits=3)compressed = quantizer.compress(model.kv_cache)
# No fine-tuning, no calibration data# Deploy in minutes, not daysThis is possible because PolarQuant’s random rotation doesn’t depend on data distribution. Any model, any data - the same rotation works.
What I Got Wrong Initially
When I first read about TurboQuant, I assumed:
-
“6x memory reduction” means I can run 70B on 16GB VRAM
Wrong. TurboQuant only compresses KV cache, not model weights. You still need to fit the model itself.
correction.txt Myth: 16GB VRAM can run 70B model with TurboQuantReality:- Model weights (unquantized): 140 GB - still need this- KV cache (with TurboQuant): ~53 GB instead of 320 GBTurboQuant helps with:- Long context (128K+ tokens)- Multiple concurrent users- Not with model weight memory -
“Zero accuracy loss” means identical outputs
Not quite. It means the model performs the same on benchmarks, but individual token probabilities may differ slightly.
-
“8x speedup” applies to all GPUs
The 8x speedup is specific to H100’s optimized 3-bit operations. On older GPUs, the speedup is smaller.
Performance Benchmarks
Google’s benchmarks on H100 GPUs:
Model: Mistral-7BContext: 128K tokens
| Metric | FP16 Baseline | TurboQuant (3-bit) ||---------------------|---------------|---------------------|| KV Cache Memory | 32 GB | 5.3 GB (6x less) || Attention Latency | 120 ms | 15 ms (8x faster) || Perplexity (wikitext)| 12.4 | 12.5 (same within error) || Needle in Haystack | 100% recall | 100% recall |
Model: Gemma-2-27BContext: 128K tokens
| Metric | FP16 Baseline | TurboQuant (3-bit) ||---------------------|---------------|---------------------|| KV Cache Memory | 96 GB | 16 GB (6x less) || Throughput (tok/s) | 45 | 180 (4x faster) |The “Needle in A Haystack” test is particularly impressive - 100% recall means the model still finds information buried deep in long contexts.
When to Use TurboQuant
TurboQuant shines in these scenarios:
- Long context inference - 128K+ tokens
- Multi-user serving - More concurrent requests per GPU
- Real-time applications - The 8x speedup matters for latency-sensitive apps
When NOT to use it:
- Short contexts - Under 4K tokens, KV cache is small anyway
- Inflexible hardware - Needs H100 or newer for full benefits
- Model weight compression - Use GPTQ/AWQ for that
Related Knowledge
The Pied Piper Connection
Multiple Reddit users compared TurboQuant to the fictional compression algorithm from HBO’s Silicon Valley:
“This is literally Pied Piper from Silicon Valley. Lossless compression that nobody thought was possible.”
The comparison is apt - both achieve “impossible” compression ratios. The difference is TurboQuant is real.
Connection to Vector Search
TurboQuant isn’t just for LLMs. The same techniques apply to vector databases:
Traditional Vector Search:- Store embeddings: FP32 (32 bits each)- 1M vectors x 1536 dimensions = 6 GB
With QJL:- Store compressed embeddings: 3 bits each- 1M vectors x 1536 dimensions = ~0.5 GB- Search accuracy preserved via JL guaranteesGoogle mentioned this technology powers their search infrastructure as well.
Why 3 Bits?
The choice of 3 bits isn’t arbitrary:
1-bit: Too aggressive, significant accuracy loss2-bit: Marginal improvement over 1-bit3-bit: Sweet spot - minimal loss, max compression4-bit: Good accuracy, but 33% more memory than 3-bit
3-bit quantization range:- Values from -3 to +3 (7 levels)- Sufficient for normalized, rotated vectorsSummary
In this post, I explained Google’s TurboQuant algorithm for KV cache compression. The key innovations are:
- PolarQuant - Random rotation before quantization makes 3-bit viable
- QJL - Mathematical guarantee that distances are preserved
- Training-free - Works on any model without calibration
The result: 6x memory reduction, 8x speedup on H100, zero accuracy loss.
This is a real breakthrough for long-context AI applications. Million-token contexts are now commercially viable.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Google Research TurboQuant Announcement
- 👨💻 ICLR 2026 - QJL Paper
- 👨💻 AISTATS 2026 - PolarQuant Paper
- 👨💻 Reddit Discussion on TurboQuant
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments