Skip to content

What is Google TurboQuant? KV Cache Compression Explained

Problem

When I deployed a Llama-3.1-70B model with a 128K token context window, my H100 GPU ran out of memory within minutes. The culprit? KV cache explosion.

Here’s what happened:

memory-breakdown.txt
Model Weights (FP16): ~140 GB
KV Cache (128K tokens): ~320 GB <-- This killed me
Activation Memory: ~20 GB
-----------------------------------
Total Required: ~480 GB
Available: 1x H100 = 80 GB
Result: OOM Error

The KV cache consumed 4x more memory than the model weights themselves. I thought quantizing the model weights would help, but that only addressed 30% of the problem. The real bottleneck was the KV cache.

What is KV Cache?

Before diving into the solution, I needed to understand why KV cache explodes:

kv-cache-formula.txt
KV Cache Size = Layers x 2 (K and V) x Sequence Length x Heads x Head Dimension x Bytes
For Llama-3.1-70B:
- 80 layers
- 128K max context length
- 64 attention heads
- 128 dimension per head
- 2 bytes (FP16)
Calculation:
80 x 2 x 128000 x 64 x 128 x 2 = ~320 GB

The KV cache grows linearly with context length. Double your context, double your memory. This is why million-token contexts are commercially impractical.

The Solution: TurboQuant

On March 24, 2026, Google Research announced TurboQuant - a training-free compression algorithm that reduces KV cache memory by 6x with zero accuracy loss.

The key insight: you don’t need 16-bit precision to store KV cache values. TurboQuant compresses to just 3 bits per value.

compression-comparison.txt
Before TurboQuant:
- Precision: FP16 (16 bits per value)
- KV cache for 128K context: ~320 GB
After TurboQuant:
- Precision: 3 bits per value
- KV cache for 128K context: ~53 GB
Compression Ratio: 6x

How TurboQuant Works

TurboQuant combines two mathematical techniques:

1. PolarQuant: Random Rotation Quantization

The first technique is deceptively simple. Before quantizing, apply a random rotation to the vectors:

polarquant-concept.txt
Traditional Quantization:
[3.7, -2.1, 0.8, ...] --> [4, -2, 1, ...] (High error on outliers)
PolarQuant:
[3.7, -2.1, 0.8, ...] --> Rotate --> [0.3, -0.1, 0.2, ...] --> Quantize
Why this works:
- Random rotation spreads values uniformly
- No more extreme outliers
- 3-bit quantization becomes viable

The math relies on a beautiful property: random rotations make any vector “look the same” from a distribution perspective. This means aggressive quantization doesn’t distort the relative relationships between vectors.

2. QJL: Quantized Johnson-Lindenstrauss

The second technique preserves distance relationships during compression:

qjl-concept.txt
Johnson-Lindenstrauss Lemma:
"Any set of n points in d dimensions can be embedded into
k dimensions (k = O(log n)) while preserving all pairwise
distances within a small factor."
QJL applies this to KV compression:
- Projects high-dim KV vectors to lower dimensions
- Quantizes during projection (no extra step)
- Preserves attention computation accuracy

Here’s the architecture:

turboquant-pipeline.txt
Original KV Vectors (FP16)
|
v
[PolarQuant] ---> Random Rotation ---> Quantize to 3-bit
|
v
[QJL] ----------> Dimension Reduction + Quantization
|
v
Compressed KV Cache (3-bit)
|
v
De-quantize on-the-fly during attention

Why Training-Free Matters

Traditional quantization requires a calibration dataset:

traditional-quant.py
# Traditional approach - requires calibration
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("llama-3.1-70b")
# Need representative data for calibration
calibration_data = load_dataset("calibration-samples")
# Time-consuming calibration process
quantized_model = quantize(
model,
calibration_data=calibration_data, # <-- This is the bottleneck
bits=4
)
# Takes hours to days depending on model size

TurboQuant skips this entirely:

turboquant-approach.py
# TurboQuant - no calibration needed
from turboquant import TurboQuant
model = load_model("llama-3.1-70b")
# Apply compression immediately
quantizer = TurboQuant(bits=3)
compressed = quantizer.compress(model.kv_cache)
# No fine-tuning, no calibration data
# Deploy in minutes, not days

This is possible because PolarQuant’s random rotation doesn’t depend on data distribution. Any model, any data - the same rotation works.

What I Got Wrong Initially

When I first read about TurboQuant, I assumed:

  1. “6x memory reduction” means I can run 70B on 16GB VRAM

    Wrong. TurboQuant only compresses KV cache, not model weights. You still need to fit the model itself.

    correction.txt
    Myth: 16GB VRAM can run 70B model with TurboQuant
    Reality:
    - Model weights (unquantized): 140 GB - still need this
    - KV cache (with TurboQuant): ~53 GB instead of 320 GB
    TurboQuant helps with:
    - Long context (128K+ tokens)
    - Multiple concurrent users
    - Not with model weight memory
  2. “Zero accuracy loss” means identical outputs

    Not quite. It means the model performs the same on benchmarks, but individual token probabilities may differ slightly.

  3. “8x speedup” applies to all GPUs

    The 8x speedup is specific to H100’s optimized 3-bit operations. On older GPUs, the speedup is smaller.

Performance Benchmarks

Google’s benchmarks on H100 GPUs:

benchmark-results.txt
Model: Mistral-7B
Context: 128K tokens
| Metric | FP16 Baseline | TurboQuant (3-bit) |
|---------------------|---------------|---------------------|
| KV Cache Memory | 32 GB | 5.3 GB (6x less) |
| Attention Latency | 120 ms | 15 ms (8x faster) |
| Perplexity (wikitext)| 12.4 | 12.5 (same within error) |
| Needle in Haystack | 100% recall | 100% recall |
Model: Gemma-2-27B
Context: 128K tokens
| Metric | FP16 Baseline | TurboQuant (3-bit) |
|---------------------|---------------|---------------------|
| KV Cache Memory | 96 GB | 16 GB (6x less) |
| Throughput (tok/s) | 45 | 180 (4x faster) |

The “Needle in A Haystack” test is particularly impressive - 100% recall means the model still finds information buried deep in long contexts.

When to Use TurboQuant

TurboQuant shines in these scenarios:

  1. Long context inference - 128K+ tokens
  2. Multi-user serving - More concurrent requests per GPU
  3. Real-time applications - The 8x speedup matters for latency-sensitive apps

When NOT to use it:

  1. Short contexts - Under 4K tokens, KV cache is small anyway
  2. Inflexible hardware - Needs H100 or newer for full benefits
  3. Model weight compression - Use GPTQ/AWQ for that

The Pied Piper Connection

Multiple Reddit users compared TurboQuant to the fictional compression algorithm from HBO’s Silicon Valley:

“This is literally Pied Piper from Silicon Valley. Lossless compression that nobody thought was possible.”

The comparison is apt - both achieve “impossible” compression ratios. The difference is TurboQuant is real.

TurboQuant isn’t just for LLMs. The same techniques apply to vector databases:

vector-search-application.txt
Traditional Vector Search:
- Store embeddings: FP32 (32 bits each)
- 1M vectors x 1536 dimensions = 6 GB
With QJL:
- Store compressed embeddings: 3 bits each
- 1M vectors x 1536 dimensions = ~0.5 GB
- Search accuracy preserved via JL guarantees

Google mentioned this technology powers their search infrastructure as well.

Why 3 Bits?

The choice of 3 bits isn’t arbitrary:

bit-analysis.txt
1-bit: Too aggressive, significant accuracy loss
2-bit: Marginal improvement over 1-bit
3-bit: Sweet spot - minimal loss, max compression
4-bit: Good accuracy, but 33% more memory than 3-bit
3-bit quantization range:
- Values from -3 to +3 (7 levels)
- Sufficient for normalized, rotated vectors

Summary

In this post, I explained Google’s TurboQuant algorithm for KV cache compression. The key innovations are:

  1. PolarQuant - Random rotation before quantization makes 3-bit viable
  2. QJL - Mathematical guarantee that distances are preserved
  3. Training-free - Works on any model without calibration

The result: 6x memory reduction, 8x speedup on H100, zero accuracy loss.

This is a real breakthrough for long-context AI applications. Million-token contexts are now commercially viable.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments