What is Google TurboQuant? KV Cache Compression Explained

Mar 27, 2026

Problem

When I deployed a Llama-3.1-70B model with a 128K token context window, my H100 GPU ran out of memory within minutes. The culprit? KV cache explosion.

Here’s what happened:

Model Weights (FP16):        ~140 GB
KV Cache (128K tokens):      ~320 GB  <-- This killed me
Activation Memory:           ~20 GB
-----------------------------------
Total Required:              ~480 GB

Available: 1x H100 = 80 GB
Result: OOM Error

The KV cache consumed 4x more memory than the model weights themselves. I thought quantizing the model weights would help, but that only addressed 30% of the problem. The real bottleneck was the KV cache.

What is KV Cache?

Before diving into the solution, I needed to understand why KV cache explodes:

KV Cache Size = Layers x 2 (K and V) x Sequence Length x Heads x Head Dimension x Bytes

For Llama-3.1-70B:
- 80 layers
- 128K max context length
- 64 attention heads
- 128 dimension per head
- 2 bytes (FP16)

Calculation:
80 x 2 x 128000 x 64 x 128 x 2 = ~320 GB

The KV cache grows linearly with context length. Double your context, double your memory. This is why million-token contexts are commercially impractical.

The Solution: TurboQuant

On March 24, 2026, Google Research announced TurboQuant - a training-free compression algorithm that reduces KV cache memory by 6x with zero accuracy loss.

The key insight: you don’t need 16-bit precision to store KV cache values. TurboQuant compresses to just 3 bits per value.

Before TurboQuant:
- Precision: FP16 (16 bits per value)
- KV cache for 128K context: ~320 GB

After TurboQuant:
- Precision: 3 bits per value
- KV cache for 128K context: ~53 GB

Compression Ratio: 6x

How TurboQuant Works

TurboQuant combines two mathematical techniques:

1. PolarQuant: Random Rotation Quantization

The first technique is deceptively simple. Before quantizing, apply a random rotation to the vectors:

Traditional Quantization:
[3.7, -2.1, 0.8, ...] --> [4, -2, 1, ...]  (High error on outliers)

PolarQuant:
[3.7, -2.1, 0.8, ...] --> Rotate --> [0.3, -0.1, 0.2, ...] --> Quantize

Why this works:
- Random rotation spreads values uniformly
- No more extreme outliers
- 3-bit quantization becomes viable

The math relies on a beautiful property: random rotations make any vector “look the same” from a distribution perspective. This means aggressive quantization doesn’t distort the relative relationships between vectors.

2. QJL: Quantized Johnson-Lindenstrauss

The second technique preserves distance relationships during compression:

Johnson-Lindenstrauss Lemma:
"Any set of n points in d dimensions can be embedded into
k dimensions (k = O(log n)) while preserving all pairwise
distances within a small factor."

QJL applies this to KV compression:
- Projects high-dim KV vectors to lower dimensions
- Quantizes during projection (no extra step)
- Preserves attention computation accuracy

Here’s the architecture:

Original KV Vectors (FP16)
        |
        v
[PolarQuant] ---> Random Rotation ---> Quantize to 3-bit
        |
        v
[QJL] ----------> Dimension Reduction + Quantization
        |
        v
Compressed KV Cache (3-bit)
        |
        v
De-quantize on-the-fly during attention

Why Training-Free Matters

Traditional quantization requires a calibration dataset:

# Traditional approach - requires calibration
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("llama-3.1-70b")

# Need representative data for calibration
calibration_data = load_dataset("calibration-samples")

# Time-consuming calibration process
quantized_model = quantize(
    model,
    calibration_data=calibration_data,  # <-- This is the bottleneck
    bits=4
)

# Takes hours to days depending on model size

TurboQuant skips this entirely:

# TurboQuant - no calibration needed
from turboquant import TurboQuant

model = load_model("llama-3.1-70b")

# Apply compression immediately
quantizer = TurboQuant(bits=3)
compressed = quantizer.compress(model.kv_cache)

# No fine-tuning, no calibration data
# Deploy in minutes, not days

This is possible because PolarQuant’s random rotation doesn’t depend on data distribution. Any model, any data - the same rotation works.

What I Got Wrong Initially

When I first read about TurboQuant, I assumed:

“6x memory reduction” means I can run 70B on 16GB VRAM

Wrong. TurboQuant only compresses KV cache, not model weights. You still need to fit the model itself.

Myth: 16GB VRAM can run 70B model with TurboQuant

Reality:
- Model weights (unquantized): 140 GB - still need this
- KV cache (with TurboQuant): ~53 GB instead of 320 GB

TurboQuant helps with:
- Long context (128K+ tokens)
- Multiple concurrent users
- Not with model weight memory

“Zero accuracy loss” means identical outputs

Not quite. It means the model performs the same on benchmarks, but individual token probabilities may differ slightly.
“8x speedup” applies to all GPUs

The 8x speedup is specific to H100’s optimized 3-bit operations. On older GPUs, the speedup is smaller.

Performance Benchmarks

Google’s benchmarks on H100 GPUs:

Model: Mistral-7B
Context: 128K tokens

| Metric              | FP16 Baseline | TurboQuant (3-bit) |
|---------------------|---------------|---------------------|
| KV Cache Memory     | 32 GB         | 5.3 GB (6x less)    |
| Attention Latency   | 120 ms        | 15 ms (8x faster)   |
| Perplexity (wikitext)| 12.4         | 12.5 (same within error) |
| Needle in Haystack  | 100% recall   | 100% recall         |

Model: Gemma-2-27B
Context: 128K tokens

| Metric              | FP16 Baseline | TurboQuant (3-bit) |
|---------------------|---------------|---------------------|
| KV Cache Memory     | 96 GB         | 16 GB (6x less)     |
| Throughput (tok/s)  | 45            | 180 (4x faster)     |

The “Needle in A Haystack” test is particularly impressive - 100% recall means the model still finds information buried deep in long contexts.

When to Use TurboQuant

TurboQuant shines in these scenarios:

Long context inference - 128K+ tokens
Multi-user serving - More concurrent requests per GPU
Real-time applications - The 8x speedup matters for latency-sensitive apps

When NOT to use it:

Short contexts - Under 4K tokens, KV cache is small anyway
Inflexible hardware - Needs H100 or newer for full benefits
Model weight compression - Use GPTQ/AWQ for that

The Pied Piper Connection

Multiple Reddit users compared TurboQuant to the fictional compression algorithm from HBO’s Silicon Valley:

“This is literally Pied Piper from Silicon Valley. Lossless compression that nobody thought was possible.”

The comparison is apt - both achieve “impossible” compression ratios. The difference is TurboQuant is real.

Connection to Vector Search

TurboQuant isn’t just for LLMs. The same techniques apply to vector databases:

Traditional Vector Search:
- Store embeddings: FP32 (32 bits each)
- 1M vectors x 1536 dimensions = 6 GB

With QJL:
- Store compressed embeddings: 3 bits each
- 1M vectors x 1536 dimensions = ~0.5 GB
- Search accuracy preserved via JL guarantees

Google mentioned this technology powers their search infrastructure as well.

Why 3 Bits?

The choice of 3 bits isn’t arbitrary:

1-bit: Too aggressive, significant accuracy loss
2-bit: Marginal improvement over 1-bit
3-bit: Sweet spot - minimal loss, max compression
4-bit: Good accuracy, but 33% more memory than 3-bit

3-bit quantization range:
- Values from -3 to +3 (7 levels)
- Sufficient for normalized, rotated vectors

Summary

In this post, I explained Google’s TurboQuant algorithm for KV cache compression. The key innovations are:

PolarQuant - Random rotation before quantization makes 3-bit viable
QJL - Mathematical guarantee that distances are preserved
Training-free - Works on any model without calibration

The result: 6x memory reduction, 8x speedup on H100, zero accuracy loss.

This is a real breakthrough for long-context AI applications. Million-token contexts are now commercially viable.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Google Research TurboQuant Announcement
👨‍💻 ICLR 2026 - QJL Paper
👨‍💻 AISTATS 2026 - PolarQuant Paper
👨‍💻 Reddit Discussion on TurboQuant

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!