Skip to content

How DeepSeek V4 Handles 1M Token Context Without Performance Degradation

When I first heard about million-token context in LLMs, I thought: “That sounds great on paper, but how do you actually run it?” Standard attention is O(n^2) - each token attends to every other token. A million tokens means a trillion attention operations. The KV cache alone would eat your entire GPU memory.

DeepSeek V4 solves this with something called Hybrid Attention. Let me break down how it works.

The Problem with Long Context

Before diving into the solution, let me understand the problem.

Standard Transformer attention stores Key-Value (KV) pairs for every token in the context. When generating the next token, the model needs to attend to all previous tokens.

Standard attention memory growth
Context Length KV Cache Size (approx)
128K tokens ~11 GiB
256K tokens ~22 GiB
512K tokens ~44 GiB
1M tokens ~88 GiB

A single H200 has 141 GB of memory. Running 1M context means the KV cache alone takes more than half your VRAM. That’s before accounting for model weights, activations, and intermediate computations.

For DeepSeek V3.2 (61 layers), the KV cache at 1M tokens was estimated at 83.9 GiB.

DeepSeek V4 Text Arena Ranking

Two New Attention Mechanisms

DeepSeek V4 introduces two attention types that work together: CSA (Compressed Sparse Attention) and HCA (Heavily Compressed Attention).

CSA: Compressed Sparse Attention

CSA is the primary mechanism. It does two things:

  1. Compression: Every m tokens get compressed into ONE KV entry. Think of it like summarizing - instead of remembering every word of a paragraph, you remember the paragraph’s key point.

  2. Sparse Selection: Each query token only attends to the top-k most relevant compressed entries. This is called DeepSeek Sparse Attention (DSA).

CSA compression concept
Original tokens: [T1, T2, T3, T4, T5, T6, T7, T8, ...]
Compressed: [C1, C2, C3, ...]
(m=4 tokens per compressed entry)
Query token picks top-k relevant C entries to attend

The compression ratio depends on m. If m=4, you get 4x compression. If m=128, you get 128x compression.

HCA: Heavily Compressed Attention

HCA goes further. It uses an even larger compression ratio (m' >> m). Think of it as the “executive summary” layer - highly compressed, but dense attention over all compressed entries.

HCA vs CSA comparison
CSA: m tokens → 1 KV entry (e.g., m=4, 4x compression)
+ Sparse attention (top-k selection)
HCA: m' tokens → 1 KV entry (e.g., m'=128, 128x compression)
+ Dense attention over all compressed entries

HCA catches the “big picture” context - the main themes and ideas across the entire document. CSA catches the “important details” - specific passages that matter for the current query.

How They Work Together

V4 interleaves CSA and HCA across Transformer layers. Some layers use CSA, others use HCA. This creates a hierarchy:

Hybrid Attention hierarchy
Layer 1: CSA - detailed, sparse attention
Layer 2: HCA - overview, dense attention
Layer 3: CSA - detailed, sparse attention
Layer 4: HCA - overview, dense attention
...

This design captures both:

  • Local details (via CSA’s sparse selection of relevant compressed blocks)
  • Global context (via HCA’s overview of the entire document)

The Numbers: 10% KV Cache, 27% FLOPs

Here’s what matters for inference:

V4 efficiency at 1M context
Metric V3.2 V4 Savings
KV Cache 83.9 GiB 9.62 GiB ~90% reduction
Per-token inference FLOPs baseline 27% of V3.2 ~73% reduction

How does vLLM achieve this? Three compression strategies combined:

  1. Shared Key-Value vectors: 2x savings
  2. c4a compression: 4x savings
  3. c128a compression: 128x savings

Plus a 128-token local sliding window that preserves nearby context without compression.

vLLM compression breakdown
Local window (128 tokens): preserved uncompressed
c4a blocks: 4x compression
c128a blocks: 128x compression
Shared KV vectors: reduces redundancy across heads

A Simple Analogy

Imagine you’re reading a 1000-page novel:

Standard attention: You memorize every sentence. When asked about chapter 5, you scan through all 1000 pages to find relevant sentences.

CSA: You summarize each page into one key point. When asked about chapter 5, you find the most relevant page summaries and read those pages more carefully.

HCA: You summarize each chapter into one main idea. When asked about chapter 5, you first recall what chapter 5 was about (global context), then dive into specific page summaries.

V4’s Hybrid: You have both chapter summaries (HCA) and page summaries (CSA). Some “memory slots” hold page-level details, others hold chapter-level overviews.

What This Means Practically

With 9.62 GiB KV cache, you can actually run 1M context on a single H200 (141 GB VRAM):

Memory allocation on H200
Model weights (FP4+FP8): ~40-50 GiB
KV cache at 1M: ~9.62 GiB
Activations + buffers: ~20-30 GiB
Total: fits in 141 GiB

V3.2 at 1M would need multiple GPUs just for the KV cache. V4 does it on a single card.

DeepSeek V4 Benchmark Comparison

The Trade-off

Compression isn’t free. The question is: does the model lose retrieval accuracy?

DeepSeek’s benchmarks show:

  • MRCR 1M (long-context retrieval): 83.5 - still strong, though Opus 4.6 scores 92.9
  • CorpusQA 1M: 62.0 - functional but not top-tier (Opus 4.6 hits 71.7)

The compression achieves efficiency at some cost to precision. For most tasks (coding, reasoning, chat), the quality impact is minimal. For needle-in-a-haystack retrieval over million-token documents, frontier models like Opus still lead.

Why This Matters

Before V4, “1M context” was mostly marketing. Models could technically handle 1M tokens, but:

  • KV cache explosion made inference impractical
  • Latency increased dramatically
  • Cost per token skyrocketed

V4 changes this equation. 1M context becomes usable at production cost. You can:

  • Analyze entire codebases (real 1M token repositories)
  • Process full legal documents without chunking
  • Maintain multi-hour conversation history

The efficiency gain isn’t incremental - it’s roughly 10x on the key bottleneck (KV cache).

Key Takeaways

  1. CSA compresses m tokens into one KV entry, then uses sparse attention to pick relevant blocks
  2. HCA compresses even more aggressively (m’ >> m), providing global context overview
  3. KV cache drops to 10% of V3.2 (9.62 GiB vs 83.9 GiB at 1M)
  4. Inference FLOPs drop to 27% of V3.2
  5. vLLM combines 128-token local window + c4a/c128a compression + shared KV vectors

This isn’t magic - it’s intelligent compression. DeepSeek figured out that you don’t need to remember every token with full precision. You need to remember the important parts well and the rest adequately.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments