Skip to content

How DeepSeek V4 Flash Achieves 98% Prompt Cache Hit Rate for Coding Agents

Problem

LLM API costs scale linearly with input token volume. For coding agents making thousands of calls per session, the repeated system prompt and tool definitions create massive token waste. Without caching, every call bills the same prefix over and over.

I ran a coding agent for 3 weeks and noticed the input token count was growing fast. The 330-line system prompt plus tool schemas were being sent on every single call. Without caching, the input bill would have been astronomical.

The Numbers

Here’s what the raw data looked like after 13,978 calls:

Token usage breakdown
Input tokens billed: 29,179,437 (fresh)
Cache tokens served: 1,668,125,568 (cached)
Cache hit rate: 98.2%
Total calls: 13,978
System prompt length: 330 lines

Only 29.2 million tokens were billed as fresh input. Over 1.67 billion tokens were served from cache. That’s a roughly 50x reduction in effective input token cost.

How Prefix Caching Works

DeepSeek V4 Flash implements automatic prefix caching. When you send a prompt, DeepSeek checks if the beginning of your prompt matches a recently cached prefix. If it does, the cached prefix is reused and you’re billed at the lower cache token rate instead of the fresh input rate.

Prompt structure for coding agents
┌──────────────────────────────────────────────────┐
│ Prompt Structure │
├────────────────────┬─────────────────────────────┤
│ Stable Prefix │ Variable Suffix │
│ (Cached - 98% hit) │ (Fresh - 2% miss) │
├────────────────────┼─────────────────────────────┤
│ System Prompt │ User Instructions │
│ (330 lines) │ (varies per call) │
│ │ │
│ Tool Definitions │ Current task context │
│ (consistent) │ │
│ │ │
│ Role Definition │ File content being edited │
└────────────────────┴─────────────────────────────┘

The 98% hit rate comes from the fact that agentic workflows have naturally stable prompt prefixes. The system prompt and tool definitions rarely change during a session. Only the user instruction portion — the specific task the agent needs to perform next — changes between calls.

Why 98% Instead of 100%

Even with a stable prefix, not every call hits the cache. Here’s why:

  • Cache eviction: DeepSeek’s cache has a limited window. If there’s a long gap between calls, the cached prefix may be evicted
  • Prefix changes: If you update the system prompt mid-session, the cache key changes and the old entry is invalidated
  • Provider differences: MiniMax, for comparison, achieved only ~70% cache hit rate with similar workload characteristics. The difference likely comes from different cache key granularity and eviction policies

How This Compares to Other Providers

I tested the same workload on MiniMax alongside DeepSeek. MiniMax’s cache hit rate was roughly 70% — noticeably lower. This meant MiniMax billed more fresh input tokens per call, contributing to its higher cost ($52.87 vs $10.37).

Cache hit rate comparison
Provider Cache Hit Rate Fresh Input/Call Cost Impact
DeepSeek V4 Flash 98.2% ~2,100 tokens Low
MiniMax (M2.5/M2.7) ~70% ~57,000 tokens High

At 98% cache hit, each call bills roughly 2,100 fresh tokens. At 70%, each call bills roughly 57,000 fresh tokens. That difference alone explains a large part of the cost gap.

Designing for Cache Efficiency

If you’re building an agent today, here’s what I learned about maximizing cache reuse:

  1. Keep the system prompt stable — don’t inject dynamic content into the system prompt. Move dynamic instructions to the user message portion
  2. Batch related calls — consecutive calls with the same prefix maximize cache hits. Spreading calls across long time gaps risks eviction
  3. Tool definitions should be static — define all tools upfront and reference them by name in the system prompt rather than redefining them per call
  4. Avoid mid-session prompt rewrites — if you need to change behavior, add instructions to the user message rather than modifying the system prompt

Summary

In this post, I explained how DeepSeek V4 Flash achieves 98% prompt cache hit rates in coding agent workloads. The key takeaway is that automatic prefix caching combined with naturally stable agent prompt structures reduces effective input token costs by roughly 50x. When building agentic systems, prioritize providers with automatic caching and design your prompts for maximum prefix stability.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments