How DeepSeek V4 Flash Achieves 98% Prompt Cache Hit Rate for Coding Agents
Problem
LLM API costs scale linearly with input token volume. For coding agents making thousands of calls per session, the repeated system prompt and tool definitions create massive token waste. Without caching, every call bills the same prefix over and over.
I ran a coding agent for 3 weeks and noticed the input token count was growing fast. The 330-line system prompt plus tool schemas were being sent on every single call. Without caching, the input bill would have been astronomical.
The Numbers
Here’s what the raw data looked like after 13,978 calls:
Input tokens billed: 29,179,437 (fresh)Cache tokens served: 1,668,125,568 (cached)Cache hit rate: 98.2%Total calls: 13,978System prompt length: 330 linesOnly 29.2 million tokens were billed as fresh input. Over 1.67 billion tokens were served from cache. That’s a roughly 50x reduction in effective input token cost.
How Prefix Caching Works
DeepSeek V4 Flash implements automatic prefix caching. When you send a prompt, DeepSeek checks if the beginning of your prompt matches a recently cached prefix. If it does, the cached prefix is reused and you’re billed at the lower cache token rate instead of the fresh input rate.
┌──────────────────────────────────────────────────┐│ Prompt Structure │├────────────────────┬─────────────────────────────┤│ Stable Prefix │ Variable Suffix ││ (Cached - 98% hit) │ (Fresh - 2% miss) │├────────────────────┼─────────────────────────────┤│ System Prompt │ User Instructions ││ (330 lines) │ (varies per call) ││ │ ││ Tool Definitions │ Current task context ││ (consistent) │ ││ │ ││ Role Definition │ File content being edited │└────────────────────┴─────────────────────────────┘The 98% hit rate comes from the fact that agentic workflows have naturally stable prompt prefixes. The system prompt and tool definitions rarely change during a session. Only the user instruction portion — the specific task the agent needs to perform next — changes between calls.
Why 98% Instead of 100%
Even with a stable prefix, not every call hits the cache. Here’s why:
- Cache eviction: DeepSeek’s cache has a limited window. If there’s a long gap between calls, the cached prefix may be evicted
- Prefix changes: If you update the system prompt mid-session, the cache key changes and the old entry is invalidated
- Provider differences: MiniMax, for comparison, achieved only ~70% cache hit rate with similar workload characteristics. The difference likely comes from different cache key granularity and eviction policies
How This Compares to Other Providers
I tested the same workload on MiniMax alongside DeepSeek. MiniMax’s cache hit rate was roughly 70% — noticeably lower. This meant MiniMax billed more fresh input tokens per call, contributing to its higher cost ($52.87 vs $10.37).
Provider Cache Hit Rate Fresh Input/Call Cost ImpactDeepSeek V4 Flash 98.2% ~2,100 tokens LowMiniMax (M2.5/M2.7) ~70% ~57,000 tokens HighAt 98% cache hit, each call bills roughly 2,100 fresh tokens. At 70%, each call bills roughly 57,000 fresh tokens. That difference alone explains a large part of the cost gap.
Designing for Cache Efficiency
If you’re building an agent today, here’s what I learned about maximizing cache reuse:
- Keep the system prompt stable — don’t inject dynamic content into the system prompt. Move dynamic instructions to the user message portion
- Batch related calls — consecutive calls with the same prefix maximize cache hits. Spreading calls across long time gaps risks eviction
- Tool definitions should be static — define all tools upfront and reference them by name in the system prompt rather than redefining them per call
- Avoid mid-session prompt rewrites — if you need to change behavior, add instructions to the user message rather than modifying the system prompt
Summary
In this post, I explained how DeepSeek V4 Flash achieves 98% prompt cache hit rates in coding agent workloads. The key takeaway is that automatic prefix caching combined with naturally stable agent prompt structures reduces effective input token costs by roughly 50x. When building agentic systems, prioritize providers with automatic caching and design your prompts for maximum prefix stability.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit Discussion: 98% prompt cache hit rate on DeepSeek V4 Flash
- 👨💻 DeepSeek API Documentation: Prefix Caching
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments