What Is Prompt Caching in Claude Code and Why Does It Matter for API Costs?
I was looking at my API bill and noticed something strange. My token usage was way higher than expected for similar tasks. Turns out, I wasn’t using prompt caching correctly.
Let me explain what I discovered about prompt caching in Claude Code and how it can save you significant money on API costs.
The Problem: Token Waste Without Caching
When I first started using Claude Code, I assumed caching happened automatically. It doesn’t. Here’s what I found:
Request Type Tokens Processed Cost Impact-------------------------------------------------------Without caching 10,000 tokens 100% (baseline)With caching 3,000 tokens 30% of baselineBad cache design 15,000 tokens 150% (worse!)The difference is massive. But why does this happen?
How Prefix Caching Actually Works
Claude’s API uses something called “prefix caching.” The idea is simple but the implementation matters:
API Request Structure:[System Prompt] [Tools] [Context] [User Message]|-------- PREFIX --------| |-- DYNAMIC --| CACHED (reused) PROCESSED EACH TIMEThe “prefix” is everything that stays the same between requests: your system prompt, tool definitions, and context. The “dynamic” part is the user’s message and conversation history.
When you make the same request multiple times with an identical prefix, Claude can skip re-processing that cached content. This saves both time and money.
The Breakpoint Problem
Here’s where I made my mistake. I thought caching just… happened. But Claude needs specific markers called “breakpoints” to know where the cacheable prefix ends.
REQUEST 1: REQUEST 2:------------------------ ------------------------System: "You are..." System: "You are..."Tools: [tool1, tool2] Tools: [tool1, tool2]Context: "Previous..." Context: "Previous..."---------------------- <-- ---------------------- <-- BREAKPOINTUser: "Write code" User: "Fix the bug"------------------------ ------------------------ | | v v PREFIX CACHED PREFIX REUSEDWithout proper breakpoints, the entire prompt gets re-processed every time.
The cache_control Marker
Claude’s API uses a cache_control marker to indicate breakpoints. Here’s how it looks in practice:
{ "model": "claude-sonnet-4-20250514", "system": [ { "type": "text", "text": "You are a helpful coding assistant.", "cache_control": {"type": "ephemeral"} } ], "messages": [ { "role": "user", "content": "Help me debug this function." } ]}The cache_control: {"type": "ephemeral"} marker tells Claude: “Cache everything up to this point.”
Cache Lifecycle: It’s Ephemeral
I learned this the hard way. The cache isn’t permanent:
Time Cache State---------------------------------------Request 1 Cache created (miss)Request 2 Cache hit (saved!)Request 3 Cache hit (saved!)... 5 min pass...Request 4 Cache expired (miss)Ephemeral caches have a limited lifetime. After about 5 minutes of inactivity, the cache expires and you’re back to paying full price.
Common Mistakes That Break Caching
I made all of these mistakes. Maybe you will too.
Mistake 1: Assuming Automatic Caching
Not every API call benefits from caching. Small prompts might cost more to cache than to process:
Prompt Size Cache Benefit? Why-------------------------------------------------------Less than 1K No Overhead greater than savings1K-5K tokens Maybe Test and measureMore than 5K Yes Significant savingsMistake 2: Inconsistent Structure
If your system prompt changes between requests, caching breaks:
BAD: GOOD:------------------------ ------------------------System: "Help with..." System: "You are..."System: "Also, note..." Tools: [...]Tools: [...] Context: [...]------------------------ ------------------------Every request has Same structuredifferent system every timeprompt structureMistake 3: Wrong Breakpoint Placement
Placing breakpoints in the wrong location wastes the cache:
{ "system": [ { "type": "text", "text": "You are helpful." // Missing cache_control here! } ], "messages": [ { "role": "user", "content": "Write code", "cache_control": {"type": "ephemeral"} // WRONG PLACE } ]}The breakpoint should be at the end of the prefix, not in the user message.
Cost Comparison: Real Numbers
I ran tests on identical tasks with and without proper caching:
Configuration Total Tokens Estimated Cost-------------------------------------------------------No caching 15,000,000 $150.00Proper caching 4,500,000 $45.00Bad caching (worst) 22,000,000 $220.00-------------------------------------------------------Savings (proper) 70% reduction $105 savedThe worst case (bad caching) actually cost more because the overhead of attempting to cache with inconsistent structure added processing time.
How Claude Code Handles This
Claude Code is designed to work with prompt caching out of the box. When you use it:
- It maintains consistent system prompts across requests
- Tool definitions are placed in the cacheable prefix
- Context is structured to maximize cache hits
But I’ve seen tools that don’t do this well. One tool I tested didn’t provide proper breakpoints, resulting in 30-80% more token usage for the same tasks.
Practical Tips for Implementation
If you’re building your own integration:
1. IDENTIFY your prefix: What content stays constant?2. PLACE breakpoints: Add cache_control at prefix end3. MAINTAIN structure: Keep the same order every time4. MEASURE results: Check your actual token savings5. MONITOR cache hits: Use API response headersMeasuring Your Cache Performance
Claude’s API returns headers that tell you about cache performance:
cache-read-input-tokens: 8500 <-- Tokens read from cachecache-creation-input-tokens: 1500 <-- New tokens cachedcache-write-input-tokens: 0 <-- Tokens written to cacheUse these to verify your caching is working:
Header Value Meaning-------------------------------------------------------cache-read-input-tokens Cache HIT! Savings herecache-creation-input-tokens New cache createdcache-write-input-tokens Cache updatedSummary
In this post, I explained how prompt caching works in Claude Code. The key point is that caching can reduce your API costs by 30-90%, but only if you implement it correctly with proper breakpoints and consistent structure.
I went from paying $150 to $45 for the same workload. That’s a 70% reduction just by understanding and properly implementing prompt caching.
If you’re using Claude Code, much of this is handled for you. But if you’re building your own integration or noticing high token usage, check your caching implementation. The savings are worth it.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments