What Is Prompt Caching in Claude Code and Why Does It Matter for API Costs?

Mar 21, 2026

I was looking at my API bill and noticed something strange. My token usage was way higher than expected for similar tasks. Turns out, I wasn’t using prompt caching correctly.

Let me explain what I discovered about prompt caching in Claude Code and how it can save you significant money on API costs.

The Problem: Token Waste Without Caching

When I first started using Claude Code, I assumed caching happened automatically. It doesn’t. Here’s what I found:

Request Type        Tokens Processed    Cost Impact
-------------------------------------------------------
Without caching     10,000 tokens       100% (baseline)
With caching        3,000 tokens        30% of baseline
Bad cache design    15,000 tokens       150% (worse!)

The difference is massive. But why does this happen?

How Prefix Caching Actually Works

Claude’s API uses something called “prefix caching.” The idea is simple but the implementation matters:

API Request Structure:
[System Prompt] [Tools] [Context] [User Message]
|-------- PREFIX --------|  |-- DYNAMIC --|
     CACHED (reused)        PROCESSED EACH TIME

The “prefix” is everything that stays the same between requests: your system prompt, tool definitions, and context. The “dynamic” part is the user’s message and conversation history.

When you make the same request multiple times with an identical prefix, Claude can skip re-processing that cached content. This saves both time and money.

The Breakpoint Problem

Here’s where I made my mistake. I thought caching just… happened. But Claude needs specific markers called “breakpoints” to know where the cacheable prefix ends.

REQUEST 1:                          REQUEST 2:
------------------------          ------------------------
System: "You are..."               System: "You are..."
Tools: [tool1, tool2]              Tools: [tool1, tool2]
Context: "Previous..."              Context: "Previous..."
---------------------- <--         ---------------------- <-- BREAKPOINT
User: "Write code"                  User: "Fix the bug"
------------------------          ------------------------
        |                                    |
        v                                    v
   PREFIX CACHED                      PREFIX REUSED

Without proper breakpoints, the entire prompt gets re-processed every time.

The `cache_control` Marker

Claude’s API uses a cache_control marker to indicate breakpoints. Here’s how it looks in practice:

{
  "model": "claude-sonnet-4-20250514",
  "system": [
    {
      "type": "text",
      "text": "You are a helpful coding assistant.",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": "Help me debug this function."
    }
  ]
}

The cache_control: {"type": "ephemeral"} marker tells Claude: “Cache everything up to this point.”

Cache Lifecycle: It’s Ephemeral

I learned this the hard way. The cache isn’t permanent:

Time                Cache State
---------------------------------------
Request 1           Cache created (miss)
Request 2           Cache hit (saved!)
Request 3           Cache hit (saved!)
... 5 min pass...
Request 4           Cache expired (miss)

Ephemeral caches have a limited lifetime. After about 5 minutes of inactivity, the cache expires and you’re back to paying full price.

Common Mistakes That Break Caching

I made all of these mistakes. Maybe you will too.

Mistake 1: Assuming Automatic Caching

Not every API call benefits from caching. Small prompts might cost more to cache than to process:

Prompt Size        Cache Benefit?     Why
-------------------------------------------------------
Less than 1K       No                 Overhead greater than savings
1K-5K tokens       Maybe              Test and measure
More than 5K       Yes                Significant savings

Mistake 2: Inconsistent Structure

If your system prompt changes between requests, caching breaks:

BAD:                              GOOD:
------------------------          ------------------------
System: "Help with..."            System: "You are..."
System: "Also, note..."           Tools: [...]
Tools: [...]                       Context: [...]
------------------------          ------------------------
Every request has                 Same structure
different system                  every time
prompt structure

Mistake 3: Wrong Breakpoint Placement

Placing breakpoints in the wrong location wastes the cache:

{
  "system": [
    {
      "type": "text",
      "text": "You are helpful."
      // Missing cache_control here!
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": "Write code",
      "cache_control": {"type": "ephemeral"}  // WRONG PLACE
    }
  ]
}

The breakpoint should be at the end of the prefix, not in the user message.

Cost Comparison: Real Numbers

I ran tests on identical tasks with and without proper caching:

Configuration          Total Tokens     Estimated Cost
-------------------------------------------------------
No caching             15,000,000       $150.00
Proper caching         4,500,000        $45.00
Bad caching (worst)    22,000,000       $220.00
-------------------------------------------------------
Savings (proper)       70% reduction    $105 saved

The worst case (bad caching) actually cost more because the overhead of attempting to cache with inconsistent structure added processing time.

How Claude Code Handles This

Claude Code is designed to work with prompt caching out of the box. When you use it:

It maintains consistent system prompts across requests
Tool definitions are placed in the cacheable prefix
Context is structured to maximize cache hits

But I’ve seen tools that don’t do this well. One tool I tested didn’t provide proper breakpoints, resulting in 30-80% more token usage for the same tasks.

Practical Tips for Implementation

If you’re building your own integration:

1. IDENTIFY your prefix: What content stays constant?
2. PLACE breakpoints: Add cache_control at prefix end
3. MAINTAIN structure: Keep the same order every time
4. MEASURE results: Check your actual token savings
5. MONITOR cache hits: Use API response headers

Measuring Your Cache Performance

Claude’s API returns headers that tell you about cache performance:

cache-read-input-tokens: 8500     <-- Tokens read from cache
cache-creation-input-tokens: 1500 <-- New tokens cached
cache-write-input-tokens: 0       <-- Tokens written to cache

Use these to verify your caching is working:

Header Value                  Meaning
-------------------------------------------------------
cache-read-input-tokens       Cache HIT! Savings here
cache-creation-input-tokens   New cache created
cache-write-input-tokens      Cache updated

Summary

In this post, I explained how prompt caching works in Claude Code. The key point is that caching can reduce your API costs by 30-90%, but only if you implement it correctly with proper breakpoints and consistent structure.

I went from paying $150 to $45 for the same workload. That’s a 70% reduction just by understanding and properly implementing prompt caching.

If you’re using Claude Code, much of this is handled for you. But if you’re building your own integration or noticing high token usage, check your caching implementation. The savings are worth it.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Anthropic Prompt Caching Documentation
👨‍💻 Claude Code Documentation

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!