How Does Claude Token Caching Work? A Cost Optimization Guide

Mar 26, 2026

I was staring at my Claude API bill last month, wondering why my agent workflow costs were 10x what I expected. The culprit? I wasn’t using prompt caching. Let me explain what I learned about Anthropic’s token caching and how it can save you significant money.

The Problem: Runaway API Costs

When you’re building AI applications with Claude - especially multi-step agents or conversational interfaces - you quickly hit a cost wall. Every API call sends the same context over and over: system prompts, conversation history, retrieved documents.

I built an agent that retrieves documentation, thinks through problems, and produces answers. Simple enough, right? But each step in my chain was sending the same 50K token context repeatedly. My costs were exploding.

Then I discovered prompt caching.

What Is Token Caching?

Token caching is Anthropic’s mechanism for storing frequently used input tokens. When you send the same or similar prompts repeatedly - system instructions, context documents, conversation history - the cache eliminates redundant processing costs.

Here’s the key insight: cached reads cost just 0.1x compared to 1x for uncached input tokens.

That’s a 90% savings on repeated context.

The Pricing Tiers Explained

Anthropic offers three caching strategies with different cost tradeoffs:

Token Type	Cost Multiplier	When It Applies
Uncached Input	1.0x	First request, no caching enabled
Cache Write (5-min)	1.25x	New input cached for 5 minutes
Cache Write (1-hour)	2.0x	New input cached for 1 hour
Cache Read	0.1x	Reusing cached input within TTL
Output	5.0x	All output tokens

The math is straightforward. Pay a premium upfront (1.25x or 2x) to cache your context, then pay just 0.1x on subsequent reads.

When Does Caching Make Sense?

Let’s look at a concrete example. Say you have a 10,000 token system prompt and context that you’ll use across 10 API calls in a conversation.

Without Caching:
├── 10 calls × 10,000 tokens × 1.0x = 100,000 token cost units

With 5-min Caching:
├── 1st call: 10,000 tokens × 1.25x = 12,500 (cache write)
├── 9 reads: 10,000 tokens × 0.1x × 9 = 9,000 (cache reads)
└── Total: 21,500 token cost units

Savings: 78.5%

The break-even point? Just 2 reads for 5-minute caching, 10 reads for 1-hour caching.

Model Multipliers Stack On Top

Each Claude model has a base cost multiplier that applies to all token costs:

Model	Base Multiplier	Best For
Claude Haiku	1x	Fast, cheap operations
Claude Sonnet	3x	Balanced performance
Claude Opus	5x	Complex reasoning

So a cached read with Sonnet costs 3x × 0.1x = 0.3x of baseline Haiku pricing. Still a massive discount compared to uncached 3x × 1x = 3x.

The Hidden Cost: Thinking Tokens

Here’s something I learned the hard way. Claude’s “thinking” tokens count as output tokens, which cost 5x. This caught me off guard when I started using extended thinking mode.

# Example: Analyzing a 20K document with extended thinking

# Without caching:
# - Input: 20,000 × 1x = 20,000 units
# - Thinking: 15,000 × 5x = 75,000 units  <- This adds up fast!
# - Output: 2,000 × 5x = 10,000 units
# Total: 105,000 units

# With caching (assuming 5 reads):
# - Cache write: 20,000 × 1.25x = 25,000 units
# - Cache reads: 20,000 × 0.1x × 5 = 10,000 units
# - Thinking (per call): 15,000 × 5x × 5 = 375,000 units
# - Output (per call): 2,000 × 5x × 5 = 50,000 units
# Total: 460,000 units (but spread across 5 useful outputs)

The thinking tokens can’t be cached, so they dominate costs in extended thinking workflows.

The Chain Cost Multiplier Problem

When building multi-step workflows, costs compound in ways you might not expect.

Step 1: Input → Output A
Step 2: Output A + New Input → Output B
Step 3: Output B + New Input → Final Output

Each step's output becomes the next step's input!
This means non-final outputs effectively cost 6.25x or 7x
because they get processed again as input tokens.

For agent workflows, caching becomes even more critical. If Step 1’s output (at 5x cost) becomes Step 2’s input (at 1x or 0.1x), you’re paying twice for the same content.

Cache Expiration: The Gotcha

Caches expire based on TTL from the last read, not the initial write:

5-minute cache: Expires 5 minutes after last access
1-hour cache: Expires 1 hour after last access

This means frequently accessed content stays cached longer. But if you have a 30-minute gap between calls, your 5-minute cache will expire, and you’ll pay for a fresh cache write.

Choosing the Right Cache Duration

Here’s my decision matrix:

Use 5-minute caching when:
├── High-frequency access (multiple calls per minute)
├── Short-lived conversations
└── You need at least 2 reads to break even

Use 1-hour caching when:
├── Lower-frequency access (calls every few minutes)
├── Long-running sessions or workflows
├── You need at least 10 reads to break even
└── Certainty that content won't change

Skip caching when:
├── Single-call operations
├── Content changes frequently
└── Access pattern is unpredictable

Real-World Example: Document Q&A Agent

I built a document analysis agent that retrieves relevant documentation and answers questions. Here’s how caching transformed the economics:

Scenario: 100 questions about a 30K token document

BEFORE (no caching):
├── 100 calls × 30,000 tokens × 1.0x = 3,000,000 input cost
├── Plus output costs (~5x on responses)
└── Total input: 3M token units

AFTER (1-hour caching):
├── 1 cache write: 30,000 × 2.0x = 60,000
├── 99 cache reads: 30,000 × 0.1x × 99 = 297,000
└── Total input: 357,000 token units

Savings: 88% reduction in input token costs

For a production application handling thousands of queries, this translates to real money.

Implementation Tips

Cache at the API level - Use the caching parameter in your Claude API calls
Structure prompts for caching - Put static content (system prompts, documents) first
Monitor cache hit rates - Track how often you’re getting cache reads vs writes
Consider cache warming - For predictable usage patterns, pre-cache with a dummy request

Context Window Management: Token caching works within Claude’s context window limits
Streaming: Cached responses still support streaming output
Multi-turn Conversations: Perfect use case for caching conversation history

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!