How Does Claude Token Caching Work? A Cost Optimization Guide
I was staring at my Claude API bill last month, wondering why my agent workflow costs were 10x what I expected. The culprit? I wasn’t using prompt caching. Let me explain what I learned about Anthropic’s token caching and how it can save you significant money.
The Problem: Runaway API Costs
When you’re building AI applications with Claude - especially multi-step agents or conversational interfaces - you quickly hit a cost wall. Every API call sends the same context over and over: system prompts, conversation history, retrieved documents.
I built an agent that retrieves documentation, thinks through problems, and produces answers. Simple enough, right? But each step in my chain was sending the same 50K token context repeatedly. My costs were exploding.
Then I discovered prompt caching.
What Is Token Caching?
Token caching is Anthropic’s mechanism for storing frequently used input tokens. When you send the same or similar prompts repeatedly - system instructions, context documents, conversation history - the cache eliminates redundant processing costs.
Here’s the key insight: cached reads cost just 0.1x compared to 1x for uncached input tokens.
That’s a 90% savings on repeated context.
The Pricing Tiers Explained
Anthropic offers three caching strategies with different cost tradeoffs:
| Token Type | Cost Multiplier | When It Applies |
|---|---|---|
| Uncached Input | 1.0x | First request, no caching enabled |
| Cache Write (5-min) | 1.25x | New input cached for 5 minutes |
| Cache Write (1-hour) | 2.0x | New input cached for 1 hour |
| Cache Read | 0.1x | Reusing cached input within TTL |
| Output | 5.0x | All output tokens |
The math is straightforward. Pay a premium upfront (1.25x or 2x) to cache your context, then pay just 0.1x on subsequent reads.
When Does Caching Make Sense?
Let’s look at a concrete example. Say you have a 10,000 token system prompt and context that you’ll use across 10 API calls in a conversation.
Without Caching:├── 10 calls × 10,000 tokens × 1.0x = 100,000 token cost units
With 5-min Caching:├── 1st call: 10,000 tokens × 1.25x = 12,500 (cache write)├── 9 reads: 10,000 tokens × 0.1x × 9 = 9,000 (cache reads)└── Total: 21,500 token cost units
Savings: 78.5%The break-even point? Just 2 reads for 5-minute caching, 10 reads for 1-hour caching.
Model Multipliers Stack On Top
Each Claude model has a base cost multiplier that applies to all token costs:
| Model | Base Multiplier | Best For |
|---|---|---|
| Claude Haiku | 1x | Fast, cheap operations |
| Claude Sonnet | 3x | Balanced performance |
| Claude Opus | 5x | Complex reasoning |
So a cached read with Sonnet costs 3x × 0.1x = 0.3x of baseline Haiku pricing. Still a massive discount compared to uncached 3x × 1x = 3x.
The Hidden Cost: Thinking Tokens
Here’s something I learned the hard way. Claude’s “thinking” tokens count as output tokens, which cost 5x. This caught me off guard when I started using extended thinking mode.
# Example: Analyzing a 20K document with extended thinking
# Without caching:# - Input: 20,000 × 1x = 20,000 units# - Thinking: 15,000 × 5x = 75,000 units <- This adds up fast!# - Output: 2,000 × 5x = 10,000 units# Total: 105,000 units
# With caching (assuming 5 reads):# - Cache write: 20,000 × 1.25x = 25,000 units# - Cache reads: 20,000 × 0.1x × 5 = 10,000 units# - Thinking (per call): 15,000 × 5x × 5 = 375,000 units# - Output (per call): 2,000 × 5x × 5 = 50,000 units# Total: 460,000 units (but spread across 5 useful outputs)The thinking tokens can’t be cached, so they dominate costs in extended thinking workflows.
The Chain Cost Multiplier Problem
When building multi-step workflows, costs compound in ways you might not expect.
Step 1: Input → Output AStep 2: Output A + New Input → Output BStep 3: Output B + New Input → Final Output
Each step's output becomes the next step's input!This means non-final outputs effectively cost 6.25x or 7xbecause they get processed again as input tokens.For agent workflows, caching becomes even more critical. If Step 1’s output (at 5x cost) becomes Step 2’s input (at 1x or 0.1x), you’re paying twice for the same content.
Cache Expiration: The Gotcha
Caches expire based on TTL from the last read, not the initial write:
- 5-minute cache: Expires 5 minutes after last access
- 1-hour cache: Expires 1 hour after last access
This means frequently accessed content stays cached longer. But if you have a 30-minute gap between calls, your 5-minute cache will expire, and you’ll pay for a fresh cache write.
Choosing the Right Cache Duration
Here’s my decision matrix:
Use 5-minute caching when:├── High-frequency access (multiple calls per minute)├── Short-lived conversations└── You need at least 2 reads to break even
Use 1-hour caching when:├── Lower-frequency access (calls every few minutes)├── Long-running sessions or workflows├── You need at least 10 reads to break even└── Certainty that content won't change
Skip caching when:├── Single-call operations├── Content changes frequently└── Access pattern is unpredictableReal-World Example: Document Q&A Agent
I built a document analysis agent that retrieves relevant documentation and answers questions. Here’s how caching transformed the economics:
Scenario: 100 questions about a 30K token document
BEFORE (no caching):├── 100 calls × 30,000 tokens × 1.0x = 3,000,000 input cost├── Plus output costs (~5x on responses)└── Total input: 3M token units
AFTER (1-hour caching):├── 1 cache write: 30,000 × 2.0x = 60,000├── 99 cache reads: 30,000 × 0.1x × 99 = 297,000└── Total input: 357,000 token units
Savings: 88% reduction in input token costsFor a production application handling thousands of queries, this translates to real money.
Implementation Tips
- Cache at the API level - Use the
cachingparameter in your Claude API calls - Structure prompts for caching - Put static content (system prompts, documents) first
- Monitor cache hit rates - Track how often you’re getting cache reads vs writes
- Consider cache warming - For predictable usage patterns, pre-cache with a dummy request
Related Concepts
- Context Window Management: Token caching works within Claude’s context window limits
- Streaming: Cached responses still support streaming output
- Multi-turn Conversations: Perfect use case for caching conversation history
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments