Skip to content

How Does Claude Token Caching Work? A Cost Optimization Guide

I was staring at my Claude API bill last month, wondering why my agent workflow costs were 10x what I expected. The culprit? I wasn’t using prompt caching. Let me explain what I learned about Anthropic’s token caching and how it can save you significant money.

The Problem: Runaway API Costs

When you’re building AI applications with Claude - especially multi-step agents or conversational interfaces - you quickly hit a cost wall. Every API call sends the same context over and over: system prompts, conversation history, retrieved documents.

I built an agent that retrieves documentation, thinks through problems, and produces answers. Simple enough, right? But each step in my chain was sending the same 50K token context repeatedly. My costs were exploding.

Then I discovered prompt caching.

What Is Token Caching?

Token caching is Anthropic’s mechanism for storing frequently used input tokens. When you send the same or similar prompts repeatedly - system instructions, context documents, conversation history - the cache eliminates redundant processing costs.

Here’s the key insight: cached reads cost just 0.1x compared to 1x for uncached input tokens.

That’s a 90% savings on repeated context.

The Pricing Tiers Explained

Anthropic offers three caching strategies with different cost tradeoffs:

Token TypeCost MultiplierWhen It Applies
Uncached Input1.0xFirst request, no caching enabled
Cache Write (5-min)1.25xNew input cached for 5 minutes
Cache Write (1-hour)2.0xNew input cached for 1 hour
Cache Read0.1xReusing cached input within TTL
Output5.0xAll output tokens

The math is straightforward. Pay a premium upfront (1.25x or 2x) to cache your context, then pay just 0.1x on subsequent reads.

When Does Caching Make Sense?

Let’s look at a concrete example. Say you have a 10,000 token system prompt and context that you’ll use across 10 API calls in a conversation.

Cost comparison diagram
Without Caching:
├── 10 calls × 10,000 tokens × 1.0x = 100,000 token cost units
With 5-min Caching:
├── 1st call: 10,000 tokens × 1.25x = 12,500 (cache write)
├── 9 reads: 10,000 tokens × 0.1x × 9 = 9,000 (cache reads)
└── Total: 21,500 token cost units
Savings: 78.5%

The break-even point? Just 2 reads for 5-minute caching, 10 reads for 1-hour caching.

Model Multipliers Stack On Top

Each Claude model has a base cost multiplier that applies to all token costs:

ModelBase MultiplierBest For
Claude Haiku1xFast, cheap operations
Claude Sonnet3xBalanced performance
Claude Opus5xComplex reasoning

So a cached read with Sonnet costs 3x × 0.1x = 0.3x of baseline Haiku pricing. Still a massive discount compared to uncached 3x × 1x = 3x.

The Hidden Cost: Thinking Tokens

Here’s something I learned the hard way. Claude’s “thinking” tokens count as output tokens, which cost 5x. This caught me off guard when I started using extended thinking mode.

cost_calculation.py
# Example: Analyzing a 20K document with extended thinking
# Without caching:
# - Input: 20,000 × 1x = 20,000 units
# - Thinking: 15,000 × 5x = 75,000 units <- This adds up fast!
# - Output: 2,000 × 5x = 10,000 units
# Total: 105,000 units
# With caching (assuming 5 reads):
# - Cache write: 20,000 × 1.25x = 25,000 units
# - Cache reads: 20,000 × 0.1x × 5 = 10,000 units
# - Thinking (per call): 15,000 × 5x × 5 = 375,000 units
# - Output (per call): 2,000 × 5x × 5 = 50,000 units
# Total: 460,000 units (but spread across 5 useful outputs)

The thinking tokens can’t be cached, so they dominate costs in extended thinking workflows.

The Chain Cost Multiplier Problem

When building multi-step workflows, costs compound in ways you might not expect.

Chain cost escalation
Step 1: Input → Output A
Step 2: Output A + New Input → Output B
Step 3: Output B + New Input → Final Output
Each step's output becomes the next step's input!
This means non-final outputs effectively cost 6.25x or 7x
because they get processed again as input tokens.

For agent workflows, caching becomes even more critical. If Step 1’s output (at 5x cost) becomes Step 2’s input (at 1x or 0.1x), you’re paying twice for the same content.

Cache Expiration: The Gotcha

Caches expire based on TTL from the last read, not the initial write:

  • 5-minute cache: Expires 5 minutes after last access
  • 1-hour cache: Expires 1 hour after last access

This means frequently accessed content stays cached longer. But if you have a 30-minute gap between calls, your 5-minute cache will expire, and you’ll pay for a fresh cache write.

Choosing the Right Cache Duration

Here’s my decision matrix:

Cache duration decision guide
Use 5-minute caching when:
├── High-frequency access (multiple calls per minute)
├── Short-lived conversations
└── You need at least 2 reads to break even
Use 1-hour caching when:
├── Lower-frequency access (calls every few minutes)
├── Long-running sessions or workflows
├── You need at least 10 reads to break even
└── Certainty that content won't change
Skip caching when:
├── Single-call operations
├── Content changes frequently
└── Access pattern is unpredictable

Real-World Example: Document Q&A Agent

I built a document analysis agent that retrieves relevant documentation and answers questions. Here’s how caching transformed the economics:

Before and after caching
Scenario: 100 questions about a 30K token document
BEFORE (no caching):
├── 100 calls × 30,000 tokens × 1.0x = 3,000,000 input cost
├── Plus output costs (~5x on responses)
└── Total input: 3M token units
AFTER (1-hour caching):
├── 1 cache write: 30,000 × 2.0x = 60,000
├── 99 cache reads: 30,000 × 0.1x × 99 = 297,000
└── Total input: 357,000 token units
Savings: 88% reduction in input token costs

For a production application handling thousands of queries, this translates to real money.

Implementation Tips

  1. Cache at the API level - Use the caching parameter in your Claude API calls
  2. Structure prompts for caching - Put static content (system prompts, documents) first
  3. Monitor cache hit rates - Track how often you’re getting cache reads vs writes
  4. Consider cache warming - For predictable usage patterns, pre-cache with a dummy request
  • Context Window Management: Token caching works within Claude’s context window limits
  • Streaming: Cached responses still support streaming output
  • Multi-turn Conversations: Perfect use case for caching conversation history

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments