Skip to content

Prompt Caching for AI Agents: How Prefix Matching Cuts Costs by 80%

My AI agent’s API bill hit $2,400 last month. For a simple task orchestration system running maybe 50 iterations a day. Something was wrong.

I opened my Anthropic dashboard and stared at the token counts. Each agent iteration was sending 50,000+ tokens to the API. By iteration 30, I was paying for the same system prompt and tool definitions 30 times over.

There had to be a better way. That’s when I discovered prompt caching - and why most production agents never modify their message history.

The Expensive Reality of Agent Loops

Here’s what my agent loop looked like:

Agent iteration breakdown
Iteration 1: send 5,000 tokens (system + tools)
Iteration 2: send 7,000 tokens (system + tools + history)
Iteration 3: send 9,000 tokens (system + tools + more history)
...
Iteration 30: send 50,000+ tokens

Every single request re-processed the same system prompt. The same tool definitions. The same conversation history from 10 iterations ago.

I tried truncating history. That helped with costs but broke the agent’s ability to reference earlier context. I tried summarization. That added latency and sometimes lost critical details.

Then I found prompt caching.

How Prompt Caching Actually Works

Prompt caching stores the computed KV-cache for processed tokens. When your next request has the same prefix, the server reuses the cache instead of recomputing.

Cache matching visualization
Request 1: [A][B][C][D][E] <- compute all
Request 2: [A][B][C][D][F] <- reuse A-D, compute F only
Request 3: [A][B][C][D][F][G] <- reuse A-F, compute G only

Anthropic charges 1/10th the price for cached tokens. With 18,000 cached tokens and 2,000 new tokens, my next request cost roughly 20% of a non-cached equivalent.

But there’s a catch that took me three failed attempts to understand.

The Prefix Constraint Is Brutal

The cache matches prefixes strictly. If messages 1-20 are identical to your last request, they hit cache. But modify even one token in the middle? Everything after that point re-computes.

I learned this the hard way:

My first attempt (broke cache)
def manage_messages(self, messages, max_tokens=40000):
# I thought truncating old history was smart...
if count_tokens(messages) > max_tokens:
messages = messages[-20:] # Keep last 20 messages
return messages

This looked reasonable. But every time I truncated, the cache completely invalidated. The first message that used to match the cached prefix was now different. No cache hit. Full re-computation. Maximum cost.

My second mistake was reordering tools based on usage frequency:

Second attempt (also broke cache)
def optimize_tools(self, tools):
# I reordered tools thinking it would help...
used_tools = [t for t in tools if t.name in self.used_tool_names]
unused_tools = [t for t in tools if t.name not in self.used_tool_names]
return used_tools + unused_tools # WRONG: reordering breaks cache

Even though the content was identical, the order changed. Cache broken.

The Cache-Friendly Architecture

After studying how Manus and other production systems handle this, I rewrote my message manager:

cache_aware_manager.py
class CacheAwareMessageManager:
"""
Three rules for cache-friendly agents:
1. Static content first (system, tools)
2. Only append to messages, never modify
3. Never delete from history
"""
def __init__(self):
# Order matters! Static first, dynamic last
self.messages = []
def initialize(self, system_prompt: str, tools: list[dict]):
"""Called once at agent start - these are static forever."""
self.messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": json.dumps(tools)}
]
def append_interaction(
self,
tool_call: dict,
tool_result: str,
user_response: str
):
"""
ONLY append - never modify previous messages.
Even if a tool call was wrong, append the correction.
Even if history is long, keep appending.
"""
self.messages.append({"role": "assistant", "tool_calls": [tool_call]})
self.messages.append({"role": "tool_result", "content": tool_result})
self.messages.append({"role": "user", "content": user_response})
def get_messages(self) -> list[dict]:
"""Return full history - let caching handle efficiency."""
return self.messages
def get_token_stats(self) -> dict:
"""Track cache effectiveness."""
return {
"total_messages": len(self.messages),
"estimated_tokens": sum(
self._estimate_tokens(m) for m in self.messages
)
}
def _estimate_tokens(self, message: dict) -> int:
# Rough estimate: 4 chars per token
content = json.dumps(message)
return len(content) // 4

The key insight: never truncate, never reorder, never modify. Just append.

Why This Changes Agent Architecture

This constraint shapes every architectural decision:

Before vs After understanding caching
BEFORE (cache-unaware):
- Truncate old messages to save tokens
- Reorder tools by usage frequency
- Edit message history to fix errors
- Result: $2,400/month, no cache hits
AFTER (cache-aware):
- Keep all message history
- Tools in fixed order (even unused ones)
- Append corrections, never edit
- Result: $480/month, 85% cache hit rate

The cost difference is massive. But more importantly, the architecture becomes simpler. I no longer worry about history management. I no longer try to be clever with tool ordering.

Common Mistakes I Made

Mistake 1: Modifying message content

Breaking cache by modification
# WRONG: Fixing a typo in history
self.messages[3]["content"] = self.messages[3]["content"].replace(
"eror", "error"
)
# Cache breaks from message 3 onwards

Mistake 2: Deleting unused tool results

Breaking cache by deletion
# WRONG: Removing failed tool calls
self.messages = [
m for m in self.messages
if not is_failed_tool_call(m)
]
# All subsequent cache invalidates

Mistake 3: Reordering by change frequency

Breaking cache by reordering
# WRONG: Putting most-changed content first
self.messages = sorted(
self.messages,
key=lambda m: estimate_change_frequency(m),
reverse=True
)
# Completely different prefix order

All three seemed like good ideas at the time. All three destroyed my cache hit rate.

The Right Approach: Append-Only History

Here’s what works:

append_only_agent.py
class AppendOnlyAgent:
"""
Production agent with 85% cache hit rate.
Key: append-only, static-first ordering.
"""
def __init__(self, system_prompt: str, tools: list[dict]):
self.manager = CacheAwareMessageManager()
self.manager.initialize(system_prompt, tools)
# Tools stay in this order forever, even if some are never used
async def run_iteration(self, user_input: str) -> dict:
"""Single iteration - just append to history."""
self.manager.append_interaction(
tool_call={},
tool_result="",
user_response=user_input
)
response = await self.client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
system=self.manager.messages[0]["content"],
messages=self.manager.messages[1:],
tools=self.tools, # Same order, every time
)
# Handle tool calls by appending, never modifying
for tool_call in response.tool_calls:
result = await self.execute_tool(tool_call)
self.manager.append_interaction(
tool_call=tool_call,
tool_result=result,
user_response=""
)
return response
async def execute_tool(self, tool_call: dict) -> str:
"""Execute and return result - success or failure."""
try:
result = await self.tools[tool_call.name].execute(
tool_call.arguments
)
return json.dumps(result)
except Exception as e:
# Return error as result, don't delete the call
return json.dumps({"error": str(e)})

Notice: even failed tool calls stay in history. I append the error as a result. The agent learns from failures. The cache stays intact.

Measuring Cache Effectiveness

I added monitoring to track the actual savings:

cache_monitor.py
import tiktoken
class CacheMonitor:
def __init__(self):
self.requests = []
self.encoder = tiktoken.get_encoding("cl100k_base")
def log_request(self, messages: list[dict], cached_tokens: int):
total_tokens = sum(
len(self.encoder.encode(json.dumps(m)))
for m in messages
)
new_tokens = total_tokens - cached_tokens
self.requests.append({
"total": total_tokens,
"cached": cached_tokens,
"new": new_tokens,
"cache_hit_rate": cached_tokens / total_tokens if total_tokens > 0 else 0
})
def get_summary(self) -> dict:
if not self.requests:
return {}
total_cached = sum(r["cached"] for r in self.requests)
total_new = sum(r["new"] for r in self.requests)
total_all = total_cached + total_new
# Without caching, we'd pay for all tokens
# With caching, we pay ~1/10 for cached tokens
cost_without_cache = total_all
cost_with_cache = total_new + (total_cached * 0.1)
savings = (cost_without_cache - cost_with_cache) / cost_without_cache
return {
"total_requests": len(self.requests),
"total_tokens": total_all,
"cached_tokens": total_cached,
"new_tokens": total_new,
"average_cache_hit_rate": total_cached / total_all if total_all > 0 else 0,
"estimated_cost_reduction": f"{savings:.1%}"
}

Running this for a week:

Weekly cache statistics
Total requests: 1,247
Total tokens: 31.2M
Cached tokens: 26.4M
New tokens: 4.8M
Average cache hit rate: 84.6%
Estimated cost reduction: 80.2%

The math works out. With 85% of tokens cached and cached tokens costing 1/10th, I’m paying roughly 20% of the non-cached price.

When You Must Modify History

Sometimes you genuinely need to modify history - for example, when implementing a “learn from correction” system where you want to fix past mistakes.

The solution: append corrections, don’t modify the original:

correction_handling.py
class CorrectingAgent(AppendOnlyAgent):
"""
Handles corrections without breaking cache.
Appends corrections instead of editing history.
"""
async def apply_correction(
self,
message_index: int,
original_content: str,
corrected_content: str
):
"""
Instead of:
self.messages[message_index]["content"] = corrected_content
Do this:
append a correction instruction
"""
# Append correction instruction
self.manager.append_interaction(
tool_call={},
tool_result="",
user_response=f"""
CORRECTION: In message {message_index}, the content was wrong.
Original: {original_content}
Corrected: {corrected_content}
Please use the corrected version for future reasoning.
"""
)

The agent sees both the mistake and the correction. Cache remains valid. Cost stays low.

The Architectural Takeaway

Prompt caching fundamentally changed how I design agent systems:

  1. History management is dead - Don’t truncate, don’t summarize, don’t optimize. Just append.

  2. Tool ordering matters once - Choose an order at initialization, never change it.

  3. Static content first - System prompts, tool definitions, and any unchanging context goes at the start.

  4. Monitor cache hit rate - If it drops below 70%, you’re probably breaking cache somewhere.

  5. Embrace append-only - This isn’t a limitation, it’s a feature. Simpler architecture, lower costs, better debugging.

The $2,400 bill is now $480. Same agent, same tasks, same outputs. The only difference: understanding that in prompt caching, order is everything and modification is the enemy.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments