Prompt Caching for AI Agents: How Prefix Matching Cuts Costs by 80%
My AI agent’s API bill hit $2,400 last month. For a simple task orchestration system running maybe 50 iterations a day. Something was wrong.
I opened my Anthropic dashboard and stared at the token counts. Each agent iteration was sending 50,000+ tokens to the API. By iteration 30, I was paying for the same system prompt and tool definitions 30 times over.
There had to be a better way. That’s when I discovered prompt caching - and why most production agents never modify their message history.
The Expensive Reality of Agent Loops
Here’s what my agent loop looked like:
Iteration 1: send 5,000 tokens (system + tools)Iteration 2: send 7,000 tokens (system + tools + history)Iteration 3: send 9,000 tokens (system + tools + more history)...Iteration 30: send 50,000+ tokensEvery single request re-processed the same system prompt. The same tool definitions. The same conversation history from 10 iterations ago.
I tried truncating history. That helped with costs but broke the agent’s ability to reference earlier context. I tried summarization. That added latency and sometimes lost critical details.
Then I found prompt caching.
How Prompt Caching Actually Works
Prompt caching stores the computed KV-cache for processed tokens. When your next request has the same prefix, the server reuses the cache instead of recomputing.
Request 1: [A][B][C][D][E] <- compute allRequest 2: [A][B][C][D][F] <- reuse A-D, compute F onlyRequest 3: [A][B][C][D][F][G] <- reuse A-F, compute G onlyAnthropic charges 1/10th the price for cached tokens. With 18,000 cached tokens and 2,000 new tokens, my next request cost roughly 20% of a non-cached equivalent.
But there’s a catch that took me three failed attempts to understand.
The Prefix Constraint Is Brutal
The cache matches prefixes strictly. If messages 1-20 are identical to your last request, they hit cache. But modify even one token in the middle? Everything after that point re-computes.
I learned this the hard way:
def manage_messages(self, messages, max_tokens=40000): # I thought truncating old history was smart... if count_tokens(messages) > max_tokens: messages = messages[-20:] # Keep last 20 messages return messagesThis looked reasonable. But every time I truncated, the cache completely invalidated. The first message that used to match the cached prefix was now different. No cache hit. Full re-computation. Maximum cost.
My second mistake was reordering tools based on usage frequency:
def optimize_tools(self, tools): # I reordered tools thinking it would help... used_tools = [t for t in tools if t.name in self.used_tool_names] unused_tools = [t for t in tools if t.name not in self.used_tool_names] return used_tools + unused_tools # WRONG: reordering breaks cacheEven though the content was identical, the order changed. Cache broken.
The Cache-Friendly Architecture
After studying how Manus and other production systems handle this, I rewrote my message manager:
class CacheAwareMessageManager: """ Three rules for cache-friendly agents: 1. Static content first (system, tools) 2. Only append to messages, never modify 3. Never delete from history """
def __init__(self): # Order matters! Static first, dynamic last self.messages = []
def initialize(self, system_prompt: str, tools: list[dict]): """Called once at agent start - these are static forever.""" self.messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": json.dumps(tools)} ]
def append_interaction( self, tool_call: dict, tool_result: str, user_response: str ): """ ONLY append - never modify previous messages. Even if a tool call was wrong, append the correction. Even if history is long, keep appending. """ self.messages.append({"role": "assistant", "tool_calls": [tool_call]}) self.messages.append({"role": "tool_result", "content": tool_result}) self.messages.append({"role": "user", "content": user_response})
def get_messages(self) -> list[dict]: """Return full history - let caching handle efficiency.""" return self.messages
def get_token_stats(self) -> dict: """Track cache effectiveness.""" return { "total_messages": len(self.messages), "estimated_tokens": sum( self._estimate_tokens(m) for m in self.messages ) }
def _estimate_tokens(self, message: dict) -> int: # Rough estimate: 4 chars per token content = json.dumps(message) return len(content) // 4The key insight: never truncate, never reorder, never modify. Just append.
Why This Changes Agent Architecture
This constraint shapes every architectural decision:
BEFORE (cache-unaware):- Truncate old messages to save tokens- Reorder tools by usage frequency- Edit message history to fix errors- Result: $2,400/month, no cache hits
AFTER (cache-aware):- Keep all message history- Tools in fixed order (even unused ones)- Append corrections, never edit- Result: $480/month, 85% cache hit rateThe cost difference is massive. But more importantly, the architecture becomes simpler. I no longer worry about history management. I no longer try to be clever with tool ordering.
Common Mistakes I Made
Mistake 1: Modifying message content
# WRONG: Fixing a typo in historyself.messages[3]["content"] = self.messages[3]["content"].replace( "eror", "error")# Cache breaks from message 3 onwardsMistake 2: Deleting unused tool results
# WRONG: Removing failed tool callsself.messages = [ m for m in self.messages if not is_failed_tool_call(m)]# All subsequent cache invalidatesMistake 3: Reordering by change frequency
# WRONG: Putting most-changed content firstself.messages = sorted( self.messages, key=lambda m: estimate_change_frequency(m), reverse=True)# Completely different prefix orderAll three seemed like good ideas at the time. All three destroyed my cache hit rate.
The Right Approach: Append-Only History
Here’s what works:
class AppendOnlyAgent: """ Production agent with 85% cache hit rate. Key: append-only, static-first ordering. """
def __init__(self, system_prompt: str, tools: list[dict]): self.manager = CacheAwareMessageManager() self.manager.initialize(system_prompt, tools) # Tools stay in this order forever, even if some are never used
async def run_iteration(self, user_input: str) -> dict: """Single iteration - just append to history.""" self.manager.append_interaction( tool_call={}, tool_result="", user_response=user_input )
response = await self.client.messages.create( model="claude-sonnet-4-20250514", max_tokens=4096, system=self.manager.messages[0]["content"], messages=self.manager.messages[1:], tools=self.tools, # Same order, every time )
# Handle tool calls by appending, never modifying for tool_call in response.tool_calls: result = await self.execute_tool(tool_call) self.manager.append_interaction( tool_call=tool_call, tool_result=result, user_response="" )
return response
async def execute_tool(self, tool_call: dict) -> str: """Execute and return result - success or failure.""" try: result = await self.tools[tool_call.name].execute( tool_call.arguments ) return json.dumps(result) except Exception as e: # Return error as result, don't delete the call return json.dumps({"error": str(e)})Notice: even failed tool calls stay in history. I append the error as a result. The agent learns from failures. The cache stays intact.
Measuring Cache Effectiveness
I added monitoring to track the actual savings:
import tiktoken
class CacheMonitor: def __init__(self): self.requests = [] self.encoder = tiktoken.get_encoding("cl100k_base")
def log_request(self, messages: list[dict], cached_tokens: int): total_tokens = sum( len(self.encoder.encode(json.dumps(m))) for m in messages ) new_tokens = total_tokens - cached_tokens
self.requests.append({ "total": total_tokens, "cached": cached_tokens, "new": new_tokens, "cache_hit_rate": cached_tokens / total_tokens if total_tokens > 0 else 0 })
def get_summary(self) -> dict: if not self.requests: return {}
total_cached = sum(r["cached"] for r in self.requests) total_new = sum(r["new"] for r in self.requests) total_all = total_cached + total_new
# Without caching, we'd pay for all tokens # With caching, we pay ~1/10 for cached tokens cost_without_cache = total_all cost_with_cache = total_new + (total_cached * 0.1) savings = (cost_without_cache - cost_with_cache) / cost_without_cache
return { "total_requests": len(self.requests), "total_tokens": total_all, "cached_tokens": total_cached, "new_tokens": total_new, "average_cache_hit_rate": total_cached / total_all if total_all > 0 else 0, "estimated_cost_reduction": f"{savings:.1%}" }Running this for a week:
Total requests: 1,247Total tokens: 31.2MCached tokens: 26.4MNew tokens: 4.8MAverage cache hit rate: 84.6%Estimated cost reduction: 80.2%The math works out. With 85% of tokens cached and cached tokens costing 1/10th, I’m paying roughly 20% of the non-cached price.
When You Must Modify History
Sometimes you genuinely need to modify history - for example, when implementing a “learn from correction” system where you want to fix past mistakes.
The solution: append corrections, don’t modify the original:
class CorrectingAgent(AppendOnlyAgent): """ Handles corrections without breaking cache. Appends corrections instead of editing history. """
async def apply_correction( self, message_index: int, original_content: str, corrected_content: str ): """ Instead of: self.messages[message_index]["content"] = corrected_content
Do this: append a correction instruction """ # Append correction instruction self.manager.append_interaction( tool_call={}, tool_result="", user_response=f""" CORRECTION: In message {message_index}, the content was wrong. Original: {original_content} Corrected: {corrected_content} Please use the corrected version for future reasoning. """ )The agent sees both the mistake and the correction. Cache remains valid. Cost stays low.
The Architectural Takeaway
Prompt caching fundamentally changed how I design agent systems:
-
History management is dead - Don’t truncate, don’t summarize, don’t optimize. Just append.
-
Tool ordering matters once - Choose an order at initialization, never change it.
-
Static content first - System prompts, tool definitions, and any unchanging context goes at the start.
-
Monitor cache hit rate - If it drops below 70%, you’re probably breaking cache somewhere.
-
Embrace append-only - This isn’t a limitation, it’s a feature. Simpler architecture, lower costs, better debugging.
The $2,400 bill is now $480. Same agent, same tasks, same outputs. The only difference: understanding that in prompt caching, order is everything and modification is the enemy.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments