Prompt Caching for AI Agents: How Prefix Matching Cuts Costs by 80%

Mar 25, 2026

My AI agent’s API bill hit $2,400 last month. For a simple task orchestration system running maybe 50 iterations a day. Something was wrong.

I opened my Anthropic dashboard and stared at the token counts. Each agent iteration was sending 50,000+ tokens to the API. By iteration 30, I was paying for the same system prompt and tool definitions 30 times over.

There had to be a better way. That’s when I discovered prompt caching - and why most production agents never modify their message history.

The Expensive Reality of Agent Loops

Here’s what my agent loop looked like:

Iteration 1:  send 5,000 tokens (system + tools)
Iteration 2:  send 7,000 tokens (system + tools + history)
Iteration 3:  send 9,000 tokens (system + tools + more history)
...
Iteration 30: send 50,000+ tokens

Every single request re-processed the same system prompt. The same tool definitions. The same conversation history from 10 iterations ago.

I tried truncating history. That helped with costs but broke the agent’s ability to reference earlier context. I tried summarization. That added latency and sometimes lost critical details.

Then I found prompt caching.

How Prompt Caching Actually Works

Prompt caching stores the computed KV-cache for processed tokens. When your next request has the same prefix, the server reuses the cache instead of recomputing.

Request 1: [A][B][C][D][E]  <- compute all
Request 2: [A][B][C][D][F]  <- reuse A-D, compute F only
Request 3: [A][B][C][D][F][G] <- reuse A-F, compute G only

Anthropic charges 1/10th the price for cached tokens. With 18,000 cached tokens and 2,000 new tokens, my next request cost roughly 20% of a non-cached equivalent.

But there’s a catch that took me three failed attempts to understand.

The Prefix Constraint Is Brutal

The cache matches prefixes strictly. If messages 1-20 are identical to your last request, they hit cache. But modify even one token in the middle? Everything after that point re-computes.

I learned this the hard way:

def manage_messages(self, messages, max_tokens=40000):
    # I thought truncating old history was smart...
    if count_tokens(messages) > max_tokens:
        messages = messages[-20:]  # Keep last 20 messages
    return messages

This looked reasonable. But every time I truncated, the cache completely invalidated. The first message that used to match the cached prefix was now different. No cache hit. Full re-computation. Maximum cost.

My second mistake was reordering tools based on usage frequency:

def optimize_tools(self, tools):
    # I reordered tools thinking it would help...
    used_tools = [t for t in tools if t.name in self.used_tool_names]
    unused_tools = [t for t in tools if t.name not in self.used_tool_names]
    return used_tools + unused_tools  # WRONG: reordering breaks cache

Even though the content was identical, the order changed. Cache broken.

The Cache-Friendly Architecture

After studying how Manus and other production systems handle this, I rewrote my message manager:

class CacheAwareMessageManager:
    """
    Three rules for cache-friendly agents:
    1. Static content first (system, tools)
    2. Only append to messages, never modify
    3. Never delete from history
    """

    def __init__(self):
        # Order matters! Static first, dynamic last
        self.messages = []

    def initialize(self, system_prompt: str, tools: list[dict]):
        """Called once at agent start - these are static forever."""
        self.messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": json.dumps(tools)}
        ]

    def append_interaction(
        self,
        tool_call: dict,
        tool_result: str,
        user_response: str
    ):
        """
        ONLY append - never modify previous messages.
        Even if a tool call was wrong, append the correction.
        Even if history is long, keep appending.
        """
        self.messages.append({"role": "assistant", "tool_calls": [tool_call]})
        self.messages.append({"role": "tool_result", "content": tool_result})
        self.messages.append({"role": "user", "content": user_response})

    def get_messages(self) -> list[dict]:
        """Return full history - let caching handle efficiency."""
        return self.messages

    def get_token_stats(self) -> dict:
        """Track cache effectiveness."""
        return {
            "total_messages": len(self.messages),
            "estimated_tokens": sum(
                self._estimate_tokens(m) for m in self.messages
            )
        }

    def _estimate_tokens(self, message: dict) -> int:
        # Rough estimate: 4 chars per token
        content = json.dumps(message)
        return len(content) // 4

The key insight: never truncate, never reorder, never modify. Just append.

Why This Changes Agent Architecture

This constraint shapes every architectural decision:

BEFORE (cache-unaware):
- Truncate old messages to save tokens
- Reorder tools by usage frequency
- Edit message history to fix errors
- Result: $2,400/month, no cache hits

AFTER (cache-aware):
- Keep all message history
- Tools in fixed order (even unused ones)
- Append corrections, never edit
- Result: $480/month, 85% cache hit rate

The cost difference is massive. But more importantly, the architecture becomes simpler. I no longer worry about history management. I no longer try to be clever with tool ordering.

Common Mistakes I Made

Mistake 1: Modifying message content

# WRONG: Fixing a typo in history
self.messages[3]["content"] = self.messages[3]["content"].replace(
    "eror", "error"
)
# Cache breaks from message 3 onwards

Mistake 2: Deleting unused tool results

# WRONG: Removing failed tool calls
self.messages = [
    m for m in self.messages
    if not is_failed_tool_call(m)
]
# All subsequent cache invalidates

Mistake 3: Reordering by change frequency

# WRONG: Putting most-changed content first
self.messages = sorted(
    self.messages,
    key=lambda m: estimate_change_frequency(m),
    reverse=True
)
# Completely different prefix order

All three seemed like good ideas at the time. All three destroyed my cache hit rate.

The Right Approach: Append-Only History

Here’s what works:

class AppendOnlyAgent:
    """
    Production agent with 85% cache hit rate.
    Key: append-only, static-first ordering.
    """

    def __init__(self, system_prompt: str, tools: list[dict]):
        self.manager = CacheAwareMessageManager()
        self.manager.initialize(system_prompt, tools)
        # Tools stay in this order forever, even if some are never used

    async def run_iteration(self, user_input: str) -> dict:
        """Single iteration - just append to history."""
        self.manager.append_interaction(
            tool_call={},
            tool_result="",
            user_response=user_input
        )

        response = await self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            system=self.manager.messages[0]["content"],
            messages=self.manager.messages[1:],
            tools=self.tools,  # Same order, every time
        )

        # Handle tool calls by appending, never modifying
        for tool_call in response.tool_calls:
            result = await self.execute_tool(tool_call)
            self.manager.append_interaction(
                tool_call=tool_call,
                tool_result=result,
                user_response=""
            )

        return response

    async def execute_tool(self, tool_call: dict) -> str:
        """Execute and return result - success or failure."""
        try:
            result = await self.tools[tool_call.name].execute(
                tool_call.arguments
            )
            return json.dumps(result)
        except Exception as e:
            # Return error as result, don't delete the call
            return json.dumps({"error": str(e)})

Notice: even failed tool calls stay in history. I append the error as a result. The agent learns from failures. The cache stays intact.

Measuring Cache Effectiveness

I added monitoring to track the actual savings:

import tiktoken

class CacheMonitor:
    def __init__(self):
        self.requests = []
        self.encoder = tiktoken.get_encoding("cl100k_base")

    def log_request(self, messages: list[dict], cached_tokens: int):
        total_tokens = sum(
            len(self.encoder.encode(json.dumps(m)))
            for m in messages
        )
        new_tokens = total_tokens - cached_tokens

        self.requests.append({
            "total": total_tokens,
            "cached": cached_tokens,
            "new": new_tokens,
            "cache_hit_rate": cached_tokens / total_tokens if total_tokens > 0 else 0
        })

    def get_summary(self) -> dict:
        if not self.requests:
            return {}

        total_cached = sum(r["cached"] for r in self.requests)
        total_new = sum(r["new"] for r in self.requests)
        total_all = total_cached + total_new

        # Without caching, we'd pay for all tokens
        # With caching, we pay ~1/10 for cached tokens
        cost_without_cache = total_all
        cost_with_cache = total_new + (total_cached * 0.1)
        savings = (cost_without_cache - cost_with_cache) / cost_without_cache

        return {
            "total_requests": len(self.requests),
            "total_tokens": total_all,
            "cached_tokens": total_cached,
            "new_tokens": total_new,
            "average_cache_hit_rate": total_cached / total_all if total_all > 0 else 0,
            "estimated_cost_reduction": f"{savings:.1%}"
        }

Running this for a week:

Total requests: 1,247
Total tokens: 31.2M
Cached tokens: 26.4M
New tokens: 4.8M
Average cache hit rate: 84.6%
Estimated cost reduction: 80.2%

The math works out. With 85% of tokens cached and cached tokens costing 1/10th, I’m paying roughly 20% of the non-cached price.

When You Must Modify History

Sometimes you genuinely need to modify history - for example, when implementing a “learn from correction” system where you want to fix past mistakes.

The solution: append corrections, don’t modify the original:

class CorrectingAgent(AppendOnlyAgent):
    """
    Handles corrections without breaking cache.
    Appends corrections instead of editing history.
    """

    async def apply_correction(
        self,
        message_index: int,
        original_content: str,
        corrected_content: str
    ):
        """
        Instead of:
          self.messages[message_index]["content"] = corrected_content

        Do this:
          append a correction instruction
        """
        # Append correction instruction
        self.manager.append_interaction(
            tool_call={},
            tool_result="",
            user_response=f"""
            CORRECTION: In message {message_index}, the content was wrong.
            Original: {original_content}
            Corrected: {corrected_content}
            Please use the corrected version for future reasoning.
            """
        )

The agent sees both the mistake and the correction. Cache remains valid. Cost stays low.

The Architectural Takeaway

Prompt caching fundamentally changed how I design agent systems:

History management is dead - Don’t truncate, don’t summarize, don’t optimize. Just append.
Tool ordering matters once - Choose an order at initialization, never change it.
Static content first - System prompts, tool definitions, and any unchanging context goes at the start.
Monitor cache hit rate - If it drops below 70%, you’re probably breaking cache somewhere.
Embrace append-only - This isn’t a limitation, it’s a feature. Simpler architecture, lower costs, better debugging.

The $2,400 bill is now $480. Same agent, same tasks, same outputs. The only difference: understanding that in prompt caching, order is everything and modification is the enemy.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Anthropic Prompt Caching
👨‍💻 Manus Agent Optimization

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!