Skip to content

How to optimize AI agent costs with model cascading, caching, and budgeting

I built an AI agent last month and my API bill jumped from $50 to $800 in two weeks. The culprit? I used GPT-4 for everything - even simple “is this text positive?” queries that a cheaper model could handle.

In this post, I’ll show how I reduced my agent costs by 80% using three strategies: model cascading, prompt caching, and budgeting. The key point is that you don’t need expensive models for simple tasks.

The Problem: Agent Costs Spiral Fast

I tracked my agent’s API usage and found the pattern:

  • Each agent loop processed ALL conversation history
  • Output tokens cost 3-5x more than input tokens
  • Complex tasks ran 30+ iterations
  • Tree of Thoughts triggered 100+ API calls
  • Multi-agent debates multiplied costs by 5-10x

My agent would spin on a problem, burning tokens until I manually stopped it. No limits, no fallbacks, no awareness of cost.

I tried a few naive solutions:

  1. Using only cheap models - Quality dropped. Complex reasoning tasks failed.
  2. Hard token limits - Agent stopped mid-task, leaving broken outputs.
  3. Manual intervention - I became a human circuit breaker. Not sustainable.

None of these worked. I needed a smarter approach.

Solution 1: Model Cascading (Route Smart)

The insight: most queries are simple. A query like “extract the date from this text” doesn’t need GPT-4. But “analyze the logical fallacies in this argument” does.

I implemented a router that scores query complexity and sends it to the right model:

model_router.py
class ModelRouter:
def __init__(self, cheap_model, expensive_model, threshold=0.7):
self.cheap = cheap_model
self.expensive = expensive_model
self.threshold = threshold
def route(self, prompt, complexity_scorer):
score = complexity_scorer(prompt)
if score < self.threshold:
return self.cheap.generate(prompt)
else:
return self.expensive.generate(prompt)

The complexity scorer can be as simple as checking for keywords like “analyze”, “reason”, “evaluate”, or you can train a lightweight classifier.

Results from RouteLLM benchmarks:

  • 85% cost reduction
  • 95% of top model quality
  • Simple queries route to GPT-4o-mini or Haiku
  • Complex queries route to GPT-4o or Sonnet

For my agent, I added one more level - critical decisions (like final answers to users) go to the most capable model.

Solution 2: Prompt Caching (Reuse Computation)

Prompt caching was my biggest win. The concept: when you send the same prefix to an API multiple times, the provider can reuse the computed KV-cache instead of reprocessing.

For agents, this is huge. Every loop iteration sends the entire conversation history. If you structure messages correctly, 90%+ of tokens can be cached.

Here’s what I changed:

  1. Static content first: System prompt, tool definitions, context documents
  2. Dynamic content last: User queries, agent responses
  3. Never modify history, only append
caching_agent.py
def build_messages(system_prompt, tools, history, new_message):
"""
Structure messages for optimal caching.
Cache hits only happen on prefix matches.
"""
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": tools_description}
]
# History is cached after first iteration
messages.extend(history)
# New content goes last
messages.append(new_message)
return messages

Cost reduction: 70-80% on cached tokens. Anthropic charges 1/10th for cached input tokens.

The mistake I made initially: I was modifying earlier messages in the conversation to “improve” them. This broke the cache. Once I switched to append-only history, costs dropped.

Solution 3: Budgeting (Stop the Bleeding)

Even with routing and caching, agents can spiral. I needed hard limits.

budget_aware_agent.py
class BudgetAwareAgent:
def __init__(self, max_cost_dollars=1.0):
self.max_cost = max_cost_dollars
self.spent = 0
async def run(self, task):
while not task.complete:
if self.spent >= self.max_cost:
raise BudgetExceeded(
f"Spent ${self.spent:.2f}, limit ${self.max_cost}"
)
response = await self.step(task)
self.spent += self.calculate_cost(response)
return task.result
def calculate_cost(self, response):
# Pricing varies by model
input_cost = response.input_tokens * self.price_per_input_token
output_cost = response.output_tokens * self.price_per_output_token
return input_cost + output_cost

This saved me during development. Instead of surprise $100 bills, I’d get a clear error: “Budget exceeded at $1.00.”

For reasoning models like o1 or Claude’s extended thinking, I also set budget_tokens limits:

thinking_budget.py
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4096,
thinking={
"type": "enabled",
"budget_tokens": 16000 # Cap reasoning depth
},
messages=[...]
)

The budget_tokens parameter limits how much internal reasoning the model does. Higher values = more thorough but more expensive.

Bonus: Data Format Matters

I discovered one more optimization: the format of your data affects token count.

Testing the same structured data:

FormatToken CountSavings vs JSON
JSON1000baseline
YAML90010%
Markdown66034%
CSV (for tables)500-60040-50%

For agent contexts, I switched to Markdown:

format_utils.py
def format_as_markdown(data):
"""Use Markdown instead of JSON for structured data"""
lines = ["# Data"]
for key, value in data.items():
lines.append(f"- **{key}**: {value}")
return "\n".join(lines)
# Instead of:
# {"name": "Alice", "age": 30, "role": "engineer"}
# Use:
# # Data
# - **name**: Alice
# - **age**: 30
# - **role**: engineer

For tabular data, CSV beats everything. I use it for agent memory and logs.

Putting It Together

My final architecture:

agent-architecture.txt
┌─────────────────────────────────────────────────────────────┐
│ Incoming Query │
└─────────────────────┬───────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Complexity Scorer (cheap model) │
│ Score: 0.0 (simple) to 1.0 (complex) │
└─────────────────────┬───────────────────────────────────────┘
┌──────────┴──────────┐
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ Cheap Model │ │ Expensive Model │
│ (Haiku/4o-mini) │ │ (Sonnet/4o) │
│ Score &lt; 0.7 │ │ Score >= 0.7 │
└────────┬─────────┘ └────────┬─────────┘
│ │
└──────────┬──────────┘
┌─────────────────────────────────────────────────────────────┐
│ Budget Tracker (max $1.00/task) │
│ - Check before each API call │
│ - Raise exception on exceed │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Response (cached where possible) │
└─────────────────────────────────────────────────────────────┘

Results after implementing all three:

  • Before: $800/month for one agent
  • After: $150/month for the same workload
  • Quality: No noticeable degradation on task success rate

Common Mistakes I Made

  1. Using the most powerful model for everything - This was my $750 mistake.
  2. Modifying conversation history - Broke caching completely.
  3. No budget limits during development - Led to surprise bills.
  4. JSON for everything - Wasted tokens on structural syntax.

Summary

Agent cost optimization requires a layered approach:

  1. Model cascading - Route simple tasks to cheap models. Start here for quick wins.
  2. Prompt caching - Structure messages for prefix matching. Enables 70-80% savings.
  3. Budgeting - Set hard limits per task. Prevents runaway costs.
  4. Data format - Use Markdown or CSV over JSON. Saves 10-50% on context.

The key insight: you don’t need GPT-4 for “capitalize this sentence.” Route intelligently, cache aggressively, and budget realistically.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments