How to optimize AI agent costs with model cascading, caching, and budgeting
I built an AI agent last month and my API bill jumped from $50 to $800 in two weeks. The culprit? I used GPT-4 for everything - even simple “is this text positive?” queries that a cheaper model could handle.
In this post, I’ll show how I reduced my agent costs by 80% using three strategies: model cascading, prompt caching, and budgeting. The key point is that you don’t need expensive models for simple tasks.
The Problem: Agent Costs Spiral Fast
I tracked my agent’s API usage and found the pattern:
- Each agent loop processed ALL conversation history
- Output tokens cost 3-5x more than input tokens
- Complex tasks ran 30+ iterations
- Tree of Thoughts triggered 100+ API calls
- Multi-agent debates multiplied costs by 5-10x
My agent would spin on a problem, burning tokens until I manually stopped it. No limits, no fallbacks, no awareness of cost.
I tried a few naive solutions:
- Using only cheap models - Quality dropped. Complex reasoning tasks failed.
- Hard token limits - Agent stopped mid-task, leaving broken outputs.
- Manual intervention - I became a human circuit breaker. Not sustainable.
None of these worked. I needed a smarter approach.
Solution 1: Model Cascading (Route Smart)
The insight: most queries are simple. A query like “extract the date from this text” doesn’t need GPT-4. But “analyze the logical fallacies in this argument” does.
I implemented a router that scores query complexity and sends it to the right model:
class ModelRouter: def __init__(self, cheap_model, expensive_model, threshold=0.7): self.cheap = cheap_model self.expensive = expensive_model self.threshold = threshold
def route(self, prompt, complexity_scorer): score = complexity_scorer(prompt)
if score < self.threshold: return self.cheap.generate(prompt) else: return self.expensive.generate(prompt)The complexity scorer can be as simple as checking for keywords like “analyze”, “reason”, “evaluate”, or you can train a lightweight classifier.
Results from RouteLLM benchmarks:
- 85% cost reduction
- 95% of top model quality
- Simple queries route to GPT-4o-mini or Haiku
- Complex queries route to GPT-4o or Sonnet
For my agent, I added one more level - critical decisions (like final answers to users) go to the most capable model.
Solution 2: Prompt Caching (Reuse Computation)
Prompt caching was my biggest win. The concept: when you send the same prefix to an API multiple times, the provider can reuse the computed KV-cache instead of reprocessing.
For agents, this is huge. Every loop iteration sends the entire conversation history. If you structure messages correctly, 90%+ of tokens can be cached.
Here’s what I changed:
- Static content first: System prompt, tool definitions, context documents
- Dynamic content last: User queries, agent responses
- Never modify history, only append
def build_messages(system_prompt, tools, history, new_message): """ Structure messages for optimal caching. Cache hits only happen on prefix matches. """ messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": tools_description} ] # History is cached after first iteration messages.extend(history) # New content goes last messages.append(new_message) return messagesCost reduction: 70-80% on cached tokens. Anthropic charges 1/10th for cached input tokens.
The mistake I made initially: I was modifying earlier messages in the conversation to “improve” them. This broke the cache. Once I switched to append-only history, costs dropped.
Solution 3: Budgeting (Stop the Bleeding)
Even with routing and caching, agents can spiral. I needed hard limits.
class BudgetAwareAgent: def __init__(self, max_cost_dollars=1.0): self.max_cost = max_cost_dollars self.spent = 0
async def run(self, task): while not task.complete: if self.spent >= self.max_cost: raise BudgetExceeded( f"Spent ${self.spent:.2f}, limit ${self.max_cost}" )
response = await self.step(task) self.spent += self.calculate_cost(response)
return task.result
def calculate_cost(self, response): # Pricing varies by model input_cost = response.input_tokens * self.price_per_input_token output_cost = response.output_tokens * self.price_per_output_token return input_cost + output_costThis saved me during development. Instead of surprise $100 bills, I’d get a clear error: “Budget exceeded at $1.00.”
For reasoning models like o1 or Claude’s extended thinking, I also set budget_tokens limits:
response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=4096, thinking={ "type": "enabled", "budget_tokens": 16000 # Cap reasoning depth }, messages=[...])The budget_tokens parameter limits how much internal reasoning the model does. Higher values = more thorough but more expensive.
Bonus: Data Format Matters
I discovered one more optimization: the format of your data affects token count.
Testing the same structured data:
| Format | Token Count | Savings vs JSON |
|---|---|---|
| JSON | 1000 | baseline |
| YAML | 900 | 10% |
| Markdown | 660 | 34% |
| CSV (for tables) | 500-600 | 40-50% |
For agent contexts, I switched to Markdown:
def format_as_markdown(data): """Use Markdown instead of JSON for structured data""" lines = ["# Data"] for key, value in data.items(): lines.append(f"- **{key}**: {value}") return "\n".join(lines)
# Instead of:# {"name": "Alice", "age": 30, "role": "engineer"}
# Use:# # Data# - **name**: Alice# - **age**: 30# - **role**: engineerFor tabular data, CSV beats everything. I use it for agent memory and logs.
Putting It Together
My final architecture:
┌─────────────────────────────────────────────────────────────┐│ Incoming Query │└─────────────────────┬───────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────┐│ Complexity Scorer (cheap model) ││ Score: 0.0 (simple) to 1.0 (complex) │└─────────────────────┬───────────────────────────────────────┘ │ ┌──────────┴──────────┐ │ │ ▼ ▼┌──────────────────┐ ┌──────────────────┐│ Cheap Model │ │ Expensive Model ││ (Haiku/4o-mini) │ │ (Sonnet/4o) ││ Score < 0.7 │ │ Score >= 0.7 │└────────┬─────────┘ └────────┬─────────┘ │ │ └──────────┬──────────┘ │ ▼┌─────────────────────────────────────────────────────────────┐│ Budget Tracker (max $1.00/task) ││ - Check before each API call ││ - Raise exception on exceed │└─────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────┐│ Response (cached where possible) │└─────────────────────────────────────────────────────────────┘Results after implementing all three:
- Before: $800/month for one agent
- After: $150/month for the same workload
- Quality: No noticeable degradation on task success rate
Common Mistakes I Made
- Using the most powerful model for everything - This was my $750 mistake.
- Modifying conversation history - Broke caching completely.
- No budget limits during development - Led to surprise bills.
- JSON for everything - Wasted tokens on structural syntax.
Summary
Agent cost optimization requires a layered approach:
- Model cascading - Route simple tasks to cheap models. Start here for quick wins.
- Prompt caching - Structure messages for prefix matching. Enables 70-80% savings.
- Budgeting - Set hard limits per task. Prevents runaway costs.
- Data format - Use Markdown or CSV over JSON. Saves 10-50% on context.
The key insight: you don’t need GPT-4 for “capitalize this sentence.” Route intelligently, cache aggressively, and budget realistically.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 RouteLLM
- 👨💻 Anthropic Prompt Caching
- 👨💻 Cascade Routing Paper
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments