How to optimize AI agent costs with model cascading, caching, and budgeting

Mar 25, 2026

I built an AI agent last month and my API bill jumped from $50 to $800 in two weeks. The culprit? I used GPT-4 for everything - even simple “is this text positive?” queries that a cheaper model could handle.

In this post, I’ll show how I reduced my agent costs by 80% using three strategies: model cascading, prompt caching, and budgeting. The key point is that you don’t need expensive models for simple tasks.

The Problem: Agent Costs Spiral Fast

I tracked my agent’s API usage and found the pattern:

Each agent loop processed ALL conversation history
Output tokens cost 3-5x more than input tokens
Complex tasks ran 30+ iterations
Tree of Thoughts triggered 100+ API calls
Multi-agent debates multiplied costs by 5-10x

My agent would spin on a problem, burning tokens until I manually stopped it. No limits, no fallbacks, no awareness of cost.

I tried a few naive solutions:

Using only cheap models - Quality dropped. Complex reasoning tasks failed.
Hard token limits - Agent stopped mid-task, leaving broken outputs.
Manual intervention - I became a human circuit breaker. Not sustainable.

None of these worked. I needed a smarter approach.

Solution 1: Model Cascading (Route Smart)

The insight: most queries are simple. A query like “extract the date from this text” doesn’t need GPT-4. But “analyze the logical fallacies in this argument” does.

I implemented a router that scores query complexity and sends it to the right model:

class ModelRouter:
    def __init__(self, cheap_model, expensive_model, threshold=0.7):
        self.cheap = cheap_model
        self.expensive = expensive_model
        self.threshold = threshold

    def route(self, prompt, complexity_scorer):
        score = complexity_scorer(prompt)

        if score < self.threshold:
            return self.cheap.generate(prompt)
        else:
            return self.expensive.generate(prompt)

The complexity scorer can be as simple as checking for keywords like “analyze”, “reason”, “evaluate”, or you can train a lightweight classifier.

Results from RouteLLM benchmarks:

85% cost reduction
95% of top model quality
Simple queries route to GPT-4o-mini or Haiku
Complex queries route to GPT-4o or Sonnet

For my agent, I added one more level - critical decisions (like final answers to users) go to the most capable model.

Solution 2: Prompt Caching (Reuse Computation)

Prompt caching was my biggest win. The concept: when you send the same prefix to an API multiple times, the provider can reuse the computed KV-cache instead of reprocessing.

For agents, this is huge. Every loop iteration sends the entire conversation history. If you structure messages correctly, 90%+ of tokens can be cached.

Here’s what I changed:

Static content first: System prompt, tool definitions, context documents
Dynamic content last: User queries, agent responses
Never modify history, only append

def build_messages(system_prompt, tools, history, new_message):
    """
    Structure messages for optimal caching.
    Cache hits only happen on prefix matches.
    """
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": tools_description}
    ]
    # History is cached after first iteration
    messages.extend(history)
    # New content goes last
    messages.append(new_message)
    return messages

Cost reduction: 70-80% on cached tokens. Anthropic charges 1/10th for cached input tokens.

The mistake I made initially: I was modifying earlier messages in the conversation to “improve” them. This broke the cache. Once I switched to append-only history, costs dropped.

Solution 3: Budgeting (Stop the Bleeding)

Even with routing and caching, agents can spiral. I needed hard limits.

class BudgetAwareAgent:
    def __init__(self, max_cost_dollars=1.0):
        self.max_cost = max_cost_dollars
        self.spent = 0

    async def run(self, task):
        while not task.complete:
            if self.spent >= self.max_cost:
                raise BudgetExceeded(
                    f"Spent ${self.spent:.2f}, limit ${self.max_cost}"
                )

            response = await self.step(task)
            self.spent += self.calculate_cost(response)

        return task.result

    def calculate_cost(self, response):
        # Pricing varies by model
        input_cost = response.input_tokens * self.price_per_input_token
        output_cost = response.output_tokens * self.price_per_output_token
        return input_cost + output_cost

This saved me during development. Instead of surprise $100 bills, I’d get a clear error: “Budget exceeded at $1.00.”

For reasoning models like o1 or Claude’s extended thinking, I also set budget_tokens limits:

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=4096,
    thinking={
        "type": "enabled",
        "budget_tokens": 16000  # Cap reasoning depth
    },
    messages=[...]
)

The budget_tokens parameter limits how much internal reasoning the model does. Higher values = more thorough but more expensive.

Bonus: Data Format Matters

I discovered one more optimization: the format of your data affects token count.

Testing the same structured data:

Format	Token Count	Savings vs JSON
JSON	1000	baseline
YAML	900	10%
Markdown	660	34%
CSV (for tables)	500-600	40-50%

For agent contexts, I switched to Markdown:

def format_as_markdown(data):
    """Use Markdown instead of JSON for structured data"""
    lines = ["# Data"]
    for key, value in data.items():
        lines.append(f"- **{key}**: {value}")
    return "\n".join(lines)

# Instead of:
# {"name": "Alice", "age": 30, "role": "engineer"}

# Use:
# # Data
# - **name**: Alice
# - **age**: 30
# - **role**: engineer

For tabular data, CSV beats everything. I use it for agent memory and logs.

Putting It Together

My final architecture:

┌─────────────────────────────────────────────────────────────┐
│                     Incoming Query                          │
└─────────────────────┬───────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────────┐
│              Complexity Scorer (cheap model)                 │
│              Score: 0.0 (simple) to 1.0 (complex)            │
└─────────────────────┬───────────────────────────────────────┘
                      │
           ┌──────────┴──────────┐
           │                     │
           ▼                     ▼
┌──────────────────┐  ┌──────────────────┐
│   Cheap Model    │  │ Expensive Model  │
│  (Haiku/4o-mini) │  │  (Sonnet/4o)     │
│  Score &lt; 0.7     │  │  Score >= 0.7    │
└────────┬─────────┘  └────────┬─────────┘
         │                     │
         └──────────┬──────────┘
                    │
                    ▼
┌─────────────────────────────────────────────────────────────┐
│              Budget Tracker (max $1.00/task)                 │
│              - Check before each API call                    │
│              - Raise exception on exceed                      │
└─────────────────────────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────────────────────┐
│              Response (cached where possible)                │
└─────────────────────────────────────────────────────────────┘

Results after implementing all three:

Before: $800/month for one agent
After: $150/month for the same workload
Quality: No noticeable degradation on task success rate

Common Mistakes I Made

Using the most powerful model for everything - This was my $750 mistake.
Modifying conversation history - Broke caching completely.
No budget limits during development - Led to surprise bills.
JSON for everything - Wasted tokens on structural syntax.

Summary

Agent cost optimization requires a layered approach:

Model cascading - Route simple tasks to cheap models. Start here for quick wins.
Prompt caching - Structure messages for prefix matching. Enables 70-80% savings.
Budgeting - Set hard limits per task. Prevents runaway costs.
Data format - Use Markdown or CSV over JSON. Saves 10-50% on context.

The key insight: you don’t need GPT-4 for “capitalize this sentence.” Route intelligently, cache aggressively, and budget realistically.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 RouteLLM
👨‍💻 Anthropic Prompt Caching
👨‍💻 Cascade Routing Paper

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!