Skip to content

How Can Model Routing Reduce AI Agent Costs by 70% or More?

I built an AI agent that runs every night. The first week, my API bill hit $52. That’s when I realized I was doing something wrong.

The culprit? I was using Claude Opus for everything—scanning documents, filtering content, making simple classifications. It was like hiring a senior architect to sweep the floors.

The Problem: One Model for Everything

My original agent pipeline looked like this:

Original Pipeline (Expensive)
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ 100 docs │────▶│ OPUS │────▶│ Results │
│ (scan) │ │ ($15/M) │ │ │
└─────────────┘ └─────────────┘ └─────────────┘
Cost: ~100K input tokens × $15/M = $1.50 per run
Monthly: $1.50 × 30 days = $45/month

I treated every task the same, regardless of complexity. Scanning 100 documents to find 5 relevant ones? Opus. Classifying text into categories? Opus. Writing a summary? You guessed it—Opus.

The math was brutal:

  • Opus input: $15 per million tokens
  • Opus output: $75 per million tokens
  • My nightly runs: 100K-150K tokens
  • Weekly cost: $50+

I needed a different approach.

The Solution: Model Routing

Model routing is exactly what it sounds like: sending tasks to different models based on what the task requires. The key insight came from a Reddit post where someone mentioned their agent costs ~$0.40/night because “Haiku scans, Opus judges.”

That phrase stuck with me. What if I used cheaper models for the volume work and saved the expensive models for the judgment calls?

Here’s the model pricing comparison:

Claude Model Pricing (per 1M tokens)
Model | Input | Output | Ratio to Opus
-----------|--------|--------|---------------
Haiku | $0.80 | $4.00 | 1/20th
Sonnet | $3.00 | $15.00 | 1/5th
Opus | $15.00 | $75.00 | 1x

Haiku is 20x cheaper than Opus. If 80% of my tasks could use Haiku, my costs would plummet.

My First Routing Attempt

I started by categorizing my tasks:

Task Classification
Task Type | Volume | Complexity | Model Choice
------------------|--------|------------|-------------
Scan documents | High | Low | Haiku
Filter content | High | Low | Haiku
Classify topics | Medium | Medium | Sonnet
Summarize | Medium | Medium | Sonnet
Final evaluation | Low | High | Opus
Deep analysis | Low | High | Opus

Then I built a router:

router.py
from anthropic import Anthropic
from enum import Enum
from dataclasses import dataclass
class ModelType(Enum):
HAIKU = "claude-3-5-haiku-latest"
SONNET = "claude-3-5-sonnet-latest"
OPUS = "claude-3-opus-latest"
@dataclass
class Task:
task_type: str
content: str
complexity: float # 0.0 to 1.0
class ModelRouter:
def __init__(self, api_key: str):
self.client = Anthropic(api_key=api_key)
def select_model(self, task: Task) -> ModelType:
"""Select optimal model based on task characteristics."""
if task.task_type in ["scan", "filter", "classify"]:
return ModelType.HAIKU
elif task.task_type == "evaluate" and task.complexity > 0.7:
return ModelType.OPUS
return ModelType.SONNET
def execute(self, task: Task) -> str:
"""Execute task with optimal model."""
model = self.select_model(task)
response = self.client.messages.create(
model=model.value,
max_tokens=2048,
messages=[{"role": "user", "content": task.content}]
)
return response.content[0].text

This worked, but I quickly ran into a problem.

The Context Problem

When I switched models between phases, I was losing context. My scan phase would find relevant documents with Haiku, but then Opus had no memory of why Haiku selected them.

A comment on Reddit echoed this concern:

“How do you handle context preservation when switching models? Is there context lost during the transitions?”

I had to solve this. My solution: a shared context store.

agent_with_context.py
class RoutedAgent:
"""
Agent that routes tasks to appropriate models.
Uses shared context to preserve information between phases.
"""
def __init__(self, client: Anthropic):
self.client = client
self.context = {
"candidates": [],
"scan_metadata": {},
"decisions": []
}
def scan(self, documents: list[str]) -> list[str]:
"""Fast, cheap scanning with Haiku."""
response = self.client.messages.create(
model="claude-3-5-haiku-latest",
max_tokens=2048,
messages=[{
"role": "user",
"content": f"Filter relevant documents. Return JSON with keys: selected, reasons.\n\nDocuments: {documents}"
}]
)
# Parse and store in shared context
result = self._parse_json(response.content[0].text)
self.context["candidates"] = result["selected"]
self.context["scan_metadata"]["reasons"] = result["reasons"]
return result["selected"]
def evaluate(self, candidates: list[str]) -> str:
"""Deep evaluation with Opus."""
# Build context from previous phases
context_summary = f"""
Previous scan selected {len(self.context['candidates'])} candidates.
Selection reasons: {self.context['scan_metadata']['reasons']}
Now evaluate these candidates for final decision.
"""
response = self.client.messages.create(
model="claude-3-opus-latest",
max_tokens=2048,
system="You are a careful evaluator. Consider all context from previous phases.",
messages=[{
"role": "user",
"content": f"{context_summary}\n\nCandidates: {candidates}"
}]
)
return response.content[0].text
def run(self, documents: list[str]) -> str:
# Phase 1: Scan with Haiku (cheap, fast)
candidates = self.scan(documents)
# Phase 2: Evaluate with Opus (expensive, thorough)
result = self.evaluate(candidates)
return result

The pattern is simple: each phase writes to a shared context, and subsequent phases can read from it. This way, Opus knows why Haiku made its selections.

The Results: Cost Tracking

I added cost tracking to see the actual savings:

cost_tracker.py
COSTS = {
"claude-3-5-haiku-latest": {"input": 0.80, "output": 4.00},
"claude-3-5-sonnet-latest": {"input": 3.00, "output": 15.00},
"claude-3-opus-latest": {"input": 15.00, "output": 75.00},
}
def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
"""Calculate cost in USD for a model call."""
return (
(input_tokens / 1_000_000) * COSTS[model]["input"] +
(output_tokens / 1_000_000) * COSTS[model]["output"]
)

After a week of running my routed agent, here’s what I found:

Cost Comparison (per 100 documents)
Phase | Old Model | Old Cost | New Model | New Cost
----------------|-----------|----------|-----------|---------
Scan (100K) | Opus | $1.50 | Haiku | $0.08
Filter (50K) | Opus | $0.75 | Haiku | $0.04
Classify (20K) | Opus | $0.30 | Sonnet | $0.06
Evaluate (10K) | Opus | $0.15 | Opus | $0.15
Analyze (5K) | Opus | $0.08 | Opus | $0.08
----------------|-----------|----------|-----------|---------
TOTAL | | $2.78 | | $0.41
Savings: 85%

My nightly cost dropped from ~$2.78 to ~$0.41. Over a month, that’s $83 vs $12.

The Four-Phase Routing Pattern

The pattern that worked best for me follows four phases:

Four-Phase Routing Pattern
┌──────────────────────────────────────────────────────────────┐
│ DOCUMENT PIPELINE │
└──────────────────────────────────────────────────────────────┘
┌─────────────────┐ ┌─────────────────────────────────────┐
│ PHASE 1: SCAN │ │ Model: Haiku (1/20th cost) │
│ │────▶│ Task: Filter relevant docs │
│ High volume │ │ Speed: Fast │
└─────────────────┘ └─────────────────────────────────────┘
┌─────────────────┐ ┌─────────────────────────────────────┐
│ PHASE 2: REFLECT│ │ Model: Sonnet (1/5th cost) │
│ │────▶│ Task: Analyze patterns │
│ Medium volume │ │ Speed: Moderate │
└─────────────────┘ └─────────────────────────────────────┘
┌──────────────────┐ ┌─────────────────────────────────────┐
│ PHASE 3: RESEARCH │ │ Model: Sonnet │
│ │───▶│ Task: Gather context │
│ Medium volume │ │ Speed: Moderate │
└──────────────────┘ └─────────────────────────────────────┘
┌───────────────────┐ ┌─────────────────────────────────────┐
│ PHASE 4: EVALUATE │ │ Model: Opus (full price) │
│ │──▶│ Task: Final judgment │
│ Low volume │ │ Speed: Thorough │
└───────────────────┘ └─────────────────────────────────────┘

Each phase decreases in volume but increases in complexity. Haiku handles 80% of the tokens, Opus handles 5%.

Key Lessons Learned

1. Start with Haiku

My first mistake was assuming I needed Opus for everything. Now I start with Haiku and only upgrade if the task fails. Most simple classification and filtering tasks work fine with cheaper models.

2. Context Must Flow

The shared context pattern is essential. Without it, each phase starts from scratch, and you lose the benefits of routing. I use a simple dictionary, but a database or vector store works for larger systems.

3. Measure Everything

I didn’t know I was overpaying until I added cost tracking. Now I log model, tokens, and cost for every call:

logging_calls.py
import logging
logger = logging.getLogger(__name__)
def log_model_call(model: str, input_tokens: int, output_tokens: int, cost: float):
logger.info(f"Model: {model}, In: {input_tokens}, Out: {output_tokens}, Cost: ${cost:.4f}")

4. Test Quality at Each Tier

Before routing to Haiku, I tested it on sample tasks. I found it handled filtering and classification well, but struggled with nuanced evaluation. That’s fine—that’s what Opus is for.

5. Complexity Scoring Helps

I added a simple complexity score (0-1) to my tasks. Tasks below 0.3 go to Haiku, 0.3-0.7 to Sonnet, above 0.7 to Opus. This automated much of the routing logic:

complexity_routing.py
def estimate_complexity(task: str) -> float:
"""Estimate task complexity based on keywords and length."""
complex_keywords = ["analyze", "evaluate", "reason", "decide", "compare"]
simple_keywords = ["filter", "classify", "scan", "extract", "summarize"]
task_lower = task.lower()
for keyword in complex_keywords:
if keyword in task_lower:
return 0.8
for keyword in simple_keywords:
if keyword in task_lower:
return 0.2
return 0.5 # Default to middle

What Didn’t Work

I tried a few approaches that failed:

  • Dynamic routing based on response time: Too unpredictable
  • Automatic model selection via API: Anthropic doesn’t support this directly
  • Cost-based routing without quality checks: Haiku sometimes missed nuances that Opus caught

The lesson: routing is optimization, not automation. You still need to understand what each model is good at.

When to Use Each Model

Here’s my decision matrix:

Model Selection Guide
Use Haiku when:
- High volume, low stakes
- Simple classification
- Pattern matching
- Content filtering
- Quick summaries
Use Sonnet when:
- Medium complexity
- Multi-step reasoning
- Code generation
- Detailed analysis
- Content creation
Use Opus when:
- Complex reasoning required
- Final judgments needed
- Nuanced evaluation
- Cross-domain synthesis
- Quality over speed

The Bottom Line

Model routing reduced my AI agent costs by 85% without sacrificing quality. The key principles:

  1. Analyze your pipeline - Most tasks don’t need the most expensive model
  2. Start with Haiku - Test cheaper models before assuming you need more
  3. Reserve Opus for judgment - Complex reasoning benefits from top-tier models
  4. Design for context sharing - Plan how information flows between models
  5. Measure and iterate - Track costs and quality to optimize routing

The ~$0.40/night cost I saw in that Reddit post? It’s real. You just need to be intentional about which model does what.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments