How Can Model Routing Reduce AI Agent Costs by 70% or More?
I built an AI agent that runs every night. The first week, my API bill hit $52. That’s when I realized I was doing something wrong.
The culprit? I was using Claude Opus for everything—scanning documents, filtering content, making simple classifications. It was like hiring a senior architect to sweep the floors.
The Problem: One Model for Everything
My original agent pipeline looked like this:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐│ 100 docs │────▶│ OPUS │────▶│ Results ││ (scan) │ │ ($15/M) │ │ │└─────────────┘ └─────────────┘ └─────────────┘
Cost: ~100K input tokens × $15/M = $1.50 per runMonthly: $1.50 × 30 days = $45/monthI treated every task the same, regardless of complexity. Scanning 100 documents to find 5 relevant ones? Opus. Classifying text into categories? Opus. Writing a summary? You guessed it—Opus.
The math was brutal:
- Opus input: $15 per million tokens
- Opus output: $75 per million tokens
- My nightly runs: 100K-150K tokens
- Weekly cost: $50+
I needed a different approach.
The Solution: Model Routing
Model routing is exactly what it sounds like: sending tasks to different models based on what the task requires. The key insight came from a Reddit post where someone mentioned their agent costs ~$0.40/night because “Haiku scans, Opus judges.”
That phrase stuck with me. What if I used cheaper models for the volume work and saved the expensive models for the judgment calls?
Here’s the model pricing comparison:
Model | Input | Output | Ratio to Opus-----------|--------|--------|---------------Haiku | $0.80 | $4.00 | 1/20thSonnet | $3.00 | $15.00 | 1/5thOpus | $15.00 | $75.00 | 1xHaiku is 20x cheaper than Opus. If 80% of my tasks could use Haiku, my costs would plummet.
My First Routing Attempt
I started by categorizing my tasks:
Task Type | Volume | Complexity | Model Choice------------------|--------|------------|-------------Scan documents | High | Low | HaikuFilter content | High | Low | HaikuClassify topics | Medium | Medium | SonnetSummarize | Medium | Medium | SonnetFinal evaluation | Low | High | OpusDeep analysis | Low | High | OpusThen I built a router:
from anthropic import Anthropicfrom enum import Enumfrom dataclasses import dataclass
class ModelType(Enum): HAIKU = "claude-3-5-haiku-latest" SONNET = "claude-3-5-sonnet-latest" OPUS = "claude-3-opus-latest"
@dataclassclass Task: task_type: str content: str complexity: float # 0.0 to 1.0
class ModelRouter: def __init__(self, api_key: str): self.client = Anthropic(api_key=api_key)
def select_model(self, task: Task) -> ModelType: """Select optimal model based on task characteristics.""" if task.task_type in ["scan", "filter", "classify"]: return ModelType.HAIKU elif task.task_type == "evaluate" and task.complexity > 0.7: return ModelType.OPUS return ModelType.SONNET
def execute(self, task: Task) -> str: """Execute task with optimal model.""" model = self.select_model(task) response = self.client.messages.create( model=model.value, max_tokens=2048, messages=[{"role": "user", "content": task.content}] ) return response.content[0].textThis worked, but I quickly ran into a problem.
The Context Problem
When I switched models between phases, I was losing context. My scan phase would find relevant documents with Haiku, but then Opus had no memory of why Haiku selected them.
A comment on Reddit echoed this concern:
“How do you handle context preservation when switching models? Is there context lost during the transitions?”
I had to solve this. My solution: a shared context store.
class RoutedAgent: """ Agent that routes tasks to appropriate models. Uses shared context to preserve information between phases. """
def __init__(self, client: Anthropic): self.client = client self.context = { "candidates": [], "scan_metadata": {}, "decisions": [] }
def scan(self, documents: list[str]) -> list[str]: """Fast, cheap scanning with Haiku.""" response = self.client.messages.create( model="claude-3-5-haiku-latest", max_tokens=2048, messages=[{ "role": "user", "content": f"Filter relevant documents. Return JSON with keys: selected, reasons.\n\nDocuments: {documents}" }] ) # Parse and store in shared context result = self._parse_json(response.content[0].text) self.context["candidates"] = result["selected"] self.context["scan_metadata"]["reasons"] = result["reasons"] return result["selected"]
def evaluate(self, candidates: list[str]) -> str: """Deep evaluation with Opus.""" # Build context from previous phases context_summary = f""" Previous scan selected {len(self.context['candidates'])} candidates. Selection reasons: {self.context['scan_metadata']['reasons']}
Now evaluate these candidates for final decision. """
response = self.client.messages.create( model="claude-3-opus-latest", max_tokens=2048, system="You are a careful evaluator. Consider all context from previous phases.", messages=[{ "role": "user", "content": f"{context_summary}\n\nCandidates: {candidates}" }] ) return response.content[0].text
def run(self, documents: list[str]) -> str: # Phase 1: Scan with Haiku (cheap, fast) candidates = self.scan(documents)
# Phase 2: Evaluate with Opus (expensive, thorough) result = self.evaluate(candidates)
return resultThe pattern is simple: each phase writes to a shared context, and subsequent phases can read from it. This way, Opus knows why Haiku made its selections.
The Results: Cost Tracking
I added cost tracking to see the actual savings:
COSTS = { "claude-3-5-haiku-latest": {"input": 0.80, "output": 4.00}, "claude-3-5-sonnet-latest": {"input": 3.00, "output": 15.00}, "claude-3-opus-latest": {"input": 15.00, "output": 75.00},}
def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float: """Calculate cost in USD for a model call.""" return ( (input_tokens / 1_000_000) * COSTS[model]["input"] + (output_tokens / 1_000_000) * COSTS[model]["output"] )After a week of running my routed agent, here’s what I found:
Phase | Old Model | Old Cost | New Model | New Cost----------------|-----------|----------|-----------|---------Scan (100K) | Opus | $1.50 | Haiku | $0.08Filter (50K) | Opus | $0.75 | Haiku | $0.04Classify (20K) | Opus | $0.30 | Sonnet | $0.06Evaluate (10K) | Opus | $0.15 | Opus | $0.15Analyze (5K) | Opus | $0.08 | Opus | $0.08----------------|-----------|----------|-----------|---------TOTAL | | $2.78 | | $0.41
Savings: 85%My nightly cost dropped from ~$2.78 to ~$0.41. Over a month, that’s $83 vs $12.
The Four-Phase Routing Pattern
The pattern that worked best for me follows four phases:
┌──────────────────────────────────────────────────────────────┐│ DOCUMENT PIPELINE │└──────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────┐ ┌─────────────────────────────────────┐│ PHASE 1: SCAN │ │ Model: Haiku (1/20th cost) ││ │────▶│ Task: Filter relevant docs ││ High volume │ │ Speed: Fast │└─────────────────┘ └─────────────────────────────────────┘ │ ▼┌─────────────────┐ ┌─────────────────────────────────────┐│ PHASE 2: REFLECT│ │ Model: Sonnet (1/5th cost) ││ │────▶│ Task: Analyze patterns ││ Medium volume │ │ Speed: Moderate │└─────────────────┘ └─────────────────────────────────────┘ │ ▼┌──────────────────┐ ┌─────────────────────────────────────┐│ PHASE 3: RESEARCH │ │ Model: Sonnet ││ │───▶│ Task: Gather context ││ Medium volume │ │ Speed: Moderate │└──────────────────┘ └─────────────────────────────────────┘ │ ▼┌───────────────────┐ ┌─────────────────────────────────────┐│ PHASE 4: EVALUATE │ │ Model: Opus (full price) ││ │──▶│ Task: Final judgment ││ Low volume │ │ Speed: Thorough │└───────────────────┘ └─────────────────────────────────────┘Each phase decreases in volume but increases in complexity. Haiku handles 80% of the tokens, Opus handles 5%.
Key Lessons Learned
1. Start with Haiku
My first mistake was assuming I needed Opus for everything. Now I start with Haiku and only upgrade if the task fails. Most simple classification and filtering tasks work fine with cheaper models.
2. Context Must Flow
The shared context pattern is essential. Without it, each phase starts from scratch, and you lose the benefits of routing. I use a simple dictionary, but a database or vector store works for larger systems.
3. Measure Everything
I didn’t know I was overpaying until I added cost tracking. Now I log model, tokens, and cost for every call:
import logging
logger = logging.getLogger(__name__)
def log_model_call(model: str, input_tokens: int, output_tokens: int, cost: float): logger.info(f"Model: {model}, In: {input_tokens}, Out: {output_tokens}, Cost: ${cost:.4f}")4. Test Quality at Each Tier
Before routing to Haiku, I tested it on sample tasks. I found it handled filtering and classification well, but struggled with nuanced evaluation. That’s fine—that’s what Opus is for.
5. Complexity Scoring Helps
I added a simple complexity score (0-1) to my tasks. Tasks below 0.3 go to Haiku, 0.3-0.7 to Sonnet, above 0.7 to Opus. This automated much of the routing logic:
def estimate_complexity(task: str) -> float: """Estimate task complexity based on keywords and length.""" complex_keywords = ["analyze", "evaluate", "reason", "decide", "compare"] simple_keywords = ["filter", "classify", "scan", "extract", "summarize"]
task_lower = task.lower() for keyword in complex_keywords: if keyword in task_lower: return 0.8 for keyword in simple_keywords: if keyword in task_lower: return 0.2 return 0.5 # Default to middleWhat Didn’t Work
I tried a few approaches that failed:
- Dynamic routing based on response time: Too unpredictable
- Automatic model selection via API: Anthropic doesn’t support this directly
- Cost-based routing without quality checks: Haiku sometimes missed nuances that Opus caught
The lesson: routing is optimization, not automation. You still need to understand what each model is good at.
When to Use Each Model
Here’s my decision matrix:
Use Haiku when: - High volume, low stakes - Simple classification - Pattern matching - Content filtering - Quick summaries
Use Sonnet when: - Medium complexity - Multi-step reasoning - Code generation - Detailed analysis - Content creation
Use Opus when: - Complex reasoning required - Final judgments needed - Nuanced evaluation - Cross-domain synthesis - Quality over speedThe Bottom Line
Model routing reduced my AI agent costs by 85% without sacrificing quality. The key principles:
- Analyze your pipeline - Most tasks don’t need the most expensive model
- Start with Haiku - Test cheaper models before assuming you need more
- Reserve Opus for judgment - Complex reasoning benefits from top-tier models
- Design for context sharing - Plan how information flows between models
- Measure and iterate - Track costs and quality to optimize routing
The ~$0.40/night cost I saw in that Reddit post? It’s real. You just need to be intentional about which model does what.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 My OpenClaw agent dreams at night - and wakes up smarter
- 👨💻 Anthropic API Pricing
- 👨💻 Claude Models Documentation
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments