How Can Model Routing Reduce AI Agent Costs by 70% or More?

Mar 30, 2026

I built an AI agent that runs every night. The first week, my API bill hit $52. That’s when I realized I was doing something wrong.

The culprit? I was using Claude Opus for everything—scanning documents, filtering content, making simple classifications. It was like hiring a senior architect to sweep the floors.

The Problem: One Model for Everything

My original agent pipeline looked like this:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   100 docs  │────▶│    OPUS     │────▶│   Results   │
│   (scan)    │     │  ($15/M)    │     │             │
└─────────────┘     └─────────────┘     └─────────────┘

Cost: ~100K input tokens × $15/M = $1.50 per run
Monthly: $1.50 × 30 days = $45/month

I treated every task the same, regardless of complexity. Scanning 100 documents to find 5 relevant ones? Opus. Classifying text into categories? Opus. Writing a summary? You guessed it—Opus.

The math was brutal:

Opus input: $15 per million tokens
Opus output: $75 per million tokens
My nightly runs: 100K-150K tokens
Weekly cost: $50+

I needed a different approach.

The Solution: Model Routing

Model routing is exactly what it sounds like: sending tasks to different models based on what the task requires. The key insight came from a Reddit post where someone mentioned their agent costs ~$0.40/night because “Haiku scans, Opus judges.”

That phrase stuck with me. What if I used cheaper models for the volume work and saved the expensive models for the judgment calls?

Here’s the model pricing comparison:

Model      | Input  | Output | Ratio to Opus
-----------|--------|--------|---------------
Haiku      | $0.80  | $4.00  | 1/20th
Sonnet     | $3.00  | $15.00 | 1/5th
Opus       | $15.00 | $75.00 | 1x

Haiku is 20x cheaper than Opus. If 80% of my tasks could use Haiku, my costs would plummet.

My First Routing Attempt

I started by categorizing my tasks:

Task Type         | Volume | Complexity | Model Choice
------------------|--------|------------|-------------
Scan documents    | High   | Low        | Haiku
Filter content    | High   | Low        | Haiku
Classify topics   | Medium | Medium     | Sonnet
Summarize         | Medium | Medium     | Sonnet
Final evaluation  | Low    | High       | Opus
Deep analysis     | Low    | High       | Opus

Then I built a router:

from anthropic import Anthropic
from enum import Enum
from dataclasses import dataclass

class ModelType(Enum):
    HAIKU = "claude-3-5-haiku-latest"
    SONNET = "claude-3-5-sonnet-latest"
    OPUS = "claude-3-opus-latest"

@dataclass
class Task:
    task_type: str
    content: str
    complexity: float  # 0.0 to 1.0

class ModelRouter:
    def __init__(self, api_key: str):
        self.client = Anthropic(api_key=api_key)

    def select_model(self, task: Task) -> ModelType:
        """Select optimal model based on task characteristics."""
        if task.task_type in ["scan", "filter", "classify"]:
            return ModelType.HAIKU
        elif task.task_type == "evaluate" and task.complexity > 0.7:
            return ModelType.OPUS
        return ModelType.SONNET

    def execute(self, task: Task) -> str:
        """Execute task with optimal model."""
        model = self.select_model(task)
        response = self.client.messages.create(
            model=model.value,
            max_tokens=2048,
            messages=[{"role": "user", "content": task.content}]
        )
        return response.content[0].text

This worked, but I quickly ran into a problem.

The Context Problem

When I switched models between phases, I was losing context. My scan phase would find relevant documents with Haiku, but then Opus had no memory of why Haiku selected them.

A comment on Reddit echoed this concern:

“How do you handle context preservation when switching models? Is there context lost during the transitions?”

I had to solve this. My solution: a shared context store.

class RoutedAgent:
    """
    Agent that routes tasks to appropriate models.
    Uses shared context to preserve information between phases.
    """

    def __init__(self, client: Anthropic):
        self.client = client
        self.context = {
            "candidates": [],
            "scan_metadata": {},
            "decisions": []
        }

    def scan(self, documents: list[str]) -> list[str]:
        """Fast, cheap scanning with Haiku."""
        response = self.client.messages.create(
            model="claude-3-5-haiku-latest",
            max_tokens=2048,
            messages=[{
                "role": "user",
                "content": f"Filter relevant documents. Return JSON with keys: selected, reasons.\n\nDocuments: {documents}"
            }]
        )
        # Parse and store in shared context
        result = self._parse_json(response.content[0].text)
        self.context["candidates"] = result["selected"]
        self.context["scan_metadata"]["reasons"] = result["reasons"]
        return result["selected"]

    def evaluate(self, candidates: list[str]) -> str:
        """Deep evaluation with Opus."""
        # Build context from previous phases
        context_summary = f"""
        Previous scan selected {len(self.context['candidates'])} candidates.
        Selection reasons: {self.context['scan_metadata']['reasons']}

        Now evaluate these candidates for final decision.
        """

        response = self.client.messages.create(
            model="claude-3-opus-latest",
            max_tokens=2048,
            system="You are a careful evaluator. Consider all context from previous phases.",
            messages=[{
                "role": "user",
                "content": f"{context_summary}\n\nCandidates: {candidates}"
            }]
        )
        return response.content[0].text

    def run(self, documents: list[str]) -> str:
        # Phase 1: Scan with Haiku (cheap, fast)
        candidates = self.scan(documents)

        # Phase 2: Evaluate with Opus (expensive, thorough)
        result = self.evaluate(candidates)

        return result

The pattern is simple: each phase writes to a shared context, and subsequent phases can read from it. This way, Opus knows why Haiku made its selections.

The Results: Cost Tracking

I added cost tracking to see the actual savings:

COSTS = {
    "claude-3-5-haiku-latest": {"input": 0.80, "output": 4.00},
    "claude-3-5-sonnet-latest": {"input": 3.00, "output": 15.00},
    "claude-3-opus-latest": {"input": 15.00, "output": 75.00},
}

def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    """Calculate cost in USD for a model call."""
    return (
        (input_tokens / 1_000_000) * COSTS[model]["input"] +
        (output_tokens / 1_000_000) * COSTS[model]["output"]
    )

After a week of running my routed agent, here’s what I found:

Phase           | Old Model | Old Cost | New Model | New Cost
----------------|-----------|----------|-----------|---------
Scan (100K)     | Opus      | $1.50    | Haiku     | $0.08
Filter (50K)    | Opus      | $0.75    | Haiku     | $0.04
Classify (20K)  | Opus      | $0.30    | Sonnet    | $0.06
Evaluate (10K)  | Opus      | $0.15    | Opus      | $0.15
Analyze (5K)    | Opus      | $0.08    | Opus      | $0.08
----------------|-----------|----------|-----------|---------
TOTAL           |           | $2.78    |           | $0.41

Savings: 85%

My nightly cost dropped from ~$2.78 to ~$0.41. Over a month, that’s $83 vs $12.

The Four-Phase Routing Pattern

The pattern that worked best for me follows four phases:

┌──────────────────────────────────────────────────────────────┐
│                     DOCUMENT PIPELINE                        │
└──────────────────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────┐     ┌─────────────────────────────────────┐
│  PHASE 1: SCAN  │     │  Model: Haiku (1/20th cost)         │
│                 │────▶│  Task: Filter relevant docs         │
│  High volume    │     │  Speed: Fast                        │
└─────────────────┘     └─────────────────────────────────────┘
         │
         ▼
┌─────────────────┐     ┌─────────────────────────────────────┐
│ PHASE 2: REFLECT│     │  Model: Sonnet (1/5th cost)          │
│                 │────▶│  Task: Analyze patterns             │
│  Medium volume  │     │  Speed: Moderate                    │
└─────────────────┘     └─────────────────────────────────────┘
         │
         ▼
┌──────────────────┐    ┌─────────────────────────────────────┐
│ PHASE 3: RESEARCH │    │  Model: Sonnet                      │
│                  │───▶│  Task: Gather context               │
│  Medium volume   │    │  Speed: Moderate                    │
└──────────────────┘    └─────────────────────────────────────┘
         │
         ▼
┌───────────────────┐   ┌─────────────────────────────────────┐
│ PHASE 4: EVALUATE │   │  Model: Opus (full price)           │
│                   │──▶│  Task: Final judgment               │
│  Low volume       │   │  Speed: Thorough                    │
└───────────────────┘   └─────────────────────────────────────┘

Each phase decreases in volume but increases in complexity. Haiku handles 80% of the tokens, Opus handles 5%.

Key Lessons Learned

1. Start with Haiku

My first mistake was assuming I needed Opus for everything. Now I start with Haiku and only upgrade if the task fails. Most simple classification and filtering tasks work fine with cheaper models.

2. Context Must Flow

The shared context pattern is essential. Without it, each phase starts from scratch, and you lose the benefits of routing. I use a simple dictionary, but a database or vector store works for larger systems.

3. Measure Everything

I didn’t know I was overpaying until I added cost tracking. Now I log model, tokens, and cost for every call:

import logging

logger = logging.getLogger(__name__)

def log_model_call(model: str, input_tokens: int, output_tokens: int, cost: float):
    logger.info(f"Model: {model}, In: {input_tokens}, Out: {output_tokens}, Cost: ${cost:.4f}")

4. Test Quality at Each Tier

Before routing to Haiku, I tested it on sample tasks. I found it handled filtering and classification well, but struggled with nuanced evaluation. That’s fine—that’s what Opus is for.

5. Complexity Scoring Helps

I added a simple complexity score (0-1) to my tasks. Tasks below 0.3 go to Haiku, 0.3-0.7 to Sonnet, above 0.7 to Opus. This automated much of the routing logic:

def estimate_complexity(task: str) -> float:
    """Estimate task complexity based on keywords and length."""
    complex_keywords = ["analyze", "evaluate", "reason", "decide", "compare"]
    simple_keywords = ["filter", "classify", "scan", "extract", "summarize"]

    task_lower = task.lower()
    for keyword in complex_keywords:
        if keyword in task_lower:
            return 0.8
    for keyword in simple_keywords:
        if keyword in task_lower:
            return 0.2
    return 0.5  # Default to middle

What Didn’t Work

I tried a few approaches that failed:

Dynamic routing based on response time: Too unpredictable
Automatic model selection via API: Anthropic doesn’t support this directly
Cost-based routing without quality checks: Haiku sometimes missed nuances that Opus caught

The lesson: routing is optimization, not automation. You still need to understand what each model is good at.

When to Use Each Model

Here’s my decision matrix:

Use Haiku when:
  - High volume, low stakes
  - Simple classification
  - Pattern matching
  - Content filtering
  - Quick summaries

Use Sonnet when:
  - Medium complexity
  - Multi-step reasoning
  - Code generation
  - Detailed analysis
  - Content creation

Use Opus when:
  - Complex reasoning required
  - Final judgments needed
  - Nuanced evaluation
  - Cross-domain synthesis
  - Quality over speed

The Bottom Line

Model routing reduced my AI agent costs by 85% without sacrificing quality. The key principles:

Analyze your pipeline - Most tasks don’t need the most expensive model
Start with Haiku - Test cheaper models before assuming you need more
Reserve Opus for judgment - Complex reasoning benefits from top-tier models
Design for context sharing - Plan how information flows between models
Measure and iterate - Track costs and quality to optimize routing

The ~$0.40/night cost I saw in that Reddit post? It’s real. You just need to be intentional about which model does what.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 My OpenClaw agent dreams at night - and wakes up smarter
👨‍💻 Anthropic API Pricing
👨‍💻 Claude Models Documentation

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!