How to Reduce API Costs When Using AI Agents: A Practical Guide

Mar 11, 2026

Problem

I ran OpenClaw with 5 agents for 3 weeks. Then I hit the API usage limit. The culprit? I sent huge instructions to all 5 agents for every task.

My setup:

1 main coordinator agent
4 specialized subagents
Each agent received ~2000 tokens of instructions per task

At 100 tasks per day, that’s 10,000 input tokens just for instructions. With Claude Sonnet 4.5 pricing ($3 per million input tokens), I burned through $90/month on instructions alone. And that’s before any actual work happened.

I needed a different approach.

Why Multi-Agent Systems Burn Money

Multi-agent systems multiply costs in ways single-agent systems don’t.

Token multiplication:

Single agent: 1 API call per task
5 agents: 5-25 API calls per complex task

Each agent makes multiple calls. They share context. They send messages to each other. The token count explodes.

Context accumulation:

Task 1: 1,000 tokens context
Task 2: 1,000 + Task 1 response = 2,500 tokens
Task 3: 2,500 + Task 2 response = 4,500 tokens

Context grows without management. Every task costs more than the last.

Strategy 1: Caching (30-70% Savings)

I started with the simplest fix: stop making the same API calls repeatedly.

Here’s what I implemented:

import hashlib
from typing import Optional

class AgentCache:
    def __init__(self):
        self.cache = {}  # Use Redis in production
        self.hit_rate = 0

    def get_cached_response(self, prompt: str, agent_id: str) -> Optional[str]:
        key = self._hash_key(prompt, agent_id)
        if key in self.cache:
            self.hit_rate += 1
            return self.cache[key]
        return None

    def cache_response(self, prompt: str, agent_id: str, response: str, ttl: int = 3600):
        key = self._hash_key(prompt, agent_id)
        self.cache[key] = response

    def _hash_key(self, prompt: str, agent_id: str) -> str:
        content = f"{agent_id}:{prompt}"
        return hashlib.sha256(content.encode()).hexdigest()

What to cache:

Content Type	Cache Duration	Savings
Common queries across agents	1 hour	High
Tool usage patterns	24 hours	Medium
FAQ-style responses	Permanent	High
System prompt results	Session	Medium

I also added embedding caching for my RAG agents:

class RAGCache:
    """Cache embeddings to avoid regeneration"""

    def __init__(self, embedding_model):
        self.embedding_cache = {}
        self.model = embedding_model

    def get_embedding(self, text: str) -> list:
        text_hash = hashlib.md5(text.encode()).hexdigest()
        if text_hash in self.embedding_cache:
            return self.embedding_cache[text_hash]

        embedding = self.model.embed(text)
        self.embedding_cache[text_hash] = embedding
        return embedding

Document embeddings get cached permanently. Query embeddings cache for frequently used queries. This saved me 40% on RAG agent costs.

Strategy 2: Model Routing (50-80% Savings)

I was using my most expensive model for everything. That’s like hiring a senior architect to write boilerplate code.

I built a router to match models to task complexity:

class ModelRouter:
    """Route agent tasks to appropriate models"""

    MODEL_COSTS = {
        "gpt-5": {"input": 0.015, "output": 0.06},
        "claude-sonnet-4.5": {"input": 0.003, "output": 0.015},
        "claude-haiku-3.5": {"input": 0.00025, "output": 0.00125},
        "local-llama-4": {"input": 0.0, "output": 0.0}
    }

    def route_task(self, task_type: str, complexity: str) -> str:
        routing_matrix = {
            ("classification", "simple"): "claude-haiku-3.5",
            ("classification", "moderate"): "claude-sonnet-4.5",
            ("extraction", "simple"): "claude-haiku-3.5",
            ("extraction", "complex"): "claude-sonnet-4.5",
            ("reasoning", "simple"): "claude-sonnet-4.5",
            ("reasoning", "complex"): "gpt-5",
            ("code", "generation"): "claude-sonnet-4.5",
            ("code", "review"): "local-llama-4",
            ("creative", "any"): "gpt-5"
        }

        return routing_matrix.get((task_type, complexity), "claude-sonnet-4.5")

Agent-to-model mapping:

Research Agent (high-volume searches) -> Local Llama 4
Writer Agent (creative tasks) -> Claude Sonnet 4.5
Code Agent (routine reviews) -> Local model
Memory Agent (classification) -> Claude Haiku 3.5
Coordinator Agent (complex reasoning) -> GPT-5

Cost comparison:

Approach	Monthly Cost	Savings
All tasks with GPT-5	$100	Baseline
Smart routing	$25	75%

I also implemented model cascading - try cheaper models first, escalate if needed:

class CascadingModelCaller:
    """Try cheap models first, escalate on failure"""

    def __init__(self):
        self.models = [
            ("local-llama-4", self._call_local),
            ("claude-haiku-3.5", self._call_haiku),
            ("claude-sonnet-4.5", self._call_sonnet)
        ]
        self.confidence_threshold = 0.85

    async def call_with_cascade(self, prompt: str) -> tuple:
        for model_name, call_func in self.models:
            response, confidence = await call_func(prompt)

            if confidence >= self.confidence_threshold:
                return response, model_name

        return await self._fallback_response(prompt)

This handles 60-70% of queries with cheaper models. Quality stays the same through escalation.

Strategy 3: Prompt Optimization (20-40% Savings)

I reviewed my prompts. They were verbose.

Before (250 tokens):

You are a helpful AI assistant that specializes in analyzing customer feedback.
Your role is to carefully read through customer reviews and identify the main
themes, sentiment, and actionable insights. Please make sure to provide a
structured response that includes a summary, key points, and recommendations.

After (75 tokens):

Analyze feedback. Return: themes, sentiment, insights, recommendations.
Structure: summary, key_points, actions.

That’s 175 tokens saved per call. At 1000 calls, that’s $0.52 in savings.

Prompt compression approach:

class PromptOptimizer:
    def compress_prompt(self, prompt: str) -> str:
        """Reduce prompt tokens while preserving meaning"""
        # Remove redundancy
        prompt = self._remove_redundancy(prompt)
        # Use bullet points
        prompt = self._convert_to_bullets(prompt)
        # Abbreviate where context is clear
        prompt = self._abbreviate(prompt)
        return prompt

I also fixed my context management. I was sending the entire conversation history every time.

Dynamic context selection:

class ContextManager:
    """Select relevant context for each agent call"""

    def __init__(self, max_context_tokens: int = 4000):
        self.max_tokens = max_context_tokens
        self.conversation_history = []

    def select_relevant_context(self, current_query: str) -> list:
        scored_messages = []
        for msg in self.conversation_history:
            score = self._calculate_relevance(current_query, msg)
            scored_messages.append((score, msg))

        scored_messages.sort(reverse=True, key=lambda x: x[0])

        selected_context = []
        token_count = 0

        for score, msg in scored_messages:
            msg_tokens = self._count_tokens(msg)
            if token_count + msg_tokens <= self.max_tokens:
                selected_context.append(msg)
                token_count += msg_tokens

        return selected_context

Agent memory optimization:

Bad: Send full memory to every agent (10,000+ tokens)
Good: Send only relevant memory to each agent (500-1000 tokens)

Strategy 4: Output Control (20-40% Savings)

I stopped letting agents ramble.

# Always set max_tokens
response = client.messages.create(
    model="claude-sonnet-4.5",
    max_tokens=500,  # Force concise response
    messages=[{"role": "user", "content": query}]
)

Cost impact:

Setting	Output Tokens	Cost per Call
No limit	~1500	$0.0225
With limit (500)	~400	$0.006
Savings	73%	-

I also switched to JSON responses instead of prose:

Prose response (500 tokens):

Based on my analysis of the customer feedback, I found that there are several
key themes that emerged. The first theme is about product quality, where
customers mentioned...

JSON response (150 tokens):

{
  "themes": ["quality", "price", "support"],
  "sentiment": "positive",
  "recommendations": ["Improve docs", "Add examples"]
}

Same information. 70% fewer tokens.

Strategy 5: Self-Hosted Models (40-100% Savings)

For high-volume tasks, I moved to local models.

Break-even analysis:

Monthly Volume	Recommendation
< 5M tokens	Use API
5-20M tokens	Consider hybrid
> 20M tokens	Self-host + API for complex tasks

I set up Ollama for routine tasks:

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Run local model
ollama run llama3.2:3b

# API-compatible endpoint
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:3b",
  "prompt": "Why is the sky blue?"
}'

Hybrid architecture:

class HybridModelSelector:
    """Use local for routine, cloud for complex"""

    def __init__(self):
        self.local_client = LocalAIClient("http://localhost:8080")
        self.cloud_client = AnthropicClient()

    async def process(self, task: dict) -> str:
        if self._is_routine_task(task):
            return await self.local_client.generate(task)
        else:
            return await self.cloud_client.generate(task)

    def _is_routine_task(self, task: dict) -> bool:
        routine_patterns = ["summarize", "classify", "extract", "format"]
        return any(pattern in task['type'] for pattern in routine_patterns)

I followed the Reddit case approach: local Whisper for speech recognition, local model for code assistance, cloud API only for complex reasoning.

Strategy 6: Budget Controls

I added rate limiting and per-agent budgets to prevent runaway costs.

class AgentRateLimiter:
    """Prevent runaway API costs"""

    def __init__(self, max_requests_per_minute: int = 60, max_cost_per_day: float = 10.0):
        self.rpm_limit = max_requests_per_minute
        self.daily_budget = max_cost_per_day
        self.request_times = []
        self.daily_spend = 0.0

    async def acquire(self, estimated_cost: float):
        # Check minute limit
        now = datetime.now()
        minute_ago = now - timedelta(minutes=1)
        self.request_times = [t for t in self.request_times if t > minute_ago]

        if len(self.request_times) >= self.rpm_limit:
            wait_time = 60 - (now - self.request_times[0]).seconds
            await asyncio.sleep(wait_time)

        # Check daily budget
        if self.daily_spend + estimated_cost > self.daily_budget:
            raise BudgetExceededError(
                f"Daily budget ${self.daily_budget} exceeded. "
                f"Current spend: ${self.daily_spend:.2f}"
            )

        self.request_times.append(now)
        self.daily_spend += estimated_cost

Per-agent budget allocation:

class AgentBudgetManager:
    """Allocate budgets to individual agents"""

    def __init__(self, total_budget: float):
        self.total_budget = total_budget
        self.agent_budgets = {
            "research_agent": total_budget * 0.30,
            "writer_agent": total_budget * 0.25,
            "code_agent": total_budget * 0.20,
            "memory_agent": total_budget * 0.15,
            "coordinator_agent": total_budget * 0.10
        }
        self.agent_spend = {agent: 0.0 for agent in self.agent_budgets}

    def can_agent_proceed(self, agent_id: str, estimated_cost: float) -> bool:
        budget = self.agent_budgets[agent_id]
        spent = self.agent_spend[agent_id]
        return spent + estimated_cost <= budget

Strategy 7: Cost Monitoring

I built a tracking system to see where money went.

@dataclass
class AgentUsage:
    agent_id: str
    model: str
    input_tokens: int
    output_tokens: int
    cost: float
    task_type: str
    timestamp: datetime

class CostTracker:
    """Track costs across all agents"""

    PRICING = {
        "gpt-5": {"input": 0.015 / 1000, "output": 0.06 / 1000},
        "claude-sonnet-4.5": {"input": 3.0 / 1_000_000, "output": 15.0 / 1_000_000},
        "claude-haiku-3.5": {"input": 0.25 / 1_000_000, "output": 1.25 / 1_000_000},
        "local-llama-4": {"input": 0.0, "output": 0.0}
    }

    def __init__(self):
        self.usage_log: List[AgentUsage] = []

    def track(self, agent_id: str, model: str, input_tokens: int,
              output_tokens: int, task_type: str) -> float:
        pricing = self.PRICING[model]
        cost = (input_tokens * pricing["input"]) + (output_tokens * pricing["output"])

        usage = AgentUsage(
            agent_id=agent_id,
            model=model,
            input_tokens=input_tokens,
            output_tokens=output_tokens,
            cost=cost,
            task_type=task_type,
            timestamp=datetime.now()
        )
        self.usage_log.append(usage)
        return cost

    def get_cost_by_agent(self) -> Dict[str, float]:
        costs = {}
        for usage in self.usage_log:
            costs[usage.agent_id] = costs.get(usage.agent_id, 0) + usage.cost
        return costs

This showed me that my Research Agent consumed 40% of costs. I moved it to a local model.

Multi-Agent Specific Optimizations

For agent communication, I switched from verbose messages to structured data:

# Bad: Verbose inter-agent communication (150 tokens)
agent1_to_agent2 = """
Hey Agent 2, I've finished my research on the topic you requested.
I found several interesting sources that you might want to look at.
The main findings are related to X, Y, and Z. I think you should
consider these when writing your section. Let me know if you need
more details.
"""

# Good: Structured minimal communication (30 tokens)
agent1_to_agent2 = {
    "status": "complete",
    "findings": ["X", "Y", "Z"],
    "sources": 3,
    "confidence": 0.85
}
# 80% savings

I also implemented a shared context pool to avoid duplicate context across agents:

class SharedContextPool:
    """Shared memory for multi-agent systems"""

    def __init__(self):
        self.shared_knowledge = {}
        self.agent_subscriptions = {}

    def publish(self, agent_id: str, key: str, value: any):
        self.shared_knowledge[key] = {
            "value": value,
            "source": agent_id,
            "timestamp": datetime.now()
        }

        for subscriber in self.agent_subscriptions.get(key, []):
            if subscriber != agent_id:
                self._notify_agent(subscriber, key, value)

    def get(self, agent_id: str, key: str) -> any:
        if key in self.shared_knowledge:
            return self.shared_knowledge[key]["value"]
        return None

Results

I applied all these strategies over 4 weeks:

Week 1: Caching + Tracking

Added response caching
Built cost tracking
Savings: 30%

Week 2: Model Routing

Implemented task-based routing
Added model cascading
Savings: 50%

Week 3: Budgets + Alerts

Per-agent budgets
Daily spend limits
Prevented runaway costs

Week 4: Prompt Optimization

Compressed prompts
Dynamic context selection
JSON output format
Savings: 20%

Final results:

Metric	Before	After	Change
Monthly API cost	$320	$75	-77%
Rate limit hits	15/day	0-2/day	-90%
Avg tokens per task	3500	1200	-66%
Cache hit rate	0%	45%	+45%

Quick Wins (Implement Today)

If you want to start immediately:

Enable response caching: 30-70% savings
Set max_tokens limits: 20-40% savings
Use smaller models for simple tasks: 50-80% savings
Implement per-agent budgets: Prevents runaway costs

Summary

In this post, I showed how to reduce AI agent API costs by 60-80%. The key strategies are caching to avoid redundant calls, routing tasks to appropriate models, optimizing prompts and context, self-hosting for high-volume tasks, and establishing strict budgets.

The biggest lesson: you can’t optimize what you don’t measure. Start with cost tracking, then apply strategies based on your usage patterns.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Reddit Discussion - OpenClaw Multi-Agent Costs
👨‍💻 21medien LLM Cost Optimization Guide
👨‍💻 Ollama Local Models

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!