Skip to content

How to Reduce API Costs When Using AI Agents: A Practical Guide

Problem

I ran OpenClaw with 5 agents for 3 weeks. Then I hit the API usage limit. The culprit? I sent huge instructions to all 5 agents for every task.

My setup:

  • 1 main coordinator agent
  • 4 specialized subagents
  • Each agent received ~2000 tokens of instructions per task

At 100 tasks per day, that’s 10,000 input tokens just for instructions. With Claude Sonnet 4.5 pricing ($3 per million input tokens), I burned through $90/month on instructions alone. And that’s before any actual work happened.

I needed a different approach.

Why Multi-Agent Systems Burn Money

Multi-agent systems multiply costs in ways single-agent systems don’t.

Token multiplication:

Token multiplication example
Single agent: 1 API call per task
5 agents: 5-25 API calls per complex task

Each agent makes multiple calls. They share context. They send messages to each other. The token count explodes.

Context accumulation:

Context accumulation example
Task 1: 1,000 tokens context
Task 2: 1,000 + Task 1 response = 2,500 tokens
Task 3: 2,500 + Task 2 response = 4,500 tokens

Context grows without management. Every task costs more than the last.

Strategy 1: Caching (30-70% Savings)

I started with the simplest fix: stop making the same API calls repeatedly.

Here’s what I implemented:

agent_cache.py
import hashlib
from typing import Optional
class AgentCache:
def __init__(self):
self.cache = {} # Use Redis in production
self.hit_rate = 0
def get_cached_response(self, prompt: str, agent_id: str) -> Optional[str]:
key = self._hash_key(prompt, agent_id)
if key in self.cache:
self.hit_rate += 1
return self.cache[key]
return None
def cache_response(self, prompt: str, agent_id: str, response: str, ttl: int = 3600):
key = self._hash_key(prompt, agent_id)
self.cache[key] = response
def _hash_key(self, prompt: str, agent_id: str) -> str:
content = f"{agent_id}:{prompt}"
return hashlib.sha256(content.encode()).hexdigest()

What to cache:

Content TypeCache DurationSavings
Common queries across agents1 hourHigh
Tool usage patterns24 hoursMedium
FAQ-style responsesPermanentHigh
System prompt resultsSessionMedium

I also added embedding caching for my RAG agents:

rag_cache.py
class RAGCache:
"""Cache embeddings to avoid regeneration"""
def __init__(self, embedding_model):
self.embedding_cache = {}
self.model = embedding_model
def get_embedding(self, text: str) -> list:
text_hash = hashlib.md5(text.encode()).hexdigest()
if text_hash in self.embedding_cache:
return self.embedding_cache[text_hash]
embedding = self.model.embed(text)
self.embedding_cache[text_hash] = embedding
return embedding

Document embeddings get cached permanently. Query embeddings cache for frequently used queries. This saved me 40% on RAG agent costs.

Strategy 2: Model Routing (50-80% Savings)

I was using my most expensive model for everything. That’s like hiring a senior architect to write boilerplate code.

I built a router to match models to task complexity:

model_router.py
class ModelRouter:
"""Route agent tasks to appropriate models"""
MODEL_COSTS = {
"gpt-5": {"input": 0.015, "output": 0.06},
"claude-sonnet-4.5": {"input": 0.003, "output": 0.015},
"claude-haiku-3.5": {"input": 0.00025, "output": 0.00125},
"local-llama-4": {"input": 0.0, "output": 0.0}
}
def route_task(self, task_type: str, complexity: str) -> str:
routing_matrix = {
("classification", "simple"): "claude-haiku-3.5",
("classification", "moderate"): "claude-sonnet-4.5",
("extraction", "simple"): "claude-haiku-3.5",
("extraction", "complex"): "claude-sonnet-4.5",
("reasoning", "simple"): "claude-sonnet-4.5",
("reasoning", "complex"): "gpt-5",
("code", "generation"): "claude-sonnet-4.5",
("code", "review"): "local-llama-4",
("creative", "any"): "gpt-5"
}
return routing_matrix.get((task_type, complexity), "claude-sonnet-4.5")

Agent-to-model mapping:

Agent-to-model mapping
Research Agent (high-volume searches) -> Local Llama 4
Writer Agent (creative tasks) -> Claude Sonnet 4.5
Code Agent (routine reviews) -> Local model
Memory Agent (classification) -> Claude Haiku 3.5
Coordinator Agent (complex reasoning) -> GPT-5

Cost comparison:

ApproachMonthly CostSavings
All tasks with GPT-5$100Baseline
Smart routing$2575%

I also implemented model cascading - try cheaper models first, escalate if needed:

cascading_caller.py
class CascadingModelCaller:
"""Try cheap models first, escalate on failure"""
def __init__(self):
self.models = [
("local-llama-4", self._call_local),
("claude-haiku-3.5", self._call_haiku),
("claude-sonnet-4.5", self._call_sonnet)
]
self.confidence_threshold = 0.85
async def call_with_cascade(self, prompt: str) -> tuple:
for model_name, call_func in self.models:
response, confidence = await call_func(prompt)
if confidence >= self.confidence_threshold:
return response, model_name
return await self._fallback_response(prompt)

This handles 60-70% of queries with cheaper models. Quality stays the same through escalation.

Strategy 3: Prompt Optimization (20-40% Savings)

I reviewed my prompts. They were verbose.

Before (250 tokens):

Verbose prompt example
You are a helpful AI assistant that specializes in analyzing customer feedback.
Your role is to carefully read through customer reviews and identify the main
themes, sentiment, and actionable insights. Please make sure to provide a
structured response that includes a summary, key points, and recommendations.

After (75 tokens):

Optimized prompt example
Analyze feedback. Return: themes, sentiment, insights, recommendations.
Structure: summary, key_points, actions.

That’s 175 tokens saved per call. At 1000 calls, that’s $0.52 in savings.

Prompt compression approach:

prompt_optimizer.py
class PromptOptimizer:
def compress_prompt(self, prompt: str) -> str:
"""Reduce prompt tokens while preserving meaning"""
# Remove redundancy
prompt = self._remove_redundancy(prompt)
# Use bullet points
prompt = self._convert_to_bullets(prompt)
# Abbreviate where context is clear
prompt = self._abbreviate(prompt)
return prompt

I also fixed my context management. I was sending the entire conversation history every time.

Dynamic context selection:

context_manager.py
class ContextManager:
"""Select relevant context for each agent call"""
def __init__(self, max_context_tokens: int = 4000):
self.max_tokens = max_context_tokens
self.conversation_history = []
def select_relevant_context(self, current_query: str) -> list:
scored_messages = []
for msg in self.conversation_history:
score = self._calculate_relevance(current_query, msg)
scored_messages.append((score, msg))
scored_messages.sort(reverse=True, key=lambda x: x[0])
selected_context = []
token_count = 0
for score, msg in scored_messages:
msg_tokens = self._count_tokens(msg)
if token_count + msg_tokens <= self.max_tokens:
selected_context.append(msg)
token_count += msg_tokens
return selected_context

Agent memory optimization:

Memory optimization comparison
Bad: Send full memory to every agent (10,000+ tokens)
Good: Send only relevant memory to each agent (500-1000 tokens)

Strategy 4: Output Control (20-40% Savings)

I stopped letting agents ramble.

output_control.py
# Always set max_tokens
response = client.messages.create(
model="claude-sonnet-4.5",
max_tokens=500, # Force concise response
messages=[{"role": "user", "content": query}]
)

Cost impact:

SettingOutput TokensCost per Call
No limit~1500$0.0225
With limit (500)~400$0.006
Savings73%-

I also switched to JSON responses instead of prose:

Prose response (500 tokens):

Verbose prose response
Based on my analysis of the customer feedback, I found that there are several
key themes that emerged. The first theme is about product quality, where
customers mentioned...

JSON response (150 tokens):

Concise JSON response
{
"themes": ["quality", "price", "support"],
"sentiment": "positive",
"recommendations": ["Improve docs", "Add examples"]
}

Same information. 70% fewer tokens.

Strategy 5: Self-Hosted Models (40-100% Savings)

For high-volume tasks, I moved to local models.

Break-even analysis:

Monthly VolumeRecommendation
< 5M tokensUse API
5-20M tokensConsider hybrid
> 20M tokensSelf-host + API for complex tasks

I set up Ollama for routine tasks:

ollama_setup.sh
# Install
curl -fsSL https://ollama.com/install.sh | sh
# Run local model
ollama run llama3.2:3b
# API-compatible endpoint
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2:3b",
"prompt": "Why is the sky blue?"
}'

Hybrid architecture:

hybrid_selector.py
class HybridModelSelector:
"""Use local for routine, cloud for complex"""
def __init__(self):
self.local_client = LocalAIClient("http://localhost:8080")
self.cloud_client = AnthropicClient()
async def process(self, task: dict) -> str:
if self._is_routine_task(task):
return await self.local_client.generate(task)
else:
return await self.cloud_client.generate(task)
def _is_routine_task(self, task: dict) -> bool:
routine_patterns = ["summarize", "classify", "extract", "format"]
return any(pattern in task['type'] for pattern in routine_patterns)

I followed the Reddit case approach: local Whisper for speech recognition, local model for code assistance, cloud API only for complex reasoning.

Strategy 6: Budget Controls

I added rate limiting and per-agent budgets to prevent runaway costs.

rate_limiter.py
class AgentRateLimiter:
"""Prevent runaway API costs"""
def __init__(self, max_requests_per_minute: int = 60, max_cost_per_day: float = 10.0):
self.rpm_limit = max_requests_per_minute
self.daily_budget = max_cost_per_day
self.request_times = []
self.daily_spend = 0.0
async def acquire(self, estimated_cost: float):
# Check minute limit
now = datetime.now()
minute_ago = now - timedelta(minutes=1)
self.request_times = [t for t in self.request_times if t > minute_ago]
if len(self.request_times) >= self.rpm_limit:
wait_time = 60 - (now - self.request_times[0]).seconds
await asyncio.sleep(wait_time)
# Check daily budget
if self.daily_spend + estimated_cost > self.daily_budget:
raise BudgetExceededError(
f"Daily budget ${self.daily_budget} exceeded. "
f"Current spend: ${self.daily_spend:.2f}"
)
self.request_times.append(now)
self.daily_spend += estimated_cost

Per-agent budget allocation:

budget_manager.py
class AgentBudgetManager:
"""Allocate budgets to individual agents"""
def __init__(self, total_budget: float):
self.total_budget = total_budget
self.agent_budgets = {
"research_agent": total_budget * 0.30,
"writer_agent": total_budget * 0.25,
"code_agent": total_budget * 0.20,
"memory_agent": total_budget * 0.15,
"coordinator_agent": total_budget * 0.10
}
self.agent_spend = {agent: 0.0 for agent in self.agent_budgets}
def can_agent_proceed(self, agent_id: str, estimated_cost: float) -> bool:
budget = self.agent_budgets[agent_id]
spent = self.agent_spend[agent_id]
return spent + estimated_cost <= budget

Strategy 7: Cost Monitoring

I built a tracking system to see where money went.

cost_tracker.py
@dataclass
class AgentUsage:
agent_id: str
model: str
input_tokens: int
output_tokens: int
cost: float
task_type: str
timestamp: datetime
class CostTracker:
"""Track costs across all agents"""
PRICING = {
"gpt-5": {"input": 0.015 / 1000, "output": 0.06 / 1000},
"claude-sonnet-4.5": {"input": 3.0 / 1_000_000, "output": 15.0 / 1_000_000},
"claude-haiku-3.5": {"input": 0.25 / 1_000_000, "output": 1.25 / 1_000_000},
"local-llama-4": {"input": 0.0, "output": 0.0}
}
def __init__(self):
self.usage_log: List[AgentUsage] = []
def track(self, agent_id: str, model: str, input_tokens: int,
output_tokens: int, task_type: str) -> float:
pricing = self.PRICING[model]
cost = (input_tokens * pricing["input"]) + (output_tokens * pricing["output"])
usage = AgentUsage(
agent_id=agent_id,
model=model,
input_tokens=input_tokens,
output_tokens=output_tokens,
cost=cost,
task_type=task_type,
timestamp=datetime.now()
)
self.usage_log.append(usage)
return cost
def get_cost_by_agent(self) -> Dict[str, float]:
costs = {}
for usage in self.usage_log:
costs[usage.agent_id] = costs.get(usage.agent_id, 0) + usage.cost
return costs

This showed me that my Research Agent consumed 40% of costs. I moved it to a local model.

Multi-Agent Specific Optimizations

For agent communication, I switched from verbose messages to structured data:

agent_communication.py
# Bad: Verbose inter-agent communication (150 tokens)
agent1_to_agent2 = """
Hey Agent 2, I've finished my research on the topic you requested.
I found several interesting sources that you might want to look at.
The main findings are related to X, Y, and Z. I think you should
consider these when writing your section. Let me know if you need
more details.
"""
# Good: Structured minimal communication (30 tokens)
agent1_to_agent2 = {
"status": "complete",
"findings": ["X", "Y", "Z"],
"sources": 3,
"confidence": 0.85
}
# 80% savings

I also implemented a shared context pool to avoid duplicate context across agents:

shared_context.py
class SharedContextPool:
"""Shared memory for multi-agent systems"""
def __init__(self):
self.shared_knowledge = {}
self.agent_subscriptions = {}
def publish(self, agent_id: str, key: str, value: any):
self.shared_knowledge[key] = {
"value": value,
"source": agent_id,
"timestamp": datetime.now()
}
for subscriber in self.agent_subscriptions.get(key, []):
if subscriber != agent_id:
self._notify_agent(subscriber, key, value)
def get(self, agent_id: str, key: str) -> any:
if key in self.shared_knowledge:
return self.shared_knowledge[key]["value"]
return None

Results

I applied all these strategies over 4 weeks:

Week 1: Caching + Tracking

  • Added response caching
  • Built cost tracking
  • Savings: 30%

Week 2: Model Routing

  • Implemented task-based routing
  • Added model cascading
  • Savings: 50%

Week 3: Budgets + Alerts

  • Per-agent budgets
  • Daily spend limits
  • Prevented runaway costs

Week 4: Prompt Optimization

  • Compressed prompts
  • Dynamic context selection
  • JSON output format
  • Savings: 20%

Final results:

MetricBeforeAfterChange
Monthly API cost$320$75-77%
Rate limit hits15/day0-2/day-90%
Avg tokens per task35001200-66%
Cache hit rate0%45%+45%

Quick Wins (Implement Today)

If you want to start immediately:

  1. Enable response caching: 30-70% savings
  2. Set max_tokens limits: 20-40% savings
  3. Use smaller models for simple tasks: 50-80% savings
  4. Implement per-agent budgets: Prevents runaway costs

Summary

In this post, I showed how to reduce AI agent API costs by 60-80%. The key strategies are caching to avoid redundant calls, routing tasks to appropriate models, optimizing prompts and context, self-hosting for high-volume tasks, and establishing strict budgets.

The biggest lesson: you can’t optimize what you don’t measure. Start with cost tracking, then apply strategies based on your usage patterns.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments