How to Reduce API Costs When Using AI Agents: A Practical Guide
Problem
I ran OpenClaw with 5 agents for 3 weeks. Then I hit the API usage limit. The culprit? I sent huge instructions to all 5 agents for every task.
My setup:
- 1 main coordinator agent
- 4 specialized subagents
- Each agent received ~2000 tokens of instructions per task
At 100 tasks per day, that’s 10,000 input tokens just for instructions. With Claude Sonnet 4.5 pricing ($3 per million input tokens), I burned through $90/month on instructions alone. And that’s before any actual work happened.
I needed a different approach.
Why Multi-Agent Systems Burn Money
Multi-agent systems multiply costs in ways single-agent systems don’t.
Token multiplication:
Single agent: 1 API call per task5 agents: 5-25 API calls per complex taskEach agent makes multiple calls. They share context. They send messages to each other. The token count explodes.
Context accumulation:
Task 1: 1,000 tokens contextTask 2: 1,000 + Task 1 response = 2,500 tokensTask 3: 2,500 + Task 2 response = 4,500 tokensContext grows without management. Every task costs more than the last.
Strategy 1: Caching (30-70% Savings)
I started with the simplest fix: stop making the same API calls repeatedly.
Here’s what I implemented:
import hashlibfrom typing import Optional
class AgentCache: def __init__(self): self.cache = {} # Use Redis in production self.hit_rate = 0
def get_cached_response(self, prompt: str, agent_id: str) -> Optional[str]: key = self._hash_key(prompt, agent_id) if key in self.cache: self.hit_rate += 1 return self.cache[key] return None
def cache_response(self, prompt: str, agent_id: str, response: str, ttl: int = 3600): key = self._hash_key(prompt, agent_id) self.cache[key] = response
def _hash_key(self, prompt: str, agent_id: str) -> str: content = f"{agent_id}:{prompt}" return hashlib.sha256(content.encode()).hexdigest()What to cache:
| Content Type | Cache Duration | Savings |
|---|---|---|
| Common queries across agents | 1 hour | High |
| Tool usage patterns | 24 hours | Medium |
| FAQ-style responses | Permanent | High |
| System prompt results | Session | Medium |
I also added embedding caching for my RAG agents:
class RAGCache: """Cache embeddings to avoid regeneration"""
def __init__(self, embedding_model): self.embedding_cache = {} self.model = embedding_model
def get_embedding(self, text: str) -> list: text_hash = hashlib.md5(text.encode()).hexdigest() if text_hash in self.embedding_cache: return self.embedding_cache[text_hash]
embedding = self.model.embed(text) self.embedding_cache[text_hash] = embedding return embeddingDocument embeddings get cached permanently. Query embeddings cache for frequently used queries. This saved me 40% on RAG agent costs.
Strategy 2: Model Routing (50-80% Savings)
I was using my most expensive model for everything. That’s like hiring a senior architect to write boilerplate code.
I built a router to match models to task complexity:
class ModelRouter: """Route agent tasks to appropriate models"""
MODEL_COSTS = { "gpt-5": {"input": 0.015, "output": 0.06}, "claude-sonnet-4.5": {"input": 0.003, "output": 0.015}, "claude-haiku-3.5": {"input": 0.00025, "output": 0.00125}, "local-llama-4": {"input": 0.0, "output": 0.0} }
def route_task(self, task_type: str, complexity: str) -> str: routing_matrix = { ("classification", "simple"): "claude-haiku-3.5", ("classification", "moderate"): "claude-sonnet-4.5", ("extraction", "simple"): "claude-haiku-3.5", ("extraction", "complex"): "claude-sonnet-4.5", ("reasoning", "simple"): "claude-sonnet-4.5", ("reasoning", "complex"): "gpt-5", ("code", "generation"): "claude-sonnet-4.5", ("code", "review"): "local-llama-4", ("creative", "any"): "gpt-5" }
return routing_matrix.get((task_type, complexity), "claude-sonnet-4.5")Agent-to-model mapping:
Research Agent (high-volume searches) -> Local Llama 4Writer Agent (creative tasks) -> Claude Sonnet 4.5Code Agent (routine reviews) -> Local modelMemory Agent (classification) -> Claude Haiku 3.5Coordinator Agent (complex reasoning) -> GPT-5Cost comparison:
| Approach | Monthly Cost | Savings |
|---|---|---|
| All tasks with GPT-5 | $100 | Baseline |
| Smart routing | $25 | 75% |
I also implemented model cascading - try cheaper models first, escalate if needed:
class CascadingModelCaller: """Try cheap models first, escalate on failure"""
def __init__(self): self.models = [ ("local-llama-4", self._call_local), ("claude-haiku-3.5", self._call_haiku), ("claude-sonnet-4.5", self._call_sonnet) ] self.confidence_threshold = 0.85
async def call_with_cascade(self, prompt: str) -> tuple: for model_name, call_func in self.models: response, confidence = await call_func(prompt)
if confidence >= self.confidence_threshold: return response, model_name
return await self._fallback_response(prompt)This handles 60-70% of queries with cheaper models. Quality stays the same through escalation.
Strategy 3: Prompt Optimization (20-40% Savings)
I reviewed my prompts. They were verbose.
Before (250 tokens):
You are a helpful AI assistant that specializes in analyzing customer feedback.Your role is to carefully read through customer reviews and identify the mainthemes, sentiment, and actionable insights. Please make sure to provide astructured response that includes a summary, key points, and recommendations.After (75 tokens):
Analyze feedback. Return: themes, sentiment, insights, recommendations.Structure: summary, key_points, actions.That’s 175 tokens saved per call. At 1000 calls, that’s $0.52 in savings.
Prompt compression approach:
class PromptOptimizer: def compress_prompt(self, prompt: str) -> str: """Reduce prompt tokens while preserving meaning""" # Remove redundancy prompt = self._remove_redundancy(prompt) # Use bullet points prompt = self._convert_to_bullets(prompt) # Abbreviate where context is clear prompt = self._abbreviate(prompt) return promptI also fixed my context management. I was sending the entire conversation history every time.
Dynamic context selection:
class ContextManager: """Select relevant context for each agent call"""
def __init__(self, max_context_tokens: int = 4000): self.max_tokens = max_context_tokens self.conversation_history = []
def select_relevant_context(self, current_query: str) -> list: scored_messages = [] for msg in self.conversation_history: score = self._calculate_relevance(current_query, msg) scored_messages.append((score, msg))
scored_messages.sort(reverse=True, key=lambda x: x[0])
selected_context = [] token_count = 0
for score, msg in scored_messages: msg_tokens = self._count_tokens(msg) if token_count + msg_tokens <= self.max_tokens: selected_context.append(msg) token_count += msg_tokens
return selected_contextAgent memory optimization:
Bad: Send full memory to every agent (10,000+ tokens)Good: Send only relevant memory to each agent (500-1000 tokens)Strategy 4: Output Control (20-40% Savings)
I stopped letting agents ramble.
# Always set max_tokensresponse = client.messages.create( model="claude-sonnet-4.5", max_tokens=500, # Force concise response messages=[{"role": "user", "content": query}])Cost impact:
| Setting | Output Tokens | Cost per Call |
|---|---|---|
| No limit | ~1500 | $0.0225 |
| With limit (500) | ~400 | $0.006 |
| Savings | 73% | - |
I also switched to JSON responses instead of prose:
Prose response (500 tokens):
Based on my analysis of the customer feedback, I found that there are severalkey themes that emerged. The first theme is about product quality, wherecustomers mentioned...JSON response (150 tokens):
{ "themes": ["quality", "price", "support"], "sentiment": "positive", "recommendations": ["Improve docs", "Add examples"]}Same information. 70% fewer tokens.
Strategy 5: Self-Hosted Models (40-100% Savings)
For high-volume tasks, I moved to local models.
Break-even analysis:
| Monthly Volume | Recommendation |
|---|---|
| < 5M tokens | Use API |
| 5-20M tokens | Consider hybrid |
| > 20M tokens | Self-host + API for complex tasks |
I set up Ollama for routine tasks:
# Installcurl -fsSL https://ollama.com/install.sh | sh
# Run local modelollama run llama3.2:3b
# API-compatible endpointcurl http://localhost:11434/api/generate -d '{ "model": "llama3.2:3b", "prompt": "Why is the sky blue?"}'Hybrid architecture:
class HybridModelSelector: """Use local for routine, cloud for complex"""
def __init__(self): self.local_client = LocalAIClient("http://localhost:8080") self.cloud_client = AnthropicClient()
async def process(self, task: dict) -> str: if self._is_routine_task(task): return await self.local_client.generate(task) else: return await self.cloud_client.generate(task)
def _is_routine_task(self, task: dict) -> bool: routine_patterns = ["summarize", "classify", "extract", "format"] return any(pattern in task['type'] for pattern in routine_patterns)I followed the Reddit case approach: local Whisper for speech recognition, local model for code assistance, cloud API only for complex reasoning.
Strategy 6: Budget Controls
I added rate limiting and per-agent budgets to prevent runaway costs.
class AgentRateLimiter: """Prevent runaway API costs"""
def __init__(self, max_requests_per_minute: int = 60, max_cost_per_day: float = 10.0): self.rpm_limit = max_requests_per_minute self.daily_budget = max_cost_per_day self.request_times = [] self.daily_spend = 0.0
async def acquire(self, estimated_cost: float): # Check minute limit now = datetime.now() minute_ago = now - timedelta(minutes=1) self.request_times = [t for t in self.request_times if t > minute_ago]
if len(self.request_times) >= self.rpm_limit: wait_time = 60 - (now - self.request_times[0]).seconds await asyncio.sleep(wait_time)
# Check daily budget if self.daily_spend + estimated_cost > self.daily_budget: raise BudgetExceededError( f"Daily budget ${self.daily_budget} exceeded. " f"Current spend: ${self.daily_spend:.2f}" )
self.request_times.append(now) self.daily_spend += estimated_costPer-agent budget allocation:
class AgentBudgetManager: """Allocate budgets to individual agents"""
def __init__(self, total_budget: float): self.total_budget = total_budget self.agent_budgets = { "research_agent": total_budget * 0.30, "writer_agent": total_budget * 0.25, "code_agent": total_budget * 0.20, "memory_agent": total_budget * 0.15, "coordinator_agent": total_budget * 0.10 } self.agent_spend = {agent: 0.0 for agent in self.agent_budgets}
def can_agent_proceed(self, agent_id: str, estimated_cost: float) -> bool: budget = self.agent_budgets[agent_id] spent = self.agent_spend[agent_id] return spent + estimated_cost <= budgetStrategy 7: Cost Monitoring
I built a tracking system to see where money went.
@dataclassclass AgentUsage: agent_id: str model: str input_tokens: int output_tokens: int cost: float task_type: str timestamp: datetime
class CostTracker: """Track costs across all agents"""
PRICING = { "gpt-5": {"input": 0.015 / 1000, "output": 0.06 / 1000}, "claude-sonnet-4.5": {"input": 3.0 / 1_000_000, "output": 15.0 / 1_000_000}, "claude-haiku-3.5": {"input": 0.25 / 1_000_000, "output": 1.25 / 1_000_000}, "local-llama-4": {"input": 0.0, "output": 0.0} }
def __init__(self): self.usage_log: List[AgentUsage] = []
def track(self, agent_id: str, model: str, input_tokens: int, output_tokens: int, task_type: str) -> float: pricing = self.PRICING[model] cost = (input_tokens * pricing["input"]) + (output_tokens * pricing["output"])
usage = AgentUsage( agent_id=agent_id, model=model, input_tokens=input_tokens, output_tokens=output_tokens, cost=cost, task_type=task_type, timestamp=datetime.now() ) self.usage_log.append(usage) return cost
def get_cost_by_agent(self) -> Dict[str, float]: costs = {} for usage in self.usage_log: costs[usage.agent_id] = costs.get(usage.agent_id, 0) + usage.cost return costsThis showed me that my Research Agent consumed 40% of costs. I moved it to a local model.
Multi-Agent Specific Optimizations
For agent communication, I switched from verbose messages to structured data:
# Bad: Verbose inter-agent communication (150 tokens)agent1_to_agent2 = """Hey Agent 2, I've finished my research on the topic you requested.I found several interesting sources that you might want to look at.The main findings are related to X, Y, and Z. I think you shouldconsider these when writing your section. Let me know if you needmore details."""
# Good: Structured minimal communication (30 tokens)agent1_to_agent2 = { "status": "complete", "findings": ["X", "Y", "Z"], "sources": 3, "confidence": 0.85}# 80% savingsI also implemented a shared context pool to avoid duplicate context across agents:
class SharedContextPool: """Shared memory for multi-agent systems"""
def __init__(self): self.shared_knowledge = {} self.agent_subscriptions = {}
def publish(self, agent_id: str, key: str, value: any): self.shared_knowledge[key] = { "value": value, "source": agent_id, "timestamp": datetime.now() }
for subscriber in self.agent_subscriptions.get(key, []): if subscriber != agent_id: self._notify_agent(subscriber, key, value)
def get(self, agent_id: str, key: str) -> any: if key in self.shared_knowledge: return self.shared_knowledge[key]["value"] return NoneResults
I applied all these strategies over 4 weeks:
Week 1: Caching + Tracking
- Added response caching
- Built cost tracking
- Savings: 30%
Week 2: Model Routing
- Implemented task-based routing
- Added model cascading
- Savings: 50%
Week 3: Budgets + Alerts
- Per-agent budgets
- Daily spend limits
- Prevented runaway costs
Week 4: Prompt Optimization
- Compressed prompts
- Dynamic context selection
- JSON output format
- Savings: 20%
Final results:
| Metric | Before | After | Change |
|---|---|---|---|
| Monthly API cost | $320 | $75 | -77% |
| Rate limit hits | 15/day | 0-2/day | -90% |
| Avg tokens per task | 3500 | 1200 | -66% |
| Cache hit rate | 0% | 45% | +45% |
Quick Wins (Implement Today)
If you want to start immediately:
- Enable response caching: 30-70% savings
- Set max_tokens limits: 20-40% savings
- Use smaller models for simple tasks: 50-80% savings
- Implement per-agent budgets: Prevents runaway costs
Summary
In this post, I showed how to reduce AI agent API costs by 60-80%. The key strategies are caching to avoid redundant calls, routing tasks to appropriate models, optimizing prompts and context, self-hosting for high-volume tasks, and establishing strict budgets.
The biggest lesson: you can’t optimize what you don’t measure. Start with cost tracking, then apply strategies based on your usage patterns.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit Discussion - OpenClaw Multi-Agent Costs
- 👨💻 21medien LLM Cost Optimization Guide
- 👨💻 Ollama Local Models
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments