How Do I Optimize My AI Agent to Use Less Tokens?
The Problem
I built an AI agent that ran for 8 hours and cost me $47. My API bill went from $200/month to over $800 in just two weeks. The agent worked correctly, but it was burning tokens like crazy.
When I investigated, I found the problem:
Agent run log:Iteration 1: 5,000 tokens inputIteration 5: 15,000 tokens inputIteration 10: 35,000 tokens inputIteration 20: 80,000 tokens input
Total: ~800,000 tokens for one agent runAt $10/1M tokens: $8 per run10 runs/day = $80/day = $2,400/monthThe context window was growing unbounded. Every tool output, error message, and observation got appended to the conversation. No compression. No caching. No optimization.
Why Agents Waste Tokens
I discovered four main causes:
1. Context Window Bloat
Each action adds observations. Tool outputs can be thousands of tokens. Error messages repeat. Without intervention, context grows to maximum limits.
# This is what I was doing - WRONGconversation = []while not done: response = llm.chat(conversation + [user_message]) conversation.append(user_message) conversation.append(response) # Context grows indefinitely!2. Redundant API Calls
My agent made identical calls multiple times. Same query in different reasoning steps. Similar prompts with minor variations. No caching.
3. Prompt Verbosity
I wrote long system prompts with unnecessary examples and redundant instructions. Each call sent 200+ tokens of setup that rarely changed.
4. Wrong Model for Simple Tasks
I used GPT-4 for everything, even simple JSON-to-YAML conversions. Thatβs a 60x cost difference from using a lightweight model.
Solution 1: Context Compression
I implemented a sliding window with summarization:
import tiktokenfrom dataclasses import dataclassfrom typing import List
@dataclassclass CompressedMessage: role: str content: str token_count: int summarized: bool = False
class ContextCompressor: """Keep context bounded at target size"""
def __init__( self, max_tokens: int = 4000, keep_recent: int = 3 ): self.max_tokens = max_tokens self.keep_recent = keep_recent self.messages: List[CompressedMessage] = [] self.encoding = tiktoken.encoding_for_model("gpt-4")
def add_message(self, role: str, content: str): token_count = len(self.encoding.encode(content)) self.messages.append(CompressedMessage( role=role, content=content, token_count=token_count ))
if self._get_total_tokens() > self.max_tokens * 0.7: self._compress_context()
def _get_total_tokens(self) -> int: return sum(m.token_count for m in self.messages)
def _compress_context(self): """Compress older messages into summary""" if len(self.messages) <= self.keep_recent: return
to_compress = self.messages[:-self.keep_recent] summary = self._generate_summary(to_compress)
compressed_msg = CompressedMessage( role="system", content=f"[Previous Context]\n{summary}", token_count=len(self.encoding.encode(summary)), summarized=True )
self.messages = [compressed_msg] + self.messages[-self.keep_recent:]
def _generate_summary(self, messages: List[CompressedMessage]) -> str: # Extract key information actions = sum(1 for m in messages if "action" in m.content.lower()) errors = sum(1 for m in messages if "error" in m.content.lower()) return f"{actions} actions, {errors} errors encountered"
def get_context_for_api(self) -> List[dict]: return [{"role": m.role, "content": m.content} for m in self.messages]This keeps my context bounded at 4,000 tokens instead of growing to 128,000.
Solution 2: Semantic Caching
I added caching to avoid redundant API calls:
import hashlibfrom datetime import datetime, timedeltafrom dataclasses import dataclassfrom typing import Any, Optional
@dataclassclass CacheEntry: query_hash: str response: Any timestamp: datetime hit_count: int = 0
class SemanticCache: """Cache responses for similar queries"""
def __init__(self, ttl_minutes: int = 30): self.ttl = timedelta(minutes=ttl_minutes) self.cache: dict[str, CacheEntry] = {}
def _hash_query(self, query: str) -> str: normalized = query.lower().strip() return hashlib.sha256(normalized.encode()).hexdigest()[:16]
def get(self, query: str) -> Optional[Any]: query_hash = self._hash_query(query)
if query_hash in self.cache: entry = self.cache[query_hash] if datetime.now() - entry.timestamp < self.ttl: entry.hit_count += 1 return entry.response else: del self.cache[query_hash] return None
def set(self, query: str, response: Any): query_hash = self._hash_query(query) self.cache[query_hash] = CacheEntry( query_hash=query_hash, response=response, timestamp=datetime.now() )
# Usagecache = SemanticCache()
async def cached_query(query: str): cached = cache.get(query) if cached: print("Cache HIT") return cached
result = await llm.generate(query) cache.set(query, result) return resultMy cache hit rate is now 30-40%, saving hundreds of API calls per day.
Solution 3: Efficient Prompts
I replaced verbose prompts with concise ones:
# BEFORE: 150+ tokensverbose_prompt = """You are an AI assistant that helps users search through documentation.Your role is to understand user queries and return relevant information.You should always be helpful and provide accurate information.When the user asks a question, you should search through the availabledocumentation and provide the most relevant results. Make sure toformat your response in a clear and readable manner. If you cannotfind relevant information, you should let the user know politely."""
# AFTER: 20 tokensconcise_prompt = """Search docs. Return top 5 results or "None found".Format: bullet points."""
# Token savings: 87%For structured output, I request specific formats:
# Instead of open-ended responsesprompt = """Return JSON:{ "approach": "redis|memcached|memory", "code": "<implementation>", "dependencies": []}Max 30 lines code."""Solution 4: Tiered Model Selection
I route tasks to appropriate models:
from enum import Enum
class ModelTier(Enum): LIGHTWEIGHT = "claude-3-haiku" # $0.25/1M tokens BALANCED = "claude-3-sonnet" # $3/1M tokens CAPABLE = "claude-3-opus" # $15/1M tokens
class TieredRouter: """Route queries to appropriate model"""
def __init__(self, clients: dict[ModelTier, Any]): self.clients = clients
def select_tier(self, query: str) -> ModelTier: query_lower = query.lower()
# High complexity keywords if any(kw in query_lower for kw in ["analyze", "design", "architect"]): return ModelTier.CAPABLE
# Medium complexity if any(kw in query_lower for kw in ["implement", "debug", "refactor"]): return ModelTier.BALANCED
# Simple tasks return ModelTier.LIGHTWEIGHT
async def route(self, query: str) -> tuple[Any, ModelTier]: tier = self.select_tier(query) result = await self.clients[tier].generate(query) return result, tierCost comparison for 100 queries:
All CAPABLE: 100 * $15/1M * 2000 tokens = $3.00Tiered (70% light, 20% balanced, 10% capable): 70 * $0.25/1M * 2000 = $0.035 20 * $3/1M * 2000 = $0.12 10 * $15/1M * 2000 = $0.30 Total: $0.455
Savings: 85%Common Mistakes I Made
Mistake 1: Keeping full conversation history
# BADconversation = []while not done: response = llm.chat(conversation + [user_message]) conversation.append(user_message) conversation.append(response)
# GOODcompressor = ContextCompressor(max_tokens=4000)while not done: response = llm.chat(compressor.get_context_for_api()) compressor.add_message("assistant", response)Mistake 2: No caching for repeated queries
# BAD: Every query hits APIasync def search(query: str): return await llm.generate(f"Search: {query}")
# GOODcache = SemanticCache()async def search(query: str): cached = cache.get(query) if cached: return cached result = await llm.generate(f"Search: {query}") cache.set(query, result) return resultMistake 3: Using capable model for simple tasks
# BAD: Expensive model for simple conversionresult = opus.generate("Convert JSON to YAML") # $15/1M
# GOODresult = haiku.generate("Convert JSON to YAML") # $0.25/1M# 60x cheaper for identical utilityThe Results
After implementing all four optimizations:
Before optimization:- Token usage: ~800,000 per run- Cost: $8/run, $2,400/month- Context size: 128k tokens
After optimization:- Token usage: ~160,000 per run (80% reduction)- Cost: $1.60/run, $480/month (80% savings)- Context size: 4k tokens (bounded)- Cache hit rate: 35%Summary
In this post, I showed how to reduce AI agent token consumption by 60-80%. The four key strategies are: context compression (sliding window with summarization), semantic caching (avoid redundant API calls), efficient prompts (concise, structured output requests), and tiered model selection (right-size model for task complexity).
Start with context compression - it has the biggest impact. Add caching next. Then refine prompts and implement tiered routing. A single afternoon of optimization can save thousands in monthly API costs.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- π¨βπ» Reddit: The Hidden Cost of Ghost AI Agents
- π¨βπ» LangChain Token Management Guide
- π¨βπ» Claude API Pricing
Oh, and if you found these resources useful, donβt forget to support me by starring the repo on GitHub!
Comments