Skip to content

How Do I Optimize My AI Agent to Use Less Tokens?

The Problem

I built an AI agent that ran for 8 hours and cost me $47. My API bill went from $200/month to over $800 in just two weeks. The agent worked correctly, but it was burning tokens like crazy.

When I investigated, I found the problem:

Agent run log:
Iteration 1: 5,000 tokens input
Iteration 5: 15,000 tokens input
Iteration 10: 35,000 tokens input
Iteration 20: 80,000 tokens input
Total: ~800,000 tokens for one agent run
At $10/1M tokens: $8 per run
10 runs/day = $80/day = $2,400/month

The context window was growing unbounded. Every tool output, error message, and observation got appended to the conversation. No compression. No caching. No optimization.

Why Agents Waste Tokens

I discovered four main causes:

1. Context Window Bloat

Each action adds observations. Tool outputs can be thousands of tokens. Error messages repeat. Without intervention, context grows to maximum limits.

bad-context.py
# This is what I was doing - WRONG
conversation = []
while not done:
response = llm.chat(conversation + [user_message])
conversation.append(user_message)
conversation.append(response)
# Context grows indefinitely!

2. Redundant API Calls

My agent made identical calls multiple times. Same query in different reasoning steps. Similar prompts with minor variations. No caching.

3. Prompt Verbosity

I wrote long system prompts with unnecessary examples and redundant instructions. Each call sent 200+ tokens of setup that rarely changed.

4. Wrong Model for Simple Tasks

I used GPT-4 for everything, even simple JSON-to-YAML conversions. That’s a 60x cost difference from using a lightweight model.

Solution 1: Context Compression

I implemented a sliding window with summarization:

context-compressor.py
import tiktoken
from dataclasses import dataclass
from typing import List
@dataclass
class CompressedMessage:
role: str
content: str
token_count: int
summarized: bool = False
class ContextCompressor:
"""Keep context bounded at target size"""
def __init__(
self,
max_tokens: int = 4000,
keep_recent: int = 3
):
self.max_tokens = max_tokens
self.keep_recent = keep_recent
self.messages: List[CompressedMessage] = []
self.encoding = tiktoken.encoding_for_model("gpt-4")
def add_message(self, role: str, content: str):
token_count = len(self.encoding.encode(content))
self.messages.append(CompressedMessage(
role=role,
content=content,
token_count=token_count
))
if self._get_total_tokens() > self.max_tokens * 0.7:
self._compress_context()
def _get_total_tokens(self) -> int:
return sum(m.token_count for m in self.messages)
def _compress_context(self):
"""Compress older messages into summary"""
if len(self.messages) <= self.keep_recent:
return
to_compress = self.messages[:-self.keep_recent]
summary = self._generate_summary(to_compress)
compressed_msg = CompressedMessage(
role="system",
content=f"[Previous Context]\n{summary}",
token_count=len(self.encoding.encode(summary)),
summarized=True
)
self.messages = [compressed_msg] + self.messages[-self.keep_recent:]
def _generate_summary(self, messages: List[CompressedMessage]) -> str:
# Extract key information
actions = sum(1 for m in messages if "action" in m.content.lower())
errors = sum(1 for m in messages if "error" in m.content.lower())
return f"{actions} actions, {errors} errors encountered"
def get_context_for_api(self) -> List[dict]:
return [{"role": m.role, "content": m.content} for m in self.messages]

This keeps my context bounded at 4,000 tokens instead of growing to 128,000.

Solution 2: Semantic Caching

I added caching to avoid redundant API calls:

semantic-cache.py
import hashlib
from datetime import datetime, timedelta
from dataclasses import dataclass
from typing import Any, Optional
@dataclass
class CacheEntry:
query_hash: str
response: Any
timestamp: datetime
hit_count: int = 0
class SemanticCache:
"""Cache responses for similar queries"""
def __init__(self, ttl_minutes: int = 30):
self.ttl = timedelta(minutes=ttl_minutes)
self.cache: dict[str, CacheEntry] = {}
def _hash_query(self, query: str) -> str:
normalized = query.lower().strip()
return hashlib.sha256(normalized.encode()).hexdigest()[:16]
def get(self, query: str) -> Optional[Any]:
query_hash = self._hash_query(query)
if query_hash in self.cache:
entry = self.cache[query_hash]
if datetime.now() - entry.timestamp < self.ttl:
entry.hit_count += 1
return entry.response
else:
del self.cache[query_hash]
return None
def set(self, query: str, response: Any):
query_hash = self._hash_query(query)
self.cache[query_hash] = CacheEntry(
query_hash=query_hash,
response=response,
timestamp=datetime.now()
)
# Usage
cache = SemanticCache()
async def cached_query(query: str):
cached = cache.get(query)
if cached:
print("Cache HIT")
return cached
result = await llm.generate(query)
cache.set(query, result)
return result

My cache hit rate is now 30-40%, saving hundreds of API calls per day.

Solution 3: Efficient Prompts

I replaced verbose prompts with concise ones:

prompt-comparison.py
# BEFORE: 150+ tokens
verbose_prompt = """
You are an AI assistant that helps users search through documentation.
Your role is to understand user queries and return relevant information.
You should always be helpful and provide accurate information.
When the user asks a question, you should search through the available
documentation and provide the most relevant results. Make sure to
format your response in a clear and readable manner. If you cannot
find relevant information, you should let the user know politely.
"""
# AFTER: 20 tokens
concise_prompt = """
Search docs. Return top 5 results or "None found".
Format: bullet points.
"""
# Token savings: 87%

For structured output, I request specific formats:

structured-output.py
# Instead of open-ended responses
prompt = """
Return JSON:
{
"approach": "redis|memcached|memory",
"code": "<implementation>",
"dependencies": []
}
Max 30 lines code.
"""

Solution 4: Tiered Model Selection

I route tasks to appropriate models:

model-router.py
from enum import Enum
class ModelTier(Enum):
LIGHTWEIGHT = "claude-3-haiku" # $0.25/1M tokens
BALANCED = "claude-3-sonnet" # $3/1M tokens
CAPABLE = "claude-3-opus" # $15/1M tokens
class TieredRouter:
"""Route queries to appropriate model"""
def __init__(self, clients: dict[ModelTier, Any]):
self.clients = clients
def select_tier(self, query: str) -> ModelTier:
query_lower = query.lower()
# High complexity keywords
if any(kw in query_lower for kw in ["analyze", "design", "architect"]):
return ModelTier.CAPABLE
# Medium complexity
if any(kw in query_lower for kw in ["implement", "debug", "refactor"]):
return ModelTier.BALANCED
# Simple tasks
return ModelTier.LIGHTWEIGHT
async def route(self, query: str) -> tuple[Any, ModelTier]:
tier = self.select_tier(query)
result = await self.clients[tier].generate(query)
return result, tier

Cost comparison for 100 queries:

All CAPABLE: 100 * $15/1M * 2000 tokens = $3.00
Tiered (70% light, 20% balanced, 10% capable):
70 * $0.25/1M * 2000 = $0.035
20 * $3/1M * 2000 = $0.12
10 * $15/1M * 2000 = $0.30
Total: $0.455
Savings: 85%

Common Mistakes I Made

Mistake 1: Keeping full conversation history

mistake-1.py
# BAD
conversation = []
while not done:
response = llm.chat(conversation + [user_message])
conversation.append(user_message)
conversation.append(response)
# GOOD
compressor = ContextCompressor(max_tokens=4000)
while not done:
response = llm.chat(compressor.get_context_for_api())
compressor.add_message("assistant", response)

Mistake 2: No caching for repeated queries

mistake-2.py
# BAD: Every query hits API
async def search(query: str):
return await llm.generate(f"Search: {query}")
# GOOD
cache = SemanticCache()
async def search(query: str):
cached = cache.get(query)
if cached:
return cached
result = await llm.generate(f"Search: {query}")
cache.set(query, result)
return result

Mistake 3: Using capable model for simple tasks

mistake-3.py
# BAD: Expensive model for simple conversion
result = opus.generate("Convert JSON to YAML") # $15/1M
# GOOD
result = haiku.generate("Convert JSON to YAML") # $0.25/1M
# 60x cheaper for identical utility

The Results

After implementing all four optimizations:

Before optimization:
- Token usage: ~800,000 per run
- Cost: $8/run, $2,400/month
- Context size: 128k tokens
After optimization:
- Token usage: ~160,000 per run (80% reduction)
- Cost: $1.60/run, $480/month (80% savings)
- Context size: 4k tokens (bounded)
- Cache hit rate: 35%

Summary

In this post, I showed how to reduce AI agent token consumption by 60-80%. The four key strategies are: context compression (sliding window with summarization), semantic caching (avoid redundant API calls), efficient prompts (concise, structured output requests), and tiered model selection (right-size model for task complexity).

Start with context compression - it has the biggest impact. Add caching next. Then refine prompts and implement tiered routing. A single afternoon of optimization can save thousands in monthly API costs.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments