How I Cut My AI API Token Usage by 75%
Problem
My AI API bill was exploding. I was spending $400/month on tokens, and I couldn’t figure out why. Then I discovered my agent was “slamming 128k context window” into every single API call, even for simple questions:
User: What's 2+2?Agent: [Loads 128,000 tokens of context] [Processes entire context] [Returns: 4] [Cost: $0.40 per call]
Total daily cost: $40+ for trivial questionsI saw a Reddit thread about OpenClaw users doing exactly this—running “millions of ghost agents, running 24/7” that waste API resources for everyone. But I wasn’t running millions of agents. I was just… inefficient.
Environment
- Python 3.11 with LangGraph
- Anthropic Claude API
- Daily agent runs for task automation
- Average 50-100 API calls per day
- Context window: 200k tokens
- Expected cost: ~$50/month
- Actual cost: $400+/month
What happened?
I built an agent system to automate content creation. Each agent handles a different task: research, writing, editing, publishing. The architecture looked reasonable:
Research Agent → Writer Agent → Editor Agent → Publisher Agent ↓ ↓ ↓ ↓ Context A Context B Context C Context DBut I made a critical mistake. I passed the entire conversation history to each agent:
from langgraph.graph import StateGraphfrom typing import TypedDict, List, Any
class AgentState(TypedDict): messages: List[Any] # Full conversation history task: str result: str
def research_agent(state: AgentState) -> AgentState: # Receives ENTIRE conversation history response = client.messages.create( model="claude-3-opus-20240229", max_tokens=4096, messages=state["messages"] # All 50+ previous messages ) return {"result": response.content}
def writer_agent(state: AgentState) -> AgentState: # Also receives ENTIRE conversation history response = client.messages.create( model="claude-3-opus-20240229", max_tokens=4096, messages=state["messages"] # Still all 50+ messages ) return {"result": response.content}
# Each agent processes the same 50k+ tokens# 4 agents × 50k tokens = 200k tokens per workflow# 100 workflows/day = 20M tokens/day# 20M tokens × $0.015/1k = $300/dayI profiled my actual usage:
Average context per call: 87,000 tokensAverage response per call: 1,200 tokensNumber of calls per day: 85Daily token usage: 7.4M input + 102k outputDaily cost: $111 (input) + $3 (output) = $114/dayMonthly cost: ~$3,400
But wait... I only process 100 simple tasks!I was paying 68x more than expected because every agent re-processed the entire conversation history.
How to solve it?
I tried several approaches to reduce token waste.
Attempt 1: Truncate conversation history
First, I tried limiting the history to recent messages:
def trim_context(messages: List[Any], max_messages: int = 10) -> List[Any]: """Keep only the most recent messages.""" return messages[-max_messages:]
def research_agent(state: AgentState) -> AgentState: trimmed = trim_context(state["messages"], max_messages=10) response = client.messages.create( model="claude-3-opus-20240229", max_tokens=4096, messages=trimmed # Only 10 messages now ) return {"result": response.content}This reduced my usage:
Before: 87,000 tokens per callAfter: 15,000 tokens per call
Reduction: 82%New daily cost: $20/dayNew monthly cost: $600But I hit a new problem: agents lost important context from early messages.
Research Agent: "Found 5 sources about Python async"[10 messages later]Writer Agent: "What sources? I don't see any sources mentioned."Attempt 2: Extract key information
Instead of keeping full messages, I extracted and stored key facts:
from dataclasses import dataclassfrom typing import Dict, List, Set
@dataclassclass KeyInformation: facts: Dict[str, str] # Key findings decisions: List[str] # Decisions made sources: Set[str] # Reference URLs constraints: List[str] # Requirements
def extract_key_info(messages: List[Any]) -> KeyInformation: """Extract essential information from conversation.""" extraction_prompt = """ Extract key information from this conversation: - Important facts discovered - Decisions made - Source URLs mentioned - Constraints and requirements
Return as structured JSON. """ # Use a smaller model for extraction response = client.messages.create( model="claude-3-haiku-20240307", # Cheaper model max_tokens=1024, messages=[{"role": "user", "content": extraction_prompt}] ) return parse_extraction(response.content)
def research_agent(state: AgentState, key_info: KeyInformation) -> AgentState: # Include only key facts + current task context = f""" Key facts so far: {format_key_info(key_info)}
Current task: {state["task"]} """ response = client.messages.create( model="claude-3-opus-20240229", max_tokens=4096, messages=[{"role": "user", "content": context}] ) return {"result": response.content}This worked better:
Key facts: 2,000 tokensCurrent task: 500 tokensSystem prompt: 800 tokensTotal per call: 3,300 tokens
Reduction from original: 96%New daily cost: $5/dayNew monthly cost: $150But extraction itself costs tokens:
Extraction call per workflow: 2,000 tokens input + 500 tokens outputCost per extraction: $0.03100 extractions/day = $3/day
Still saving money, but extraction overhead adds upAttempt 3: Implement smart caching
I realized many of my requests were similar. Why not cache responses?
import hashlibfrom typing import Optionalimport json
class SemanticCache: def __init__(self, similarity_threshold: float = 0.95): self.cache: Dict[str, Any] = {} self.similarity_threshold = similarity_threshold
def _hash_content(self, content: str) -> str: """Create hash of content for exact matching.""" return hashlib.sha256(content.encode()).hexdigest()
def get_exact(self, content: str) -> Optional[Any]: """Check for exact match in cache.""" key = self._hash_content(content) return self.cache.get(key)
def set(self, content: str, response: Any): """Store response in cache.""" key = self._hash_content(content) self.cache[key] = response
def research_agent_with_cache( state: AgentState, cache: SemanticCache) -> AgentState: # Check cache first cached = cache.get_exact(state["task"]) if cached: print(f"Cache hit! Saved {len(state['messages'])} tokens") return {"result": cached}
# Not in cache, make API call response = client.messages.create( model="claude-3-opus-20240229", max_tokens=4096, messages=[{"role": "user", "content": state["task"]}] )
# Cache the result cache.set(state["task"], response.content) return {"result": response.content}Cache hit rates surprised me:
Day 1: 5% cache hits (building cache)Day 7: 35% cache hits (cache warming up)Day 30: 60% cache hits (cache mature)
With 60% cache hits:- 40 calls × 3,300 tokens = 132k tokens (actual API calls)- 60 calls × 0 tokens = 0 tokens (cached)- Daily token usage: 132k input- Daily cost: $2/day- Monthly cost: $60Attempt 4: Optimize prompts
I was also wasting tokens on verbose prompts. I rewrote them to be concise:
# BEFORE: Verbose prompt (2,400 tokens)VERBOSE_PROMPT = """You are an expert research assistant specializing in technology topics.Your role is to analyze the given topic and provide comprehensive insights.
Please follow these steps:1. First, understand the core question being asked2. Research the topic thoroughly using available knowledge3. Identify key concepts, terms, and relationships4. Find relevant examples and case studies5. Synthesize findings into a clear, structured response
Your response should:- Be well-organized with clear sections- Include specific examples where relevant- Cite sources when possible- Avoid unnecessary repetition- Be concise but comprehensive
Format your response as:## Overview[Brief summary of the topic]
## Key Points[Main findings and insights]
## Examples[Relevant examples and cases]
## Conclusion[Summary and recommendations]"""
# AFTER: Concise prompt (180 tokens)CONCISE_PROMPT = """Research the topic. Output:1. Core findings (3-5 bullets)2. Key examples (2-3)3. Recommendations (1-2)
Be specific. No filler."""The results were dramatic:
Verbose prompt: 2,400 tokens per callConcise prompt: 180 tokens per callReduction: 92%
And response quality? Same or better.Verbose avg response: 850 tokensConcise avg response: 320 tokensTotal reduction: 85%Attempt 5: Set termination conditions
My agents were running in infinite loops, like the “ghost agents” mentioned in the Reddit thread. I added proper termination:
from dataclasses import dataclassfrom typing import Literal
@dataclassclass AgentResult: status: Literal["complete", "needs_more", "failed"] confidence: float # 0.0 to 1.0 iterations: int result: str
MAX_ITERATIONS = 5MIN_CONFIDENCE = 0.85
def agent_loop(state: AgentState) -> AgentResult: iterations = 0
while iterations < MAX_ITERATIONS: result = run_agent(state)
if result.confidence >= MIN_CONFIDENCE: return AgentResult( status="complete", confidence=result.confidence, iterations=iterations, result=result.output )
# Not confident enough, iterate state = refine_state(state, result) iterations += 1
# Hit max iterations return AgentResult( status="needs_more", confidence=result.confidence, iterations=iterations, result="Max iterations reached" )
def run_agent(state: AgentState) -> Any: """Single agent execution with confidence scoring.""" response = client.messages.create( model="claude-3-opus-20240229", max_tokens=1024, messages=state["messages"], system="Rate your confidence in this response (0-1)." )
confidence = extract_confidence(response.content) return AgentOutput( output=response.content, confidence=confidence )This prevented runaway loops:
Before termination: Average 12 iterations per taskAfter termination: Average 3.2 iterations per task
Reduction: 73% fewer API callsCost reduction: 73%The reason
Why was I wasting so many tokens? Three core problems:
1. Context Window Bloat
I assumed bigger context = better results. But:
Processing 100k tokens: $1.50Processing 10k tokens: $0.15Processing 1k tokens: $0.015
The math: 100x cost difference for similar qualityMost tasks don’t need the entire conversation history. I was “slamming 128k context window” just because I could.
2. No Caching Strategy
I made the same API calls repeatedly without caching results:
Query: "What is async/await in Python?"Response: [Same response every time]Cost: $0.15 per call × 100 calls = $15
With caching: $0.15 × 1 call + $0 for 99 cached = $0.15Savings: 99%3. Verbose Everything
Both my prompts and expected responses were unnecessarily long:
My original prompt: 2,400 tokensOptimized prompt: 180 tokens
Expected verbose response: 1,000 tokensExpected concise response: 300 tokens
Total per call: 3,400 tokens → 480 tokensReduction: 85%Final solution
I combined all strategies into a token budget manager:
from dataclasses import dataclassfrom typing import Dict, List, Optionalimport tiktoken
@dataclassclass TokenBudget: max_context: int = 4000 # Max context per call max_response: int = 1000 # Max response tokens cache_enabled: bool = True # Enable semantic caching min_confidence: float = 0.85 # Early stopping threshold max_iterations: int = 3 # Max agent iterations
class TokenBudgetManager: def __init__(self, budget: TokenBudget): self.budget = budget self.cache: Dict[str, str] = {} self.encoding = tiktoken.encoding_for_model("gpt-4")
def count_tokens(self, text: str) -> int: """Count tokens in text.""" return len(self.encoding.encode(text))
def trim_context( self, messages: List[Dict], max_tokens: int ) -> List[Dict]: """Trim messages to fit token budget.""" result = [] total_tokens = 0
# Keep most recent messages that fit for msg in reversed(messages): msg_tokens = self.count_tokens(str(msg)) if total_tokens + msg_tokens <= max_tokens: result.insert(0, msg) total_tokens += msg_tokens else: break
return result
def check_cache(self, query: str) -> Optional[str]: """Check if query is cached.""" if not self.budget.cache_enabled: return None return self.cache.get(query)
def cache_response(self, query: str, response: str): """Cache query-response pair.""" if self.budget.cache_enabled: self.cache[query] = response
def should_stop(self, confidence: float, iterations: int) -> bool: """Check if agent should stop.""" return ( confidence >= self.budget.min_confidence or iterations >= self.budget.max_iterations )
# Usagebudget = TokenBudget( max_context=4000, max_response=1000, cache_enabled=True, min_confidence=0.85, max_iterations=3)
manager = TokenBudgetManager(budget)
def smart_agent_call(query: str, context: List[Dict]) -> str: # Check cache first cached = manager.check_cache(query) if cached: return cached
# Trim context to budget trimmed = manager.trim_context( context, budget.max_context - manager.count_tokens(query) )
# Make API call response = client.messages.create( model="claude-3-opus-20240229", max_tokens=budget.max_response, messages=trimmed + [{"role": "user", "content": query}] )
# Cache result manager.cache_response(query, response.content)
return response.contentFinal results after implementing all optimizations:
Original setup:- Context per call: 87,000 tokens- Daily calls: 85- Daily cost: $114- Monthly cost: $3,420
Optimized setup:- Context per call: 3,500 tokens- Daily calls: 34 (60% cache hits)- Daily cost: $2.40- Monthly cost: $72
Total reduction: 97.9%Annual savings: $40,176Best practices
Based on my experience, here’s what I recommend:
1. Start with a token budget
BUDGET = { "max_context": 4000, # 4k context is enough for most tasks "max_response": 1000, # Concise responses preferred "cache_ttl": 3600, # 1 hour cache "max_iterations": 3 # Stop after 3 tries}2. Trim context aggressively
# WRONG: Keep everythingmessages = all_messages # Could be 100k+ tokens
# RIGHT: Keep only what's neededmessages = trim_to_budget(all_messages, max_tokens=4000)3. Cache everything
# Every similar query should hit cachecache.set(query, response, ttl=3600)
# Use semantic similarity for fuzzy matchingsimilar = cache.find_similar(query, threshold=0.95)4. Use concise prompts
# WRONG: Essay-length promptprompt = "You are an expert assistant who..." # 2000+ tokens
# RIGHT: Direct, specific promptprompt = "Analyze. Output: 3 findings, 2 examples. Be specific." # 100 tokens5. Set termination conditions
# Prevent infinite loopsif iterations >= MAX_ITERATIONS: breakif confidence >= MIN_CONFIDENCE: breakSummary
In this post, I showed how I reduced my AI API token consumption by 97.9% (from $3,420/month to $72/month) by fixing wasteful patterns like context window bloat, missing cache strategies, and verbose prompts.
The key insight is that “ghost agents” running inefficient loops don’t just hurt your wallet—they waste API capacity for everyone. Responsible usage isn’t just ethical; it’s financially smart.
To optimize your token usage:
- Set explicit token budgets and trim context aggressively
- Implement semantic caching for similar queries
- Write concise, structured prompts
- Set proper termination conditions to prevent loops
- Monitor actual usage vs. expected usage
Your API bill will thank you.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit Discussion: OpenClaw and API Abuse
- 👨💻 Anthropic Token Counting Documentation
- 👨💻 Context Windows in LLMs
- 👨💻 Goodhart's Law
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments