How to Reduce AI Agent Token Consumption and Control LLM Costs
Problem
I built an AI agent using LangChain to automate some research tasks. It worked great in testing. Then I deployed it and let it run overnight.
The next morning, I checked my OpenAI billing:
Previous month: $12.45Current month (12 hours): $847.23The agent had consumed more tokens in 12 hours than my typical monthly usage. A simple research task turned into an expensive lesson.
Environment
- Python 3.12
- LangChain 0.1.x
- OpenAI GPT-4 and GPT-3.5-turbo
- tiktoken for token counting
What Happened
I checked the logs to understand where all those tokens went:
[10:23:01] Agent started with task: "Research AI agent costs"[10:23:05] LLM call #1 - 4,521 tokens (reasoning)[10:23:08] LLM call #2 - 4,521 tokens (same context re-sent)[10:23:12] LLM call #3 - 4,521 tokens (still same context)...[10:45:33] LLM call #847 - 4,521 tokens (context never shrinks)[10:45:35] ERROR: Max iterations reached. Task incomplete.The agent had three problems:
- No iteration limit - It kept trying until it hit a hardcoded maximum
- Context bloat - Every LLM call included the full conversation history
- Model overkill - Using GPT-4 for simple formatting tasks
Solution
I implemented seven strategies to control token consumption.
Strategy 1: Model Tiering
Use cheaper models for simple tasks. Reserve expensive models for complex reasoning:
from langchain_openai import ChatOpenAIfrom langchain.schema import HumanMessage
class CostOptimizedAgent: def __init__(self): self.cheap_model = ChatOpenAI(model="gpt-3.5-turbo", temperature=0) self.expensive_model = ChatOpenAI(model="gpt-4", temperature=0) self.token_budget = 100000 # per session
def route_task(self, task: str, complexity: str): if complexity == "simple": return self.cheap_model.invoke([HumanMessage(content=task)]) else: return self.expensive_model.invoke([HumanMessage(content=task)])
def classify_complexity(self, task: str) -> str: prompt = f"Classify this task as 'simple' or 'complex': {task}" result = self.cheap_model.invoke([HumanMessage(content=prompt)]) return result.content.lower()Cost comparison for a typical agent workflow:
Before: 100 GPT-4 calls @ $0.03/1K tokens = $3.00After: 10 GPT-4 calls @ $0.03/1K + 90 GPT-3.5 calls @ $0.0005/1K = $0.35Savings: 88% reductionStrategy 2: Token Budget Guard
Track token usage in real-time:
import tiktoken
class TokenBudgetGuard: def __init__(self, max_tokens: int = 50000): self.max_tokens = max_tokens self.used_tokens = 0 self.encoding = tiktoken.encoding_for_model("gpt-4")
def count_tokens(self, messages: list) -> int: total = 0 for msg in messages: total += len(self.encoding.encode(msg.get("content", ""))) return total
def can_proceed(self, estimated_tokens: int) -> bool: return (self.used_tokens + estimated_tokens) <= self.max_tokens
def record_usage(self, actual_tokens: int): self.used_tokens += actual_tokens if self.used_tokens > self.max_tokens * 0.8: print(f"WARNING: {self.used_tokens}/{self.max_tokens} tokens used")Usage in the agent loop:
budget_guard = TokenBudgetGuard(max_tokens=50000)
while not task_complete: estimated = estimate_tokens(current_context) if not budget_guard.can_proceed(estimated): print("Budget exceeded, stopping agent") break # Proceed with LLM callStrategy 3: Context Window Management
Don’t pass full history to every call:
class ContextWindowManager: def __init__(self, max_context_tokens: int = 4000): self.max_tokens = max_context_tokens self.messages = []
def add_message(self, role: str, content: str): self.messages.append({"role": role, "content": content}) self._prune_if_needed()
def _prune_if_needed(self): if len(self.messages) <= 3: return
total_tokens = sum(len(m["content"]) // 4 for m in self.messages)
if total_tokens > self.max_tokens: # Keep first (system) and last 2 messages # Summarize everything in between middle_messages = self.messages[1:-2] summary = self._summarize_messages(middle_messages)
self.messages = [ self.messages[0], # System prompt {"role": "assistant", "content": f"[Summary]: {summary}"}, *self.messages[-2:] # Last 2 messages ]
def _summarize_messages(self, messages: list) -> str: combined = " ".join(m["content"] for m in messages) return combined[:500] + "..."Strategy 4: Iteration Limits
Hard cap on agent loops:
class AgentWithLimits: def __init__(self, max_iterations: int = 10): self.max_iterations = max_iterations self.iteration_count = 0
def run(self, task: str): while self.iteration_count < self.max_iterations: self.iteration_count += 1 result = self.step(task)
if result.is_complete: return result.output
if result.needs_human_input: user_input = input("Approve action? (y/n): ") if user_input.lower() != 'y': return "Cancelled by user"
return f"Max iterations ({self.max_iterations}) reached."
def step(self, task: str): # Single agent step passStrategy 5: Prompt Caching
Use provider caching features:
from anthropic import Anthropic
client = Anthropic()
# Claude supports prompt caching for repeated system promptsresponse = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, system=[ { "type": "text", "text": "You are a research assistant...", "cache_control": {"type": "ephemeral"} } ], messages=[ {"role": "user", "content": "Research AI agent costs"} ])Strategy 6: Response Caching
Cache identical queries:
import hashlibimport jsonfrom functools import lru_cache
class ResponseCache: def __init__(self): self.cache = {}
def _hash_query(self, messages: list) -> str: content = json.dumps(messages, sort_keys=True) return hashlib.md5(content.encode()).hexdigest()
def get(self, messages: list): key = self._hash_query(messages) return self.cache.get(key)
def set(self, messages: list, response: str): key = self._hash_query(messages) self.cache[key] = response
# Usagecache = ResponseCache()
def cached_llm_call(messages: list): cached = cache.get(messages) if cached: return cached
response = llm.invoke(messages) cache.set(messages, response.content) return response.contentStrategy 7: Monitoring Dashboard
Real-time cost visibility:
from dataclasses import dataclassfrom datetime import datetimefrom typing import Optionalimport json
@dataclassclass TokenUsage: timestamp: datetime model: str input_tokens: int output_tokens: int cost_usd: float task_id: str
class CostMonitor: def __init__(self): self.usage_log = []
def log_call(self, model: str, input_tokens: int, output_tokens: int, task_id: str): cost = self._calculate_cost(model, input_tokens, output_tokens) usage = TokenUsage( timestamp=datetime.now(), model=model, input_tokens=input_tokens, output_tokens=output_tokens, cost_usd=cost, task_id=task_id ) self.usage_log.append(usage)
# Alert if daily budget exceeded daily_spend = sum(u.cost_usd for u in self.usage_log if u.timestamp.date() == datetime.now().date()) if daily_spend > 10.0: # $10 daily limit print(f"ALERT: Daily spend ${daily_spend:.2f} exceeds $10 limit")
def _calculate_cost(self, model: str, input_tokens: int, output_tokens: int) -> float: pricing = { "gpt-4": {"input": 0.03, "output": 0.06}, "gpt-3.5-turbo": {"input": 0.0005, "output": 0.0015}, "claude-3-sonnet": {"input": 0.003, "output": 0.015} } rates = pricing.get(model, {"input": 0.01, "output": 0.03}) return (input_tokens * rates["input"] + output_tokens * rates["output"]) / 1000Reason
The root cause of runaway token consumption is architectural. Agents operate autonomously, making decisions without human oversight. Without guardrails, they can:
- Enter infinite retry loops when errors occur
- Chain reasoning steps unnecessarily
- Include excessive context in every call
- Use expensive models for trivial tasks
The Reddit community describes this as agents “burning through your token quota like a teenage girl in Sephora with her dad’s credit card.” One user reported a simple “hi” input consumed massive tokens and still timed out.
Summary
After implementing these strategies, my agent’s token consumption dropped by 85%:
Before optimization:- Daily tokens: ~500K- Daily cost: ~$30- Frequent timeout errors
After optimization:- Daily tokens: ~75K- Daily cost: ~$4.50- No timeout errorsKey takeaways:
- Model tiering - Use GPT-3.5/Haiku for routing, GPT-4/Opus for complex reasoning
- Budget guards - Track tokens in real-time, stop before limits
- Context pruning - Summarize history instead of passing everything
- Iteration limits - Hard cap at 10-20 iterations with graceful degradation
- Caching - Cache prompts and responses
- Monitoring - Real-time dashboards with per-task attribution
Always test with cost limits before production deployment. Budget 3-5x your expected usage initially.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit: OpenClaw token consumption discussion
- 👨💻 OpenAI Pricing
- 👨💻 Anthropic Claude Pricing
- 👨💻 LangChain Documentation
- 👨💻 tiktoken Library
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments