Skip to content

How to Reduce AI Agent Token Consumption and Control LLM Costs

Problem

I built an AI agent using LangChain to automate some research tasks. It worked great in testing. Then I deployed it and let it run overnight.

The next morning, I checked my OpenAI billing:

billing-shock.txt
Previous month: $12.45
Current month (12 hours): $847.23

The agent had consumed more tokens in 12 hours than my typical monthly usage. A simple research task turned into an expensive lesson.

Environment

  • Python 3.12
  • LangChain 0.1.x
  • OpenAI GPT-4 and GPT-3.5-turbo
  • tiktoken for token counting

What Happened

I checked the logs to understand where all those tokens went:

agent-logs.txt
[10:23:01] Agent started with task: "Research AI agent costs"
[10:23:05] LLM call #1 - 4,521 tokens (reasoning)
[10:23:08] LLM call #2 - 4,521 tokens (same context re-sent)
[10:23:12] LLM call #3 - 4,521 tokens (still same context)
...
[10:45:33] LLM call #847 - 4,521 tokens (context never shrinks)
[10:45:35] ERROR: Max iterations reached. Task incomplete.

The agent had three problems:

  1. No iteration limit - It kept trying until it hit a hardcoded maximum
  2. Context bloat - Every LLM call included the full conversation history
  3. Model overkill - Using GPT-4 for simple formatting tasks

Solution

I implemented seven strategies to control token consumption.

Strategy 1: Model Tiering

Use cheaper models for simple tasks. Reserve expensive models for complex reasoning:

model_tiering.py
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage
class CostOptimizedAgent:
def __init__(self):
self.cheap_model = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
self.expensive_model = ChatOpenAI(model="gpt-4", temperature=0)
self.token_budget = 100000 # per session
def route_task(self, task: str, complexity: str):
if complexity == "simple":
return self.cheap_model.invoke([HumanMessage(content=task)])
else:
return self.expensive_model.invoke([HumanMessage(content=task)])
def classify_complexity(self, task: str) -> str:
prompt = f"Classify this task as 'simple' or 'complex': {task}"
result = self.cheap_model.invoke([HumanMessage(content=prompt)])
return result.content.lower()

Cost comparison for a typical agent workflow:

cost-comparison.txt
Before: 100 GPT-4 calls @ $0.03/1K tokens = $3.00
After: 10 GPT-4 calls @ $0.03/1K + 90 GPT-3.5 calls @ $0.0005/1K = $0.35
Savings: 88% reduction

Strategy 2: Token Budget Guard

Track token usage in real-time:

budget_guard.py
import tiktoken
class TokenBudgetGuard:
def __init__(self, max_tokens: int = 50000):
self.max_tokens = max_tokens
self.used_tokens = 0
self.encoding = tiktoken.encoding_for_model("gpt-4")
def count_tokens(self, messages: list) -> int:
total = 0
for msg in messages:
total += len(self.encoding.encode(msg.get("content", "")))
return total
def can_proceed(self, estimated_tokens: int) -> bool:
return (self.used_tokens + estimated_tokens) <= self.max_tokens
def record_usage(self, actual_tokens: int):
self.used_tokens += actual_tokens
if self.used_tokens > self.max_tokens * 0.8:
print(f"WARNING: {self.used_tokens}/{self.max_tokens} tokens used")

Usage in the agent loop:

agent_with_budget.py
budget_guard = TokenBudgetGuard(max_tokens=50000)
while not task_complete:
estimated = estimate_tokens(current_context)
if not budget_guard.can_proceed(estimated):
print("Budget exceeded, stopping agent")
break
# Proceed with LLM call

Strategy 3: Context Window Management

Don’t pass full history to every call:

context_manager.py
class ContextWindowManager:
def __init__(self, max_context_tokens: int = 4000):
self.max_tokens = max_context_tokens
self.messages = []
def add_message(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
self._prune_if_needed()
def _prune_if_needed(self):
if len(self.messages) <= 3:
return
total_tokens = sum(len(m["content"]) // 4 for m in self.messages)
if total_tokens > self.max_tokens:
# Keep first (system) and last 2 messages
# Summarize everything in between
middle_messages = self.messages[1:-2]
summary = self._summarize_messages(middle_messages)
self.messages = [
self.messages[0], # System prompt
{"role": "assistant", "content": f"[Summary]: {summary}"},
*self.messages[-2:] # Last 2 messages
]
def _summarize_messages(self, messages: list) -> str:
combined = " ".join(m["content"] for m in messages)
return combined[:500] + "..."

Strategy 4: Iteration Limits

Hard cap on agent loops:

iteration_limiter.py
class AgentWithLimits:
def __init__(self, max_iterations: int = 10):
self.max_iterations = max_iterations
self.iteration_count = 0
def run(self, task: str):
while self.iteration_count < self.max_iterations:
self.iteration_count += 1
result = self.step(task)
if result.is_complete:
return result.output
if result.needs_human_input:
user_input = input("Approve action? (y/n): ")
if user_input.lower() != 'y':
return "Cancelled by user"
return f"Max iterations ({self.max_iterations}) reached."
def step(self, task: str):
# Single agent step
pass

Strategy 5: Prompt Caching

Use provider caching features:

prompt_caching.py
from anthropic import Anthropic
client = Anthropic()
# Claude supports prompt caching for repeated system prompts
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a research assistant...",
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": "Research AI agent costs"}
]
)

Strategy 6: Response Caching

Cache identical queries:

response_cache.py
import hashlib
import json
from functools import lru_cache
class ResponseCache:
def __init__(self):
self.cache = {}
def _hash_query(self, messages: list) -> str:
content = json.dumps(messages, sort_keys=True)
return hashlib.md5(content.encode()).hexdigest()
def get(self, messages: list):
key = self._hash_query(messages)
return self.cache.get(key)
def set(self, messages: list, response: str):
key = self._hash_query(messages)
self.cache[key] = response
# Usage
cache = ResponseCache()
def cached_llm_call(messages: list):
cached = cache.get(messages)
if cached:
return cached
response = llm.invoke(messages)
cache.set(messages, response.content)
return response.content

Strategy 7: Monitoring Dashboard

Real-time cost visibility:

monitoring.py
from dataclasses import dataclass
from datetime import datetime
from typing import Optional
import json
@dataclass
class TokenUsage:
timestamp: datetime
model: str
input_tokens: int
output_tokens: int
cost_usd: float
task_id: str
class CostMonitor:
def __init__(self):
self.usage_log = []
def log_call(self, model: str, input_tokens: int, output_tokens: int, task_id: str):
cost = self._calculate_cost(model, input_tokens, output_tokens)
usage = TokenUsage(
timestamp=datetime.now(),
model=model,
input_tokens=input_tokens,
output_tokens=output_tokens,
cost_usd=cost,
task_id=task_id
)
self.usage_log.append(usage)
# Alert if daily budget exceeded
daily_spend = sum(u.cost_usd for u in self.usage_log if u.timestamp.date() == datetime.now().date())
if daily_spend > 10.0: # $10 daily limit
print(f"ALERT: Daily spend ${daily_spend:.2f} exceeds $10 limit")
def _calculate_cost(self, model: str, input_tokens: int, output_tokens: int) -> float:
pricing = {
"gpt-4": {"input": 0.03, "output": 0.06},
"gpt-3.5-turbo": {"input": 0.0005, "output": 0.0015},
"claude-3-sonnet": {"input": 0.003, "output": 0.015}
}
rates = pricing.get(model, {"input": 0.01, "output": 0.03})
return (input_tokens * rates["input"] + output_tokens * rates["output"]) / 1000

Reason

The root cause of runaway token consumption is architectural. Agents operate autonomously, making decisions without human oversight. Without guardrails, they can:

  • Enter infinite retry loops when errors occur
  • Chain reasoning steps unnecessarily
  • Include excessive context in every call
  • Use expensive models for trivial tasks

The Reddit community describes this as agents “burning through your token quota like a teenage girl in Sephora with her dad’s credit card.” One user reported a simple “hi” input consumed massive tokens and still timed out.

Summary

After implementing these strategies, my agent’s token consumption dropped by 85%:

results.txt
Before optimization:
- Daily tokens: ~500K
- Daily cost: ~$30
- Frequent timeout errors
After optimization:
- Daily tokens: ~75K
- Daily cost: ~$4.50
- No timeout errors

Key takeaways:

  1. Model tiering - Use GPT-3.5/Haiku for routing, GPT-4/Opus for complex reasoning
  2. Budget guards - Track tokens in real-time, stop before limits
  3. Context pruning - Summarize history instead of passing everything
  4. Iteration limits - Hard cap at 10-20 iterations with graceful degradation
  5. Caching - Cache prompts and responses
  6. Monitoring - Real-time dashboards with per-task attribution

Always test with cost limits before production deployment. Budget 3-5x your expected usage initially.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments