How to Reduce AI Agent Token Consumption and Control LLM Costs

Mar 17, 2026

Problem

I built an AI agent using LangChain to automate some research tasks. It worked great in testing. Then I deployed it and let it run overnight.

The next morning, I checked my OpenAI billing:

Previous month: $12.45
Current month (12 hours): $847.23

The agent had consumed more tokens in 12 hours than my typical monthly usage. A simple research task turned into an expensive lesson.

Environment

Python 3.12
LangChain 0.1.x
OpenAI GPT-4 and GPT-3.5-turbo
tiktoken for token counting

What Happened

I checked the logs to understand where all those tokens went:

[10:23:01] Agent started with task: "Research AI agent costs"
[10:23:05] LLM call #1 - 4,521 tokens (reasoning)
[10:23:08] LLM call #2 - 4,521 tokens (same context re-sent)
[10:23:12] LLM call #3 - 4,521 tokens (still same context)
...
[10:45:33] LLM call #847 - 4,521 tokens (context never shrinks)
[10:45:35] ERROR: Max iterations reached. Task incomplete.

The agent had three problems:

No iteration limit - It kept trying until it hit a hardcoded maximum
Context bloat - Every LLM call included the full conversation history
Model overkill - Using GPT-4 for simple formatting tasks

Solution

I implemented seven strategies to control token consumption.

Strategy 1: Model Tiering

Use cheaper models for simple tasks. Reserve expensive models for complex reasoning:

from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage

class CostOptimizedAgent:
    def __init__(self):
        self.cheap_model = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
        self.expensive_model = ChatOpenAI(model="gpt-4", temperature=0)
        self.token_budget = 100000  # per session

    def route_task(self, task: str, complexity: str):
        if complexity == "simple":
            return self.cheap_model.invoke([HumanMessage(content=task)])
        else:
            return self.expensive_model.invoke([HumanMessage(content=task)])

    def classify_complexity(self, task: str) -> str:
        prompt = f"Classify this task as 'simple' or 'complex': {task}"
        result = self.cheap_model.invoke([HumanMessage(content=prompt)])
        return result.content.lower()

Cost comparison for a typical agent workflow:

Before: 100 GPT-4 calls @ $0.03/1K tokens = $3.00
After:  10 GPT-4 calls @ $0.03/1K + 90 GPT-3.5 calls @ $0.0005/1K = $0.35
Savings: 88% reduction

Strategy 2: Token Budget Guard

Track token usage in real-time:

import tiktoken

class TokenBudgetGuard:
    def __init__(self, max_tokens: int = 50000):
        self.max_tokens = max_tokens
        self.used_tokens = 0
        self.encoding = tiktoken.encoding_for_model("gpt-4")

    def count_tokens(self, messages: list) -> int:
        total = 0
        for msg in messages:
            total += len(self.encoding.encode(msg.get("content", "")))
        return total

    def can_proceed(self, estimated_tokens: int) -> bool:
        return (self.used_tokens + estimated_tokens) <= self.max_tokens

    def record_usage(self, actual_tokens: int):
        self.used_tokens += actual_tokens
        if self.used_tokens > self.max_tokens * 0.8:
            print(f"WARNING: {self.used_tokens}/{self.max_tokens} tokens used")

Usage in the agent loop:

budget_guard = TokenBudgetGuard(max_tokens=50000)

while not task_complete:
    estimated = estimate_tokens(current_context)
    if not budget_guard.can_proceed(estimated):
        print("Budget exceeded, stopping agent")
        break
    # Proceed with LLM call

Strategy 3: Context Window Management

Don’t pass full history to every call:

class ContextWindowManager:
    def __init__(self, max_context_tokens: int = 4000):
        self.max_tokens = max_context_tokens
        self.messages = []

    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        self._prune_if_needed()

    def _prune_if_needed(self):
        if len(self.messages) <= 3:
            return

        total_tokens = sum(len(m["content"]) // 4 for m in self.messages)

        if total_tokens > self.max_tokens:
            # Keep first (system) and last 2 messages
            # Summarize everything in between
            middle_messages = self.messages[1:-2]
            summary = self._summarize_messages(middle_messages)

            self.messages = [
                self.messages[0],  # System prompt
                {"role": "assistant", "content": f"[Summary]: {summary}"},
                *self.messages[-2:]  # Last 2 messages
            ]

    def _summarize_messages(self, messages: list) -> str:
        combined = " ".join(m["content"] for m in messages)
        return combined[:500] + "..."

Strategy 4: Iteration Limits

Hard cap on agent loops:

class AgentWithLimits:
    def __init__(self, max_iterations: int = 10):
        self.max_iterations = max_iterations
        self.iteration_count = 0

    def run(self, task: str):
        while self.iteration_count < self.max_iterations:
            self.iteration_count += 1
            result = self.step(task)

            if result.is_complete:
                return result.output

            if result.needs_human_input:
                user_input = input("Approve action? (y/n): ")
                if user_input.lower() != 'y':
                    return "Cancelled by user"

        return f"Max iterations ({self.max_iterations}) reached."

    def step(self, task: str):
        # Single agent step
        pass

Strategy 5: Prompt Caching

Use provider caching features:

from anthropic import Anthropic

client = Anthropic()

# Claude supports prompt caching for repeated system prompts
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a research assistant...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "Research AI agent costs"}
    ]
)

Strategy 6: Response Caching

Cache identical queries:

import hashlib
import json
from functools import lru_cache

class ResponseCache:
    def __init__(self):
        self.cache = {}

    def _hash_query(self, messages: list) -> str:
        content = json.dumps(messages, sort_keys=True)
        return hashlib.md5(content.encode()).hexdigest()

    def get(self, messages: list):
        key = self._hash_query(messages)
        return self.cache.get(key)

    def set(self, messages: list, response: str):
        key = self._hash_query(messages)
        self.cache[key] = response

# Usage
cache = ResponseCache()

def cached_llm_call(messages: list):
    cached = cache.get(messages)
    if cached:
        return cached

    response = llm.invoke(messages)
    cache.set(messages, response.content)
    return response.content

Strategy 7: Monitoring Dashboard

Real-time cost visibility:

from dataclasses import dataclass
from datetime import datetime
from typing import Optional
import json

@dataclass
class TokenUsage:
    timestamp: datetime
    model: str
    input_tokens: int
    output_tokens: int
    cost_usd: float
    task_id: str

class CostMonitor:
    def __init__(self):
        self.usage_log = []

    def log_call(self, model: str, input_tokens: int, output_tokens: int, task_id: str):
        cost = self._calculate_cost(model, input_tokens, output_tokens)
        usage = TokenUsage(
            timestamp=datetime.now(),
            model=model,
            input_tokens=input_tokens,
            output_tokens=output_tokens,
            cost_usd=cost,
            task_id=task_id
        )
        self.usage_log.append(usage)

        # Alert if daily budget exceeded
        daily_spend = sum(u.cost_usd for u in self.usage_log if u.timestamp.date() == datetime.now().date())
        if daily_spend > 10.0:  # $10 daily limit
            print(f"ALERT: Daily spend ${daily_spend:.2f} exceeds $10 limit")

    def _calculate_cost(self, model: str, input_tokens: int, output_tokens: int) -> float:
        pricing = {
            "gpt-4": {"input": 0.03, "output": 0.06},
            "gpt-3.5-turbo": {"input": 0.0005, "output": 0.0015},
            "claude-3-sonnet": {"input": 0.003, "output": 0.015}
        }
        rates = pricing.get(model, {"input": 0.01, "output": 0.03})
        return (input_tokens * rates["input"] + output_tokens * rates["output"]) / 1000

Reason

The root cause of runaway token consumption is architectural. Agents operate autonomously, making decisions without human oversight. Without guardrails, they can:

Enter infinite retry loops when errors occur
Chain reasoning steps unnecessarily
Include excessive context in every call
Use expensive models for trivial tasks

The Reddit community describes this as agents “burning through your token quota like a teenage girl in Sephora with her dad’s credit card.” One user reported a simple “hi” input consumed massive tokens and still timed out.

Summary

After implementing these strategies, my agent’s token consumption dropped by 85%:

Before optimization:
- Daily tokens: ~500K
- Daily cost: ~$30
- Frequent timeout errors

After optimization:
- Daily tokens: ~75K
- Daily cost: ~$4.50
- No timeout errors

Key takeaways:

Model tiering - Use GPT-3.5/Haiku for routing, GPT-4/Opus for complex reasoning
Budget guards - Track tokens in real-time, stop before limits
Context pruning - Summarize history instead of passing everything
Iteration limits - Hard cap at 10-20 iterations with graceful degradation
Caching - Cache prompts and responses
Monitoring - Real-time dashboards with per-task attribution

Always test with cost limits before production deployment. Budget 3-5x your expected usage initially.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Reddit: OpenClaw token consumption discussion
👨‍💻 OpenAI Pricing
👨‍💻 Anthropic Claude Pricing
👨‍💻 LangChain Documentation
👨‍💻 tiktoken Library

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!