How I Cut My AI API Token Usage by 75%

Mar 11, 2026

Problem

My AI API bill was exploding. I was spending $400/month on tokens, and I couldn’t figure out why. Then I discovered my agent was “slamming 128k context window” into every single API call, even for simple questions:

User: What's 2+2?
Agent: [Loads 128,000 tokens of context]
       [Processes entire context]
       [Returns: 4]
       [Cost: $0.40 per call]

Total daily cost: $40+ for trivial questions

I saw a Reddit thread about OpenClaw users doing exactly this—running “millions of ghost agents, running 24/7” that waste API resources for everyone. But I wasn’t running millions of agents. I was just… inefficient.

Environment

Python 3.11 with LangGraph
Anthropic Claude API
Daily agent runs for task automation
Average 50-100 API calls per day
Context window: 200k tokens
Expected cost: ~$50/month
Actual cost: $400+/month

What happened?

I built an agent system to automate content creation. Each agent handles a different task: research, writing, editing, publishing. The architecture looked reasonable:

Research Agent → Writer Agent → Editor Agent → Publisher Agent
      ↓               ↓              ↓              ↓
   Context A       Context B      Context C      Context D

But I made a critical mistake. I passed the entire conversation history to each agent:

from langgraph.graph import StateGraph
from typing import TypedDict, List, Any

class AgentState(TypedDict):
    messages: List[Any]  # Full conversation history
    task: str
    result: str

def research_agent(state: AgentState) -> AgentState:
    # Receives ENTIRE conversation history
    response = client.messages.create(
        model="claude-3-opus-20240229",
        max_tokens=4096,
        messages=state["messages"]  # All 50+ previous messages
    )
    return {"result": response.content}

def writer_agent(state: AgentState) -> AgentState:
    # Also receives ENTIRE conversation history
    response = client.messages.create(
        model="claude-3-opus-20240229",
        max_tokens=4096,
        messages=state["messages"]  # Still all 50+ messages
    )
    return {"result": response.content}

# Each agent processes the same 50k+ tokens
# 4 agents × 50k tokens = 200k tokens per workflow
# 100 workflows/day = 20M tokens/day
# 20M tokens × $0.015/1k = $300/day

I profiled my actual usage:

Average context per call: 87,000 tokens
Average response per call: 1,200 tokens
Number of calls per day: 85
Daily token usage: 7.4M input + 102k output
Daily cost: $111 (input) + $3 (output) = $114/day
Monthly cost: ~$3,400

But wait... I only process 100 simple tasks!

I was paying 68x more than expected because every agent re-processed the entire conversation history.

How to solve it?

I tried several approaches to reduce token waste.

Attempt 1: Truncate conversation history

First, I tried limiting the history to recent messages:

def trim_context(messages: List[Any], max_messages: int = 10) -> List[Any]:
    """Keep only the most recent messages."""
    return messages[-max_messages:]

def research_agent(state: AgentState) -> AgentState:
    trimmed = trim_context(state["messages"], max_messages=10)
    response = client.messages.create(
        model="claude-3-opus-20240229",
        max_tokens=4096,
        messages=trimmed  # Only 10 messages now
    )
    return {"result": response.content}

This reduced my usage:

Before: 87,000 tokens per call
After:  15,000 tokens per call

Reduction: 82%
New daily cost: $20/day
New monthly cost: $600

But I hit a new problem: agents lost important context from early messages.

Research Agent: "Found 5 sources about Python async"
[10 messages later]
Writer Agent: "What sources? I don't see any sources mentioned."

Attempt 2: Extract key information

Instead of keeping full messages, I extracted and stored key facts:

from dataclasses import dataclass
from typing import Dict, List, Set

@dataclass
class KeyInformation:
    facts: Dict[str, str]  # Key findings
    decisions: List[str]   # Decisions made
    sources: Set[str]      # Reference URLs
    constraints: List[str] # Requirements

def extract_key_info(messages: List[Any]) -> KeyInformation:
    """Extract essential information from conversation."""
    extraction_prompt = """
    Extract key information from this conversation:
    - Important facts discovered
    - Decisions made
    - Source URLs mentioned
    - Constraints and requirements

    Return as structured JSON.
    """
    # Use a smaller model for extraction
    response = client.messages.create(
        model="claude-3-haiku-20240307",  # Cheaper model
        max_tokens=1024,
        messages=[{"role": "user", "content": extraction_prompt}]
    )
    return parse_extraction(response.content)

def research_agent(state: AgentState, key_info: KeyInformation) -> AgentState:
    # Include only key facts + current task
    context = f"""
    Key facts so far:
    {format_key_info(key_info)}

    Current task: {state["task"]}
    """
    response = client.messages.create(
        model="claude-3-opus-20240229",
        max_tokens=4096,
        messages=[{"role": "user", "content": context}]
    )
    return {"result": response.content}

This worked better:

Key facts: 2,000 tokens
Current task: 500 tokens
System prompt: 800 tokens
Total per call: 3,300 tokens

Reduction from original: 96%
New daily cost: $5/day
New monthly cost: $150

But extraction itself costs tokens:

Extraction call per workflow: 2,000 tokens input + 500 tokens output
Cost per extraction: $0.03
100 extractions/day = $3/day

Still saving money, but extraction overhead adds up

Attempt 3: Implement smart caching

I realized many of my requests were similar. Why not cache responses?

import hashlib
from typing import Optional
import json

class SemanticCache:
    def __init__(self, similarity_threshold: float = 0.95):
        self.cache: Dict[str, Any] = {}
        self.similarity_threshold = similarity_threshold

    def _hash_content(self, content: str) -> str:
        """Create hash of content for exact matching."""
        return hashlib.sha256(content.encode()).hexdigest()

    def get_exact(self, content: str) -> Optional[Any]:
        """Check for exact match in cache."""
        key = self._hash_content(content)
        return self.cache.get(key)

    def set(self, content: str, response: Any):
        """Store response in cache."""
        key = self._hash_content(content)
        self.cache[key] = response

def research_agent_with_cache(
    state: AgentState,
    cache: SemanticCache
) -> AgentState:
    # Check cache first
    cached = cache.get_exact(state["task"])
    if cached:
        print(f"Cache hit! Saved {len(state['messages'])} tokens")
        return {"result": cached}

    # Not in cache, make API call
    response = client.messages.create(
        model="claude-3-opus-20240229",
        max_tokens=4096,
        messages=[{"role": "user", "content": state["task"]}]
    )

    # Cache the result
    cache.set(state["task"], response.content)
    return {"result": response.content}

Cache hit rates surprised me:

Day 1: 5% cache hits (building cache)
Day 7: 35% cache hits (cache warming up)
Day 30: 60% cache hits (cache mature)

With 60% cache hits:
- 40 calls × 3,300 tokens = 132k tokens (actual API calls)
- 60 calls × 0 tokens = 0 tokens (cached)
- Daily token usage: 132k input
- Daily cost: $2/day
- Monthly cost: $60

Attempt 4: Optimize prompts

I was also wasting tokens on verbose prompts. I rewrote them to be concise:

# BEFORE: Verbose prompt (2,400 tokens)
VERBOSE_PROMPT = """
You are an expert research assistant specializing in technology topics.
Your role is to analyze the given topic and provide comprehensive insights.

Please follow these steps:
1. First, understand the core question being asked
2. Research the topic thoroughly using available knowledge
3. Identify key concepts, terms, and relationships
4. Find relevant examples and case studies
5. Synthesize findings into a clear, structured response

Your response should:
- Be well-organized with clear sections
- Include specific examples where relevant
- Cite sources when possible
- Avoid unnecessary repetition
- Be concise but comprehensive

Format your response as:
## Overview
[Brief summary of the topic]

## Key Points
[Main findings and insights]

## Examples
[Relevant examples and cases]

## Conclusion
[Summary and recommendations]
"""

# AFTER: Concise prompt (180 tokens)
CONCISE_PROMPT = """
Research the topic. Output:
1. Core findings (3-5 bullets)
2. Key examples (2-3)
3. Recommendations (1-2)

Be specific. No filler.
"""

The results were dramatic:

Verbose prompt: 2,400 tokens per call
Concise prompt: 180 tokens per call
Reduction: 92%

And response quality? Same or better.
Verbose avg response: 850 tokens
Concise avg response: 320 tokens
Total reduction: 85%

Attempt 5: Set termination conditions

My agents were running in infinite loops, like the “ghost agents” mentioned in the Reddit thread. I added proper termination:

from dataclasses import dataclass
from typing import Literal

@dataclass
class AgentResult:
    status: Literal["complete", "needs_more", "failed"]
    confidence: float  # 0.0 to 1.0
    iterations: int
    result: str

MAX_ITERATIONS = 5
MIN_CONFIDENCE = 0.85

def agent_loop(state: AgentState) -> AgentResult:
    iterations = 0

    while iterations < MAX_ITERATIONS:
        result = run_agent(state)

        if result.confidence >= MIN_CONFIDENCE:
            return AgentResult(
                status="complete",
                confidence=result.confidence,
                iterations=iterations,
                result=result.output
            )

        # Not confident enough, iterate
        state = refine_state(state, result)
        iterations += 1

    # Hit max iterations
    return AgentResult(
        status="needs_more",
        confidence=result.confidence,
        iterations=iterations,
        result="Max iterations reached"
    )

def run_agent(state: AgentState) -> Any:
    """Single agent execution with confidence scoring."""
    response = client.messages.create(
        model="claude-3-opus-20240229",
        max_tokens=1024,
        messages=state["messages"],
        system="Rate your confidence in this response (0-1)."
    )

    confidence = extract_confidence(response.content)
    return AgentOutput(
        output=response.content,
        confidence=confidence
    )

This prevented runaway loops:

Before termination: Average 12 iterations per task
After termination: Average 3.2 iterations per task

Reduction: 73% fewer API calls
Cost reduction: 73%

The reason

Why was I wasting so many tokens? Three core problems:

1. Context Window Bloat

I assumed bigger context = better results. But:

Processing 100k tokens: $1.50
Processing 10k tokens: $0.15
Processing 1k tokens: $0.015

The math: 100x cost difference for similar quality

Most tasks don’t need the entire conversation history. I was “slamming 128k context window” just because I could.

2. No Caching Strategy

I made the same API calls repeatedly without caching results:

Query: "What is async/await in Python?"
Response: [Same response every time]
Cost: $0.15 per call × 100 calls = $15

With caching: $0.15 × 1 call + $0 for 99 cached = $0.15
Savings: 99%

3. Verbose Everything

Both my prompts and expected responses were unnecessarily long:

My original prompt: 2,400 tokens
Optimized prompt: 180 tokens

Expected verbose response: 1,000 tokens
Expected concise response: 300 tokens

Total per call: 3,400 tokens → 480 tokens
Reduction: 85%

Final solution

I combined all strategies into a token budget manager:

from dataclasses import dataclass
from typing import Dict, List, Optional
import tiktoken

@dataclass
class TokenBudget:
    max_context: int = 4000      # Max context per call
    max_response: int = 1000      # Max response tokens
    cache_enabled: bool = True    # Enable semantic caching
    min_confidence: float = 0.85  # Early stopping threshold
    max_iterations: int = 3       # Max agent iterations

class TokenBudgetManager:
    def __init__(self, budget: TokenBudget):
        self.budget = budget
        self.cache: Dict[str, str] = {}
        self.encoding = tiktoken.encoding_for_model("gpt-4")

    def count_tokens(self, text: str) -> int:
        """Count tokens in text."""
        return len(self.encoding.encode(text))

    def trim_context(
        self,
        messages: List[Dict],
        max_tokens: int
    ) -> List[Dict]:
        """Trim messages to fit token budget."""
        result = []
        total_tokens = 0

        # Keep most recent messages that fit
        for msg in reversed(messages):
            msg_tokens = self.count_tokens(str(msg))
            if total_tokens + msg_tokens <= max_tokens:
                result.insert(0, msg)
                total_tokens += msg_tokens
            else:
                break

        return result

    def check_cache(self, query: str) -> Optional[str]:
        """Check if query is cached."""
        if not self.budget.cache_enabled:
            return None
        return self.cache.get(query)

    def cache_response(self, query: str, response: str):
        """Cache query-response pair."""
        if self.budget.cache_enabled:
            self.cache[query] = response

    def should_stop(self, confidence: float, iterations: int) -> bool:
        """Check if agent should stop."""
        return (
            confidence >= self.budget.min_confidence or
            iterations >= self.budget.max_iterations
        )

# Usage
budget = TokenBudget(
    max_context=4000,
    max_response=1000,
    cache_enabled=True,
    min_confidence=0.85,
    max_iterations=3
)

manager = TokenBudgetManager(budget)

def smart_agent_call(query: str, context: List[Dict]) -> str:
    # Check cache first
    cached = manager.check_cache(query)
    if cached:
        return cached

    # Trim context to budget
    trimmed = manager.trim_context(
        context,
        budget.max_context - manager.count_tokens(query)
    )

    # Make API call
    response = client.messages.create(
        model="claude-3-opus-20240229",
        max_tokens=budget.max_response,
        messages=trimmed + [{"role": "user", "content": query}]
    )

    # Cache result
    manager.cache_response(query, response.content)

    return response.content

Final results after implementing all optimizations:

Original setup:
- Context per call: 87,000 tokens
- Daily calls: 85
- Daily cost: $114
- Monthly cost: $3,420

Optimized setup:
- Context per call: 3,500 tokens
- Daily calls: 34 (60% cache hits)
- Daily cost: $2.40
- Monthly cost: $72

Total reduction: 97.9%
Annual savings: $40,176

Best practices

Based on my experience, here’s what I recommend:

1. Start with a token budget

BUDGET = {
    "max_context": 4000,    # 4k context is enough for most tasks
    "max_response": 1000,    # Concise responses preferred
    "cache_ttl": 3600,      # 1 hour cache
    "max_iterations": 3      # Stop after 3 tries
}

2. Trim context aggressively

# WRONG: Keep everything
messages = all_messages  # Could be 100k+ tokens

# RIGHT: Keep only what's needed
messages = trim_to_budget(all_messages, max_tokens=4000)

3. Cache everything

# Every similar query should hit cache
cache.set(query, response, ttl=3600)

# Use semantic similarity for fuzzy matching
similar = cache.find_similar(query, threshold=0.95)

4. Use concise prompts

# WRONG: Essay-length prompt
prompt = "You are an expert assistant who..."  # 2000+ tokens

# RIGHT: Direct, specific prompt
prompt = "Analyze. Output: 3 findings, 2 examples. Be specific."  # 100 tokens

5. Set termination conditions

# Prevent infinite loops
if iterations >= MAX_ITERATIONS:
    break
if confidence >= MIN_CONFIDENCE:
    break

Summary

In this post, I showed how I reduced my AI API token consumption by 97.9% (from $3,420/month to $72/month) by fixing wasteful patterns like context window bloat, missing cache strategies, and verbose prompts.

The key insight is that “ghost agents” running inefficient loops don’t just hurt your wallet—they waste API capacity for everyone. Responsible usage isn’t just ethical; it’s financially smart.

To optimize your token usage:

Set explicit token budgets and trim context aggressively
Implement semantic caching for similar queries
Write concise, structured prompts
Set proper termination conditions to prevent loops
Monitor actual usage vs. expected usage

Your API bill will thank you.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Reddit Discussion: OpenClaw and API Abuse
👨‍💻 Anthropic Token Counting Documentation
👨‍💻 Context Windows in LLMs
👨‍💻 Goodhart's Law

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!