Skip to content

Why Your AI API Costs Explode: Context Window Management Mistakes

Problem

I bought $10 of Anthropic API credits and burned through them in one sitting with simple chatting. No complex tasks, no code generation—just basic conversation. The model? Haiku, their cheapest option.

Starting balance: $10.00
Model: Claude 3 Haiku (cheapest)
Usage: Simple chatting
Result: Empty balance in hours

This made no sense. Haiku costs $0.25 per million input tokens. Even at 100k tokens per conversation (a massive amount), that’s only $0.025. How did I burn $10?

Environment

  • Claude 3 Haiku API
  • Input pricing: $0.25/million tokens
  • Output pricing: $1.25/million tokens
  • Usage pattern: Extended conversations, 50+ messages
  • Expected cost: Under $1
  • Actual cost: $10+ in hours

What happened?

I assumed each message cost tokens based only on its content:

My mental model:
Message 1: 500 tokens → cost for 500 tokens
Message 2: 500 tokens → cost for 500 tokens
Message 50: 500 tokens → cost for 500 tokens
Expected total: 50 messages × 500 tokens = 25,000 tokens
Expected cost at $0.25/M tokens = $0.00625 (less than a penny)

But my balance kept dropping faster than expected. So I checked the actual API usage:

Actual API behavior:
Message 1: 500 tokens sent
Message 2: 1000 tokens sent (500 history + 500 new)
Message 3: 1500 tokens sent (1000 history + 500 new)
...
Message 50: 25,000 tokens sent (24,500 history + 500 new)

Each message resends the ENTIRE conversation history. The API doesn’t remember anything—I was paying to resend everything every single time.

The math that shocked me

Let me calculate the actual tokens sent for my 50-message conversation:

Token accumulation calculation
Message 1: 500 tokens
Message 2: 500 + 500 = 1,000 tokens
Message 3: 500 + 500 + 500 = 1,500 tokens
...
Message 50: 500 × 50 = 25,000 tokens
Total tokens SENT to API:
500 + 1000 + 1500 + ... + 25000 = 637,500 tokens

I expected to send 25,000 tokens total. I actually sent 637,500 tokens.

Cost comparison
Naive expectation: 25,000 tokens × $0.25/M = $0.00625
Actual cost: 637,500 tokens × $0.25/M = $0.16 per conversation
At 10 conversations/day × $0.16 = $1.60/day
Extended sessions with more messages = $10+ easily

The cost multiplier is 25x what I expected. This is the hidden cost of context window accumulation.

Why does this happen?

I dug into how LLM APIs actually work:

API request structure
POST /v1/messages
{
"model": "claude-3-haiku",
"messages": [
{"role": "user", "content": "First message"},
{"role": "assistant", "content": "First response"},
{"role": "user", "content": "Second message"},
{"role": "assistant", "content": "Second response"},
{"role": "user", "content": "Current message"}
]
}

Every API call includes ALL previous messages. There’s no server-side memory. The “context window” is just the conversation history I send with every request.

This architecture makes sense for stateless APIs—no database to maintain, no session management complexity. But it creates a hidden cost multiplier that most developers don’t realize until they see their bill.

Here’s the cost accumulation pattern:

Cost per message over time
Message # | Context Size | Cost This Message | Cumulative Cost
-----------|--------------|-------------------|----------------
1 | 500 tokens | $0.000125 | $0.000125
10 | 5,000 tokens | $0.00125 | $0.006875
25 | 12,500 tokens| $0.003125 | $0.040625
50 | 25,000 tokens| $0.00625 | $0.159375
100 | 50,000 tokens| $0.0125 | $0.6375
Each message gets more expensive as the conversation grows.

Solution 1: Reset sessions frequently

The Reddit discussion that opened my eyes suggested using /new regularly. This resets the context window:

Long conversation (expensive):
Message 1-50 in one session: 637,500 tokens sent
Split into 5 sessions (cheap):
Session 1 (messages 1-10): 5,500 tokens
Session 2 (messages 11-20): 5,500 tokens
Session 3 (messages 21-30): 5,500 tokens
Session 4 (messages 31-40): 5,500 tokens
Session 5 (messages 41-50): 5,500 tokens
Total: 27,500 tokens (23x cheaper!)

The tradeoff: I lose conversation context between sessions. But for many tasks, each message is independent anyway—why pay to resend context I don’t need?

For programmatic API usage, I implemented this pattern:

session_reset.py
class ConversationManager:
"""
Manages conversation sessions with automatic context reset.
"""
def __init__(self, max_messages=20, max_tokens=10000):
self.max_messages = max_messages
self.max_tokens = max_tokens
self.messages = []
self.session_count = 0
def add_message(self, role: str, content: str) -> list:
"""Add message, reset session if limits exceeded."""
self.messages.append({"role": role, "content": content})
# Check if we should reset
if len(self.messages) > self.max_messages:
return self._reset_session()
return self.messages
def _reset_session(self) -> list:
"""Start fresh session, keeping only last 2 messages for continuity."""
recent = self.messages[-2:] # Keep last exchange
self.messages = recent
self.session_count += 1
print(f"Session reset #{self.session_count}. Kept {len(self.messages)} messages.")
return self.messages
# Usage
manager = ConversationManager(max_messages=15)
for task in my_tasks:
manager.add_message("user", task["prompt"])
response = call_api(manager.messages)
manager.add_message("assistant", response)

This simple change reduced my API costs by over 80%.

Solution 2: Cap context at 64k tokens

Another Reddit user mentioned they “keep context windows capped at 64k and strictly limit input/output tokens to keep latency down.”

I implemented a token-aware context manager:

context_limiter.py
import tiktoken
class ContextWindowManager:
"""
Tracks and limits context window to prevent cost explosion.
"""
def __init__(self, max_context_tokens=64000):
self.max_tokens = max_context_tokens
self.messages = []
self.encoding = tiktoken.encoding_for_model("gpt-4") # Approximation
def count_tokens(self, text: str) -> int:
"""Count tokens in text."""
return len(self.encoding.encode(text))
def add_message(self, role: str, content: str) -> dict:
"""Add message, return stats about context window."""
tokens = self.count_tokens(content)
self.messages.append({
"role": role,
"content": content,
"tokens": tokens
})
current_total = sum(m["tokens"] for m in self.messages)
return {
"message_tokens": tokens,
"total_context": current_total,
"at_limit": current_total >= self.max_tokens
}
def trim_oldest(self, keep_last: int = 5) -> int:
"""Remove oldest messages, keep last N for continuity."""
if len(self.messages) <= keep_last:
return 0
removed = len(self.messages) - keep_last
self.messages = self.messages[-keep_last:]
return removed
def should_summarize(self) -> bool:
"""Check if context needs summarization instead of removal."""
total = sum(m["tokens"] for m in self.messages)
return total > self.max_tokens * 0.8 # Warn at 80%

The 64k token cap serves two purposes:

  1. Cost control: Prevents runaway token accumulation
  2. Latency control: Larger contexts mean slower responses
Latency impact by context size
Context Size | Approximate Response Time
-------------|---------------------------
10k tokens | ~2 seconds
32k tokens | ~4 seconds
64k tokens | ~7 seconds
100k tokens | ~12+ seconds
200k tokens | ~20+ seconds (and very expensive)

Solution 3: Summarize long conversations

For conversations where I need continuity but context is growing too large, I use a summarization strategy:

Summarization workflow
1. Detect context approaching limit (e.g., 50k tokens)
2. Send special request: "Summarize our conversation in 500 tokens"
3. Start new conversation with summary as system context
4. Continue with clean context window
summarize_context.py
async def summarize_and_reset(client, messages: list, max_summary_tokens: int = 500) -> list:
"""
Summarize conversation history and return fresh context.
"""
summary_prompt = f"""
Summarize the following conversation in under {max_summary_tokens} tokens.
Preserve key decisions, important facts, and current state.
Do not include pleasantries or filler.
Conversation:
{format_messages(messages)}
"""
response = await client.messages.create(
model="claude-3-haiku",
max_tokens=max_summary_tokens,
messages=[{"role": "user", "content": summary_prompt}]
)
summary = response.content[0].text
# Start fresh with summary as context
return [
{"role": "user", "content": f"[Previous context summary: {summary}]"},
{"role": "assistant", "content": "Understood. I have the context summary. How can I help?"}
]

This preserves what matters while discarding the accumulated token bloat.

The subscription trap

I compared my API costs to ChatGPT Plus at $20/month:

Subscription vs API cost comparison
Usage Pattern | ChatGPT Plus | API Equivalent
---------------------------|--------------|----------------
50 messages/day, avg ctx | $20/month | ~$15/month
100 messages/day, long ctx | $20/month | ~$60/month
200 messages/day, accum. | $20/month | ~$200/month
Heavy daily usage | $20/month | $500+/month

The subscription model hides context costs. With API, I pay for every accumulated token. With subscription, I can have 100-message conversations without thinking about it.

This insight changed how I structure my API usage:

For exploration and prototyping:
- Use ChatGPT Plus or Claude Pro subscription
- No context management needed
- Predictable monthly cost
For production and automation:
- Use API with strict context management
- Implement session resets
- Monitor token usage per request

Common mistakes to avoid

Looking back at my usage patterns, I made several mistakes:

Mistake #1: One endless conversation
Before:
session = start_conversation()
for 200 messages:
chat(session, message)
# Cost: 20M tokens sent, ~$5 for one conversation
After:
for task in tasks:
session = start_conversation() # Fresh each time
chat(session, task)
# Cost: ~500k tokens sent, ~$0.12 for 200 tasks
Mistake #2: Ignoring context in cost calculations
My mental model:
"I'm only sending 500 tokens per message"
Reality:
"Every message resends ALL previous messages"
Message 100 with 500-token messages = 50,000 tokens sent for one message
Mistake #3: Large system prompts every request
Each request included my full system instructions:
- Project context: 2000 tokens
- Style guidelines: 1000 tokens
- Format requirements: 500 tokens
Sent 100 times = 350,000 tokens just on system prompt
Fix: Cache system prompts, use shorter references in context

Token budget allocation

Now I structure my context with explicit budgets:

token_budget.py
CONTEXT_BUDGET = {
"system_prompt": 500, # Essential instructions only
"conversation": 45000, # Rolling history
"current_query": 4000, # User's input
"response_reserve": 14000 # AI's response space
}
def validate_context(messages: list, new_query: str) -> bool:
"""Check if adding query would exceed budget."""
current = sum(count_tokens(m["content"]) for m in messages)
query_tokens = count_tokens(new_query)
available = CONTEXT_BUDGET["conversation"] - current
if query_tokens > available:
print(f"Context full. {available} tokens available, need {query_tokens}.")
return False
return True

Summary

In this post, I explained why my $10 API budget evaporated during “simple chatting” with Haiku. The culprit was context window accumulation—every message resends the entire conversation history.

A 50-message conversation at 500 tokens/message costs:

  • Expected: 25,000 tokens
  • Actual: 637,500 tokens (25x more)

To fix this:

  1. Reset sessions frequently — Use /new or create fresh conversations
  2. Cap context at 64k tokens — Prevent runaway accumulation
  3. Summarize long conversations — Preserve key info, discard the rest
  4. Track what the API actually sends — Not just your latest message

The subscription model hides this cost; API users feel it directly. Context window management is the #1 skill for cost-effective API usage.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments