Why Does Claude Re-read the Entire Conversation Every Follow-up?
Problem
When I started using Claude heavily for coding work, I noticed my token usage exploded in long conversations. A simple follow-up question would consume disproportionately more tokens than the first message.
Here’s what I observed:
Message 1: "Help me write a Python function"Usage: ~2% of monthly quota
Message 10: "Also add type hints"Usage: ~5% of monthly quota (for just this message!)
Message 30: "Add docstrings too"Usage: ~12% of monthly quota (for this single message!)I assumed each message cost roughly the same. But a 30-turn conversation wasn’t 30x the cost of a single message—it was exponentially more expensive. Why?
Environment
- Claude AI Pro Plan
- Heavy coding sessions with 20-40 follow-up messages
- Observed usage: 3x-5x higher token consumption in long threads
- Expected: Linear token growth with message count
What happened?
I was debugging a complex feature across multiple Claude sessions. Each time I asked a follow-up question, my usage meter dropped faster than expected.
I tested this systematically:
Test 1: Single comprehensive promptMe: "Write a Python function to sort a list with error handling, type hints, docstrings, and unit tests."Claude: [Complete implementation]Usage: ~3% of plan
Test 2: Same result via follow-upsMe: "Write a Python function to sort a list"Claude: [Basic implementation]Usage: ~2% of plan
Me: "Add error handling"Claude: [Updated implementation]Usage: ~2.5% of plan (accumulating history)
Me: "Add type hints"Claude: [Updated implementation]Usage: ~3% of plan (more history)
Me: "Add docstrings"Claude: [Updated implementation]Usage: ~3.5% of plan (even more history)
Me: "Add unit tests"Claude: [Updated implementation]Usage: ~4% of plan (full history processed)
Total usage: 2 + 2.5 + 3 + 3.5 + 4 = 15% of planSame result, 5x the cost. I was shocked by this compounding effect.
How to solve it?
I needed to understand why this happened. My hypothesis was that Claude must re-process previous messages to maintain conversation context.
Let me calculate the actual token processing:
def calculate_context_tokens(messages, avg_tokens_per_message=500): """ Calculate total tokens processed across a conversation.
Each new message re-processes all previous context. """ total_tokens = 0
for i in range(len(messages)): # Message i+1 processes all messages 0 through i context_size = sum(messages[:i+1]) total_tokens += context_size
return total_tokens
# Example: 10 messages of 500 tokens eachmessages = [500] * 10print(f"Total tokens processed: {calculate_context_tokens(messages)}")# Output: 27,500 tokens (not 5,000!)The math reveals the problem:
Message 1 processes: 500 tokensMessage 2 processes: 500 + 500 = 1,000 tokensMessage 3 processes: 500 + 500 + 500 = 1,500 tokens...Message 10 processes: 5,000 tokens
Total: 500 + 1,000 + 1,500 + ... + 5,000 = 27,500 tokensA 10-message thread processes 27,500 tokens, not 5,000. The compounding is brutal for heavy coding work.
The Solution: Edit Instead of Follow-up
I discovered Claude’s edit feature. Instead of sending follow-ups, I edit the original message:
[INEFFICIENT - Follow-ups]Me: Write a function to sort a listClaude: [implementation]
Me: Now add error handlingClaude: [updated implementation]
Me: Now add type hintsClaude: [updated implementation]Total: 3 separate context windows, 3x system prompt overhead
[EFFICIENT - Edit]Me: Write a Python function to sort a list with error handling and type hints[Claude generates once]Total: 1 context window, 1x system prompt overheadWhen I edit a message, Claude starts fresh from that point—clean context, no accumulated history.
Strategy 1: Batch Related Tasks
Instead of:
Me: Write a functionMe: Add error handlingMe: Add type hintsMe: Add docstringsMe: Add testsI batch everything:
# INEFFICIENT: Multiple follow-upsprompts = [ "Write a function to sort a list", "Now add error handling", "Now add type hints", "Now add docstrings", "Now add unit tests"]# Total context processed: 1 + 2 + 3 + 4 + 5 = 15 message-equivalents
# EFFICIENT: Single comprehensive promptefficient_prompt = """Write a Python function to sort a list with the following requirements:1. Include proper error handling for edge cases2. Add complete type hints using typing module3. Include comprehensive docstrings (Google style)4. Include unit tests using pytest
Output the complete implementation."""# Total context processed: 1 message-equivalentStrategy 2: Start Fresh Chats for New Topics
When pivoting topics, I start a new conversation:
[INEFFICIENT]Conversation 1: Python debugging help [5,000 tokens accumulated]Conversation 1 (continued): JavaScript question [processes all 5,000 + new]
[EFFICIENT]Conversation 1: Python debugging help [self-contained]Conversation 2 (new): JavaScript question [fresh start]Strategy 3: Summarize and Continue
For long conversations I need to continue, I ask Claude to summarize:
Me: Please summarize our conversation so far in 200 words.
Claude: [Concise summary]
Me: [New conversation] Continue from this summary: [paste summary]This condenses 5,000 tokens of history into 200 tokens of context.
Strategy 4: Use Claude Projects
Claude Projects maintain separate contexts:
Project A: Frontend development [isolated context]Project B: Backend API work [separate context]Project C: Documentation [clean context]Each project starts fresh without cross-contaminating context.
The reason
Claude re-reads entire conversations because of how Large Language Models fundamentally work—they are stateless inference engines.
The Stateless Architecture
Unlike databases that “remember” previous queries:
Traditional Database:Query 1: SELECT * FROM users WHERE id = 1Query 2: UPDATE users SET name = 'John' WHERE id = 1Query 3: SELECT * FROM users WHERE id = 1→ Database remembers state between queries
LLM Architecture:Request 1: [Full context sent] → Response 1Request 2: [Full context + Request 1 + Response 1 sent] → Response 2Request 3: [Full context + Request 1 + Response 1 + Request 2 + Response 2 sent] → Response 3→ LLM requires full context for every inferenceEach API call requires the complete context to generate a response. Claude doesn’t “remember”—it “re-reads.”
Why This Design?
This stateless design enables:
- Consistency: Each response is deterministic given the same context
- Flexibility: You can edit any part of the conversation
- Simplicity: No complex state management on the backend
- Scalability: Each request is independent, enabling parallel processing
But the trade-off is compounding token costs.
The Math of Compounding Costs
Here’s a practical example:
System prompt: ~800 tokens (fixed)User messages: ~500 tokens each (varies)Claude responses: ~500 tokens each (varies)
Message 1: 800 + 500 + 500 = 1,800 tokens processedMessage 2: 800 + 1,800 + 500 + 500 = 3,600 tokens processedMessage 3: 800 + 3,600 + 500 + 500 = 5,400 tokens processedMessage 10: ~15,000 tokens processed for this single messageMessage 30: ~45,000 tokens processed for this single messageA 30-turn conversation where each response is 500 tokens means Claude processes 15,000+ words of history just to answer the last question.
Common mistakes
I made several mistakes optimizing my Claude usage:
Mistake 1: Treating Claude Like a Human
My assumption: "Claude remembers what we discussed"
Reality: Claude re-processes everything. I was wasting tokens by assuming context persisted magically.Mistake 2: Endless Follow-ups
Me: "Can you also add logging?"Me: "And error handling too?"Me: "One more thing—add tests"
Each "one more thing" compounded the cost exponentially.Mistake 3: Not Using Edit Feature
I didn’t realize editing is more efficient than follow-ups. Edit gives Claude a fresh starting point without accumulated history.
Mistake 4: Mixing Topics in One Chat
[INEFFICIENT]Same conversation:- Python debugging- JavaScript questions- SQL optimization- CSS layout issues
All context accumulates, making each message exponentially expensive.Practical implementation
Here’s how I now structure my Claude usage:
class ClaudeUsageOptimizer: """Strategies to optimize Claude token usage."""
def batch_questions(self, questions: list[str]) -> str: """Combine multiple questions into one prompt.""" return "\n".join([ "Answer all of the following concisely:", *[f"{i+1}. {q}" for i, q in enumerate(questions)] ])
def should_start_new_chat(self, current_tokens: int, new_topic: bool) -> bool: """Decide when to start a fresh conversation.""" if new_topic: return True if current_tokens > 10000: # Threshold for context bloat return True return False
def create_summary_prompt(self) -> str: """Generate a summary to condense context.""" return """ Summarize our conversation in 200 words, focusing on: 1. Key decisions made 2. Code patterns established 3. Outstanding questions Format for easy continuation. """Cost Comparison Example
Scenario: 5 related coding tasks
Approach A (Follow-ups):Task 1: 1,000 tokens processedTask 2: 2,000 tokens processedTask 3: 3,000 tokens processedTask 4: 4,000 tokens processedTask 5: 5,000 tokens processedTotal: 15,000 tokens
Approach B (Batched):All tasks in one prompt: 1,500 tokens processedTotal: 1,500 tokens
Savings: 90%Summary
In this post, I explained why Claude re-reads the entire conversation on every follow-up message. The key point is that LLMs are stateless inference engines—each API call requires full context, meaning conversation history compounds with every message.
The practical implications:
- Architecture dictates behavior—Stateless inference requires full context re-processing
- Costs compound exponentially—A 10-message thread costs 27,500 tokens, not 5,000
- Edit is your friend—Edit original messages instead of follow-ups when possible
- Batch and condense—Combine related tasks, summarize long conversations
- Start fresh when pivoting—New topics deserve new conversations
By working with Claude’s architecture instead of against it, I reduced my token usage by 50-90% while maintaining the same productivity. The edit feature alone saves me thousands of tokens per session.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit Discussion: 10 TRICKS TO STOP HITTING CLAUDE'S USAGE LIMITS
- 👨💻 Anthropic Context Windows Documentation
- 👨💻 Understanding Token Counting
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments