Skip to content

Why Does Claude Re-read the Entire Conversation Every Follow-up?

Problem

When I started using Claude heavily for coding work, I noticed my token usage exploded in long conversations. A simple follow-up question would consume disproportionately more tokens than the first message.

Here’s what I observed:

Message 1: "Help me write a Python function"
Usage: ~2% of monthly quota
Message 10: "Also add type hints"
Usage: ~5% of monthly quota (for just this message!)
Message 30: "Add docstrings too"
Usage: ~12% of monthly quota (for this single message!)

I assumed each message cost roughly the same. But a 30-turn conversation wasn’t 30x the cost of a single message—it was exponentially more expensive. Why?

Environment

  • Claude AI Pro Plan
  • Heavy coding sessions with 20-40 follow-up messages
  • Observed usage: 3x-5x higher token consumption in long threads
  • Expected: Linear token growth with message count

What happened?

I was debugging a complex feature across multiple Claude sessions. Each time I asked a follow-up question, my usage meter dropped faster than expected.

I tested this systematically:

Test 1: Single comprehensive prompt
Me: "Write a Python function to sort a list with error handling, type hints, docstrings, and unit tests."
Claude: [Complete implementation]
Usage: ~3% of plan
Test 2: Same result via follow-ups
Me: "Write a Python function to sort a list"
Claude: [Basic implementation]
Usage: ~2% of plan
Me: "Add error handling"
Claude: [Updated implementation]
Usage: ~2.5% of plan (accumulating history)
Me: "Add type hints"
Claude: [Updated implementation]
Usage: ~3% of plan (more history)
Me: "Add docstrings"
Claude: [Updated implementation]
Usage: ~3.5% of plan (even more history)
Me: "Add unit tests"
Claude: [Updated implementation]
Usage: ~4% of plan (full history processed)
Total usage: 2 + 2.5 + 3 + 3.5 + 4 = 15% of plan

Same result, 5x the cost. I was shocked by this compounding effect.

How to solve it?

I needed to understand why this happened. My hypothesis was that Claude must re-process previous messages to maintain conversation context.

Let me calculate the actual token processing:

token_calculation.py
def calculate_context_tokens(messages, avg_tokens_per_message=500):
"""
Calculate total tokens processed across a conversation.
Each new message re-processes all previous context.
"""
total_tokens = 0
for i in range(len(messages)):
# Message i+1 processes all messages 0 through i
context_size = sum(messages[:i+1])
total_tokens += context_size
return total_tokens
# Example: 10 messages of 500 tokens each
messages = [500] * 10
print(f"Total tokens processed: {calculate_context_tokens(messages)}")
# Output: 27,500 tokens (not 5,000!)

The math reveals the problem:

Message 1 processes: 500 tokens
Message 2 processes: 500 + 500 = 1,000 tokens
Message 3 processes: 500 + 500 + 500 = 1,500 tokens
...
Message 10 processes: 5,000 tokens
Total: 500 + 1,000 + 1,500 + ... + 5,000 = 27,500 tokens

A 10-message thread processes 27,500 tokens, not 5,000. The compounding is brutal for heavy coding work.

The Solution: Edit Instead of Follow-up

I discovered Claude’s edit feature. Instead of sending follow-ups, I edit the original message:

[INEFFICIENT - Follow-ups]
Me: Write a function to sort a list
Claude: [implementation]
Me: Now add error handling
Claude: [updated implementation]
Me: Now add type hints
Claude: [updated implementation]
Total: 3 separate context windows, 3x system prompt overhead
[EFFICIENT - Edit]
Me: Write a Python function to sort a list with error handling and type hints
[Claude generates once]
Total: 1 context window, 1x system prompt overhead

When I edit a message, Claude starts fresh from that point—clean context, no accumulated history.

Instead of:

Me: Write a function
Me: Add error handling
Me: Add type hints
Me: Add docstrings
Me: Add tests

I batch everything:

efficient_prompting.py
# INEFFICIENT: Multiple follow-ups
prompts = [
"Write a function to sort a list",
"Now add error handling",
"Now add type hints",
"Now add docstrings",
"Now add unit tests"
]
# Total context processed: 1 + 2 + 3 + 4 + 5 = 15 message-equivalents
# EFFICIENT: Single comprehensive prompt
efficient_prompt = """
Write a Python function to sort a list with the following requirements:
1. Include proper error handling for edge cases
2. Add complete type hints using typing module
3. Include comprehensive docstrings (Google style)
4. Include unit tests using pytest
Output the complete implementation.
"""
# Total context processed: 1 message-equivalent

Strategy 2: Start Fresh Chats for New Topics

When pivoting topics, I start a new conversation:

[INEFFICIENT]
Conversation 1: Python debugging help [5,000 tokens accumulated]
Conversation 1 (continued): JavaScript question [processes all 5,000 + new]
[EFFICIENT]
Conversation 1: Python debugging help [self-contained]
Conversation 2 (new): JavaScript question [fresh start]

Strategy 3: Summarize and Continue

For long conversations I need to continue, I ask Claude to summarize:

Me: Please summarize our conversation so far in 200 words.
Claude: [Concise summary]
Me: [New conversation] Continue from this summary: [paste summary]

This condenses 5,000 tokens of history into 200 tokens of context.

Strategy 4: Use Claude Projects

Claude Projects maintain separate contexts:

Project A: Frontend development [isolated context]
Project B: Backend API work [separate context]
Project C: Documentation [clean context]

Each project starts fresh without cross-contaminating context.

The reason

Claude re-reads entire conversations because of how Large Language Models fundamentally work—they are stateless inference engines.

The Stateless Architecture

Unlike databases that “remember” previous queries:

Traditional Database:
Query 1: SELECT * FROM users WHERE id = 1
Query 2: UPDATE users SET name = 'John' WHERE id = 1
Query 3: SELECT * FROM users WHERE id = 1
→ Database remembers state between queries
LLM Architecture:
Request 1: [Full context sent] → Response 1
Request 2: [Full context + Request 1 + Response 1 sent] → Response 2
Request 3: [Full context + Request 1 + Response 1 + Request 2 + Response 2 sent] → Response 3
→ LLM requires full context for every inference

Each API call requires the complete context to generate a response. Claude doesn’t “remember”—it “re-reads.”

Why This Design?

This stateless design enables:

  1. Consistency: Each response is deterministic given the same context
  2. Flexibility: You can edit any part of the conversation
  3. Simplicity: No complex state management on the backend
  4. Scalability: Each request is independent, enabling parallel processing

But the trade-off is compounding token costs.

The Math of Compounding Costs

Here’s a practical example:

System prompt: ~800 tokens (fixed)
User messages: ~500 tokens each (varies)
Claude responses: ~500 tokens each (varies)
Message 1: 800 + 500 + 500 = 1,800 tokens processed
Message 2: 800 + 1,800 + 500 + 500 = 3,600 tokens processed
Message 3: 800 + 3,600 + 500 + 500 = 5,400 tokens processed
Message 10: ~15,000 tokens processed for this single message
Message 30: ~45,000 tokens processed for this single message

A 30-turn conversation where each response is 500 tokens means Claude processes 15,000+ words of history just to answer the last question.

Common mistakes

I made several mistakes optimizing my Claude usage:

Mistake 1: Treating Claude Like a Human

My assumption: "Claude remembers what we discussed"
Reality: Claude re-processes everything. I was wasting tokens by assuming context persisted magically.

Mistake 2: Endless Follow-ups

Me: "Can you also add logging?"
Me: "And error handling too?"
Me: "One more thing—add tests"
Each "one more thing" compounded the cost exponentially.

Mistake 3: Not Using Edit Feature

I didn’t realize editing is more efficient than follow-ups. Edit gives Claude a fresh starting point without accumulated history.

Mistake 4: Mixing Topics in One Chat

[INEFFICIENT]
Same conversation:
- Python debugging
- JavaScript questions
- SQL optimization
- CSS layout issues
All context accumulates, making each message exponentially expensive.

Practical implementation

Here’s how I now structure my Claude usage:

claude_usage_optimizer.py
class ClaudeUsageOptimizer:
"""Strategies to optimize Claude token usage."""
def batch_questions(self, questions: list[str]) -> str:
"""Combine multiple questions into one prompt."""
return "\n".join([
"Answer all of the following concisely:",
*[f"{i+1}. {q}" for i, q in enumerate(questions)]
])
def should_start_new_chat(self, current_tokens: int, new_topic: bool) -> bool:
"""Decide when to start a fresh conversation."""
if new_topic:
return True
if current_tokens > 10000: # Threshold for context bloat
return True
return False
def create_summary_prompt(self) -> str:
"""Generate a summary to condense context."""
return """
Summarize our conversation in 200 words, focusing on:
1. Key decisions made
2. Code patterns established
3. Outstanding questions
Format for easy continuation.
"""

Cost Comparison Example

Scenario: 5 related coding tasks
Approach A (Follow-ups):
Task 1: 1,000 tokens processed
Task 2: 2,000 tokens processed
Task 3: 3,000 tokens processed
Task 4: 4,000 tokens processed
Task 5: 5,000 tokens processed
Total: 15,000 tokens
Approach B (Batched):
All tasks in one prompt: 1,500 tokens processed
Total: 1,500 tokens
Savings: 90%

Summary

In this post, I explained why Claude re-reads the entire conversation on every follow-up message. The key point is that LLMs are stateless inference engines—each API call requires full context, meaning conversation history compounds with every message.

The practical implications:

  1. Architecture dictates behavior—Stateless inference requires full context re-processing
  2. Costs compound exponentially—A 10-message thread costs 27,500 tokens, not 5,000
  3. Edit is your friend—Edit original messages instead of follow-ups when possible
  4. Batch and condense—Combine related tasks, summarize long conversations
  5. Start fresh when pivoting—New topics deserve new conversations

By working with Claude’s architecture instead of against it, I reduced my token usage by 50-90% while maintaining the same productivity. The edit feature alone saves me thousands of tokens per session.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments