At What Token Count Does Claude's Performance Degrade?
I was excited when Claude Opus 4.6 announced a 1M context window at the same price point. More context is always better, right? Then I started noticing something strange in my long sessions.
The Problem
I was working on a large refactoring task. I’d loaded up my context with multiple files, documentation, and several rounds of back-and-forth discussion. Around 180k tokens, Claude started responding with:
“Actually… let me reconsider… Actually, I think…”
Over and over. The “Actually… Actually…” pattern.
The model wasn’t failing. It was degrading. Quietly.
What the Community Reports
I dug into a Reddit discussion about the 1M context announcement. The top comment wasn’t celebrating - it was asking:
“pretty huge, but how’s the performance drop off?”
Here’s what practitioners reported:
| Token Range | What Happens | Source |
|---|---|---|
| ~140k | ”Opus gets dementia” | Anecdotal report |
| ~180k | ”Actually… Actually…” loops begin | Community consensus |
| 250k-500k | ”Quality starts to tank” | Multiple reports |
| 400k-500k | ”Loses track of earlier instructions” | 7 upvotes |
One comment cut to the heart of it:
“In our experience the model starts losing track of earlier instructions somewhere around 400-500k tokens even when the context window technically allows more”
Important clarification: Claude doesn’t “forget” early context. It deprioritizes it when newer information conflicts.
┌─────────────────────────────────────────────────┐│ Beginning Context ││ - System prompts ││ - Initial instructions ││ - Highest priority │├─────────────────────────────────────────────────┤│ Middle Context ││ - Gets less attention during retrieval ││ - "Lost in the middle" phenomenon ││ - Lowest retrieval accuracy │├─────────────────────────────────────────────────┤│ End Context ││ - Most recent interactions ││ - High priority ││ - Competes with earlier instructions │└─────────────────────────────────────────────────┘The 80% Rule
I found guidance in my own project’s performance rules:
Avoid last 20% of context window for:
- Large-scale refactoring
- Feature implementation spanning multiple files
- Debugging complex interactions
This aligns with community reports. The degradation doesn’t happen at the limit - it happens in the upper portion.
Window Size Safe Threshold For What Task?─────────────────────────────────────────────────200k ~160k (80%) Simple tasks200k ~100k (50%) Complex reasoning1M ~800k (80%) Simple tasks (risky)1M ~400k-500k Complex reasoning (recommended)Signs of Context Degradation
I’ve learned to recognize when Claude’s context is overloaded:
1. The “Actually…” Loop
The model keeps reconsidering without making progress. This signals it’s struggling to reconcile conflicting context.
2. Forgotten Instructions
System prompt says “Use TypeScript strict mode” but later output shows plain JavaScript without types.
3. Quality Regression
Earlier responses: detailed, well-structured Later responses: shorter, generic, less nuanced
4. Pattern Inconsistency
Earlier: correctly uses existing codebase patterns Later: suggests patterns contradicting earlier decisions
Context Hygiene Strategy
I now manage context like a scarce resource:
Before Complex Task:1. Fresh session if context >50% full2. Add only relevant files3. State critical constraints in current message
During Long Session:1. Monitor response quality2. If degradation detected: - Summarize current state - Start new session with summary - Restate critical constraints
After Task Completion:1. Clear context for unrelated tasks2. Keep summary if tasks are related"I have 1M context, so let me add the entire codebase"- More noise, lower signal-to-noise ratio- Higher chance of conflicting information- Deprioritization of important early context
"Keep the session going forever"- Quality degrades over time- Earlier instructions deprioritized- Inconsistent model behaviorThe Hidden Cost of 1M Context
The 1M window at the same price seems like a pure win. But consider:
| Factor | 200k Window | 1M Window |
|---|---|---|
| Technical capacity | 200k tokens | 1M tokens |
| Reliable capacity | ~160k (80%) | ~400k-500k practically |
| Token cost per reliable unit | Standard | Potentially higher |
| Debugging difficulty | Moderate | Higher (more context to analyze) |
The 1M context is a capacity feature, not a quality guarantee. You can fit more in, but the model won’t weigh it equally.
What I Do Now
For complex reasoning (refactoring, debugging, multi-file changes):
- Stay under 50% of window
- Actively prune irrelevant files
- Start fresh sessions for unrelated tasks
For simple tasks (single-file edits, documentation):
- 80% threshold is acceptable
- Less sensitive to positioning issues
For long sessions:
- Watch for degradation signals
- Periodically summarize and reset
- Move critical instructions to recent context
Summary
In this post, I investigated when Claude’s performance degrades based on real user reports. The key point is that quality drops at 250k-500k tokens despite the 1M capacity—manage context actively and stay under 80% for simple tasks, under 50% for complex reasoning.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments