Skip to content

At What Token Count Does Claude's Performance Degrade?

I was excited when Claude Opus 4.6 announced a 1M context window at the same price point. More context is always better, right? Then I started noticing something strange in my long sessions.

The Problem

I was working on a large refactoring task. I’d loaded up my context with multiple files, documentation, and several rounds of back-and-forth discussion. Around 180k tokens, Claude started responding with:

“Actually… let me reconsider… Actually, I think…”

Over and over. The “Actually… Actually…” pattern.

The model wasn’t failing. It was degrading. Quietly.

What the Community Reports

I dug into a Reddit discussion about the 1M context announcement. The top comment wasn’t celebrating - it was asking:

“pretty huge, but how’s the performance drop off?”

Here’s what practitioners reported:

Token RangeWhat HappensSource
~140k”Opus gets dementia”Anecdotal report
~180k”Actually… Actually…” loops beginCommunity consensus
250k-500k”Quality starts to tank”Multiple reports
400k-500k”Loses track of earlier instructions”7 upvotes

One comment cut to the heart of it:

“In our experience the model starts losing track of earlier instructions somewhere around 400-500k tokens even when the context window technically allows more”

Important clarification: Claude doesn’t “forget” early context. It deprioritizes it when newer information conflicts.

How Claude Weighs Context
┌─────────────────────────────────────────────────┐
│ Beginning Context │
│ - System prompts │
│ - Initial instructions │
│ - Highest priority │
├─────────────────────────────────────────────────┤
│ Middle Context │
│ - Gets less attention during retrieval │
│ - "Lost in the middle" phenomenon │
│ - Lowest retrieval accuracy │
├─────────────────────────────────────────────────┤
│ End Context │
│ - Most recent interactions │
│ - High priority │
│ - Competes with earlier instructions │
└─────────────────────────────────────────────────┘

The 80% Rule

I found guidance in my own project’s performance rules:

Avoid last 20% of context window for:

  • Large-scale refactoring
  • Feature implementation spanning multiple files
  • Debugging complex interactions

This aligns with community reports. The degradation doesn’t happen at the limit - it happens in the upper portion.

Safe Context Thresholds
Window Size Safe Threshold For What Task?
─────────────────────────────────────────────────
200k ~160k (80%) Simple tasks
200k ~100k (50%) Complex reasoning
1M ~800k (80%) Simple tasks (risky)
1M ~400k-500k Complex reasoning (recommended)

Signs of Context Degradation

I’ve learned to recognize when Claude’s context is overloaded:

1. The “Actually…” Loop

The model keeps reconsidering without making progress. This signals it’s struggling to reconcile conflicting context.

2. Forgotten Instructions

System prompt says “Use TypeScript strict mode” but later output shows plain JavaScript without types.

3. Quality Regression

Earlier responses: detailed, well-structured Later responses: shorter, generic, less nuanced

4. Pattern Inconsistency

Earlier: correctly uses existing codebase patterns Later: suggests patterns contradicting earlier decisions

Context Hygiene Strategy

I now manage context like a scarce resource:

Good: Active Context Management
Before Complex Task:
1. Fresh session if context >50% full
2. Add only relevant files
3. State critical constraints in current message
During Long Session:
1. Monitor response quality
2. If degradation detected:
- Summarize current state
- Start new session with summary
- Restate critical constraints
After Task Completion:
1. Clear context for unrelated tasks
2. Keep summary if tasks are related
Bad: Passive Context Filling
"I have 1M context, so let me add the entire codebase"
- More noise, lower signal-to-noise ratio
- Higher chance of conflicting information
- Deprioritization of important early context
"Keep the session going forever"
- Quality degrades over time
- Earlier instructions deprioritized
- Inconsistent model behavior

The Hidden Cost of 1M Context

The 1M window at the same price seems like a pure win. But consider:

Factor200k Window1M Window
Technical capacity200k tokens1M tokens
Reliable capacity~160k (80%)~400k-500k practically
Token cost per reliable unitStandardPotentially higher
Debugging difficultyModerateHigher (more context to analyze)

The 1M context is a capacity feature, not a quality guarantee. You can fit more in, but the model won’t weigh it equally.

What I Do Now

For complex reasoning (refactoring, debugging, multi-file changes):

  • Stay under 50% of window
  • Actively prune irrelevant files
  • Start fresh sessions for unrelated tasks

For simple tasks (single-file edits, documentation):

  • 80% threshold is acceptable
  • Less sensitive to positioning issues

For long sessions:

  • Watch for degradation signals
  • Periodically summarize and reset
  • Move critical instructions to recent context

Summary

In this post, I investigated when Claude’s performance degrades based on real user reports. The key point is that quality drops at 250k-500k tokens despite the 1M capacity—manage context actively and stay under 80% for simple tasks, under 50% for complex reasoning.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments