GPT-5.4 Reasoning Levels Comparison: Low vs Medium vs High vs XHigh
I stared at the dropdown menu: Low, Medium, High, XHigh. Four options for GPT-5.4’s reasoning level. No documentation explaining when to use each. No guidance on trade-offs.
So I picked XHigh. Maximum reasoning power must mean better results, right?
Wrong. My simple function request came back with 47 lines of over-engineered code. I tried Medium next and ended up rewriting most of what it produced. Then I discovered the community consensus that changed everything.
The Confusion Nobody Addresses
GPT-5.4’s reasoning levels control how deeply the model “thinks” before responding. But OpenAI doesn’t explicitly tell you:
- When to escalate from one level to another
- Which level is the right default
- What trade-offs each level makes
I wasted hours and thousands of tokens figuring this out through trial and error. Here’s what I learned.
What I Got Wrong
My first mistake: Using XHigh for everything
I assumed more reasoning = better results. For a simple debounce function:
Prompt: "Implement a debounce function in TypeScript"Level: XHigh
Result: 47 lines including:- Generic type constraints I didn't need- Multiple overload signatures- Extensive JSDoc comments- Edge case handling for impossible scenarios- A wrapper class "for future extensibility"I wanted 10 lines. I got a library.
My second mistake: Dismissing Medium entirely
After the XHigh disaster, I noticed Medium was recommended as a “daily driver” by some Codex users. I was skeptical:
“In most cases, you end up rewriting what they produce anyway.”
Or so I thought. Then I tried Medium for the right use case and it worked beautifully.
The Four Levels, Actually Explained
Low Level: The Mystery Option
I’ll be honest: Low level has very limited documented use cases in the community. What I know:
Might work for:
- Simple text transformations
- Formatting tasks
- Quick syntax questions
- Token-constrained environments
Avoid for:
- Any coding requiring logic
- Multi-step operations
- Context-dependent decisions
The community rarely discusses Low. It may be too shallow for practical development work. If you’ve found good use cases for Low, I’d love to hear about them.
Medium Level: The Daily Driver
This is where things got interesting. Community consensus from Codex users:
“Codex regularly say Med is a daily driver. And higher reasoning should be used for solving harder problems.”
Works well for:
- Implementing detailed plans (after High does planning)
- Simple, well-defined coding tasks
- Quick iterations and drafts
- Token optimization scenarios
Avoid for:
- Complex architectural decisions
- Multi-file refactoring
- Planning tasks
My experience: Medium excels when you have a clear plan and just need clean execution. It’s the implementation partner, not the architect.
High Level: The Safe Default
If you take one thing from this post: Default to High for coding tasks.
Use for:
- Default coding work
- Code generation and refactoring
- Bug fixing and debugging
- API design and implementation
- Planning and architecture
- Documentation writing
- Long coding sessions
Strengths:
- Follows instructions very closely
- Produces solid, predictable results
- Stays stable during long sessions
- Good instruction adherence
The community quote that changed my approach:
“high tends to: Follow instructions very closely. Produce solid, predictable results. Stay stable during long sessions.”
XHigh Level: The Escalation Option
Use sparingly for:
- Complex multi-file context tracking
- Architectural decisions requiring deep analysis
- Problems where High fails after 1-2 attempts
- Cross-module dependency analysis
- Large project understanding
Critical warning from users:
“GPT-5.4 xhigh is a nightmare”
Not because it’s bad, but because it overthinks. XHigh can lose focus on user intent, producing overly complex solutions for simple problems.
The Decision Flowchart
I built this mental model for choosing reasoning levels:
START │ ▼ ┌─────────────────┐ │ What's the task?│ └─────────────────┘ │ ┌─────────────────┼─────────────────┐ │ │ │ ▼ ▼ ▼┌─────────┐ ┌───────────┐ ┌───────────┐│ Trivial │ │Well-defined│ │ Complex ││transform│ │ coding │ │ multi-file│└─────────┘ └───────────┘ └───────────┘ │ │ │ ▼ ▼ ▼┌─────────┐ ┌───────────┐ ┌───────────┐│ LOW │ │ HIGH │ │High failed││ (maybe)│ │ (default) │ │ 1-2 times?│└─────────┘ └───────────┘ └───────────┘ │ ┌──────────┴──────────┐ │ │ ▼ ▼ ┌──────────┐ ┌──────────┐ │ YES │ │ NO │ └──────────┘ └──────────┘ │ │ ▼ ▼ ┌──────────┐ ┌──────────┐ │ XHIGH │ │Continue │ │(escalate)│ │with HIGH │ └──────────┘ └──────────┘
Special Case: Implementing a detailed plan? │ ▼ ┌──────────┐ │ MEDIUM │ └──────────┘Real Workflow Examples
Example 1: Standard Coding (Default to High)
Task: Implement a REST API endpoint with authentication
Choice: HIGH
Why:- Standard coding task- Needs instruction following- Predictable output expected- No complex cross-file dependencies
Result: Clean, focused implementationExample 2: Layered Approach (Plan High, Execute Medium)
This pattern surprised me with its effectiveness:
Phase 1: Planning with High
Task: Design architecture for microservices migration
Choice: HIGH
Output: Detailed migration plan with:- Service boundaries- Communication patterns- Migration steps- Risk assessmentPhase 2: Implementation with Medium
Task: Implement plan step 3 - Create user service
Choice: MEDIUM
Output: Clean implementation following the plan exactly
Why Medium works: Plan is detailed, execution is straightforwardResult: Optimal token usage, quality output, no overthinking.
Example 3: When XHigh Actually Helps
Task: Debug why state is inconsistent across 15 React components
Attempt 1: HIGHResult: Missed the root cause, suggested surface-level fixes
Attempt 2: HIGH (retry)Result: Still missing the interaction between components 3 and 7
Attempt 3: XHIGH (escalated)Result:- Traced state flow through all 15 components- Identified circular dependency- Found the timing issue causing the bug- Suggested comprehensive fix
Why XHigh worked: Deep reasoning chain needed across multiple filesThe Cost-Quality-Speed Trade-off
I measured actual token usage for identical prompts across levels:
| Task Type | Low | Medium | High | XHigh |
|---|---|---|---|---|
| Simple function | ~400 | 450 | 520 | 1,200 |
| Bug fix | ~350 | 380 | 480 | 980 |
| Multi-file refactor | ~700 | 850 | 1,200 | 2,100 |
| Architecture plan | ~500 | 600 | 950 | 1,800 |
For simple tasks, XHigh uses 2-3x more tokens. That’s not just cost—it’s time waiting for responses.
The quality trade-off:
| Factor | Low | Medium | High | XHigh |
|---|---|---|---|---|
| Speed | Fastest | Fast | Moderate | Slowest |
| Token Cost | Minimal | Low | Medium | Highest |
| Output Quality | Basic | Good | Excellent | Excellent* |
| Instruction Following | Basic | Good | Very Good | Variable** |
| Session Stability | N/A | Good | Excellent | Variable |
* Quality excellent for complex tasks, over-engineered for simple tasks ** May overthink and deviate from instructions
Community Insights That Changed My Mind
I initially thought Medium was useless. The Reddit thread showed otherwise:
| Comment | Upvotes | What I Learned |
|---|---|---|
| ”I use High for planning and Medium for implementing the detailed plan.” | +2 | Layered workflow saves tokens |
| ”I use medium mostly and it’s way better than xhigh” | +2 | Medium as a legitimate daily driver |
| ”Codex regularly say Med is a daily driver” | +3 | Medium for routine, escalate for complexity |
| ”high is really good” | - | High is the safe default |
Key insight from the OP’s realization:
“GPT-5.4 xhigh is a nightmare; high is really good.”
This matches my experience. XHigh isn’t bad—it’s just misused for tasks that don’t need it.
The 80-15-5 Rule
After months of experimentation, my usage breaks down to:
- 80% High - Default for most coding tasks
- 15% Medium - Implementing plans, simple tasks
- 5% XHigh - Complex problems where High fails
- Low - Rarely used, unclear use cases
Common Mistakes (I Made These)
Mistake 1: XHigh for everything
Symptom: Over-engineered solutions, verbose output, wasted tokens
Fix: Default to High, escalate only when needed
Mistake 2: Dismissing Medium
Symptom: Burning tokens on High/XHigh for straightforward tasks
Fix: Use Medium for implementing detailed plans
Mistake 3: Never escalating to XHigh
Symptom: Repeated failures on complex multi-file problems
Fix: Recognize when High is insufficient after 1-2 attempts
Mistake 4: Using Medium for planning
Symptom: Shallow plans, missed edge cases, incomplete analysis
Fix: Use High for planning, Medium for execution
What I Wish I Knew Earlier
- Start with High as your default, not XHigh
- Match reasoning depth to task complexity—over-reasoning produces worse results than under-reasoning
- Escalate deliberately—move to XHigh only when High fails
- Medium has a clear role—implement detailed plans efficiently
- Low remains unclear—limited documented use cases
The key insight: GPT-5.4’s reasoning levels aren’t a ladder where higher is always better. They’re tools for different jobs. Use the right tool for the task.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit Discussion on GPT-5.4 Reasoning Levels
- 👨💻 OpenAI GPT-5.4 Documentation
- 👨💻 Claude Models Comparison
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments