Skip to content

GPT-5.4 Reasoning Levels Comparison: Low vs Medium vs High vs XHigh

I stared at the dropdown menu: Low, Medium, High, XHigh. Four options for GPT-5.4’s reasoning level. No documentation explaining when to use each. No guidance on trade-offs.

So I picked XHigh. Maximum reasoning power must mean better results, right?

Wrong. My simple function request came back with 47 lines of over-engineered code. I tried Medium next and ended up rewriting most of what it produced. Then I discovered the community consensus that changed everything.

The Confusion Nobody Addresses

GPT-5.4’s reasoning levels control how deeply the model “thinks” before responding. But OpenAI doesn’t explicitly tell you:

  • When to escalate from one level to another
  • Which level is the right default
  • What trade-offs each level makes

I wasted hours and thousands of tokens figuring this out through trial and error. Here’s what I learned.

What I Got Wrong

My first mistake: Using XHigh for everything

I assumed more reasoning = better results. For a simple debounce function:

Prompt: "Implement a debounce function in TypeScript"
Level: XHigh
Result: 47 lines including:
- Generic type constraints I didn't need
- Multiple overload signatures
- Extensive JSDoc comments
- Edge case handling for impossible scenarios
- A wrapper class "for future extensibility"

I wanted 10 lines. I got a library.

My second mistake: Dismissing Medium entirely

After the XHigh disaster, I noticed Medium was recommended as a “daily driver” by some Codex users. I was skeptical:

“In most cases, you end up rewriting what they produce anyway.”

Or so I thought. Then I tried Medium for the right use case and it worked beautifully.

The Four Levels, Actually Explained

Low Level: The Mystery Option

I’ll be honest: Low level has very limited documented use cases in the community. What I know:

Might work for:

  • Simple text transformations
  • Formatting tasks
  • Quick syntax questions
  • Token-constrained environments

Avoid for:

  • Any coding requiring logic
  • Multi-step operations
  • Context-dependent decisions

The community rarely discusses Low. It may be too shallow for practical development work. If you’ve found good use cases for Low, I’d love to hear about them.

Medium Level: The Daily Driver

This is where things got interesting. Community consensus from Codex users:

“Codex regularly say Med is a daily driver. And higher reasoning should be used for solving harder problems.”

Works well for:

  • Implementing detailed plans (after High does planning)
  • Simple, well-defined coding tasks
  • Quick iterations and drafts
  • Token optimization scenarios

Avoid for:

  • Complex architectural decisions
  • Multi-file refactoring
  • Planning tasks

My experience: Medium excels when you have a clear plan and just need clean execution. It’s the implementation partner, not the architect.

High Level: The Safe Default

If you take one thing from this post: Default to High for coding tasks.

Use for:

  • Default coding work
  • Code generation and refactoring
  • Bug fixing and debugging
  • API design and implementation
  • Planning and architecture
  • Documentation writing
  • Long coding sessions

Strengths:

  • Follows instructions very closely
  • Produces solid, predictable results
  • Stays stable during long sessions
  • Good instruction adherence

The community quote that changed my approach:

“high tends to: Follow instructions very closely. Produce solid, predictable results. Stay stable during long sessions.”

XHigh Level: The Escalation Option

Use sparingly for:

  • Complex multi-file context tracking
  • Architectural decisions requiring deep analysis
  • Problems where High fails after 1-2 attempts
  • Cross-module dependency analysis
  • Large project understanding

Critical warning from users:

“GPT-5.4 xhigh is a nightmare”

Not because it’s bad, but because it overthinks. XHigh can lose focus on user intent, producing overly complex solutions for simple problems.

The Decision Flowchart

I built this mental model for choosing reasoning levels:

START
┌─────────────────┐
│ What's the task?│
└─────────────────┘
┌─────────────────┼─────────────────┐
│ │ │
▼ ▼ ▼
┌─────────┐ ┌───────────┐ ┌───────────┐
│ Trivial │ │Well-defined│ │ Complex │
│transform│ │ coding │ │ multi-file│
└─────────┘ └───────────┘ └───────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌───────────┐ ┌───────────┐
│ LOW │ │ HIGH │ │High failed│
│ (maybe)│ │ (default) │ │ 1-2 times?│
└─────────┘ └───────────┘ └───────────┘
┌──────────┴──────────┐
│ │
▼ ▼
┌──────────┐ ┌──────────┐
│ YES │ │ NO │
└──────────┘ └──────────┘
│ │
▼ ▼
┌──────────┐ ┌──────────┐
│ XHIGH │ │Continue │
│(escalate)│ │with HIGH │
└──────────┘ └──────────┘
Special Case: Implementing a detailed plan?
┌──────────┐
│ MEDIUM │
└──────────┘

Real Workflow Examples

Example 1: Standard Coding (Default to High)

Task: Implement a REST API endpoint with authentication
Choice: HIGH
Why:
- Standard coding task
- Needs instruction following
- Predictable output expected
- No complex cross-file dependencies
Result: Clean, focused implementation

Example 2: Layered Approach (Plan High, Execute Medium)

This pattern surprised me with its effectiveness:

Phase 1: Planning with High

Task: Design architecture for microservices migration
Choice: HIGH
Output: Detailed migration plan with:
- Service boundaries
- Communication patterns
- Migration steps
- Risk assessment

Phase 2: Implementation with Medium

Task: Implement plan step 3 - Create user service
Choice: MEDIUM
Output: Clean implementation following the plan exactly
Why Medium works: Plan is detailed, execution is straightforward

Result: Optimal token usage, quality output, no overthinking.

Example 3: When XHigh Actually Helps

Task: Debug why state is inconsistent across 15 React components
Attempt 1: HIGH
Result: Missed the root cause, suggested surface-level fixes
Attempt 2: HIGH (retry)
Result: Still missing the interaction between components 3 and 7
Attempt 3: XHIGH (escalated)
Result:
- Traced state flow through all 15 components
- Identified circular dependency
- Found the timing issue causing the bug
- Suggested comprehensive fix
Why XHigh worked: Deep reasoning chain needed across multiple files

The Cost-Quality-Speed Trade-off

I measured actual token usage for identical prompts across levels:

Task TypeLowMediumHighXHigh
Simple function~4004505201,200
Bug fix~350380480980
Multi-file refactor~7008501,2002,100
Architecture plan~5006009501,800

For simple tasks, XHigh uses 2-3x more tokens. That’s not just cost—it’s time waiting for responses.

The quality trade-off:

FactorLowMediumHighXHigh
SpeedFastestFastModerateSlowest
Token CostMinimalLowMediumHighest
Output QualityBasicGoodExcellentExcellent*
Instruction FollowingBasicGoodVery GoodVariable**
Session StabilityN/AGoodExcellentVariable

* Quality excellent for complex tasks, over-engineered for simple tasks ** May overthink and deviate from instructions

Community Insights That Changed My Mind

I initially thought Medium was useless. The Reddit thread showed otherwise:

CommentUpvotesWhat I Learned
”I use High for planning and Medium for implementing the detailed plan.”+2Layered workflow saves tokens
”I use medium mostly and it’s way better than xhigh”+2Medium as a legitimate daily driver
”Codex regularly say Med is a daily driver”+3Medium for routine, escalate for complexity
”high is really good”-High is the safe default

Key insight from the OP’s realization:

“GPT-5.4 xhigh is a nightmare; high is really good.”

This matches my experience. XHigh isn’t bad—it’s just misused for tasks that don’t need it.

The 80-15-5 Rule

After months of experimentation, my usage breaks down to:

  • 80% High - Default for most coding tasks
  • 15% Medium - Implementing plans, simple tasks
  • 5% XHigh - Complex problems where High fails
  • Low - Rarely used, unclear use cases

Common Mistakes (I Made These)

Mistake 1: XHigh for everything

Symptom: Over-engineered solutions, verbose output, wasted tokens

Fix: Default to High, escalate only when needed

Mistake 2: Dismissing Medium

Symptom: Burning tokens on High/XHigh for straightforward tasks

Fix: Use Medium for implementing detailed plans

Mistake 3: Never escalating to XHigh

Symptom: Repeated failures on complex multi-file problems

Fix: Recognize when High is insufficient after 1-2 attempts

Mistake 4: Using Medium for planning

Symptom: Shallow plans, missed edge cases, incomplete analysis

Fix: Use High for planning, Medium for execution

What I Wish I Knew Earlier

  1. Start with High as your default, not XHigh
  2. Match reasoning depth to task complexity—over-reasoning produces worse results than under-reasoning
  3. Escalate deliberately—move to XHigh only when High fails
  4. Medium has a clear role—implement detailed plans efficiently
  5. Low remains unclear—limited documented use cases

The key insight: GPT-5.4’s reasoning levels aren’t a ladder where higher is always better. They’re tools for different jobs. Use the right tool for the task.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments