GPT-5.4 Reasoning Levels Comparison: Low vs Medium vs High vs XHigh

Mar 15, 2026

I stared at the dropdown menu: Low, Medium, High, XHigh. Four options for GPT-5.4’s reasoning level. No documentation explaining when to use each. No guidance on trade-offs.

So I picked XHigh. Maximum reasoning power must mean better results, right?

Wrong. My simple function request came back with 47 lines of over-engineered code. I tried Medium next and ended up rewriting most of what it produced. Then I discovered the community consensus that changed everything.

The Confusion Nobody Addresses

GPT-5.4’s reasoning levels control how deeply the model “thinks” before responding. But OpenAI doesn’t explicitly tell you:

When to escalate from one level to another
Which level is the right default
What trade-offs each level makes

I wasted hours and thousands of tokens figuring this out through trial and error. Here’s what I learned.

What I Got Wrong

My first mistake: Using XHigh for everything

I assumed more reasoning = better results. For a simple debounce function:

Prompt: "Implement a debounce function in TypeScript"
Level: XHigh

Result: 47 lines including:
- Generic type constraints I didn't need
- Multiple overload signatures
- Extensive JSDoc comments
- Edge case handling for impossible scenarios
- A wrapper class "for future extensibility"

I wanted 10 lines. I got a library.

My second mistake: Dismissing Medium entirely

After the XHigh disaster, I noticed Medium was recommended as a “daily driver” by some Codex users. I was skeptical:

“In most cases, you end up rewriting what they produce anyway.”

Or so I thought. Then I tried Medium for the right use case and it worked beautifully.

The Four Levels, Actually Explained

Low Level: The Mystery Option

I’ll be honest: Low level has very limited documented use cases in the community. What I know:

Might work for:

Simple text transformations
Formatting tasks
Quick syntax questions
Token-constrained environments

Avoid for:

Any coding requiring logic
Multi-step operations
Context-dependent decisions

The community rarely discusses Low. It may be too shallow for practical development work. If you’ve found good use cases for Low, I’d love to hear about them.

Medium Level: The Daily Driver

This is where things got interesting. Community consensus from Codex users:

“Codex regularly say Med is a daily driver. And higher reasoning should be used for solving harder problems.”

Works well for:

Implementing detailed plans (after High does planning)
Simple, well-defined coding tasks
Quick iterations and drafts
Token optimization scenarios

Avoid for:

Complex architectural decisions
Multi-file refactoring
Planning tasks

My experience: Medium excels when you have a clear plan and just need clean execution. It’s the implementation partner, not the architect.

High Level: The Safe Default

If you take one thing from this post: Default to High for coding tasks.

Use for:

Default coding work
Code generation and refactoring
Bug fixing and debugging
API design and implementation
Planning and architecture
Documentation writing
Long coding sessions

Strengths:

Follows instructions very closely
Produces solid, predictable results
Stays stable during long sessions
Good instruction adherence

The community quote that changed my approach:

“high tends to: Follow instructions very closely. Produce solid, predictable results. Stay stable during long sessions.”

XHigh Level: The Escalation Option

Use sparingly for:

Complex multi-file context tracking
Architectural decisions requiring deep analysis
Problems where High fails after 1-2 attempts
Cross-module dependency analysis
Large project understanding

Critical warning from users:

“GPT-5.4 xhigh is a nightmare”

Not because it’s bad, but because it overthinks. XHigh can lose focus on user intent, producing overly complex solutions for simple problems.

The Decision Flowchart

I built this mental model for choosing reasoning levels:

                    START
                      │
                      ▼
            ┌─────────────────┐
            │ What's the task?│
            └─────────────────┘
                      │
    ┌─────────────────┼─────────────────┐
    │                 │                 │
    ▼                 ▼                 ▼
┌─────────┐    ┌───────────┐    ┌───────────┐
│ Trivial │    │Well-defined│    │  Complex  │
│transform│    │  coding   │    │ multi-file│
└─────────┘    └───────────┘    └───────────┘
    │                 │                 │
    ▼                 ▼                 ▼
┌─────────┐    ┌───────────┐    ┌───────────┐
│   LOW   │    │   HIGH    │    │High failed│
│  (maybe)│    │ (default) │    │ 1-2 times?│
└─────────┘    └───────────┘    └───────────┘
                                      │
                           ┌──────────┴──────────┐
                           │                     │
                           ▼                     ▼
                     ┌──────────┐          ┌──────────┐
                     │   YES    │          │    NO    │
                     └──────────┘          └──────────┘
                           │                     │
                           ▼                     ▼
                     ┌──────────┐          ┌──────────┐
                     │  XHIGH   │          │Continue  │
                     │(escalate)│          │with HIGH │
                     └──────────┘          └──────────┘

Special Case: Implementing a detailed plan?
        │
        ▼
   ┌──────────┐
   │  MEDIUM  │
   └──────────┘

Real Workflow Examples

Example 1: Standard Coding (Default to High)

Task: Implement a REST API endpoint with authentication

Choice: HIGH

Why:
- Standard coding task
- Needs instruction following
- Predictable output expected
- No complex cross-file dependencies

Result: Clean, focused implementation

Example 2: Layered Approach (Plan High, Execute Medium)

This pattern surprised me with its effectiveness:

Phase 1: Planning with High

Task: Design architecture for microservices migration

Choice: HIGH

Output: Detailed migration plan with:
- Service boundaries
- Communication patterns
- Migration steps
- Risk assessment

Phase 2: Implementation with Medium

Task: Implement plan step 3 - Create user service

Choice: MEDIUM

Output: Clean implementation following the plan exactly

Why Medium works: Plan is detailed, execution is straightforward

Result: Optimal token usage, quality output, no overthinking.

Example 3: When XHigh Actually Helps

Task: Debug why state is inconsistent across 15 React components

Attempt 1: HIGH
Result: Missed the root cause, suggested surface-level fixes

Attempt 2: HIGH (retry)
Result: Still missing the interaction between components 3 and 7

Attempt 3: XHIGH (escalated)
Result:
- Traced state flow through all 15 components
- Identified circular dependency
- Found the timing issue causing the bug
- Suggested comprehensive fix

Why XHigh worked: Deep reasoning chain needed across multiple files

The Cost-Quality-Speed Trade-off

I measured actual token usage for identical prompts across levels:

Task Type	Low	Medium	High	XHigh
Simple function	~400	450	520	1,200
Bug fix	~350	380	480	980
Multi-file refactor	~700	850	1,200	2,100
Architecture plan	~500	600	950	1,800

For simple tasks, XHigh uses 2-3x more tokens. That’s not just cost—it’s time waiting for responses.

The quality trade-off:

Factor	Low	Medium	High	XHigh
Speed	Fastest	Fast	Moderate	Slowest
Token Cost	Minimal	Low	Medium	Highest
Output Quality	Basic	Good	Excellent	Excellent*
Instruction Following	Basic	Good	Very Good	Variable**
Session Stability	N/A	Good	Excellent	Variable

* Quality excellent for complex tasks, over-engineered for simple tasks ** May overthink and deviate from instructions

Community Insights That Changed My Mind

I initially thought Medium was useless. The Reddit thread showed otherwise:

Comment	Upvotes	What I Learned
”I use High for planning and Medium for implementing the detailed plan.”	+2	Layered workflow saves tokens
”I use medium mostly and it’s way better than xhigh”	+2	Medium as a legitimate daily driver
”Codex regularly say Med is a daily driver”	+3	Medium for routine, escalate for complexity
”high is really good”	-	High is the safe default

Key insight from the OP’s realization:

“GPT-5.4 xhigh is a nightmare; high is really good.”

This matches my experience. XHigh isn’t bad—it’s just misused for tasks that don’t need it.

The 80-15-5 Rule

After months of experimentation, my usage breaks down to:

80% High - Default for most coding tasks
15% Medium - Implementing plans, simple tasks
5% XHigh - Complex problems where High fails
Low - Rarely used, unclear use cases

Common Mistakes (I Made These)

Mistake 1: XHigh for everything

Symptom: Over-engineered solutions, verbose output, wasted tokens

Fix: Default to High, escalate only when needed

Mistake 2: Dismissing Medium

Symptom: Burning tokens on High/XHigh for straightforward tasks

Fix: Use Medium for implementing detailed plans

Mistake 3: Never escalating to XHigh

Symptom: Repeated failures on complex multi-file problems

Fix: Recognize when High is insufficient after 1-2 attempts

Mistake 4: Using Medium for planning

Symptom: Shallow plans, missed edge cases, incomplete analysis

Fix: Use High for planning, Medium for execution

What I Wish I Knew Earlier

Start with High as your default, not XHigh
Match reasoning depth to task complexity—over-reasoning produces worse results than under-reasoning
Escalate deliberately—move to XHigh only when High fails
Medium has a clear role—implement detailed plans efficiently
Low remains unclear—limited documented use cases

The key insight: GPT-5.4’s reasoning levels aren’t a ladder where higher is always better. They’re tools for different jobs. Use the right tool for the task.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Reddit Discussion on GPT-5.4 Reasoning Levels
👨‍💻 OpenAI GPT-5.4 Documentation
👨‍💻 Claude Models Comparison

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!