Skip to content

Does 'Think Step by Step' Actually Improve Claude's Outputs?

Problem

I kept seeing the same advice everywhere: “Add ‘think through this step by step’ to your prompts for better Claude outputs.”

So I tried it. I appended the magical phrase to every prompt I sent to Claude. Some responses improved. Others became painfully verbose. A few were actually worse.

The question haunted me: Does this technique actually work, or am I just burning tokens for no reason?

What I Tried First

I started with a math problem—the classic chain-of-thought use case:

math-test.txt
Me: A store sells apples for $2 each. If I buy 5 apples and pay with a $20 bill,
how much change do I get? Think through this step by step.
Claude: Let me work through this step by step.
Step 1: Calculate the total cost of apples.
- Price per apple: $2
- Number of apples: 5
- Total cost = $2 × 5 = $10
Step 2: Calculate the change from the payment.
- Payment: $20
- Total cost: $10
- Change = $20 - $10 = $10
Therefore, you will receive $10 in change.

The reasoning was clear and correct. But then I tried the same prompt without the magic phrase:

math-test-direct.txt
Me: A store sells apples for $2 each. If I buy 5 apples and pay with a $20 bill,
how much change do I get?
Claude: You'll receive $10 in change. ($20 - (5 × $2) = $10)

Same correct answer. Much more concise. I started questioning whether I needed chain-of-thought at all.

The Real Test: Complex Reasoning

I decided to test something harder. This time, a logic puzzle that requires multiple steps:

logic-puzzle-cot.txt
Me: Three friends—Alice, Bob, and Carol—each have a different pet: a cat, a dog,
or a bird. Alice is allergic to cats. Bob's pet doesn't bark. Carol's pet has
wings. Who owns which pet?
[Without chain-of-thought]
Claude: Alice has the dog (allergic to cats, so not cat; bird has wings which
Carol has, so dog remains), Bob has the cat (his pet doesn't bark, so not dog;
Carol has bird, so cat remains), and Carol has the bird (her pet has wings).
[With chain-of-thought]
Claude: Let me solve this step by step.
Step 1: Analyze Carol's clue first.
- "Carol's pet has wings"
- Only birds have wings among the options
- Therefore, Carol owns the bird.
Step 2: Analyze Bob's clue.
- "Bob's pet doesn't bark"
- Dogs bark, cats don't, birds have wings
- Carol already has the bird
- Bob cannot have the dog (it barks)
- Therefore, Bob owns the cat.
Step 3: Analyze Alice's clue.
- "Alice is allergic to cats"
- She doesn't own the cat (Bob does)
- She doesn't own the bird (Carol does)
- Therefore, Alice owns the dog.
Final answer: Alice → dog, Bob → cat, Carol → bird.

Both answers were correct. But the chain-of-thought version showed its reasoning clearly. I could spot-check each step. More importantly, if there had been an error, I would have known exactly where.

When It Falls Apart

Then I tested it on tasks where I thought it wouldn’t help:

simple-query-verbose.txt
Me: What is the capital of France? Think through this step by step.
Claude: Let me think through this step by step.
Step 1: Identify what is being asked.
- The question asks for the capital city of France.
Step 2: Retrieve geographical knowledge.
- France is a country in Western Europe.
- Its capital city is Paris.
Step 3: Verify the answer.
- Paris is indeed the capital of France.
- It is the largest city in France and houses the government.
The capital of France is Paris.

This was absurd. A one-word answer stretched into multiple steps. The technique added no value while consuming 5x more tokens.

I tried creative writing next:

creative-writing-cot.txt
Me: Write a haiku about programming. Think through this step by step.
Claude: Let me approach this step by step.
Step 1: Understand the haiku format.
- Three lines
- Syllable pattern: 5-7-5
- Should capture a moment or feeling
Step 2: Brainstorm programming themes.
- Debugging
- Late nights coding
- Success after failure
- Coffee
Step 3: Draft the haiku.
Line 1 (5 syllables): Late night, screen aglow
Line 2 (7 syllables): Bug hides deep in nested code
Line 3 (5 syllables): Coffee, then success
Final haiku:
Late night, screen aglow
Bug hides deep in nested code
Coffee, then success

The chain-of-thought made the creative process feel mechanical. The haiku was fine, but I preferred the spontaneous version:

creative-writing-direct.txt
Me: Write a haiku about programming.
Claude:
Cursor blinks and waits,
Semicolon missing again—
Found it, code compiles.

More natural. Less formulaic.

The Pattern Emerges

After dozens of tests, I noticed a clear pattern:

task-comparison.txt
┌─────────────────────────┬──────────────────┬─────────────────────┐
│ Task Type │ CoT Helps? │ Why │
├─────────────────────────┼──────────────────┼─────────────────────┤
│ Math problems │ YES │ Shows calculation │
│ Logic puzzles │ YES │ Reveals reasoning │
│ Debugging │ YES │ Traces the logic │
│ Analysis tasks │ YES │ Structures thought │
├─────────────────────────┼──────────────────┼─────────────────────┤
│ Simple facts │ NO │ Adds verbosity │
│ Creative writing │ NO │ Feels mechanical │
│ Summarization │ NO │ No reasoning needed │
│ Simple code generation │ NO │ Slows output │
└─────────────────────────┴──────────────────┴─────────────────────┘

Chain-of-thought helps when:

  1. The task requires multiple reasoning steps
  2. You want to verify the logic (spot errors early)
  3. Intermediate conclusions matter
  4. The problem is genuinely complex

It hurts when:

  1. The answer is straightforward
  2. Creativity matters more than structure
  3. You want concise output
  4. Token efficiency is important

Why Does This Work?

The technique comes from a 2022 paper by Wei et al. showing that LLMs perform better on reasoning benchmarks when they show their work.

The insight is simple: LLMs are next-token predictors. They generate better reasoning when they can “see” their own intermediate steps. Without explicit instruction to reason step-by-step, the model might jump to an answer without working through the logic.

Here’s what happens internally:

reasoning-flow.txt
WITHOUT chain-of-thought:
Question → [Black Box] → Answer
High chance of skipping steps
WITH chain-of-thought:
Question → [Step 1] → [Step 2] → [Step 3] → Answer
↓ ↓ ↓
Intermediate tokens become context
for subsequent reasoning steps

Each step generates tokens that help the model “think through” the problem. The intermediate steps are not just output—they become part of the context that shapes subsequent reasoning.

The Verification Problem

One benefit I discovered: chain-of-thought lets me catch errors mid-reasoning.

average-speed-calculation.txt
Me: If a train travels 120 miles in 2 hours, then stops for 30 minutes, then
travels another 90 miles in 1.5 hours, what is the average speed for the entire
journey? Think step by step.
Claude: Let me work through this step by step.
Step 1: Calculate total distance.
- First leg: 120 miles
- Second leg: 90 miles
- Total distance = 120 + 90 = 210 miles
Step 2: Calculate total time.
- First leg: 2 hours
- Stop: 30 minutes = 0.5 hours
- Second leg: 1.5 hours
- Total time = 2 + 0.5 + 1.5 = 4 hours
Step 3: Calculate average speed.
- Average speed = Total distance / Total time
- Average speed = 210 miles / 4 hours = 52.5 mph
The average speed is 52.5 mph.

Without seeing the steps, I couldn’t verify if the stop time was included. With chain-of-thought, I can confirm: yes, the 30-minute stop was correctly added to total time.

The Classic Test That Proves It

The most compelling evidence came from the cognitive reflection test—the famous “bat and ball” problem:

cognitive-reflection-test.txt
# WITHOUT chain-of-thought
Me: A bat and ball cost $1.10 total. The bat costs $1 more than the ball.
How much does the ball cost?
Claude: The ball costs $0.10.
[WRONG - this is the intuitive but incorrect answer]
# WITH chain-of-thought
Me: A bat and ball cost $1.10 total. The bat costs $1 more than the ball.
Think through this step by step.
Claude: Let me work through this step by step.
Step 1: Set up variables.
- Let ball = x
- Then bat = x + $1
Step 2: Create equation.
- Total: ball + bat = $1.10
- x + (x + $1) = $1.10
Step 3: Solve.
- 2x + $1 = $1.10
- 2x = $0.10
- x = $0.05
The ball costs $0.05.
[CORRECT]

This was the key finding: chain-of-thought prevented the model from jumping to the intuitive-but-wrong answer. By forcing it to show work, the correct answer emerged naturally.

The Verdict

After testing chain-of-thought prompting extensively, here’s what I learned:

Use it for:

  • Math word problems
  • Logic puzzles
  • Multi-step analysis
  • Debugging scenarios
  • Any task where intermediate reasoning matters

Skip it for:

  • Simple factual queries
  • Creative writing
  • Straightforward code generation
  • Summarization
  • Tasks where conciseness matters

The technique isn’t magic. It’s a tool that works for specific problems. Applying it universally is like using a sledgehammer for every nail—you’ll get results, but you’ll also cause unnecessary damage.

Summary

In this post, I tested chain-of-thought prompting (“think step by step”) across multiple task types to see if it actually improves Claude’s outputs. The answer: it depends on the task complexity.

Chain-of-thought improves outputs for multi-step reasoning tasks by forcing the model to generate intermediate steps that become context for subsequent reasoning. For simple queries, it adds verbosity without value. The bat-and-ball test was the clearest proof: without chain-of-thought, Claude gave the wrong intuitive answer; with it, the model worked through the math correctly.

The key insight is that chain-of-thought isn’t a universal quality booster—it’s a reasoning visualization tool. Use it selectively, test both approaches, and measure the actual benefit for your use case.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments