How Superpowers Enforces Test-Driven Development

Mar 18, 2026

The Problem

My AI coding assistant kept skipping tests. I’d ask it to implement a feature, and it would immediately write production code. When I reminded it to write tests, it would add them afterward.

Here’s a typical interaction:

User: "Add a retry mechanism for failed API calls"

Claude: [Immediately writes implementation]
export async function retryOperation(operation, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await operation();
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      await sleep(1000 * Math.pow(2, i));
    }
  }
}

User: "Write tests for this"

Claude: [Writes tests that pass immediately]

The problem? Tests written after implementation prove nothing. They might pass, but they don’t verify that the code actually works correctly. They only verify that the tests match whatever the code does.

I needed my AI assistant to follow real TDD: write the test first, watch it fail, then write the minimal code to make it pass.

What Is the TDD Skill?

Superpowers includes a test-driven-development skill that enforces the RED-GREEN-REFACTOR cycle. It doesn’t just suggest TDD—it forces it.

The skill has one core principle, stated as an Iron Law:

NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST

This isn’t a suggestion. It’s a rule that the AI agent must follow.

How the Iron Law Works

When I ask for a feature, the TDD skill intercepts the request. Instead of jumping to implementation, it follows a strict sequence.

Step 1: Write a Failing Test (RED)

The skill guides the agent to write one minimal test that describes what should happen:

// Good: Clear name, tests real behavior, one thing
test('retries failed operations 3 times', async () => {
  let attempts = 0;
  const operation = () => {
    attempts++;
    if (attempts < 3) throw new Error('fail');
    return 'success';
  };

  const result = await retryOperation(operation);

  expect(result).toBe('success');
  expect(attempts).toBe(3);
});

The skill emphasizes:

Test name describes the behavior
Test tests one thing
Test uses realistic data and assertions

Step 2: Watch It Fail (MANDATORY)

This step is non-negotiable. The skill requires running the test and confirming failure:

npm test path/to/test.test.ts

The agent must confirm:

The test fails (not errors)
The failure message is expected
The failure is because the feature is missing (not typos or syntax errors)

If a test passes immediately, that’s a red flag. It means either:

The feature already exists
The test doesn’t actually test what you think

Step 3: Write Minimal Code (GREEN)

Only after watching the test fail can the agent write implementation. And the implementation must be minimal—just enough to pass:

async function retryOperation(operation) {
  // Just enough to pass the test
  let attempts = 0;
  while (attempts < 3) {
    try {
      return await operation();
    } catch (error) {
      attempts++;
    }
  }
}

The skill warns against:

Adding features the test doesn’t require
Adding improvements that aren’t tested
Writing “better” code that does more than needed

Step 4: Watch It Pass (MANDATORY)

Again, the skill requires running the test:

npm test path/to/test.test.ts

The agent confirms:

The new test passes
All other tests still pass

Step 5: Refactor (Optional)

After green, the skill allows refactoring:

Remove duplication
Improve names
Extract helpers

But refactoring is only allowed when tests are green. And after refactoring, tests must still pass.

Why Order Matters

The skill explicitly addresses common rationalizations developers use to skip TDD:

Excuse	Reality
”I’ll write tests after”	Tests passing immediately prove nothing
”Already manually tested”	Ad-hoc testing is not systematic testing
”Deleting X hours is wasteful”	Sunk cost fallacy—the code is wrong, delete it
”TDD is dogmatic”	TDD IS pragmatic—finds bugs before commit
”Tests after achieve same goals”	Tests-after = “what does this do?” Tests-first = “what should this do?”

The skill’s core insight:

“If you didn’t watch the test fail, you don’t know if it tests the right thing.”

This isn’t about being pedantic. It’s about proving that your test actually tests something.

Red Flags That Mean Start Over

The TDD skill identifies specific behaviors that indicate you’re not doing real TDD:

Red Flag 1: Code before test

❌ "I wrote the implementation first, now I'll add tests"
→ Delete the code. Start over with a failing test.

Red Flag 2: Test passes immediately

❌ "The test I just wrote passes!"
→ The test is broken or tests nothing. Delete it and start over.

Red Flag 3: “Keep as reference”

❌ "Let me keep the implementation as reference while writing the test"
→ This defeats the purpose. Delete it.

Red Flag 4: “I already manually tested it”

❌ "The code works, I just need to formalize it in tests"
→ Manual testing ≠ automated testing. Delete and start over.

Red Flag 5: “It’s about spirit not ritual”

❌ "TDD is a mindset, not strict rules"
→ The ritual exists for a reason. Follow it.

All of these mean the same thing: delete the code and start over with TDD.

The DELETE Protocol

When the TDD skill detects code-before-test violations, it follows a strict protocol:

┌─────────────────────┐
│  Detect Violation   │ ← Code exists without failing test
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│   DELETE Code        │ ← Not "adapt", not "reference"
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  Write Failing Test  │ ← Start from the beginning
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  Verify RED         │ ← Watch it fail
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  Implement GREEN    │ ← Minimal code only
└─────────────────────┘

The skill emphasizes: “Delete means delete.” Not “delete but keep a copy for reference.” Not “delete but remember the structure.” Delete.

A Real Example

I asked my AI assistant to add exponential backoff to the retry mechanism. Here’s how the TDD skill handled it:

Without TDD skill:

User: "Add exponential backoff to retryOperation"

Claude: [Immediately modifies the function]
export async function retryOperation(operation, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await operation();
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      await sleep(1000 * Math.pow(2, i)); // Added backoff
    }
  }
}

User: "Write tests for this"

Claude: [Writes test that passes immediately]
test('has exponential backoff', async () => {
  // This test passes because the code already exists!
});

With TDD skill:

User: "Add exponential backoff to retryOperation"

Claude: [TDD skill activated]
I need to add exponential backoff. Following TDD:

Step 1: Write failing test

test('waits longer between each retry', async () => {
  const delays = [];
  const operation = () => {
    delays.push(Date.now());
    throw new Error('fail');
  };

  try {
    await retryOperation(operation, 3);
  } catch (e) {}

  // Each delay should be longer than the previous
  expect(delays[1] - delays[0]).toBeLessThan(delays[2] - delays[1]);
});

Step 2: Verify test fails

npm test retry.test.ts
FAIL: retryOperation is not defined

Step 3: Implement minimal code

[Modifies retryOperation to add backoff]

Step 4: Verify test passes

npm test retry.test.ts
PASS

The difference is stark. With the TDD skill, I know the test actually verifies the behavior.

What I Learned

After using the TDD skill for several weeks, I noticed three key changes:

1. Tests are better

Tests written first focus on behavior. Tests written after tend to test implementation details.

// Tests what it SHOULD do
test('retries up to maxRetries times', async () => {
  // ...
});

// Tests what it DOES do
test('calls operation 3 times when maxRetries is 3', async () => {
  // This passes even if the retry logic is broken
});

2. Edge cases are caught earlier

When I write the test first, I think about edge cases:

What if maxRetries is 0?
What if the operation never succeeds?
What if the delay calculation overflows?

When I write tests after, I only test what the code already handles.

3. Debugging is faster

With test-first, when a bug appears, I write a failing test for it. Then I fix the code. The test proves the bug is fixed and prevents regression.

Without tests-first, I fix the bug, then write a test. The test might pass even if my fix is wrong.

DO and DON’T

DO

Watch every test fail before implementing

npm test path/to/test.test.ts
# Confirm: FAIL (expected)

Write minimal code to pass the test

// Just enough - no more, no less
return await operation();

Run all tests after refactoring

npm test
# All tests must pass

DON’T

Don’t keep “reference” code when starting over

❌ "I'll delete the code but keep a mental note of the structure"
→ No. Start fresh. Let the tests guide the design.

Don’t skip watching tests run

❌ "I'll just write the test and implement, no need to run it"
→ Running it proves the test works. Skipping defeats the purpose.

Don’t add untested features during GREEN

// ❌ WRONG: Adding features test doesn't require
async function retryOperation(operation, maxRetries = 3, logErrors = true) {
  // logErrors is not tested - don't add it yet
}

// ✅ RIGHT: Only what the test requires
async function retryOperation(operation, maxRetries = 3) {
  // Just the retry logic
}

Summary

In this post, I showed how Superpowers enforces test-driven development with an Iron Law: no production code without a failing test first. The key point is that tests written after implementation prove nothing—you need to watch the test fail to know it actually tests something.

The RED-GREEN-REFACTOR cycle is enforced through mandatory verification steps:

RED: Write a failing test and verify it fails
GREEN: Write minimal code and verify it passes
REFACTOR: Improve code while keeping tests green

Common rationalizations like “I’ll write tests after” or “keeping code as reference” are explicitly rejected. The skill forces you to delete and start over.

This isn’t about being dogmatic. It’s about the core insight: if you didn’t watch the test fail, you don’t know if it tests the right thing.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!