Skip to content

How to Write Custom Superpowers Skills

The Problem

I wrote a skill for my AI coding assistant. I documented everything carefully—a clear process, helpful examples, all the edge cases I could think of. Then I tested it.

The agent ignored half the steps. It invented shortcuts I never anticipated. When pressed, it rationalized: “The skill is about spirit, not ritual.”

My documentation was clear to me. But I wrote it after solving the problem once. I never watched an agent try to use it.

What Is a Skill?

In Superpowers, a skill is a reference guide for proven techniques, patterns, or tools. Skills help future Claude instances find and apply effective approaches.

Skill definition
Skills ARE:
- Reusable techniques
- Patterns for thinking
- Reference guides
- Tool documentation
Skills are NOT:
- Narratives about how you solved something once
- Project-specific conventions (those go in CLAUDE.md)
- Things you can automate with regex/validation

The key insight from the writing-skills skill:

Writing skills IS Test-Driven Development applied to process documentation.

The TDD Mapping

This is the core idea. Creating a skill follows the exact same cycle as writing code:

TDD to Skill Creation mapping
| TDD Concept | Skill Creation |
|---------------------|-------------------------------------|
| Test case | Pressure scenario with subagent |
| Production code | Skill document (SKILL.md) |
| Test fails (RED) | Agent violates rule without skill |
| Test passes (GREEN) | Agent complies with skill present |
| Refactor | Close loopholes while maintaining |
| | compliance |

The Iron Law is the same:

The Iron Law
NO SKILL WITHOUT A FAILING TEST FIRST

When to Create a Skill

Not everything needs a skill. Here’s the decision matrix:

When to create vs not create skills
CREATE when:
- Technique wasn't intuitively obvious to you
- You'd reference this again across projects
- Pattern applies broadly (not project-specific)
- Others would benefit
DON'T create for:
- One-off solutions
- Standard practices well-documented elsewhere
- Project-specific conventions (put in CLAUDE.md)
- Mechanical constraints (automate with regex instead)

Directory Structure

Skills live in ~/.claude/skills/ with a flat namespace:

Skill directory structure
skills/
skill-name/
SKILL.md # Main reference (required)
supporting-file.* # Only if needed (100+ lines or tools)

Flat namespace means all skills are in one searchable directory. No subdirectories by category.

Create separate files only for:

  • Heavy reference (100+ lines) like API docs
  • Reusable tools like scripts or templates

Everything else stays inline in SKILL.md.

SKILL.md Structure

The skill file has two parts: frontmatter and body.

Frontmatter (YAML):

Frontmatter example
---
name: skill-name-with-hyphens
description: Use when [specific triggering conditions]
---

Rules for frontmatter:

  • Only two fields: name and description
  • Max 1024 characters total
  • name: Letters, numbers, hyphens only
  • description: Third-person, “Use when…” format

Critical trap: The description must describe ONLY triggering conditions. Never summarize the skill’s workflow.

Why? Testing showed that when a description summarizes workflow, Claude follows the description instead of reading the full skill. A description saying “code review between tasks” caused Claude to do ONE review, even though the skill’s flowchart clearly showed TWO reviews.

Body structure:

SKILL.md body template
# Skill Name
## Overview
What is this? Core principle in 1-2 sentences.
## When to Use
Bullet list with SYMPTOMS and use cases
When NOT to use
## The Process
Steps or patterns (the actual content)
## Examples
One excellent example (not multi-language)
## Common Mistakes
What goes wrong + fixes

The RED Phase: Write Failing Test First

Before writing any skill, you MUST run a baseline scenario WITHOUT the skill.

This is “watch the test fail” in TDD terms. You need to see what agents naturally do.

RED phase process
1. Create pressure scenario
- For discipline skills: combine multiple pressures
(time + sunk cost + exhaustion)
- For technique skills: realistic application scenario
2. Run scenario WITHOUT skill
- Use a subagent for isolation
- Document behavior VERBATIM
3. Document exactly:
- What choices did they make?
- What rationalizations did they use?
- Which pressures triggered violations?

Example from TDD skill testing:

I asked an agent to implement a feature. Without the TDD skill, it immediately wrote production code. When reminded to write tests, it said: “I’ll add them after—it’s the same result.”

That rationalization became the basis for the skill’s counter-argument table.

The GREEN Phase: Write Minimal Skill

Now write the skill addressing those SPECIFIC failures. Not hypothetical cases—just what you observed.

GREEN phase process
1. Write skill addressing baseline failures
- Name: active voice, verb-first (creating-skills, not skill-creation)
- Description: triggering conditions only
- Body: counter specific rationalizations
2. Run same scenarios WITH skill
- Agent should now comply
3. If still failing, add more content
- But only for failures you actually saw

The skill should be minimal. Every token counts because frequently-loaded skills appear in EVERY conversation.

Target word counts:

  • Getting-started workflows: <150 words each
  • Frequently-loaded skills: <200 words total
  • Other skills: <500 words

The REFACTOR Phase: Close Loopholes

Agents are smart. They find loopholes.

REFACTOR phase process
1. Run scenarios again
2. Watch for NEW rationalizations
3. Add explicit counters
4. Re-test until bulletproof

For discipline-enforcing skills, build these structures:

Rationalization table:

Example rationalization table
| Excuse | Reality |
|--------|---------|
| "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
| "I'll test after" | Tests passing immediately prove nothing. |
| "It's about spirit not ritual" | Violating letter = violating spirit. |

Red flags list:

Example red flags
## Red Flags - STOP and Start Over
- Code before test
- "I already manually tested it"
- "This is different because..."
All of these mean: Delete code. Start over.

Explicit “no exceptions” clause:

No exceptions example
Write code before test? Delete it. Start over.
**No exceptions:**
- Don't keep it as "reference"
- Don't "adapt" it while writing tests
- Delete means delete

Common Rationalizations for Skipping Testing

Here’s what agents (and developers) say:

Rationalizations and reality
| Excuse | Reality |
|-----------------------------|---------------------------------------------------|
| "Skill is obviously clear" | Clear to you ≠ clear to others. Test it. |
| "It's just a reference" | References can have gaps. Test retrieval. |
| "Testing is overkill" | Untested skills have issues. Always. |
| "I'll test if problems" | Problems = agents can't use skill. Test BEFORE. |
| "No time to test" | Deploying untested skill wastes more time later. |
| "I'm confident it's good" | Overconfidence guarantees issues. |

All of these mean: Test before deploying. No exceptions.

A Real Example

I created a skill for condition-based waiting in async tests. Here’s how it went:

RED Phase:

I gave an agent a flaky test without the skill. The agent tried:

  • Adding random delays (setTimeout(100))
  • Increasing existing timeouts
  • Adding sleep() calls

None of these fixed the underlying race condition. The agent said: “The test is timing-dependent, these things happen.”

GREEN Phase:

I wrote the skill with:

  • Core principle: wait for conditions, not arbitrary times
  • Pattern: waitFor(() => condition)
  • One example in TypeScript

Testing with skill: The agent correctly identified the race condition and used condition-based waiting.

REFACTOR Phase:

New scenario: The agent used condition-based waiting but with the wrong condition. It waited for an element to exist but not to be visible.

I added a “common mistakes” section covering:

  • Waiting for existence vs visibility
  • Handling timeouts in conditions
  • Debugging which condition is wrong

Skill Types and Testing Approaches

Different skills need different testing:

Skill types and test approaches
| Type | Examples | Test With |
|-----------|---------------------------|-----------------------------------|
| Discipline| TDD, verification | Pressure scenarios + combined |
| | | pressures |
| Technique | condition-based-waiting, | Application scenarios + edge |
| | root-cause-tracing | cases |
| Pattern | flatten-with-flags | Recognition + application + |
| | | counter-examples |
| Reference | API docs, syntax guides | Retrieval + application + gap |
| | | testing |

Claude Search Optimization (CSO)

Skills need to be discoverable. Future Claude finds skills by searching.

1. Rich description field:

Description comparison
# BAD: Too abstract
description: For async testing
# BAD: Summarizes workflow
description: Use for TDD - write test first, watch fail, implement
# GOOD: Triggering conditions only
description: Use when tests have race conditions, timing dependencies,
or pass/fail inconsistently

2. Keyword coverage:

Include words Claude would search for:

  • Error messages: “Hook timed out”, “ENOTEMPTY”
  • Symptoms: “flaky”, “hanging”, “zombie”
  • Tools: Actual commands, library names

3. Descriptive naming:

Use active voice, verb-first:

  • creating-skills not skill-creation
  • condition-based-waiting not async-test-helpers
  • root-cause-tracing not debugging-techniques

The Deployment Checklist

After writing ANY skill, you MUST complete this process:

Skill deployment checklist
RED Phase:
[ ] Create pressure scenarios (3+ combined pressures for discipline)
[ ] Run scenarios WITHOUT skill - document verbatim
[ ] Identify patterns in rationalizations
GREEN Phase:
[ ] Name: letters, numbers, hyphens only
[ ] Frontmatter: only name and description (<1024 chars)
[ ] Description: "Use when..." with specific triggers
[ ] Address specific baseline failures
[ ] One excellent example
[ ] Run scenarios WITH skill - verify compliance
REFACTOR Phase:
[ ] Identify NEW rationalizations
[ ] Add explicit counters
[ ] Build rationalization table
[ ] Create red flags list
[ ] Re-test until bulletproof

Do NOT:

  • Create multiple skills in batch without testing each
  • Move to next skill before current is verified
  • Skip testing because “batching is more efficient”

Deploying untested skills = deploying untested code.

What I Learned

After writing several skills following this process:

1. Baseline behavior is always surprising

I thought my skill was clear. Then I watched an agent interpret it completely differently. The gaps were in places I never suspected.

2. Rationalizations are predictable

The same excuses come up repeatedly: “spirit vs letter”, “too simple”, “I’ll do it after”. Building tables of these makes skills more robust.

3. Testing takes less time than you think

A 15-minute pressure scenario saves hours of debugging why an agent isn’t following your skill in production.

4. Minimal is better

Longer skills don’t help. They load into every conversation and burn context. One excellent example beats five mediocre ones.

DO and DON’T

DO

Run baseline scenarios before writing

Run baseline scenarios
Create pressure scenario with subagent
Run WITHOUT skill
Document behavior verbatim

Address specific failures

Address specific failures
If agent said "I'll test after"
Add: "Tests written after prove nothing"

Build explicit counters

Build explicit counters
**No exceptions:**
- Don't keep as "reference"
- Don't "adapt" while testing

DON’T

Don’t summarize workflow in description

Bad description example
# This causes Claude to skip reading the skill
description: Use for TDD - write test, watch fail, implement

Don’t create multi-language examples

Multi-language example anti-pattern
# WRONG
example-js.js, example-py.py, example-go.go
# RIGHT
One excellent example in the most relevant language

Don’t skip the REFACTOR phase

The first GREEN pass always has loopholes. Test again.

Summary

In this post, I explained how to write Superpowers skills using TDD:

  1. RED: Run pressure scenarios without the skill, document failures verbatim
  2. GREEN: Write minimal skill addressing those specific failures
  3. REFACTOR: Find new rationalizations, add counters, re-test

The core insight: If you didn’t watch an agent fail without the skill, you don’t know if the skill teaches the right thing.

Skills are documentation, and documentation needs the same rigor as code. Write the test first. Watch it fail. Then write the skill.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments