How to Write Custom Superpowers Skills
The Problem
I wrote a skill for my AI coding assistant. I documented everything carefully—a clear process, helpful examples, all the edge cases I could think of. Then I tested it.
The agent ignored half the steps. It invented shortcuts I never anticipated. When pressed, it rationalized: “The skill is about spirit, not ritual.”
My documentation was clear to me. But I wrote it after solving the problem once. I never watched an agent try to use it.
What Is a Skill?
In Superpowers, a skill is a reference guide for proven techniques, patterns, or tools. Skills help future Claude instances find and apply effective approaches.
Skills ARE: - Reusable techniques - Patterns for thinking - Reference guides - Tool documentation
Skills are NOT: - Narratives about how you solved something once - Project-specific conventions (those go in CLAUDE.md) - Things you can automate with regex/validationThe key insight from the writing-skills skill:
Writing skills IS Test-Driven Development applied to process documentation.
The TDD Mapping
This is the core idea. Creating a skill follows the exact same cycle as writing code:
| TDD Concept | Skill Creation ||---------------------|-------------------------------------|| Test case | Pressure scenario with subagent || Production code | Skill document (SKILL.md) || Test fails (RED) | Agent violates rule without skill || Test passes (GREEN) | Agent complies with skill present || Refactor | Close loopholes while maintaining || | compliance |The Iron Law is the same:
NO SKILL WITHOUT A FAILING TEST FIRSTWhen to Create a Skill
Not everything needs a skill. Here’s the decision matrix:
CREATE when: - Technique wasn't intuitively obvious to you - You'd reference this again across projects - Pattern applies broadly (not project-specific) - Others would benefit
DON'T create for: - One-off solutions - Standard practices well-documented elsewhere - Project-specific conventions (put in CLAUDE.md) - Mechanical constraints (automate with regex instead)Directory Structure
Skills live in ~/.claude/skills/ with a flat namespace:
skills/ skill-name/ SKILL.md # Main reference (required) supporting-file.* # Only if needed (100+ lines or tools)Flat namespace means all skills are in one searchable directory. No subdirectories by category.
Create separate files only for:
- Heavy reference (100+ lines) like API docs
- Reusable tools like scripts or templates
Everything else stays inline in SKILL.md.
SKILL.md Structure
The skill file has two parts: frontmatter and body.
Frontmatter (YAML):
---name: skill-name-with-hyphensdescription: Use when [specific triggering conditions]---Rules for frontmatter:
- Only two fields:
nameanddescription - Max 1024 characters total
name: Letters, numbers, hyphens onlydescription: Third-person, “Use when…” format
Critical trap: The description must describe ONLY triggering conditions. Never summarize the skill’s workflow.
Why? Testing showed that when a description summarizes workflow, Claude follows the description instead of reading the full skill. A description saying “code review between tasks” caused Claude to do ONE review, even though the skill’s flowchart clearly showed TWO reviews.
Body structure:
# Skill Name
## OverviewWhat is this? Core principle in 1-2 sentences.
## When to UseBullet list with SYMPTOMS and use casesWhen NOT to use
## The ProcessSteps or patterns (the actual content)
## ExamplesOne excellent example (not multi-language)
## Common MistakesWhat goes wrong + fixesThe RED Phase: Write Failing Test First
Before writing any skill, you MUST run a baseline scenario WITHOUT the skill.
This is “watch the test fail” in TDD terms. You need to see what agents naturally do.
1. Create pressure scenario - For discipline skills: combine multiple pressures (time + sunk cost + exhaustion) - For technique skills: realistic application scenario
2. Run scenario WITHOUT skill - Use a subagent for isolation - Document behavior VERBATIM
3. Document exactly: - What choices did they make? - What rationalizations did they use? - Which pressures triggered violations?Example from TDD skill testing:
I asked an agent to implement a feature. Without the TDD skill, it immediately wrote production code. When reminded to write tests, it said: “I’ll add them after—it’s the same result.”
That rationalization became the basis for the skill’s counter-argument table.
The GREEN Phase: Write Minimal Skill
Now write the skill addressing those SPECIFIC failures. Not hypothetical cases—just what you observed.
1. Write skill addressing baseline failures - Name: active voice, verb-first (creating-skills, not skill-creation) - Description: triggering conditions only - Body: counter specific rationalizations
2. Run same scenarios WITH skill - Agent should now comply
3. If still failing, add more content - But only for failures you actually sawThe skill should be minimal. Every token counts because frequently-loaded skills appear in EVERY conversation.
Target word counts:
- Getting-started workflows: <150 words each
- Frequently-loaded skills: <200 words total
- Other skills: <500 words
The REFACTOR Phase: Close Loopholes
Agents are smart. They find loopholes.
1. Run scenarios again2. Watch for NEW rationalizations3. Add explicit counters4. Re-test until bulletproofFor discipline-enforcing skills, build these structures:
Rationalization table:
| Excuse | Reality ||--------|---------|| "Too simple to test" | Simple code breaks. Test takes 30 seconds. || "I'll test after" | Tests passing immediately prove nothing. || "It's about spirit not ritual" | Violating letter = violating spirit. |Red flags list:
## Red Flags - STOP and Start Over
- Code before test- "I already manually tested it"- "This is different because..."
All of these mean: Delete code. Start over.Explicit “no exceptions” clause:
Write code before test? Delete it. Start over.
**No exceptions:**- Don't keep it as "reference"- Don't "adapt" it while writing tests- Delete means deleteCommon Rationalizations for Skipping Testing
Here’s what agents (and developers) say:
| Excuse | Reality ||-----------------------------|---------------------------------------------------|| "Skill is obviously clear" | Clear to you ≠ clear to others. Test it. || "It's just a reference" | References can have gaps. Test retrieval. || "Testing is overkill" | Untested skills have issues. Always. || "I'll test if problems" | Problems = agents can't use skill. Test BEFORE. || "No time to test" | Deploying untested skill wastes more time later. || "I'm confident it's good" | Overconfidence guarantees issues. |All of these mean: Test before deploying. No exceptions.
A Real Example
I created a skill for condition-based waiting in async tests. Here’s how it went:
RED Phase:
I gave an agent a flaky test without the skill. The agent tried:
- Adding random delays (
setTimeout(100)) - Increasing existing timeouts
- Adding
sleep()calls
None of these fixed the underlying race condition. The agent said: “The test is timing-dependent, these things happen.”
GREEN Phase:
I wrote the skill with:
- Core principle: wait for conditions, not arbitrary times
- Pattern:
waitFor(() => condition) - One example in TypeScript
Testing with skill: The agent correctly identified the race condition and used condition-based waiting.
REFACTOR Phase:
New scenario: The agent used condition-based waiting but with the wrong condition. It waited for an element to exist but not to be visible.
I added a “common mistakes” section covering:
- Waiting for existence vs visibility
- Handling timeouts in conditions
- Debugging which condition is wrong
Skill Types and Testing Approaches
Different skills need different testing:
| Type | Examples | Test With ||-----------|---------------------------|-----------------------------------|| Discipline| TDD, verification | Pressure scenarios + combined || | | pressures || Technique | condition-based-waiting, | Application scenarios + edge || | root-cause-tracing | cases || Pattern | flatten-with-flags | Recognition + application + || | | counter-examples || Reference | API docs, syntax guides | Retrieval + application + gap || | | testing |Claude Search Optimization (CSO)
Skills need to be discoverable. Future Claude finds skills by searching.
1. Rich description field:
# BAD: Too abstractdescription: For async testing
# BAD: Summarizes workflowdescription: Use for TDD - write test first, watch fail, implement
# GOOD: Triggering conditions onlydescription: Use when tests have race conditions, timing dependencies, or pass/fail inconsistently2. Keyword coverage:
Include words Claude would search for:
- Error messages: “Hook timed out”, “ENOTEMPTY”
- Symptoms: “flaky”, “hanging”, “zombie”
- Tools: Actual commands, library names
3. Descriptive naming:
Use active voice, verb-first:
creating-skillsnotskill-creationcondition-based-waitingnotasync-test-helpersroot-cause-tracingnotdebugging-techniques
The Deployment Checklist
After writing ANY skill, you MUST complete this process:
RED Phase: [ ] Create pressure scenarios (3+ combined pressures for discipline) [ ] Run scenarios WITHOUT skill - document verbatim [ ] Identify patterns in rationalizations
GREEN Phase: [ ] Name: letters, numbers, hyphens only [ ] Frontmatter: only name and description (<1024 chars) [ ] Description: "Use when..." with specific triggers [ ] Address specific baseline failures [ ] One excellent example [ ] Run scenarios WITH skill - verify compliance
REFACTOR Phase: [ ] Identify NEW rationalizations [ ] Add explicit counters [ ] Build rationalization table [ ] Create red flags list [ ] Re-test until bulletproofDo NOT:
- Create multiple skills in batch without testing each
- Move to next skill before current is verified
- Skip testing because “batching is more efficient”
Deploying untested skills = deploying untested code.
What I Learned
After writing several skills following this process:
1. Baseline behavior is always surprising
I thought my skill was clear. Then I watched an agent interpret it completely differently. The gaps were in places I never suspected.
2. Rationalizations are predictable
The same excuses come up repeatedly: “spirit vs letter”, “too simple”, “I’ll do it after”. Building tables of these makes skills more robust.
3. Testing takes less time than you think
A 15-minute pressure scenario saves hours of debugging why an agent isn’t following your skill in production.
4. Minimal is better
Longer skills don’t help. They load into every conversation and burn context. One excellent example beats five mediocre ones.
DO and DON’T
DO
Run baseline scenarios before writing
Create pressure scenario with subagentRun WITHOUT skillDocument behavior verbatimAddress specific failures
If agent said "I'll test after"Add: "Tests written after prove nothing"Build explicit counters
**No exceptions:**- Don't keep as "reference"- Don't "adapt" while testingDON’T
Don’t summarize workflow in description
# This causes Claude to skip reading the skilldescription: Use for TDD - write test, watch fail, implementDon’t create multi-language examples
# WRONGexample-js.js, example-py.py, example-go.go
# RIGHTOne excellent example in the most relevant languageDon’t skip the REFACTOR phase
The first GREEN pass always has loopholes. Test again.
Summary
In this post, I explained how to write Superpowers skills using TDD:
- RED: Run pressure scenarios without the skill, document failures verbatim
- GREEN: Write minimal skill addressing those specific failures
- REFACTOR: Find new rationalizations, add counters, re-test
The core insight: If you didn’t watch an agent fail without the skill, you don’t know if the skill teaches the right thing.
Skills are documentation, and documentation needs the same rigor as code. Write the test first. Watch it fail. Then write the skill.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments