Skip to content

How to Build AI Agent Skills and Workflows That Outlive Your Best Model

Problem

I’ve been using AI coding assistants daily for about a year. Every few months, a new model drops that’s noticeably better than the last one. Each time, I’d start from scratch — rewriting prompts, rebuilding workflows, re-discovering what works.

Then the model I relied on would go away (API changes, pricing shifts, access revoked), and I’d have nothing left. All that accumulated know-how, tied to a single model, gone.

I’m not alone in this. A Reddit thread on r/ClaudeCode captured exactly what I was feeling:

“Have it create systems specifically for you and what you’re doing that will carry on long after you’ve lost access to it.” — person-pitch

Another user put it even more directly:

“Scan skills workflows hooks and evaluate custom tools etc. so that even when fable is gone I will have some benefits.” — stackfullofdreams

The core problem: we treat AI interactions like chat conversations, not like software engineering. We don’t version-control our prompting strategies. We don’t abstract model-specific quirks behind interfaces. We don’t build infrastructure that outlives the current best model.

The Solution — Agent Skill Architecture

I spent the last few weeks restructuring how I work with AI agents. Instead of treating each session as a blank slate, I built a three-layer system that separates planning from execution, and encodes my best practices into reusable, model-agnostic modules.

Three-layer agent skill architecture
┌──────────────────────────────────────────────────────┐
│ Layer 3: Workflow Orchestration │
│ Plan → Audit → Execute → Validate (multi-model) │
├──────────────────────────────────────────────────────┤
│ Layer 2: Harness / Infra │
│ Skill loader, model router, validation hooks, │
│ fallback logic, audit trail │
├──────────────────────────────────────────────────────┤
│ Layer 1: Skill Repository │
│ Version-controlled, structured skill definitions │
│ Each skill: purpose, inputs, outputs, error modes │
└──────────────────────────────────────────────────────┘

Layer 1 — Skill Repository

This is the foundation. A directory of structured files where each file defines one skill. Each skill has clear boundaries: what it does, what it needs, what it produces, and how it can fail.

I store these in a git repo under skills/. Here’s what a skill definition looks like:

skills/audit-workflow.md
## skill: audit-workflow
purpose: Scan a codebase for known anti-patterns and dependency issues
inputs:
- path: source code directory
- model_tier: Fable | Opus
outputs:
- audit_report.md
- fix_plan.md
validation:
- cross-reference findings against yesterday's baseline
- flag any new warnings introduced
known_failure_modes:
- model context overflow for repos >10k files → split by module
- hallucinated vulnerabilities → require evidence link for each finding
compatible_with: Fable, Opus 4.8, GPT 5.5

The key insight: I write these using the strongest model I have access to. Fable (or whatever the current top model is) architects the skill structure, documents edge cases, and defines the validation rules. Once written, a weaker model can execute the same skill by following the definition.

Skills are version-controlled, so I can see how they evolve. When a new model ships with unexpected behavior, I update the known_failure_modes and compatible_with fields. The skill itself stays stable.

Layer 2 — Harness/Infra

The harness is the runtime that loads skills, routes them to the right model, and enforces validation. I built mine as a Python CLI tool, but the pattern works in any language.

harness execution flow
User request
Skill resolver ──→ Load skill definition from repository
Model router ──→ Check compatible_with, pick best available
│ │
│ ┌─────┴──────┐
│ ▼ ▼
│ Fable available Fable gone
│ │ │
│ ▼ ▼
│ Route to Fable Route to Opus 4.8
│ with fallback prompts
Execution ──→ Run skill, capture output
Validation ──→ Check output against skill's validation rules
Audit trail ──→ Log model used, duration, errors, output quality

The important part is the model router. It doesn’t just pick the best model — it also adjusts prompts and parameters based on which model is being used. An Opus 4.8 needs more explicit step-by-step guidance than Fable. The harness handles this automatically.

One user in the Reddit thread described doing exactly this:

“I just used it rewrite most of my skills, harness and agent infra. It did patch a good amount of holes in my setup.” — 1337NET

Layer 3 — Workflow Orchestration

The top layer chains multiple skills into pipelines. Each step in the pipeline can use a different model tier. This is where the real power shows up.

blog publishing workflow
Fable (planning tier)
┌─────▼──────┐
│ Plan │ ← Uses strongest model for strategy
│ research │
└─────┬──────┘
Opus 4.8 (execution tier)
┌─────▼──────┐
│ Content │ ← Capable model follows the plan
│ create │
└─────┬──────┘
Sonnet (validation tier)
┌─────▼──────┐
│ Validate │ ← Cheaper model catches mistakes
│ & review │
└─────┬──────┘
┌─────▼──────┐
│ Publish │ ← Scripted, no AI needed
└────────────┘

The critical rule: never let a weaker model plan. Planning requires reasoning about trade-offs, anticipating edge cases, and making judgment calls. Let your strongest model do that. Execution is pattern-following — cheaper models excel at it if the plan is solid.

“Get it to create skills for the previous generations to make them more intelligent and more like itself, self teaching AI.” — Ready_Positive_6419

This is exactly what the quote describes. Use today’s best model to build skills that tomorrow’s (cheaper, weaker) models can execute.

Why This Matters

Models improve, but patterns persist. The specific capabilities of GPT-5.5 or Fable or Opus 4.8 will be obsolete in 18 months. But the skill of “audit a Python codebase for dependency issues” is timeless. Encoding that skill as a structured, version-controlled definition means you migrate it forward, not rewrite it.

Cost optimization is real. I used to run every task through the most expensive model tier. Now I route strategically: advanced model for planning/auditing ($15-30/hr of work), cheaper model for execution ($2-5/hr). Same output quality, 70-80% cost reduction.

Reliability goes up, not down. A weaker model with a well-structured plan consistently outperforms a strong model with no plan. The plan constrains the search space. The skill definition provides guardrails. The validation layer catches deviations. Each layer compensates for the one below it.

Common Mistakes

I made all of these. Here’s what to avoid:

Starting from zero every session. The default behavior is to open a chat and start typing. That gives you a one-off interaction, not reusable infrastructure. Force yourself to write a skill definition first, even if it’s imperfect. You can refine it later.

Hard-coding model-specific prompts in your skill logic. If your skills assume Fable’s exact behavior, they’ll break when you switch to a different model. Define the what, not the how. Let the harness handle model-specific prompt adaptation.

No validation layer. I assumed the weaker model would follow the plan correctly. It didn’t. Plans are high-level, and models fill in gaps differently. Add explicit validation at each workflow step. Check outputs against expected formats. Compare results against baselines.

Not stress-testing your skills. A skill that works on a 100-file project might choke on 10,000 files. Context windows overflow. Models hallucinate more with larger inputs. Document failure modes explicitly and test with worst-case inputs.

“Building a skills repository and better instrument lesser models. In the future I plan to use it to tighten said skills and workflows.” — Vivid_Sample_1793

How to Start

You don’t need a complex framework. Here’s what I’d do if I were starting today:

  1. Pick one repetitive task you do with AI (code review, blog drafting, test generation).
  2. Write a skill definition for it. Follow the template above. Be specific about inputs, outputs, and failure modes.
  3. Run it with your best model first. Validate the output. Fix the definition.
  4. Try it with a weaker model. Adjust the prompting strategy in your harness until the weaker model produces acceptable output.
  5. Check it into git. Now it’s permanent.

The investment is maybe 2-3 hours for the first skill. Subsequent skills take 30 minutes each. After 10 skills, you have a library that captures months of accumulated experience — and it works regardless of which model you have access to.

Summary

In this post, I showed how to build AI agent skills as composable, version-controlled modules that encode specific capabilities. The key point is separating strategic thinking (advanced model) from tactical execution (capable model) so your workflows outlast any single model’s availability. Store skills with clear interfaces, dependency declarations, and validation hooks. Use the harness to route execution to the appropriate model tier. Your skill repository will keep working when today’s best model is gone — and tomorrow’s model will execute your skills even better.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments