How to Build a Multi-Model AI Coding Harness with Per-Phase Spec-Driven Development

Jun 5, 2026

I spent months throwing Claude Sonnet at every coding task, watching my API bills climb while fast, cheap models sat idle. A r/opencode thread changed my approach: a Principal Architect shared their 9-phase Spec-Driven Development (SDD) setup with a different model for each phase. Total monthly cost: $12-15.

The idea is simple: match model capability to the phase’s reasoning demand. Don’t pay frontier prices for mechanical work.

The 9-Phase Model Map

The SDD pipeline has nine phases, from repo mapping to archiving. Here is the assignment that works:

phases:
  sdd-init:
    model: deepseek-v4-flash
    reasoning: none
  sdd-explore:
    model: kimi-k2.6
    mode: thinking
  sdd-propose:
    model: glm-5.1
    reasoning: none
  sdd-spec:
    model: deepseek-v4-pro
    reasoning: high
  sdd-design:
    model: deepseek-v4-pro
    reasoning: medium
  sdd-tasks:
    model: deepseek-v4-flash
    reasoning: none
  sdd-apply:
    model: deepseek-v4-pro
    reasoning: high
  sdd-verify:
    model: qwen3-coder-480b
    source: openrouter
  sdd-archive:
    model: deepseek-v4-flash
    reasoning: none

orchestrator:
  model: claude-sonnet-4.6
  source: openrouter
  role: gate-coordinator

Flow diagram of task router dispatching different task types to different model backends

Why Different Phases Need Different Models

The key insight from that Reddit thread: “different phases need completely different capabilities.” I tried using DeepSeek V4 Flash for everything first — it is fast and cheap but lacks the reflective depth needed for spec writing. Then I tried Claude for everything — great specs, but I was burning $0.50+ per session on mechanical archive steps.

sdd-init — DeepSeek V4 Flash. Fast repo scanning, no reasoning needed. It reads file trees and builds context maps. Mechanical work.

sdd-explore — Kimi K2.6. It has a huge context window and agentic exploration chops. When the harness needs to dig through 50+ files to understand a codebase, Kimi handles it without hitting context limits.

sdd-propose — GLM-5.1. This model surprised me. Its reflective reasoning produces well-structured proposals. I was skeptical of a model I had not heard of before, but it consistently outputs better proposals than Flash for zero extra cost under the OpenCode Go subscription.

sdd-spec — DeepSeek V4 Pro with high reasoning. This is the most critical phase. The spec determines everything downstream. High reasoning mode catches edge cases most models miss.

sdd-apply — DeepSeek V4 Pro with high reasoning. Highest token consumption phase. The actual code generation. V4 Pro writes correct, idiomatic code. I tried cheaper models here and got buggy output that wasted more time than it saved.

sdd-verify — Qwen3-Coder 480B via OpenRouter. Tool-call accuracy matters most here. Qwen3-Coder 480B excels at calling the right verification tools with correct parameters. A few cents per session.

sdd-archive — DeepSeek V4 Flash. Mechanical summarization and cleanup. No reasoning needed.

The Cost Picture

Bar chart comparing monthly cost and intelligence score for various LLM models

The math works because OpenCode Go costs $10/month flat-rate for 8 of the 9 models — DeepSeek V4 Flash, DeepSeek V4 Pro, Kimi K2.6, GLM-5.1. You only pay extra for Qwen3-Coder 480B on OpenRouter (cents per session) and the orchestrator Claude Sonnet 4.6 on OpenRouter (also cents).

Compare that to running Claude Sonnet end-to-end: at typical usage patterns, that is $50-80/month. Or running DeepSeek V4 Pro on every phase: still $3-5/month more with worse results on exploration and verification.

Common Mistakes

Adding reasoning effort to init or archive phases. You are paying for thinking time on tasks that just need text generation. Set reasoning: none on Flash phases.

Using the same model for coding and verification. They test different skills. Coding needs creativity and pattern matching. Verification needs precision and tool-call discipline. The Qwen3-Coder 480B trades creativity for tool-call accuracy — a good trade for verify, a bad one for apply.

Ignoring the orchestrator role. The orchestrator does not write code — it decides when a phase is done and gates the next one. Claude Sonnet 4.6 is excellent at this judgment task. You do not need a coding model here.

Summary

In this post, I walked through a 9-phase SDD harness that assigns different models to different phases based on what each phase actually needs. The result is better output quality at $12-15/month total — a fraction of what a single frontier model costs end-to-end. Match the model to the job, not the leaderboard.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Reddit Discussion: r/opencode - Multi-model SDD setup

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!