How DeepSeek V4's Two-Stage Post-Training Solves Multi-Domain Interference
When training multi-capability LLMs, I’ve noticed a persistent problem: models tend to “average” across domains rather than excel. A model trained simultaneously on coding, math, and reasoning data often produces compromises—strong in one area but weakened in others. DeepSeek V4 tackles this with a two-stage post-training architecture that separates domain optimization from capability integration.

The Problem I Identified
Traditional post-training mixes all domains in a single Supervised Fine-Tuning (SFT) phase:
- Coding examples, math proofs, reasoning chains, and general knowledge all thrown together
- The model learns to navigate trade-offs rather than master each domain
- Different reasoning patterns interfere: math’s logical rigor conflicts with creative writing’s flexibility
This creates knowledge interference—gradient updates from one domain degrade another. I’ve seen models that were strong coders become mediocre mathematicians after mixed-domain training.
DeepSeek V4’s Solution
V4 breaks post-training into two distinct stages:
Stage 1: Independent Expert Training┌─────────────────────────────────────────────────────┐│ ││ Base Model ││ │ ││ ├────► Coding Expert (SFT + GRPO) ││ │ ││ ├────► Math Expert (SFT + GRPO) ││ │ ││ ├────► Reasoning Expert (SFT + GRPO) ││ │ ││ └────► World Knowledge Expert (SFT + GRPO) ││ │└─────────────────────────────────────────────────────┘
Stage 2: On-Policy Distillation (OPD)┌─────────────────────────────────────────────────────┐│ ││ All Experts ──► Unified V4 Model ││ (via student-generated trajectory learning) ││ │└─────────────────────────────────────────────────────┘Stage 1: Domain Expert Training
For each domain, V4 runs:
- Domain-specific SFT: High-quality curated data for coding/math/reasoning
- GRPO (Group Relative Policy Optimization): Reinforcement learning fine-tuning
The result: Four expert models, each optimized to its domain ceiling without compromise.
# Conceptual illustration of Stage 1class DomainExpertTrainer: def __init__(self, domain: str): self.domain = domain # "coding", "math", "reasoning", "knowledge"
def train_expert(self, base_model): # Step 1a: Domain-specific SFT domain_data = self.load_domain_data() expert = self.supervised_finetune(base_model, domain_data)
# Step 1b: Domain-specific GRPO RL expert = self.grpo_reinforcement_learning(expert) return expert
def load_domain_data(self): # Curated high-quality examples for this domain only datasets = { "coding": ["HumanEval", "MBPP", "internal_code_corpus"], "math": ["MATH", "GSM8K", "theorem_proofs"], "reasoning": ["logical_chains", "argumentation_data"], "knowledge": ["encyclopedia", "qa_pairs"] } return datasets[self.domain]Why separate training matters:
- No cross-domain gradient conflicts
- Each expert reaches its theoretical ceiling
- Debugging is easier—if math underperforms, retrain just that expert
Stage 2: On-Policy Distillation (OPD)
This is where V4 differs from standard distillation. Instead of having teachers generate outputs for students to copy, OPD lets the student generate first, then learns from expert feedback on its own outputs.
# Standard vs On-Policy Distillation comparison
def standard_distillation(teacher, student, data): """Traditional approach: Teacher generates, student mimics.""" for sample in data: teacher_output = teacher.generate(sample) student.learn_from_distribution(teacher_output)
def on_policy_distillation(experts, student): """V4 approach: Student generates, experts provide feedback.""" for _ in range(num_iterations): # Student generates its own trajectory student_sample = student.generate()
# Each expert evaluates student's output for expert in experts: target_distribution = expert.evaluate(student_sample) student.learn_from_distribution(target_distribution)
return student # Now unified with all expert capabilitiesWhy OPD is superior:
- Student learns on its own output distribution (not teacher’s)
- Better alignment with student’s actual capabilities
- Avoids distribution mismatch between teacher-generated and student-generated samples
Results I Observed
The benchmark improvements validate this approach:
| Benchmark | V4 Pro | V3.2-Base | Improvement ||--------------|--------|-----------|-------------|| HumanEval | 76.8% | 62.8% | +14.0% | ← Coding expert boost| MMLU-Pro | 73.5% | 65.5% | +8.0% || Simple-QA | 55.2% | 28.3% | +26.9% | ← Knowledge expert boost| LongBench-V2 | 51.5% | 40.2% | +11.3% |
The HumanEval spike (+14%) comes directly from the coding expert’s dedicated GRPO training. World knowledge jumped +27% because that expert was isolated from math/coding interference.
Why This Architecture Matters
For Model Quality
V4-Pro-Max demonstrates that domain isolation followed by integration works:
- Coding benchmarks recovered “national coding champion” status
- Knowledge tests only slightly behind Gemini Pro 3.1
- No single domain suffered from training others
For Training Efficiency
| Aspect | Traditional Mixed SFT | V4 Two-Stage OPD ||---------------------|----------------------|------------------|| Domain interference | High | None || Debugging | Difficult | Per-domain || Adding new domains | Retrain all | Train new expert || Convergence | Unstable | Stable per expert|For Research Direction
This approach suggests a blueprint for future multi-capability models:
- More granular experts (e.g., separate “Python” from “JavaScript” coding)
- Distillation techniques that preserve expert boundaries
- Modular capability addition without full retraining
Implementation Considerations
If you’re designing a similar training pipeline, here’s what I’d focus on:
class TwoStagePostTraining: def __init__(self, domains: list[str]): self.domains = domains self.experts = []
def stage_one(self, base_model): """Train domain experts independently.""" for domain in self.domains: trainer = DomainExpertTrainer(domain) expert = trainer.train_expert(base_model) self.experts.append(expert) return self.experts
def stage_two(self, student_model): """Merge via on-policy distillation.""" opd = OnPolicyDistillation() unified = opd.distill_experts(self.experts, student_model) return unified
def iterate_domain(self, domain: str, base_model): """Retrain single domain without affecting others.""" # This is the key advantage: modular updates trainer = DomainExpertTrainer(domain) new_expert = trainer.train_expert(base_model) self.experts[self.domains.index(domain)] = new_expertKey decisions:
- Expert granularity: How specialized should each expert be?
- Distillation iterations: How many student trajectories before convergence?
- Expert selection: Which domains benefit most from isolation?
Final Thoughts
DeepSeek V4’s two-stage post-training solves a fundamental problem in multi-domain LLM development. By separating optimization from integration, each capability reaches its ceiling before merging. The OPD technique ensures the unified model learns from expert feedback on its own outputs, not copied distributions.
For practitioners building multi-capability models, this architecture offers practical benefits: modular debugging, stable convergence, and scalable domain addition. V4’s benchmark gains—especially the coding and knowledge improvements—validate that domain isolation isn’t just theoretically sound; it produces measurable results.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 DeepSeek V4 Official Release
- 👨💻 DeepSeek V4 Technical Report
- 👨💻 GRPO Reinforcement Learning Method
- 👨💻 Knowledge Distillation Survey
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments