Qwen3.5 MoE Architecture Explained: Why 397B Parameters Only Activate 17B
The Confusion
When I first read about Qwen3.5-397B-A17B, I was confused. The model has 397 billion parameters, but only 17 billion are active during inference. That’s just 4.3% of the model. How can a model that “ignores” 95% of its parameters still perform at flagship level?
I thought bigger models meant slower inference. But Qwen3.5 achieves near-Dense-model quality with 20x faster inference. Here’s how I understand MoE architecture after digging into it.
What is Mixture of Experts (MoE)?
Traditional Dense models activate all parameters for every token. A 400B Dense model processes 400B parameters for every single word you generate. That’s slow and memory-intensive.
MoE takes a different approach: divide the model into specialized “expert” networks, then dynamically route each input token to the most relevant experts.
┌─────────────────────────────────────────────────────────────────┐│ DENSE MODEL ││ Input Token ──▶ [All 400B Parameters Activated] ──▶ Output ││ (100% load) │└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐│ MOE MODEL ││ ││ ┌──────────────┐ ││ │ Expert 1 │ (inactive) ││ └──────────────┘ ││ ┌──────────────┐ ││ Input Token ──▶ │ Expert 2 │ (ACTIVE) ──┐ ││ └──────────────┘ │ ││ ┌──────────────┐ ▼ ││ │ Expert 3 │ (ACTIVE) ──▶ Output ││ └──────────────┘ │ ││ ┌──────────────┐ │ ││ │ Expert 4 │ (inactive) ─┘ ││ └──────────────┘ ││ ││ Only 17B of 397B active (~4.3%) │└─────────────────────────────────────────────────────────────────┘How MoE Routing Works
The key to MoE is the gating network—a learned function that decides which experts to use for each token.
import torchimport torch.nn as nnimport torch.nn.functional as F
class MoELayer(nn.Module): """ A simplified Mixture of Experts layer.
In real implementations like Qwen3.5, this is more complex with: - Load balancing losses - Expert capacity constraints - Distributed expert placement """
def __init__(self, input_dim, hidden_dim, num_experts, top_k=2): super().__init__() self.num_experts = num_experts self.top_k = top_k
# Gating network: decides which experts to use self.gate = nn.Linear(input_dim, num_experts)
# Expert networks: specialized for different tasks self.experts = nn.ModuleList([ nn.Sequential( nn.Linear(input_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, input_dim) ) for _ in range(num_experts) ])
def forward(self, x): batch_size, seq_len, input_dim = x.shape
# Step 1: Gate decides expert relevance for each token gate_logits = self.gate(x) # [batch, seq, num_experts]
# Step 2: Select top-k experts (sparse activation!) top_k_weights, top_k_indices = torch.topk( gate_logits, self.top_k, dim=-1 ) top_k_weights = F.softmax(top_k_weights, dim=-1)
# Step 3: Compute weighted output from selected experts only output = torch.zeros_like(x)
for i in range(self.top_k): expert_idx = top_k_indices[:, :, i] # Which expert weight = top_k_weights[:, :, i:i+1] # How much to trust it
# Gather expert outputs (only top_k experts compute) for e in range(self.num_experts): mask = (expert_idx == e) if mask.any(): expert_input = x[mask] expert_output = self.experts[e](expert_input) output[mask] += weight[mask] * expert_output
return output
# Example: Qwen3.5-397B-A17B scaled downmoe = MoELayer( input_dim=4096, hidden_dim=16384, num_experts=64, # 64 experts in Qwen3.5 top_k=2 # Activate 2 experts per token)
# Process a batch of tokenstokens = torch.randn(1, 512, 4096) # 512 tokensoutput = moe(tokens)
print(f"Total parameters: {sum(p.numel() for p in moe.parameters()):,}")print(f"Active parameters per token: ~{4096 * 16384 * 2:,}")This is a simplified illustration. Real MoE implementations like Qwen3.5 include:
- Load balancing losses: Ensure experts aren’t over/under-utilized
- Expert capacity constraints: Prevent any expert from getting too many tokens
- Distributed placement: Experts spread across multiple GPUs
- Router Z-loss: Stabilize training by penalizing large gate logits
Qwen3.5 MoE Model Sizes
Qwen3.5 offers three MoE variants with different trade-offs:
┌──────────────────────┬──────────────┬──────────────┬──────────────┐│ Model │ Total Params │ Active Params│ Activation % │├──────────────────────┼──────────────┼──────────────┼──────────────┤│ Qwen3.5-35B-A3B │ 35B │ 3B │ 8.6% ││ Qwen3.5-122B-A10B │ 122B │ 10B │ 8.2% ││ Qwen3.5-397B-A17B │ 397B │ 17B │ 4.3% │└──────────────────────┴──────────────┴──────────────┴──────────────┘
Comparison to Dense:┌──────────────────────┬──────────────┬──────────────┬──────────────┐│ Dense 400B (hypot.) │ 400B │ 400B │ 100% ││ Qwen3.5-397B-A17B │ 397B │ 17B │ 4.3% │└──────────────────────┴──────────────┴──────────────┴──────────────┘ Speedup: ~23x faster inferenceWhy MoE Works: Expert Specialization
MoE works because different experts learn to specialize in different types of knowledge or tasks. During training, the gating network learns which experts are best for which inputs.
┌─────────────────────────────────────────────────────────────────┐│ QWEN3.5 MOE EXPERTS │├─────────────────────────────────────────────────────────────────┤│ Expert 1 ──▶ Code generation, syntax, programming patterns ││ Expert 2 ──▶ Mathematical reasoning, calculations ││ Expert 3 ──▶ Creative writing, storytelling ││ Expert 4 ──▶ Factual knowledge, encyclopedic information ││ Expert 5 ──▶ Translation, multilingual tasks ││ Expert 6 ──▶ Logical reasoning, analysis ││ Expert 7 ──▶ Summarization, extraction ││ Expert 8 ──▶ Dialogue, conversational patterns ││ ... ││ Expert 64 ──▶ Domain-specific knowledge (medical, legal, etc.) │└─────────────────────────────────────────────────────────────────┘
When you ask "Write a Python function to sort a list": Gate activates: Expert 1 (code) + Expert 6 (logic) Other experts remain inactive
When you ask "Translate this to French": Gate activates: Expert 5 (translation) + Expert 8 (dialogue) Other experts remain inactiveThis specialization explains why MoE can match Dense model quality: each expert becomes highly proficient in its domain, and the gating network ensures the right experts handle each query.
Practical Implications
Speed Benefits
MoE models are faster than equivalently-sized Dense models because they compute far fewer operations per token.
# Dense 400B modeldense_params = 400_000_000_000dense_flops_per_token = dense_params * 2 # multiply-add operationsprint(f"Dense: {dense_flops_per_token / 1e12:.1f} TFLOPS per token")
# MoE 397B model with 17B activemoe_total_params = 397_000_000_000moe_active_params = 17_000_000_000moe_flops_per_token = moe_active_params * 2print(f"MoE active: {moe_flops_per_token / 1e12:.1f} TFLOPS per token")print(f"Speedup: {dense_flops_per_token / moe_flops_per_token:.1f}x")
# Output:# Dense: 800.0 TFLOPS per token# MoE active: 34.0 TFLOPS per token# Speedup: 23.5xMemory Considerations
MoE has a catch: you still need to load all parameters into memory, even though only a fraction are active.
┌──────────────────────┬──────────────┬──────────────┬──────────────┐│ Model │ VRAM (FP16) │ Active VRAM │ Storage │├──────────────────────┼──────────────┼──────────────┼──────────────┤│ Dense 100B │ ~200 GB │ ~200 GB │ ~200 GB ││ MoE 397B (17B active)│ ~800 GB │ ~34 GB │ ~800 GB │└──────────────────────┴──────────────┴──────────────┴──────────────┘
Key insight: MoE reduces compute, not memory storage.You still need hardware that can hold all 397B parameters.When MoE Shines vs When It Doesn’t
MoE excels in specific scenarios:
+------------------+------------------------+--------------------------+| Scenario | MoE Advantage | Reason |+------------------+------------------------+--------------------------+| GPU inference | Major speedup | GPU parallelism + sparse || Batched requests | Excellent throughput | Different experts per || | | request || Varied tasks | Maintains quality | Right expert for task |+------------------+------------------------+--------------------------+| CPU inference | Limited benefit | CPU lacks sparse matrix || | | optimization || Single stream | Good but less dramatic | One token at a time || Small batches | Overhead may dominate | Gate computation cost |+------------------+------------------------+--------------------------+Qwen3.5 vs DeepSeek R1 Comparison
Qwen3.5-397B-A17B competes directly with DeepSeek R1 671B-A37B. Here’s how they compare:
┌──────────────────────┬─────────────────────┬─────────────────────┐│ Metric │ Qwen3.5-397B-A17B │ DeepSeek R1 671B-A37B│├──────────────────────┼─────────────────────┼─────────────────────┤│ Total parameters │ 397B │ 671B ││ Active parameters │ 17B │ 37B ││ Activation ratio │ 4.3% │ 5.5% ││ Inference speed │ Faster │ Slower ││ Quality │ Near flagship │ Flagship ││ Memory requirement │ Lower │ Higher │└──────────────────────┴─────────────────────┴─────────────────────┘
Key insight from technical reports:"Qwen3.5-397B-A17B achieves comparable performance to DeepSeek R1with fewer parameters AND fewer active parameters."Common Misconceptions
Misconception 1: “MoE is always faster”
MoE is faster for GPU inference with batching, but CPU-bound scenarios may not benefit as much. The gating network adds overhead, and sparse matrix operations are less optimized on CPUs.
Misconception 2: “Active parameters = model quality”
Total parameters matter for knowledge capacity. Active parameters affect inference speed. A 397B MoE with 17B active has more stored knowledge than a 50B Dense model, even though inference is similar speed.
Misconception 3: “MoE is new”
MoE was introduced in 1991 (Jacobs et al.) and applied to transformers in 2020 (GShard). Qwen3.5 builds on years of research, not a new invention.
When to Choose MoE
Based on my analysis, MoE makes sense when:
-
You need flagship quality at lower cost: Qwen3.5-397B-A17B gives near-flagship performance with consumer-accessible inference costs.
-
Your workloads are batched: MoE shines when processing multiple requests in parallel, allowing different experts to handle different queries simultaneously.
-
You have GPU infrastructure: MoE’s sparse activation patterns are optimized for GPU parallelism.
-
Your use cases are varied: MoE’s expert specialization handles diverse tasks well.
Consider Dense models when:
-
Memory is constrained: MoE requires loading all parameters even if only using a fraction.
-
Single-stream latency is critical: Dense models have more predictable latency.
-
CPU-only deployment: Dense models may perform better without GPU sparse matrix optimization.
Summary
In this post, I explained Qwen3.5’s Mixture of Experts architecture:
- Sparse activation: Only 4-9% of parameters are active per inference
- Gating network: Dynamically routes tokens to relevant experts
- Expert specialization: Different experts handle different knowledge domains
- Speed vs memory trade-off: Faster inference but requires full memory
Qwen3.5-397B-A17B achieves flagship-level quality with just 17B active parameters out of 397B total. This makes powerful AI more accessible without requiring massive compute infrastructure for every query.
The key insight is that not all parameters need to participate in every decision. By letting experts specialize and routing tokens intelligently, MoE models get the best of both worlds: the knowledge capacity of huge models with the inference speed of smaller ones.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Qwen3.5 Technical Report
- 👨💻 Mixture of Experts Explained
- 👨💻 DeepSeek R1 Technical Report
- 👨💻 Sparse Activation in Neural Networks
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments