Skip to content

Qwen3.5 MoE Architecture Explained: Why 397B Parameters Only Activate 17B

The Confusion

When I first read about Qwen3.5-397B-A17B, I was confused. The model has 397 billion parameters, but only 17 billion are active during inference. That’s just 4.3% of the model. How can a model that “ignores” 95% of its parameters still perform at flagship level?

I thought bigger models meant slower inference. But Qwen3.5 achieves near-Dense-model quality with 20x faster inference. Here’s how I understand MoE architecture after digging into it.

What is Mixture of Experts (MoE)?

Traditional Dense models activate all parameters for every token. A 400B Dense model processes 400B parameters for every single word you generate. That’s slow and memory-intensive.

MoE takes a different approach: divide the model into specialized “expert” networks, then dynamically route each input token to the most relevant experts.

Dense vs MoE Activation Pattern
┌─────────────────────────────────────────────────────────────────┐
│ DENSE MODEL │
│ Input Token ──▶ [All 400B Parameters Activated] ──▶ Output │
│ (100% load) │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ MOE MODEL │
│ │
│ ┌──────────────┐ │
│ │ Expert 1 │ (inactive) │
│ └──────────────┘ │
│ ┌──────────────┐ │
│ Input Token ──▶ │ Expert 2 │ (ACTIVE) ──┐ │
│ └──────────────┘ │ │
│ ┌──────────────┐ ▼ │
│ │ Expert 3 │ (ACTIVE) ──▶ Output │
│ └──────────────┘ │ │
│ ┌──────────────┐ │ │
│ │ Expert 4 │ (inactive) ─┘ │
│ └──────────────┘ │
│ │
│ Only 17B of 397B active (~4.3%) │
└─────────────────────────────────────────────────────────────────┘

How MoE Routing Works

The key to MoE is the gating network—a learned function that decides which experts to use for each token.

Simplified MoE Routing Illustration
import torch
import torch.nn as nn
import torch.nn.functional as F
class MoELayer(nn.Module):
"""
A simplified Mixture of Experts layer.
In real implementations like Qwen3.5, this is more complex with:
- Load balancing losses
- Expert capacity constraints
- Distributed expert placement
"""
def __init__(self, input_dim, hidden_dim, num_experts, top_k=2):
super().__init__()
self.num_experts = num_experts
self.top_k = top_k
# Gating network: decides which experts to use
self.gate = nn.Linear(input_dim, num_experts)
# Expert networks: specialized for different tasks
self.experts = nn.ModuleList([
nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, input_dim)
)
for _ in range(num_experts)
])
def forward(self, x):
batch_size, seq_len, input_dim = x.shape
# Step 1: Gate decides expert relevance for each token
gate_logits = self.gate(x) # [batch, seq, num_experts]
# Step 2: Select top-k experts (sparse activation!)
top_k_weights, top_k_indices = torch.topk(
gate_logits, self.top_k, dim=-1
)
top_k_weights = F.softmax(top_k_weights, dim=-1)
# Step 3: Compute weighted output from selected experts only
output = torch.zeros_like(x)
for i in range(self.top_k):
expert_idx = top_k_indices[:, :, i] # Which expert
weight = top_k_weights[:, :, i:i+1] # How much to trust it
# Gather expert outputs (only top_k experts compute)
for e in range(self.num_experts):
mask = (expert_idx == e)
if mask.any():
expert_input = x[mask]
expert_output = self.experts[e](expert_input)
output[mask] += weight[mask] * expert_output
return output
# Example: Qwen3.5-397B-A17B scaled down
moe = MoELayer(
input_dim=4096,
hidden_dim=16384,
num_experts=64, # 64 experts in Qwen3.5
top_k=2 # Activate 2 experts per token
)
# Process a batch of tokens
tokens = torch.randn(1, 512, 4096) # 512 tokens
output = moe(tokens)
print(f"Total parameters: {sum(p.numel() for p in moe.parameters()):,}")
print(f"Active parameters per token: ~{4096 * 16384 * 2:,}")

This is a simplified illustration. Real MoE implementations like Qwen3.5 include:

  • Load balancing losses: Ensure experts aren’t over/under-utilized
  • Expert capacity constraints: Prevent any expert from getting too many tokens
  • Distributed placement: Experts spread across multiple GPUs
  • Router Z-loss: Stabilize training by penalizing large gate logits

Qwen3.5 MoE Model Sizes

Qwen3.5 offers three MoE variants with different trade-offs:

Qwen3.5 MoE Model Comparison
┌──────────────────────┬──────────────┬──────────────┬──────────────┐
│ Model │ Total Params │ Active Params│ Activation % │
├──────────────────────┼──────────────┼──────────────┼──────────────┤
│ Qwen3.5-35B-A3B │ 35B │ 3B │ 8.6% │
│ Qwen3.5-122B-A10B │ 122B │ 10B │ 8.2% │
│ Qwen3.5-397B-A17B │ 397B │ 17B │ 4.3% │
└──────────────────────┴──────────────┴──────────────┴──────────────┘
Comparison to Dense:
┌──────────────────────┬──────────────┬──────────────┬──────────────┐
│ Dense 400B (hypot.) │ 400B │ 400B │ 100% │
│ Qwen3.5-397B-A17B │ 397B │ 17B │ 4.3% │
└──────────────────────┴──────────────┴──────────────┴──────────────┘
Speedup: ~23x faster inference

Why MoE Works: Expert Specialization

MoE works because different experts learn to specialize in different types of knowledge or tasks. During training, the gating network learns which experts are best for which inputs.

Expert Specialization (Conceptual)
┌─────────────────────────────────────────────────────────────────┐
│ QWEN3.5 MOE EXPERTS │
├─────────────────────────────────────────────────────────────────┤
│ Expert 1 ──▶ Code generation, syntax, programming patterns │
│ Expert 2 ──▶ Mathematical reasoning, calculations │
│ Expert 3 ──▶ Creative writing, storytelling │
│ Expert 4 ──▶ Factual knowledge, encyclopedic information │
│ Expert 5 ──▶ Translation, multilingual tasks │
│ Expert 6 ──▶ Logical reasoning, analysis │
│ Expert 7 ──▶ Summarization, extraction │
│ Expert 8 ──▶ Dialogue, conversational patterns │
│ ... │
│ Expert 64 ──▶ Domain-specific knowledge (medical, legal, etc.) │
└─────────────────────────────────────────────────────────────────┘
When you ask "Write a Python function to sort a list":
Gate activates: Expert 1 (code) + Expert 6 (logic)
Other experts remain inactive
When you ask "Translate this to French":
Gate activates: Expert 5 (translation) + Expert 8 (dialogue)
Other experts remain inactive

This specialization explains why MoE can match Dense model quality: each expert becomes highly proficient in its domain, and the gating network ensures the right experts handle each query.

Practical Implications

Speed Benefits

MoE models are faster than equivalently-sized Dense models because they compute far fewer operations per token.

Inference Speed Comparison (Conceptual)
# Dense 400B model
dense_params = 400_000_000_000
dense_flops_per_token = dense_params * 2 # multiply-add operations
print(f"Dense: {dense_flops_per_token / 1e12:.1f} TFLOPS per token")
# MoE 397B model with 17B active
moe_total_params = 397_000_000_000
moe_active_params = 17_000_000_000
moe_flops_per_token = moe_active_params * 2
print(f"MoE active: {moe_flops_per_token / 1e12:.1f} TFLOPS per token")
print(f"Speedup: {dense_flops_per_token / moe_flops_per_token:.1f}x")
# Output:
# Dense: 800.0 TFLOPS per token
# MoE active: 34.0 TFLOPS per token
# Speedup: 23.5x

Memory Considerations

MoE has a catch: you still need to load all parameters into memory, even though only a fraction are active.

Memory Requirements
┌──────────────────────┬──────────────┬──────────────┬──────────────┐
│ Model │ VRAM (FP16) │ Active VRAM │ Storage │
├──────────────────────┼──────────────┼──────────────┼──────────────┤
│ Dense 100B │ ~200 GB │ ~200 GB │ ~200 GB │
│ MoE 397B (17B active)│ ~800 GB │ ~34 GB │ ~800 GB │
└──────────────────────┴──────────────┴──────────────┴──────────────┘
Key insight: MoE reduces compute, not memory storage.
You still need hardware that can hold all 397B parameters.

When MoE Shines vs When It Doesn’t

MoE excels in specific scenarios:

MoE Performance Scenarios
+------------------+------------------------+--------------------------+
| Scenario | MoE Advantage | Reason |
+------------------+------------------------+--------------------------+
| GPU inference | Major speedup | GPU parallelism + sparse |
| Batched requests | Excellent throughput | Different experts per |
| | | request |
| Varied tasks | Maintains quality | Right expert for task |
+------------------+------------------------+--------------------------+
| CPU inference | Limited benefit | CPU lacks sparse matrix |
| | | optimization |
| Single stream | Good but less dramatic | One token at a time |
| Small batches | Overhead may dominate | Gate computation cost |
+------------------+------------------------+--------------------------+

Qwen3.5 vs DeepSeek R1 Comparison

Qwen3.5-397B-A17B competes directly with DeepSeek R1 671B-A37B. Here’s how they compare:

Qwen3.5 vs DeepSeek R1 MoE Comparison
┌──────────────────────┬─────────────────────┬─────────────────────┐
│ Metric │ Qwen3.5-397B-A17B │ DeepSeek R1 671B-A37B│
├──────────────────────┼─────────────────────┼─────────────────────┤
│ Total parameters │ 397B │ 671B │
│ Active parameters │ 17B │ 37B │
│ Activation ratio │ 4.3% │ 5.5% │
│ Inference speed │ Faster │ Slower │
│ Quality │ Near flagship │ Flagship │
│ Memory requirement │ Lower │ Higher │
└──────────────────────┴─────────────────────┴─────────────────────┘
Key insight from technical reports:
"Qwen3.5-397B-A17B achieves comparable performance to DeepSeek R1
with fewer parameters AND fewer active parameters."

Common Misconceptions

Misconception 1: “MoE is always faster”

MoE is faster for GPU inference with batching, but CPU-bound scenarios may not benefit as much. The gating network adds overhead, and sparse matrix operations are less optimized on CPUs.

Misconception 2: “Active parameters = model quality”

Total parameters matter for knowledge capacity. Active parameters affect inference speed. A 397B MoE with 17B active has more stored knowledge than a 50B Dense model, even though inference is similar speed.

Misconception 3: “MoE is new”

MoE was introduced in 1991 (Jacobs et al.) and applied to transformers in 2020 (GShard). Qwen3.5 builds on years of research, not a new invention.

When to Choose MoE

Based on my analysis, MoE makes sense when:

  1. You need flagship quality at lower cost: Qwen3.5-397B-A17B gives near-flagship performance with consumer-accessible inference costs.

  2. Your workloads are batched: MoE shines when processing multiple requests in parallel, allowing different experts to handle different queries simultaneously.

  3. You have GPU infrastructure: MoE’s sparse activation patterns are optimized for GPU parallelism.

  4. Your use cases are varied: MoE’s expert specialization handles diverse tasks well.

Consider Dense models when:

  1. Memory is constrained: MoE requires loading all parameters even if only using a fraction.

  2. Single-stream latency is critical: Dense models have more predictable latency.

  3. CPU-only deployment: Dense models may perform better without GPU sparse matrix optimization.

Summary

In this post, I explained Qwen3.5’s Mixture of Experts architecture:

  • Sparse activation: Only 4-9% of parameters are active per inference
  • Gating network: Dynamically routes tokens to relevant experts
  • Expert specialization: Different experts handle different knowledge domains
  • Speed vs memory trade-off: Faster inference but requires full memory

Qwen3.5-397B-A17B achieves flagship-level quality with just 17B active parameters out of 397B total. This makes powerful AI more accessible without requiring massive compute infrastructure for every query.

The key insight is that not all parameters need to participate in every decision. By letting experts specialize and routing tokens intelligently, MoE models get the best of both worlds: the knowledge capacity of huge models with the inference speed of smaller ones.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments