Qwen3.5 Quantization Accuracy: How Much Performance Do You Lose with 4-bit Models?
I couldn’t believe what I was reading. A 3-bit quantized model outperforming a 4-bit one? That defies everything I thought I knew about quantization. Yet there it was, in black and white: the UD-Q3_K_XL scored 80.7% accuracy while UD-Q4_K_XL scored 80.5%. Same original model, different compression levels, and the more aggressive compression won.
The Shocking Numbers
Benjamin Marie ran an independent benchmark using 750 mixed questions from LiveCodeBench v6, MMLU Pro, GPQA, and Math500. Here are the results for Qwen3.5-397B:
| Version | Accuracy | Precision Loss | Disk Size ||--------------------|----------|----------------|-----------|| Original FP16 | 81.3% | — | ~807 GB || UD-Q4_K_XL (4-bit) | 80.5% | -0.8% | ~214 GB || UD-Q3_K_XL (3-bit) | 80.7% | -0.6% | ~160 GB |Let me emphasize this: the 3-bit model scored higher than the 4-bit model. And both lost less than 1% accuracy compared to the full-precision original.
Why This Matters
I’ve been running quantized models for years, always accepting the tradeoff: smaller size means lower quality. I assumed 4-bit was the practical limit, and anything below that would be unusable.
But Unsloth’s Dynamic 2.0 quantization breaks this assumption. The key insight:
Important layers get automatically promoted to 8-bit or 16-bit precision, while less critical layers are compressed more aggressively.
This isn’t a one-size-fits-all compression. It’s intelligent bit allocation based on layer importance.
How Dynamic Quantization Works
When I first read about this, I needed to understand the mechanism. Here’s what happens during quantization:
- Importance Analysis: The quantization process analyzes each layer’s contribution to model performance
- Bit Allocation: Critical layers receive higher bit depth (8-bit or even 16-bit)
- Aggressive Compression: Less important layers get squeezed to 3-bit or 4-bit
- Result: Average bit depth might be 3.x or 4.x, but with smarter distribution
Critical Attention Layers: 16-bit (preserved)Important FFN Layers: 8-bit (partial precision)Moderate Importance Layers: 4-bit (standard compression)Low Impact Layers: 3-bit (aggressive compression)The math explains the counter-intuitive results:
- FP16: 16 bits per parameter across all layers
- 4-bit average: ~4 bits per parameter (5x compression)
- 3-bit average: ~3 bits per parameter (5.3x compression)
- Dynamic: Varies per layer based on learned importance
Running Quantized Qwen3.5
I tested this myself with llama.cpp. Here’s how to select different quantization formats:
# Recommended 4-bit format for Qwen3.5 (Dynamic MXFP4)./llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE
# Alternative: UD-Q4_K_XL (standard 4-bit dynamic)./llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL
# Ultra-compressed 3-bit (still 99.3% accuracy)./llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q3_K_XL
# For 397B model (must use sharded GGUF)./llama-cli --model Qwen3.5-397B-A17B-MXFP4_MOE-00001-of-00006.ggufThe MXFP4_MOE variant is specifically optimized for Mixture-of-Experts models like Qwen3.5. It applies different quantization strategies to the expert layers versus the shared layers.
Hardware Recommendations
Based on my testing and the benchmark data, here’s what I recommend for different hardware tiers:
Consumer GPUs (24GB VRAM)
# Qwen3.5-35B at UD-Q3_K_XL fits comfortably# Uses ~18GB VRAM with full context./llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q3_K_XL \ --ctx-size 8192 \ --n-gpu-layers 99Workstation GPUs (48GB VRAM)
# Qwen3.5-35B at UD-Q4_K_XL with room to spare# Or Qwen3.5-72B at UD-Q3_K_XL./llama-cli -hf unsloth/Qwen3.5-72B-GGUF:UD-Q3_K_XL \ --ctx-size 16384 \ --n-gpu-layers 99Apple Silicon (Unified Memory)
# Full precision for smaller models, quantized for larger# Qwen3.5-35B works well at UD-Q4_K_XL./llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL \ --ctx-size 32768What I Learned
I’ve changed my approach to model deployment based on these findings:
- Don’t default to 4-bit: Test 3-bit first - it might work better for your use case
- Dynamic quantization isn’t equal: Unsloth’s Dynamic 2.0 outperforms static quantization significantly
- Storage savings are massive: 807GB to 160GB means you can actually store multiple model variants
- Quality loss is negligible: Less than 1% accuracy difference is imperceptible in most applications
Common Mistakes to Avoid
I made these mistakes when I first started experimenting:
Assuming all 4-bit quantization is the same - Static 4-bit quantization can lose 3-5% accuracy. Dynamic quantization like Unsloth’s preserves much more quality.
Over-provisioning memory - I used to think I needed 48GB for a 35B model. With 3-bit quantization, 24GB works fine.
Not testing on your specific workload - Benchmark numbers are averages. Your specific use case might have different results.
Ignoring MXFP4_MOE for MoE models - Qwen3.5 uses mixture-of-experts architecture. The specialized MXFP4_MOE quantization handles this better than generic formats.
When to Use Each Format
From my experience:
- UD-Q3_K_XL: Production inference where storage is constrained, quality-critical applications
- UD-Q4_K_XL: Balanced deployment, good for most use cases
- MXFP4_MOE: Best for MoE models (Qwen3.5-35B-A3B, Qwen3.5-397B-A17B)
- FP16: Research, fine-tuning, when you absolutely need maximum quality
The Bottom Line
Unsloth’s dynamic quantization delivers flagship model quality at quarter the storage cost. With less than 1% accuracy loss and 5x compression, quantized Qwen3.5 models are production-ready for most applications.
The 3-bit variant occasionally outperforming 4-bit isn’t a bug - it’s a feature of intelligent layer prioritization. When important layers keep their precision and unimportant ones get compressed aggressively, you get the best of both worlds.
I now run Qwen3.5-35B at UD-Q3_K_XL on my RTX 4090. The quality is indistinguishable from the full-precision model for my coding and writing tasks, and I have 160GB of disk space back.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Unsloth Dynamic Quantization
- 👨💻 Benjamin Marie Benchmark
- 👨💻 llama.cpp Documentation
- 👨💻 Qwen3.5 Model Card
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments