How to Deploy Qwen3.5 Locally with llama.cpp: Complete Step-by-Step Guide
I wanted to run Qwen3.5 locally but couldn’t find clear instructions. Most guides assumed I already knew how to compile llama.cpp or skipped critical steps. After piecing together the documentation, I found three reliable approaches to get Qwen3.5 running on my machine.
The Problem
Deploying LLMs locally sounds straightforward until you try it. You face multiple questions:
- Which GGUF model should I download?
- How do I compile llama.cpp with GPU support?
- What parameters should I use for inference?
- How do I set up an API server for production?
The official documentation is scattered, and most tutorials skip the compilation steps or assume you’re using pre-built binaries. I needed a complete, working solution from zero to running model.
Solution Overview
There are three main approaches depending on your needs:
- Quick Start: Let llama.cpp download and run the model automatically
- Offline-First: Pre-download models for unstable networks or air-gapped environments
- Production API: Set up an OpenAI-compatible server for application integration
Let me walk through each one.
Approach 1: Quick Start (Recommended)
This approach downloads the model on first run. It’s the fastest way to get started.
Step 1: Compile llama.cpp with CUDA Support
First, clone and build llama.cpp:
git clone https://github.com/ggml-org/llama.cppcd llama.cppNow compile with CUDA support for GPU acceleration:
cmake . -B build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ONcmake --build build --config Release -jWhen I run this on my RTX 4090, the compilation takes about 2 minutes. The resulting binary is in build/bin/.
Step 2: Run Qwen3.5 Directly
The simplest command downloads and runs the model in one step:
./build/bin/llama-cli \ -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE \ --ctx-size 16384 \ --temp 0.6What happens when you run this:
- llama.cpp connects to HuggingFace
- Downloads the MXFP4_MOE quantization (about 22GB)
- Caches it locally for future runs
- Starts the interactive chat session
The -hf flag tells llama.cpp to download from HuggingFace. The MXFP4_MOE quantization is optimized for MoE models and provides the best quality-to-size ratio for Qwen3.5.
Step 3: Set Up Caching (Optional but Recommended)
By default, llama.cpp caches models in ~/.cache/llama.cpp. You can change this:
export LLAMA_CACHE=/data/models/llama-cache./build/bin/llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOEThis keeps your models organized and prevents re-downloading.
Approach 2: Offline-First (Pre-Download Models)
If you have unstable network access or need to deploy in air-gapped environments, pre-download the models first.
Step 1: Install HuggingFace CLI
pip install huggingface_hub hf_transferThe hf_transfer package enables parallel downloads and is significantly faster.
Step 2: Download the Model
hf download unsloth/Qwen3.5-35B-A3B-GGUF \ --local-dir ./models/35b \ --include "*MXFP4_MOE*"This downloads only the MXFP4_MOE variant to ./models/35b/. The download is about 22GB.
Step 3: Run from Local File
./build/bin/llama-cli \ --model ./models/35b/Qwen3.5-35B-A3B-MXFP4_MOE.gguf \ --ctx-size 16384 \ --temp 0.6 \ --n-gpu-layers -1The --n-gpu-layers -1 flag offloads all layers to GPU for maximum speed.
Available Quantizations
Unsloth provides multiple quantization options:
| Quantization | Size | Quality | Speed ||----------------|---------|----------|----------|| MXFP4_MOE | ~22 GB | Good | Fastest || Q4_K_M | ~20 GB | Better | Fast || Q5_K_M | ~24 GB | Best | Moderate || Q8_0 | ~38 GB | Excellent| Slower |For most use cases, MXFP4_MOE offers the best balance. Use Q5_K_M if you have extra VRAM and want better quality.
Approach 3: Production API Server
For integrating Qwen3.5 into applications, use llama-server to create an OpenAI-compatible API.
Step 1: Start the Server
./build/bin/llama-server \ --model ./models/35b/Qwen3.5-35B-A3B-MXFP4_MOE.gguf \ --port 8001 \ --ctx-size 16384 \ --host 0.0.0.0The server starts on port 8001 and exposes an OpenAI-compatible API.
Step 2: Test with curl
curl http://localhost:8001/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen3.5", "messages": [ {"role": "user", "content": "Explain quantum computing in one paragraph"} ], "temperature": 0.7 }'Step 3: Use with OpenAI SDK
from openai import OpenAI
# Point to local llama-serverclient = OpenAI( base_url="http://127.0.0.1:8001/v1", api_key="sk-no-key-required")
# Use exactly like OpenAI APIresponse = client.chat.completions.create( model="qwen3.5", messages=[ {"role": "user", "content": "Explain quantum computing"} ], temperature=0.6)
print(response.choices[0].message.content)This is powerful because any code that uses the OpenAI SDK works with your local Qwen3.5 server. Just change the base_url and you’re done.
Parameter Tuning
Qwen3.5 has specific recommendations for inference parameters:
Thinking Mode vs Non-Thinking Mode
Qwen3.5 supports two modes. The temperature settings differ:
./build/bin/llama-cli \ -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE \ --temp 0.6 \ --top-p 0.95 \ --min-p 0./build/bin/llama-cli \ -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE \ --temp 0.7 \ --top-p 0.8 \ --top-k 20 \ --min-p 0The thinking mode enables the model’s reasoning capabilities but requires different temperature settings.
Context Window Size
Qwen3.5 supports up to 131K context, but that requires significant memory:
| Context Size | Additional Memory ||--------------|-------------------|| 4K | ~1 GB || 8K | ~2 GB || 16K | ~4 GB || 32K | ~8 GB || 64K | ~16 GB || 128K | ~32 GB |For a 24GB GPU running the 35B model, stick to 16K context. Adjust based on your available memory.
Common Mistakes to Avoid
Mistake 1: Not Setting LLAMA_CACHE
Without setting the cache directory, models download to ~/.cache/llama.cpp by default. On systems with limited home directory space, this fills up quickly.
# Models download to ~/.cache/llama.cpp (might be small partition)./build/bin/llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOEexport LLAMA_CACHE=/data/models/llama-cache./build/bin/llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOEMistake 2: Exceeding Context Size
If you set --ctx-size too high for your GPU memory, you get out-of-memory errors:
# 131072 context requires ~32GB+ just for KV cache./build/bin/llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE \ --ctx-size 131072 # Will fail on 24GB GPU!Start with smaller context and increase gradually:
# Safe for 24GB GPU./build/bin/llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE \ --ctx-size 16384Mistake 3: Wrong Temperature for Mode
Using the wrong temperature for thinking vs non-thinking mode gives suboptimal results:
| Mode | Temperature | Top-P ||-------------|-------------|-------|| Non-thinking| 0.6 | 0.95 || Thinking | 0.7 | 0.8 |Mistake 4: Forgetting CUDA Compilation Flag
Compiling without CUDA support means CPU-only inference, which is significantly slower:
cmake . -B buildcmake --build build --config Release -jcmake . -B build -DGGML_CUDA=ONcmake --build build --config Release -jThe difference is 10-50x in inference speed.
Running Larger Models
For the flagship 397B model, you need significant hardware. Here’s the setup for Mac with 192GB+ unified memory:
hf download unsloth/Qwen3.5-397B-A17B-GGUF \ --local-dir ./models/397b \ --include "*MXFP4_MOE*"
./build/bin/llama-cli \ --model ./models/397b/Qwen3.5-397B-A17B-MXFP4_MOE.gguf \ --ctx-size 8192 \ --temp 0.6The 397B model at MXFP4 quantization requires about 214GB of memory.
Quick Reference Commands
Here’s a cheat sheet for common operations:
# Quick start with auto-download./build/bin/llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 16384 --temp 0.6
# Run from local file./build/bin/llama-cli -m ./models/35b/Qwen3.5-35B-A3B-MXFP4_MOE.gguf --ctx-size 16384 --temp 0.6
# Start API server./build/bin/llama-server -m ./models/35b/Qwen3.5-35B-A3B-MXFP4_MOE.gguf --port 8001 --ctx-size 16384
# Test APIcurl http://localhost:8001/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"qwen3.5","messages":[{"role":"user","content":"Hello"}]}'Summary
Deploying Qwen3.5 locally with llama.cpp is straightforward once you know the steps:
- Compile llama.cpp with
-DGGML_CUDA=ONfor GPU acceleration - Choose your approach: Quick start (auto-download), offline-first (pre-download), or production API
- Set the cache directory with
LLAMA_CACHEenvironment variable - Use appropriate parameters: 0.6 temperature for non-thinking, 0.7 for thinking mode
- Match context size to your memory: Start with 16K on 24GB GPUs
The three approaches cover all deployment scenarios: quick testing, offline environments, and production applications. With the OpenAI-compatible server, any existing code using the OpenAI SDK works with your local Qwen3.5 instance.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments