Skip to content

How to Deploy Qwen3.5 Locally with llama.cpp: Complete Step-by-Step Guide

I wanted to run Qwen3.5 locally but couldn’t find clear instructions. Most guides assumed I already knew how to compile llama.cpp or skipped critical steps. After piecing together the documentation, I found three reliable approaches to get Qwen3.5 running on my machine.

The Problem

Deploying LLMs locally sounds straightforward until you try it. You face multiple questions:

  • Which GGUF model should I download?
  • How do I compile llama.cpp with GPU support?
  • What parameters should I use for inference?
  • How do I set up an API server for production?

The official documentation is scattered, and most tutorials skip the compilation steps or assume you’re using pre-built binaries. I needed a complete, working solution from zero to running model.

Solution Overview

There are three main approaches depending on your needs:

  1. Quick Start: Let llama.cpp download and run the model automatically
  2. Offline-First: Pre-download models for unstable networks or air-gapped environments
  3. Production API: Set up an OpenAI-compatible server for application integration

Let me walk through each one.

This approach downloads the model on first run. It’s the fastest way to get started.

Step 1: Compile llama.cpp with CUDA Support

First, clone and build llama.cpp:

Clone llama.cpp repository
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

Now compile with CUDA support for GPU acceleration:

Build llama.cpp with CUDA
cmake . -B build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build build --config Release -j

When I run this on my RTX 4090, the compilation takes about 2 minutes. The resulting binary is in build/bin/.

Step 2: Run Qwen3.5 Directly

The simplest command downloads and runs the model in one step:

Run Qwen3.5 with auto-download
./build/bin/llama-cli \
-hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE \
--ctx-size 16384 \
--temp 0.6

What happens when you run this:

  1. llama.cpp connects to HuggingFace
  2. Downloads the MXFP4_MOE quantization (about 22GB)
  3. Caches it locally for future runs
  4. Starts the interactive chat session

The -hf flag tells llama.cpp to download from HuggingFace. The MXFP4_MOE quantization is optimized for MoE models and provides the best quality-to-size ratio for Qwen3.5.

By default, llama.cpp caches models in ~/.cache/llama.cpp. You can change this:

Set custom cache directory
export LLAMA_CACHE=/data/models/llama-cache
./build/bin/llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE

This keeps your models organized and prevents re-downloading.

Approach 2: Offline-First (Pre-Download Models)

If you have unstable network access or need to deploy in air-gapped environments, pre-download the models first.

Step 1: Install HuggingFace CLI

Install huggingface_hub with transfer acceleration
pip install huggingface_hub hf_transfer

The hf_transfer package enables parallel downloads and is significantly faster.

Step 2: Download the Model

Download Qwen3.5 GGUF model
hf download unsloth/Qwen3.5-35B-A3B-GGUF \
--local-dir ./models/35b \
--include "*MXFP4_MOE*"

This downloads only the MXFP4_MOE variant to ./models/35b/. The download is about 22GB.

Step 3: Run from Local File

Run Qwen3.5 from local file
./build/bin/llama-cli \
--model ./models/35b/Qwen3.5-35B-A3B-MXFP4_MOE.gguf \
--ctx-size 16384 \
--temp 0.6 \
--n-gpu-layers -1

The --n-gpu-layers -1 flag offloads all layers to GPU for maximum speed.

Available Quantizations

Unsloth provides multiple quantization options:

Qwen3.5 Quantization Options
| Quantization | Size | Quality | Speed |
|----------------|---------|----------|----------|
| MXFP4_MOE | ~22 GB | Good | Fastest |
| Q4_K_M | ~20 GB | Better | Fast |
| Q5_K_M | ~24 GB | Best | Moderate |
| Q8_0 | ~38 GB | Excellent| Slower |

For most use cases, MXFP4_MOE offers the best balance. Use Q5_K_M if you have extra VRAM and want better quality.

Approach 3: Production API Server

For integrating Qwen3.5 into applications, use llama-server to create an OpenAI-compatible API.

Step 1: Start the Server

Start llama.cpp API server
./build/bin/llama-server \
--model ./models/35b/Qwen3.5-35B-A3B-MXFP4_MOE.gguf \
--port 8001 \
--ctx-size 16384 \
--host 0.0.0.0

The server starts on port 8001 and exposes an OpenAI-compatible API.

Step 2: Test with curl

Test API with curl
curl http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5",
"messages": [
{"role": "user", "content": "Explain quantum computing in one paragraph"}
],
"temperature": 0.7
}'

Step 3: Use with OpenAI SDK

api_client.py
from openai import OpenAI
# Point to local llama-server
client = OpenAI(
base_url="http://127.0.0.1:8001/v1",
api_key="sk-no-key-required"
)
# Use exactly like OpenAI API
response = client.chat.completions.create(
model="qwen3.5",
messages=[
{"role": "user", "content": "Explain quantum computing"}
],
temperature=0.6
)
print(response.choices[0].message.content)

This is powerful because any code that uses the OpenAI SDK works with your local Qwen3.5 server. Just change the base_url and you’re done.

Parameter Tuning

Qwen3.5 has specific recommendations for inference parameters:

Thinking Mode vs Non-Thinking Mode

Qwen3.5 supports two modes. The temperature settings differ:

Non-thinking mode (faster responses)
./build/bin/llama-cli \
-hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE \
--temp 0.6 \
--top-p 0.95 \
--min-p 0
Thinking mode (better reasoning)
./build/bin/llama-cli \
-hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE \
--temp 0.7 \
--top-p 0.8 \
--top-k 20 \
--min-p 0

The thinking mode enables the model’s reasoning capabilities but requires different temperature settings.

Context Window Size

Qwen3.5 supports up to 131K context, but that requires significant memory:

Context Size vs Memory Overhead
| Context Size | Additional Memory |
|--------------|-------------------|
| 4K | ~1 GB |
| 8K | ~2 GB |
| 16K | ~4 GB |
| 32K | ~8 GB |
| 64K | ~16 GB |
| 128K | ~32 GB |

For a 24GB GPU running the 35B model, stick to 16K context. Adjust based on your available memory.

Common Mistakes to Avoid

Mistake 1: Not Setting LLAMA_CACHE

Without setting the cache directory, models download to ~/.cache/llama.cpp by default. On systems with limited home directory space, this fills up quickly.

Wrong: Default cache location
# Models download to ~/.cache/llama.cpp (might be small partition)
./build/bin/llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE
Correct: Set cache to large storage
export LLAMA_CACHE=/data/models/llama-cache
./build/bin/llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE

Mistake 2: Exceeding Context Size

If you set --ctx-size too high for your GPU memory, you get out-of-memory errors:

Wrong: Context too large for 24GB GPU
# 131072 context requires ~32GB+ just for KV cache
./build/bin/llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE \
--ctx-size 131072 # Will fail on 24GB GPU!

Start with smaller context and increase gradually:

Correct: Start conservative, increase as needed
# Safe for 24GB GPU
./build/bin/llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE \
--ctx-size 16384

Mistake 3: Wrong Temperature for Mode

Using the wrong temperature for thinking vs non-thinking mode gives suboptimal results:

Temperature Settings by Mode
| Mode | Temperature | Top-P |
|-------------|-------------|-------|
| Non-thinking| 0.6 | 0.95 |
| Thinking | 0.7 | 0.8 |

Mistake 4: Forgetting CUDA Compilation Flag

Compiling without CUDA support means CPU-only inference, which is significantly slower:

Wrong: CPU-only build
cmake . -B build
cmake --build build --config Release -j
Correct: CUDA-enabled build
cmake . -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

The difference is 10-50x in inference speed.

Running Larger Models

For the flagship 397B model, you need significant hardware. Here’s the setup for Mac with 192GB+ unified memory:

Running Qwen3.5-397B on Mac
hf download unsloth/Qwen3.5-397B-A17B-GGUF \
--local-dir ./models/397b \
--include "*MXFP4_MOE*"
./build/bin/llama-cli \
--model ./models/397b/Qwen3.5-397B-A17B-MXFP4_MOE.gguf \
--ctx-size 8192 \
--temp 0.6

The 397B model at MXFP4 quantization requires about 214GB of memory.

Quick Reference Commands

Here’s a cheat sheet for common operations:

Quick reference commands
# Quick start with auto-download
./build/bin/llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 16384 --temp 0.6
# Run from local file
./build/bin/llama-cli -m ./models/35b/Qwen3.5-35B-A3B-MXFP4_MOE.gguf --ctx-size 16384 --temp 0.6
# Start API server
./build/bin/llama-server -m ./models/35b/Qwen3.5-35B-A3B-MXFP4_MOE.gguf --port 8001 --ctx-size 16384
# Test API
curl http://localhost:8001/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"qwen3.5","messages":[{"role":"user","content":"Hello"}]}'

Summary

Deploying Qwen3.5 locally with llama.cpp is straightforward once you know the steps:

  1. Compile llama.cpp with -DGGML_CUDA=ON for GPU acceleration
  2. Choose your approach: Quick start (auto-download), offline-first (pre-download), or production API
  3. Set the cache directory with LLAMA_CACHE environment variable
  4. Use appropriate parameters: 0.6 temperature for non-thinking, 0.7 for thinking mode
  5. Match context size to your memory: Start with 16K on 24GB GPUs

The three approaches cover all deployment scenarios: quick testing, offline environments, and production applications. With the OpenAI-compatible server, any existing code using the OpenAI SDK works with your local Qwen3.5 instance.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments