How to Deploy Qwen3.5 Locally with llama.cpp: Complete Step-by-Step Guide

Mar 24, 2026

I wanted to run Qwen3.5 locally but couldn’t find clear instructions. Most guides assumed I already knew how to compile llama.cpp or skipped critical steps. After piecing together the documentation, I found three reliable approaches to get Qwen3.5 running on my machine.

The Problem

Deploying LLMs locally sounds straightforward until you try it. You face multiple questions:

Which GGUF model should I download?
How do I compile llama.cpp with GPU support?
What parameters should I use for inference?
How do I set up an API server for production?

The official documentation is scattered, and most tutorials skip the compilation steps or assume you’re using pre-built binaries. I needed a complete, working solution from zero to running model.

Solution Overview

There are three main approaches depending on your needs:

Quick Start: Let llama.cpp download and run the model automatically
Offline-First: Pre-download models for unstable networks or air-gapped environments
Production API: Set up an OpenAI-compatible server for application integration

Let me walk through each one.

Approach 1: Quick Start (Recommended)

This approach downloads the model on first run. It’s the fastest way to get started.

Step 1: Compile llama.cpp with CUDA Support

First, clone and build llama.cpp:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

Now compile with CUDA support for GPU acceleration:

cmake . -B build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build build --config Release -j

When I run this on my RTX 4090, the compilation takes about 2 minutes. The resulting binary is in build/bin/.

Step 2: Run Qwen3.5 Directly

The simplest command downloads and runs the model in one step:

./build/bin/llama-cli \
  -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE \
  --ctx-size 16384 \
  --temp 0.6

What happens when you run this:

llama.cpp connects to HuggingFace
Downloads the MXFP4_MOE quantization (about 22GB)
Caches it locally for future runs
Starts the interactive chat session

The -hf flag tells llama.cpp to download from HuggingFace. The MXFP4_MOE quantization is optimized for MoE models and provides the best quality-to-size ratio for Qwen3.5.

Step 3: Set Up Caching (Optional but Recommended)

By default, llama.cpp caches models in ~/.cache/llama.cpp. You can change this:

export LLAMA_CACHE=/data/models/llama-cache
./build/bin/llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE

This keeps your models organized and prevents re-downloading.

Approach 2: Offline-First (Pre-Download Models)

If you have unstable network access or need to deploy in air-gapped environments, pre-download the models first.

Step 1: Install HuggingFace CLI

pip install huggingface_hub hf_transfer

The hf_transfer package enables parallel downloads and is significantly faster.

Step 2: Download the Model

hf download unsloth/Qwen3.5-35B-A3B-GGUF \
  --local-dir ./models/35b \
  --include "*MXFP4_MOE*"

This downloads only the MXFP4_MOE variant to ./models/35b/. The download is about 22GB.

Step 3: Run from Local File

./build/bin/llama-cli \
  --model ./models/35b/Qwen3.5-35B-A3B-MXFP4_MOE.gguf \
  --ctx-size 16384 \
  --temp 0.6 \
  --n-gpu-layers -1

The --n-gpu-layers -1 flag offloads all layers to GPU for maximum speed.

Available Quantizations

Unsloth provides multiple quantization options:

| Quantization   | Size    | Quality  | Speed    |
|----------------|---------|----------|----------|
| MXFP4_MOE      | ~22 GB  | Good     | Fastest  |
| Q4_K_M         | ~20 GB  | Better   | Fast     |
| Q5_K_M         | ~24 GB  | Best     | Moderate |
| Q8_0           | ~38 GB  | Excellent| Slower   |

For most use cases, MXFP4_MOE offers the best balance. Use Q5_K_M if you have extra VRAM and want better quality.

Approach 3: Production API Server

For integrating Qwen3.5 into applications, use llama-server to create an OpenAI-compatible API.

Step 1: Start the Server

./build/bin/llama-server \
  --model ./models/35b/Qwen3.5-35B-A3B-MXFP4_MOE.gguf \
  --port 8001 \
  --ctx-size 16384 \
  --host 0.0.0.0

The server starts on port 8001 and exposes an OpenAI-compatible API.

Step 2: Test with curl

curl http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5",
    "messages": [
      {"role": "user", "content": "Explain quantum computing in one paragraph"}
    ],
    "temperature": 0.7
  }'

Step 3: Use with OpenAI SDK

from openai import OpenAI

# Point to local llama-server
client = OpenAI(
    base_url="http://127.0.0.1:8001/v1",
    api_key="sk-no-key-required"
)

# Use exactly like OpenAI API
response = client.chat.completions.create(
    model="qwen3.5",
    messages=[
        {"role": "user", "content": "Explain quantum computing"}
    ],
    temperature=0.6
)

print(response.choices[0].message.content)

This is powerful because any code that uses the OpenAI SDK works with your local Qwen3.5 server. Just change the base_url and you’re done.

Parameter Tuning

Qwen3.5 has specific recommendations for inference parameters:

Thinking Mode vs Non-Thinking Mode

Qwen3.5 supports two modes. The temperature settings differ:

./build/bin/llama-cli \
  -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE \
  --temp 0.6 \
  --top-p 0.95 \
  --min-p 0

./build/bin/llama-cli \
  -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE \
  --temp 0.7 \
  --top-p 0.8 \
  --top-k 20 \
  --min-p 0

The thinking mode enables the model’s reasoning capabilities but requires different temperature settings.

Context Window Size

Qwen3.5 supports up to 131K context, but that requires significant memory:

| Context Size | Additional Memory |
|--------------|-------------------|
| 4K           | ~1 GB             |
| 8K           | ~2 GB             |
| 16K          | ~4 GB             |
| 32K          | ~8 GB             |
| 64K          | ~16 GB            |
| 128K         | ~32 GB            |

For a 24GB GPU running the 35B model, stick to 16K context. Adjust based on your available memory.

Common Mistakes to Avoid

Mistake 1: Not Setting LLAMA_CACHE

Without setting the cache directory, models download to ~/.cache/llama.cpp by default. On systems with limited home directory space, this fills up quickly.

# Models download to ~/.cache/llama.cpp (might be small partition)
./build/bin/llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE

export LLAMA_CACHE=/data/models/llama-cache
./build/bin/llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE

Mistake 2: Exceeding Context Size

If you set --ctx-size too high for your GPU memory, you get out-of-memory errors:

# 131072 context requires ~32GB+ just for KV cache
./build/bin/llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE \
  --ctx-size 131072  # Will fail on 24GB GPU!

Start with smaller context and increase gradually:

# Safe for 24GB GPU
./build/bin/llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE \
  --ctx-size 16384

Mistake 3: Wrong Temperature for Mode

Using the wrong temperature for thinking vs non-thinking mode gives suboptimal results:

| Mode        | Temperature | Top-P |
|-------------|-------------|-------|
| Non-thinking| 0.6         | 0.95  |
| Thinking    | 0.7         | 0.8   |

Mistake 4: Forgetting CUDA Compilation Flag

Compiling without CUDA support means CPU-only inference, which is significantly slower:

cmake . -B build
cmake --build build --config Release -j

cmake . -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

The difference is 10-50x in inference speed.

Running Larger Models

For the flagship 397B model, you need significant hardware. Here’s the setup for Mac with 192GB+ unified memory:

hf download unsloth/Qwen3.5-397B-A17B-GGUF \
  --local-dir ./models/397b \
  --include "*MXFP4_MOE*"

./build/bin/llama-cli \
  --model ./models/397b/Qwen3.5-397B-A17B-MXFP4_MOE.gguf \
  --ctx-size 8192 \
  --temp 0.6

The 397B model at MXFP4 quantization requires about 214GB of memory.

Quick Reference Commands

Here’s a cheat sheet for common operations:

# Quick start with auto-download
./build/bin/llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE --ctx-size 16384 --temp 0.6

# Run from local file
./build/bin/llama-cli -m ./models/35b/Qwen3.5-35B-A3B-MXFP4_MOE.gguf --ctx-size 16384 --temp 0.6

# Start API server
./build/bin/llama-server -m ./models/35b/Qwen3.5-35B-A3B-MXFP4_MOE.gguf --port 8001 --ctx-size 16384

# Test API
curl http://localhost:8001/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"qwen3.5","messages":[{"role":"user","content":"Hello"}]}'

Summary

Deploying Qwen3.5 locally with llama.cpp is straightforward once you know the steps:

Compile llama.cpp with -DGGML_CUDA=ON for GPU acceleration
Choose your approach: Quick start (auto-download), offline-first (pre-download), or production API
Set the cache directory with LLAMA_CACHE environment variable
Use appropriate parameters: 0.6 temperature for non-thinking, 0.7 for thinking mode
Match context size to your memory: Start with 16K on 24GB GPUs

The three approaches cover all deployment scenarios: quick testing, offline environments, and production applications. With the OpenAI-compatible server, any existing code using the OpenAI SDK works with your local Qwen3.5 instance.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!