How to Deploy Qwen3.5 as OpenAI-Compatible API Server with llama-server

Mar 24, 2026

Purpose

I wanted to replace OpenAI’s API with a local Qwen3.5 deployment. The problem? My existing applications use the OpenAI SDK, and I didn’t want to rewrite all that code.

The solution: llama-server from llama.cpp provides a drop-in OpenAI-compatible API endpoint.

Environment

Qwen3.5 model (GGUF format)
llama.cpp compiled with llama-server
Python with openai package
Linux/macOS

The llama-server Setup

llama-server is part of llama.cpp. It exposes an HTTP server that mimics the OpenAI API format.

Start the server with:

./llama.cpp/llama-server \
  --model unsloth/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-Q4_K_M.gguf \
  --alias "Qwen3.5-35B-A3B" \
  --temp 0.6 \
  --top-p 0.95 \
  --ctx-size 16384 \
  --top-k 20 \
  --min-p 0.00 \
  --port 8001

I can explain the key options:

--model: Path to your GGUF model file
--alias: Friendly name for the model (appears in API responses)
--temp 0.6: Temperature for sampling (Qwen3.5 recommends 0.5-0.7)
--ctx-size 16384: Context window size (adjust based on memory)
--port 8001: Server port (default is 8080)

When the server starts, I see:

server listening on http://127.0.0.1:8001
model loaded successfully

Client Integration

Now I can use the OpenAI SDK with zero changes:

from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:8001/v1",
    api_key="sk-no-key-required"  # Any string works
)

response = client.chat.completions.create(
    model="Qwen3.5-35B-A3B",
    messages=[{"role": "user", "content": "Hello, who are you?"}]
)

print(response.choices[0].message.content)

The key change: base_url points to my local server. Everything else stays the same.

When I run this:

python test_api.py

I get:

I am Qwen, a large language model developed by Alibaba Cloud...

Function Calling Support

Qwen3.5 supports native function calling. This is useful for building AI agents.

Here’s a complete example:

from openai import OpenAI
import json

client = OpenAI(
    base_url="http://127.0.0.1:8001/v1",
    api_key="local"
)

# Define tools for the model to call
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="Qwen3.5-35B-A3B",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools
)

# Model returns function call instead of text
if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {tool_call.function.arguments}")

When I run this:

python function_calling.py

I get:

Function: get_weather
Arguments: {"location": "Tokyo"}

The model understood my request and returned a structured function call. I can now execute the actual function and pass the result back.

llama-server vs Ollama

I’ve used both. Here’s my comparison:

Feature	Ollama	llama-server
Setup	Easier	More configuration
API	OpenAI-compatible	OpenAI-compatible
Function calling	Limited	Full support
Production ready	Good	Better

For quick experiments, Ollama is fine. But for production agents with function calling, I prefer llama-server.

Why This Matters

Deploying Qwen3.5 as an OpenAI-compatible API gives me:

Zero code changes - Existing OpenAI SDK code works unchanged
Function calling - Native support for building AI agents
Streaming - Server-sent events for real-time responses
Cost savings - No per-token charges after setup
Privacy - Data never leaves my infrastructure

Summary

In this post, I showed how to deploy Qwen3.5 as an OpenAI-compatible API server using llama-server. The key point is pointing the OpenAI SDK’s base_url to http://127.0.0.1:8001/v1. With native function calling support, this setup is ideal for building local AI agents without rewriting existing OpenAI-based code.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 llama.cpp GitHub
👨‍💻 Qwen3.5 Model Card

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!