Skip to content

How to Deploy Qwen3.5 as OpenAI-Compatible API Server with llama-server

Purpose

I wanted to replace OpenAI’s API with a local Qwen3.5 deployment. The problem? My existing applications use the OpenAI SDK, and I didn’t want to rewrite all that code.

The solution: llama-server from llama.cpp provides a drop-in OpenAI-compatible API endpoint.

Environment

  • Qwen3.5 model (GGUF format)
  • llama.cpp compiled with llama-server
  • Python with openai package
  • Linux/macOS

The llama-server Setup

llama-server is part of llama.cpp. It exposes an HTTP server that mimics the OpenAI API format.

Start the server with:

Start llama-server
./llama.cpp/llama-server \
--model unsloth/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-Q4_K_M.gguf \
--alias "Qwen3.5-35B-A3B" \
--temp 0.6 \
--top-p 0.95 \
--ctx-size 16384 \
--top-k 20 \
--min-p 0.00 \
--port 8001

I can explain the key options:

  • --model: Path to your GGUF model file
  • --alias: Friendly name for the model (appears in API responses)
  • --temp 0.6: Temperature for sampling (Qwen3.5 recommends 0.5-0.7)
  • --ctx-size 16384: Context window size (adjust based on memory)
  • --port 8001: Server port (default is 8080)

When the server starts, I see:

Server startup output
server listening on http://127.0.0.1:8001
model loaded successfully

Client Integration

Now I can use the OpenAI SDK with zero changes:

test_api.py
from openai import OpenAI
client = OpenAI(
base_url="http://127.0.0.1:8001/v1",
api_key="sk-no-key-required" # Any string works
)
response = client.chat.completions.create(
model="Qwen3.5-35B-A3B",
messages=[{"role": "user", "content": "Hello, who are you?"}]
)
print(response.choices[0].message.content)

The key change: base_url points to my local server. Everything else stays the same.

When I run this:

Test the API
python test_api.py

I get:

Output
I am Qwen, a large language model developed by Alibaba Cloud...

Function Calling Support

Qwen3.5 supports native function calling. This is useful for building AI agents.

Here’s a complete example:

function_calling.py
from openai import OpenAI
import json
client = OpenAI(
base_url="http://127.0.0.1:8001/v1",
api_key="local"
)
# Define tools for the model to call
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}
}
]
response = client.chat.completions.create(
model="Qwen3.5-35B-A3B",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools
)
# Model returns function call instead of text
if response.choices[0].message.tool_calls:
tool_call = response.choices[0].message.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")

When I run this:

Test function calling
python function_calling.py

I get:

Output
Function: get_weather
Arguments: {"location": "Tokyo"}

The model understood my request and returned a structured function call. I can now execute the actual function and pass the result back.

llama-server vs Ollama

I’ve used both. Here’s my comparison:

FeatureOllamallama-server
SetupEasierMore configuration
APIOpenAI-compatibleOpenAI-compatible
Function callingLimitedFull support
Production readyGoodBetter

For quick experiments, Ollama is fine. But for production agents with function calling, I prefer llama-server.

Why This Matters

Deploying Qwen3.5 as an OpenAI-compatible API gives me:

  1. Zero code changes - Existing OpenAI SDK code works unchanged
  2. Function calling - Native support for building AI agents
  3. Streaming - Server-sent events for real-time responses
  4. Cost savings - No per-token charges after setup
  5. Privacy - Data never leaves my infrastructure

Summary

In this post, I showed how to deploy Qwen3.5 as an OpenAI-compatible API server using llama-server. The key point is pointing the OpenAI SDK’s base_url to http://127.0.0.1:8001/v1. With native function calling support, this setup is ideal for building local AI agents without rewriting existing OpenAI-based code.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments