How to Deploy Qwen3.5 as OpenAI-Compatible API Server with llama-server
Purpose
I wanted to replace OpenAI’s API with a local Qwen3.5 deployment. The problem? My existing applications use the OpenAI SDK, and I didn’t want to rewrite all that code.
The solution: llama-server from llama.cpp provides a drop-in OpenAI-compatible API endpoint.
Environment
- Qwen3.5 model (GGUF format)
- llama.cpp compiled with
llama-server - Python with
openaipackage - Linux/macOS
The llama-server Setup
llama-server is part of llama.cpp. It exposes an HTTP server that mimics the OpenAI API format.
Start the server with:
./llama.cpp/llama-server \ --model unsloth/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-Q4_K_M.gguf \ --alias "Qwen3.5-35B-A3B" \ --temp 0.6 \ --top-p 0.95 \ --ctx-size 16384 \ --top-k 20 \ --min-p 0.00 \ --port 8001I can explain the key options:
--model: Path to your GGUF model file--alias: Friendly name for the model (appears in API responses)--temp 0.6: Temperature for sampling (Qwen3.5 recommends 0.5-0.7)--ctx-size 16384: Context window size (adjust based on memory)--port 8001: Server port (default is 8080)
When the server starts, I see:
server listening on http://127.0.0.1:8001model loaded successfullyClient Integration
Now I can use the OpenAI SDK with zero changes:
from openai import OpenAI
client = OpenAI( base_url="http://127.0.0.1:8001/v1", api_key="sk-no-key-required" # Any string works)
response = client.chat.completions.create( model="Qwen3.5-35B-A3B", messages=[{"role": "user", "content": "Hello, who are you?"}])
print(response.choices[0].message.content)The key change: base_url points to my local server. Everything else stays the same.
When I run this:
python test_api.pyI get:
I am Qwen, a large language model developed by Alibaba Cloud...Function Calling Support
Qwen3.5 supports native function calling. This is useful for building AI agents.
Here’s a complete example:
from openai import OpenAIimport json
client = OpenAI( base_url="http://127.0.0.1:8001/v1", api_key="local")
# Define tools for the model to calltools = [ { "type": "function", "function": { "name": "get_weather", "description": "Get current weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string"} }, "required": ["location"] } } }]
response = client.chat.completions.create( model="Qwen3.5-35B-A3B", messages=[{"role": "user", "content": "What's the weather in Tokyo?"}], tools=tools)
# Model returns function call instead of textif response.choices[0].message.tool_calls: tool_call = response.choices[0].message.tool_calls[0] print(f"Function: {tool_call.function.name}") print(f"Arguments: {tool_call.function.arguments}")When I run this:
python function_calling.pyI get:
Function: get_weatherArguments: {"location": "Tokyo"}The model understood my request and returned a structured function call. I can now execute the actual function and pass the result back.
llama-server vs Ollama
I’ve used both. Here’s my comparison:
| Feature | Ollama | llama-server |
|---|---|---|
| Setup | Easier | More configuration |
| API | OpenAI-compatible | OpenAI-compatible |
| Function calling | Limited | Full support |
| Production ready | Good | Better |
For quick experiments, Ollama is fine. But for production agents with function calling, I prefer llama-server.
Why This Matters
Deploying Qwen3.5 as an OpenAI-compatible API gives me:
- Zero code changes - Existing OpenAI SDK code works unchanged
- Function calling - Native support for building AI agents
- Streaming - Server-sent events for real-time responses
- Cost savings - No per-token charges after setup
- Privacy - Data never leaves my infrastructure
Summary
In this post, I showed how to deploy Qwen3.5 as an OpenAI-compatible API server using llama-server. The key point is pointing the OpenAI SDK’s base_url to http://127.0.0.1:8001/v1. With native function calling support, this setup is ideal for building local AI agents without rewriting existing OpenAI-based code.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 llama.cpp GitHub
- 👨💻 Qwen3.5 Model Card
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments