How to Set Up llama.cpp Server for Local LLM Coding on RTX 5090
Problem
I wanted to run local LLMs on my RTX 5090 for coding, but I needed more than just a command-line chat. I wanted:
- An OpenAI-compatible API endpoint
- Integration with VS Code extensions (Continue, Cursor)
- No API costs or rate limits
- Privacy for my code
The solution is llama.cpp server, but setting it up correctly took some trial and error.
Environment
- RTX 5090 with 32GB VRAM
- llama.cpp compiled with CUDA support
- VS Code with Continue extension
The Solution
Use llama-server with the -hf flag to download GGUF models directly from HuggingFace. This creates an OpenAI-compatible API endpoint that works with your existing tools.
Quick Start Command
llama-server \ --host 0.0.0.0 \ --port 8080 \ -hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL \ --ctx-size 65536 \ --temp 0.6 \ --top-p 0.95 \ --top-k 20This single command:
- Downloads the model from HuggingFace (if not cached)
- Loads it onto your GPU
- Starts an OpenAI-compatible server at
http://localhost:8080
Key Flags Explained
| Flag | Purpose |
|---|---|
--host 0.0.0.0 | Listen on all interfaces (network accessible) |
--port 8080 | API endpoint port |
-hf unsloth/... | Download model directly from HuggingFace |
--ctx-size 65536 | Large context window for code files |
--temp 0.6 | Temperature for deterministic coding output |
How It Works
The server exposes these endpoints:
POST /v1/chat/completions # Chat completionsPOST /v1/completions # Text completionsGET /v1/models # List available modelsAny tool that works with OpenAI’s API will work with your local server.
Testing the Server
Let me verify it’s working:
curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "local-model", "messages": [{"role": "user", "content": "Write a hello world in Python"}], "temperature": 0.6 }'VS Code Integration
To use this with VS Code’s Continue extension:
{ "models": [ { "title": "Local Qwen3.5", "provider": "openai", "model": "local-model", "apiBase": "http://localhost:8080/v1", "apiKey": "not-needed" } ]}Just point the apiBase to your local server. The apiKey is required by the SDK but not used.
Python Integration
You can also use it with the OpenAI Python SDK:
from openai import OpenAI
# Point to local llama-serverclient = OpenAI( base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create( model="local-model", messages=[ {"role": "system", "content": "You are a coding assistant."}, {"role": "user", "content": "Write a Python function to validate email"} ], temperature=0.6, top_p=0.95)
print(response.choices[0].message.content)Finding Models
Search HuggingFace for GGUF models:
unsloth/Qwen3.5-35B-A3B-GGUF # Fast, good qualityunsloth/Qwen3-coder-30B-GGUF # Coding specializeddeepseek-ai/DeepSeek-R1-Distill-Qwen-32B-GGUF # ReasoningVisit https://huggingface.co/models?search=GGUF+Qwen to find more.
Common Mistakes I Made
-
Not using
-hfflag - I tried downloading models manually first. The-hfflag is much easier and handles caching. -
Too small context size - Coding needs context. I started with 4096, which wasn’t enough for larger files. 65536 works well.
-
Overly high temperature - For coding, lower temperature (0.6) gives more consistent output. Higher values make code unpredictable.
-
Not enabling network access - Without
--host 0.0.0.0, the server only accepts localhost connections. I couldn’t test from other machines. -
Ignoring repeat penalty - Without
--repeat-penalty 1.0, I got repetitive code output.
Summary
In this post, I showed how to set up llama.cpp server on RTX 5090. The key point is using llama-server with the -hf flag for easy model downloads and configuring a large context window for coding tasks.
The OpenAI-compatible endpoint at http://localhost:8080/v1 works with VS Code extensions, Python scripts, and any tool that supports the OpenAI API format.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments