Skip to content

How to Set Up llama.cpp Server for Local LLM Coding on RTX 5090

Problem

I wanted to run local LLMs on my RTX 5090 for coding, but I needed more than just a command-line chat. I wanted:

  • An OpenAI-compatible API endpoint
  • Integration with VS Code extensions (Continue, Cursor)
  • No API costs or rate limits
  • Privacy for my code

The solution is llama.cpp server, but setting it up correctly took some trial and error.

Environment

  • RTX 5090 with 32GB VRAM
  • llama.cpp compiled with CUDA support
  • VS Code with Continue extension

The Solution

Use llama-server with the -hf flag to download GGUF models directly from HuggingFace. This creates an OpenAI-compatible API endpoint that works with your existing tools.

Quick Start Command

Start llama-server with Qwen3.5
llama-server \
--host 0.0.0.0 \
--port 8080 \
-hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL \
--ctx-size 65536 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20

This single command:

  1. Downloads the model from HuggingFace (if not cached)
  2. Loads it onto your GPU
  3. Starts an OpenAI-compatible server at http://localhost:8080

Key Flags Explained

FlagPurpose
--host 0.0.0.0Listen on all interfaces (network accessible)
--port 8080API endpoint port
-hf unsloth/...Download model directly from HuggingFace
--ctx-size 65536Large context window for code files
--temp 0.6Temperature for deterministic coding output

How It Works

The server exposes these endpoints:

OpenAI-compatible endpoints
POST /v1/chat/completions # Chat completions
POST /v1/completions # Text completions
GET /v1/models # List available models

Any tool that works with OpenAI’s API will work with your local server.

Testing the Server

Let me verify it’s working:

Test with curl
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local-model",
"messages": [{"role": "user", "content": "Write a hello world in Python"}],
"temperature": 0.6
}'

VS Code Integration

To use this with VS Code’s Continue extension:

~/.continue/config.json
{
"models": [
{
"title": "Local Qwen3.5",
"provider": "openai",
"model": "local-model",
"apiBase": "http://localhost:8080/v1",
"apiKey": "not-needed"
}
]
}

Just point the apiBase to your local server. The apiKey is required by the SDK but not used.

Python Integration

You can also use it with the OpenAI Python SDK:

Python client example
from openai import OpenAI
# Point to local llama-server
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="local-model",
messages=[
{"role": "system", "content": "You are a coding assistant."},
{"role": "user", "content": "Write a Python function to validate email"}
],
temperature=0.6,
top_p=0.95
)
print(response.choices[0].message.content)

Finding Models

Search HuggingFace for GGUF models:

Popular coding models for RTX 5090
unsloth/Qwen3.5-35B-A3B-GGUF # Fast, good quality
unsloth/Qwen3-coder-30B-GGUF # Coding specialized
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B-GGUF # Reasoning

Visit https://huggingface.co/models?search=GGUF+Qwen to find more.

Common Mistakes I Made

  1. Not using -hf flag - I tried downloading models manually first. The -hf flag is much easier and handles caching.

  2. Too small context size - Coding needs context. I started with 4096, which wasn’t enough for larger files. 65536 works well.

  3. Overly high temperature - For coding, lower temperature (0.6) gives more consistent output. Higher values make code unpredictable.

  4. Not enabling network access - Without --host 0.0.0.0, the server only accepts localhost connections. I couldn’t test from other machines.

  5. Ignoring repeat penalty - Without --repeat-penalty 1.0, I got repetitive code output.

Summary

In this post, I showed how to set up llama.cpp server on RTX 5090. The key point is using llama-server with the -hf flag for easy model downloads and configuring a large context window for coding tasks.

The OpenAI-compatible endpoint at http://localhost:8080/v1 works with VS Code extensions, Python scripts, and any tool that supports the OpenAI API format.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments