How to Set Up llama.cpp Server for Local LLM Coding on RTX 5090

Mar 17, 2026

Cowrie

Dev @ Bswen

Problem

I wanted to run local LLMs on my RTX 5090 for coding, but I needed more than just a command-line chat. I wanted:

An OpenAI-compatible API endpoint
Integration with VS Code extensions (Continue, Cursor)
No API costs or rate limits
Privacy for my code

The solution is llama.cpp server, but setting it up correctly took some trial and error.

Environment

RTX 5090 with 32GB VRAM
llama.cpp compiled with CUDA support
VS Code with Continue extension

The Solution

Use llama-server with the -hf flag to download GGUF models directly from HuggingFace. This creates an OpenAI-compatible API endpoint that works with your existing tools.

Quick Start Command

llama-server \
  --host 0.0.0.0 \
  --port 8080 \
  -hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL \
  --ctx-size 65536 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20

This single command:

Downloads the model from HuggingFace (if not cached)
Loads it onto your GPU
Starts an OpenAI-compatible server at http://localhost:8080

Key Flags Explained

Flag	Purpose
`--host 0.0.0.0`	Listen on all interfaces (network accessible)
`--port 8080`	API endpoint port
`-hf unsloth/...`	Download model directly from HuggingFace
`--ctx-size 65536`	Large context window for code files
`--temp 0.6`	Temperature for deterministic coding output

How It Works

The server exposes these endpoints:

POST /v1/chat/completions    # Chat completions
POST /v1/completions         # Text completions
GET  /v1/models              # List available models

Any tool that works with OpenAI’s API will work with your local server.

Testing the Server

Let me verify it’s working:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local-model",
    "messages": [{"role": "user", "content": "Write a hello world in Python"}],
    "temperature": 0.6
  }'

VS Code Integration

To use this with VS Code’s Continue extension:

{
  "models": [
    {
      "title": "Local Qwen3.5",
      "provider": "openai",
      "model": "local-model",
      "apiBase": "http://localhost:8080/v1",
      "apiKey": "not-needed"
    }
  ]
}

Just point the apiBase to your local server. The apiKey is required by the SDK but not used.

Python Integration

You can also use it with the OpenAI Python SDK:

from openai import OpenAI

# Point to local llama-server
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="local-model",
    messages=[
        {"role": "system", "content": "You are a coding assistant."},
        {"role": "user", "content": "Write a Python function to validate email"}
    ],
    temperature=0.6,
    top_p=0.95
)

print(response.choices[0].message.content)

Finding Models

Search HuggingFace for GGUF models:

unsloth/Qwen3.5-35B-A3B-GGUF     # Fast, good quality
unsloth/Qwen3-coder-30B-GGUF     # Coding specialized
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B-GGUF  # Reasoning

Visit https://huggingface.co/models?search=GGUF+Qwen to find more.

Common Mistakes I Made

Not using -hf flag - I tried downloading models manually first. The -hf flag is much easier and handles caching.
Too small context size - Coding needs context. I started with 4096, which wasn’t enough for larger files. 65536 works well.
Overly high temperature - For coding, lower temperature (0.6) gives more consistent output. Higher values make code unpredictable.
Not enabling network access - Without --host 0.0.0.0, the server only accepts localhost connections. I couldn’t test from other machines.
Ignoring repeat penalty - Without --repeat-penalty 1.0, I got repetitive code output.

Summary

In this post, I showed how to set up llama.cpp server on RTX 5090. The key point is using llama-server with the -hf flag for easy model downloads and configuring a large context window for coding tasks.

The OpenAI-compatible endpoint at http://localhost:8080/v1 works with VS Code extensions, Python scripts, and any tool that supports the OpenAI API format.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Reddit: RTX 5090 + local LLM for app dev
👨‍💻 llama.cpp GitHub
👨‍💻 HuggingFace GGUF Models

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!