How to Integrate Local LLMs into VS Code for AI-Powered Coding

Mar 17, 2026

Problem

I had my RTX 5090 running local LLMs, but I was still using a terminal for everything. I wanted AI assistance directly in VS Code, like GitHub Copilot, but using my local models.

The question was: How do I connect my local LLM to VS Code?

The Solution

Use the Continue extension for VS Code and point it to your local llama.cpp server. Configure it with OpenAI-compatible settings.

Architecture Overview

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   VS Code       │     │   llama-server   │     │   RTX 5090      │
│   + Continue    │────►│   (port 8080)    │────►│   (32GB VRAM)   │
└─────────────────┘     └──────────────────┘     └─────────────────┘
                              │
                              ▼
                        GGUF Models
                        (HuggingFace)

Step-by-Step Setup

1. Install Continue Extension

In VS Code, install the “Continue” extension from the marketplace.

2. Start Your Local LLM Server

First, make sure your llama-server is running:

llama-server \
  --host 0.0.0.0 \
  --port 8080 \
  -hf unsloth/Qwen3-coder-30B-GGUF:Q4_K_M \
  --ctx-size 32768

3. Configure Continue

Open Continue’s config file at ~/.continue/config.json:

{
  "models": [
    {
      "title": "Qwen-Coder-30B",
      "provider": "openai",
      "model": "Qwen/Qwen3-coder-30B",
      "apiBase": "http://localhost:8080/v1",
      "apiKey": "not-needed"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen-Coder-7B-Fast",
    "provider": "openai",
    "model": "Qwen/Qwen3-coder-7B",
    "apiBase": "http://localhost:8081/v1"
  },
  "embeddingsProvider": {
    "provider": "ollama",
    "model": "nomic-embed-text"
  }
}

4. Dual Model Strategy

A key insight from the community: “Keep separate configs: one model for inline completions (fast, smaller), one for chat/tools (bigger Qwen).”

| Task              | Model Size | Why                         |
|-------------------|------------|-----------------------------|
| Inline completion | 7B         | Fast, responsive            |
| Chat/Tools        | 30B        | Better reasoning, quality   |
| Embeddings        | Small      | Fast indexing, not critical |

You’ll need two llama-server instances on different ports for this.

Important: Match Model Names Exactly

One user warned: “Name the model exactly like your llama-server advertises.”

If your server exposes the model as Qwen/Qwen3-coder-30B, use that exact string in your config. Mismatched names cause connection failures.

Alternative: Cursor IDE

If you prefer an all-in-one solution, Cursor is a fork of VS Code with built-in AI:

| Feature        | Continue + VS Code      | Cursor              |
|----------------|-------------------------|---------------------|
| Setup          | Manual config           | Built-in            |
| Flexibility    | High                    | Medium              |
| Local LLM      | Yes                     | Yes (with config)   |
| Copilot        | No                      | Optional            |

Both support local LLMs via OpenAI-compatible endpoints.

Common Mistakes I Made

Using a single large model for all tasks - Autocomplete with a 30B model is slow. Use 7B for completions.
Mismatched model names - The config model name must match exactly what the server exposes.
Not caching embeddings - Without local embeddings, context is limited to the open file. Set up an embeddings provider.
Forgetting to start the server - VS Code errors are confusing if the LLM server isn’t running.
Wrong API base URL - Don’t forget the /v1 suffix: http://localhost:8080/v1

Testing Your Setup

Open VS Code and try:

Chat: Open Continue sidebar, ask a coding question
Inline completion: Start typing a function, press Tab
Code explanation: Select code, use “Explain” command

If nothing works, check:

Is llama-server running? (curl http://localhost:8080/v1/models)
Is the port correct in config?
Does the model name match?

LiteLLM for Unified Proxy

If you run multiple models, use LiteLLM as a unified proxy:

model_list:
  - model_name: "Qwen/Qwen3-coder-30B"
    litellm_params:
      model: "huggingface/Qwen/Qwen3-coder-30B"
      api_base: "http://localhost:8080"
  - model_name: "Qwen/Qwen3-coder-7B"
    litellm_params:
      model: "huggingface/Qwen/Qwen3-coder-7B"
      api_base: "http://localhost:8081"

This gives you a single endpoint that routes to different models.

Summary

In this post, I showed how to integrate local LLMs into VS Code. The key point is using the Continue extension with OpenAI-compatible settings pointing to your local llama-server.

Use dual models for optimal performance: a fast 7B model for inline completions and a larger 30B model for chat and complex tasks.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Reddit: RTX 5090 + local LLM for app dev
👨‍💻 Continue Extension
👨‍💻 Cursor IDE

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!