Skip to content

How to Integrate Local LLMs into VS Code for AI-Powered Coding

Problem

I had my RTX 5090 running local LLMs, but I was still using a terminal for everything. I wanted AI assistance directly in VS Code, like GitHub Copilot, but using my local models.

The question was: How do I connect my local LLM to VS Code?

The Solution

Use the Continue extension for VS Code and point it to your local llama.cpp server. Configure it with OpenAI-compatible settings.

Architecture Overview

Integration architecture
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ VS Code │ │ llama-server │ │ RTX 5090 │
│ + Continue │────►│ (port 8080) │────►│ (32GB VRAM) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
GGUF Models
(HuggingFace)

Step-by-Step Setup

1. Install Continue Extension

In VS Code, install the “Continue” extension from the marketplace.

2. Start Your Local LLM Server

First, make sure your llama-server is running:

Start llama-server
llama-server \
--host 0.0.0.0 \
--port 8080 \
-hf unsloth/Qwen3-coder-30B-GGUF:Q4_K_M \
--ctx-size 32768

3. Configure Continue

Open Continue’s config file at ~/.continue/config.json:

~/.continue/config.json
{
"models": [
{
"title": "Qwen-Coder-30B",
"provider": "openai",
"model": "Qwen/Qwen3-coder-30B",
"apiBase": "http://localhost:8080/v1",
"apiKey": "not-needed"
}
],
"tabAutocompleteModel": {
"title": "Qwen-Coder-7B-Fast",
"provider": "openai",
"model": "Qwen/Qwen3-coder-7B",
"apiBase": "http://localhost:8081/v1"
},
"embeddingsProvider": {
"provider": "ollama",
"model": "nomic-embed-text"
}
}

4. Dual Model Strategy

A key insight from the community: “Keep separate configs: one model for inline completions (fast, smaller), one for chat/tools (bigger Qwen).”

Recommended model split
| Task | Model Size | Why |
|-------------------|------------|-----------------------------|
| Inline completion | 7B | Fast, responsive |
| Chat/Tools | 30B | Better reasoning, quality |
| Embeddings | Small | Fast indexing, not critical |

You’ll need two llama-server instances on different ports for this.

Important: Match Model Names Exactly

One user warned: “Name the model exactly like your llama-server advertises.”

If your server exposes the model as Qwen/Qwen3-coder-30B, use that exact string in your config. Mismatched names cause connection failures.

Alternative: Cursor IDE

If you prefer an all-in-one solution, Cursor is a fork of VS Code with built-in AI:

Continue vs Cursor
| Feature | Continue + VS Code | Cursor |
|----------------|-------------------------|---------------------|
| Setup | Manual config | Built-in |
| Flexibility | High | Medium |
| Local LLM | Yes | Yes (with config) |
| Copilot | No | Optional |

Both support local LLMs via OpenAI-compatible endpoints.

Common Mistakes I Made

  1. Using a single large model for all tasks - Autocomplete with a 30B model is slow. Use 7B for completions.

  2. Mismatched model names - The config model name must match exactly what the server exposes.

  3. Not caching embeddings - Without local embeddings, context is limited to the open file. Set up an embeddings provider.

  4. Forgetting to start the server - VS Code errors are confusing if the LLM server isn’t running.

  5. Wrong API base URL - Don’t forget the /v1 suffix: http://localhost:8080/v1

Testing Your Setup

Open VS Code and try:

  1. Chat: Open Continue sidebar, ask a coding question
  2. Inline completion: Start typing a function, press Tab
  3. Code explanation: Select code, use “Explain” command

If nothing works, check:

  • Is llama-server running? (curl http://localhost:8080/v1/models)
  • Is the port correct in config?
  • Does the model name match?

LiteLLM for Unified Proxy

If you run multiple models, use LiteLLM as a unified proxy:

litellm_config.yaml
model_list:
- model_name: "Qwen/Qwen3-coder-30B"
litellm_params:
model: "huggingface/Qwen/Qwen3-coder-30B"
api_base: "http://localhost:8080"
- model_name: "Qwen/Qwen3-coder-7B"
litellm_params:
model: "huggingface/Qwen/Qwen3-coder-7B"
api_base: "http://localhost:8081"

This gives you a single endpoint that routes to different models.

Summary

In this post, I showed how to integrate local LLMs into VS Code. The key point is using the Continue extension with OpenAI-compatible settings pointing to your local llama-server.

Use dual models for optimal performance: a fast 7B model for inline completions and a larger 30B model for chat and complex tasks.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments