How to Integrate Local LLMs into VS Code for AI-Powered Coding
Problem
I had my RTX 5090 running local LLMs, but I was still using a terminal for everything. I wanted AI assistance directly in VS Code, like GitHub Copilot, but using my local models.
The question was: How do I connect my local LLM to VS Code?
The Solution
Use the Continue extension for VS Code and point it to your local llama.cpp server. Configure it with OpenAI-compatible settings.
Architecture Overview
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐│ VS Code │ │ llama-server │ │ RTX 5090 ││ + Continue │────►│ (port 8080) │────►│ (32GB VRAM) │└─────────────────┘ └──────────────────┘ └─────────────────┘ │ ▼ GGUF Models (HuggingFace)Step-by-Step Setup
1. Install Continue Extension
In VS Code, install the “Continue” extension from the marketplace.
2. Start Your Local LLM Server
First, make sure your llama-server is running:
llama-server \ --host 0.0.0.0 \ --port 8080 \ -hf unsloth/Qwen3-coder-30B-GGUF:Q4_K_M \ --ctx-size 327683. Configure Continue
Open Continue’s config file at ~/.continue/config.json:
{ "models": [ { "title": "Qwen-Coder-30B", "provider": "openai", "model": "Qwen/Qwen3-coder-30B", "apiBase": "http://localhost:8080/v1", "apiKey": "not-needed" } ], "tabAutocompleteModel": { "title": "Qwen-Coder-7B-Fast", "provider": "openai", "model": "Qwen/Qwen3-coder-7B", "apiBase": "http://localhost:8081/v1" }, "embeddingsProvider": { "provider": "ollama", "model": "nomic-embed-text" }}4. Dual Model Strategy
A key insight from the community: “Keep separate configs: one model for inline completions (fast, smaller), one for chat/tools (bigger Qwen).”
| Task | Model Size | Why ||-------------------|------------|-----------------------------|| Inline completion | 7B | Fast, responsive || Chat/Tools | 30B | Better reasoning, quality || Embeddings | Small | Fast indexing, not critical |You’ll need two llama-server instances on different ports for this.
Important: Match Model Names Exactly
One user warned: “Name the model exactly like your llama-server advertises.”
If your server exposes the model as Qwen/Qwen3-coder-30B, use that exact string in your config. Mismatched names cause connection failures.
Alternative: Cursor IDE
If you prefer an all-in-one solution, Cursor is a fork of VS Code with built-in AI:
| Feature | Continue + VS Code | Cursor ||----------------|-------------------------|---------------------|| Setup | Manual config | Built-in || Flexibility | High | Medium || Local LLM | Yes | Yes (with config) || Copilot | No | Optional |Both support local LLMs via OpenAI-compatible endpoints.
Common Mistakes I Made
-
Using a single large model for all tasks - Autocomplete with a 30B model is slow. Use 7B for completions.
-
Mismatched model names - The config model name must match exactly what the server exposes.
-
Not caching embeddings - Without local embeddings, context is limited to the open file. Set up an embeddings provider.
-
Forgetting to start the server - VS Code errors are confusing if the LLM server isn’t running.
-
Wrong API base URL - Don’t forget the
/v1suffix:http://localhost:8080/v1
Testing Your Setup
Open VS Code and try:
- Chat: Open Continue sidebar, ask a coding question
- Inline completion: Start typing a function, press Tab
- Code explanation: Select code, use “Explain” command
If nothing works, check:
- Is llama-server running? (
curl http://localhost:8080/v1/models) - Is the port correct in config?
- Does the model name match?
LiteLLM for Unified Proxy
If you run multiple models, use LiteLLM as a unified proxy:
model_list: - model_name: "Qwen/Qwen3-coder-30B" litellm_params: model: "huggingface/Qwen/Qwen3-coder-30B" api_base: "http://localhost:8080" - model_name: "Qwen/Qwen3-coder-7B" litellm_params: model: "huggingface/Qwen/Qwen3-coder-7B" api_base: "http://localhost:8081"This gives you a single endpoint that routes to different models.
Summary
In this post, I showed how to integrate local LLMs into VS Code. The key point is using the Continue extension with OpenAI-compatible settings pointing to your local llama-server.
Use dual models for optimal performance: a fast 7B model for inline completions and a larger 30B model for chat and complex tasks.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments