Which Ollama Model is Best for Coding with 8GB VRAM?
The Problem
I recently set up a local LLM for coding assistance on my RTX 4060 laptop. With only 8GB of VRAM, I quickly ran into memory errors when trying to run larger models.
My first attempt with a 14B model failed spectacularly. Ollama crashed with an out-of-memory error. I then tried smaller models, but they produced garbage code. I needed to find the sweet spot between VRAM constraints and coding ability.
After testing several models and researching Reddit discussions, I found what works.
Why VRAM Limits Matter for Coding
VRAM determines which models can run entirely on your GPU. When a model fits in VRAM, you get fast inference. When it doesn’t, Ollama offloads layers to system RAM, which is much slower.
For coding assistance, this matters because:
-
Coding requires back-and-forth: You ask a question, get code, refine it, ask follow-ups. Each interaction needs fast responses.
-
Context adds up: Coding conversations get long. A model that barely fits your VRAM might struggle when you add more context.
-
Model size affects reasoning: Smaller models (under 7B parameters) often lack the reasoning depth needed for complex code tasks.
The trade-off is simple: larger models understand code better but need more memory. Smaller models fit easily but might produce wrong or nonsensical code.
What I Tested
My hardware setup:
- GPU: RTX 4060 Laptop (8GB VRAM)
- RAM: 64GB system RAM
- CPU: i7-12650H
I tested these models over two weeks of daily coding:
| Model | Parameters | VRAM Usage | Result |
|---|---|---|---|
| Qwen 2.5 Coder 7B | 7B | ~5-6GB | Fits, fast |
| Qwen 3.5 8B | 8B | ~6-7GB | Fits, fast |
| Qwen 3.5 9B | 9B | ~7GB | Fits, slightly slower |
| Qwen 3.5 4B | 4B | ~3GB | Fits, but poor code |
| Qwen 3.5 32B | 32B | 8GB + RAM | Hybrid mode, slower |
I used these models for code generation, debugging, refactoring, and explaining concepts in Python and JavaScript.
Best Model for 8GB VRAM: Qwen 2.5 Coder 7B
After testing, I recommend Qwen 2.5 Coder 7B for 8GB VRAM setups.
Here’s why:
-
Fits entirely in VRAM: Uses about 5-6GB, leaving headroom for context and system overhead.
-
Optimized for code: The “Coder” variant is specifically trained for programming tasks, unlike general-purpose models.
-
Fast responses: Since all layers run on GPU, you get 30-50 tokens per second on an RTX 4060.
-
Good reasoning: At 7B parameters, it has enough capacity for complex logic, unlike smaller 4B models.
To install it:
ollama pull qwen2.5-coder:7bRun it:
ollama run qwen2.5-coder:7bAlternative: Qwen 3.5 8B
If you want a newer model, Qwen 3.5 8B is another solid choice. It fits in 8GB VRAM and offers similar coding ability.
The 8B parameter count sits at the upper limit of what 8GB VRAM can handle. In practice, it works well:
ollama pull qwen3.5:8bThe difference between 7B and 8B is subtle. I found Qwen 2.5 Coder slightly better at code-specific tasks, while Qwen 3.5 8B handles general reasoning a bit better. For coding, stick with the Coder variant.
The 32B Option: Slower but Smarter
If you have 64GB+ system RAM (like I do), you can run larger models through hybrid CPU+GPU inference. This uses your GPU for some layers and system RAM for the rest.
Reddit users reported running Qwen 3.5 32B at 15-18 tokens per second with this setup. That’s slower than full GPU inference but much smarter for complex coding tasks.
To try this:
ollama pull qwen3.5:32b
# Run with partial GPU offloadollama run qwen3.5:32bOllama automatically handles the layer distribution. The trade-off:
- Speed: 15-18 tokens/s vs 30-50 tokens/s for 7B models
- Quality: Noticeably better for complex reasoning and larger codebases
- Use case: Good for when you need deep analysis rather than quick iterations
I use this for architectural discussions and complex refactoring, not daily coding assistance.
Models to Avoid
Not all models work well for coding, even if they fit in VRAM.
Qwen 3.5 4B: Too Small for Real Coding
A Reddit user warned: “Qwen 3.5 4B is fast but stupid for coding.”
I tested this and confirmed it. The 4B model produces code that looks plausible but contains subtle bugs and logical errors. It misses edge cases and struggles with anything beyond simple functions.
# Qwen 3.5 4B suggested this for a sorting functiondef sort_list(items): return sorted(items) # Looks fine but ignores None handling
# When asked about edge cases, it added:def sort_list(items): if items: return sorted(items) return [] # Still missing: what if items contains None?For simple tasks, 4B might work. But debugging its mistakes takes longer than writing code yourself.
Models Larger Than 14B Without Hybrid Setup
If you try to run a 14B+ model on 8GB VRAM without proper CPU offloading, expect:
- Slow responses (single-digit tokens per second)
- Potential crashes
- Inconsistent behavior
Configuring Ollama for Optimal GPU Usage
You can fine-tune how Ollama uses your GPU through environment variables.
Setting GPU Layers
The OLLAMA_GPU_LAYERS (or num_gpu in Modelfile) controls how many model layers run on GPU:
# For 7B-9B models - keep all layers on GPUOLLAMA_GPU_LAYERS=35 ollama run qwen2.5-coder:7b
# For 32B hybrid - partial GPU offloadOLLAMA_GPU_LAYERS=20 ollama run qwen3.5:32bHigher values mean more layers on GPU (faster) but more VRAM usage. For 7B-9B models, 35 layers works well on 8GB VRAM.
Monitoring VRAM Usage
On Linux, monitor VRAM with:
watch -n 1 nvidia-smiOn Windows, use Task Manager’s Performance tab or NVIDIA Control Panel.
Watch VRAM usage while generating code. If you see it hitting 7.5GB+ consistently, you might need to reduce context length or use a smaller model.
Real-World Performance Comparison
Here’s how the models performed in my daily coding work:
| Model | Speed (tokens/s) | Code Quality | Best For |
|---|---|---|---|
| Qwen 2.5 Coder 7B | 35-45 | Good | Daily coding, quick iterations |
| Qwen 3.5 8B | 30-40 | Good | General coding + reasoning |
| Qwen 3.5 9B | 25-35 | Good | Slightly better reasoning |
| Qwen 3.5 32B (hybrid) | 15-18 | Excellent | Complex refactoring, architecture |
| Qwen 3.5 4B | 50-60 | Poor | Avoid for coding |
Code quality means how often the generated code works correctly on first try. “Good” means 70-80% accuracy for typical coding tasks. “Excellent” means 85-90% accuracy.
GLM-4-9B: An Alternative
Another option Reddit users mentioned is GLM-4-9B. It’s designed for coding and fits 8GB VRAM.
ollama pull glm4:9bI found GLM-4-9B competitive with Qwen 2.5 Coder for Python tasks. For JavaScript and TypeScript, Qwen felt slightly better. Your experience may vary depending on your primary language.
Choosing the Right Model
Your choice depends on what you value:
Speed over quality → Qwen 2.5 Coder 7BBalance → Qwen 3.5 8BQuality over speed (with 64GB RAM) → Qwen 3.5 32B hybridAlternative → GLM-4-9BFor most developers with 8GB VRAM, start with Qwen 2.5 Coder 7B. It’s the most practical choice for daily coding assistance.
Summary
In this post, I compared Ollama coding models for 8GB VRAM GPUs based on real testing and community discussions.
The key findings:
-
Qwen 2.5 Coder 7B is the best overall choice for 8GB VRAM. It fits entirely in memory, runs fast, and produces good code.
-
Qwen 3.5 8B/9B work well as alternatives if you prefer the newer Qwen variant.
-
Qwen 3.5 32B is viable with 64GB+ system RAM through hybrid inference, but slower.
-
Avoid models under 7B for serious coding work—they lack reasoning depth.
-
Avoid models over 14B without proper hybrid setup—they’ll be too slow or crash.
The 7B-9B range is the sweet spot for 8GB VRAM. These models fit in memory, run fast, and have enough parameters for competent coding assistance.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit: RTX 4060 + Ollama Model Discussion
- 👨💻 Ollama Model Library
- 👨💻 Qwen Official Documentation
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments