Skip to content

Which Ollama Model is Best for Coding with 8GB VRAM?

The Problem

I recently set up a local LLM for coding assistance on my RTX 4060 laptop. With only 8GB of VRAM, I quickly ran into memory errors when trying to run larger models.

My first attempt with a 14B model failed spectacularly. Ollama crashed with an out-of-memory error. I then tried smaller models, but they produced garbage code. I needed to find the sweet spot between VRAM constraints and coding ability.

After testing several models and researching Reddit discussions, I found what works.

Why VRAM Limits Matter for Coding

VRAM determines which models can run entirely on your GPU. When a model fits in VRAM, you get fast inference. When it doesn’t, Ollama offloads layers to system RAM, which is much slower.

For coding assistance, this matters because:

  1. Coding requires back-and-forth: You ask a question, get code, refine it, ask follow-ups. Each interaction needs fast responses.

  2. Context adds up: Coding conversations get long. A model that barely fits your VRAM might struggle when you add more context.

  3. Model size affects reasoning: Smaller models (under 7B parameters) often lack the reasoning depth needed for complex code tasks.

The trade-off is simple: larger models understand code better but need more memory. Smaller models fit easily but might produce wrong or nonsensical code.

What I Tested

My hardware setup:

  • GPU: RTX 4060 Laptop (8GB VRAM)
  • RAM: 64GB system RAM
  • CPU: i7-12650H

I tested these models over two weeks of daily coding:

ModelParametersVRAM UsageResult
Qwen 2.5 Coder 7B7B~5-6GBFits, fast
Qwen 3.5 8B8B~6-7GBFits, fast
Qwen 3.5 9B9B~7GBFits, slightly slower
Qwen 3.5 4B4B~3GBFits, but poor code
Qwen 3.5 32B32B8GB + RAMHybrid mode, slower

I used these models for code generation, debugging, refactoring, and explaining concepts in Python and JavaScript.

Best Model for 8GB VRAM: Qwen 2.5 Coder 7B

After testing, I recommend Qwen 2.5 Coder 7B for 8GB VRAM setups.

Here’s why:

  1. Fits entirely in VRAM: Uses about 5-6GB, leaving headroom for context and system overhead.

  2. Optimized for code: The “Coder” variant is specifically trained for programming tasks, unlike general-purpose models.

  3. Fast responses: Since all layers run on GPU, you get 30-50 tokens per second on an RTX 4060.

  4. Good reasoning: At 7B parameters, it has enough capacity for complex logic, unlike smaller 4B models.

To install it:

install-qwen-coder.sh
ollama pull qwen2.5-coder:7b

Run it:

run-qwen-coder.sh
ollama run qwen2.5-coder:7b

Alternative: Qwen 3.5 8B

If you want a newer model, Qwen 3.5 8B is another solid choice. It fits in 8GB VRAM and offers similar coding ability.

The 8B parameter count sits at the upper limit of what 8GB VRAM can handle. In practice, it works well:

install-qwen35.sh
ollama pull qwen3.5:8b

The difference between 7B and 8B is subtle. I found Qwen 2.5 Coder slightly better at code-specific tasks, while Qwen 3.5 8B handles general reasoning a bit better. For coding, stick with the Coder variant.

The 32B Option: Slower but Smarter

If you have 64GB+ system RAM (like I do), you can run larger models through hybrid CPU+GPU inference. This uses your GPU for some layers and system RAM for the rest.

Reddit users reported running Qwen 3.5 32B at 15-18 tokens per second with this setup. That’s slower than full GPU inference but much smarter for complex coding tasks.

To try this:

hybrid-inference.sh
ollama pull qwen3.5:32b
# Run with partial GPU offload
ollama run qwen3.5:32b

Ollama automatically handles the layer distribution. The trade-off:

  • Speed: 15-18 tokens/s vs 30-50 tokens/s for 7B models
  • Quality: Noticeably better for complex reasoning and larger codebases
  • Use case: Good for when you need deep analysis rather than quick iterations

I use this for architectural discussions and complex refactoring, not daily coding assistance.

Models to Avoid

Not all models work well for coding, even if they fit in VRAM.

Qwen 3.5 4B: Too Small for Real Coding

A Reddit user warned: “Qwen 3.5 4B is fast but stupid for coding.”

I tested this and confirmed it. The 4B model produces code that looks plausible but contains subtle bugs and logical errors. It misses edge cases and struggles with anything beyond simple functions.

example-4b-output.py
# Qwen 3.5 4B suggested this for a sorting function
def sort_list(items):
return sorted(items) # Looks fine but ignores None handling
# When asked about edge cases, it added:
def sort_list(items):
if items:
return sorted(items)
return []
# Still missing: what if items contains None?

For simple tasks, 4B might work. But debugging its mistakes takes longer than writing code yourself.

Models Larger Than 14B Without Hybrid Setup

If you try to run a 14B+ model on 8GB VRAM without proper CPU offloading, expect:

  • Slow responses (single-digit tokens per second)
  • Potential crashes
  • Inconsistent behavior

Configuring Ollama for Optimal GPU Usage

You can fine-tune how Ollama uses your GPU through environment variables.

Setting GPU Layers

The OLLAMA_GPU_LAYERS (or num_gpu in Modelfile) controls how many model layers run on GPU:

gpu-layers.sh
# For 7B-9B models - keep all layers on GPU
OLLAMA_GPU_LAYERS=35 ollama run qwen2.5-coder:7b
# For 32B hybrid - partial GPU offload
OLLAMA_GPU_LAYERS=20 ollama run qwen3.5:32b

Higher values mean more layers on GPU (faster) but more VRAM usage. For 7B-9B models, 35 layers works well on 8GB VRAM.

Monitoring VRAM Usage

On Linux, monitor VRAM with:

monitor-vram.sh
watch -n 1 nvidia-smi

On Windows, use Task Manager’s Performance tab or NVIDIA Control Panel.

Watch VRAM usage while generating code. If you see it hitting 7.5GB+ consistently, you might need to reduce context length or use a smaller model.

Real-World Performance Comparison

Here’s how the models performed in my daily coding work:

ModelSpeed (tokens/s)Code QualityBest For
Qwen 2.5 Coder 7B35-45GoodDaily coding, quick iterations
Qwen 3.5 8B30-40GoodGeneral coding + reasoning
Qwen 3.5 9B25-35GoodSlightly better reasoning
Qwen 3.5 32B (hybrid)15-18ExcellentComplex refactoring, architecture
Qwen 3.5 4B50-60PoorAvoid for coding

Code quality means how often the generated code works correctly on first try. “Good” means 70-80% accuracy for typical coding tasks. “Excellent” means 85-90% accuracy.

GLM-4-9B: An Alternative

Another option Reddit users mentioned is GLM-4-9B. It’s designed for coding and fits 8GB VRAM.

install-glm.sh
ollama pull glm4:9b

I found GLM-4-9B competitive with Qwen 2.5 Coder for Python tasks. For JavaScript and TypeScript, Qwen felt slightly better. Your experience may vary depending on your primary language.

Choosing the Right Model

Your choice depends on what you value:

Speed over quality → Qwen 2.5 Coder 7B
Balance → Qwen 3.5 8B
Quality over speed (with 64GB RAM) → Qwen 3.5 32B hybrid
Alternative → GLM-4-9B

For most developers with 8GB VRAM, start with Qwen 2.5 Coder 7B. It’s the most practical choice for daily coding assistance.

Summary

In this post, I compared Ollama coding models for 8GB VRAM GPUs based on real testing and community discussions.

The key findings:

  1. Qwen 2.5 Coder 7B is the best overall choice for 8GB VRAM. It fits entirely in memory, runs fast, and produces good code.

  2. Qwen 3.5 8B/9B work well as alternatives if you prefer the newer Qwen variant.

  3. Qwen 3.5 32B is viable with 64GB+ system RAM through hybrid inference, but slower.

  4. Avoid models under 7B for serious coding work—they lack reasoning depth.

  5. Avoid models over 14B without proper hybrid setup—they’ll be too slow or crash.

The 7B-9B range is the sweet spot for 8GB VRAM. These models fit in memory, run fast, and have enough parameters for competent coding assistance.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments