Skip to content

Is CPU+GPU Hybrid Inference Fast Enough for Coding?

I have an RTX 4060 laptop with 8GB VRAM and 64GB system RAM. I wanted to run larger coding models like Qwen 3.5 35B, but my VRAM isn’t enough for pure GPU inference. The question that kept me up: Is CPU+GPU hybrid inference usable for real-time coding assistance?

The Problem: VRAM Bottleneck

Running large language models locally for coding assistance requires serious VRAM. A 35B parameter model at 4-bit quantization needs roughly 18-20GB of VRAM for full GPU inference. My 8GB RTX 4060 can only fit a 7B or 8B model entirely in GPU memory.

I had two options:

  1. Use smaller models (7B-8B) with full GPU acceleration
  2. Try hybrid CPU+GPU inference with larger models

Option 1 felt limiting. I wanted the reasoning capabilities of larger models. So I tested option 2.

What is CPU+GPU Hybrid Inference?

Hybrid inference splits the model between GPU VRAM and system RAM. Some layers run on the GPU, others on the CPU. This lets you run models larger than your VRAM capacity.

Ollama and llama.cpp handle this automatically through a feature called “partial offloading.” When your GPU doesn’t have enough VRAM for the entire model, it offloads some layers to CPU/RAM.

check-ollama-mode.sh
# Ollama automatically uses hybrid mode when VRAM is insufficient
ollama run qwen2.5-coder:32b
# Check layer distribution after loading
ollama ps

The output shows how many layers sit on GPU vs CPU:

NAME ID SIZE PROCESSOR UNTIL
qwen2.5-coder:32b abc123... 18.2 GB 35%/65% 4 minutes from now

That 35%/65% split means roughly 35% of model layers run on GPU, 65% on CPU.

The Latency Question

Here’s what worried me: CPU inference is slow. Really slow. Would mixing CPU and GPU inference make the model unbearably sluggish for interactive coding?

I ran benchmarks with Qwen 3.5 35B (a Mixture-of-Experts model) on my setup:

My Hardware:

  • RTX 4060 Laptop GPU (8GB VRAM)
  • 64GB DDR5 RAM
  • AMD Ryzen 9 7940HS (8 cores, AVX-512 support)

Results:

Model: Qwen 3.5 35B (Q4_K_M quantization)
Mode: Hybrid (partial GPU offload)
Tokens/second: 15-18 t/s
Time to first token: 1.2-1.5 seconds

For comparison, a 7B model fully on GPU gives 50-80 tokens/second. So hybrid mode is slower. But is 15-18 tokens/second usable?

Perception Threshold: What Feels “Real-time”?

Research on LLM latency perception suggests 15-20 tokens/second feels interactive for most users. Below 10 t/s, users notice significant lag. Above 20 t/s, responses feel smooth.

My 15-18 t/s falls right in the usable zone. It’s not instant, but it doesn’t break the coding flow.

Why MoE Models Excel in Hybrid Mode

Not all models perform equally in hybrid mode. Dense models like Llama 2 13B struggle more than Mixture-of-Experts (MoE) models like Qwen 3.5.

How MoE Works:

MoE models contain multiple “expert” sub-networks. For each token, only a subset of experts activate. Qwen 3.5 35B has many total parameters but only uses 10-20% of them per inference step.

Dense 35B model: All 35B parameters process every token
MoE 35B model: ~7B active parameters per token

This sparse activation means MoE models:

  • Move less data between CPU and GPU
  • Achieve better throughput in hybrid mode
  • Provide larger-model intelligence at smaller-model speeds

My benchmarks confirm this. Qwen 3.5 35B in hybrid mode (15-18 t/s) outperforms my expectations, while dense 30B+ models dip below 10 t/s on the same hardware.

Real-World Coding Experience

I spent a week using Qwen 3.5 35B in hybrid mode for actual development work. Here’s what worked and what didn’t:

Good Use Cases:

  1. Code Review and Explanation I’d paste a function and ask for explanation. The 1-2 second initial delay didn’t bother me. Once generation started, 15 t/s felt responsive enough.

  2. Architecture Discussions Brainstorming system design with the model. The slower pace actually helped me think through responses.

  3. Debugging Assistance Pasting error logs and asking for diagnosis. The quality of reasoning from the larger model compensated for speed.

  4. Documentation Generation Having the model write docstrings or README sections. Speed mattered less here.

Poor Use Cases:

  1. Rapid Autocomplete Not suitable for VS Code-style quick suggestions. Too slow.

  2. Real-time Pair Programming The latency breaks the flow of quick back-and-forth exchanges.

  3. Speed-Critical Iteration When you need many quick queries in succession, the delay compounds.

Configuration Tips for Best Performance

After experimenting, I found several optimizations:

ollama-env.sh
# Set GPU layers manually (experiment with your hardware)
export OLLAMA_NUM_GPU=28
# This forces more layers onto GPU
# Higher values = faster but need more VRAM
# Lower values = slower but work with less VRAM
# For my 8GB VRAM setup, 28-32 GPU layers works well

You can also use llama.cpp directly for more control:

llama-cpp-hybrid.sh
llama-server -m qwen-35b-q4_k_m.gguf \
--n-gpu-layers 32 \
--ctx-size 8192 \
--threads 8 \
--batch-size 512

The --n-gpu-layers flag controls GPU offloading. More layers on GPU means faster inference but requires more VRAM.

Streaming Improves Perceived Performance

Even at 15 t/s, streaming responses makes the experience feel faster:

stream-ollama.py
import ollama
# Stream response for better UX with slower models
response = ollama.chat(
model='qwen2.5-coder:32b',
messages=[{
'role': 'user',
'content': 'Explain this React hook'
}],
stream=True
)
for chunk in response:
print(chunk['message']['content'], end='', flush=True)

Streaming shows progress immediately. Users see text appearing rather than waiting for a complete response.

Monitoring Your Setup

Check how your model distributes across hardware:

monitor-gpu.sh
# Watch GPU utilization during inference
watch -n 1 nvidia-smi
# In another terminal, check Ollama status
ollama ps
# Look for VRAM usage in nvidia-smi output
# If VRAM maxes out but model is slow, increase OLLAMA_NUM_GPU

Performance Comparison Table

SetupModelModeTokens/secCoding Usability
8GB VRAM7B denseGPU-only50-80Excellent
8GB VRAM14B denseHybrid20-30Good
8GB VRAM32B denseHybrid8-15Marginal
8GB VRAM35B MoEHybrid15-18Usable
12GB VRAM35B MoEHybrid25-35Good

The sweet spot for 8GB VRAM: MoE models in the 30-40B range.

Hardware Recommendations

If you’re building a hybrid inference rig for coding:

Minimum:

  • 8GB VRAM
  • 32GB system RAM
  • Modern CPU with AVX-2 support

Recommended:

  • 12GB+ VRAM
  • 64GB system RAM
  • CPU with AVX-512 support (significantly improves CPU inference)

Ideal:

  • 16GB+ VRAM
  • 64GB+ RAM
  • Multiple memory channels (dual/quad channel helps CPU throughput)

The Trade-off: Quality vs Speed

Running larger models in hybrid mode means choosing model intelligence over response speed. A 7B model on pure GPU gives instant responses. A 35B MoE in hybrid mode gives smarter responses, but slower.

For coding assistance, I found the trade-off worthwhile. The larger model’s better reasoning and context understanding often meant I needed fewer iterations to solve problems. The 1-2 extra seconds per response didn’t significantly impact my workflow.

Other strategies for running large models with limited VRAM:

Quantization: Lower precision (Q4, Q3) reduces memory. A Q3 quantization of a 35B model needs ~12GB VRAM for full GPU inference.

Model Offloading with vLLM: vLLM offers more sophisticated memory management than Ollama, but requires more setup.

Remote APIs: If local performance is too slow, cloud inference APIs like Groq or Together AI offer fast inference for large models.

Key Takeaways

  1. Hybrid inference works for coding - 15-18 tokens/sec is usable for thoughtful coding workflows

  2. MoE models are optimal - Qwen 3.5 35B’s mixture-of-experts architecture provides better quality/speed balance than dense models

  3. Trade speed for intelligence - You get better reasoning from larger models at the cost of slower responses

  4. RAM matters as much as VRAM - 64GB system RAM enables smooth hybrid operation

  5. Match your workload - Code review and debugging work well; real-time autocomplete doesn’t

For developers with limited VRAM who need coding intelligence from larger models, CPU+GPU hybrid inference is a viable solution. Set your expectations: you’re trading speed for capability. With an MoE model like Qwen 3.5 35B, the 15-18 tokens/sec achieved on typical 8GB VRAM hardware stays in the usable range for most coding assistance tasks.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

References

Comments