Is CPU+GPU Hybrid Inference Fast Enough for Coding?
I have an RTX 4060 laptop with 8GB VRAM and 64GB system RAM. I wanted to run larger coding models like Qwen 3.5 35B, but my VRAM isn’t enough for pure GPU inference. The question that kept me up: Is CPU+GPU hybrid inference usable for real-time coding assistance?
The Problem: VRAM Bottleneck
Running large language models locally for coding assistance requires serious VRAM. A 35B parameter model at 4-bit quantization needs roughly 18-20GB of VRAM for full GPU inference. My 8GB RTX 4060 can only fit a 7B or 8B model entirely in GPU memory.
I had two options:
- Use smaller models (7B-8B) with full GPU acceleration
- Try hybrid CPU+GPU inference with larger models
Option 1 felt limiting. I wanted the reasoning capabilities of larger models. So I tested option 2.
What is CPU+GPU Hybrid Inference?
Hybrid inference splits the model between GPU VRAM and system RAM. Some layers run on the GPU, others on the CPU. This lets you run models larger than your VRAM capacity.
Ollama and llama.cpp handle this automatically through a feature called “partial offloading.” When your GPU doesn’t have enough VRAM for the entire model, it offloads some layers to CPU/RAM.
# Ollama automatically uses hybrid mode when VRAM is insufficientollama run qwen2.5-coder:32b
# Check layer distribution after loadingollama psThe output shows how many layers sit on GPU vs CPU:
NAME ID SIZE PROCESSOR UNTILqwen2.5-coder:32b abc123... 18.2 GB 35%/65% 4 minutes from nowThat 35%/65% split means roughly 35% of model layers run on GPU, 65% on CPU.
The Latency Question
Here’s what worried me: CPU inference is slow. Really slow. Would mixing CPU and GPU inference make the model unbearably sluggish for interactive coding?
I ran benchmarks with Qwen 3.5 35B (a Mixture-of-Experts model) on my setup:
My Hardware:
- RTX 4060 Laptop GPU (8GB VRAM)
- 64GB DDR5 RAM
- AMD Ryzen 9 7940HS (8 cores, AVX-512 support)
Results:
Model: Qwen 3.5 35B (Q4_K_M quantization)Mode: Hybrid (partial GPU offload)Tokens/second: 15-18 t/sTime to first token: 1.2-1.5 secondsFor comparison, a 7B model fully on GPU gives 50-80 tokens/second. So hybrid mode is slower. But is 15-18 tokens/second usable?
Perception Threshold: What Feels “Real-time”?
Research on LLM latency perception suggests 15-20 tokens/second feels interactive for most users. Below 10 t/s, users notice significant lag. Above 20 t/s, responses feel smooth.
My 15-18 t/s falls right in the usable zone. It’s not instant, but it doesn’t break the coding flow.
Why MoE Models Excel in Hybrid Mode
Not all models perform equally in hybrid mode. Dense models like Llama 2 13B struggle more than Mixture-of-Experts (MoE) models like Qwen 3.5.
How MoE Works:
MoE models contain multiple “expert” sub-networks. For each token, only a subset of experts activate. Qwen 3.5 35B has many total parameters but only uses 10-20% of them per inference step.
Dense 35B model: All 35B parameters process every tokenMoE 35B model: ~7B active parameters per tokenThis sparse activation means MoE models:
- Move less data between CPU and GPU
- Achieve better throughput in hybrid mode
- Provide larger-model intelligence at smaller-model speeds
My benchmarks confirm this. Qwen 3.5 35B in hybrid mode (15-18 t/s) outperforms my expectations, while dense 30B+ models dip below 10 t/s on the same hardware.
Real-World Coding Experience
I spent a week using Qwen 3.5 35B in hybrid mode for actual development work. Here’s what worked and what didn’t:
Good Use Cases:
-
Code Review and Explanation I’d paste a function and ask for explanation. The 1-2 second initial delay didn’t bother me. Once generation started, 15 t/s felt responsive enough.
-
Architecture Discussions Brainstorming system design with the model. The slower pace actually helped me think through responses.
-
Debugging Assistance Pasting error logs and asking for diagnosis. The quality of reasoning from the larger model compensated for speed.
-
Documentation Generation Having the model write docstrings or README sections. Speed mattered less here.
Poor Use Cases:
-
Rapid Autocomplete Not suitable for VS Code-style quick suggestions. Too slow.
-
Real-time Pair Programming The latency breaks the flow of quick back-and-forth exchanges.
-
Speed-Critical Iteration When you need many quick queries in succession, the delay compounds.
Configuration Tips for Best Performance
After experimenting, I found several optimizations:
# Set GPU layers manually (experiment with your hardware)export OLLAMA_NUM_GPU=28
# This forces more layers onto GPU# Higher values = faster but need more VRAM# Lower values = slower but work with less VRAM
# For my 8GB VRAM setup, 28-32 GPU layers works wellYou can also use llama.cpp directly for more control:
llama-server -m qwen-35b-q4_k_m.gguf \ --n-gpu-layers 32 \ --ctx-size 8192 \ --threads 8 \ --batch-size 512The --n-gpu-layers flag controls GPU offloading. More layers on GPU means faster inference but requires more VRAM.
Streaming Improves Perceived Performance
Even at 15 t/s, streaming responses makes the experience feel faster:
import ollama
# Stream response for better UX with slower modelsresponse = ollama.chat( model='qwen2.5-coder:32b', messages=[{ 'role': 'user', 'content': 'Explain this React hook' }], stream=True)
for chunk in response: print(chunk['message']['content'], end='', flush=True)Streaming shows progress immediately. Users see text appearing rather than waiting for a complete response.
Monitoring Your Setup
Check how your model distributes across hardware:
# Watch GPU utilization during inferencewatch -n 1 nvidia-smi
# In another terminal, check Ollama statusollama ps
# Look for VRAM usage in nvidia-smi output# If VRAM maxes out but model is slow, increase OLLAMA_NUM_GPUPerformance Comparison Table
| Setup | Model | Mode | Tokens/sec | Coding Usability |
|---|---|---|---|---|
| 8GB VRAM | 7B dense | GPU-only | 50-80 | Excellent |
| 8GB VRAM | 14B dense | Hybrid | 20-30 | Good |
| 8GB VRAM | 32B dense | Hybrid | 8-15 | Marginal |
| 8GB VRAM | 35B MoE | Hybrid | 15-18 | Usable |
| 12GB VRAM | 35B MoE | Hybrid | 25-35 | Good |
The sweet spot for 8GB VRAM: MoE models in the 30-40B range.
Hardware Recommendations
If you’re building a hybrid inference rig for coding:
Minimum:
- 8GB VRAM
- 32GB system RAM
- Modern CPU with AVX-2 support
Recommended:
- 12GB+ VRAM
- 64GB system RAM
- CPU with AVX-512 support (significantly improves CPU inference)
Ideal:
- 16GB+ VRAM
- 64GB+ RAM
- Multiple memory channels (dual/quad channel helps CPU throughput)
The Trade-off: Quality vs Speed
Running larger models in hybrid mode means choosing model intelligence over response speed. A 7B model on pure GPU gives instant responses. A 35B MoE in hybrid mode gives smarter responses, but slower.
For coding assistance, I found the trade-off worthwhile. The larger model’s better reasoning and context understanding often meant I needed fewer iterations to solve problems. The 1-2 extra seconds per response didn’t significantly impact my workflow.
Related Approaches
Other strategies for running large models with limited VRAM:
Quantization: Lower precision (Q4, Q3) reduces memory. A Q3 quantization of a 35B model needs ~12GB VRAM for full GPU inference.
Model Offloading with vLLM: vLLM offers more sophisticated memory management than Ollama, but requires more setup.
Remote APIs: If local performance is too slow, cloud inference APIs like Groq or Together AI offer fast inference for large models.
Key Takeaways
-
Hybrid inference works for coding - 15-18 tokens/sec is usable for thoughtful coding workflows
-
MoE models are optimal - Qwen 3.5 35B’s mixture-of-experts architecture provides better quality/speed balance than dense models
-
Trade speed for intelligence - You get better reasoning from larger models at the cost of slower responses
-
RAM matters as much as VRAM - 64GB system RAM enables smooth hybrid operation
-
Match your workload - Code review and debugging work well; real-time autocomplete doesn’t
For developers with limited VRAM who need coding intelligence from larger models, CPU+GPU hybrid inference is a viable solution. Set your expectations: you’re trading speed for capability. With an MoE model like Qwen 3.5 35B, the 15-18 tokens/sec achieved on typical 8GB VRAM hardware stays in the usable range for most coding assistance tasks.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments