How Do I Set the Optimal -ngl Value for GPU Offloading in llama.cpp?
I spent an hour tweaking the -ngl value, trying to find the magic number for my GPU. I set it to 10, then 20, then 30, hoping each would be “optimal.” Meanwhile, my model was crawling at 5 tokens per second.
Turns out, I was overcomplicating it. The answer is simpler than I thought: just set -ngl 99 and let llama.cpp handle the rest. But understanding why this works - and when you need a different approach - made all the difference.
The Quick Answer
The optimal -ngl value depends on your GPU’s VRAM capacity and model size. Here’s the simple approach:
./llama-cli -m model.gguf -ngl 99 -p "Your prompt"llama.cpp will automatically use the maximum number of layers your GPU can handle. If you run out of VRAM, reduce the value or use a smaller quantization.
For automatic optimization, use the llama-fit-params tool:
./llama-fit-params -m model.gguf --ctx-size 8192This tool calculates the optimal -ngl value for your specific hardware and model combination.
Understanding the -ngl Flag
The -ngl flag (short for --n-gpu-layers) controls how many transformer layers are offloaded to your GPU. This is crucial because:
text title="Layer execution comparison"CPU execution: ~3-5 tokens/secGPU execution: ~40-100+ tokens/sec
Difference: 10-30x speedupThe Reddit community consensus is clear: “Your best bets are more layers on GPU.” Offloading more layers to your GPU dramatically improves inference speed.
How Transformer Layers Work
A typical LLM has multiple transformer layers (also called “blocks”). For example:
text title="Model layer counts"Model | Layers | Parameters---------------|--------|------------Qwen2.5-7B | 28 | 7.6BLlama-3-8B | 32 | 8BQwen2.5-14B | 40 | 14.7BQwen2.5-32B | 64 | 32.5BLlama-3-70B | 80 | 70BWhen you set -ngl 99, llama.cpp offloads all available layers to GPU. If a model has 32 layers and you set -ngl 99, all 32 layers run on GPU.
The Simple Approach: Start High, Reduce If Needed
My recommended workflow:
./llama-cli -m model-Q4_K_M.gguf -ngl 99 -c 4096 -p "Hello"
# If it works, you're done!# If you get CUDA out of memory error, continue to step 2./llama-cli -m model-Q4_K_M.gguf -ngl 50 -c 4096 -p "Hello"
# Still OOM? Keep reducing# Binary search approach./llama-cli -m model-Q4_K_M.gguf -ngl 25 -c 4096 -p "Hello"This trial-and-error approach is faster than calculating manually because llama.cpp loads the model quickly.
The Automatic Approach: llama-fit-params
llama.cpp includes a built-in tool to find optimal parameters:
# Navigate to your llama.cpp build directorycd llama.cpp/build/bin
# Run the fitting tool./llama-fit-params \ -m /path/to/model-Q4_K_M.gguf \ --ctx-size 8192
# The tool outputs suggested parametersExample output:
text title="llama-fit-params output"Model size: 18.5 GBAvailable VRAM: 24.0 GBSuggested -ngl: 99 (all layers)Suggested -b: 512Suggested -ub: 512Memory for KV: 2.1 GB (with q4_0 cache)Total memory: 20.6 GBThis tool considers:
- Model weights
- KV cache memory
- Context length
- Available VRAM
Manual Calculation (When You Need It)
If you want to understand the math, here’s how to calculate manually:
Step 1: Check Your VRAM
nvidia-smi --query-gpu=memory.total,memory.free --format=csv
# Example output:# memory.total [MiB], memory.free [MiB]# 24564 MiB, 24000 MiBStep 2: Estimate Model Memory
./llama-cli -m model.gguf --log-disable 2>&1 | head -20
# Look for:# - n_layer = 32# - Model size (in the output)Step 3: Calculate Per-Layer Memory
Total model memory = Model file size + KV cache + Activation overhead
For Q4_K_M quantization:- Model weights: ~0.7 bytes per parameter- 7B model: ~5 GB- 14B model: ~10 GB- 32B model: ~22 GB
KV cache (per 1K context, FP16):- ~0.5 GB per billion parameters
Example for Qwen2.5-14B-Q4_K_M at 4K context:- Model weights: ~10 GB- KV cache (FP16): ~7 GB- Total: ~17 GB (fits in 24GB VRAM with room to spare)Step 4: Account for Context Length
Context length affects KV cache size significantly:
Context_Length: 2048: fp16: "~1.8 GB per 7B model" q4_0: "~0.5 GB per 7B model" 4096: fp16: "~3.6 GB per 7B model" q4_0: "~1.0 GB per 7B model" 8192: fp16: "~7.2 GB per 7B model" q4_0: "~2.0 GB per 7B model" 16384: fp16: "~14.4 GB per 7B model" q4_0: "~4.0 GB per 7B model"Using KV cache quantization (--cache-type-k q4_0 --cache-type-v q4_0) reduces this by 75%.
GPU VRAM Scenarios
Here’s what works for common GPU configurations:
24GB VRAM (RTX 3090, RTX 4090)
# 7B model - Full offload, plenty of room./llama-cli -m 7b-Q4_K_M.gguf -ngl 99 -c 16384
# 14B model - Full offload, moderate context./llama-cli -m 14b-Q4_K_M.gguf -ngl 99 -c 8192
# 32B model - Full offload with Q4, limited context./llama-cli -m 32b-Q4_K_M.gguf -ngl 99 -c 4096 \ --cache-type-k q4_0 --cache-type-v q4_0
# 70B model - Cannot fully offload (need 2+ GPUs or partial offload)./llama-cli -m 70b-Q4_K_M.gguf -ngl 40 -c 204812GB VRAM (RTX 3060, RTX 4070)
# 7B model - Full offload./llama-cli -m 7b-Q4_K_M.gguf -ngl 99 -c 8192
# 14B model - Full offload with KV quantization./llama-cli -m 14b-Q4_K_M.gguf -ngl 99 -c 4096 \ --cache-type-k q4_0 --cache-type-v q4_0
# 32B model - Partial offload (slow)./llama-cli -m 32b-Q4_K_M.gguf -ngl 25 -c 20488GB VRAM (GTX 1080, RTX 3070)
# 7B model - Full offload with KV quantization./llama-cli -m 7b-Q4_K_M.gguf -ngl 99 -c 4096 \ --cache-type-k q4_0 --cache-type-v q4_0
# 14B model - Partial offload (hybrid CPU/GPU)./llama-cli -m 14b-Q4_K_M.gguf -ngl 15 -c 2048
# Larger models - Mostly CPU (slow)./llama-cli -m 32b-Q4_K_M.gguf -ngl 10 -c 1024Why -ngl 99 Works
You might wonder why I recommend -ngl 99 when most models have far fewer layers.
1. llama.cpp reads your -ngl value2. It checks the model's actual layer count3. It offloads min(ngl, actual_layers) to GPU4. If VRAM is insufficient, it fails with OOM error
Example:- Model has 32 layers- You set -ngl 99- llama.cpp offloads 32 layers (all of them)- Result: Maximum GPU utilizationThis approach works because:
- It’s simpler than looking up layer counts
- It’s future-proof for models with more layers
- llama.cpp handles the logic internally
Signs You Need to Adjust -ngl
Out of Memory (OOM) Error
# Error message:# CUDA error: out of memory# current device: 0# ggml_backend_cuda_buffer_type_alloc_buffer: allocate 4096 MBThis means -ngl is too high. Reduce it:
# If -ngl 99 fails, try:./llama-cli -m model.gguf -ngl 50
# If still fails:./llama-cli -m model.gguf -ngl 25Slow Inference Speed
If your tokens/sec is very low (under 10 t/s for a modern GPU):
nvidia-smi -l 1
# If GPU memory is low (< 50%), you're not offloading enough# Increase -nglInconsistent Speed
If speed varies dramatically between prompts:
./llama-cli -m model.gguf -ngl 99 --flash-attnCommon Mistakes
Mistake 1: Using a Fixed Low Value
# WRONG: Arbitrarily choosing -ngl 20./llama-cli -m 7b-model.gguf -ngl 20
# This leaves GPU resources unused!# The 7B model might have 32 layers, so 12 are running on CPU# RIGHT: Use -ngl 99 and let llama.cpp decide./llama-cli -m 7b-model.gguf -ngl 99Mistake 2: Not Accounting for Context
# This might fail with OOM at 16K context:./llama-cli -m 14b-model.gguf -ngl 99 -c 16384
# Use KV quantization to save memory:./llama-cli -m 14b-model.gguf -ngl 99 -c 16384 \ --cache-type-k q4_0 --cache-type-v q4_0Mistake 3: Forgetting to Verify GPU Usage
# Check during inference:watch -n 1 nvidia-smi
# Look for:# - GPU memory usage (should be high, e.g., 18GB/24GB)# - GPU utilization (should spike during generation)Practical Workflow
Here’s my recommended workflow for any new model:
# Step 1: Check VRAMnvidia-smi --query-gpu=memory.free --format=csv,noheader
# Step 2: Try maximum offload with safe context./llama-cli -m model.gguf -ngl 99 -c 4096 -p "test"
# Step 3: If success, try larger context./llama-cli -m model.gguf -ngl 99 -c 8192 -p "test"
# Step 4: If OOM, use llama-fit-params./llama-fit-params -m model.gguf --ctx-size 4096
# Step 5: Run with optimal settings./llama-cli -m model.gguf -ngl <suggested> -c <suggested>Summary
The optimal -ngl value strategy is simple:
| Scenario | Recommended -ngl | Notes |
|---|---|---|
| Default approach | -ngl 99 | Works for most cases |
| OOM errors | Reduce until stable | Binary search: 50, 25, 12 |
| Automatic optimization | llama-fit-params | Calculates exact values |
| Mixed CPU/GPU | Start at 50% | Adjust based on performance |
The key takeaway: more layers on GPU = faster inference. Set -ngl 99 and only reduce if you run into memory issues. For precise optimization, use the built-in llama-fit-params tool.
Related Knowledge
- KV Cache Quantization: Use
--cache-type-k q4_0and--cache-type-v q4_0to reduce KV cache memory by 75%, allowing more room for model weights. - Flash Attention: Combine
-ngl 99with--flash-attnfor additional speedup, especially with longer contexts. - Multi-GPU Support: For models that don’t fit on one GPU, use
-sm rowto split across multiple GPUs. - CPU-Only Fallback: If GPU memory is insufficient, llama.cpp automatically falls back to CPU for remaining layers.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments