Skip to content

How Do I Set the Optimal -ngl Value for GPU Offloading in llama.cpp?

I spent an hour tweaking the -ngl value, trying to find the magic number for my GPU. I set it to 10, then 20, then 30, hoping each would be “optimal.” Meanwhile, my model was crawling at 5 tokens per second.

Turns out, I was overcomplicating it. The answer is simpler than I thought: just set -ngl 99 and let llama.cpp handle the rest. But understanding why this works - and when you need a different approach - made all the difference.

The Quick Answer

The optimal -ngl value depends on your GPU’s VRAM capacity and model size. Here’s the simple approach:

Just use -ngl 99
./llama-cli -m model.gguf -ngl 99 -p "Your prompt"

llama.cpp will automatically use the maximum number of layers your GPU can handle. If you run out of VRAM, reduce the value or use a smaller quantization.

For automatic optimization, use the llama-fit-params tool:

Automatic parameter fitting
./llama-fit-params -m model.gguf --ctx-size 8192

This tool calculates the optimal -ngl value for your specific hardware and model combination.

Understanding the -ngl Flag

The -ngl flag (short for --n-gpu-layers) controls how many transformer layers are offloaded to your GPU. This is crucial because:

text title="Layer execution comparison"
CPU execution: ~3-5 tokens/sec
GPU execution: ~40-100+ tokens/sec
Difference: 10-30x speedup

The Reddit community consensus is clear: “Your best bets are more layers on GPU.” Offloading more layers to your GPU dramatically improves inference speed.

How Transformer Layers Work

A typical LLM has multiple transformer layers (also called “blocks”). For example:

text title="Model layer counts"
Model | Layers | Parameters
---------------|--------|------------
Qwen2.5-7B | 28 | 7.6B
Llama-3-8B | 32 | 8B
Qwen2.5-14B | 40 | 14.7B
Qwen2.5-32B | 64 | 32.5B
Llama-3-70B | 80 | 70B

When you set -ngl 99, llama.cpp offloads all available layers to GPU. If a model has 32 layers and you set -ngl 99, all 32 layers run on GPU.

The Simple Approach: Start High, Reduce If Needed

My recommended workflow:

Step 1: Try maximum offload
./llama-cli -m model-Q4_K_M.gguf -ngl 99 -c 4096 -p "Hello"
# If it works, you're done!
# If you get CUDA out of memory error, continue to step 2
Step 2: Reduce by half and try again
./llama-cli -m model-Q4_K_M.gguf -ngl 50 -c 4096 -p "Hello"
# Still OOM? Keep reducing
Step 3: Find the sweet spot
# Binary search approach
./llama-cli -m model-Q4_K_M.gguf -ngl 25 -c 4096 -p "Hello"

This trial-and-error approach is faster than calculating manually because llama.cpp loads the model quickly.

The Automatic Approach: llama-fit-params

llama.cpp includes a built-in tool to find optimal parameters:

Using llama-fit-params
# Navigate to your llama.cpp build directory
cd llama.cpp/build/bin
# Run the fitting tool
./llama-fit-params \
-m /path/to/model-Q4_K_M.gguf \
--ctx-size 8192
# The tool outputs suggested parameters

Example output:

text title="llama-fit-params output"
Model size: 18.5 GB
Available VRAM: 24.0 GB
Suggested -ngl: 99 (all layers)
Suggested -b: 512
Suggested -ub: 512
Memory for KV: 2.1 GB (with q4_0 cache)
Total memory: 20.6 GB

This tool considers:

  • Model weights
  • KV cache memory
  • Context length
  • Available VRAM

Manual Calculation (When You Need It)

If you want to understand the math, here’s how to calculate manually:

Step 1: Check Your VRAM

Check GPU memory
nvidia-smi --query-gpu=memory.total,memory.free --format=csv
# Example output:
# memory.total [MiB], memory.free [MiB]
# 24564 MiB, 24000 MiB

Step 2: Estimate Model Memory

Check model information
./llama-cli -m model.gguf --log-disable 2>&1 | head -20
# Look for:
# - n_layer = 32
# - Model size (in the output)

Step 3: Calculate Per-Layer Memory

Memory estimation formula
Total model memory = Model file size + KV cache + Activation overhead
For Q4_K_M quantization:
- Model weights: ~0.7 bytes per parameter
- 7B model: ~5 GB
- 14B model: ~10 GB
- 32B model: ~22 GB
KV cache (per 1K context, FP16):
- ~0.5 GB per billion parameters
Example for Qwen2.5-14B-Q4_K_M at 4K context:
- Model weights: ~10 GB
- KV cache (FP16): ~7 GB
- Total: ~17 GB (fits in 24GB VRAM with room to spare)

Step 4: Account for Context Length

Context length affects KV cache size significantly:

KV cache memory by context length
Context_Length:
2048:
fp16: "~1.8 GB per 7B model"
q4_0: "~0.5 GB per 7B model"
4096:
fp16: "~3.6 GB per 7B model"
q4_0: "~1.0 GB per 7B model"
8192:
fp16: "~7.2 GB per 7B model"
q4_0: "~2.0 GB per 7B model"
16384:
fp16: "~14.4 GB per 7B model"
q4_0: "~4.0 GB per 7B model"

Using KV cache quantization (--cache-type-k q4_0 --cache-type-v q4_0) reduces this by 75%.

GPU VRAM Scenarios

Here’s what works for common GPU configurations:

24GB VRAM (RTX 3090, RTX 4090)

24GB GPU configurations
# 7B model - Full offload, plenty of room
./llama-cli -m 7b-Q4_K_M.gguf -ngl 99 -c 16384
# 14B model - Full offload, moderate context
./llama-cli -m 14b-Q4_K_M.gguf -ngl 99 -c 8192
# 32B model - Full offload with Q4, limited context
./llama-cli -m 32b-Q4_K_M.gguf -ngl 99 -c 4096 \
--cache-type-k q4_0 --cache-type-v q4_0
# 70B model - Cannot fully offload (need 2+ GPUs or partial offload)
./llama-cli -m 70b-Q4_K_M.gguf -ngl 40 -c 2048

12GB VRAM (RTX 3060, RTX 4070)

12GB GPU configurations
# 7B model - Full offload
./llama-cli -m 7b-Q4_K_M.gguf -ngl 99 -c 8192
# 14B model - Full offload with KV quantization
./llama-cli -m 14b-Q4_K_M.gguf -ngl 99 -c 4096 \
--cache-type-k q4_0 --cache-type-v q4_0
# 32B model - Partial offload (slow)
./llama-cli -m 32b-Q4_K_M.gguf -ngl 25 -c 2048

8GB VRAM (GTX 1080, RTX 3070)

8GB GPU configurations
# 7B model - Full offload with KV quantization
./llama-cli -m 7b-Q4_K_M.gguf -ngl 99 -c 4096 \
--cache-type-k q4_0 --cache-type-v q4_0
# 14B model - Partial offload (hybrid CPU/GPU)
./llama-cli -m 14b-Q4_K_M.gguf -ngl 15 -c 2048
# Larger models - Mostly CPU (slow)
./llama-cli -m 32b-Q4_K_M.gguf -ngl 10 -c 1024

Why -ngl 99 Works

You might wonder why I recommend -ngl 99 when most models have far fewer layers.

How -ngl 99 works
1. llama.cpp reads your -ngl value
2. It checks the model's actual layer count
3. It offloads min(ngl, actual_layers) to GPU
4. If VRAM is insufficient, it fails with OOM error
Example:
- Model has 32 layers
- You set -ngl 99
- llama.cpp offloads 32 layers (all of them)
- Result: Maximum GPU utilization

This approach works because:

  1. It’s simpler than looking up layer counts
  2. It’s future-proof for models with more layers
  3. llama.cpp handles the logic internally

Signs You Need to Adjust -ngl

Out of Memory (OOM) Error

OOM error example
# Error message:
# CUDA error: out of memory
# current device: 0
# ggml_backend_cuda_buffer_type_alloc_buffer: allocate 4096 MB

This means -ngl is too high. Reduce it:

Reduce and retry
# If -ngl 99 fails, try:
./llama-cli -m model.gguf -ngl 50
# If still fails:
./llama-cli -m model.gguf -ngl 25

Slow Inference Speed

If your tokens/sec is very low (under 10 t/s for a modern GPU):

Check GPU utilization
nvidia-smi -l 1
# If GPU memory is low (< 50%), you're not offloading enough
# Increase -ngl

Inconsistent Speed

If speed varies dramatically between prompts:

Enable Flash Attention for consistency
./llama-cli -m model.gguf -ngl 99 --flash-attn

Common Mistakes

Mistake 1: Using a Fixed Low Value

Suboptimal configuration
# WRONG: Arbitrarily choosing -ngl 20
./llama-cli -m 7b-model.gguf -ngl 20
# This leaves GPU resources unused!
# The 7B model might have 32 layers, so 12 are running on CPU
Correct approach
# RIGHT: Use -ngl 99 and let llama.cpp decide
./llama-cli -m 7b-model.gguf -ngl 99

Mistake 2: Not Accounting for Context

Context memory oversight
# This might fail with OOM at 16K context:
./llama-cli -m 14b-model.gguf -ngl 99 -c 16384
# Use KV quantization to save memory:
./llama-cli -m 14b-model.gguf -ngl 99 -c 16384 \
--cache-type-k q4_0 --cache-type-v q4_0

Mistake 3: Forgetting to Verify GPU Usage

Always verify GPU offloading
# Check during inference:
watch -n 1 nvidia-smi
# Look for:
# - GPU memory usage (should be high, e.g., 18GB/24GB)
# - GPU utilization (should spike during generation)

Practical Workflow

Here’s my recommended workflow for any new model:

Complete workflow script
# Step 1: Check VRAM
nvidia-smi --query-gpu=memory.free --format=csv,noheader
# Step 2: Try maximum offload with safe context
./llama-cli -m model.gguf -ngl 99 -c 4096 -p "test"
# Step 3: If success, try larger context
./llama-cli -m model.gguf -ngl 99 -c 8192 -p "test"
# Step 4: If OOM, use llama-fit-params
./llama-fit-params -m model.gguf --ctx-size 4096
# Step 5: Run with optimal settings
./llama-cli -m model.gguf -ngl <suggested> -c <suggested>

Summary

The optimal -ngl value strategy is simple:

ScenarioRecommended -nglNotes
Default approach-ngl 99Works for most cases
OOM errorsReduce until stableBinary search: 50, 25, 12
Automatic optimizationllama-fit-paramsCalculates exact values
Mixed CPU/GPUStart at 50%Adjust based on performance

The key takeaway: more layers on GPU = faster inference. Set -ngl 99 and only reduce if you run into memory issues. For precise optimization, use the built-in llama-fit-params tool.


  • KV Cache Quantization: Use --cache-type-k q4_0 and --cache-type-v q4_0 to reduce KV cache memory by 75%, allowing more room for model weights.
  • Flash Attention: Combine -ngl 99 with --flash-attn for additional speedup, especially with longer contexts.
  • Multi-GPU Support: For models that don’t fit on one GPU, use -sm row to split across multiple GPUs.
  • CPU-Only Fallback: If GPU memory is insufficient, llama.cpp automatically falls back to CPU for remaining layers.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments