How Do I Set the Optimal -ngl Value for GPU Offloading in llama.cpp?

Mar 15, 2026

I spent an hour tweaking the -ngl value, trying to find the magic number for my GPU. I set it to 10, then 20, then 30, hoping each would be “optimal.” Meanwhile, my model was crawling at 5 tokens per second.

Turns out, I was overcomplicating it. The answer is simpler than I thought: just set -ngl 99 and let llama.cpp handle the rest. But understanding why this works - and when you need a different approach - made all the difference.

The Quick Answer

The optimal -ngl value depends on your GPU’s VRAM capacity and model size. Here’s the simple approach:

./llama-cli -m model.gguf -ngl 99 -p "Your prompt"

llama.cpp will automatically use the maximum number of layers your GPU can handle. If you run out of VRAM, reduce the value or use a smaller quantization.

For automatic optimization, use the llama-fit-params tool:

./llama-fit-params -m model.gguf --ctx-size 8192

This tool calculates the optimal -ngl value for your specific hardware and model combination.

Understanding the -ngl Flag

The -ngl flag (short for --n-gpu-layers) controls how many transformer layers are offloaded to your GPU. This is crucial because:

text title="Layer execution comparison"
CPU execution: ~3-5 tokens/sec
GPU execution: ~40-100+ tokens/sec

Difference: 10-30x speedup

The Reddit community consensus is clear: “Your best bets are more layers on GPU.” Offloading more layers to your GPU dramatically improves inference speed.

How Transformer Layers Work

A typical LLM has multiple transformer layers (also called “blocks”). For example:

text title="Model layer counts"
Model          | Layers | Parameters
---------------|--------|------------
Qwen2.5-7B     | 28     | 7.6B
Llama-3-8B     | 32     | 8B
Qwen2.5-14B    | 40     | 14.7B
Qwen2.5-32B    | 64     | 32.5B
Llama-3-70B    | 80     | 70B

When you set -ngl 99, llama.cpp offloads all available layers to GPU. If a model has 32 layers and you set -ngl 99, all 32 layers run on GPU.

The Simple Approach: Start High, Reduce If Needed

My recommended workflow:

./llama-cli -m model-Q4_K_M.gguf -ngl 99 -c 4096 -p "Hello"

# If it works, you're done!
# If you get CUDA out of memory error, continue to step 2

./llama-cli -m model-Q4_K_M.gguf -ngl 50 -c 4096 -p "Hello"

# Still OOM? Keep reducing

# Binary search approach
./llama-cli -m model-Q4_K_M.gguf -ngl 25 -c 4096 -p "Hello"

This trial-and-error approach is faster than calculating manually because llama.cpp loads the model quickly.

The Automatic Approach: llama-fit-params

llama.cpp includes a built-in tool to find optimal parameters:

# Navigate to your llama.cpp build directory
cd llama.cpp/build/bin

# Run the fitting tool
./llama-fit-params \
  -m /path/to/model-Q4_K_M.gguf \
  --ctx-size 8192

# The tool outputs suggested parameters

Example output:

text title="llama-fit-params output"
Model size:     18.5 GB
Available VRAM:  24.0 GB
Suggested -ngl:  99 (all layers)
Suggested -b:    512
Suggested -ub:    512
Memory for KV:   2.1 GB (with q4_0 cache)
Total memory:    20.6 GB

This tool considers:

Model weights
KV cache memory
Context length
Available VRAM

Manual Calculation (When You Need It)

If you want to understand the math, here’s how to calculate manually:

Step 1: Check Your VRAM

nvidia-smi --query-gpu=memory.total,memory.free --format=csv

# Example output:
# memory.total [MiB], memory.free [MiB]
# 24564 MiB, 24000 MiB

Step 2: Estimate Model Memory

./llama-cli -m model.gguf --log-disable 2>&1 | head -20

# Look for:
# - n_layer = 32
# - Model size (in the output)

Step 3: Calculate Per-Layer Memory

Total model memory = Model file size + KV cache + Activation overhead

For Q4_K_M quantization:
- Model weights: ~0.7 bytes per parameter
- 7B model: ~5 GB
- 14B model: ~10 GB
- 32B model: ~22 GB

KV cache (per 1K context, FP16):
- ~0.5 GB per billion parameters

Example for Qwen2.5-14B-Q4_K_M at 4K context:
- Model weights: ~10 GB
- KV cache (FP16): ~7 GB
- Total: ~17 GB (fits in 24GB VRAM with room to spare)

Step 4: Account for Context Length

Context length affects KV cache size significantly:

Context_Length:
  2048:
    fp16: "~1.8 GB per 7B model"
    q4_0: "~0.5 GB per 7B model"
  4096:
    fp16: "~3.6 GB per 7B model"
    q4_0: "~1.0 GB per 7B model"
  8192:
    fp16: "~7.2 GB per 7B model"
    q4_0: "~2.0 GB per 7B model"
  16384:
    fp16: "~14.4 GB per 7B model"
    q4_0: "~4.0 GB per 7B model"

Using KV cache quantization (--cache-type-k q4_0 --cache-type-v q4_0) reduces this by 75%.

GPU VRAM Scenarios

Here’s what works for common GPU configurations:

24GB VRAM (RTX 3090, RTX 4090)

# 7B model - Full offload, plenty of room
./llama-cli -m 7b-Q4_K_M.gguf -ngl 99 -c 16384

# 14B model - Full offload, moderate context
./llama-cli -m 14b-Q4_K_M.gguf -ngl 99 -c 8192

# 32B model - Full offload with Q4, limited context
./llama-cli -m 32b-Q4_K_M.gguf -ngl 99 -c 4096 \
  --cache-type-k q4_0 --cache-type-v q4_0

# 70B model - Cannot fully offload (need 2+ GPUs or partial offload)
./llama-cli -m 70b-Q4_K_M.gguf -ngl 40 -c 2048

12GB VRAM (RTX 3060, RTX 4070)

# 7B model - Full offload
./llama-cli -m 7b-Q4_K_M.gguf -ngl 99 -c 8192

# 14B model - Full offload with KV quantization
./llama-cli -m 14b-Q4_K_M.gguf -ngl 99 -c 4096 \
  --cache-type-k q4_0 --cache-type-v q4_0

# 32B model - Partial offload (slow)
./llama-cli -m 32b-Q4_K_M.gguf -ngl 25 -c 2048

8GB VRAM (GTX 1080, RTX 3070)

# 7B model - Full offload with KV quantization
./llama-cli -m 7b-Q4_K_M.gguf -ngl 99 -c 4096 \
  --cache-type-k q4_0 --cache-type-v q4_0

# 14B model - Partial offload (hybrid CPU/GPU)
./llama-cli -m 14b-Q4_K_M.gguf -ngl 15 -c 2048

# Larger models - Mostly CPU (slow)
./llama-cli -m 32b-Q4_K_M.gguf -ngl 10 -c 1024

Why -ngl 99 Works

You might wonder why I recommend -ngl 99 when most models have far fewer layers.

1. llama.cpp reads your -ngl value
2. It checks the model's actual layer count
3. It offloads min(ngl, actual_layers) to GPU
4. If VRAM is insufficient, it fails with OOM error

Example:
- Model has 32 layers
- You set -ngl 99
- llama.cpp offloads 32 layers (all of them)
- Result: Maximum GPU utilization

This approach works because:

It’s simpler than looking up layer counts
It’s future-proof for models with more layers
llama.cpp handles the logic internally

Signs You Need to Adjust -ngl

Out of Memory (OOM) Error

# Error message:
# CUDA error: out of memory
# current device: 0
# ggml_backend_cuda_buffer_type_alloc_buffer: allocate 4096 MB

This means -ngl is too high. Reduce it:

# If -ngl 99 fails, try:
./llama-cli -m model.gguf -ngl 50

# If still fails:
./llama-cli -m model.gguf -ngl 25

Slow Inference Speed

If your tokens/sec is very low (under 10 t/s for a modern GPU):

nvidia-smi -l 1

# If GPU memory is low (< 50%), you're not offloading enough
# Increase -ngl

Inconsistent Speed

If speed varies dramatically between prompts:

./llama-cli -m model.gguf -ngl 99 --flash-attn

Common Mistakes

Mistake 1: Using a Fixed Low Value

# WRONG: Arbitrarily choosing -ngl 20
./llama-cli -m 7b-model.gguf -ngl 20

# This leaves GPU resources unused!
# The 7B model might have 32 layers, so 12 are running on CPU

# RIGHT: Use -ngl 99 and let llama.cpp decide
./llama-cli -m 7b-model.gguf -ngl 99

Mistake 2: Not Accounting for Context

# This might fail with OOM at 16K context:
./llama-cli -m 14b-model.gguf -ngl 99 -c 16384

# Use KV quantization to save memory:
./llama-cli -m 14b-model.gguf -ngl 99 -c 16384 \
  --cache-type-k q4_0 --cache-type-v q4_0

Mistake 3: Forgetting to Verify GPU Usage

# Check during inference:
watch -n 1 nvidia-smi

# Look for:
# - GPU memory usage (should be high, e.g., 18GB/24GB)
# - GPU utilization (should spike during generation)

Practical Workflow

Here’s my recommended workflow for any new model:

# Step 1: Check VRAM
nvidia-smi --query-gpu=memory.free --format=csv,noheader

# Step 2: Try maximum offload with safe context
./llama-cli -m model.gguf -ngl 99 -c 4096 -p "test"

# Step 3: If success, try larger context
./llama-cli -m model.gguf -ngl 99 -c 8192 -p "test"

# Step 4: If OOM, use llama-fit-params
./llama-fit-params -m model.gguf --ctx-size 4096

# Step 5: Run with optimal settings
./llama-cli -m model.gguf -ngl <suggested> -c <suggested>

Summary

The optimal -ngl value strategy is simple:

Scenario	Recommended -ngl	Notes
Default approach	`-ngl 99`	Works for most cases
OOM errors	Reduce until stable	Binary search: 50, 25, 12
Automatic optimization	`llama-fit-params`	Calculates exact values
Mixed CPU/GPU	Start at 50%	Adjust based on performance

The key takeaway: more layers on GPU = faster inference. Set -ngl 99 and only reduce if you run into memory issues. For precise optimization, use the built-in llama-fit-params tool.

KV Cache Quantization: Use --cache-type-k q4_0 and --cache-type-v q4_0 to reduce KV cache memory by 75%, allowing more room for model weights.
Flash Attention: Combine -ngl 99 with --flash-attn for additional speedup, especially with longer contexts.
Multi-GPU Support: For models that don’t fit on one GPU, use -sm row to split across multiple GPUs.
CPU-Only Fallback: If GPU memory is insufficient, llama.cpp automatically falls back to CPU for remaining layers.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Reddit Discussion: llama.cpp GPU offload -ngl

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!