Skip to content

How Much VRAM Do You Need to Run 70B Models Locally?

Problem

When I tried running a 70B parameter model on my RTX 3090 (24GB VRAM), I got this error:

Terminal window
llama_model_load: error loading model: CUDA out of memory
GGML_ASSERT: ggml-cuda.c:1234: false

The model wouldn’t even load, let alone generate tokens. I knew 70B models were big, but I didn’t realize how much memory they actually need.

Environment

  • NVIDIA RTX 3090 (24GB VRAM)
  • 64GB DDR4 RAM
  • Ubuntu 22.04
  • llama.cpp (latest build)

What Happened?

I wanted to run a 70B model locally to avoid API costs and have full control over my inference. I thought my RTX 3090 with 24GB VRAM would be enough—after all, it runs 13B models beautifully at 50+ tokens per second.

Here’s what I tried first:

Terminal window
# Attempting to load Llama-3-70B
./main -m ./models/Llama-3-70B-Q4_K_M.gguf \
-p "Write a short poem about AI" \
-n 256 \
-ngl 100

The model loaded partially, then crashed with the CUDA out of memory error. I tried reducing the GPU layers:

Terminal window
# Trying to offload fewer layers to GPU
./main -m ./models/Llama-3-70B-Q4_K_M.gguf \
-p "Write a short poem about AI" \
-n 256 \
-ngl 20

It loaded, but performance was terrible—about 2 tokens per second because most layers were running on CPU with system RAM.

How to Solve It?

I realized I needed to understand the actual VRAM math before throwing hardware at the problem.

The VRAM Formula

Here’s the basic formula for model memory:

VRAM calculation
VRAM (GB) = Model Parameters (B) x Quantization Bits / 8

For a 70B model:

70B model VRAM requirements
FP16: 70 x 16 / 8 = 140 GB
Q8: 70 x 8 / 8 = 70 GB
Q4: 70 x 4 / 8 = 35 GB (plus overhead = ~40 GB)
Q3: 70 x 3 / 8 = 26 GB (plus overhead = ~30 GB)

Quantization Comparison

I put together a table to understand the trade-offs:

QuantizationBits per ParameterVRAM RequiredQualitySpeed
FP1616~140 GBFull precisionBaseline
FP88~70 GBNear-losslessFast
Q88~70 GBNear-losslessFast
Q6_K6~52 GBExcellentFast
Q5_K_M5~44 GBGoodFast
Q4_K_M4~40 GBBest balanceFast
Q3_K_M3~26 GBUsableFast
Q2_K2~18 GBDegradedFast

Context Window Overhead

But that’s not the whole story. Each token in your context window also consumes VRAM for the KV cache:

Context VRAM overhead for 70B Q4
4k context: +2 GB
16k context: +8 GB
32k context: +16 GB
64k context: +32 GB

So for a 70B Q4 model with 16k context, I actually need around 48 GB of VRAM.

My Solution Options

Given my budget and needs, I evaluated these approaches:

┌─────────────────────────────────────────────────────────────────┐
│ 70B Model Hardware Options │
├─────────────────────┬─────────────┬──────────────┬─────────────┤
│ Option │ VRAM │ Cost │ Performance │
├─────────────────────┼─────────────┼──────────────┼─────────────┤
│ 1x RTX 4090 (24GB) │ 24 GB │ ~$1,600 │ Slow (CPU) │
│ 2x RTX 3090 (48GB) │ 48 GB │ ~$1,400 │ Good Q4 │
│ 1x RTX 6000 Ada │ 48 GB │ ~$7,000 │ Good Q4 │
│ 2x RTX 4090 (48GB) │ 48 GB │ ~$3,200 │ Good Q4 │
│ 1x A100 (80GB) │ 80 GB │ ~$15,000 │ Excellent │
│ 2x RTX 6000 Ada │ 96 GB │ ~$14,000 │ Excellent │
│ Mac Studio Ultra │ 192 GB │ ~$6,000 │ Slow │
└─────────────────────┴─────────────┴──────────────┴─────────────┘

I ended up buying a second RTX 3090 used for $650. Here’s my working configuration:

Terminal window
# Working setup with 2x RTX 3090 (48GB total)
./main -m ./models/Llama-3-70B-Q4_K_M.gguf \
-p "Write a short poem about AI" \
-n 256 \
-ngl 80 \
-c 16384

This gives me about 18-22 tokens per second with Q4 quantization and 16k context—perfectly usable for interactive use.

The Reason

The core issue is that 70B models are fundamentally large. The VRAM requirements scale linearly with both model size and context length.

Key insights I learned:

1. Quantization is a double-edged sword. Going below Q4 significantly degrades model quality. A heavily quantized 70B often performs worse than a properly quantized 30B or 34B model.

2. CPU offloading is painful. When I had only one GPU, offloading layers to CPU dropped my speed from potential 20+ tok/s to 2-3 tok/s. That’s a 10x slowdown.

3. Multi-GPU scaling works well. llama.cpp handles multi-GPU setups automatically. With NVLink or even just PCIe, the overhead is minimal for inference.

4. Future-proofing matters. Models are getting larger, and context windows are expanding. One Reddit user put it bluntly:

“Any solution that gives you less than 256GB VRAM/unified memory is a non-starter”

While that’s extreme for most users, the point stands—buy more VRAM than you think you need.

What About Larger Models?

Curious about even bigger models, I tested what my setup could handle:

Model size vs. performance on 48GB VRAM
Model │ Quantization │ Context │ Speed (tok/s)
───────────────────┼──────────────┼─────────┼──────────────
Llama-3-8B │ Q8 │ 32k │ 80+
Llama-3-70B │ Q4_K_M │ 16k │ 18-22
Qwen2.5-72B │ Q4_K_M │ 16k │ 15-20
Mixtral-8x7B │ Q4_K_M │ 16k │ 25-30
DeepSeek-67B │ Q4_K_M │ 16k │ 18-22

For 100B+ models, I’d need to offload to system RAM or use more aggressive quantization.

Summary

In this post, I showed how to calculate and plan VRAM requirements for 70B models. The key point is that you need 40-48 GB minimum for usable Q4 inference with reasonable context, and 70+ GB if you want better quantization or longer contexts.

If you’re building a local LLM rig, here’s my quick recommendation:

  • 48 GB VRAM: Good for 70B Q4 with 16k context (2x RTX 3090/4090)
  • 80 GB VRAM: Excellent for 70B Q6-Q8 or 16k+ context (A100, or wait for RTX 5090)
  • 192 GB unified memory: Good for 200B+ models but slower (Mac Studio Ultra)

The math is straightforward: model parameters times quantization bits divided by 8, plus context overhead. Buy more than you think you need—models and context windows keep growing.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments