How Much VRAM Do You Need to Run 70B Models Locally?

Mar 27, 2026

Problem

When I tried running a 70B parameter model on my RTX 3090 (24GB VRAM), I got this error:

llama_model_load: error loading model: CUDA out of memory
GGML_ASSERT: ggml-cuda.c:1234: false

The model wouldn’t even load, let alone generate tokens. I knew 70B models were big, but I didn’t realize how much memory they actually need.

Environment

NVIDIA RTX 3090 (24GB VRAM)
64GB DDR4 RAM
Ubuntu 22.04
llama.cpp (latest build)

What Happened?

I wanted to run a 70B model locally to avoid API costs and have full control over my inference. I thought my RTX 3090 with 24GB VRAM would be enough—after all, it runs 13B models beautifully at 50+ tokens per second.

Here’s what I tried first:

# Attempting to load Llama-3-70B
./main -m ./models/Llama-3-70B-Q4_K_M.gguf \
  -p "Write a short poem about AI" \
  -n 256 \
  -ngl 100

The model loaded partially, then crashed with the CUDA out of memory error. I tried reducing the GPU layers:

# Trying to offload fewer layers to GPU
./main -m ./models/Llama-3-70B-Q4_K_M.gguf \
  -p "Write a short poem about AI" \
  -n 256 \
  -ngl 20

It loaded, but performance was terrible—about 2 tokens per second because most layers were running on CPU with system RAM.

How to Solve It?

I realized I needed to understand the actual VRAM math before throwing hardware at the problem.

The VRAM Formula

Here’s the basic formula for model memory:

VRAM (GB) = Model Parameters (B) x Quantization Bits / 8

For a 70B model:

FP16:  70 x 16 / 8 = 140 GB
Q8:    70 x 8  / 8 = 70 GB
Q4:    70 x 4  / 8 = 35 GB (plus overhead = ~40 GB)
Q3:    70 x 3  / 8 = 26 GB (plus overhead = ~30 GB)

Quantization Comparison

I put together a table to understand the trade-offs:

Quantization	Bits per Parameter	VRAM Required	Quality	Speed
FP16	16	~140 GB	Full precision	Baseline
FP8	8	~70 GB	Near-lossless	Fast
Q8	8	~70 GB	Near-lossless	Fast
Q6_K	6	~52 GB	Excellent	Fast
Q5_K_M	5	~44 GB	Good	Fast
Q4_K_M	4	~40 GB	Best balance	Fast
Q3_K_M	3	~26 GB	Usable	Fast
Q2_K	2	~18 GB	Degraded	Fast

Context Window Overhead

But that’s not the whole story. Each token in your context window also consumes VRAM for the KV cache:

4k context:  +2 GB
16k context: +8 GB
32k context: +16 GB
64k context: +32 GB

So for a 70B Q4 model with 16k context, I actually need around 48 GB of VRAM.

My Solution Options

Given my budget and needs, I evaluated these approaches:

┌─────────────────────────────────────────────────────────────────┐
│                    70B Model Hardware Options                    │
├─────────────────────┬─────────────┬──────────────┬─────────────┤
│ Option              │ VRAM        │ Cost         │ Performance │
├─────────────────────┼─────────────┼──────────────┼─────────────┤
│ 1x RTX 4090 (24GB)  │ 24 GB       │ ~$1,600      │ Slow (CPU)  │
│ 2x RTX 3090 (48GB)  │ 48 GB       │ ~$1,400      │ Good Q4     │
│ 1x RTX 6000 Ada     │ 48 GB       │ ~$7,000      │ Good Q4     │
│ 2x RTX 4090 (48GB)  │ 48 GB       │ ~$3,200      │ Good Q4     │
│ 1x A100 (80GB)      │ 80 GB       │ ~$15,000     │ Excellent   │
│ 2x RTX 6000 Ada     │ 96 GB       │ ~$14,000     │ Excellent   │
│ Mac Studio Ultra    │ 192 GB      │ ~$6,000      │ Slow        │
└─────────────────────┴─────────────┴──────────────┴─────────────┘

I ended up buying a second RTX 3090 used for $650. Here’s my working configuration:

# Working setup with 2x RTX 3090 (48GB total)
./main -m ./models/Llama-3-70B-Q4_K_M.gguf \
  -p "Write a short poem about AI" \
  -n 256 \
  -ngl 80 \
  -c 16384

This gives me about 18-22 tokens per second with Q4 quantization and 16k context—perfectly usable for interactive use.

The Reason

The core issue is that 70B models are fundamentally large. The VRAM requirements scale linearly with both model size and context length.

Key insights I learned:

1. Quantization is a double-edged sword. Going below Q4 significantly degrades model quality. A heavily quantized 70B often performs worse than a properly quantized 30B or 34B model.

2. CPU offloading is painful. When I had only one GPU, offloading layers to CPU dropped my speed from potential 20+ tok/s to 2-3 tok/s. That’s a 10x slowdown.

3. Multi-GPU scaling works well. llama.cpp handles multi-GPU setups automatically. With NVLink or even just PCIe, the overhead is minimal for inference.

4. Future-proofing matters. Models are getting larger, and context windows are expanding. One Reddit user put it bluntly:

“Any solution that gives you less than 256GB VRAM/unified memory is a non-starter”

While that’s extreme for most users, the point stands—buy more VRAM than you think you need.

What About Larger Models?

Curious about even bigger models, I tested what my setup could handle:

Model              │ Quantization │ Context │ Speed (tok/s)
───────────────────┼──────────────┼─────────┼──────────────
Llama-3-8B        │ Q8           │ 32k     │ 80+
Llama-3-70B       │ Q4_K_M       │ 16k     │ 18-22
Qwen2.5-72B       │ Q4_K_M       │ 16k     │ 15-20
Mixtral-8x7B      │ Q4_K_M       │ 16k     │ 25-30
DeepSeek-67B      │ Q4_K_M       │ 16k     │ 18-22

For 100B+ models, I’d need to offload to system RAM or use more aggressive quantization.

Summary

In this post, I showed how to calculate and plan VRAM requirements for 70B models. The key point is that you need 40-48 GB minimum for usable Q4 inference with reasonable context, and 70+ GB if you want better quantization or longer contexts.

If you’re building a local LLM rig, here’s my quick recommendation:

48 GB VRAM: Good for 70B Q4 with 16k context (2x RTX 3090/4090)
80 GB VRAM: Excellent for 70B Q6-Q8 or 16k+ context (A100, or wait for RTX 5090)
192 GB unified memory: Good for 200B+ models but slower (Mac Studio Ultra)

The math is straightforward: model parameters times quantization bits divided by 8, plus context overhead. Buy more than you think you need—models and context windows keep growing.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Reddit: Local LLaMA Discussion on VRAM Requirements
👨‍💻 llama.cpp Documentation
👨‍💻 Hugging Face Model Quantization Guide

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!