How Much VRAM Do You Need to Run 70B Models Locally?
Problem
When I tried running a 70B parameter model on my RTX 3090 (24GB VRAM), I got this error:
llama_model_load: error loading model: CUDA out of memoryGGML_ASSERT: ggml-cuda.c:1234: falseThe model wouldn’t even load, let alone generate tokens. I knew 70B models were big, but I didn’t realize how much memory they actually need.
Environment
- NVIDIA RTX 3090 (24GB VRAM)
- 64GB DDR4 RAM
- Ubuntu 22.04
- llama.cpp (latest build)
What Happened?
I wanted to run a 70B model locally to avoid API costs and have full control over my inference. I thought my RTX 3090 with 24GB VRAM would be enough—after all, it runs 13B models beautifully at 50+ tokens per second.
Here’s what I tried first:
# Attempting to load Llama-3-70B./main -m ./models/Llama-3-70B-Q4_K_M.gguf \ -p "Write a short poem about AI" \ -n 256 \ -ngl 100The model loaded partially, then crashed with the CUDA out of memory error. I tried reducing the GPU layers:
# Trying to offload fewer layers to GPU./main -m ./models/Llama-3-70B-Q4_K_M.gguf \ -p "Write a short poem about AI" \ -n 256 \ -ngl 20It loaded, but performance was terrible—about 2 tokens per second because most layers were running on CPU with system RAM.
How to Solve It?
I realized I needed to understand the actual VRAM math before throwing hardware at the problem.
The VRAM Formula
Here’s the basic formula for model memory:
VRAM (GB) = Model Parameters (B) x Quantization Bits / 8For a 70B model:
FP16: 70 x 16 / 8 = 140 GBQ8: 70 x 8 / 8 = 70 GBQ4: 70 x 4 / 8 = 35 GB (plus overhead = ~40 GB)Q3: 70 x 3 / 8 = 26 GB (plus overhead = ~30 GB)Quantization Comparison
I put together a table to understand the trade-offs:
| Quantization | Bits per Parameter | VRAM Required | Quality | Speed |
|---|---|---|---|---|
| FP16 | 16 | ~140 GB | Full precision | Baseline |
| FP8 | 8 | ~70 GB | Near-lossless | Fast |
| Q8 | 8 | ~70 GB | Near-lossless | Fast |
| Q6_K | 6 | ~52 GB | Excellent | Fast |
| Q5_K_M | 5 | ~44 GB | Good | Fast |
| Q4_K_M | 4 | ~40 GB | Best balance | Fast |
| Q3_K_M | 3 | ~26 GB | Usable | Fast |
| Q2_K | 2 | ~18 GB | Degraded | Fast |
Context Window Overhead
But that’s not the whole story. Each token in your context window also consumes VRAM for the KV cache:
4k context: +2 GB16k context: +8 GB32k context: +16 GB64k context: +32 GBSo for a 70B Q4 model with 16k context, I actually need around 48 GB of VRAM.
My Solution Options
Given my budget and needs, I evaluated these approaches:
┌─────────────────────────────────────────────────────────────────┐│ 70B Model Hardware Options │├─────────────────────┬─────────────┬──────────────┬─────────────┤│ Option │ VRAM │ Cost │ Performance │├─────────────────────┼─────────────┼──────────────┼─────────────┤│ 1x RTX 4090 (24GB) │ 24 GB │ ~$1,600 │ Slow (CPU) ││ 2x RTX 3090 (48GB) │ 48 GB │ ~$1,400 │ Good Q4 ││ 1x RTX 6000 Ada │ 48 GB │ ~$7,000 │ Good Q4 ││ 2x RTX 4090 (48GB) │ 48 GB │ ~$3,200 │ Good Q4 ││ 1x A100 (80GB) │ 80 GB │ ~$15,000 │ Excellent ││ 2x RTX 6000 Ada │ 96 GB │ ~$14,000 │ Excellent ││ Mac Studio Ultra │ 192 GB │ ~$6,000 │ Slow │└─────────────────────┴─────────────┴──────────────┴─────────────┘I ended up buying a second RTX 3090 used for $650. Here’s my working configuration:
# Working setup with 2x RTX 3090 (48GB total)./main -m ./models/Llama-3-70B-Q4_K_M.gguf \ -p "Write a short poem about AI" \ -n 256 \ -ngl 80 \ -c 16384This gives me about 18-22 tokens per second with Q4 quantization and 16k context—perfectly usable for interactive use.
The Reason
The core issue is that 70B models are fundamentally large. The VRAM requirements scale linearly with both model size and context length.
Key insights I learned:
1. Quantization is a double-edged sword. Going below Q4 significantly degrades model quality. A heavily quantized 70B often performs worse than a properly quantized 30B or 34B model.
2. CPU offloading is painful. When I had only one GPU, offloading layers to CPU dropped my speed from potential 20+ tok/s to 2-3 tok/s. That’s a 10x slowdown.
3. Multi-GPU scaling works well. llama.cpp handles multi-GPU setups automatically. With NVLink or even just PCIe, the overhead is minimal for inference.
4. Future-proofing matters. Models are getting larger, and context windows are expanding. One Reddit user put it bluntly:
“Any solution that gives you less than 256GB VRAM/unified memory is a non-starter”
While that’s extreme for most users, the point stands—buy more VRAM than you think you need.
What About Larger Models?
Curious about even bigger models, I tested what my setup could handle:
Model │ Quantization │ Context │ Speed (tok/s)───────────────────┼──────────────┼─────────┼──────────────Llama-3-8B │ Q8 │ 32k │ 80+Llama-3-70B │ Q4_K_M │ 16k │ 18-22Qwen2.5-72B │ Q4_K_M │ 16k │ 15-20Mixtral-8x7B │ Q4_K_M │ 16k │ 25-30DeepSeek-67B │ Q4_K_M │ 16k │ 18-22For 100B+ models, I’d need to offload to system RAM or use more aggressive quantization.
Summary
In this post, I showed how to calculate and plan VRAM requirements for 70B models. The key point is that you need 40-48 GB minimum for usable Q4 inference with reasonable context, and 70+ GB if you want better quantization or longer contexts.
If you’re building a local LLM rig, here’s my quick recommendation:
- 48 GB VRAM: Good for 70B Q4 with 16k context (2x RTX 3090/4090)
- 80 GB VRAM: Excellent for 70B Q6-Q8 or 16k+ context (A100, or wait for RTX 5090)
- 192 GB unified memory: Good for 200B+ models but slower (Mac Studio Ultra)
The math is straightforward: model parameters times quantization bits divided by 8, plus context overhead. Buy more than you think you need—models and context windows keep growing.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit: Local LLaMA Discussion on VRAM Requirements
- 👨💻 llama.cpp Documentation
- 👨💻 Hugging Face Model Quantization Guide
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments