Skip to content

How Much VRAM Do You Need to Run Local LLMs?

I recently went down a rabbit hole trying to figure out what GPU to buy for running local LLMs. The question seems simple: how much VRAM do I actually need? But the answer depends on so many variables that I ended up building a calculator and testing different configurations.

Here’s what I learned.

The Core Question

When you’re shopping for a GPU to run local models, you’ll see numbers like 8GB, 12GB, 16GB, 24GB VRAM. But what do these numbers actually mean for which models you can run?

The short answer: VRAM determines the maximum model size you can load, and quantization determines how much you can compress that model to fit in your available memory.

Understanding the Math

I’ll start with the basic formula that changed how I think about this problem.

A neural network model is essentially a bunch of numbers (parameters). The memory needed depends on:

  1. How many parameters (measured in billions)
  2. How much precision each parameter uses (quantization level)

The math looks like this:

vram_calculator.py
def estimate_vram(params_billion: float, quantization: str, context_length: int = 4096) -> float:
"""
Estimate VRAM needed for a model.
Args:
params_billion: Model size in billions of parameters
quantization: 'Q4', 'Q5', 'Q6', 'Q8', or 'FP16'
context_length: Maximum context window size
Returns:
Estimated VRAM in GB
"""
# Bytes per parameter for each quantization
bytes_per_param = {
'Q4': 0.5, # 4 bits = 0.5 bytes
'Q5': 0.625, # 5 bits
'Q6': 0.75, # 6 bits
'Q8': 1.0, # 8 bits = 1 byte
'FP16': 2.0, # 16-bit float = 2 bytes
}
# Base model memory
base_memory = params_billion * bytes_per_param[quantization]
# KV cache overhead (roughly 0.5-1GB per 1K context for larger models)
kv_cache = (context_length / 1024) * (params_billion / 30) * 0.8
# 10% overhead for runtime
total = (base_memory + kv_cache) * 1.1
return round(total, 1)
# Examples
print(f"Qwen3.5-35B Q4: {estimate_vram(35, 'Q4')} GB") # ~24 GB
print(f"Llama-3-70B Q4: {estimate_vram(70, 'Q4')} GB") # ~48 GB
print(f"Qwen3-8B Q4: {estimate_vram(8, 'Q4')} GB") # ~6 GB

Let me break down what this calculator tells us.

Quantization Levels Explained

I used to think quantization was some kind of magic compression. It’s actually straightforward: you’re reducing the precision of each parameter.

Q4 (4-bit): Each parameter uses 4 bits (0.5 bytes). This is the most common choice for local inference. You lose some quality, but most models remain usable.

Q6 (6-bit): A middle ground. Better quality than Q4, but needs 50% more memory.

Q8 (8-bit): Nearly indistinguishable from full precision for most tasks, but requires twice the memory of Q4.

FP16 (16-bit): Full half-precision. Only needed when you’re training or need maximum accuracy.

The VRAM Sweet Spots

After running the numbers and checking community reports, here are the practical breakpoints:

VRAM Requirements by Model Size
| Model Size | Q4 (4-bit) | Q6 (6-bit) | Q8 (8-bit) | FP16 |
|------------|------------|------------|------------|-----------|
| 7B | 5-6 GB | 7-8 GB | 9-10 GB | 14 GB |
| 13B | 9-10 GB | 12-13 GB | 16-17 GB | 26 GB |
| 35B | 22-24 GB | 30-32 GB | 40-42 GB | 70 GB |
| 70B | 42-48 GB | 56-60 GB | 75-80 GB | 140 GB |
| 100B+ | 60+ GB | 80+ GB | 105+ GB | 200+ GB |

8GB VRAM: The Minimum

An 8GB card (like RTX 3060 or 4060) can run 7B models at Q4. This is your entry point. Models like Mistral 7B or Llama 3.1 8B work well here.

The limitation: you can’t run anything larger without offloading to system RAM, which kills performance.

12GB VRAM: The Budget Sweet Spot

Cards with 12GB (RTX 3060 12GB, 4070) open up 13B models at Q4. This was my first setup, and it handled most of what I wanted to do.

The catch: if you want long context windows, you’ll bump into memory limits faster than expected.

16GB VRAM: The Comfortable Middle

16GB cards (RTX 4080, some used 3090 Ti models) let you run 20B models comfortably. I found this to be a comfortable spot for models like Command R.

24GB VRAM: The Value Champion

This is where things get interesting. A single RTX 3090 or 4090 with 24GB can run 30-35B models at Q4. I’ve seen people successfully run Qwen3.5-35B on a 3090 Ti.

The used market for RTX 3090 cards around $700-800 makes this incredibly accessible for what you get.

48GB+ VRAM: Dual GPU Territory

To run 70B models at Q4, you need about 48GB. This usually means dual RTX 3090s or a workstation card like RTX 6000 Ada.

I tried this setup briefly. The performance is incredible, but the power draw and heat are real considerations.

80GB+ VRAM: Enterprise Land

NVIDIA A100 or H100 cards. These run 100B+ models comfortably. Unless you’re doing serious research or running a service, this is overkill for personal use.

The Mac Alternative

I also explored Mac Studio with unified memory. The appeal is obvious: 128GB or even 192GB of unified memory in a single machine.

But there’s a critical trade-off. GPU VRAM (GDDR6/GDDR7) runs at around 1.5 TB/s bandwidth. DDR5 system RAM runs at 50-100 GB/s. That’s a 15-30x difference in memory bandwidth.

What this means in practice: a Mac Studio can load a 200B model, but token generation will be slow. A GPU will load a smaller model but generate tokens much faster.

For coding assistants and interactive use, speed matters. For batch processing or research where you can wait, unified memory might be the better choice.

What I Got Wrong Initially

I made a few mistakes when I first started planning my build:

Underestimating KV cache: Long context windows need more memory than the base model size. A 128K context window can add several GB to your memory needs.

Ignoring bandwidth: I was so focused on VRAM capacity that I forgot about bandwidth. A slower card with more VRAM might not be faster in practice.

Buying for today: I bought my GPU for the models available then. Six months later, new model architectures changed the memory requirements.

My Recommendation

For most people getting into local LLMs today, here’s what I’d suggest:

  1. If you’re curious but not committed: Get a used RTX 3060 12GB or similar. Low investment, lets you experiment.

  2. If you’re serious about local AI: A used RTX 3090 at $700-800 gives you 24GB. This runs capable models like Qwen3.5-35B and leaves room for the next generation.

  3. If you need the absolute best: Consider whether you need speed (dual GPUs) or capacity (Mac Studio). They serve different use cases.

Summary

In this post, I walked through how to calculate VRAM requirements for running local LLMs. The key point is that VRAM needs scale with model size and quantization level: 8GB handles 7B models, 12GB handles 13B, 24GB handles 30-35B, and 48GB+ is needed for 70B+ models. The RTX 3090/4090 with 24GB VRAM offers the best value for most users, running capable models without requiring enterprise hardware.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments