RTX 3090 vs 4090 vs 5090: Which GPU for Local LLMs?
I thought I could run a 70B parameter model on my single RTX 3090. I was wrong.
RuntimeError: CUDA out of memory. Tried to allocate 42.00 GiBThat’s when I started researching GPU options seriously. If you’re building a local LLM rig, the GPU choice isn’t just about raw performance—it’s about VRAM, multi-GPU scaling, and cost efficiency.
The VRAM Reality Check
Local LLMs need VRAM. A lot of it. The model size directly determines what you need:
| Model Size | Q4 Quantization | Q8 Quantization |
|---|---|---|
| 7B | ~5-6GB | ~8-10GB |
| 13B | ~8-10GB | ~14-16GB |
| 30B | ~18-20GB | ~32-35GB |
| 70B | ~40-45GB | ~70-75GB |
| 120B | ~70-80GB | ~140GB |
My 24GB RTX 3090 could handle 13B models at Q8, but 70B? No chance. Even the RTX 5090 with its 32GB can’t run a 70B model at Q4 quantization—you need about 40GB for that.
RTX 3090: The Budget Multi-GPU King
I started looking at used RTX 3090s on the market. They’re going for $600-800 each.
Here’s what makes the 3090 compelling for multi-GPU setups:
Pros:
- 24GB VRAM per card
- NVLink support for pooling VRAM (2 cards = 48GB effectively)
- Incredible value on the used market
- Well-tested software support
- 4 cards = 96GB VRAM for about $2,800-3,200
Cons:
- Power hungry at 350W TDP
- Older Ampere architecture
- Used cards may have mining history
The math is compelling: 4x RTX 3090 gives you 96GB pooled VRAM for under $3,200 in GPUs. That’s enough for a 70B model at Q8 or even 120B at Q4.
RTX 4090: The Awkward Middle Ground
I looked at the RTX 4090 next. At $1,600-2,000, it offers better performance per watt, but there’s a catch:
Pros:
- Significantly faster inference than 3090
- Better power efficiency (450W TDP)
- New with warranty
Cons:
- Same 24GB VRAM as 3090
- NVLink removed—no VRAM pooling
- Limited for 70B+ models alone
- Expensive for multi-GPU
This is where NVIDIA made a strategic decision. Without NVLink, multi-GPU setups on 4090 rely on PCIe interconnect, which is slower for model parallelism. If you’re running models that fit in 24GB, the 4090 is faster. But for larger models, the lack of NVLink hurts.
RTX 5090: Future-Proofing at a Price
The RTX 5090 brings 32GB of GDDR7 VRAM—the largest consumer GPU VRAM yet. But it comes with caveats:
Pros:
- 32GB VRAM—largest consumer GPU
- Fastest inference available
- GDDR7 for higher bandwidth
- Can run 30B models at higher precision
Cons:
- $2,000+ MSRP (often higher due to demand)
- 575W TDP—massive power draw
- Still not enough for 70B at Q4 alone
- Requires 1000W+ PSU
A single RTX 5090 can run up to 30B models at Q4. But for 70B, you still need two cards. And two 5090s will cost you $4,000+.
Multi-GPU Configurations: The Real Answer
I created this comparison table to understand the options:
| Configuration | Total VRAM | Est. Cost | Models Supported |
|---|---|---|---|
| 1x RTX 3090 | 24GB | $700 | Up to 13B Q8 |
| 2x RTX 3090 | 48GB* | $1,400 | Up to 30B Q8 |
| 4x RTX 3090 | 96GB* | $2,800-3,200 | 70B Q8, 120B Q4 |
| 1x RTX 4090 | 24GB | $1,800 | Up to 13B Q8 |
| 2x RTX 4090 | 48GB** | $3,600 | Up to 30B Q8 |
| 1x RTX 5090 | 32GB | $2,000+ | Up to 30B Q4 |
| 2x RTX 5090 | 64GB** | $4,000+ | Up to 70B Q4 |
*NVLink allows pooling on RTX 3090 **No NVLink on 4090/5090; relies on PCIe interconnect
Checking Your GPU Setup with Python
Before investing in hardware, I wrote a quick script to check what’s available:
import torch
# List available GPUs and their VRAMif torch.cuda.is_available(): for i in range(torch.cuda.device_count()): props = torch.cuda.get_device_properties(i) print(f"GPU {i}: {props.name}") print(f" Total VRAM: {props.total_memory / 1024**3:.1f} GB") print(f" Multi-processor count: {props.multi_processor_count}")else: print("No CUDA GPUs available")For multi-GPU model loading, here’s how I distribute models across cards:
from transformers import AutoModelForCausalLM, AutoTokenizerimport torch
model_id = "meta-llama/Llama-2-70b-chat-hf"
# Load model with automatic device mapping across multiple GPUsmodel = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", # Automatically distributes across available GPUs torch_dtype=torch.float16, load_in_4bit=True # Use 4-bit quantization to fit larger models)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Verify model placementprint(model.hf_device_map)And to monitor VRAM usage during inference:
import torchimport time
def monitor_vram(interval=1): """Continuously monitor GPU VRAM usage""" while True: for i in range(torch.cuda.device_count()): allocated = torch.cuda.memory_allocated(i) / 1024**3 reserved = torch.cuda.memory_reserved(i) / 1024**3 print(f"GPU {i}: {allocated:.2f} GB allocated, {reserved:.2f} GB reserved") time.sleep(interval)
# Call after loading models to see memory footprintRecommended Builds by Budget
Based on my research and the Reddit community’s experiences, here are three build paths:
Budget Build (~$5k) — Maximum VRAM
GPU: 4x Used RTX 3090 ($2,800-3,200)CPU: Ryzen 9 7900X ($350)RAM: 128GB DDR5 ($400)PSU: 1600W ($300)Motherboard: X570 with 4x PCIe x16 slots ($300)Case + Cooling: ($400)Total: ~$5,000This gives you 96GB VRAM—enough for 70B models at Q8 or 120B at Q4.
Balanced Build (~$7k) — Performance + VRAM
GPU: 2x RTX 5090 ($4,000-4,500)CPU: i7-14700K ($400)RAM: 128GB DDR5 ($400)PSU: 1500W ($350)Motherboard: Z790 ($350)Case + Cooling: ($500)Total: ~$7,00064GB VRAM with modern architecture and better efficiency.
Starter Build (~$3k) — Gaming + AI
GPU: 1x RTX 5090 ($2,000-2,500)CPU: i5-14600K ($300)RAM: 64GB DDR5 ($200)PSU: 1000W ($200)Motherboard: Z790 ($250)Case + Cooling: ($350)Total: ~$3,200Good for gaming and running models up to 30B.
Common Mistakes I Almost Made
-
Underestimating VRAM needs: A 70B model at Q4 needs ~40GB. Single 32GB card can’t run it efficiently—you need CPU offloading or multiple GPUs.
-
Ignoring power requirements: RTX 5090 draws 575W. A 4x 3090 setup needs 1400W+ just for GPUs. Don’t cheap out on the PSU.
-
Overlooking the used market: RTX 3090 at $600-800 is exceptional value. Many are ex-mining cards, but for inference workloads, that’s often fine.
-
Forgetting system RAM: Fast DDR5 system RAM enables CPU offloading for overflow. Get 128GB if you’re serious about large models.
-
NVLink matters for multi-GPU: RTX 3090 has it. 4090 and 5090 don’t. For model parallelism, NVLink’s faster interconnect helps.
The Verdict
After all this research, here’s my take:
- Budget-conscious, want large models: 4x used RTX 3090. 96GB VRAM for under $5k total.
- Performance + some future-proofing: 2x RTX 5090. Better efficiency, modern architecture, 64GB VRAM.
- Just getting started: 1x RTX 5090. Great single-card experience, gaming capable, handles up to 30B models.
For my use case—running coding agents and experimenting with larger models—I’m going with the 4x RTX 3090 route. The 96GB pooled VRAM opens up 70B and 120B models that simply won’t fit on smaller configurations.
VRAM is king for local LLMs. Everything else—speed, efficiency, features—is secondary if you can’t load the model you want to run.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 r/LocalLLaMA - Reddit Community
- 👨💻 HuggingFace Big Model Inference
- 👨💻 llama.cpp - GitHub
- 👨💻 NVIDIA RTX 5090 Specifications
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments