Skip to content

RTX 3090 vs 4090 vs 5090: Which GPU for Local LLMs?

I thought I could run a 70B parameter model on my single RTX 3090. I was wrong.

cuda_error.txt
RuntimeError: CUDA out of memory. Tried to allocate 42.00 GiB

That’s when I started researching GPU options seriously. If you’re building a local LLM rig, the GPU choice isn’t just about raw performance—it’s about VRAM, multi-GPU scaling, and cost efficiency.

The VRAM Reality Check

Local LLMs need VRAM. A lot of it. The model size directly determines what you need:

Model SizeQ4 QuantizationQ8 Quantization
7B~5-6GB~8-10GB
13B~8-10GB~14-16GB
30B~18-20GB~32-35GB
70B~40-45GB~70-75GB
120B~70-80GB~140GB

My 24GB RTX 3090 could handle 13B models at Q8, but 70B? No chance. Even the RTX 5090 with its 32GB can’t run a 70B model at Q4 quantization—you need about 40GB for that.

RTX 3090: The Budget Multi-GPU King

I started looking at used RTX 3090s on the market. They’re going for $600-800 each.

Here’s what makes the 3090 compelling for multi-GPU setups:

Pros:

  • 24GB VRAM per card
  • NVLink support for pooling VRAM (2 cards = 48GB effectively)
  • Incredible value on the used market
  • Well-tested software support
  • 4 cards = 96GB VRAM for about $2,800-3,200

Cons:

  • Power hungry at 350W TDP
  • Older Ampere architecture
  • Used cards may have mining history

The math is compelling: 4x RTX 3090 gives you 96GB pooled VRAM for under $3,200 in GPUs. That’s enough for a 70B model at Q8 or even 120B at Q4.

RTX 4090: The Awkward Middle Ground

I looked at the RTX 4090 next. At $1,600-2,000, it offers better performance per watt, but there’s a catch:

Pros:

  • Significantly faster inference than 3090
  • Better power efficiency (450W TDP)
  • New with warranty

Cons:

  • Same 24GB VRAM as 3090
  • NVLink removed—no VRAM pooling
  • Limited for 70B+ models alone
  • Expensive for multi-GPU

This is where NVIDIA made a strategic decision. Without NVLink, multi-GPU setups on 4090 rely on PCIe interconnect, which is slower for model parallelism. If you’re running models that fit in 24GB, the 4090 is faster. But for larger models, the lack of NVLink hurts.

RTX 5090: Future-Proofing at a Price

The RTX 5090 brings 32GB of GDDR7 VRAM—the largest consumer GPU VRAM yet. But it comes with caveats:

Pros:

  • 32GB VRAM—largest consumer GPU
  • Fastest inference available
  • GDDR7 for higher bandwidth
  • Can run 30B models at higher precision

Cons:

  • $2,000+ MSRP (often higher due to demand)
  • 575W TDP—massive power draw
  • Still not enough for 70B at Q4 alone
  • Requires 1000W+ PSU

A single RTX 5090 can run up to 30B models at Q4. But for 70B, you still need two cards. And two 5090s will cost you $4,000+.

Multi-GPU Configurations: The Real Answer

I created this comparison table to understand the options:

ConfigurationTotal VRAMEst. CostModels Supported
1x RTX 309024GB$700Up to 13B Q8
2x RTX 309048GB*$1,400Up to 30B Q8
4x RTX 309096GB*$2,800-3,20070B Q8, 120B Q4
1x RTX 409024GB$1,800Up to 13B Q8
2x RTX 409048GB**$3,600Up to 30B Q8
1x RTX 509032GB$2,000+Up to 30B Q4
2x RTX 509064GB**$4,000+Up to 70B Q4

*NVLink allows pooling on RTX 3090 **No NVLink on 4090/5090; relies on PCIe interconnect

Checking Your GPU Setup with Python

Before investing in hardware, I wrote a quick script to check what’s available:

check_gpu_vram.py
import torch
# List available GPUs and their VRAM
if torch.cuda.is_available():
for i in range(torch.cuda.device_count()):
props = torch.cuda.get_device_properties(i)
print(f"GPU {i}: {props.name}")
print(f" Total VRAM: {props.total_memory / 1024**3:.1f} GB")
print(f" Multi-processor count: {props.multi_processor_count}")
else:
print("No CUDA GPUs available")

For multi-GPU model loading, here’s how I distribute models across cards:

multi_gpu_load.py
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "meta-llama/Llama-2-70b-chat-hf"
# Load model with automatic device mapping across multiple GPUs
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto", # Automatically distributes across available GPUs
torch_dtype=torch.float16,
load_in_4bit=True # Use 4-bit quantization to fit larger models
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Verify model placement
print(model.hf_device_map)

And to monitor VRAM usage during inference:

monitor_vram.py
import torch
import time
def monitor_vram(interval=1):
"""Continuously monitor GPU VRAM usage"""
while True:
for i in range(torch.cuda.device_count()):
allocated = torch.cuda.memory_allocated(i) / 1024**3
reserved = torch.cuda.memory_reserved(i) / 1024**3
print(f"GPU {i}: {allocated:.2f} GB allocated, {reserved:.2f} GB reserved")
time.sleep(interval)
# Call after loading models to see memory footprint

Based on my research and the Reddit community’s experiences, here are three build paths:

Budget Build (~$5k) — Maximum VRAM

budget_build.txt
GPU: 4x Used RTX 3090 ($2,800-3,200)
CPU: Ryzen 9 7900X ($350)
RAM: 128GB DDR5 ($400)
PSU: 1600W ($300)
Motherboard: X570 with 4x PCIe x16 slots ($300)
Case + Cooling: ($400)
Total: ~$5,000

This gives you 96GB VRAM—enough for 70B models at Q8 or 120B at Q4.

Balanced Build (~$7k) — Performance + VRAM

balanced_build.txt
GPU: 2x RTX 5090 ($4,000-4,500)
CPU: i7-14700K ($400)
RAM: 128GB DDR5 ($400)
PSU: 1500W ($350)
Motherboard: Z790 ($350)
Case + Cooling: ($500)
Total: ~$7,000

64GB VRAM with modern architecture and better efficiency.

Starter Build (~$3k) — Gaming + AI

starter_build.txt
GPU: 1x RTX 5090 ($2,000-2,500)
CPU: i5-14600K ($300)
RAM: 64GB DDR5 ($200)
PSU: 1000W ($200)
Motherboard: Z790 ($250)
Case + Cooling: ($350)
Total: ~$3,200

Good for gaming and running models up to 30B.

Common Mistakes I Almost Made

  1. Underestimating VRAM needs: A 70B model at Q4 needs ~40GB. Single 32GB card can’t run it efficiently—you need CPU offloading or multiple GPUs.

  2. Ignoring power requirements: RTX 5090 draws 575W. A 4x 3090 setup needs 1400W+ just for GPUs. Don’t cheap out on the PSU.

  3. Overlooking the used market: RTX 3090 at $600-800 is exceptional value. Many are ex-mining cards, but for inference workloads, that’s often fine.

  4. Forgetting system RAM: Fast DDR5 system RAM enables CPU offloading for overflow. Get 128GB if you’re serious about large models.

  5. NVLink matters for multi-GPU: RTX 3090 has it. 4090 and 5090 don’t. For model parallelism, NVLink’s faster interconnect helps.

The Verdict

After all this research, here’s my take:

  • Budget-conscious, want large models: 4x used RTX 3090. 96GB VRAM for under $5k total.
  • Performance + some future-proofing: 2x RTX 5090. Better efficiency, modern architecture, 64GB VRAM.
  • Just getting started: 1x RTX 5090. Great single-card experience, gaming capable, handles up to 30B models.

For my use case—running coding agents and experimenting with larger models—I’m going with the 4x RTX 3090 route. The 96GB pooled VRAM opens up 70B and 120B models that simply won’t fit on smaller configurations.

VRAM is king for local LLMs. Everything else—speed, efficiency, features—is secondary if you can’t load the model you want to run.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments