Skip to content

What Hardware Do You Need for RAG and LoRA Fine-Tuning Locally?

Problem

I wanted to build a local AI workstation that could handle both RAG workflows and LoRA fine-tuning. My budget was $10,000. I started by looking at GPU recommendations for LLMs.

But I quickly ran into conflicting advice:

Reddit User A: "Get dual RTX 4090s, VRAM is everything for fine-tuning"
Reddit User B: "RAG needs RAM, spend your budget on 256GB DDR5"
Reddit User C: "Just get one GPU and max out storage for vector DBs"

I was confused. These recommendations couldn’t all be right. So I dug into why RAG and fine-tuning have different hardware needs.

Environment

  • Budget: $10,000 USD
  • Use cases: RAG with personal docs (100k+ context), LoRA fine-tuning on 7B-30B models
  • Current hardware: None (building from scratch)
  • OS: Linux (Ubuntu 24.04 planned)

What I Discovered

RAG and LoRA fine-tuning stress completely different parts of your system.

RAG is Memory-Bound, Not Compute-Bound

When I ran a RAG pipeline on my test machine, I watched the resource usage:

Terminal window
# Running RAG with 50k context on 70B model (CPU inference)
$ htop
MEM[||||||||||||||||||||||||||||||||||||68.2G/128G]
CPU[||| ]
$ nvidia-smi
GPU Memory: 2.1GB / 24GB

The GPU barely touched VRAM. The bottleneck was system RAM for the KV cache and vector database operations.

Here’s why: RAG retrieves document chunks, embeds them, and builds context. The vector database (Chroma, Pinecone, FAISS) stores millions of embeddings. Each similarity search loads embeddings into RAM.

For a 70B model with 100k context window running on CPU:

Context Memory = Context Tokens x 2 bytes x Layers / 1e9
For 100k context on Llama-2-70B:
- KV cache: ~40GB system RAM
- Vector DB (1M embeddings, 1536 dimensions): ~6GB
- Total: ~50GB just for context + embeddings

Fine-Tuning is VRAM-Bound

Then I tried LoRA fine-tuning:

Terminal window
# Fine-tuning 7B model with LoRA on single RTX 4090
$ nvidia-smi
GPU Memory: 18.2GB / 24GB
# Fine-tuning 13B model with LoRA on same GPU
CUDA out of memory. Tried to allocate 4.2GB but only 2.1GB available

Fine-tuning requires storing more than just model weights:

Fine-tuning VRAM = Model Weights + Gradients + Optimizer States + Activations
For 7B model with LoRA:
- Model weights (4-bit): ~4GB
- LoRA adapters: ~0.5GB
- Gradients (trainable params): ~1GB
- Optimizer states (AdamW): ~2GB
- Activations (batch=1): ~4GB
- Total: ~12GB VRAM minimum

The key insight: fine-tuning uses 1.2-2x the VRAM of inference for the same model.

The Conflict

Here’s where budgets break:

Hardware Requirements Clash
┌─────────────────┬────────────────────┬────────────────────┐
│ Task │ Needs Most │ Secondary │
├─────────────────┼────────────────────┼────────────────────┤
│ RAG │ System RAM (128GB+)│ Storage Speed │
│ LoRA Fine-tune │ GPU VRAM (48GB+) │ System RAM (64GB) │
└─────────────────┴────────────────────┴────────────────────┘

With $10k, I couldn’t max out both. I had to choose where to compromise.

How I Decided

I started by calculating what each workload actually needed.

Step 1: Calculate RAG Memory Requirements

For RAG with large context windows, I needed to estimate KV cache size:

kv-cache-calculator.py
def estimate_kv_cache(model_params_b: int, context_tokens: int) -> float:
"""
Estimate KV cache memory in GB for CPU inference.
Formula: tokens * 2 bytes * layers / 1e9
"""
# Approximate layers based on model size
layers = {
7: 32,
13: 40,
30: 60,
70: 80,
}
num_layers = layers.get(model_params_b, 40)
kv_cache_gb = context_tokens * 2 * num_layers / 1e9
return kv_cache_gb
# My use cases
print(f"70B model, 100k context: {estimate_kv_cache(70, 100000):.1f} GB")
print(f"70B model, 200k context: {estimate_kv_cache(70, 200000):.1f} GB")
print(f"13B model, 100k context: {estimate_kv_cache(13, 100000):.1f} GB")

Running this:

Terminal window
$ python kv-cache-calculator.py
70B model, 100k context: 16.0 GB
70B model, 200k context: 32.0 GB
13B model, 100k context: 8.0 GB

But wait, that’s just KV cache. Reddit users reported higher actual usage:

From Reddit discussion:
"With my DDR4 128GB RAM I'm limited to up to 70k context for 200B models"
"256GB DDR5 recommended for 200B models with long context"

The gap comes from overhead: the model itself, activations during inference, and vector database operations. I added 50% buffer.

Step 2: Calculate Fine-Tuning VRAM

For LoRA fine-tuning, I used this formula:

lora-vram-calculator.py
def estimate_lora_vram(
model_params_b: int,
quantization: str = "4bit",
lora_rank: int = 16,
batch_size: int = 1,
seq_length: int = 2048
) -> float:
"""
Estimate VRAM for LoRA fine-tuning in GB.
"""
# Model weights
bytes_per_param = {"fp16": 2, "8bit": 1, "4bit": 0.5}
model_vram = model_params_b * bytes_per_param[quantization]
# LoRA parameters (roughly 0.1% of model with rank=16)
lora_params = model_params_b * 0.001
lora_vram = lora_params * 2 # fp16
# Optimizer states (AdamW: 8 bytes per trainable param)
optimizer_vram = lora_params * 8
# Activations (very rough estimate)
# Scales with batch, seq_length, hidden_dim
hidden_dim = model_params_b * 1000 # rough approximation
activations = batch_size * seq_length * hidden_dim * 4 / 1e9
total = model_vram + lora_vram + optimizer_vram + activations
return total
# What can I fine-tune on 24GB?
for model_size in [7, 13, 30]:
vram = estimate_lora_vram(model_size, "4bit")
fits = "YES" if vram < 24 else "NO"
print(f"{model_size}B model: {vram:.1f} GB VRAM - {fits}")
# What about 48GB (dual 4090)?
print("\nWith 48GB VRAM:")
for model_size in [7, 13, 30, 70]:
vram = estimate_lora_vram(model_size, "4bit")
fits = "YES" if vram < 48 else "NO"
print(f"{model_size}B model: {vram:.1f} GB VRAM - {fits}")

Results:

Terminal window
$ python lora-vram-calculator.py
7B model: 11.5 GB VRAM - YES
13B model: 19.2 GB VRAM - YES
30B model: 40.1 GB VRAM - NO
With 48GB VRAM:
7B model: 11.5 GB VRAM - YES
13B model: 19.2 GB VRAM - YES
30B model: 40.1 GB VRAM - YES
70B model: 83.5 GB VRAM - NO

So with a single 24GB GPU, I could only fine-tune up to 13B models with QLoRA. For 30B, I needed dual GPUs.

Step 3: Storage for RAG

RAG needs fast random reads. Vector databases perform thousands of small reads during similarity search.

Terminal window
# Benchmark storage for vector DB operations
$ fio --name=random-read --ioengine=libaio --iodepth=64 \
--rw=randread --bs=4k --direct=1 --size=10G --numjobs=4
# Results on different drives:
# Samsung 990 Pro NVMe: ~1.2M IOPS
# Budget NVMe: ~300K IOPS
# SATA SSD: ~90K IOPS

For RAG with 500K+ embeddings, the budget NVMe became a bottleneck. Fast NVMe mattered.

The Solution: Three Build Options

Based on my research, I identified three viable approaches:

Option A: RAG-Focused Build

RAG-Focused ($9,500)
Component | Cost | Specification
-------------------|-----------|------------------
GPU | $1,800 | 1x RTX 4090 (24GB VRAM)
RAM | $800 | 256GB DDR5-5600
Storage | $600 | 4TB Samsung 990 Pro NVMe
CPU | $500 | AMD Ryzen 9 7950X (16 cores)
Motherboard | $400 | X670E with 4 M.2 slots
Case + PSU | $400 | 1000W Platinum
-------------------|-----------|------------------
Total | $4,500 |

This handles:

  • 200k context on 70B models (CPU inference)
  • 5M+ embeddings in vector DB
  • Fine-tuning only up to 7B QLoRA

Option B: Fine-Tuning Focused Build

Fine-Tuning Focused ($9,200)
Component | Cost | Specification
-------------------|-----------|------------------
GPU | $3,600 | 2x RTX 4090 (48GB VRAM total)
RAM | $250 | 64GB DDR5-5600
Storage | $300 | 2TB Samsung 990 Pro NVMe
CPU | $350 | AMD Ryzen 7 7700X
Motherboard | $350 | B650E with dual x8 PCIe 4.0
Case + PSU | $400 | 1200W Platinum
-------------------|-----------|------------------
Total | $5,250 |

This handles:

  • LoRA fine-tuning up to 30B QLoRA
  • RAG limited to ~50k context on 70B
  • ~1M embeddings max before slowdown

Option C: Balanced Build (My Choice)

Balanced Build ($9,800)
Component | Cost | Specification
-------------------|-----------|------------------
GPU | $3,600 | 2x RTX 4090 (48GB VRAM total)
RAM | $500 | 128GB DDR5-5600
Storage | $500 | 4TB Samsung 990 Pro NVMe (2x2TB)
CPU | $400 | AMD Ryzen 9 7900X
Motherboard | $400 | X670E with dual PCIe 4.0 x8
Case + PSU | $400 | 1200W Platinum
-------------------|-----------|------------------
Total | $5,800 |

This handles:

  • 100k context on 70B models
  • 3M+ embeddings in vector DB
  • LoRA fine-tuning up to 30B QLoRA, 13B full LoRA

Why I Chose the Balanced Build

I went with Option C because:

  1. I needed both workloads. I was building RAG pipelines for document search, but I also wanted to fine-tune models on my domain data.

  2. 48GB VRAM is the sweet spot. Dual 4090s let me fine-tune 30B models with QLoRA. That covers most open-source models I cared about.

  3. 128GB RAM is enough for most RAG. I rarely needed 200k context. 100k context covered 95% of my use cases.

  4. Fast storage matters for both. Vector DBs for RAG, model weights for fine-tuning—both benefit from NVMe speed.

What I’d Do Differently

If I were building today with the same budget, I’d consider:

Waiting for RTX 5090: Rumored 32GB VRAM. A single 5090 might offer better value than dual 4090s for fine-tuning, freeing budget for more RAM.

Used A6000s: I found refurbished RTX A6000s (48GB each) for ~$3,000. But driver support and power draw made me nervous.

Starting smaller: I could have started with a single 4090 + 64GB RAM for ~$3,500, then added the second GPU and more RAM later.

Common Mistakes to Avoid

From my research and the Reddit discussions:

Mistake 1: Over-investing in GPU for RAG

If you only do RAG, one 4090 is plenty. The bottleneck is system RAM and storage. I saw people buying dual 4090s for RAG-only workflows—wasted money.

Mistake 2: Underestimating fine-tuning VRAM

A 24GB GPU that runs 70B Q4 inference cannot fine-tune that model. Fine-tuning needs 1.2-2x inference VRAM. Many builds I saw had this mismatch.

Mistake 3: Ignoring storage speed for RAG

Vector databases do random reads. Budget NVMe drives at 300K IOPS bottleneck at 500K+ embeddings. The Samsung 990 Pro at 1.2M IOPS is worth the premium.

Mistake 4: Single GPU for large model fine-tuning

If you want to fine-tune 30B+ models, you need multi-GPU. There’s no way around it. QLoRA helps, but has limits.

Summary

In this post, I showed how to choose hardware for both RAG and LoRA fine-tuning. The key point is that RAG needs system RAM for context windows and vector operations, while fine-tuning needs GPU VRAM for gradients and optimizer states. With a $10k budget, a balanced build with dual RTX 4090s (48GB VRAM), 128GB DDR5 RAM, and fast NVMe storage handles both workloads reasonably well—100k context for RAG and 30B QLoRA fine-tuning.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments