What Hardware Do You Need for RAG and LoRA Fine-Tuning Locally?
Problem
I wanted to build a local AI workstation that could handle both RAG workflows and LoRA fine-tuning. My budget was $10,000. I started by looking at GPU recommendations for LLMs.
But I quickly ran into conflicting advice:
Reddit User A: "Get dual RTX 4090s, VRAM is everything for fine-tuning"Reddit User B: "RAG needs RAM, spend your budget on 256GB DDR5"Reddit User C: "Just get one GPU and max out storage for vector DBs"I was confused. These recommendations couldn’t all be right. So I dug into why RAG and fine-tuning have different hardware needs.
Environment
- Budget: $10,000 USD
- Use cases: RAG with personal docs (100k+ context), LoRA fine-tuning on 7B-30B models
- Current hardware: None (building from scratch)
- OS: Linux (Ubuntu 24.04 planned)
What I Discovered
RAG and LoRA fine-tuning stress completely different parts of your system.
RAG is Memory-Bound, Not Compute-Bound
When I ran a RAG pipeline on my test machine, I watched the resource usage:
# Running RAG with 50k context on 70B model (CPU inference)$ htop MEM[||||||||||||||||||||||||||||||||||||68.2G/128G] CPU[||| ]
$ nvidia-smi GPU Memory: 2.1GB / 24GBThe GPU barely touched VRAM. The bottleneck was system RAM for the KV cache and vector database operations.
Here’s why: RAG retrieves document chunks, embeds them, and builds context. The vector database (Chroma, Pinecone, FAISS) stores millions of embeddings. Each similarity search loads embeddings into RAM.
For a 70B model with 100k context window running on CPU:
Context Memory = Context Tokens x 2 bytes x Layers / 1e9
For 100k context on Llama-2-70B:- KV cache: ~40GB system RAM- Vector DB (1M embeddings, 1536 dimensions): ~6GB- Total: ~50GB just for context + embeddingsFine-Tuning is VRAM-Bound
Then I tried LoRA fine-tuning:
# Fine-tuning 7B model with LoRA on single RTX 4090$ nvidia-smi GPU Memory: 18.2GB / 24GB
# Fine-tuning 13B model with LoRA on same GPUCUDA out of memory. Tried to allocate 4.2GB but only 2.1GB availableFine-tuning requires storing more than just model weights:
Fine-tuning VRAM = Model Weights + Gradients + Optimizer States + Activations
For 7B model with LoRA:- Model weights (4-bit): ~4GB- LoRA adapters: ~0.5GB- Gradients (trainable params): ~1GB- Optimizer states (AdamW): ~2GB- Activations (batch=1): ~4GB- Total: ~12GB VRAM minimumThe key insight: fine-tuning uses 1.2-2x the VRAM of inference for the same model.
The Conflict
Here’s where budgets break:
┌─────────────────┬────────────────────┬────────────────────┐│ Task │ Needs Most │ Secondary │├─────────────────┼────────────────────┼────────────────────┤│ RAG │ System RAM (128GB+)│ Storage Speed ││ LoRA Fine-tune │ GPU VRAM (48GB+) │ System RAM (64GB) │└─────────────────┴────────────────────┴────────────────────┘With $10k, I couldn’t max out both. I had to choose where to compromise.
How I Decided
I started by calculating what each workload actually needed.
Step 1: Calculate RAG Memory Requirements
For RAG with large context windows, I needed to estimate KV cache size:
def estimate_kv_cache(model_params_b: int, context_tokens: int) -> float: """ Estimate KV cache memory in GB for CPU inference. Formula: tokens * 2 bytes * layers / 1e9 """ # Approximate layers based on model size layers = { 7: 32, 13: 40, 30: 60, 70: 80, }
num_layers = layers.get(model_params_b, 40) kv_cache_gb = context_tokens * 2 * num_layers / 1e9
return kv_cache_gb
# My use casesprint(f"70B model, 100k context: {estimate_kv_cache(70, 100000):.1f} GB")print(f"70B model, 200k context: {estimate_kv_cache(70, 200000):.1f} GB")print(f"13B model, 100k context: {estimate_kv_cache(13, 100000):.1f} GB")Running this:
$ python kv-cache-calculator.py70B model, 100k context: 16.0 GB70B model, 200k context: 32.0 GB13B model, 100k context: 8.0 GBBut wait, that’s just KV cache. Reddit users reported higher actual usage:
From Reddit discussion:"With my DDR4 128GB RAM I'm limited to up to 70k context for 200B models""256GB DDR5 recommended for 200B models with long context"The gap comes from overhead: the model itself, activations during inference, and vector database operations. I added 50% buffer.
Step 2: Calculate Fine-Tuning VRAM
For LoRA fine-tuning, I used this formula:
def estimate_lora_vram( model_params_b: int, quantization: str = "4bit", lora_rank: int = 16, batch_size: int = 1, seq_length: int = 2048) -> float: """ Estimate VRAM for LoRA fine-tuning in GB. """
# Model weights bytes_per_param = {"fp16": 2, "8bit": 1, "4bit": 0.5} model_vram = model_params_b * bytes_per_param[quantization]
# LoRA parameters (roughly 0.1% of model with rank=16) lora_params = model_params_b * 0.001 lora_vram = lora_params * 2 # fp16
# Optimizer states (AdamW: 8 bytes per trainable param) optimizer_vram = lora_params * 8
# Activations (very rough estimate) # Scales with batch, seq_length, hidden_dim hidden_dim = model_params_b * 1000 # rough approximation activations = batch_size * seq_length * hidden_dim * 4 / 1e9
total = model_vram + lora_vram + optimizer_vram + activations return total
# What can I fine-tune on 24GB?for model_size in [7, 13, 30]: vram = estimate_lora_vram(model_size, "4bit") fits = "YES" if vram < 24 else "NO" print(f"{model_size}B model: {vram:.1f} GB VRAM - {fits}")
# What about 48GB (dual 4090)?print("\nWith 48GB VRAM:")for model_size in [7, 13, 30, 70]: vram = estimate_lora_vram(model_size, "4bit") fits = "YES" if vram < 48 else "NO" print(f"{model_size}B model: {vram:.1f} GB VRAM - {fits}")Results:
$ python lora-vram-calculator.py7B model: 11.5 GB VRAM - YES13B model: 19.2 GB VRAM - YES30B model: 40.1 GB VRAM - NO
With 48GB VRAM:7B model: 11.5 GB VRAM - YES13B model: 19.2 GB VRAM - YES30B model: 40.1 GB VRAM - YES70B model: 83.5 GB VRAM - NOSo with a single 24GB GPU, I could only fine-tune up to 13B models with QLoRA. For 30B, I needed dual GPUs.
Step 3: Storage for RAG
RAG needs fast random reads. Vector databases perform thousands of small reads during similarity search.
# Benchmark storage for vector DB operations$ fio --name=random-read --ioengine=libaio --iodepth=64 \ --rw=randread --bs=4k --direct=1 --size=10G --numjobs=4
# Results on different drives:# Samsung 990 Pro NVMe: ~1.2M IOPS# Budget NVMe: ~300K IOPS# SATA SSD: ~90K IOPSFor RAG with 500K+ embeddings, the budget NVMe became a bottleneck. Fast NVMe mattered.
The Solution: Three Build Options
Based on my research, I identified three viable approaches:
Option A: RAG-Focused Build
Component | Cost | Specification-------------------|-----------|------------------GPU | $1,800 | 1x RTX 4090 (24GB VRAM)RAM | $800 | 256GB DDR5-5600Storage | $600 | 4TB Samsung 990 Pro NVMeCPU | $500 | AMD Ryzen 9 7950X (16 cores)Motherboard | $400 | X670E with 4 M.2 slotsCase + PSU | $400 | 1000W Platinum-------------------|-----------|------------------Total | $4,500 |This handles:
- 200k context on 70B models (CPU inference)
- 5M+ embeddings in vector DB
- Fine-tuning only up to 7B QLoRA
Option B: Fine-Tuning Focused Build
Component | Cost | Specification-------------------|-----------|------------------GPU | $3,600 | 2x RTX 4090 (48GB VRAM total)RAM | $250 | 64GB DDR5-5600Storage | $300 | 2TB Samsung 990 Pro NVMeCPU | $350 | AMD Ryzen 7 7700XMotherboard | $350 | B650E with dual x8 PCIe 4.0Case + PSU | $400 | 1200W Platinum-------------------|-----------|------------------Total | $5,250 |This handles:
- LoRA fine-tuning up to 30B QLoRA
- RAG limited to ~50k context on 70B
- ~1M embeddings max before slowdown
Option C: Balanced Build (My Choice)
Component | Cost | Specification-------------------|-----------|------------------GPU | $3,600 | 2x RTX 4090 (48GB VRAM total)RAM | $500 | 128GB DDR5-5600Storage | $500 | 4TB Samsung 990 Pro NVMe (2x2TB)CPU | $400 | AMD Ryzen 9 7900XMotherboard | $400 | X670E with dual PCIe 4.0 x8Case + PSU | $400 | 1200W Platinum-------------------|-----------|------------------Total | $5,800 |This handles:
- 100k context on 70B models
- 3M+ embeddings in vector DB
- LoRA fine-tuning up to 30B QLoRA, 13B full LoRA
Why I Chose the Balanced Build
I went with Option C because:
-
I needed both workloads. I was building RAG pipelines for document search, but I also wanted to fine-tune models on my domain data.
-
48GB VRAM is the sweet spot. Dual 4090s let me fine-tune 30B models with QLoRA. That covers most open-source models I cared about.
-
128GB RAM is enough for most RAG. I rarely needed 200k context. 100k context covered 95% of my use cases.
-
Fast storage matters for both. Vector DBs for RAG, model weights for fine-tuning—both benefit from NVMe speed.
What I’d Do Differently
If I were building today with the same budget, I’d consider:
Waiting for RTX 5090: Rumored 32GB VRAM. A single 5090 might offer better value than dual 4090s for fine-tuning, freeing budget for more RAM.
Used A6000s: I found refurbished RTX A6000s (48GB each) for ~$3,000. But driver support and power draw made me nervous.
Starting smaller: I could have started with a single 4090 + 64GB RAM for ~$3,500, then added the second GPU and more RAM later.
Common Mistakes to Avoid
From my research and the Reddit discussions:
Mistake 1: Over-investing in GPU for RAG
If you only do RAG, one 4090 is plenty. The bottleneck is system RAM and storage. I saw people buying dual 4090s for RAG-only workflows—wasted money.
Mistake 2: Underestimating fine-tuning VRAM
A 24GB GPU that runs 70B Q4 inference cannot fine-tune that model. Fine-tuning needs 1.2-2x inference VRAM. Many builds I saw had this mismatch.
Mistake 3: Ignoring storage speed for RAG
Vector databases do random reads. Budget NVMe drives at 300K IOPS bottleneck at 500K+ embeddings. The Samsung 990 Pro at 1.2M IOPS is worth the premium.
Mistake 4: Single GPU for large model fine-tuning
If you want to fine-tune 30B+ models, you need multi-GPU. There’s no way around it. QLoRA helps, but has limits.
Summary
In this post, I showed how to choose hardware for both RAG and LoRA fine-tuning. The key point is that RAG needs system RAM for context windows and vector operations, while fine-tuning needs GPU VRAM for gradients and optimizer states. With a $10k budget, a balanced build with dual RTX 4090s (48GB VRAM), 128GB DDR5 RAM, and fast NVMe storage handles both workloads reasonably well—100k context for RAG and 30B QLoRA fine-tuning.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments