What Hardware Do You Need for RAG and LoRA Fine-Tuning Locally?

Mar 27, 2026

Problem

I wanted to build a local AI workstation that could handle both RAG workflows and LoRA fine-tuning. My budget was $10,000. I started by looking at GPU recommendations for LLMs.

But I quickly ran into conflicting advice:

Reddit User A: "Get dual RTX 4090s, VRAM is everything for fine-tuning"
Reddit User B: "RAG needs RAM, spend your budget on 256GB DDR5"
Reddit User C: "Just get one GPU and max out storage for vector DBs"

I was confused. These recommendations couldn’t all be right. So I dug into why RAG and fine-tuning have different hardware needs.

Environment

Budget: $10,000 USD
Use cases: RAG with personal docs (100k+ context), LoRA fine-tuning on 7B-30B models
Current hardware: None (building from scratch)
OS: Linux (Ubuntu 24.04 planned)

What I Discovered

RAG and LoRA fine-tuning stress completely different parts of your system.

RAG is Memory-Bound, Not Compute-Bound

When I ran a RAG pipeline on my test machine, I watched the resource usage:

# Running RAG with 50k context on 70B model (CPU inference)
$ htop
  MEM[||||||||||||||||||||||||||||||||||||68.2G/128G]
  CPU[|||                                           ]

$ nvidia-smi
  GPU Memory: 2.1GB / 24GB

The GPU barely touched VRAM. The bottleneck was system RAM for the KV cache and vector database operations.

Here’s why: RAG retrieves document chunks, embeds them, and builds context. The vector database (Chroma, Pinecone, FAISS) stores millions of embeddings. Each similarity search loads embeddings into RAM.

For a 70B model with 100k context window running on CPU:

Context Memory = Context Tokens x 2 bytes x Layers / 1e9

For 100k context on Llama-2-70B:
- KV cache: ~40GB system RAM
- Vector DB (1M embeddings, 1536 dimensions): ~6GB
- Total: ~50GB just for context + embeddings

Fine-Tuning is VRAM-Bound

Then I tried LoRA fine-tuning:

# Fine-tuning 7B model with LoRA on single RTX 4090
$ nvidia-smi
  GPU Memory: 18.2GB / 24GB

# Fine-tuning 13B model with LoRA on same GPU
CUDA out of memory. Tried to allocate 4.2GB but only 2.1GB available

Fine-tuning requires storing more than just model weights:

Fine-tuning VRAM = Model Weights + Gradients + Optimizer States + Activations

For 7B model with LoRA:
- Model weights (4-bit): ~4GB
- LoRA adapters: ~0.5GB
- Gradients (trainable params): ~1GB
- Optimizer states (AdamW): ~2GB
- Activations (batch=1): ~4GB
- Total: ~12GB VRAM minimum

The key insight: fine-tuning uses 1.2-2x the VRAM of inference for the same model.

The Conflict

Here’s where budgets break:

┌─────────────────┬────────────────────┬────────────────────┐
│     Task        │    Needs Most      │    Secondary       │
├─────────────────┼────────────────────┼────────────────────┤
│ RAG             │ System RAM (128GB+)│ Storage Speed      │
│ LoRA Fine-tune  │ GPU VRAM (48GB+)   │ System RAM (64GB)  │
└─────────────────┴────────────────────┴────────────────────┘

With $10k, I couldn’t max out both. I had to choose where to compromise.

How I Decided

I started by calculating what each workload actually needed.

Step 1: Calculate RAG Memory Requirements

For RAG with large context windows, I needed to estimate KV cache size:

def estimate_kv_cache(model_params_b: int, context_tokens: int) -> float:
    """
    Estimate KV cache memory in GB for CPU inference.
    Formula: tokens * 2 bytes * layers / 1e9
    """
    # Approximate layers based on model size
    layers = {
        7: 32,
        13: 40,
        30: 60,
        70: 80,
    }

    num_layers = layers.get(model_params_b, 40)
    kv_cache_gb = context_tokens * 2 * num_layers / 1e9

    return kv_cache_gb

# My use cases
print(f"70B model, 100k context: {estimate_kv_cache(70, 100000):.1f} GB")
print(f"70B model, 200k context: {estimate_kv_cache(70, 200000):.1f} GB")
print(f"13B model, 100k context: {estimate_kv_cache(13, 100000):.1f} GB")

Running this:

$ python kv-cache-calculator.py
70B model, 100k context: 16.0 GB
70B model, 200k context: 32.0 GB
13B model, 100k context: 8.0 GB

But wait, that’s just KV cache. Reddit users reported higher actual usage:

From Reddit discussion:
"With my DDR4 128GB RAM I'm limited to up to 70k context for 200B models"
"256GB DDR5 recommended for 200B models with long context"

The gap comes from overhead: the model itself, activations during inference, and vector database operations. I added 50% buffer.

Step 2: Calculate Fine-Tuning VRAM

For LoRA fine-tuning, I used this formula:

def estimate_lora_vram(
    model_params_b: int,
    quantization: str = "4bit",
    lora_rank: int = 16,
    batch_size: int = 1,
    seq_length: int = 2048
) -> float:
    """
    Estimate VRAM for LoRA fine-tuning in GB.
    """

    # Model weights
    bytes_per_param = {"fp16": 2, "8bit": 1, "4bit": 0.5}
    model_vram = model_params_b * bytes_per_param[quantization]

    # LoRA parameters (roughly 0.1% of model with rank=16)
    lora_params = model_params_b * 0.001
    lora_vram = lora_params * 2  # fp16

    # Optimizer states (AdamW: 8 bytes per trainable param)
    optimizer_vram = lora_params * 8

    # Activations (very rough estimate)
    # Scales with batch, seq_length, hidden_dim
    hidden_dim = model_params_b * 1000  # rough approximation
    activations = batch_size * seq_length * hidden_dim * 4 / 1e9

    total = model_vram + lora_vram + optimizer_vram + activations
    return total

# What can I fine-tune on 24GB?
for model_size in [7, 13, 30]:
    vram = estimate_lora_vram(model_size, "4bit")
    fits = "YES" if vram < 24 else "NO"
    print(f"{model_size}B model: {vram:.1f} GB VRAM - {fits}")

# What about 48GB (dual 4090)?
print("\nWith 48GB VRAM:")
for model_size in [7, 13, 30, 70]:
    vram = estimate_lora_vram(model_size, "4bit")
    fits = "YES" if vram < 48 else "NO"
    print(f"{model_size}B model: {vram:.1f} GB VRAM - {fits}")

Results:

$ python lora-vram-calculator.py
7B model: 11.5 GB VRAM - YES
13B model: 19.2 GB VRAM - YES
30B model: 40.1 GB VRAM - NO

With 48GB VRAM:
7B model: 11.5 GB VRAM - YES
13B model: 19.2 GB VRAM - YES
30B model: 40.1 GB VRAM - YES
70B model: 83.5 GB VRAM - NO

So with a single 24GB GPU, I could only fine-tune up to 13B models with QLoRA. For 30B, I needed dual GPUs.

Step 3: Storage for RAG

RAG needs fast random reads. Vector databases perform thousands of small reads during similarity search.

# Benchmark storage for vector DB operations
$ fio --name=random-read --ioengine=libaio --iodepth=64 \
      --rw=randread --bs=4k --direct=1 --size=10G --numjobs=4

# Results on different drives:
# Samsung 990 Pro NVMe:    ~1.2M IOPS
# Budget NVMe:             ~300K IOPS
# SATA SSD:                ~90K IOPS

For RAG with 500K+ embeddings, the budget NVMe became a bottleneck. Fast NVMe mattered.

The Solution: Three Build Options

Based on my research, I identified three viable approaches:

Option A: RAG-Focused Build

Component          | Cost      | Specification
-------------------|-----------|------------------
GPU                | $1,800    | 1x RTX 4090 (24GB VRAM)
RAM                | $800      | 256GB DDR5-5600
Storage            | $600      | 4TB Samsung 990 Pro NVMe
CPU                | $500      | AMD Ryzen 9 7950X (16 cores)
Motherboard        | $400      | X670E with 4 M.2 slots
Case + PSU         | $400      | 1000W Platinum
-------------------|-----------|------------------
Total              | $4,500    |

This handles:

200k context on 70B models (CPU inference)
5M+ embeddings in vector DB
Fine-tuning only up to 7B QLoRA

Option B: Fine-Tuning Focused Build

Component          | Cost      | Specification
-------------------|-----------|------------------
GPU                | $3,600    | 2x RTX 4090 (48GB VRAM total)
RAM                | $250      | 64GB DDR5-5600
Storage            | $300      | 2TB Samsung 990 Pro NVMe
CPU                | $350      | AMD Ryzen 7 7700X
Motherboard        | $350      | B650E with dual x8 PCIe 4.0
Case + PSU         | $400      | 1200W Platinum
-------------------|-----------|------------------
Total              | $5,250    |

This handles:

LoRA fine-tuning up to 30B QLoRA
RAG limited to ~50k context on 70B
~1M embeddings max before slowdown

Option C: Balanced Build (My Choice)

Component          | Cost      | Specification
-------------------|-----------|------------------
GPU                | $3,600    | 2x RTX 4090 (48GB VRAM total)
RAM                | $500      | 128GB DDR5-5600
Storage            | $500      | 4TB Samsung 990 Pro NVMe (2x2TB)
CPU                | $400      | AMD Ryzen 9 7900X
Motherboard        | $400      | X670E with dual PCIe 4.0 x8
Case + PSU         | $400      | 1200W Platinum
-------------------|-----------|------------------
Total              | $5,800    |

This handles:

100k context on 70B models
3M+ embeddings in vector DB
LoRA fine-tuning up to 30B QLoRA, 13B full LoRA

Why I Chose the Balanced Build

I went with Option C because:

I needed both workloads. I was building RAG pipelines for document search, but I also wanted to fine-tune models on my domain data.
48GB VRAM is the sweet spot. Dual 4090s let me fine-tune 30B models with QLoRA. That covers most open-source models I cared about.
128GB RAM is enough for most RAG. I rarely needed 200k context. 100k context covered 95% of my use cases.
Fast storage matters for both. Vector DBs for RAG, model weights for fine-tuning—both benefit from NVMe speed.

What I’d Do Differently

If I were building today with the same budget, I’d consider:

Waiting for RTX 5090: Rumored 32GB VRAM. A single 5090 might offer better value than dual 4090s for fine-tuning, freeing budget for more RAM.

Used A6000s: I found refurbished RTX A6000s (48GB each) for ~$3,000. But driver support and power draw made me nervous.

Starting smaller: I could have started with a single 4090 + 64GB RAM for ~$3,500, then added the second GPU and more RAM later.

Common Mistakes to Avoid

From my research and the Reddit discussions:

Mistake 1: Over-investing in GPU for RAG

If you only do RAG, one 4090 is plenty. The bottleneck is system RAM and storage. I saw people buying dual 4090s for RAG-only workflows—wasted money.

Mistake 2: Underestimating fine-tuning VRAM

A 24GB GPU that runs 70B Q4 inference cannot fine-tune that model. Fine-tuning needs 1.2-2x inference VRAM. Many builds I saw had this mismatch.

Mistake 3: Ignoring storage speed for RAG

Vector databases do random reads. Budget NVMe drives at 300K IOPS bottleneck at 500K+ embeddings. The Samsung 990 Pro at 1.2M IOPS is worth the premium.

Mistake 4: Single GPU for large model fine-tuning

If you want to fine-tune 30B+ models, you need multi-GPU. There’s no way around it. QLoRA helps, but has limits.

Summary

In this post, I showed how to choose hardware for both RAG and LoRA fine-tuning. The key point is that RAG needs system RAM for context windows and vector operations, while fine-tuning needs GPU VRAM for gradients and optimizer states. With a $10k budget, a balanced build with dual RTX 4090s (48GB VRAM), 128GB DDR5 RAM, and fast NVMe storage handles both workloads reasonably well—100k context for RAG and 30B QLoRA fine-tuning.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Reddit: LocalLLaMA Hardware Discussion
👨‍💻 QLoRA Paper
👨‍💻 RAG Architecture Guide

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

What Hardware Do You Need for RAG and LoRA Fine-Tuning Locally?

Problem

Environment

What I Discovered

RAG is Memory-Bound, Not Compute-Bound

Fine-Tuning is VRAM-Bound

The Conflict

How I Decided

Step 1: Calculate RAG Memory Requirements

Step 2: Calculate Fine-Tuning VRAM

Step 3: Storage for RAG

The Solution: Three Build Options

Option A: RAG-Focused Build

Option B: Fine-Tuning Focused Build

Option C: Balanced Build (My Choice)

Why I Chose the Balanced Build

What I’d Do Differently

Common Mistakes to Avoid

Summary

Final Words + More Resources

Comments