How to Run Karpathy's Autoresearch on Consumer GPUs (RTX 4090/3090)

Mar 30, 2026

I downloaded Andrej Karpathy’s autoresearch project expecting to run automated ML experiments on my gaming PC. Then I read the documentation: “Single H100 is the benchmark machine.” That GPU costs $30,000.

But Karpathy also wrote: “designed to be adapted to lower-end hardware.” So I tried it anyway.

RuntimeError: CUDA out of memory. Tried to allocate 2.5 GiB
GPU 0 has a total capacity of 23.6 GiB
Of which 20.1 GiB is already allocated

My RTX 4090 has 24GB VRAM. The default configuration expected 80GB. Here’s how I adapted autoresearch to work on consumer GPUs.

The Problem: H100 vs Consumer GPU

Karpathy ran autoresearch for 48 hours on a single H100. That GPU has:

┌─────────────────────────────────────────────────────────────────┐
│              H100 vs Consumer GPU Comparison                    │
├─────────────────┬──────────────┬────────────────────────────────┤
│ Specification   │ H100         │ RTX 4090                       │
├─────────────────┼──────────────┼────────────────────────────────┤
│ VRAM            │ 80GB         │ 24GB                           │
│ Memory Bandwidth│ 3.35 TB/s    │ 1.0 TB/s                       │
│ Price           │ $25,000-40,000│ $1,600                        │
│ Availability    │ Enterprise only│ Consumer market              │
│ Experiment Time │ 5 min windows│ 15-30 min windows needed      │
└─────────────────┴──────────────┴────────────────────────────────┘

The H100’s 80GB VRAM lets it run large batch sizes in memory. My 24GB RTX 4090 can’t fit those configurations directly.

Strategy 1: Reduce Batch Size + Gradient Accumulation

The first change I made was reducing the batch size. But smaller batches mean less stable gradients. The solution: gradient accumulation.

# Original H100 configuration
config_h100 = {
    'batch_size': 64,
    'gradient_accumulation_steps': 1,
    'n_embd': 768,
    'n_layer': 12,
    'n_head': 12,
}

# My RTX 4090 adaptation (24GB VRAM)
config_4090 = {
    'batch_size': 16,              # Reduced to fit VRAM
    'gradient_accumulation_steps': 4,  # Effective batch still 64
    'n_embd': 768,                 # Keep same model size
    'n_layer': 12,
    'n_head': 12,
    'mixed_precision': 'bf16',     # Enable BF16 training
}

The math: effective_batch_size = batch_size * gradient_accumulation_steps. My 16 * 4 = 64, matching the H100’s effective batch size.

Here’s how gradient accumulation works in the training loop:

accumulation_steps = config['gradient_accumulation_steps']
optimizer.zero_grad()

for i, batch in enumerate(dataloader):
    outputs = model(batch)
    loss = outputs.loss / accumulation_steps  # Scale loss down
    loss.backward()

    # Only update weights after accumulating gradients
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Each forward pass uses only 16 samples, but gradients accumulate across 4 passes before updating weights. The model “sees” 64 samples per update, just like the H100 version.

Strategy 2: Enable Mixed Precision Training

BF16 (bfloat16) cuts memory usage roughly in half. RTX 4090 handles BF16 natively.

import torch
from torch.cuda.amp import autocast

# Enable automatic mixed precision
scaler = torch.cuda.amp.GradScaler()

with autocast(dtype=torch.bfloat16):
    outputs = model(batch)
    loss = outputs.loss

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

BF16 preserves numerical stability better than FP16 because it keeps the same exponent range as FP32. For autoresearch’s experimentation loop, the slight precision loss is acceptable.

Strategy 3: Monitor VRAM Usage

Consumer GPUs hit OOM more easily. I added VRAM logging to catch issues early:

import torch

def log_vram_usage(step):
    allocated = torch.cuda.memory_allocated() / 1024**2  # MB
    reserved = torch.cuda.memory_reserved() / 1024**2
    peak = torch.cuda.max_memory_allocated() / 1024**2

    print(f"Step {step}: "
          f"{allocated:.1f}MB used, "
          f"{reserved:.1f}MB reserved, "
          f"{peak:.1f}MB peak")

    # Reset peak for next measurement
    torch.cuda.reset_peak_memory_stats()

    return allocated

# Call during training
for step, batch in enumerate(dataloader):
    loss = train_step(batch)
    if step % 100 == 0:
        vram = log_vram_usage(step)
        if vram > 22000:  # Approaching 24GB limit
            print("WARNING: VRAM usage high!")

I also set this environment variable to help PyTorch manage memory better:

export PYTORCH_CUDA_ALLOC_CONF=expand_segments:True

Strategy 4: Adjust Experiment Time Windows

Karpathy’s H100 ran experiments in 5-minute windows. My RTX 4090 needs 15-30 minutes per experiment due to slower processing. I modified program.md:

## Output format

val_bpb: X.XXXXXX
training_seconds: 1800    # 30 min window (not 5 min)
peak_vram_mb: XXXXX.X     # Added: monitor VRAM usage
effective_batch_size: 64  # batch_size * grad_accum

## Hardware Constraints

- Max VRAM: 24GB (RTX 4090)
- Peak memory usage must stay under 22GB (leave buffer)
- If OOM occurs: reduce batch_size first, then increase gradient_accumulation_steps
- Consider overnight runs instead of continuous 48-hour sessions

The longer windows mean fewer experiments per hour, but each experiment still runs autonomously.

Strategy 5: Smaller Model for Lower-End GPUs

For my secondary RTX 3080 (10GB VRAM), I needed more aggressive scaling:

# RTX 3080 (10GB VRAM) - more aggressive changes
config_3080 = {
    'batch_size': 4,
    'gradient_accumulation_steps': 16,  # Effective batch 64
    'n_embd': 512,              # Reduced from 768
    'n_layer': 8,               # Reduced from 12
    'n_head': 8,                # Reduced from 12
    'mixed_precision': 'fp16',  # FP16 for older GPU
}

Smaller model dimensions mean faster training but potentially lower peak performance. For autoresearch’s exploration phase, this trade-off is acceptable—the agent can discover what works given hardware constraints.

What Actually Worked on My Hardware

┌─────────────────────────────────────────────────────────────────┐
│           My Autoresearch Consumer GPU Results                  │
├─────────────┬──────────┬──────────────┬─────────────────────────┤
│ GPU         │ VRAM     │ Batch Config │ Actual Performance      │
├─────────────┼──────────┼──────────────┼─────────────────────────┤
│ RTX 4090    │ 24GB     │ 16 x 4 accum │ Stable, ~20 min/exp    │
│ RTX 3090    │ 24GB     │ 16 x 4 accum │ Stable, ~25 min/exp    │
│ RTX 3080    │ 10GB     │ 4 x 16 accum │ Stable with smaller model│
│ RTX 4080    │ 16GB     │ 8 x 8 accum  │ Stable, ~22 min/exp    │
└─────────────┴──────────┴──────────────┴─────────────────────────┘

The RTX 4090 ran smoothly with the full model size. The RTX 3080 needed reduced dimensions but still produced meaningful experiments.

Common Mistakes I Made

Mistake 1: Copying H100 Parameters Directly

RuntimeError: CUDA out of memory. Tried to allocate 2.5 GiB

I assumed “designed for adaptation” meant copy-paste would work. It doesn’t. Always validate batch size fits VRAM before starting.

Mistake 2: Ignoring the Time Adjustment

My first runs crashed because the 5-minute timeout killed experiments mid-gradient accumulation. Consumer GPUs need longer windows.

Mistake 3: Not Watching VRAM During First Run

I started a long run without monitoring. It crashed 2 hours in at step 2000. Now I always run a short test first:

# Quick test before long run
python train.py --test_steps 100 --monitor_vram

# If peak VRAM > 22GB, reduce batch_size before full run

Why This Matters

Autoresearch on consumer GPUs opens automated ML experimentation to thousands of developers who don’t have enterprise infrastructure.

From the Reddit discussion:

“Community forks for consumer GPUs are already appearing” — LocalLLaMA commenter

“And it’s accessibly packaged (with simplicity and for consumer hardware)” — do-un-to

Even with slower throughput, the concepts transfer. Understanding memory optimization, gradient accumulation, and hardware adaptation is valuable learning itself.

Gradient Accumulation: A technique to simulate larger batch sizes on limited VRAM by accumulating gradients across multiple forward passes before updating weights.
Mixed Precision (BF16): Brain float 16 maintains the same exponent range as FP32 (8 bits) while reducing mantissa (7 bits), preserving numerical stability for most ML workloads while halving memory usage.
Effective Batch Size: The number of samples the model “sees” per weight update, calculated as batch_size * gradient_accumulation_steps.

Practical Checklist

Before running autoresearch on your consumer GPU:

[ ] Check your GPU VRAM size
[ ] Calculate max batch size: VRAM / (model_size * sequence_length)
[ ] Set gradient_accumulation_steps to reach effective batch of 64
[ ] Enable BF16/FP16 mixed precision
[ ] Add VRAM logging to training output
[ ] Set PYTORCH_CUDA_ALLOC_CONF=expand_segments:True
[ ] Run 100-step test before full experiment
[ ] Adjust experiment window time in program.md

The Bottom Line

Yes, you can run autoresearch on consumer GPUs. Start with batch size 16 and gradient accumulation steps 4 for RTX 4090/3090. For 10GB GPUs, reduce model dimensions too. Monitor VRAM, extend experiment windows, and join the community forks already making this work.

The H100 runs 48 hours of experiments. My RTX 4090 runs the same experiments over a weekend. That’s still automated research—and it’s accessible.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Karpathy's autoresearch - GitHub
👨‍💻 Autoresearch on consumer GPU - Reddit Discussion
👨‍💻 PyTorch CUDA Memory Management
👨‍💻 HuggingFace Gradient Accumulation Guide

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!