How to Run Karpathy's Autoresearch on Consumer GPUs (RTX 4090/3090)
I downloaded Andrej Karpathy’s autoresearch project expecting to run automated ML experiments on my gaming PC. Then I read the documentation: “Single H100 is the benchmark machine.” That GPU costs $30,000.
But Karpathy also wrote: “designed to be adapted to lower-end hardware.” So I tried it anyway.
RuntimeError: CUDA out of memory. Tried to allocate 2.5 GiBGPU 0 has a total capacity of 23.6 GiBOf which 20.1 GiB is already allocatedMy RTX 4090 has 24GB VRAM. The default configuration expected 80GB. Here’s how I adapted autoresearch to work on consumer GPUs.
The Problem: H100 vs Consumer GPU
Karpathy ran autoresearch for 48 hours on a single H100. That GPU has:
┌─────────────────────────────────────────────────────────────────┐│ H100 vs Consumer GPU Comparison │├─────────────────┬──────────────┬────────────────────────────────┤│ Specification │ H100 │ RTX 4090 │├─────────────────┼──────────────┼────────────────────────────────┤│ VRAM │ 80GB │ 24GB ││ Memory Bandwidth│ 3.35 TB/s │ 1.0 TB/s ││ Price │ $25,000-40,000│ $1,600 ││ Availability │ Enterprise only│ Consumer market ││ Experiment Time │ 5 min windows│ 15-30 min windows needed │└─────────────────┴──────────────┴────────────────────────────────┘The H100’s 80GB VRAM lets it run large batch sizes in memory. My 24GB RTX 4090 can’t fit those configurations directly.
Strategy 1: Reduce Batch Size + Gradient Accumulation
The first change I made was reducing the batch size. But smaller batches mean less stable gradients. The solution: gradient accumulation.
# Original H100 configurationconfig_h100 = { 'batch_size': 64, 'gradient_accumulation_steps': 1, 'n_embd': 768, 'n_layer': 12, 'n_head': 12,}
# My RTX 4090 adaptation (24GB VRAM)config_4090 = { 'batch_size': 16, # Reduced to fit VRAM 'gradient_accumulation_steps': 4, # Effective batch still 64 'n_embd': 768, # Keep same model size 'n_layer': 12, 'n_head': 12, 'mixed_precision': 'bf16', # Enable BF16 training}The math: effective_batch_size = batch_size * gradient_accumulation_steps. My 16 * 4 = 64, matching the H100’s effective batch size.
Here’s how gradient accumulation works in the training loop:
accumulation_steps = config['gradient_accumulation_steps']optimizer.zero_grad()
for i, batch in enumerate(dataloader): outputs = model(batch) loss = outputs.loss / accumulation_steps # Scale loss down loss.backward()
# Only update weights after accumulating gradients if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad()Each forward pass uses only 16 samples, but gradients accumulate across 4 passes before updating weights. The model “sees” 64 samples per update, just like the H100 version.
Strategy 2: Enable Mixed Precision Training
BF16 (bfloat16) cuts memory usage roughly in half. RTX 4090 handles BF16 natively.
import torchfrom torch.cuda.amp import autocast
# Enable automatic mixed precisionscaler = torch.cuda.amp.GradScaler()
with autocast(dtype=torch.bfloat16): outputs = model(batch) loss = outputs.loss
scaler.scale(loss).backward()scaler.step(optimizer)scaler.update()BF16 preserves numerical stability better than FP16 because it keeps the same exponent range as FP32. For autoresearch’s experimentation loop, the slight precision loss is acceptable.
Strategy 3: Monitor VRAM Usage
Consumer GPUs hit OOM more easily. I added VRAM logging to catch issues early:
import torch
def log_vram_usage(step): allocated = torch.cuda.memory_allocated() / 1024**2 # MB reserved = torch.cuda.memory_reserved() / 1024**2 peak = torch.cuda.max_memory_allocated() / 1024**2
print(f"Step {step}: " f"{allocated:.1f}MB used, " f"{reserved:.1f}MB reserved, " f"{peak:.1f}MB peak")
# Reset peak for next measurement torch.cuda.reset_peak_memory_stats()
return allocated
# Call during trainingfor step, batch in enumerate(dataloader): loss = train_step(batch) if step % 100 == 0: vram = log_vram_usage(step) if vram > 22000: # Approaching 24GB limit print("WARNING: VRAM usage high!")I also set this environment variable to help PyTorch manage memory better:
export PYTORCH_CUDA_ALLOC_CONF=expand_segments:TrueStrategy 4: Adjust Experiment Time Windows
Karpathy’s H100 ran experiments in 5-minute windows. My RTX 4090 needs 15-30 minutes per experiment due to slower processing. I modified program.md:
## Output format
val_bpb: X.XXXXXXtraining_seconds: 1800 # 30 min window (not 5 min)peak_vram_mb: XXXXX.X # Added: monitor VRAM usageeffective_batch_size: 64 # batch_size * grad_accum
## Hardware Constraints
- Max VRAM: 24GB (RTX 4090)- Peak memory usage must stay under 22GB (leave buffer)- If OOM occurs: reduce batch_size first, then increase gradient_accumulation_steps- Consider overnight runs instead of continuous 48-hour sessionsThe longer windows mean fewer experiments per hour, but each experiment still runs autonomously.
Strategy 5: Smaller Model for Lower-End GPUs
For my secondary RTX 3080 (10GB VRAM), I needed more aggressive scaling:
# RTX 3080 (10GB VRAM) - more aggressive changesconfig_3080 = { 'batch_size': 4, 'gradient_accumulation_steps': 16, # Effective batch 64 'n_embd': 512, # Reduced from 768 'n_layer': 8, # Reduced from 12 'n_head': 8, # Reduced from 12 'mixed_precision': 'fp16', # FP16 for older GPU}Smaller model dimensions mean faster training but potentially lower peak performance. For autoresearch’s exploration phase, this trade-off is acceptable—the agent can discover what works given hardware constraints.
What Actually Worked on My Hardware
┌─────────────────────────────────────────────────────────────────┐│ My Autoresearch Consumer GPU Results │├─────────────┬──────────┬──────────────┬─────────────────────────┤│ GPU │ VRAM │ Batch Config │ Actual Performance │├─────────────┼──────────┼──────────────┼─────────────────────────┤│ RTX 4090 │ 24GB │ 16 x 4 accum │ Stable, ~20 min/exp ││ RTX 3090 │ 24GB │ 16 x 4 accum │ Stable, ~25 min/exp ││ RTX 3080 │ 10GB │ 4 x 16 accum │ Stable with smaller model││ RTX 4080 │ 16GB │ 8 x 8 accum │ Stable, ~22 min/exp │└─────────────┴──────────┴──────────────┴─────────────────────────┘The RTX 4090 ran smoothly with the full model size. The RTX 3080 needed reduced dimensions but still produced meaningful experiments.
Common Mistakes I Made
Mistake 1: Copying H100 Parameters Directly
RuntimeError: CUDA out of memory. Tried to allocate 2.5 GiBI assumed “designed for adaptation” meant copy-paste would work. It doesn’t. Always validate batch size fits VRAM before starting.
Mistake 2: Ignoring the Time Adjustment
My first runs crashed because the 5-minute timeout killed experiments mid-gradient accumulation. Consumer GPUs need longer windows.
Mistake 3: Not Watching VRAM During First Run
I started a long run without monitoring. It crashed 2 hours in at step 2000. Now I always run a short test first:
# Quick test before long runpython train.py --test_steps 100 --monitor_vram
# If peak VRAM > 22GB, reduce batch_size before full runWhy This Matters
Autoresearch on consumer GPUs opens automated ML experimentation to thousands of developers who don’t have enterprise infrastructure.
From the Reddit discussion:
“Community forks for consumer GPUs are already appearing” — LocalLLaMA commenter
“And it’s accessibly packaged (with simplicity and for consumer hardware)” — do-un-to
Even with slower throughput, the concepts transfer. Understanding memory optimization, gradient accumulation, and hardware adaptation is valuable learning itself.
Related Knowledge
-
Gradient Accumulation: A technique to simulate larger batch sizes on limited VRAM by accumulating gradients across multiple forward passes before updating weights.
-
Mixed Precision (BF16): Brain float 16 maintains the same exponent range as FP32 (8 bits) while reducing mantissa (7 bits), preserving numerical stability for most ML workloads while halving memory usage.
-
Effective Batch Size: The number of samples the model “sees” per weight update, calculated as
batch_size * gradient_accumulation_steps.
Practical Checklist
Before running autoresearch on your consumer GPU:
[ ] Check your GPU VRAM size[ ] Calculate max batch size: VRAM / (model_size * sequence_length)[ ] Set gradient_accumulation_steps to reach effective batch of 64[ ] Enable BF16/FP16 mixed precision[ ] Add VRAM logging to training output[ ] Set PYTORCH_CUDA_ALLOC_CONF=expand_segments:True[ ] Run 100-step test before full experiment[ ] Adjust experiment window time in program.mdThe Bottom Line
Yes, you can run autoresearch on consumer GPUs. Start with batch size 16 and gradient accumulation steps 4 for RTX 4090/3090. For 10GB GPUs, reduce model dimensions too. Monitor VRAM, extend experiment windows, and join the community forks already making this work.
The H100 runs 48 hours of experiments. My RTX 4090 runs the same experiments over a weekend. That’s still automated research—and it’s accessible.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Karpathy's autoresearch - GitHub
- 👨💻 Autoresearch on consumer GPU - Reddit Discussion
- 👨💻 PyTorch CUDA Memory Management
- 👨💻 HuggingFace Gradient Accumulation Guide
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments