How to Apply Autoresearch Patterns to Non-ML Optimization Problems

Mar 30, 2026

The Problem

I wanted to optimize my API response times. I’d heard about Andrej Karpathy’s autoresearch project - an autonomous AI loop that improves code by itself. But when I looked at it, everything seemed focused on machine learning:

The autoresearch project:
- Train neural networks automatically
- Adjust hyperparameters
- Optimize model weights
- Minimize loss functions

I don’t train ML models. I tune API configurations, optimize SQL queries, and improve system prompts. Was autoresearch useless for me?

Then I found this insight from a Reddit discussion:

"The most underrated application is applying the pattern to non-ML optimization problems -
any domain where you have a clear metric, a fixed experiment budget, and a bounded action space."

This changed my understanding. The autoresearch pattern isn’t about ML - it’s about the loop.

What is Really Happening?

The misconception comes from Karpathy’s framing. His original project used ML terminology:

LOOP FOREVER:
  1. Propose change to model weights
  2. git commit
  3. Run training epoch -> get loss score
  4. Loss improved? Keep. Loss worse? git reset.
  5. Log results. Repeat.

This made people think autoresearch only works for neural networks. But look closer at what actually makes it work:

Requirement	What It Means	ML Example	Non-ML Example
Clear Metric	Quantifiable success measure	Loss value	API latency (ms)
Bounded Action Space	Limited change options	Weight adjustments	Config parameters
Experiment Budget	Fixed time/compute	Training epochs	Overnight runs
Fast Evaluation	Quick scoring	Loss calculation	Benchmark suite

The loop design is domain-agnostic. “If you can measure it, you can optimize it.”

How I Tried to Apply It

My First Attempt: Unbounded Changes

I told the AI agent to “optimize API performance”:

Baseline: 245ms p95 latency

Experiment 1:
- Change: "Optimize API endpoints"
- Agent modified: database schema, API routes, caching logic, connection pooling
- Result: System broke, tests failed
- Status: REVERT

Experiment 2:
- Change: "Improve database queries"
- Agent modified: index structure, query patterns, table relationships
- Result: Migration conflict, data integrity issues
- Status: REVERT

This failed because my action space was unbounded. The agent explored too widely and broke everything.

My Second Attempt: Bounded Action Space

I learned from the mistake. I constrained what the agent could modify:

# Bounded optimization space - agent can ONLY modify these

cache:
  ttl_seconds: 300          # Bound: 60-3600
  max_entries: 1000         # Bound: 100-10000

connection_pool:
  min_size: 5               # Bound: 1-20
  max_size: 50              # Bound: 10-100
  timeout_ms: 5000          # Bound: 1000-30000

Now the agent had clear boundaries. It couldn’t touch the database schema or API routes.

Baseline: 245ms p95 latency, 1000 req/s throughput

Experiment 1: Increase cache TTL
- Change: cache.ttl_seconds: 300 -> 600
- Result: 198ms p95, 1200 req/s
- Delta: -47ms latency, +200 throughput
- Decision: KEEP

Experiment 2: Reduce connection pool
- Change: connection_pool.max_size: 50 -> 30
- Result: 310ms p95, 800 req/s
- Delta: +65ms latency, -200 throughput
- Decision: REVERT

Experiment 3: Combined adjustment
- Change: cache.ttl_seconds: 600, pool.max_size: 40
- Result: 180ms p95, 1100 req/s
- Delta: -65ms latency, +100 throughput
- Decision: KEEP (best result)

This worked. After 50 overnight experiments, latency dropped from 245ms to 180ms.

The Core Pattern

I realized the autoresearch pattern works for any optimization problem that meets three criteria:

┌─────────────────────────────────────────────────────────────┐
│                    AUTORESEARCH PATTERN                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  REQUIREMENT 1: Clear Metric                                │
│  ├── Must be quantifiable (number, percentage, time)        │
│  ├── Must have clear optimization direction (min/max)       │
│  └── Example: latency_p95 (minimize), conversion_rate       │
│                                                              │
│  REQUIREMENT 2: Bounded Action Space                        │
│  ├── Limited set of possible changes                        │
│  ├── Prevents chaotic exploration                           │
│  └── Example: cache TTL 60-3600, pool size 10-100          │
│                                                              │
│  REQUIREMENT 3: Fast Evaluation                             │
│  ├── Quick scoring mechanism                                │
│  ├── Deterministic results preferred                        │
│  └── Example: benchmark suite, automated tests              │
│                                                              │
└─────────────────────────────────────────────────────────────┘

The generalized loop:

1. Define metric and optimization direction
2. Set action space boundaries
3. Establish experiment budget
4. LOOP:
   a. Generate modification hypothesis
   b. Apply bounded change
   c. Run fast evaluation
   d. Compare score to baseline
   e. Keep if improved, revert if degraded
   f. Log results and insights
   g. Repeat until budget exhausted

Other Domains Where This Works

The autoresearch-anything project demonstrates successful non-ML applications:

Domain 1: System Prompt Optimization

Metric: Task completion rate on eval suite
Action Space: Prompt templates, instruction phrasing
Budget: 50 iterations overnight
Evaluation: Run 100 test cases, measure pass rate

An AI agent that generates React components optimized its own prompts through this loop.

Domain 2: SQL Query Optimization

Metric: Query execution time (minimize)
Action Space: Index suggestions, join order hints
Budget: Database compute budget
Evaluation: Run against sample data, measure latency

Domain 3: Test Suite Optimization

Metric: Test execution time (minimize) with coverage maintained
Action Space: Test ordering, parallelization config, timeouts
Budget: CI/CD pipeline slots
Evaluation: Run full suite, measure duration and pass rate

Domain 4: Landing Page Conversion

Metric: Conversion rate (maximize)
Action Space: Headlines, CTA text, layout variants
Budget: A/B test infrastructure limits
Evaluation: Track user actions over time window

The autoresearch-anything project lists even more domains: genealogy research, trading agents, GPU kernel optimization, sudoku solvers, Shopify Liquid templates, biomechanics analysis, tennis prediction models.

Common Mistakes I Made

Mistake 1: Unbounded Action Space

WRONG:
  Action Space: "Any code change to improve performance"
  Result: Agent explores chaos, never converges

CORRECT:
  Action Space: "Cache TTL values between 60-3600 seconds"
  Result: Agent explores systematically, finds optimal

Mistake 2: Slow or Noisy Evaluation

WRONG:
  Evaluation: "Deploy to production, wait 24 hours, check metrics"
  Result: 24 hours per experiment = 1 experiment per day

CORRECT:
  Evaluation: "Run synthetic benchmark suite, extract p95 latency"
  Result: 10 minutes per experiment = 144 experiments per day

Mistake 3: Missing Baseline

WRONG:
  "Try random changes and see what happens"
  Result: No way to distinguish improvement from regression

CORRECT:
  "Baseline: 245ms. Keep changes that reduce by >10ms"
  Result: Clear success criteria

Mistake 4: Single Metric Blindness

WRONG:
  "Minimize latency at all costs"
  Result: Latency drops but throughput collapses, costs explode

CORRECT:
  "Minimize latency while maintaining throughput >1000 req/s"
  Result: Balanced optimization

Setting Up Non-ML Autoresearch

I used the autoresearch-anything npm package. Here’s the interactive setup:

$ npx autoresearch-anything

╔═══════════════════════════════════════════╗
║        autoresearch-anything              ║
║   Autonomous AI improvement loop setup    ║
╚═══════════════════════════════════════════╝

Briefly describe your project: API response time optimization
What file(s) should the agent edit?: config.yaml
What's your metric called?: latency_p95
Should the metric go up or down?: down
What command runs your eval?: ./run_benchmark.sh
What does the score line look like?: latency_p95: 245ms
Track a secondary constraint? [y/N]: y
What's the secondary metric?: throughput
How does it appear?: throughput: 1000
Max time per experiment?: 10
Files the agent must NOT modify?: benchmark_tests/

The evaluation script prints in a format the agent can parse:

const { runBenchmark } = require('./benchmark');

async function evaluate() {
  const results = await runBenchmark({
    duration: '60s',
    concurrency: 10
  });

  // Print in expected format
  console.log(`latency_p95: ${results.timing.p95}ms`);
  console.log(`throughput: ${results.requests.perSecond}`);
}

evaluate();

Why This Matters

Many developers have optimization problems that could benefit from autonomous improvement loops. But they don’t recognize the pattern applicability because:

They associate autoresearch with ML training
They lack awareness of domain-generalization
They underestimate AI agent capability for non-code optimization

The key insight from practitioners:

"The LLM training framing is compelling but the real insight is
the loop design, not the domain."

Once you understand this, you see optimization opportunities everywhere:

Your CI pipeline takes 45 minutes? Autoresearch can optimize test ordering.
Your API has 300ms latency? Autoresearch can tune configurations.
Your prompts fail 30% of tasks? Autoresearch can iterate phrasing.
Your SQL queries timeout? Autoresearch can suggest indexes.

Summary

In this post, I showed how to apply the autoresearch pattern to non-ML optimization problems. The key insight is the loop design, not the domain.

The pattern works for any problem with:

Clear metric (quantifiable, optimization direction)
Bounded action space (limited change options)
Fast evaluation (quick scoring mechanism)

I learned through trial and error: unbounded action space fails, slow evaluation limits iterations, missing baseline confuses results, single metric optimization breaks constraints.

The autoresearch-anything project proves this works across domains: system prompts, API performance, SQL queries, test suites, landing pages, and more. “If you can measure it, you can optimize it.”

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 autoresearch-anything GitHub Repository
👨‍💻 Karpathy's Original Autoresearch Project
👨‍💻 Reddit Discussion on Autoresearch Pattern Insights
👨‍💻 Context7 Documentation Search

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!