Skip to content

How to Apply Autoresearch Patterns to Non-ML Optimization Problems

The Problem

I wanted to optimize my API response times. I’d heard about Andrej Karpathy’s autoresearch project - an autonomous AI loop that improves code by itself. But when I looked at it, everything seemed focused on machine learning:

Autoresearch ML focus
The autoresearch project:
- Train neural networks automatically
- Adjust hyperparameters
- Optimize model weights
- Minimize loss functions

I don’t train ML models. I tune API configurations, optimize SQL queries, and improve system prompts. Was autoresearch useless for me?

Then I found this insight from a Reddit discussion:

Reddit insight
"The most underrated application is applying the pattern to non-ML optimization problems -
any domain where you have a clear metric, a fixed experiment budget, and a bounded action space."

This changed my understanding. The autoresearch pattern isn’t about ML - it’s about the loop.

What is Really Happening?

The misconception comes from Karpathy’s framing. His original project used ML terminology:

Original autoresearch loop
LOOP FOREVER:
1. Propose change to model weights
2. git commit
3. Run training epoch -> get loss score
4. Loss improved? Keep. Loss worse? git reset.
5. Log results. Repeat.

This made people think autoresearch only works for neural networks. But look closer at what actually makes it work:

RequirementWhat It MeansML ExampleNon-ML Example
Clear MetricQuantifiable success measureLoss valueAPI latency (ms)
Bounded Action SpaceLimited change optionsWeight adjustmentsConfig parameters
Experiment BudgetFixed time/computeTraining epochsOvernight runs
Fast EvaluationQuick scoringLoss calculationBenchmark suite

The loop design is domain-agnostic. “If you can measure it, you can optimize it.”

How I Tried to Apply It

My First Attempt: Unbounded Changes

I told the AI agent to “optimize API performance”:

First experiment log
Baseline: 245ms p95 latency
Experiment 1:
- Change: "Optimize API endpoints"
- Agent modified: database schema, API routes, caching logic, connection pooling
- Result: System broke, tests failed
- Status: REVERT
Experiment 2:
- Change: "Improve database queries"
- Agent modified: index structure, query patterns, table relationships
- Result: Migration conflict, data integrity issues
- Status: REVERT

This failed because my action space was unbounded. The agent explored too widely and broke everything.

My Second Attempt: Bounded Action Space

I learned from the mistake. I constrained what the agent could modify:

config.yaml
# Bounded optimization space - agent can ONLY modify these
cache:
ttl_seconds: 300 # Bound: 60-3600
max_entries: 1000 # Bound: 100-10000
connection_pool:
min_size: 5 # Bound: 1-20
max_size: 50 # Bound: 10-100
timeout_ms: 5000 # Bound: 1000-30000

Now the agent had clear boundaries. It couldn’t touch the database schema or API routes.

Bounded experiment log
Baseline: 245ms p95 latency, 1000 req/s throughput
Experiment 1: Increase cache TTL
- Change: cache.ttl_seconds: 300 -> 600
- Result: 198ms p95, 1200 req/s
- Delta: -47ms latency, +200 throughput
- Decision: KEEP
Experiment 2: Reduce connection pool
- Change: connection_pool.max_size: 50 -> 30
- Result: 310ms p95, 800 req/s
- Delta: +65ms latency, -200 throughput
- Decision: REVERT
Experiment 3: Combined adjustment
- Change: cache.ttl_seconds: 600, pool.max_size: 40
- Result: 180ms p95, 1100 req/s
- Delta: -65ms latency, +100 throughput
- Decision: KEEP (best result)

This worked. After 50 overnight experiments, latency dropped from 245ms to 180ms.

The Core Pattern

I realized the autoresearch pattern works for any optimization problem that meets three criteria:

Autoresearch pattern requirements
┌─────────────────────────────────────────────────────────────┐
│ AUTORESEARCH PATTERN │
├─────────────────────────────────────────────────────────────┤
│ │
│ REQUIREMENT 1: Clear Metric │
│ ├── Must be quantifiable (number, percentage, time) │
│ ├── Must have clear optimization direction (min/max) │
│ └── Example: latency_p95 (minimize), conversion_rate │
│ │
│ REQUIREMENT 2: Bounded Action Space │
│ ├── Limited set of possible changes │
│ ├── Prevents chaotic exploration │
│ └── Example: cache TTL 60-3600, pool size 10-100 │
│ │
│ REQUIREMENT 3: Fast Evaluation │
│ ├── Quick scoring mechanism │
│ ├── Deterministic results preferred │
│ └── Example: benchmark suite, automated tests │
│ │
└─────────────────────────────────────────────────────────────┘

The generalized loop:

Non-ML Autoresearch Loop
1. Define metric and optimization direction
2. Set action space boundaries
3. Establish experiment budget
4. LOOP:
a. Generate modification hypothesis
b. Apply bounded change
c. Run fast evaluation
d. Compare score to baseline
e. Keep if improved, revert if degraded
f. Log results and insights
g. Repeat until budget exhausted

Other Domains Where This Works

The autoresearch-anything project demonstrates successful non-ML applications:

Domain 1: System Prompt Optimization

Prompt optimization setup
Metric: Task completion rate on eval suite
Action Space: Prompt templates, instruction phrasing
Budget: 50 iterations overnight
Evaluation: Run 100 test cases, measure pass rate

An AI agent that generates React components optimized its own prompts through this loop.

Domain 2: SQL Query Optimization

SQL optimization setup
Metric: Query execution time (minimize)
Action Space: Index suggestions, join order hints
Budget: Database compute budget
Evaluation: Run against sample data, measure latency

Domain 3: Test Suite Optimization

Test optimization setup
Metric: Test execution time (minimize) with coverage maintained
Action Space: Test ordering, parallelization config, timeouts
Budget: CI/CD pipeline slots
Evaluation: Run full suite, measure duration and pass rate

Domain 4: Landing Page Conversion

Conversion optimization setup
Metric: Conversion rate (maximize)
Action Space: Headlines, CTA text, layout variants
Budget: A/B test infrastructure limits
Evaluation: Track user actions over time window

The autoresearch-anything project lists even more domains: genealogy research, trading agents, GPU kernel optimization, sudoku solvers, Shopify Liquid templates, biomechanics analysis, tennis prediction models.

Common Mistakes I Made

Mistake 1: Unbounded Action Space

Wrong vs Right
WRONG:
Action Space: "Any code change to improve performance"
Result: Agent explores chaos, never converges
CORRECT:
Action Space: "Cache TTL values between 60-3600 seconds"
Result: Agent explores systematically, finds optimal

Mistake 2: Slow or Noisy Evaluation

Wrong vs Right
WRONG:
Evaluation: "Deploy to production, wait 24 hours, check metrics"
Result: 24 hours per experiment = 1 experiment per day
CORRECT:
Evaluation: "Run synthetic benchmark suite, extract p95 latency"
Result: 10 minutes per experiment = 144 experiments per day

Mistake 3: Missing Baseline

Wrong vs Right
WRONG:
"Try random changes and see what happens"
Result: No way to distinguish improvement from regression
CORRECT:
"Baseline: 245ms. Keep changes that reduce by >10ms"
Result: Clear success criteria

Mistake 4: Single Metric Blindness

Wrong vs Right
WRONG:
"Minimize latency at all costs"
Result: Latency drops but throughput collapses, costs explode
CORRECT:
"Minimize latency while maintaining throughput >1000 req/s"
Result: Balanced optimization

Setting Up Non-ML Autoresearch

I used the autoresearch-anything npm package. Here’s the interactive setup:

Setup command
$ npx autoresearch-anything
╔═══════════════════════════════════════════╗
autoresearch-anything
Autonomous AI improvement loop setup
╚═══════════════════════════════════════════╝
Briefly describe your project: API response time optimization
What file(s) should the agent edit?: config.yaml
What's your metric called?: latency_p95
Should the metric go up or down?: down
What command runs your eval?: ./run_benchmark.sh
What does the score line look like?: latency_p95: 245ms
Track a secondary constraint? [y/N]: y
What's the secondary metric?: throughput
How does it appear?: throughput: 1000
Max time per experiment?: 10
Files the agent must NOT modify?: benchmark_tests/

The evaluation script prints in a format the agent can parse:

eval.js
const { runBenchmark } = require('./benchmark');
async function evaluate() {
const results = await runBenchmark({
duration: '60s',
concurrency: 10
});
// Print in expected format
console.log(`latency_p95: ${results.timing.p95}ms`);
console.log(`throughput: ${results.requests.perSecond}`);
}
evaluate();

Why This Matters

Many developers have optimization problems that could benefit from autonomous improvement loops. But they don’t recognize the pattern applicability because:

  1. They associate autoresearch with ML training
  2. They lack awareness of domain-generalization
  3. They underestimate AI agent capability for non-code optimization

The key insight from practitioners:

Core insight
"The LLM training framing is compelling but the real insight is
the loop design, not the domain."

Once you understand this, you see optimization opportunities everywhere:

  • Your CI pipeline takes 45 minutes? Autoresearch can optimize test ordering.
  • Your API has 300ms latency? Autoresearch can tune configurations.
  • Your prompts fail 30% of tasks? Autoresearch can iterate phrasing.
  • Your SQL queries timeout? Autoresearch can suggest indexes.

Summary

In this post, I showed how to apply the autoresearch pattern to non-ML optimization problems. The key insight is the loop design, not the domain.

The pattern works for any problem with:

  • Clear metric (quantifiable, optimization direction)
  • Bounded action space (limited change options)
  • Fast evaluation (quick scoring mechanism)

I learned through trial and error: unbounded action space fails, slow evaluation limits iterations, missing baseline confuses results, single metric optimization breaks constraints.

The autoresearch-anything project proves this works across domains: system prompts, API performance, SQL queries, test suites, landing pages, and more. “If you can measure it, you can optimize it.”

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments