How to Apply Autoresearch Patterns to Non-ML Optimization Problems
The Problem
I wanted to optimize my API response times. I’d heard about Andrej Karpathy’s autoresearch project - an autonomous AI loop that improves code by itself. But when I looked at it, everything seemed focused on machine learning:
The autoresearch project:- Train neural networks automatically- Adjust hyperparameters- Optimize model weights- Minimize loss functionsI don’t train ML models. I tune API configurations, optimize SQL queries, and improve system prompts. Was autoresearch useless for me?
Then I found this insight from a Reddit discussion:
"The most underrated application is applying the pattern to non-ML optimization problems -any domain where you have a clear metric, a fixed experiment budget, and a bounded action space."This changed my understanding. The autoresearch pattern isn’t about ML - it’s about the loop.
What is Really Happening?
The misconception comes from Karpathy’s framing. His original project used ML terminology:
LOOP FOREVER: 1. Propose change to model weights 2. git commit 3. Run training epoch -> get loss score 4. Loss improved? Keep. Loss worse? git reset. 5. Log results. Repeat.This made people think autoresearch only works for neural networks. But look closer at what actually makes it work:
| Requirement | What It Means | ML Example | Non-ML Example |
|---|---|---|---|
| Clear Metric | Quantifiable success measure | Loss value | API latency (ms) |
| Bounded Action Space | Limited change options | Weight adjustments | Config parameters |
| Experiment Budget | Fixed time/compute | Training epochs | Overnight runs |
| Fast Evaluation | Quick scoring | Loss calculation | Benchmark suite |
The loop design is domain-agnostic. “If you can measure it, you can optimize it.”
How I Tried to Apply It
My First Attempt: Unbounded Changes
I told the AI agent to “optimize API performance”:
Baseline: 245ms p95 latency
Experiment 1:- Change: "Optimize API endpoints"- Agent modified: database schema, API routes, caching logic, connection pooling- Result: System broke, tests failed- Status: REVERT
Experiment 2:- Change: "Improve database queries"- Agent modified: index structure, query patterns, table relationships- Result: Migration conflict, data integrity issues- Status: REVERTThis failed because my action space was unbounded. The agent explored too widely and broke everything.
My Second Attempt: Bounded Action Space
I learned from the mistake. I constrained what the agent could modify:
# Bounded optimization space - agent can ONLY modify these
cache: ttl_seconds: 300 # Bound: 60-3600 max_entries: 1000 # Bound: 100-10000
connection_pool: min_size: 5 # Bound: 1-20 max_size: 50 # Bound: 10-100 timeout_ms: 5000 # Bound: 1000-30000Now the agent had clear boundaries. It couldn’t touch the database schema or API routes.
Baseline: 245ms p95 latency, 1000 req/s throughput
Experiment 1: Increase cache TTL- Change: cache.ttl_seconds: 300 -> 600- Result: 198ms p95, 1200 req/s- Delta: -47ms latency, +200 throughput- Decision: KEEP
Experiment 2: Reduce connection pool- Change: connection_pool.max_size: 50 -> 30- Result: 310ms p95, 800 req/s- Delta: +65ms latency, -200 throughput- Decision: REVERT
Experiment 3: Combined adjustment- Change: cache.ttl_seconds: 600, pool.max_size: 40- Result: 180ms p95, 1100 req/s- Delta: -65ms latency, +100 throughput- Decision: KEEP (best result)This worked. After 50 overnight experiments, latency dropped from 245ms to 180ms.
The Core Pattern
I realized the autoresearch pattern works for any optimization problem that meets three criteria:
┌─────────────────────────────────────────────────────────────┐│ AUTORESEARCH PATTERN │├─────────────────────────────────────────────────────────────┤│ ││ REQUIREMENT 1: Clear Metric ││ ├── Must be quantifiable (number, percentage, time) ││ ├── Must have clear optimization direction (min/max) ││ └── Example: latency_p95 (minimize), conversion_rate ││ ││ REQUIREMENT 2: Bounded Action Space ││ ├── Limited set of possible changes ││ ├── Prevents chaotic exploration ││ └── Example: cache TTL 60-3600, pool size 10-100 ││ ││ REQUIREMENT 3: Fast Evaluation ││ ├── Quick scoring mechanism ││ ├── Deterministic results preferred ││ └── Example: benchmark suite, automated tests ││ │└─────────────────────────────────────────────────────────────┘The generalized loop:
1. Define metric and optimization direction2. Set action space boundaries3. Establish experiment budget4. LOOP: a. Generate modification hypothesis b. Apply bounded change c. Run fast evaluation d. Compare score to baseline e. Keep if improved, revert if degraded f. Log results and insights g. Repeat until budget exhaustedOther Domains Where This Works
The autoresearch-anything project demonstrates successful non-ML applications:
Domain 1: System Prompt Optimization
Metric: Task completion rate on eval suiteAction Space: Prompt templates, instruction phrasingBudget: 50 iterations overnightEvaluation: Run 100 test cases, measure pass rateAn AI agent that generates React components optimized its own prompts through this loop.
Domain 2: SQL Query Optimization
Metric: Query execution time (minimize)Action Space: Index suggestions, join order hintsBudget: Database compute budgetEvaluation: Run against sample data, measure latencyDomain 3: Test Suite Optimization
Metric: Test execution time (minimize) with coverage maintainedAction Space: Test ordering, parallelization config, timeoutsBudget: CI/CD pipeline slotsEvaluation: Run full suite, measure duration and pass rateDomain 4: Landing Page Conversion
Metric: Conversion rate (maximize)Action Space: Headlines, CTA text, layout variantsBudget: A/B test infrastructure limitsEvaluation: Track user actions over time windowThe autoresearch-anything project lists even more domains: genealogy research, trading agents, GPU kernel optimization, sudoku solvers, Shopify Liquid templates, biomechanics analysis, tennis prediction models.
Common Mistakes I Made
Mistake 1: Unbounded Action Space
WRONG: Action Space: "Any code change to improve performance" Result: Agent explores chaos, never converges
CORRECT: Action Space: "Cache TTL values between 60-3600 seconds" Result: Agent explores systematically, finds optimalMistake 2: Slow or Noisy Evaluation
WRONG: Evaluation: "Deploy to production, wait 24 hours, check metrics" Result: 24 hours per experiment = 1 experiment per day
CORRECT: Evaluation: "Run synthetic benchmark suite, extract p95 latency" Result: 10 minutes per experiment = 144 experiments per dayMistake 3: Missing Baseline
WRONG: "Try random changes and see what happens" Result: No way to distinguish improvement from regression
CORRECT: "Baseline: 245ms. Keep changes that reduce by >10ms" Result: Clear success criteriaMistake 4: Single Metric Blindness
WRONG: "Minimize latency at all costs" Result: Latency drops but throughput collapses, costs explode
CORRECT: "Minimize latency while maintaining throughput >1000 req/s" Result: Balanced optimizationSetting Up Non-ML Autoresearch
I used the autoresearch-anything npm package. Here’s the interactive setup:
$ npx autoresearch-anything
╔═══════════════════════════════════════════╗║ autoresearch-anything ║║ Autonomous AI improvement loop setup ║╚═══════════════════════════════════════════╝
Briefly describe your project: API response time optimizationWhat file(s) should the agent edit?: config.yamlWhat's your metric called?: latency_p95Should the metric go up or down?: downWhat command runs your eval?: ./run_benchmark.shWhat does the score line look like?: latency_p95: 245msTrack a secondary constraint? [y/N]: yWhat's the secondary metric?: throughputHow does it appear?: throughput: 1000Max time per experiment?: 10Files the agent must NOT modify?: benchmark_tests/The evaluation script prints in a format the agent can parse:
const { runBenchmark } = require('./benchmark');
async function evaluate() { const results = await runBenchmark({ duration: '60s', concurrency: 10 });
// Print in expected format console.log(`latency_p95: ${results.timing.p95}ms`); console.log(`throughput: ${results.requests.perSecond}`);}
evaluate();Why This Matters
Many developers have optimization problems that could benefit from autonomous improvement loops. But they don’t recognize the pattern applicability because:
- They associate autoresearch with ML training
- They lack awareness of domain-generalization
- They underestimate AI agent capability for non-code optimization
The key insight from practitioners:
"The LLM training framing is compelling but the real insight isthe loop design, not the domain."Once you understand this, you see optimization opportunities everywhere:
- Your CI pipeline takes 45 minutes? Autoresearch can optimize test ordering.
- Your API has 300ms latency? Autoresearch can tune configurations.
- Your prompts fail 30% of tasks? Autoresearch can iterate phrasing.
- Your SQL queries timeout? Autoresearch can suggest indexes.
Summary
In this post, I showed how to apply the autoresearch pattern to non-ML optimization problems. The key insight is the loop design, not the domain.
The pattern works for any problem with:
- Clear metric (quantifiable, optimization direction)
- Bounded action space (limited change options)
- Fast evaluation (quick scoring mechanism)
I learned through trial and error: unbounded action space fails, slow evaluation limits iterations, missing baseline confuses results, single metric optimization breaks constraints.
The autoresearch-anything project proves this works across domains: system prompts, API performance, SQL queries, test suites, landing pages, and more. “If you can measure it, you can optimize it.”
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 autoresearch-anything GitHub Repository
- 👨💻 Karpathy's Original Autoresearch Project
- 👨💻 Reddit Discussion on Autoresearch Pattern Insights
- 👨💻 Context7 Documentation Search
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments