Skip to content

Can You Use Karpathy's Autoresearch Loop for SEO Testing and Non-ML Tasks?

Problem

I read about Andrej Karpathy’s autoresearch project and thought: “This is amazing for ML, but I don’t train neural networks. Can I use this pattern for anything else?”

Then I realized I was missing the point. The autoresearch loop isn’t about machine learning. It’s about automation. The core pattern - read brief, propose change, measure result, commit or discard - applies to any domain where you can:

  1. Define a repeatable task
  2. Measure success with a single metric
  3. Tolerate some failed attempts

This post shows how to apply the autoresearch pattern to SEO testing, code refactoring, and prompt optimization.

The Universal Pattern

Karpathy’s loop architecture looks like this:

universal-autoresearch-flow.txt
+-------------------+
| Read Brief | <- Context document (program.md equivalent)
+-------------------+
|
v
+-------------------+
| Propose Change | <- Agent generates modification
+-------------------+
|
v
+-------------------+
| Execute & Measure | <- Run task, capture metric
+-------------------+
|
v
+----------+
| Improved?|
+----------+
/ \
YES NO
| |
v v
+--------+ +--------+
| Commit | | Discard|
| change | | revert |
+--------+ +--------+
| |
+-----+-----+
|
v
(Repeat)

The loop works for ML because training has:

  • A clear goal (reduce loss)
  • A fast feedback loop (5-minute runs)
  • A binary decision (improved or not)

I wondered: what other tasks have these properties?

Three Requirements for Non-ML Autoresearch

Not every task fits this pattern. You need:

Requirement 1: A Repeatable Task

The agent must execute the task repeatedly without human intervention. Each iteration should be independent or build incrementally.

task-examples.txt
Good repeatable tasks:
- Generate a headline variant
- Refactor a single function
- Rewrite a system prompt
Bad tasks (need human input):
- Design a new feature from scratch
- Interview a customer
- Create a brand strategy

Requirement 2: A Single Measurable Outcome

Success must be quantifiable with one metric. No ambiguous tradeoffs.

metric-examples.txt
Good metrics (single, clear):
- CTR percentage for SEO
- Test pass rate for refactoring
- Accuracy score for prompts
Bad metrics (ambiguous, competing):
- "Better user experience"
- "Cleaner code"
- "More engaging content"

Requirement 3: Tolerance for Failure

Failed attempts should not cause catastrophic damage. The process runs overnight, so some failures are expected.

failure-tolerance.txt
High tolerance:
- A/B testing headlines (low traffic sample)
- Refactoring with git revert available
- Prompt testing in sandbox
Low tolerance:
- Production database changes
- Financial trading (real money)
- Customer-facing features

Application 1: SEO Testing

I spend hours testing different headlines for blog posts. What if an agent could run these tests overnight?

seo_loop.py
import json
from datetime import datetime
class SEOAutoresearch:
def __init__(self, target_url, baseline_ctr):
self.target_url = target_url
self.baseline_ctr = baseline_ctr
self.seo_brief = self.load_seo_brief()
def load_seo_brief(self):
"""Context document - equivalent to program.md"""
return {
"target_keywords": ["automated SEO", "AI testing"],
"brand_voice": "technical but accessible",
"competitor_headlines": [...],
"past_winners": [...],
"past_losers": [...]
}
def run_loop(self, max_iterations=100):
for i in range(max_iterations):
# Step 1: Generate headline variant
headline = self.propose_headline()
# Step 2: Deploy to sample audience (safety limit!)
self.deploy_headline(headline, sample_size=1000)
ctr = self.measure_ctr(duration_hours=24)
# Step 3: Commit or discard
if ctr > self.baseline_ctr * 1.05: # 5% improvement threshold
self.commit_winner(headline, ctr)
self.baseline_ctr = ctr
self.update_brief("success", headline, ctr)
else:
self.revert_headline()
self.update_brief("failed", headline, ctr)
def propose_headline(self):
prompt = f"""
SEO Brief: {json.dumps(self.seo_brief)}
Generate a new headline that:
1. Uses target keywords naturally
2. Matches brand voice
3. Differs from past losers
4. Builds on past winners
Return only the headline text.
"""
return llm_generate(prompt)

The key difference from ML: safety limits. I don’t deploy to my entire audience. I test on a 1000-visitor sample first. Failed headlines only affect a small percentage of traffic.

Application 2: Code Refactoring

I have legacy code that needs cleanup. The tests pass, but the code is messy. An autoresearch loop could help.

refactor_loop.py
class RefactoringAutoresearch:
def __init__(self, codebase_path, test_command):
self.codebase_path = codebase_path
self.test_command = test_command
self.refactor_brief = self.load_refactor_brief()
def run_loop(self, max_iterations=50):
for i in range(max_iterations):
# Step 1: Select a file to refactor
target_file = self.select_file()
current_code = self.read_file(target_file)
# Step 2: Agent proposes refactoring
refactored_code = self.propose_refactor(
current_code,
self.refactor_brief
)
self.write_file(target_file, refactored_code)
# Step 3: Run tests
tests_pass, coverage = self.run_tests()
# Step 4: Commit or discard
if tests_pass and coverage >= self.baseline_coverage:
self.git_commit(f"Refactor: {self.last_change_desc}")
self.baseline_coverage = coverage
self.update_brief("success", self.last_change_desc)
else:
self.git_discard()
self.update_brief("failed", self.last_change_desc)
def propose_refactor(self, code, brief):
prompt = f"""
Refactoring Brief: {json.dumps(brief)}
Current Code:

{code}

Propose ONE specific refactoring that:
1. Improves code clarity or performance
2. Does not break existing functionality
3. Follows the patterns in the brief
Return the refactored code only.
"""
return llm_generate(prompt)

The refactoring brief is crucial. Without it, the agent makes random changes:

refactoring_brief.txt
# Refactoring Brief
## Goal
Improve code maintainability while preserving test coverage
## Successful Patterns (USE THESE)
- Extract method for functions > 20 lines
- Replace magic numbers with named constants
- Add type hints to function parameters
## Failed Attempts (AVOID)
- Renaming variables (low value, high churn)
- Adding comments (tests already document behavior)
## Current Focus
- src/legacy/ module has most technical debt
- Focus on one file per iteration

The single metric here is test coverage. If tests pass and coverage stays the same or improves, commit. Otherwise, discard.

Application 3: Prompt Optimization

I write system prompts for AI tools. Tuning them manually is tedious. A prompt optimization loop could systematically improve them.

prompt_loop.py
class PromptAutoresearch:
def __init__(self, task_description, eval_cases):
self.task = task_description
self.eval_cases = eval_cases # Test inputs with expected outputs
self.prompt_brief = self.load_prompt_brief()
def run_loop(self, max_iterations=30):
for i in range(max_iterations):
# Step 1: Generate prompt variant
prompt_variant = self.propose_prompt()
# Step 2: Evaluate on test cases
scores = []
for case in self.eval_cases:
output = llm_call(prompt_variant, case['input'])
score = self.evaluate_output(output, case['expected'])
scores.append(score)
avg_score = sum(scores) / len(scores)
# Step 3: Commit or discard
if avg_score > self.baseline_score:
self.save_winner(prompt_variant, avg_score)
self.baseline_score = avg_score
self.update_brief("success", prompt_variant, avg_score)
else:
self.update_brief("failed", prompt_variant, avg_score)

The evaluation function depends on your task:

prompt_evaluation.py
def evaluate_output(self, output, expected):
"""Score the output against expected result"""
if self.task_type == "classification":
return 1.0 if output == expected else 0.0
elif self.task_type == "extraction":
# Check if expected values appear in output
return self.jaccard_similarity(output, expected)
elif self.task_type == "generation":
# Use another LLM to score quality
return self.llm_score(output, expected)

The ROI Calculation

I calculated the time savings for SEO testing:

seo-roi-calculation.txt
Manual SEO testing:
- 30 minutes to set up each test
- 30 minutes to analyze results
- 4 tests per day (2 hours)
- 20 tests per week (10 hours)
Automated SEO testing:
- 5 minutes to set up the loop
- 50 tests run overnight
- Equivalent to 25 hours of manual work
- While I sleep

The same calculation applies to refactoring and prompt optimization. Overnight automation multiplies human productivity.

Common Mistakes

Mistake 1: Multiple Success Metrics

I initially tried tracking multiple metrics for SEO: CTR, time on page, bounce rate. This created ambiguous decisions.

multi-metric-problem.txt
Headline A:
- CTR: 3.2% (better)
- Time on page: 45s (worse)
- Bounce rate: 60% (same)
Decision: ???

The fix: pick ONE primary metric. For SEO, that’s CTR. The others are secondary signals.

Mistake 2: Long Feedback Loops

My first SEO loop tested for a week before measuring results. That’s too slow.

feedback-loop-speed.txt
Slow loop (my mistake):
- 1 week per test
- 4 tests per month
- Agent can't iterate quickly
Fast loop (correct):
- 24 hours per test
- 30 tests per month
- Rapid learning cycle

For refactoring, run a subset of tests for quick validation. For prompts, use a small eval set.

Mistake 3: Missing Safety Constraints

ML experiments fail gracefully (wasted GPU time). SEO and production code don’t.

safety-constraints.txt
Required guardrails:
SEO Testing:
- Sample size limit (never 100% of traffic)
- Automatic revert after N failures
- Brand guidelines in the brief
Refactoring:
- Git revert always available
- Test suite must pass
- No database schema changes
Prompt Optimization:
- Sandbox environment only
- Human review before production
- Fallback to previous version

Summary

The autoresearch loop pattern transfers to any domain with:

  1. Repeatable task - Agent can execute independently
  2. Single metric - Success is objectively measurable
  3. Failure tolerance - Bad attempts don’t cause catastrophe

The implementation follows the same structure:

universal_loop.py
while True:
context = read_brief() # Understand goal
proposal = generate_change() # Propose modification
result = execute_and_measure() # Run and capture metric
if result > baseline:
commit() # Keep successful change
update_brief() # Document success
else:
discard() # Revert and try again

Start small. Pick one task. Define one metric. Write a brief. Let it run overnight. You might be surprised what the agent discovers while you sleep.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments