Can You Use Karpathy's Autoresearch Loop for SEO Testing and Non-ML Tasks?
Problem
I read about Andrej Karpathy’s autoresearch project and thought: “This is amazing for ML, but I don’t train neural networks. Can I use this pattern for anything else?”
Then I realized I was missing the point. The autoresearch loop isn’t about machine learning. It’s about automation. The core pattern - read brief, propose change, measure result, commit or discard - applies to any domain where you can:
- Define a repeatable task
- Measure success with a single metric
- Tolerate some failed attempts
This post shows how to apply the autoresearch pattern to SEO testing, code refactoring, and prompt optimization.
The Universal Pattern
Karpathy’s loop architecture looks like this:
+-------------------+| Read Brief | <- Context document (program.md equivalent)+-------------------+ | v+-------------------+| Propose Change | <- Agent generates modification+-------------------+ | v+-------------------+| Execute & Measure | <- Run task, capture metric+-------------------+ | v +----------+ | Improved?| +----------+ / \ YES NO | | v v+--------+ +--------+| Commit | | Discard|| change | | revert |+--------+ +--------+ | | +-----+-----+ | v (Repeat)The loop works for ML because training has:
- A clear goal (reduce loss)
- A fast feedback loop (5-minute runs)
- A binary decision (improved or not)
I wondered: what other tasks have these properties?
Three Requirements for Non-ML Autoresearch
Not every task fits this pattern. You need:
Requirement 1: A Repeatable Task
The agent must execute the task repeatedly without human intervention. Each iteration should be independent or build incrementally.
Good repeatable tasks:- Generate a headline variant- Refactor a single function- Rewrite a system prompt
Bad tasks (need human input):- Design a new feature from scratch- Interview a customer- Create a brand strategyRequirement 2: A Single Measurable Outcome
Success must be quantifiable with one metric. No ambiguous tradeoffs.
Good metrics (single, clear):- CTR percentage for SEO- Test pass rate for refactoring- Accuracy score for prompts
Bad metrics (ambiguous, competing):- "Better user experience"- "Cleaner code"- "More engaging content"Requirement 3: Tolerance for Failure
Failed attempts should not cause catastrophic damage. The process runs overnight, so some failures are expected.
High tolerance:- A/B testing headlines (low traffic sample)- Refactoring with git revert available- Prompt testing in sandbox
Low tolerance:- Production database changes- Financial trading (real money)- Customer-facing featuresApplication 1: SEO Testing
I spend hours testing different headlines for blog posts. What if an agent could run these tests overnight?
import jsonfrom datetime import datetime
class SEOAutoresearch: def __init__(self, target_url, baseline_ctr): self.target_url = target_url self.baseline_ctr = baseline_ctr self.seo_brief = self.load_seo_brief()
def load_seo_brief(self): """Context document - equivalent to program.md""" return { "target_keywords": ["automated SEO", "AI testing"], "brand_voice": "technical but accessible", "competitor_headlines": [...], "past_winners": [...], "past_losers": [...] }
def run_loop(self, max_iterations=100): for i in range(max_iterations): # Step 1: Generate headline variant headline = self.propose_headline()
# Step 2: Deploy to sample audience (safety limit!) self.deploy_headline(headline, sample_size=1000) ctr = self.measure_ctr(duration_hours=24)
# Step 3: Commit or discard if ctr > self.baseline_ctr * 1.05: # 5% improvement threshold self.commit_winner(headline, ctr) self.baseline_ctr = ctr self.update_brief("success", headline, ctr) else: self.revert_headline() self.update_brief("failed", headline, ctr)
def propose_headline(self): prompt = f""" SEO Brief: {json.dumps(self.seo_brief)}
Generate a new headline that: 1. Uses target keywords naturally 2. Matches brand voice 3. Differs from past losers 4. Builds on past winners
Return only the headline text. """ return llm_generate(prompt)The key difference from ML: safety limits. I don’t deploy to my entire audience. I test on a 1000-visitor sample first. Failed headlines only affect a small percentage of traffic.
Application 2: Code Refactoring
I have legacy code that needs cleanup. The tests pass, but the code is messy. An autoresearch loop could help.
class RefactoringAutoresearch: def __init__(self, codebase_path, test_command): self.codebase_path = codebase_path self.test_command = test_command self.refactor_brief = self.load_refactor_brief()
def run_loop(self, max_iterations=50): for i in range(max_iterations): # Step 1: Select a file to refactor target_file = self.select_file() current_code = self.read_file(target_file)
# Step 2: Agent proposes refactoring refactored_code = self.propose_refactor( current_code, self.refactor_brief ) self.write_file(target_file, refactored_code)
# Step 3: Run tests tests_pass, coverage = self.run_tests()
# Step 4: Commit or discard if tests_pass and coverage >= self.baseline_coverage: self.git_commit(f"Refactor: {self.last_change_desc}") self.baseline_coverage = coverage self.update_brief("success", self.last_change_desc) else: self.git_discard() self.update_brief("failed", self.last_change_desc)
def propose_refactor(self, code, brief): prompt = f""" Refactoring Brief: {json.dumps(brief)}
Current Code:{code}
Propose ONE specific refactoring that:1. Improves code clarity or performance2. Does not break existing functionality3. Follows the patterns in the brief
Return the refactored code only."""return llm_generate(prompt)The refactoring brief is crucial. Without it, the agent makes random changes:
# Refactoring Brief
## GoalImprove code maintainability while preserving test coverage
## Successful Patterns (USE THESE)- Extract method for functions > 20 lines- Replace magic numbers with named constants- Add type hints to function parameters
## Failed Attempts (AVOID)- Renaming variables (low value, high churn)- Adding comments (tests already document behavior)
## Current Focus- src/legacy/ module has most technical debt- Focus on one file per iterationThe single metric here is test coverage. If tests pass and coverage stays the same or improves, commit. Otherwise, discard.
Application 3: Prompt Optimization
I write system prompts for AI tools. Tuning them manually is tedious. A prompt optimization loop could systematically improve them.
class PromptAutoresearch: def __init__(self, task_description, eval_cases): self.task = task_description self.eval_cases = eval_cases # Test inputs with expected outputs self.prompt_brief = self.load_prompt_brief()
def run_loop(self, max_iterations=30): for i in range(max_iterations): # Step 1: Generate prompt variant prompt_variant = self.propose_prompt()
# Step 2: Evaluate on test cases scores = [] for case in self.eval_cases: output = llm_call(prompt_variant, case['input']) score = self.evaluate_output(output, case['expected']) scores.append(score)
avg_score = sum(scores) / len(scores)
# Step 3: Commit or discard if avg_score > self.baseline_score: self.save_winner(prompt_variant, avg_score) self.baseline_score = avg_score self.update_brief("success", prompt_variant, avg_score) else: self.update_brief("failed", prompt_variant, avg_score)The evaluation function depends on your task:
def evaluate_output(self, output, expected): """Score the output against expected result""" if self.task_type == "classification": return 1.0 if output == expected else 0.0 elif self.task_type == "extraction": # Check if expected values appear in output return self.jaccard_similarity(output, expected) elif self.task_type == "generation": # Use another LLM to score quality return self.llm_score(output, expected)The ROI Calculation
I calculated the time savings for SEO testing:
Manual SEO testing:- 30 minutes to set up each test- 30 minutes to analyze results- 4 tests per day (2 hours)- 20 tests per week (10 hours)
Automated SEO testing:- 5 minutes to set up the loop- 50 tests run overnight- Equivalent to 25 hours of manual work- While I sleepThe same calculation applies to refactoring and prompt optimization. Overnight automation multiplies human productivity.
Common Mistakes
Mistake 1: Multiple Success Metrics
I initially tried tracking multiple metrics for SEO: CTR, time on page, bounce rate. This created ambiguous decisions.
Headline A:- CTR: 3.2% (better)- Time on page: 45s (worse)- Bounce rate: 60% (same)
Decision: ???The fix: pick ONE primary metric. For SEO, that’s CTR. The others are secondary signals.
Mistake 2: Long Feedback Loops
My first SEO loop tested for a week before measuring results. That’s too slow.
Slow loop (my mistake):- 1 week per test- 4 tests per month- Agent can't iterate quickly
Fast loop (correct):- 24 hours per test- 30 tests per month- Rapid learning cycleFor refactoring, run a subset of tests for quick validation. For prompts, use a small eval set.
Mistake 3: Missing Safety Constraints
ML experiments fail gracefully (wasted GPU time). SEO and production code don’t.
Required guardrails:
SEO Testing:- Sample size limit (never 100% of traffic)- Automatic revert after N failures- Brand guidelines in the brief
Refactoring:- Git revert always available- Test suite must pass- No database schema changes
Prompt Optimization:- Sandbox environment only- Human review before production- Fallback to previous versionSummary
The autoresearch loop pattern transfers to any domain with:
- Repeatable task - Agent can execute independently
- Single metric - Success is objectively measurable
- Failure tolerance - Bad attempts don’t cause catastrophe
The implementation follows the same structure:
while True: context = read_brief() # Understand goal proposal = generate_change() # Propose modification result = execute_and_measure() # Run and capture metric if result > baseline: commit() # Keep successful change update_brief() # Document success else: discard() # Revert and try againStart small. Pick one task. Define one metric. Write a brief. Let it run overnight. You might be surprised what the agent discovers while you sleep.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 GitHub: karpathy/autoresearch
- 👨💻 OpenClaw agent framework
- 👨💻 Reddit discussion on cross-domain applications
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments