Can You Use Karpathy's Autoresearch Loop for SEO Testing and Non-ML Tasks?

Mar 30, 2026

Problem

I read about Andrej Karpathy’s autoresearch project and thought: “This is amazing for ML, but I don’t train neural networks. Can I use this pattern for anything else?”

Then I realized I was missing the point. The autoresearch loop isn’t about machine learning. It’s about automation. The core pattern - read brief, propose change, measure result, commit or discard - applies to any domain where you can:

Define a repeatable task
Measure success with a single metric
Tolerate some failed attempts

This post shows how to apply the autoresearch pattern to SEO testing, code refactoring, and prompt optimization.

The Universal Pattern

Karpathy’s loop architecture looks like this:

+-------------------+
|   Read Brief      |  <- Context document (program.md equivalent)
+-------------------+
         |
         v
+-------------------+
| Propose Change    |  <- Agent generates modification
+-------------------+
         |
         v
+-------------------+
| Execute & Measure |  <- Run task, capture metric
+-------------------+
         |
         v
    +----------+
    | Improved?|
    +----------+
     /        \
   YES         NO
    |           |
    v           v
+--------+  +--------+
| Commit |  | Discard|
| change |  | revert |
+--------+  +--------+
    |           |
    +-----+-----+
          |
          v
    (Repeat)

The loop works for ML because training has:

A clear goal (reduce loss)
A fast feedback loop (5-minute runs)
A binary decision (improved or not)

I wondered: what other tasks have these properties?

Three Requirements for Non-ML Autoresearch

Not every task fits this pattern. You need:

Requirement 1: A Repeatable Task

The agent must execute the task repeatedly without human intervention. Each iteration should be independent or build incrementally.

Good repeatable tasks:
- Generate a headline variant
- Refactor a single function
- Rewrite a system prompt

Bad tasks (need human input):
- Design a new feature from scratch
- Interview a customer
- Create a brand strategy

Requirement 2: A Single Measurable Outcome

Success must be quantifiable with one metric. No ambiguous tradeoffs.

Good metrics (single, clear):
- CTR percentage for SEO
- Test pass rate for refactoring
- Accuracy score for prompts

Bad metrics (ambiguous, competing):
- "Better user experience"
- "Cleaner code"
- "More engaging content"

Requirement 3: Tolerance for Failure

Failed attempts should not cause catastrophic damage. The process runs overnight, so some failures are expected.

High tolerance:
- A/B testing headlines (low traffic sample)
- Refactoring with git revert available
- Prompt testing in sandbox

Low tolerance:
- Production database changes
- Financial trading (real money)
- Customer-facing features

Application 1: SEO Testing

I spend hours testing different headlines for blog posts. What if an agent could run these tests overnight?

import json
from datetime import datetime

class SEOAutoresearch:
    def __init__(self, target_url, baseline_ctr):
        self.target_url = target_url
        self.baseline_ctr = baseline_ctr
        self.seo_brief = self.load_seo_brief()

    def load_seo_brief(self):
        """Context document - equivalent to program.md"""
        return {
            "target_keywords": ["automated SEO", "AI testing"],
            "brand_voice": "technical but accessible",
            "competitor_headlines": [...],
            "past_winners": [...],
            "past_losers": [...]
        }

    def run_loop(self, max_iterations=100):
        for i in range(max_iterations):
            # Step 1: Generate headline variant
            headline = self.propose_headline()

            # Step 2: Deploy to sample audience (safety limit!)
            self.deploy_headline(headline, sample_size=1000)
            ctr = self.measure_ctr(duration_hours=24)

            # Step 3: Commit or discard
            if ctr > self.baseline_ctr * 1.05:  # 5% improvement threshold
                self.commit_winner(headline, ctr)
                self.baseline_ctr = ctr
                self.update_brief("success", headline, ctr)
            else:
                self.revert_headline()
                self.update_brief("failed", headline, ctr)

    def propose_headline(self):
        prompt = f"""
        SEO Brief: {json.dumps(self.seo_brief)}

        Generate a new headline that:
        1. Uses target keywords naturally
        2. Matches brand voice
        3. Differs from past losers
        4. Builds on past winners

        Return only the headline text.
        """
        return llm_generate(prompt)

The key difference from ML: safety limits. I don’t deploy to my entire audience. I test on a 1000-visitor sample first. Failed headlines only affect a small percentage of traffic.

Application 2: Code Refactoring

I have legacy code that needs cleanup. The tests pass, but the code is messy. An autoresearch loop could help.

class RefactoringAutoresearch:
    def __init__(self, codebase_path, test_command):
        self.codebase_path = codebase_path
        self.test_command = test_command
        self.refactor_brief = self.load_refactor_brief()

    def run_loop(self, max_iterations=50):
        for i in range(max_iterations):
            # Step 1: Select a file to refactor
            target_file = self.select_file()
            current_code = self.read_file(target_file)

            # Step 2: Agent proposes refactoring
            refactored_code = self.propose_refactor(
                current_code,
                self.refactor_brief
            )
            self.write_file(target_file, refactored_code)

            # Step 3: Run tests
            tests_pass, coverage = self.run_tests()

            # Step 4: Commit or discard
            if tests_pass and coverage >= self.baseline_coverage:
                self.git_commit(f"Refactor: {self.last_change_desc}")
                self.baseline_coverage = coverage
                self.update_brief("success", self.last_change_desc)
            else:
                self.git_discard()
                self.update_brief("failed", self.last_change_desc)

    def propose_refactor(self, code, brief):
        prompt = f"""
        Refactoring Brief: {json.dumps(brief)}

        Current Code:

{code}

Propose ONE specific refactoring that:
1. Improves code clarity or performance
2. Does not break existing functionality
3. Follows the patterns in the brief

Return the refactored code only.
"""
return llm_generate(prompt)

The refactoring brief is crucial. Without it, the agent makes random changes:

# Refactoring Brief

## Goal
Improve code maintainability while preserving test coverage

## Successful Patterns (USE THESE)
- Extract method for functions > 20 lines
- Replace magic numbers with named constants
- Add type hints to function parameters

## Failed Attempts (AVOID)
- Renaming variables (low value, high churn)
- Adding comments (tests already document behavior)

## Current Focus
- src/legacy/ module has most technical debt
- Focus on one file per iteration

The single metric here is test coverage. If tests pass and coverage stays the same or improves, commit. Otherwise, discard.

Application 3: Prompt Optimization

I write system prompts for AI tools. Tuning them manually is tedious. A prompt optimization loop could systematically improve them.

class PromptAutoresearch:
    def __init__(self, task_description, eval_cases):
        self.task = task_description
        self.eval_cases = eval_cases  # Test inputs with expected outputs
        self.prompt_brief = self.load_prompt_brief()

    def run_loop(self, max_iterations=30):
        for i in range(max_iterations):
            # Step 1: Generate prompt variant
            prompt_variant = self.propose_prompt()

            # Step 2: Evaluate on test cases
            scores = []
            for case in self.eval_cases:
                output = llm_call(prompt_variant, case['input'])
                score = self.evaluate_output(output, case['expected'])
                scores.append(score)

            avg_score = sum(scores) / len(scores)

            # Step 3: Commit or discard
            if avg_score > self.baseline_score:
                self.save_winner(prompt_variant, avg_score)
                self.baseline_score = avg_score
                self.update_brief("success", prompt_variant, avg_score)
            else:
                self.update_brief("failed", prompt_variant, avg_score)

The evaluation function depends on your task:

def evaluate_output(self, output, expected):
    """Score the output against expected result"""
    if self.task_type == "classification":
        return 1.0 if output == expected else 0.0
    elif self.task_type == "extraction":
        # Check if expected values appear in output
        return self.jaccard_similarity(output, expected)
    elif self.task_type == "generation":
        # Use another LLM to score quality
        return self.llm_score(output, expected)

The ROI Calculation

I calculated the time savings for SEO testing:

Manual SEO testing:
- 30 minutes to set up each test
- 30 minutes to analyze results
- 4 tests per day (2 hours)
- 20 tests per week (10 hours)

Automated SEO testing:
- 5 minutes to set up the loop
- 50 tests run overnight
- Equivalent to 25 hours of manual work
- While I sleep

The same calculation applies to refactoring and prompt optimization. Overnight automation multiplies human productivity.

Common Mistakes

Mistake 1: Multiple Success Metrics

I initially tried tracking multiple metrics for SEO: CTR, time on page, bounce rate. This created ambiguous decisions.

Headline A:
- CTR: 3.2% (better)
- Time on page: 45s (worse)
- Bounce rate: 60% (same)

Decision: ???

The fix: pick ONE primary metric. For SEO, that’s CTR. The others are secondary signals.

Mistake 2: Long Feedback Loops

My first SEO loop tested for a week before measuring results. That’s too slow.

Slow loop (my mistake):
- 1 week per test
- 4 tests per month
- Agent can't iterate quickly

Fast loop (correct):
- 24 hours per test
- 30 tests per month
- Rapid learning cycle

For refactoring, run a subset of tests for quick validation. For prompts, use a small eval set.

Mistake 3: Missing Safety Constraints

ML experiments fail gracefully (wasted GPU time). SEO and production code don’t.

Required guardrails:

SEO Testing:
- Sample size limit (never 100% of traffic)
- Automatic revert after N failures
- Brand guidelines in the brief

Refactoring:
- Git revert always available
- Test suite must pass
- No database schema changes

Prompt Optimization:
- Sandbox environment only
- Human review before production
- Fallback to previous version

Summary

The autoresearch loop pattern transfers to any domain with:

Repeatable task - Agent can execute independently
Single metric - Success is objectively measurable
Failure tolerance - Bad attempts don’t cause catastrophe

The implementation follows the same structure:

while True:
    context = read_brief()        # Understand goal
    proposal = generate_change()   # Propose modification
    result = execute_and_measure() # Run and capture metric
    if result > baseline:
        commit()                   # Keep successful change
        update_brief()             # Document success
    else:
        discard()                  # Revert and try again

Start small. Pick one task. Define one metric. Write a brief. Let it run overnight. You might be surprised what the agent discovers while you sleep.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 GitHub: karpathy/autoresearch
👨‍💻 OpenClaw agent framework
👨‍💻 Reddit discussion on cross-domain applications

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!