/ Directory / Playground / Autoresearch (Karpathy-style)
● Community uditgoenka ⚡ Instant

Autoresearch (Karpathy-style)

by uditgoenka · uditgoenka/autoresearch

Goal-in, results-out — Claude proposes a change, runs it, measures, keeps wins, discards losses, iterates. Andrej Karpathy's autoresearch loop, packaged as a skill.

Autoresearch turns Claude into a closed-loop researcher: you specify a goal and a verifier (a metric, a test, or a judge prompt), and the skill iterates Modify → Verify → Keep/Discard with budget controls. Inspired by Karpathy's autoresearch posts. Useful when the goal is measurable but the search space is too big to plan upfront — prompt tuning, hyperparameter search, refactor-for-perf, copy A/B.

Why use it

Key features

Live Demo

What it looks like in practice

ready

Install

Pick your client

~/Library/Application Support/Claude/claude_desktop_config.json  · Windows: %APPDATA%\Claude\claude_desktop_config.json
{
  "mcpServers": {
    "autoresearch-skill": {
      "command": "git",
      "args": [
        "clone",
        "https://github.com/uditgoenka/autoresearch",
        "~/.claude/skills/autoresearch"
      ],
      "_inferred": true
    }
  }
}

Open Claude Desktop → Settings → Developer → Edit Config. Restart after saving.

~/.cursor/mcp.json · .cursor/mcp.json
{
  "mcpServers": {
    "autoresearch-skill": {
      "command": "git",
      "args": [
        "clone",
        "https://github.com/uditgoenka/autoresearch",
        "~/.claude/skills/autoresearch"
      ],
      "_inferred": true
    }
  }
}

Cursor uses the same mcpServers schema as Claude Desktop. Project config wins over global.

VS Code → Cline → MCP Servers → Edit
{
  "mcpServers": {
    "autoresearch-skill": {
      "command": "git",
      "args": [
        "clone",
        "https://github.com/uditgoenka/autoresearch",
        "~/.claude/skills/autoresearch"
      ],
      "_inferred": true
    }
  }
}

Click the MCP Servers icon in the Cline sidebar, then "Edit Configuration".

~/.codeium/windsurf/mcp_config.json
{
  "mcpServers": {
    "autoresearch-skill": {
      "command": "git",
      "args": [
        "clone",
        "https://github.com/uditgoenka/autoresearch",
        "~/.claude/skills/autoresearch"
      ],
      "_inferred": true
    }
  }
}

Same shape as Claude Desktop. Restart Windsurf to pick up changes.

~/.continue/config.json
{
  "mcpServers": [
    {
      "name": "autoresearch-skill",
      "command": "git",
      "args": [
        "clone",
        "https://github.com/uditgoenka/autoresearch",
        "~/.claude/skills/autoresearch"
      ]
    }
  ]
}

Continue uses an array of server objects rather than a map.

~/.config/zed/settings.json
{
  "context_servers": {
    "autoresearch-skill": {
      "command": {
        "path": "git",
        "args": [
          "clone",
          "https://github.com/uditgoenka/autoresearch",
          "~/.claude/skills/autoresearch"
        ]
      }
    }
  }
}

Add to context_servers. Zed hot-reloads on save.

claude mcp add autoresearch-skill -- git clone https://github.com/uditgoenka/autoresearch ~/.claude/skills/autoresearch

One-liner. Verify with claude mcp list. Remove with claude mcp remove.

Use Cases

Real-world ways to use Autoresearch (Karpathy-style)

Iteratively tune a system prompt against a benchmark

👤 AI engineers tuning prompts ⏱ ~90 min advanced

When to use: You have a prompt, a benchmark, and the patience for a loop.

Prerequisites
  • Skill installed — git clone https://github.com/uditgoenka/autoresearch ~/.claude/skills/autoresearch
  • Benchmark with score function — /bench/run.sh that prints a score on stdout
Flow
  1. Frame the goal
    Use autoresearch. Goal: maximize score from /bench/run.sh on prompt at /prompts/system.md. Budget 30 iterations.✓ Copied
    → Loop starts; first proposal made
  2. Watch the trace
    Show me iterations 5–10 with deltas.✓ Copied
    → Trace with score per iteration; kept/discarded marked
  3. Stop early
    If 3 consecutive iterations fail to improve > 1%, stop and report best.✓ Copied
    → Convergence guard triggers; best prompt reported

Outcome: Better prompt, with a trace explaining why.

Pitfalls
  • Verifier is gameable — score increases without quality — Add a sanity check verifier (LLM judge or held-out set)
Combine with: filesystem

Squeeze 20% perf out of a hot function via auto-iteration

👤 Backend devs with profile data ⏱ ~120 min advanced

When to use: You know which function is slow; you want Claude to find a faster equivalent.

Flow
  1. Define
    Goal: minimize wall-time of /bench/perf.sh which exercises foo(). Constraint: tests must keep passing.✓ Copied
    → Loop starts; baseline captured
  2. Iterate
    Run 20 iterations. Show the top 3 improvements at the end.✓ Copied
    → 3 candidate refactors with measured speedup

Outcome: Concrete speedup, validated.

Pitfalls
  • Iterations introduce subtle correctness issue tests miss — Add property-based tests as a verifier alongside the unit tests
Combine with: github

Auto-iterate landing page copy against a CTR judge

👤 Marketers running content tests ⏱ ~60 min intermediate

When to use: You have a CTR target (or a judge prompt that simulates one) and time to iterate.

Flow
  1. Set up judge
    Goal: maximize judge_score on /copy/headline.md. Judge prompt: 'rate likelihood a Series-B SaaS founder clicks this headline'.✓ Copied
    → Judge baseline scored; loop starts
  2. Iterate
    Run 15 iterations; keep top 3 distinct candidates.✓ Copied
    → Top 3 distinct headlines

Outcome: 3 candidate headlines for human review.

Pitfalls
  • Judge has a strong style preference unrelated to clickability — Pin judge to a rubric file with explicit criteria

Combinations

Pair with other MCPs for X10 leverage

autoresearch-skill + filesystem

Persist iteration traces for inspection

Save trace to /research/traces/<ts>.md after each loop.✓ Copied
autoresearch-skill + github

Open a PR with the winning candidate

When loop finishes, open PR titled 'autoresearch: <metric> +X%'.✓ Copied

Tools

What this MCP exposes

ToolInputsWhen to callCost
loop goal, verifier, max_iter, budget_tokens? Closed-loop optimization Variable — bound by budget
trace loop_id? Inspect a run 0
rollback to_iteration Loop went off the rails 0

Cost & Limits

What this costs to run

API quota
Bound by your LLM
Tokens per call
Heavy — full loop can be 100k+ tokens
Monetary
Free; LLM cost is yours
Tip
Always set max_iter and budget_tokens — open-ended loops will burn money

Security

Permissions, secrets, blast radius

Credential storage: None
Data egress: Bound by your LLM provider

Troubleshooting

Common errors and fixes

Loop stuck — same proposal each iter

Increase exploration temperature in the proposer; or seed with diverse candidates

Verifier fails inconsistently

Verifier flakiness invalidates loop — pin seeds, repeat verification N=3 per iteration

Budget exhausted before convergence

Inspect trace — if monotonic gains continue, raise budget; otherwise verifier or proposer is the bottleneck

Alternatives

Autoresearch (Karpathy-style) vs others

AlternativeWhen to use it insteadTradeoff
wanshuiyin/Auto-claude-code-research-in-sleep (ARIS)You want overnight async ML research loops specificallyARIS focused on ML; autoresearch is general
Manual A/B with scripted iterationGoal is small and one-offSkill removes orchestration overhead

More

Resources

📖 Read the official README on GitHub

🐙 Browse open issues

🔍 Browse all 400+ MCP servers and Skills