Autoresearch (Karpathy-style) (Claude Skill) — Use Cases, Install & Live Demo

Why use it

Key features

Goal + verifier abstraction — works for any measurable target
Budget controls (max iterations, max tokens, time)
Keep/discard log + automatic rollback on regressions
Pluggable verifiers: metric, test suite, LLM judge
Markdown trace of every iteration for inspection
Free, open

Live Demo

What it looks like in practice

ready

Install

Pick your client

~/Library/Application Support/Claude/claude_desktop_config.json · Windows: %APPDATA%\Claude\claude_desktop_config.json

{
  "mcpServers": {
    "autoresearch-skill": {
      "command": "git",
      "args": [
        "clone",
        "https://github.com/uditgoenka/autoresearch",
        "~/.claude/skills/autoresearch"
      ],
      "_inferred": true
    }
  }
}

Open Claude Desktop → Settings → Developer → Edit Config. Restart after saving.

~/.cursor/mcp.json · .cursor/mcp.json

{
  "mcpServers": {
    "autoresearch-skill": {
      "command": "git",
      "args": [
        "clone",
        "https://github.com/uditgoenka/autoresearch",
        "~/.claude/skills/autoresearch"
      ],
      "_inferred": true
    }
  }
}

Cursor uses the same mcpServers schema as Claude Desktop. Project config wins over global.

VS Code → Cline → MCP Servers → Edit

{
  "mcpServers": {
    "autoresearch-skill": {
      "command": "git",
      "args": [
        "clone",
        "https://github.com/uditgoenka/autoresearch",
        "~/.claude/skills/autoresearch"
      ],
      "_inferred": true
    }
  }
}

Click the MCP Servers icon in the Cline sidebar, then "Edit Configuration".

~/.codeium/windsurf/mcp_config.json

{
  "mcpServers": {
    "autoresearch-skill": {
      "command": "git",
      "args": [
        "clone",
        "https://github.com/uditgoenka/autoresearch",
        "~/.claude/skills/autoresearch"
      ],
      "_inferred": true
    }
  }
}

Same shape as Claude Desktop. Restart Windsurf to pick up changes.

~/.continue/config.json

{
  "mcpServers": [
    {
      "name": "autoresearch-skill",
      "command": "git",
      "args": [
        "clone",
        "https://github.com/uditgoenka/autoresearch",
        "~/.claude/skills/autoresearch"
      ]
    }
  ]
}

Continue uses an array of server objects rather than a map.

~/.config/zed/settings.json

{
  "context_servers": {
    "autoresearch-skill": {
      "command": {
        "path": "git",
        "args": [
          "clone",
          "https://github.com/uditgoenka/autoresearch",
          "~/.claude/skills/autoresearch"
        ]
      }
    }
  }
}

Add to context_servers. Zed hot-reloads on save.

claude mcp add autoresearch-skill -- git clone https://github.com/uditgoenka/autoresearch ~/.claude/skills/autoresearch

One-liner. Verify with claude mcp list. Remove with claude mcp remove.

Use Cases

Real-world ways to use Autoresearch (Karpathy-style)

Iteratively tune a system prompt against a benchmark

👤 AI engineers tuning prompts ⏱ ~90 min advanced

When to use: You have a prompt, a benchmark, and the patience for a loop.

Prerequisites

Skill installed — git clone https://github.com/uditgoenka/autoresearch ~/.claude/skills/autoresearch
Benchmark with score function — /bench/run.sh that prints a score on stdout

Flow

Frame the goal

Use autoresearch. Goal: maximize score from /bench/run.sh on prompt at /prompts/system.md. Budget 30 iterations.✓ Copied

→ Loop starts; first proposal made
Watch the trace

Show me iterations 5–10 with deltas.✓ Copied

→ Trace with score per iteration; kept/discarded marked
Stop early

If 3 consecutive iterations fail to improve > 1%, stop and report best.✓ Copied

→ Convergence guard triggers; best prompt reported

Outcome: Better prompt, with a trace explaining why.

Pitfalls

Verifier is gameable — score increases without quality — Add a sanity check verifier (LLM judge or held-out set)

Combine with: filesystem

Squeeze 20% perf out of a hot function via auto-iteration

👤 Backend devs with profile data ⏱ ~120 min advanced

When to use: You know which function is slow; you want Claude to find a faster equivalent.

Flow

Define

Goal: minimize wall-time of /bench/perf.sh which exercises foo(). Constraint: tests must keep passing.✓ Copied

→ Loop starts; baseline captured
Iterate

Run 20 iterations. Show the top 3 improvements at the end.✓ Copied

→ 3 candidate refactors with measured speedup

Outcome: Concrete speedup, validated.

Pitfalls

Iterations introduce subtle correctness issue tests miss — Add property-based tests as a verifier alongside the unit tests

Combine with: github

Auto-iterate landing page copy against a CTR judge

👤 Marketers running content tests ⏱ ~60 min intermediate

When to use: You have a CTR target (or a judge prompt that simulates one) and time to iterate.

Flow

Set up judge

Goal: maximize judge_score on /copy/headline.md. Judge prompt: 'rate likelihood a Series-B SaaS founder clicks this headline'.✓ Copied

→ Judge baseline scored; loop starts
Iterate

Run 15 iterations; keep top 3 distinct candidates.✓ Copied

→ Top 3 distinct headlines

Outcome: 3 candidate headlines for human review.

Pitfalls

Judge has a strong style preference unrelated to clickability — Pin judge to a rubric file with explicit criteria

Combinations

Pair with other MCPs for X10 leverage

autoresearch-skill + filesystem

Persist iteration traces for inspection

Save trace to /research/traces/<ts>.md after each loop.✓ Copied

autoresearch-skill + github

Open a PR with the winning candidate

When loop finishes, open PR titled 'autoresearch: <metric> +X%'.✓ Copied

Tools

What this MCP exposes

Tool	Inputs	When to call	Cost
loop	goal, verifier, max_iter, budget_tokens?	Closed-loop optimization	Variable — bound by budget
trace	loop_id?	Inspect a run	0
rollback	to_iteration	Loop went off the rails	0

Cost & Limits

What this costs to run

API quota: Bound by your LLM
Tokens per call: Heavy — full loop can be 100k+ tokens
Monetary: Free; LLM cost is yours
Tip: Always set max_iter and budget_tokens — open-ended loops will burn money

Security

Permissions, secrets, blast radius

Credential storage: None

Data egress: Bound by your LLM provider

Loops can be expensive — never run with no budget

Troubleshooting

Common errors and fixes

Loop stuck — same proposal each iter

Increase exploration temperature in the proposer; or seed with diverse candidates

Verifier fails inconsistently

Verifier flakiness invalidates loop — pin seeds, repeat verification N=3 per iteration

Budget exhausted before convergence

Inspect trace — if monotonic gains continue, raise budget; otherwise verifier or proposer is the bottleneck

Alternatives

Autoresearch (Karpathy-style) vs others

Alternative	When to use it instead	Tradeoff
wanshuiyin/Auto-claude-code-research-in-sleep (ARIS)	You want overnight async ML research loops specifically	ARIS focused on ML; autoresearch is general
Manual A/B with scripted iteration	Goal is small and one-off	Skill removes orchestration overhead

More

Resources

📖 Read the official README on GitHub

🐙 Browse open issues

🔍 Browse all 400+ MCP servers and Skills