Autoresearch - Karpathy's 630-line script ran 100 ML experiments while he slept

What problem does it solve

“"The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement." — Andrej Karpathy, March 2026”

You know that feeling when you have a hypothesis about your training loop — 'what if I try a different attention scaling?' — and you spend 20 minutes editing code, then stare at a GPU for 45 minutes to get one data point, then repeat? A motivated human researcher gets through 8–10 experiment cycles in a full working day, with most of that time being pure waiting. Before autoresearch, your only options were manual iteration at human speed, hyperparameter sweep tools that only tune numbers (not architecture), or expensive multi-agent frameworks that require significant setup. Now: you write a markdown file explaining your research direction, point an AI agent at it, and wake up to 100 completed experiments with all improvements already committed to git.

llmai-agentsml-researchopen-sourcepythonpytorchautomation

How it works

The repo has exactly three files: `prepare.py` (data prep, locked — the agent never touches this), `train.py` (the training script — the agent's playground), and `program.md` (a plain English instruction file you write to tell the agent what to explore). You run `uv run prepare.py` once to download data, then point Claude Code or another coding agent at `program.md` and let it loose. The agent creates a git branch, reads the full codebase, proposes a hypothesis (e.g. 'try reducing learning rate warmup'), edits `train.py`, runs training for exactly 5 minutes, reads the `val_bpb` score from the log, and either advances the branch (if it improved) or does `git reset` (if it didn't). Then it loops. The 5-minute fixed budget is the clever design insight: it makes every experiment directly comparable regardless of what the agent changed — model size, batch size, architecture — and means the agent auto-discovers the optimal config for your specific hardware.

Key takeaways

✦

01

Fixed 5-minute training budget — every experiment is time-boxed to exactly 5 minutes, so the agent can fairly compare wildly different architectures (a small model with more layers vs a large model with fewer) on a single metric, removing ...

⟁

02

program.md as your research strategy — you control the agent entirely through a markdown file, not Python code; the human's job shifts from writing training code to writing research direction, which means your domain knowledge goes further...

⊕

03

Git-native keep-or-revert loop — every improvement gets committed, every failure gets reverted, so your repo always reflects the best configuration found so far and you have a complete audit trail of what the agent tried and why

◈

04

Single metric optimization (val_bpb) — one vocabulary-size-independent performance number drives all agent decisions, making the objective function impossible to game or misinterpret; the agent either improved it or didn't

∞

05

~100 experiments overnight throughput — 12 experiments per hour on a single H100 means 8 hours of sleep = 96 experiments; Karpathy's own run produced 700 experiments over two days with 20 genuine improvements on an already well-tuned codeb...

◎

06

Crash recovery — if a run crashes, the agent reads the last 50 lines of the log, attempts a fix, and retries; if it can't recover after a few attempts it gives up and moves to the next hypothesis rather than spinning forever

✺

07

Community fork ecosystem — the main repo links to forks for Apple Silicon (MLX, no CUDA required), Windows RTX cards, and smaller NVIDIA GPUs, so you're not blocked if you don't have an H100

Should you care?

Who it’s for

If you do ML research or engineering and spend time running experiments manually — training LLMs, fine-tuning models, benchmarking architectural changes — this is the tool that multiplies your throughput overnight. Also compelling for any developer building on top of nanochat or similar small training setups who wants to optimize for specific hardware. Not the right tool if you're training models larger than fit on one GPU (no multi-GPU support), if you need offline or air-gapped environments, or if your experiments take more than a few minutes each to reach a meaningful signal.

Worth exploring

Yes, and the urgency is real — this is Karpathy at his most distilled, and the 37.5k star velocity tells you the ML community recognized the pattern immediately. The practical value is concrete: Shopify's CEO adapted the loop to a query-expansion model overnight and woke up to a 0.8B model outperforming his hand-tuned 1.6B baseline by 19% — a smaller model winning because the agent optimized for his hardware, not the default. The one honest limitation: the gains are real but incremental — Karpathy's best run improved val_bpb from 0.862 to 0.858, meaningful but not dramatic — and the agent has no memory of why things worked, it just keeps what scores better.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

Autoresearch - Karpathy's 630-line script ran 100 ML experiments while he slept

Underrated tools. Unfiltered takes.