Autoresearch - Karpathy's 630-line script ran 100 ML experiments while he slept
Snaplyze Digest
GitHub Repos advanced 4 min read Mar 15, 2026 Updated Mar 31, 2026

Autoresearch - Karpathy's 630-line script ran 100 ML experiments while he slept

“Karpathy ran 700 experiments over two days, found 20 improvements he'd missed manually, and caught a bug in his own code after months — all while sleeping.”

In Short

Karpathy pointed an AI agent at a training script, went to bed, and woke up to 110+ git commits, 20 genuine improvements, and an 11% efficiency gain on a codebase he thought was already well-optimized — the agent even caught a QK-Norm bug he'd missed for months. autoresearch is a three-file, single-GPU framework where an AI agent modifies training code, runs 5-minute experiments, keeps improvements via git, reverts failures, and loops overnight without you touching anything. It kills the most tedious part of ML research: the 45-minute wait between hypothesis and result, repeated 8 times a day...

llmai-agentsml-researchopen-sourcepython
Why It Matters
The practical pain point this digest is really about.

You know that feeling when you have a hypothesis about your training loop — 'what if I try a different attention scaling?' — and you spend 20 minutes editing code, then stare at a GPU for 45 minutes to get one data point, then repeat? A motivated human researcher gets through 8–10 experiment cycles in a full working day, with most of that time being pure waiting. Before autoresearch, your only options were manual iteration at human speed, hyperparameter sweep tools that only tune numbers (not architecture), or expensive multi-agent frameworks that require significant setup. Now: you write a markdown file explaining your research direction, point an AI agent at it, and wake up to 100 completed experiments with all improvements already committed to git.

How It Works
The mechanism, architecture, or workflow behind it.

The repo has exactly three files: `prepare.py` (data prep, locked — the agent never touches this), `train.py` (the training script — the agent's playground), and `program.md` (a plain English instruction file you write to tell the agent what to explore). You run `uv run prepare.py` once to download data, then point Claude Code or another coding agent at `program.md` and let it loose. The agent creates a git branch, reads the full codebase, proposes a hypothesis (e.g. 'try reducing learning rate warmup'), edits `train.py`, runs training for exactly 5 minutes, reads the `val_bpb` score from the log, and either advances the branch (if it improved) or does `git reset` (if it didn't). Then it loops. The 5-minute fixed budget is the clever design insight: it makes every experiment directly comparable regardless of what the agent changed — model size, batch size, architecture — and means the agent auto-discovers the optimal config for your specific hardware.

Key Takeaways
7 fast bullets that make the core value obvious.
  • Fixed 5-minute training budget — every experiment is time-boxed to exactly 5 minutes, so the agent can fairly compare wildly different architectures (a small model with more layers vs a large model with fewer) on a sing...
  • program.md as your research strategy — you control the agent entirely through a markdown file, not Python code; the human's job shifts from writing training code to writing research direction, which means your domain kn...
  • Git-native keep-or-revert loop — every improvement gets committed, every failure gets reverted, so your repo always reflects the best configuration found so far and you have a complete audit trail of what the agent trie...
  • Single metric optimization (val_bpb) — one vocabulary-size-independent performance number drives all agent decisions, making the objective function impossible to game or misinterpret; the agent either improved it or did...
  • ~100 experiments overnight throughput — 12 experiments per hour on a single H100 means 8 hours of sleep = 96 experiments; Karpathy's own run produced 700 experiments over two days with 20 genuine improvements on an alre...
  • Crash recovery — if a run crashes, the agent reads the last 50 lines of the log, attempts a fix, and retries; if it can't recover after a few attempts it gives up and moves to the next hypothesis rather than spinning fo...
  • Community fork ecosystem — the main repo links to forks for Apple Silicon (MLX, no CUDA required), Windows RTX cards, and smaller NVIDIA GPUs, so you're not blocked if you don't have an H100
Should You Care?
Audience fit, decision signal, and the original source in one place.

Who It Is For

If you do ML research or engineering and spend time running experiments manually — training LLMs, fine-tuning models, benchmarking architectural changes — this is the tool that multiplies your throughput overnight. Also compelling for any developer building on top of nanochat or similar small training setups who wants to optimize for specific hardware. Not the right tool if you're training models...

Worth Exploring?

Yes, and the urgency is real — this is Karpathy at his most distilled, and the 37.5k star velocity tells you the ML community recognized the pattern immediately. The practical value is concrete: Shopify's CEO adapted the loop to a query-expansion model overnight and woke up to a 0.8B model outperforming his hand-tuned 1.6B baseline by 19% — a smaller model winning because the agent optimized for his hardware, not the default. The one honest limitation: the gains are real but incremental — Karpathy's best run improved val_bpb from 0.862 to 0.858, meaningful but not dramatic — and the agent has...

View original source
What the full digest unlocks

There is more here than the public preview.

This page gives you the hook. The full Snaplyze digest goes deeper so you can move from curiosity to decision with less noise.

Open the full digest to read the deeper breakdown, compare viewpoints, and get the practical next-step playbooks.

Open the full digest

Snaplyze

Go beyond the preview

Read the full digest for deep-dive insight, Easy Mode, Pro Mode, and practical playbooks you can actually use.

Install Snaplyze