“Karpathy ran 700 experiments over two days, found 20 improvements he'd missed manually, and caught a bug in his own code after months — all while sleeping.”
Karpathy pointed an AI agent at a training script, went to bed, and woke up to 110+ git commits, 20 genuine improvements, and an 11% efficiency gain on a codebase he thought was already well-optimized — the agent even caught a QK-Norm bug he'd missed for months. autoresearch is a three-file, single-GPU framework where an AI agent modifies training code, runs 5-minute experiments, keeps improvements via git, reverts failures, and loops overnight without you touching anything. It kills the most tedious part of ML research: the 45-minute wait between hypothesis and result, repeated 8 times a day...
You know that feeling when you have a hypothesis about your training loop — 'what if I try a different attention scaling?' — and you spend 20 minutes editing code, then stare at a GPU for 45 minutes to get one data point, then repeat? A motivated human researcher gets through 8–10 experiment cycles in a full working day, with most of that time being pure waiting. Before autoresearch, your only options were manual iteration at human speed, hyperparameter sweep tools that only tune numbers (not architecture), or expensive multi-agent frameworks that require significant setup. Now: you write a markdown file explaining your research direction, point an AI agent at it, and wake up to 100 completed experiments with all improvements already committed to git.
The repo has exactly three files: `prepare.py` (data prep, locked — the agent never touches this), `train.py` (the training script — the agent's playground), and `program.md` (a plain English instruction file you write to tell the agent what to explore). You run `uv run prepare.py` once to download data, then point Claude Code or another coding agent at `program.md` and let it loose. The agent creates a git branch, reads the full codebase, proposes a hypothesis (e.g. 'try reducing learning rate warmup'), edits `train.py`, runs training for exactly 5 minutes, reads the `val_bpb` score from the log, and either advances the branch (if it improved) or does `git reset` (if it didn't). Then it loops. The 5-minute fixed budget is the clever design insight: it makes every experiment directly comparable regardless of what the agent changed — model size, batch size, architecture — and means the agent auto-discovers the optimal config for your specific hardware.
If you do ML research or engineering and spend time running experiments manually — training LLMs, fine-tuning models, benchmarking architectural changes — this is the tool that multiplies your throughput overnight. Also compelling for any developer building on top of nanochat or similar small training setups who wants to optimize for specific hardware. Not the right tool if you're training models...
Yes, and the urgency is real — this is Karpathy at his most distilled, and the 37.5k star velocity tells you the ML community recognized the pattern immediately. The practical value is concrete: Shopify's CEO adapted the loop to a query-expansion model overnight and woke up to a 0.8B model outperforming his hand-tuned 1.6B baseline by 19% — a smaller model winning because the agent optimized for his hardware, not the default. The one honest limitation: the gains are real but incremental — Karpathy's best run improved val_bpb from 0.862 to 0.858, meaningful but not dramatic — and the agent has...
View original sourceThis page gives you the hook. The full Snaplyze digest goes deeper so you can move from curiosity to decision with less noise.
Open the full digest to read the deeper breakdown, compare viewpoints, and get the practical next-step playbooks.
Read the full digest for deep-dive insight, Easy Mode, Pro Mode, and practical playbooks you can actually use.
Install Snaplyze