“"The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement." — Andrej Karpathy, March 2026”
You know that feeling when you have a hypothesis about your training loop — 'what if I try a different attention scaling?' — and you spend 20 minutes editing code, then stare at a GPU for 45 minutes to get one data point, then repeat? A motivated human researcher gets through 8–10 experiment cycles in a full working day, with most of that time being pure waiting. Before autoresearch, your only options were manual iteration at human speed, hyperparameter sweep tools that only tune numbers (not architecture), or expensive multi-agent frameworks that require significant setup. Now: you write a markdown file explaining your research direction, point an AI agent at it, and wake up to 100 completed experiments with all improvements already committed to git.
The repo has exactly three files: `prepare.py` (data prep, locked — the agent never touches this), `train.py` (the training script — the agent's playground), and `program.md` (a plain English instruction file you write to tell the agent what to explore). You run `uv run prepare.py` once to download data, then point Claude Code or another coding agent at `program.md` and let it loose. The agent creates a git branch, reads the full codebase, proposes a hypothesis (e.g. 'try reducing learning rate warmup'), edits `train.py`, runs training for exactly 5 minutes, reads the `val_bpb` score from the log, and either advances the branch (if it improved) or does `git reset` (if it didn't). Then it loops. The 5-minute fixed budget is the clever design insight: it makes every experiment directly comparable regardless of what the agent changed — model size, batch size, architecture — and means the agent auto-discovers the optimal config for your specific hardware.
If you do ML research or engineering and spend time running experiments manually — training LLMs, fine-tuning models, benchmarking architectural changes — this is the tool that multiplies your throughput overnight. Also compelling for any developer building on top of nanochat or similar small training setups who wants to optimize for specific hardware. Not the right tool if you're training models larger than fit on one GPU (no multi-GPU support), if you need offline or air-gapped environments, or if your experiments take more than a few minutes each to reach a meaningful signal.
Yes, and the urgency is real — this is Karpathy at his most distilled, and the 37.5k star velocity tells you the ML community recognized the pattern immediately. The practical value is concrete: Shopify's CEO adapted the loop to a query-expansion model overnight and woke up to a 0.8B model outperforming his hand-tuned 1.6B baseline by 19% — a smaller model winning because the agent optimized for his hardware, not the default. The one honest limitation: the gains are real but incremental — Karpathy's best run improved val_bpb from 0.862 to 0.858, meaningful but not dramatic — and the agent has no memory of why things worked, it just keeps what scores better.
Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.