LLMs from Scratch: Build a GPT in Pure PyTorch, No LLM Libraries

What problem does it solve

“"at 500 lines long, it's much, much, much more digestable than a lot of the vomit that comes out of so-called production systems" — mdaniel, Hacker News (https://news.ycombinator.com/item?id=44962059)”

You know that feeling when you copy a HuggingFace tutorial, get it working, and still have no idea why the attention mask looks the way it does or what the loss curve is actually telling you? Fine-tuning a model without understanding its internals means every unexpected behavior is a mystery you can't debug. You read blog posts explaining transformers with colored boxes and animations, but translating that into inspectable, debuggable code is a different skill entirely. This repo closes that gap: every building block is written in plain PyTorch so you can step through it in a debugger and see the numbers change.

llmpytorcheducationtransformersgptdeep-learningopen-source

How it works

You start with raw text, implement a tokenizer that splits it into numeric tokens, then build multi-head self-attention from scratch using PyTorch matrix operations — no imported attention layer, just tensor math. Once attention works, you stack it into a GPT architecture identical in structure to GPT-2. Chapter 5 walks you through a pretraining loop on unlabeled data; Chapters 6 and 7 cover classification and instruction fine-tuning on the same codebase. Each notebook is self-contained so you can run any chapter independently and inspect intermediate tensor shapes. The repo also loads pretrained GPT-2 weights from OpenAI, so you can verify your architecture matches a real model by comparing outputs.

Key takeaways

✦

01

Zero external LLM libraries — every transformer component (tokenizer, attention, positional embeddings, training loop, sampling, LoRA) is written inline in the notebook so you read the exact computation rather than a wrapper function name

⟁

02

7 structured chapters plus 5 appendices following the Manning book's narrative — the chapter order matches a printed curriculum you can return to, unlike a video series that may reorder or disappear

⊕

03

Loads real pretrained GPT-2 weights — you can verify your from-scratch implementation matches OpenAI's architecture by loading weights and checking outputs, a concrete correctness test not available in tutorials that skip this step

◈

04

Bonus chapter: Gemma 3 270M in roughly 500 lines — extends the educational approach to a 2025 production architecture, showing the differences (KV cache, grouped-query attention) from the book's GPT-2-class model

∞

05

Runs on a laptop without a GPU — CI tests pass on macOS, Linux, and Windows; no cloud credits needed to complete the core curriculum (GPU auto-detected when available)

◎

06

Active CI on 3 platforms using uv — the notebooks are tested automatically on each push, so you're not dealing with code that worked two years ago but silently breaks today

✺

07

Direct sequel available — rasbt/reasoning-from-scratch (4,300 stars) picks up where this ends, covering RL and distillation for reasoning models if you want to continue after the 7 chapters

Should you care?

Who it’s for

If you have intermediate Python and basic ML knowledge and want to go from 'I know transformers exist' to 'I can read and write transformer code without a library', this is the most structured path available at this star count. You benefit most if you're a backend or ML engineer who uses HuggingFace daily but can't confidently explain why causal masking works or how LoRA modifies weight matrices. Not useful if you need production deployment patterns, RAG pipelines, or multi-GPU training — the Manning page explicitly excludes those from scope.

Worth exploring

Yes, if your goal is foundational LLM understanding — the 94k stars, active CI, and a published Manning book behind it make this the most validated educational resource in the category. Know this upfront: the core 7-chapter curriculum is frozen to match the printed book, keeping it anchored to GPT-2-era architecture; newer material lives in bonus chapters. Apple Silicon MPS users should expect reproducibility issues (open issue #977) and should run on CPU or CUDA instead until that is resolved.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

LLMs from Scratch: Build a GPT in Pure PyTorch, No LLM Libraries

Underrated tools. Unfiltered takes.