R&D intermediate 2 min read Mar 16, 2026 · Updated Mar 19, 2026
Public Preview Sign in free for the full digest →

How DeepSeek trained a frontier LLM for $5.5M — and why everyone copied it

“DeepSeek trained a frontier LLM for $5.5M. OpenAI's GPT-4 reportedly cost $100M+. The architecture that made this possible is now spreading through every open-weight model.”

How DeepSeek trained a frontier LLM for $5.5M — and why everyone copied it
11 Views
5 Likes
2 Bookmarks
Source · blog.bytebytego.com

“The gap between architectural novelty and real model behavior is wider than benchmark scores suggest. — Pawel Jozefiak, commenting on ByteByteGo after testing at Mistral hackathon”

You know that feeling when you want to run a frontier LLM but the API costs make your finance team cry? Before MoE architectures, every parameter you added meant linear cost increases. A 600B dense model meant paying for 600B parameters on every single token. You were stuck choosing between a dumb cheap model or a smart expensive one. MoE changed the math: you now get the knowledge of 671B parameters while only paying for 37B active ones per token.

llmarchitecturemoeopen-sourcedeepseeksystem-designai-infrastructure

Think of Mixture-of-Experts like a hospital with specialists. Instead of one massive neural network processing everything, you have multiple smaller 'expert' networks (16-384 of them) plus a router that decides which experts handle each token. DeepSeek V3 has 671B total parameters across 256 experts, but only 37B fire per token. The router learns which experts specialize in what — maybe one handles code, another handles math, another handles creative writing. Add Multi-Head Latent Attention (MLA), which compresses the memory-heavy KV-cache into a smaller latent space, and you get long context without the memory bloat.

01
Mixture-of-Experts routing — why YOU care: stores 671B parameters of knowledge but only computes 37B per token, cutting inference cost by ~18x vs dense models
02
Multi-Head Latent Attention (MLA) — why YOU care: compresses KV-cache into lower-dimensional space, letting you run longer contexts without running out of GPU memory
03
FP8 training — why YOU care: halves the memory bandwidth compared to FP16, which is how DeepSeek trained frontier quality on just 2.788M H800 GPU hours
04
Auxiliary-loss-free load balancing — why YOU care: the router learns to distribute work evenly without the performance penalty traditional MoE models suffer from
05
DeepSeek Sparse Attention (DSA) — why YOU care: for long contexts, it only attends to relevant previous tokens instead of all of them, making costs grow linearly instead of quadratically
06
Open-weight with MIT license — why YOU care: you can fine-tune, deploy commercially, and modify without the restrictive clauses in Llama's community license
Who it’s for

If you're evaluating which open-weight LLM to deploy — whether for cost optimization, fine-tuning, or building on top — this explains the architectural bets behind each option. Also useful if you're tracking where the open-source AI ecosystem is heading and which innovations are spreading fastest. Not for you if you just want a quick 'which model should I use' answer without understanding the tradeoffs.

Worth exploring

The open-weight ecosystem is genuinely competitive with closed models now — Kimi K2.5 matches Opus at 10% of the cost, and DeepSeek V3.2 is priced at $0.25/M input tokens. The architectural convergence around MoE means these gains are structural, not temporary. The one thing to know: you're trading some polish and tool-calling reliability for massive cost savings. If your use case tolerates occasional retries, the economics are compelling.

Developer playbook
Tech stack, code snippet, sentiment, alternatives.
PM playbook
Adoption angles, user fit, positioning.
CEO playbook
Traction signals, ROI, build vs buy.
Deep-dive insight
Full long-form analysis, no fluff.
Easy mode
Core idea, fast — when you need the gist.
Pro mode
Technical nuance, edge cases, tradeoffs.
Read the full digest
Go beyond the preview

Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.

Underrated tools. Unfiltered takes.

Read the full digest in the Snaplyze app for deep-dive insight, Easy and Pro modes, and the playbooks you can actually use.

Install Snaplyze →