How DeepSeek trained a frontier LLM for $5.5M — and why everyone copied it

What problem does it solve

“The gap between architectural novelty and real model behavior is wider than benchmark scores suggest. — Pawel Jozefiak, commenting on ByteByteGo after testing at Mistral hackathon”

You know that feeling when you want to run a frontier LLM but the API costs make your finance team cry? Before MoE architectures, every parameter you added meant linear cost increases. A 600B dense model meant paying for 600B parameters on every single token. You were stuck choosing between a dumb cheap model or a smart expensive one. MoE changed the math: you now get the knowledge of 671B parameters while only paying for 37B active ones per token.

llmarchitecturemoeopen-sourcedeepseeksystem-designai-infrastructure

How it works

Think of Mixture-of-Experts like a hospital with specialists. Instead of one massive neural network processing everything, you have multiple smaller 'expert' networks (16-384 of them) plus a router that decides which experts handle each token. DeepSeek V3 has 671B total parameters across 256 experts, but only 37B fire per token. The router learns which experts specialize in what — maybe one handles code, another handles math, another handles creative writing. Add Multi-Head Latent Attention (MLA), which compresses the memory-heavy KV-cache into a smaller latent space, and you get long context without the memory bloat.

Key takeaways

✦

01

Mixture-of-Experts routing — why YOU care: stores 671B parameters of knowledge but only computes 37B per token, cutting inference cost by ~18x vs dense models

⟁

02

Multi-Head Latent Attention (MLA) — why YOU care: compresses KV-cache into lower-dimensional space, letting you run longer contexts without running out of GPU memory

⊕

03

FP8 training — why YOU care: halves the memory bandwidth compared to FP16, which is how DeepSeek trained frontier quality on just 2.788M H800 GPU hours

◈

04

Auxiliary-loss-free load balancing — why YOU care: the router learns to distribute work evenly without the performance penalty traditional MoE models suffer from

∞

05

DeepSeek Sparse Attention (DSA) — why YOU care: for long contexts, it only attends to relevant previous tokens instead of all of them, making costs grow linearly instead of quadratically

◎

06

Open-weight with MIT license — why YOU care: you can fine-tune, deploy commercially, and modify without the restrictive clauses in Llama's community license

Should you care?

Who it’s for

If you're evaluating which open-weight LLM to deploy — whether for cost optimization, fine-tuning, or building on top — this explains the architectural bets behind each option. Also useful if you're tracking where the open-source AI ecosystem is heading and which innovations are spreading fastest. Not for you if you just want a quick 'which model should I use' answer without understanding the tradeoffs.

Worth exploring

The open-weight ecosystem is genuinely competitive with closed models now — Kimi K2.5 matches Opus at 10% of the cost, and DeepSeek V3.2 is priced at $0.25/M input tokens. The architectural convergence around MoE means these gains are structural, not temporary. The one thing to know: you're trading some polish and tool-calling reliability for massive cost savings. If your use case tolerates occasional retries, the economics are compelling.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

How DeepSeek trained a frontier LLM for $5.5M — and why everyone copied it

Underrated tools. Unfiltered takes.