“DeepSeek trained a frontier LLM for $5.5M. OpenAI's GPT-4 reportedly cost $100M+. The architecture that made this possible is now spreading through every open-weight model.”
DeepSeek V3 trained a 671-billion-parameter model for just $5.576 million using Multi-Head Latent Attention and FP8 training — then open-sourced the architecture. Within months, Moonshot AI scaled it to 1 trillion parameters (Kimi K2), and Zhipu AI integrated its sparse attention into GLM-5. This is how the open-weight ecosystem actually works: teams build on each other's innovations in public, and costs compound downward. You get frontier-class models at 10% of closed-source pricing.
You know that feeling when you want to run a frontier LLM but the API costs make your finance team cry? Before MoE architectures, every parameter you added meant linear cost increases. A 600B dense model meant paying for 600B parameters on every single token. You were stuck choosing between a dumb cheap model or a smart expensive one. MoE changed the math: you now get the knowledge of 671B parameters while only paying for 37B active ones per token.
Think of Mixture-of-Experts like a hospital with specialists. Instead of one massive neural network processing everything, you have multiple smaller 'expert' networks (16-384 of them) plus a router that decides which experts handle each token. DeepSeek V3 has 671B total parameters across 256 experts, but only 37B fire per token. The router learns which experts specialize in what — maybe one handles code, another handles math, another handles creative writing. Add Multi-Head Latent Attention (MLA), which compresses the memory-heavy KV-cache into a smaller latent space, and you get long context without the memory bloat.
If you're evaluating which open-weight LLM to deploy — whether for cost optimization, fine-tuning, or building on top — this explains the architectural bets behind each option. Also useful if you're tracking where the open-source AI ecosystem is heading and which innovations are spreading fastest. Not for you if you just want a quick 'which model should I use' answer without understanding the tra...
The open-weight ecosystem is genuinely competitive with closed models now — Kimi K2.5 matches Opus at 10% of the cost, and DeepSeek V3.2 is priced at $0.25/M input tokens. The architectural convergence around MoE means these gains are structural, not temporary. The one thing to know: you're trading some polish and tool-calling reliability for massive cost savings. If your use case tolerates occasional retries, the economics are compelling.
View original sourceThis page gives you the hook. The full Snaplyze digest goes deeper so you can move from curiosity to decision with less noise.
Open the full digest to read the deeper breakdown, compare viewpoints, and get the practical next-step playbooks.
Read the full digest for deep-dive insight, Easy Mode, Pro Mode, and practical playbooks you can actually use.
Install Snaplyze