“"Google did not do that. Google further trained the original model with an objective of minimizing error when quantized to 4-bit." — HN commenter coder543, correcting a mischaracterization of QAT mechanics (https://news.ycombinator.com/item?id=48414653; extracted via WebFetch mo...”
You want to run an open-weight multimodal model on a phone or laptop without a GPU, but standard 4-bit quantization silently degrades quality, and the resulting model still demands 6+ GB VRAM for the smallest useful variant. Picking the right format for your deployment target means navigating three incompatible runtimes: GGUF for llama.cpp, compressed tensors for vLLM, and now a Google-proprietary mobile format that only works with LiteRT-LM. The gap between the published footprint number and the actual RAM footprint during a real conversation has never been harder to reason about.
Standard post-training quantization converts a trained model's weights to 4-bit after training — like JPEG-compressing a photo after it's been taken. QAT does something different: it simulates that compression during training, so the model learns to perform well under quantization constraints before the weights are finalized. For the mobile format, Google applies this unevenly across the model: token-generation layers get 2-bit compression (the high-frequency, lower-stakes path), while reasoning layers stay at higher precision. On top of that, activation scaling gets pre-computed during training rather than calculated per-token at inference, reducing the work mobile NPUs have to do on every forward pass.
If you are building on-device AI features for Android or running local LLM inference on a laptop without a GPU, these checkpoints give you official Google support and a tested toolchain. If you already run Qwen3 locally and care about long-context inference on a phone, the activation range difference documented in the HN thread (Gemma ~600,000+ vs Qwen3 <2,000) is worth evaluating before switching. Not useful if you need the sub-1 GB footprint to work with llama.cpp or Ollama — that path only exists via LiteRT-LM.
The Q4_0 GGUF path is production-ready today — HN commenter simonw confirmed running E2B on Mac via litert-lm at a 3.2 GB download. The mobile format path (sub-1 GB) is limited to LiteRT-LM and Transformers.js, both relatively new runtimes with less community tooling than llama.cpp. The activation range issue flagged by HN engineers means the 1 GB number does not reflect actual memory usage during long conversations — test at your real target context length before committing to this as your deployment architecture.
Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.