R&D intermediate 3 min read Jun 6, 2026
Public Preview Sign in free for the full digest →

Google's Gemma 4 now fits under 1 GB on your phone

“Google's Gemma 4 fits under 1 GB on a phone — the bf16 KV cache overhead Google left out of the headline can double that in long conversations.”

Google's Gemma 4 now fits under 1 GB on your phone
1 Views
0 Likes
0 Bookmarks
Source · blog.google

“"Google did not do that. Google further trained the original model with an objective of minimizing error when quantized to 4-bit." — HN commenter coder543, correcting a mischaracterization of QAT mechanics (https://news.ycombinator.com/item?id=48414653; extracted via WebFetch mo...”

You want to run an open-weight multimodal model on a phone or laptop without a GPU, but standard 4-bit quantization silently degrades quality, and the resulting model still demands 6+ GB VRAM for the smallest useful variant. Picking the right format for your deployment target means navigating three incompatible runtimes: GGUF for llama.cpp, compressed tensors for vLLM, and now a Google-proprietary mobile format that only works with LiteRT-LM. The gap between the published footprint number and the actual RAM footprint during a real conversation has never been harder to reason about.

quantizationon-device-aillmmobile-mlgemmagoogle-deepmindedge-inference

Standard post-training quantization converts a trained model's weights to 4-bit after training — like JPEG-compressing a photo after it's been taken. QAT does something different: it simulates that compression during training, so the model learns to perform well under quantization constraints before the weights are finalized. For the mobile format, Google applies this unevenly across the model: token-generation layers get 2-bit compression (the high-frequency, lower-stakes path), while reasoning layers stay at higher precision. On top of that, activation scaling gets pre-computed during training rather than calculated per-token at inference, reducing the work mobile NPUs have to do on every forward pass.

01
Mobile quantization schema — sub-1 GB weights for E2B text-only: if you deploy via LiteRT-LM on Android or iOS, the text-only E2B model loads in under 1 GB of RAM, making it viable on mid-range phones without a dedicated ML chip
02
Q4_0 GGUF for llama.cpp and Ollama: official GGUF files ship with the release, so no conversion step is needed; the observed download size is ~3.2 GB for the E2B full model including audio and vision encoders
03
Surgical precision allocation: 2-bit compression targets token-generation layers only while reasoning layers stay at higher bit-width — quality stays closer to the original BF16 model than uniform-compression approaches
04
MTP QAT checkpoints: Multi-Token Prediction variants are also available in QAT form, preserving the inference speedup from MTP while running at lower precision
05
Optional encoder stripping: audio and vision encoders can be dropped at deploy time — removing them gets the text-only E2B under 1 GB; you only pay the memory cost for modalities you actually use
06
Broad toolchain coverage: Q4_0 weights run with llama.cpp, Ollama, LM Studio, vLLM, SGLang, and MLX; mobile-format weights run with LiteRT-LM and Transformers.js for in-browser deployment
Who it’s for

If you are building on-device AI features for Android or running local LLM inference on a laptop without a GPU, these checkpoints give you official Google support and a tested toolchain. If you already run Qwen3 locally and care about long-context inference on a phone, the activation range difference documented in the HN thread (Gemma ~600,000+ vs Qwen3 <2,000) is worth evaluating before switching. Not useful if you need the sub-1 GB footprint to work with llama.cpp or Ollama — that path only exists via LiteRT-LM.

Worth exploring

The Q4_0 GGUF path is production-ready today — HN commenter simonw confirmed running E2B on Mac via litert-lm at a 3.2 GB download. The mobile format path (sub-1 GB) is limited to LiteRT-LM and Transformers.js, both relatively new runtimes with less community tooling than llama.cpp. The activation range issue flagged by HN engineers means the 1 GB number does not reflect actual memory usage during long conversations — test at your real target context length before committing to this as your deployment architecture.

Developer playbook
Tech stack, code snippet, sentiment, alternatives.
PM playbook
Adoption angles, user fit, positioning.
CEO playbook
Traction signals, ROI, build vs buy.
Deep-dive insight
Full long-form analysis, no fluff.
Easy mode
Core idea, fast — when you need the gist.
Pro mode
Technical nuance, edge cases, tradeoffs.
Read the full digest
Go beyond the preview

Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.

Underrated tools. Unfiltered takes.

Read the full digest in the Snaplyze app for deep-dive insight, Easy and Pro modes, and the playbooks you can actually use.

Install Snaplyze →