Google's Gemma 4 now fits under 1 GB on your phone

What problem does it solve

“"Google did not do that. Google further trained the original model with an objective of minimizing error when quantized to 4-bit." — HN commenter coder543, correcting a mischaracterization of QAT mechanics (https://news.ycombinator.com/item?id=48414653; extracted via WebFetch mo...”

You want to run an open-weight multimodal model on a phone or laptop without a GPU, but standard 4-bit quantization silently degrades quality, and the resulting model still demands 6+ GB VRAM for the smallest useful variant. Picking the right format for your deployment target means navigating three incompatible runtimes: GGUF for llama.cpp, compressed tensors for vLLM, and now a Google-proprietary mobile format that only works with LiteRT-LM. The gap between the published footprint number and the actual RAM footprint during a real conversation has never been harder to reason about.

quantizationon-device-aillmmobile-mlgemmagoogle-deepmindedge-inference

How it works

Standard post-training quantization converts a trained model's weights to 4-bit after training — like JPEG-compressing a photo after it's been taken. QAT does something different: it simulates that compression during training, so the model learns to perform well under quantization constraints before the weights are finalized. For the mobile format, Google applies this unevenly across the model: token-generation layers get 2-bit compression (the high-frequency, lower-stakes path), while reasoning layers stay at higher precision. On top of that, activation scaling gets pre-computed during training rather than calculated per-token at inference, reducing the work mobile NPUs have to do on every forward pass.

Key takeaways

✦

01

Mobile quantization schema — sub-1 GB weights for E2B text-only: if you deploy via LiteRT-LM on Android or iOS, the text-only E2B model loads in under 1 GB of RAM, making it viable on mid-range phones without a dedicated ML chip

⟁

02

Q4_0 GGUF for llama.cpp and Ollama: official GGUF files ship with the release, so no conversion step is needed; the observed download size is ~3.2 GB for the E2B full model including audio and vision encoders

⊕

03

Surgical precision allocation: 2-bit compression targets token-generation layers only while reasoning layers stay at higher bit-width — quality stays closer to the original BF16 model than uniform-compression approaches

◈

04

MTP QAT checkpoints: Multi-Token Prediction variants are also available in QAT form, preserving the inference speedup from MTP while running at lower precision

∞

05

Optional encoder stripping: audio and vision encoders can be dropped at deploy time — removing them gets the text-only E2B under 1 GB; you only pay the memory cost for modalities you actually use

◎

06

Broad toolchain coverage: Q4_0 weights run with llama.cpp, Ollama, LM Studio, vLLM, SGLang, and MLX; mobile-format weights run with LiteRT-LM and Transformers.js for in-browser deployment

Should you care?

Who it’s for

If you are building on-device AI features for Android or running local LLM inference on a laptop without a GPU, these checkpoints give you official Google support and a tested toolchain. If you already run Qwen3 locally and care about long-context inference on a phone, the activation range difference documented in the HN thread (Gemma ~600,000+ vs Qwen3 <2,000) is worth evaluating before switching. Not useful if you need the sub-1 GB footprint to work with llama.cpp or Ollama — that path only exists via LiteRT-LM.

Worth exploring

The Q4_0 GGUF path is production-ready today — HN commenter simonw confirmed running E2B on Mac via litert-lm at a 3.2 GB download. The mobile format path (sub-1 GB) is limited to LiteRT-LM and Transformers.js, both relatively new runtimes with less community tooling than llama.cpp. The activation range issue flagged by HN engineers means the 1 GB number does not reflect actual memory usage during long conversations — test at your real target context length before committing to this as your deployment architecture.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

Google's Gemma 4 now fits under 1 GB on your phone

Underrated tools. Unfiltered takes.