Tech Products intermediate 3 min read Apr 2, 2026 · Updated Apr 3, 2026
Public Preview Sign in free for the full digest →

Prismml: 1.15 GB Model Scores 70.5 Against 16 GB Rivals

“An 8B LLM that fits in 1.15 GB, runs at 368 tok/s, and is Apache 2.0 — built by Caltech researchers with $16.25M from Khosla.”

Prismml: 1.15 GB Model Scores 70.5 Against 16 GB Rivals
2 Views
0 Likes
0 Bookmarks
Source · prismml.com

“"I have older M1 Air with 8GB, but still getting over 23 t/s on 4B model.. and the quality of outputs is on par with top models of similar size." — HN user freakynit, Show HN thread (March 31, 2026)”

You know that feeling when you want to run an LLM locally but your 8B model eats 16 GB of VRAM and your laptop fan sounds like a jet engine? Edge devices — phones, robots, IoT sensors — are even worse off. Full-precision models simply don't fit. Standard post-training quantization (Q4, Q8) helps, but you're still working with models that were designed for datacenters and then shrunken after the fact, which means unpredictable quality loss at extreme compression.

aillm1-bitedge-aiquantizationon-deviceopen-source

Instead of training a normal model and then compressing it afterward (post-training quantization), PrismML trains the model from scratch with 1-bit weights — every weight is either -1 or +1, with a 16-bit scale factor shared across every 128 weights. Think of it like building a house with Lego bricks (fixed, simple pieces) instead of building a normal house and then trying to replace every brick with a Lego. The base architecture is Qwen3-8B (standard transformer with GQA, SwiGLU, RoPE), but the training process itself learns to work within the 1-bit constraint. The result: 1.125 effective bits per weight, fitting 8.19B parameters into 1.15 GB. Inference runs through PrismML's fork of llama.cpp with custom kernels optimized for this format.

01
1-bit native training — you get models trained from scratch with binary weights, not post-hoc quantized, which preserves more quality at extreme compression than standard quantization pipelines
02
14x memory reduction — the 8B model fits in 1.15 GB instead of 16 GB, so you can run it on devices with 2-4 GB of available memory including phones and entry-level GPUs
03
368 tok/s on RTX 4090 — 6.2x faster than FP16 inference on the same hardware, which means you can serve more users per GPU or get real-time responses on consumer hardware
04
4-5x energy efficiency — per the whitepaper, Bonsai uses 0.276 mWh/token on a 4090 vs 1.134 for FP16, directly cutting your inference electricity bill
05
Apache 2.0 license — no usage restrictions, fully open for commercial deployment, fine-tuning, and modification
06
Three model sizes (8B, 4B, 1.7B) — the 1.7B fits in 0.24 GB and runs at 130 tok/s on an iPhone, giving you a range from edge-tiny to desktop-capable
07
Intelligence density metric — scores 1.062/GB vs Qwen3's 0.098/GB, a new way to measure how much capability you get per byte of model weight
Who it’s for

If you're building on-device AI — mobile apps, robotics, IoT, or edge inference — and need an LLM that fits in under 2 GB of RAM, this is directly relevant. Backend engineers looking to cut GPU inference costs by 5-6x per request should benchmark it against their current quantized models. Not for you if you need top-of-class accuracy: Qwen3 8B still scores 79.3 vs Bonsai's 70.5 on the same benchmarks, and if you have 16 GB of VRAM to spare, full-precision models remain more capable.

Worth exploring

Worth experimenting with if you have a concrete edge-deployment use case. The benchmarks are promising but the project is two days old, runs only on a custom llama.cpp fork (not upstream), and HN users have reported gibberish output on x86 CPUs. The 70.5 benchmark average is real but trails the full-precision frontier. Treat it as a strong alpha-stage release with genuine technical novelty, not a production drop-in replacement yet.

Developer playbook
Tech stack, code snippet, sentiment, alternatives.
PM playbook
Adoption angles, user fit, positioning.
CEO playbook
Traction signals, ROI, build vs buy.
Deep-dive insight
Full long-form analysis, no fluff.
Easy mode
Core idea, fast — when you need the gist.
Pro mode
Technical nuance, edge cases, tradeoffs.
Read the full digest
Go beyond the preview

Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.

Underrated tools. Unfiltered takes.

Read the full digest in the Snaplyze app for deep-dive insight, Easy and Pro modes, and the playbooks you can actually use.

Install Snaplyze →