Prismml: 1.15 GB Model Scores 70.5 Against 16 GB Rivals

What problem does it solve

“"I have older M1 Air with 8GB, but still getting over 23 t/s on 4B model.. and the quality of outputs is on par with top models of similar size." — HN user freakynit, Show HN thread (March 31, 2026)”

You know that feeling when you want to run an LLM locally but your 8B model eats 16 GB of VRAM and your laptop fan sounds like a jet engine? Edge devices — phones, robots, IoT sensors — are even worse off. Full-precision models simply don't fit. Standard post-training quantization (Q4, Q8) helps, but you're still working with models that were designed for datacenters and then shrunken after the fact, which means unpredictable quality loss at extreme compression.

aillm1-bitedge-aiquantizationon-deviceopen-source

How it works

Instead of training a normal model and then compressing it afterward (post-training quantization), PrismML trains the model from scratch with 1-bit weights — every weight is either -1 or +1, with a 16-bit scale factor shared across every 128 weights. Think of it like building a house with Lego bricks (fixed, simple pieces) instead of building a normal house and then trying to replace every brick with a Lego. The base architecture is Qwen3-8B (standard transformer with GQA, SwiGLU, RoPE), but the training process itself learns to work within the 1-bit constraint. The result: 1.125 effective bits per weight, fitting 8.19B parameters into 1.15 GB. Inference runs through PrismML's fork of llama.cpp with custom kernels optimized for this format.

Key takeaways

✦

01

1-bit native training — you get models trained from scratch with binary weights, not post-hoc quantized, which preserves more quality at extreme compression than standard quantization pipelines

⟁

02

14x memory reduction — the 8B model fits in 1.15 GB instead of 16 GB, so you can run it on devices with 2-4 GB of available memory including phones and entry-level GPUs

⊕

03

368 tok/s on RTX 4090 — 6.2x faster than FP16 inference on the same hardware, which means you can serve more users per GPU or get real-time responses on consumer hardware

◈

04

4-5x energy efficiency — per the whitepaper, Bonsai uses 0.276 mWh/token on a 4090 vs 1.134 for FP16, directly cutting your inference electricity bill

∞

05

Apache 2.0 license — no usage restrictions, fully open for commercial deployment, fine-tuning, and modification

◎

06

Three model sizes (8B, 4B, 1.7B) — the 1.7B fits in 0.24 GB and runs at 130 tok/s on an iPhone, giving you a range from edge-tiny to desktop-capable

✺

07

Intelligence density metric — scores 1.062/GB vs Qwen3's 0.098/GB, a new way to measure how much capability you get per byte of model weight

Should you care?

Who it’s for

If you're building on-device AI — mobile apps, robotics, IoT, or edge inference — and need an LLM that fits in under 2 GB of RAM, this is directly relevant. Backend engineers looking to cut GPU inference costs by 5-6x per request should benchmark it against their current quantized models. Not for you if you need top-of-class accuracy: Qwen3 8B still scores 79.3 vs Bonsai's 70.5 on the same benchmarks, and if you have 16 GB of VRAM to spare, full-precision models remain more capable.

Worth exploring

Worth experimenting with if you have a concrete edge-deployment use case. The benchmarks are promising but the project is two days old, runs only on a custom llama.cpp fork (not upstream), and HN users have reported gibberish output on x86 CPUs. The 70.5 benchmark average is real but trails the full-precision frontier. Treat it as a strong alpha-stage release with genuine technical novelty, not a production drop-in replacement yet.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

Prismml: 1.15 GB Model Scores 70.5 Against 16 GB Rivals

Underrated tools. Unfiltered takes.