Prismml: 1.15 GB Model Scores 70.5 Against 16 GB Rivals
Snaplyze Digest
Tech Products intermediate 3 min read Apr 2, 2026 Updated Apr 3, 2026

Prismml: 1.15 GB Model Scores 70.5 Against 16 GB Rivals

“An 8B LLM that fits in 1.15 GB, runs at 368 tok/s, and is Apache 2.0 — built by Caltech researchers with $16.25M from Khosla.”

In Short

PrismML squeezed an 8B-parameter LLM into 1.15 GB — 14x smaller than full-precision — by training natively with 1-bit weights, not post-training quantization. It runs at 368 tokens/sec on an RTX 4090 and 85 tok/s on an M4 Pro, using 4-5x less energy per token than FP16. On a six-benchmark average it scores 70.5, trailing Qwen3 8B's 79.3 but beating Llama 3.1 8B's 67.1 — at a fraction of the memory. The company emerged from stealth on March 31, 2026 with $16.25M in seed funding from Khosla Ventures.

aillm1-bitedge-aiquantization
Why It Matters
The practical pain point this digest is really about.

You know that feeling when you want to run an LLM locally but your 8B model eats 16 GB of VRAM and your laptop fan sounds like a jet engine? Edge devices — phones, robots, IoT sensors — are even worse off. Full-precision models simply don't fit. Standard post-training quantization (Q4, Q8) helps, but you're still working with models that were designed for datacenters and then shrunken after the fact, which means unpredictable quality loss at extreme compression.

How It Works
The mechanism, architecture, or workflow behind it.

Instead of training a normal model and then compressing it afterward (post-training quantization), PrismML trains the model from scratch with 1-bit weights — every weight is either -1 or +1, with a 16-bit scale factor shared across every 128 weights. Think of it like building a house with Lego bricks (fixed, simple pieces) instead of building a normal house and then trying to replace every brick with a Lego. The base architecture is Qwen3-8B (standard transformer with GQA, SwiGLU, RoPE), but the training process itself learns to work within the 1-bit constraint. The result: 1.125 effective bits per weight, fitting 8.19B parameters into 1.15 GB. Inference runs through PrismML's fork of llama.cpp with custom kernels optimized for this format.

Key Takeaways
7 fast bullets that make the core value obvious.
  • 1-bit native training — you get models trained from scratch with binary weights, not post-hoc quantized, which preserves more quality at extreme compression than standard quantization pipelines
  • 14x memory reduction — the 8B model fits in 1.15 GB instead of 16 GB, so you can run it on devices with 2-4 GB of available memory including phones and entry-level GPUs
  • 368 tok/s on RTX 4090 — 6.2x faster than FP16 inference on the same hardware, which means you can serve more users per GPU or get real-time responses on consumer hardware
  • 4-5x energy efficiency — per the whitepaper, Bonsai uses 0.276 mWh/token on a 4090 vs 1.134 for FP16, directly cutting your inference electricity bill
  • Apache 2.0 license — no usage restrictions, fully open for commercial deployment, fine-tuning, and modification
  • Three model sizes (8B, 4B, 1.7B) — the 1.7B fits in 0.24 GB and runs at 130 tok/s on an iPhone, giving you a range from edge-tiny to desktop-capable
  • Intelligence density metric — scores 1.062/GB vs Qwen3's 0.098/GB, a new way to measure how much capability you get per byte of model weight
Should You Care?
Audience fit, decision signal, and the original source in one place.

Who It Is For

If you're building on-device AI — mobile apps, robotics, IoT, or edge inference — and need an LLM that fits in under 2 GB of RAM, this is directly relevant. Backend engineers looking to cut GPU inference costs by 5-6x per request should benchmark it against their current quantized models. Not for you if you need top-of-class accuracy: Qwen3 8B still scores 79.3 vs Bonsai's 70.5 on the same benchm...

Worth Exploring?

Worth experimenting with if you have a concrete edge-deployment use case. The benchmarks are promising but the project is two days old, runs only on a custom llama.cpp fork (not upstream), and HN users have reported gibberish output on x86 CPUs. The 70.5 benchmark average is real but trails the full-precision frontier. Treat it as a strong alpha-stage release with genuine technical novelty, not a production drop-in replacement yet.

View original source
What the full digest unlocks

There is more here than the public preview.

This page gives you the hook. The full Snaplyze digest goes deeper so you can move from curiosity to decision with less noise.

Open the full digest to read the deeper breakdown, compare viewpoints, and get the practical next-step playbooks.

Open the full digest

Snaplyze

Go beyond the preview

Read the full digest for deep-dive insight, Easy Mode, Pro Mode, and practical playbooks you can actually use.

Install Snaplyze