“An 8B LLM that fits in 1.15 GB, runs at 368 tok/s, and is Apache 2.0 — built by Caltech researchers with $16.25M from Khosla.”
PrismML squeezed an 8B-parameter LLM into 1.15 GB — 14x smaller than full-precision — by training natively with 1-bit weights, not post-training quantization. It runs at 368 tokens/sec on an RTX 4090 and 85 tok/s on an M4 Pro, using 4-5x less energy per token than FP16. On a six-benchmark average it scores 70.5, trailing Qwen3 8B's 79.3 but beating Llama 3.1 8B's 67.1 — at a fraction of the memory. The company emerged from stealth on March 31, 2026 with $16.25M in seed funding from Khosla Ventures.
You know that feeling when you want to run an LLM locally but your 8B model eats 16 GB of VRAM and your laptop fan sounds like a jet engine? Edge devices — phones, robots, IoT sensors — are even worse off. Full-precision models simply don't fit. Standard post-training quantization (Q4, Q8) helps, but you're still working with models that were designed for datacenters and then shrunken after the fact, which means unpredictable quality loss at extreme compression.
Instead of training a normal model and then compressing it afterward (post-training quantization), PrismML trains the model from scratch with 1-bit weights — every weight is either -1 or +1, with a 16-bit scale factor shared across every 128 weights. Think of it like building a house with Lego bricks (fixed, simple pieces) instead of building a normal house and then trying to replace every brick with a Lego. The base architecture is Qwen3-8B (standard transformer with GQA, SwiGLU, RoPE), but the training process itself learns to work within the 1-bit constraint. The result: 1.125 effective bits per weight, fitting 8.19B parameters into 1.15 GB. Inference runs through PrismML's fork of llama.cpp with custom kernels optimized for this format.
If you're building on-device AI — mobile apps, robotics, IoT, or edge inference — and need an LLM that fits in under 2 GB of RAM, this is directly relevant. Backend engineers looking to cut GPU inference costs by 5-6x per request should benchmark it against their current quantized models. Not for you if you need top-of-class accuracy: Qwen3 8B still scores 79.3 vs Bonsai's 70.5 on the same benchm...
Worth experimenting with if you have a concrete edge-deployment use case. The benchmarks are promising but the project is two days old, runs only on a custom llama.cpp fork (not upstream), and HN users have reported gibberish output on x86 CPUs. The 70.5 benchmark average is real but trails the full-precision frontier. Treat it as a strong alpha-stage release with genuine technical novelty, not a production drop-in replacement yet.
View original sourceThis page gives you the hook. The full Snaplyze digest goes deeper so you can move from curiosity to decision with less noise.
Open the full digest to read the deeper breakdown, compare viewpoints, and get the practical next-step playbooks.
Read the full digest for deep-dive insight, Easy Mode, Pro Mode, and practical playbooks you can actually use.
Install Snaplyze