“3 bits instead of 16, zero accuracy loss, 8x faster — Google just changed the KV cache math.”
Google Research just published TurboQuant, a compression algorithm that quantizes LLM key-value caches to 3 bits with zero measurable accuracy loss on standard benchmarks. The method achieves 6x memory reduction and up to 8x faster attention computation on H100 GPUs by combining two techniques: PolarQuant (converts vectors to polar coordinates to eliminate normalization overhead) and QJL (a 1-bit Johnson-Lindenstrauss transform that removes bias). The paper will be presented at ICLR 2026, and the related QJL code is already available on GitHub with 59 stars (verified March 2026).
You know that feeling when you try to run a long-context LLM and watch your GPU memory fill up before you even hit 32k tokens? The key-value cache — the memory where LLMs store attention keys and values during inference — can consume 80-90% of your memory on long sequences. Existing quantization methods help, but they add 1-2 bits of overhead per number because you have to store quantization constants (scale and zero-point) in full precision for each data block.
Think of it like compressing a photo in two stages. First, PolarQuant randomly rotates all your vectors (which spreads them out evenly), then converts them from X,Y coordinates to angle-and-radius coordinates. Because the angles now follow a predictable pattern, you don't need to store normalization constants — the boundaries are already known. Second, QJL takes the tiny error left over from the first stage and applies a 1-bit Johnson-Lindenstrauss transform (essentially just storing the sign: positive or negative). This 1-bit correction eliminates bias in the attention scores. The result: 3 bits per value instead of 16, with mathematically provable near-optimal distortion.
If you're deploying LLMs and hitting memory walls on long contexts, or paying too much for inference compute, this matters. Especially relevant if you're building RAG systems, long-document summarizers, or any application where KV cache dominates memory. Not useful if you only run short contexts (<4k tokens) where KV cache isn't the bottleneck.
Yes, but with a caveat: TurboQuant itself doesn't have public code yet (as of March 2026). However, the related QJL technique has an Apache-2.0 licensed repo with 59 stars and CUDA kernels. If you need KV cache compression now, start with QJL. Watch for TurboQuant integration into vLLM, llama.cpp, or HuggingFace — the Reddit threads show strong community demand for this.
View original sourceThis page gives you the hook. The full Snaplyze digest goes deeper so you can move from curiosity to decision with less noise.
Open the full digest to read the deeper breakdown, compare viewpoints, and get the practical next-step playbooks.
Read the full digest for deep-dive insight, Easy Mode, Pro Mode, and practical playbooks you can actually use.
Install Snaplyze