“TurboQuant demonstrates that for KV cache quantization, we achieve absolute quality neutrality with 3.5 bits per channel and marginal quality degradation with 2.5 bits per channel. — Google Research blog, March 2026”
You know that feeling when you try to run a long-context LLM and watch your GPU memory fill up before you even hit 32k tokens? The key-value cache — the memory where LLMs store attention keys and values during inference — can consume 80-90% of your memory on long sequences. Existing quantization methods help, but they add 1-2 bits of overhead per number because you have to store quantization constants (scale and zero-point) in full precision for each data block.
Think of it like compressing a photo in two stages. First, PolarQuant randomly rotates all your vectors (which spreads them out evenly), then converts them from X,Y coordinates to angle-and-radius coordinates. Because the angles now follow a predictable pattern, you don't need to store normalization constants — the boundaries are already known. Second, QJL takes the tiny error left over from the first stage and applies a 1-bit Johnson-Lindenstrauss transform (essentially just storing the sign: positive or negative). This 1-bit correction eliminates bias in the attention scores. The result: 3 bits per value instead of 16, with mathematically provable near-optimal distortion.
If you're deploying LLMs and hitting memory walls on long contexts, or paying too much for inference compute, this matters. Especially relevant if you're building RAG systems, long-document summarizers, or any application where KV cache dominates memory. Not useful if you only run short contexts (<4k tokens) where KV cache isn't the bottleneck.
Yes, but with a caveat: TurboQuant itself doesn't have public code yet (as of March 2026). However, the related QJL technique has an Apache-2.0 licensed repo with 59 stars and CUDA kernels. If you need KV cache compression now, start with QJL. Watch for TurboQuant integration into vLLM, llama.cpp, or HuggingFace — the Reddit threads show strong community demand for this.
Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.