TurboQuant: 6x KV Cache Compression, Zero Accuracy Loss
Snaplyze Digest
R&D advanced 2 min read Mar 27, 2026 Updated Apr 2, 2026

TurboQuant: 6x KV Cache Compression, Zero Accuracy Loss

“3 bits instead of 16, zero accuracy loss, 8x faster — Google just changed the KV cache math.”

In Short

Google Research just published TurboQuant, a compression algorithm that quantizes LLM key-value caches to 3 bits with zero measurable accuracy loss on standard benchmarks. The method achieves 6x memory reduction and up to 8x faster attention computation on H100 GPUs by combining two techniques: PolarQuant (converts vectors to polar coordinates to eliminate normalization overhead) and QJL (a 1-bit Johnson-Lindenstrauss transform that removes bias). The paper will be presented at ICLR 2026, and the related QJL code is already available on GitHub with 59 stars (verified March 2026).

quantizationllmkv-cachecompressiongoogle-research
Why It Matters
The practical pain point this digest is really about.

You know that feeling when you try to run a long-context LLM and watch your GPU memory fill up before you even hit 32k tokens? The key-value cache — the memory where LLMs store attention keys and values during inference — can consume 80-90% of your memory on long sequences. Existing quantization methods help, but they add 1-2 bits of overhead per number because you have to store quantization constants (scale and zero-point) in full precision for each data block.

How It Works
The mechanism, architecture, or workflow behind it.

Think of it like compressing a photo in two stages. First, PolarQuant randomly rotates all your vectors (which spreads them out evenly), then converts them from X,Y coordinates to angle-and-radius coordinates. Because the angles now follow a predictable pattern, you don't need to store normalization constants — the boundaries are already known. Second, QJL takes the tiny error left over from the first stage and applies a 1-bit Johnson-Lindenstrauss transform (essentially just storing the sign: positive or negative). This 1-bit correction eliminates bias in the attention scores. The result: 3 bits per value instead of 16, with mathematically provable near-optimal distortion.

Key Takeaways
6 fast bullets that make the core value obvious.
  • 3-bit quantization with zero accuracy loss — why YOU care: Run the same model with 6x less KV cache memory. Per the paper, 3.5 bits achieves 'absolute quality neutrality' on LongBench benchmarks; 2.5 bits has only margi...
  • 8x faster attention on H100 GPUs — why YOU care: The quantized format is faster to compute with than full precision. 4-bit TurboQuant achieves up to 8x speedup over 32-bit keys on H100 accelerators per the Google Resear...
  • Zero memory overhead — why YOU care: Traditional quantization stores scale/zero-point constants per block, adding 1-2 bits overhead. PolarQuant's polar coordinate approach eliminates this by mapping to a fixed circular ...
  • Data-oblivious algorithm — why YOU care: No dataset-specific tuning or calibration required. The random rotation preconditioning works on any data distribution, making it suitable for online/streaming applications.
  • Theoretically grounded — why YOU care: The paper proves TurboQuant achieves near-optimal distortion rates, differing from information-theoretic lower bounds by only a small constant factor (~2.7x). This isn't just empir...
  • Works on Llama-2, Llama-3, Gemma, Mistral — why YOU care: The experiments cover major open-source model families on LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval benchmarks. Real models, real tasks, ve...
Should You Care?
Audience fit, decision signal, and the original source in one place.

Who It Is For

If you're deploying LLMs and hitting memory walls on long contexts, or paying too much for inference compute, this matters. Especially relevant if you're building RAG systems, long-document summarizers, or any application where KV cache dominates memory. Not useful if you only run short contexts (<4k tokens) where KV cache isn't the bottleneck.

Worth Exploring?

Yes, but with a caveat: TurboQuant itself doesn't have public code yet (as of March 2026). However, the related QJL technique has an Apache-2.0 licensed repo with 59 stars and CUDA kernels. If you need KV cache compression now, start with QJL. Watch for TurboQuant integration into vLLM, llama.cpp, or HuggingFace — the Reddit threads show strong community demand for this.

View original source
What the full digest unlocks

There is more here than the public preview.

This page gives you the hook. The full Snaplyze digest goes deeper so you can move from curiosity to decision with less noise.

Open the full digest to read the deeper breakdown, compare viewpoints, and get the practical next-step playbooks.

Open the full digest

Snaplyze

Go beyond the preview

Read the full digest for deep-dive insight, Easy Mode, Pro Mode, and practical playbooks you can actually use.

Install Snaplyze