R&D advanced 2 min read Mar 27, 2026 · Updated Apr 2, 2026
Public Preview Sign in free for the full digest →

TurboQuant: 6x KV Cache Compression, Zero Accuracy Loss

“3 bits instead of 16, zero accuracy loss, 8x faster — Google just changed the KV cache math.”

TurboQuant: 6x KV Cache Compression, Zero Accuracy Loss
12 Views
0 Likes
1 Bookmarks
Source · research.google

“TurboQuant demonstrates that for KV cache quantization, we achieve absolute quality neutrality with 3.5 bits per channel and marginal quality degradation with 2.5 bits per channel. — Google Research blog, March 2026”

You know that feeling when you try to run a long-context LLM and watch your GPU memory fill up before you even hit 32k tokens? The key-value cache — the memory where LLMs store attention keys and values during inference — can consume 80-90% of your memory on long sequences. Existing quantization methods help, but they add 1-2 bits of overhead per number because you have to store quantization constants (scale and zero-point) in full precision for each data block.

quantizationllmkv-cachecompressiongoogle-researchinferenceiclr-2026

Think of it like compressing a photo in two stages. First, PolarQuant randomly rotates all your vectors (which spreads them out evenly), then converts them from X,Y coordinates to angle-and-radius coordinates. Because the angles now follow a predictable pattern, you don't need to store normalization constants — the boundaries are already known. Second, QJL takes the tiny error left over from the first stage and applies a 1-bit Johnson-Lindenstrauss transform (essentially just storing the sign: positive or negative). This 1-bit correction eliminates bias in the attention scores. The result: 3 bits per value instead of 16, with mathematically provable near-optimal distortion.

01
3-bit quantization with zero accuracy loss — why YOU care: Run the same model with 6x less KV cache memory. Per the paper, 3.5 bits achieves 'absolute quality neutrality' on LongBench benchmarks; 2.5 bits has only marginal degradation.
02
8x faster attention on H100 GPUs — why YOU care: The quantized format is faster to compute with than full precision. 4-bit TurboQuant achieves up to 8x speedup over 32-bit keys on H100 accelerators per the Google Research blog benchmarks.
03
Zero memory overhead — why YOU care: Traditional quantization stores scale/zero-point constants per block, adding 1-2 bits overhead. PolarQuant's polar coordinate approach eliminates this by mapping to a fixed circular grid where boundarie...
04
Data-oblivious algorithm — why YOU care: No dataset-specific tuning or calibration required. The random rotation preconditioning works on any data distribution, making it suitable for online/streaming applications.
05
Theoretically grounded — why YOU care: The paper proves TurboQuant achieves near-optimal distortion rates, differing from information-theoretic lower bounds by only a small constant factor (~2.7x). This isn't just empirical — it's mathemat...
06
Works on Llama-2, Llama-3, Gemma, Mistral — why YOU care: The experiments cover major open-source model families on LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval benchmarks. Real models, real tasks, verified results.
Who it’s for

If you're deploying LLMs and hitting memory walls on long contexts, or paying too much for inference compute, this matters. Especially relevant if you're building RAG systems, long-document summarizers, or any application where KV cache dominates memory. Not useful if you only run short contexts (<4k tokens) where KV cache isn't the bottleneck.

Worth exploring

Yes, but with a caveat: TurboQuant itself doesn't have public code yet (as of March 2026). However, the related QJL technique has an Apache-2.0 licensed repo with 59 stars and CUDA kernels. If you need KV cache compression now, start with QJL. Watch for TurboQuant integration into vLLM, llama.cpp, or HuggingFace — the Reddit threads show strong community demand for this.

Developer playbook
Tech stack, code snippet, sentiment, alternatives.
PM playbook
Adoption angles, user fit, positioning.
CEO playbook
Traction signals, ROI, build vs buy.
Deep-dive insight
Full long-form analysis, no fluff.
Easy mode
Core idea, fast — when you need the gist.
Pro mode
Technical nuance, edge cases, tradeoffs.
Read the full digest
Go beyond the preview

Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.

Underrated tools. Unfiltered takes.

Read the full digest in the Snaplyze app for deep-dive insight, Easy and Pro modes, and the playbooks you can actually use.

Install Snaplyze →