TurboQuant: 6x KV Cache Compression Without Attention Loss
TurboQuant rotates KV vectors before quantizing to 3.5 bits/channel (quality-neutral) or 2.5 bits (minor degradation), plus error repair, yielding 6x memory savings and up to 8x speedups for long-context LLMs.
KV Cache Drives Long-Context Costs, Naive Fixes Fail
Long-context AI like chats, PDF assistants, coding copilots, and RAG systems slow down and cost more due to KV cache growth, not just model size. KV cache acts as short-term working memory, storing reusable info per token to avoid recomputing from scratch—essential for efficient generation. As context expands (e.g., adding logs, stack traces, files, or document pages), memory balloons, spiking GPU usage, latency, and dropping throughput. Compressing like image JPEG works in theory but fails because attention relies on precise inner products between query and past keys; aggressive quantization scrambles focus rankings, degrading output quality even if numbers look similar.
TurboQuant targets geometry attention uses, shrinking cache without altering what the model prioritizes.
Rotate-then-Quantize Plus Residual Repair Preserves Signals
First, apply random rotation to KV vectors before quantization. Uneven energy distribution hinders compression—like an awkwardly shaped object in a suitcase. Rotation spreads info evenly across dimensions, enabling tighter packing at low bits without losing core structure.
Second, add lightweight residual correction for quantization errors. After main compression, a one-bit QJL step repairs attention-critical mismatches, like a tiny overlay fixing JPEG artifacts in key details. This two-stage process stays online (compresses as data streams in) and data-oblivious (no dataset-specific codebooks), making it deployable in production serving stacks without extra overhead.
6x Memory Cuts Boost Long-Context Products
At 3.5 bits/channel, TurboQuant matches full-precision quality; 2.5 bits shows only slight drops. Google benchmarks confirm ~6x KV cache reduction and up to 8x faster attention computations in spots. Gains vary by model/kernel/stack, but enable real wins: longer chat histories, bigger PDFs/documents, fuller repo context in copilots, more RAG chunks—all at lower cost. Hardware serves more users faster, prioritizing throughput teams crave. Unlike theoretical tweaks, this plugs into inference for scalable long-context without quality trade-offs.