TurboQuant: 2-3x KV Cache Compression via Gaussian Rotation

KV Cache Bottleneck and Lossy Quantization Advantage

KV cache in LLMs consumes memory comparable to model weights, limiting context length, throughput, and user concurrency on fixed hardware like GPUs. Unlike model weight quantization (common for local Llama runs), KV cache quantization targets runtime attention states. Pruning methods like Snap KV or Pyramid KV discard irrelevant cache entries, but TurboQuant preserves all attention via lossy compression—reducing precision while minimizing distortion. This avoids information loss guarantees of lossless schemes (e.g., ZIP shrinks 10M 'A's from 9.53MB to 9KB, but random data only to 7.91MB, a 1.2x ratio) by accepting controlled approximation for massive gains: 2-3x memory reduction translates to longer contexts, higher interactivity, or serving more users without extra GPUs. The paper's release dropped stocks like Micron, Western Digital, and SanDisk over 7%, signaling inference hardware demand shifts.

Random Projection Transforms Inputs to Predictable Gaussians

Arbitrary KV cache inputs (e.g., spiky vectors from 'Caleb' as 8, 0.1, 0.1) defy universal codebooks, much like images needing minimal codebooks (K=2 loses details like a sun; K=64 reconstructs near-identically) without knowing color distributions. TurboQuant solves this by randomizing: normalize to unit vector (divide by norm ~8.001, yielding 1, 0.012, 0.012), then multiply by random rotation matrix. This spreads spiky energy evenly (e.g., to 0.577, 0.699, 0.423), leveraging the Central Limit Theorem: in high dimensions (LLM typical), rotated unit vectors converge to Gaussian (mean 1/d, variance 1/d per coordinate; tight in production dims vs. wide in 3D toy example). Result: unknown inputs (HTML, legal docs, repeats, or noise) become predictable Gaussians, allowing precomputed optimal codebooks via Lloyd's algorithm for 1-8 bits, stored in a one-time lookup table. Quantize by snapping to nearest codebook entry, measured by mean squared error (MSE; e.g., inputs 3,4 and 2,3.8 both snap to closer centroid C1).

QJL Residuals Preserve Attention Dot Products

Codebook snapping introduces MSE bias, distorting attention scores (dot products between quantized keys and values). TurboQuant's second step applies QJL (from 2024 paper, Johnson-Lindenstrauss inspired) to residuals: drop one bit from prior quantization, compute MSE residual, then requantize to correct inner product errors. This dual optimization—MSE for reconstruction fidelity, inner products for attention accuracy—ensures near-minimum distortion across bit widths. No input assumptions needed post-randomization; works on any context.

Hardware and Industry Implications

With KV cache matching model weights in footprint, TurboQuant's multiples memory savings mean same GPUs handle 2-3x longer contexts or users, slashing inference GPU demand (e.g., halve clusters for same throughput). Builders gain practical leverage: integrate into LLM serving stacks for production-scale interactivity without hardware upgrades, prioritizing cache over weights for context-heavy apps.

Edge

TurboQuant: 2-3x KV Cache Compression via Gaussian Rotation

Video description

KV Cache Bottleneck and Lossy Quantization Advantage

Random Projection Transforms Inputs to Predictable Gaussians

QJL Residuals Preserve Attention Dot Products

Hardware and Industry Implications

Video description

KV Cache Bottleneck and Lossy Quantization Advantage

Random Projection Transforms Inputs to Predictable Gaussians

QJL Residuals Preserve Attention Dot Products

Hardware and Industry Implications

More on Edge

GPUs Crush AI Tasks with Parallel Compute and Vast Memory

GPUs Power AI with Parallel Compute and Massive Memory

PrfaaS: 54% Throughput Boost via Cross-Datacenter LLM Prefill

PrfaaS Enables Cross-Datacenter LLM Serving with 54% Throughput Gain