TurboQuant: 4-7x KV Cache Compression in vLLM

TurboQuant vector quantization compresses vLLM KV caches 3.9-7.5x at 2-4 bits/dim with perfect Needle-in-a-Haystack recall, zero latency overhead, and 21% throughput gains.

TurboQuant Delivers Superior KV Cache Compression

TurboQuant uses online vector quantization with QR rotation, Lloyd-Max codebooks, and bit-packing for 2-4 bit (including 2.5/3.5 fractional) KV caches, achieving provably near-optimal distortion within 2.7x of information-theoretic limits. Unlike scalar methods like FP8 (e4m3/e5m2) or INT4, it preserves inner products unbiased—key for attention—while enabling 4-5x memory savings. Paper benchmarks show perfect Needle-in-a-Haystack recall at 4x compression and competitive LongBench scores at 2.5-3.5 bits/dim. It requires no preprocessing, runs online, and suits accelerators.

vLLM alternatives (FP8, compressed-tensors) optimize MSE element-wise but lack vector codebooks, inner-product focus, theoretical guarantees, or sub-4-bit flexibility.

Proven Zero-Loss Performance and Throughput Gains

PoC on Qwen2.5-7B (H200, 4K-16K context) yields:

ConfigExact MatchAvg Cache GBvs Full
Full6/60.5101.0x
TQ 2-bit6/60.0687.5x
TQ 3.5-bit6/60.1124.5x
TQ 4-bit6/60.1323.9x

Upstream PR #38280 (Qwen2.5-1.5B, H200) confirms 12/12 exact matches across bit-widths, TTFT/ITL latency matching baseline (9.3ms/8.4ms), and 21% throughput boost at batch=16. Phase 2 adds bit-packed uint8 storage (ceil(head_size*bits/8)+2 bytes/slot) for full ratios.

Straightforward vLLM Integration Path

Aligns with vLLM's framework:

  • Extend CacheDType in cache.py/torch_utils.py for integer indices.
  • Add @register_quantization_config("turboquant") TurboQuantConfig targeting Attention layers.
  • Implement TurboQuantKVCacheMethod (extends BaseKVCacheMethod) for codebook params, MSE/IP variants, per-head support.
  • Update is_quantized_kv_cache() detection.
  • CUDA/Triton encode/decode kernels (43/43 tests pass).
  • Adjust KVCacheSpec for codebook overhead/variable ratios.

PoC covers steps 1-5; PR #38280 integrates fully with Triton attention. Related: PolarQuant, ollama/ollama#15051, llama.cpp#20977, vllm-omni#2214.

Summarized by x-ai/grok-4.1-fast via openrouter

10176 input / 1474 output tokens in 8441ms

© 2026 Edge