TurboQuant: 4-7x KV Cache Compression in vLLM

TurboQuant Delivers Superior KV Cache Compression

TurboQuant uses online vector quantization with QR rotation, Lloyd-Max codebooks, and bit-packing for 2-4 bit (including 2.5/3.5 fractional) KV caches, achieving provably near-optimal distortion within 2.7x of information-theoretic limits. Unlike scalar methods like FP8 (e4m3/e5m2) or INT4, it preserves inner products unbiased—key for attention—while enabling 4-5x memory savings. Paper benchmarks show perfect Needle-in-a-Haystack recall at 4x compression and competitive LongBench scores at 2.5-3.5 bits/dim. It requires no preprocessing, runs online, and suits accelerators.

vLLM alternatives (FP8, compressed-tensors) optimize MSE element-wise but lack vector codebooks, inner-product focus, theoretical guarantees, or sub-4-bit flexibility.

Proven Zero-Loss Performance and Throughput Gains

PoC on Qwen2.5-7B (H200, 4K-16K context) yields:

Config	Exact Match	Avg Cache GB	vs Full
Full	6/6	0.510	1.0x
TQ 2-bit	6/6	0.068	7.5x
TQ 3.5-bit	6/6	0.112	4.5x
TQ 4-bit	6/6	0.132	3.9x

Upstream PR #38280 (Qwen2.5-1.5B, H200) confirms 12/12 exact matches across bit-widths, TTFT/ITL latency matching baseline (9.3ms/8.4ms), and 21% throughput boost at batch=16. Phase 2 adds bit-packed uint8 storage (ceil(head_size*bits/8)+2 bytes/slot) for full ratios.

Straightforward vLLM Integration Path

Aligns with vLLM's framework:

Extend CacheDType in cache.py/torch_utils.py for integer indices.
Add @register_quantization_config("turboquant") TurboQuantConfig targeting Attention layers.
Implement TurboQuantKVCacheMethod (extends BaseKVCacheMethod) for codebook params, MSE/IP variants, per-head support.
Update is_quantized_kv_cache() detection.
CUDA/Triton encode/decode kernels (43/43 tests pass).
Adjust KVCacheSpec for codebook overhead/variable ratios.

PoC covers steps 1-5; PR #38280 integrates fully with Triton attention. Related: PolarQuant, ollama/ollama#15051, llama.cpp#20977, vllm-omni#2214.