TurboQuant: 4-7x KV Cache Compression in vLLM
TurboQuant vector quantization compresses vLLM KV caches 3.9-7.5x at 2-4 bits/dim with perfect Needle-in-a-Haystack recall, zero latency overhead, and 21% throughput gains.
TurboQuant Delivers Superior KV Cache Compression
TurboQuant uses online vector quantization with QR rotation, Lloyd-Max codebooks, and bit-packing for 2-4 bit (including 2.5/3.5 fractional) KV caches, achieving provably near-optimal distortion within 2.7x of information-theoretic limits. Unlike scalar methods like FP8 (e4m3/e5m2) or INT4, it preserves inner products unbiased—key for attention—while enabling 4-5x memory savings. Paper benchmarks show perfect Needle-in-a-Haystack recall at 4x compression and competitive LongBench scores at 2.5-3.5 bits/dim. It requires no preprocessing, runs online, and suits accelerators.
vLLM alternatives (FP8, compressed-tensors) optimize MSE element-wise but lack vector codebooks, inner-product focus, theoretical guarantees, or sub-4-bit flexibility.
Proven Zero-Loss Performance and Throughput Gains
PoC on Qwen2.5-7B (H200, 4K-16K context) yields:
| Config | Exact Match | Avg Cache GB | vs Full |
|---|---|---|---|
| Full | 6/6 | 0.510 | 1.0x |
| TQ 2-bit | 6/6 | 0.068 | 7.5x |
| TQ 3.5-bit | 6/6 | 0.112 | 4.5x |
| TQ 4-bit | 6/6 | 0.132 | 3.9x |
Upstream PR #38280 (Qwen2.5-1.5B, H200) confirms 12/12 exact matches across bit-widths, TTFT/ITL latency matching baseline (9.3ms/8.4ms), and 21% throughput boost at batch=16. Phase 2 adds bit-packed uint8 storage (ceil(head_size*bits/8)+2 bytes/slot) for full ratios.
Straightforward vLLM Integration Path
Aligns with vLLM's framework:
- Extend
CacheDTypeincache.py/torch_utils.pyfor integer indices. - Add
@register_quantization_config("turboquant") TurboQuantConfigtargeting Attention layers. - Implement
TurboQuantKVCacheMethod(extends BaseKVCacheMethod) for codebook params, MSE/IP variants, per-head support. - Update
is_quantized_kv_cache()detection. - CUDA/Triton encode/decode kernels (43/43 tests pass).
- Adjust
KVCacheSpecfor codebook overhead/variable ratios.
PoC covers steps 1-5; PR #38280 integrates fully with Triton attention. Related: PolarQuant, ollama/ollama#15051, llama.cpp#20977, vllm-omni#2214.