TurboQuant Doubles LLM Context via 3b/2b KV Quantization

Compresses KV cache to 3-bit keys/2-bit values with Triton kernels and vLLM integration, freeing 30GB VRAM on RTX 5090 (2x max tokens) and 233MB/GPU on 8x3090 (1.45x context, 30.9% savings), passing needle tests and paper theorems.

KV Cache Compression Delivers Massive VRAM Savings

TurboQuant quantizes KV cache entries to 3-bit keys and 2-bit values using Lloyd-Max codebooks optimized for Beta-distributed attention vectors, random orthogonal rotations, and QJL projections for unbiased inner product estimation. On RTX 5090 with Qwen3.5-27B-AWQ (4-bit weights, 16/64 full-attention layers), it frees 30GB KV cache across 4 GPUs at 30k context, doubling max token capacity from 457k to 914k tokens while boosting prefill throughput 5.7% (1,804 to 1,907 tok/s) and decode 3.1% (1.264 to 1.303 tok/s), reducing peak activations 7% (644MB to 599MB).

On 8x RTX 3090 with Qwen3.5-35B-A3B MoE (205 experts pruned, TP=8, 10/40 full-attention layers), it saves 30.9% KV cache per GPU (e.g., 755MB to 522MB at 131k context, 234MB freed), extending baseline 1.41M total tokens to 2.04M (1.45x) or supporting 3 extra 131k requests. Baseline decode holds at 98-133 tok/s up to 131k context; TQ maintains quality without throughput regression. Freed VRAM per GPU scales linearly: 17MB at 8k, 59MB at 32k, 179MB at 100k, 234MB at 131k contexts.

Quality Preserved with Theoretical Guarantees

Cosine similarity stays near-lossless for 3/4-bit keys (1.000) but drops to 0.940 for 2-bit values (dominant bottleneck; 4-bit values hit 0.997). Combined 3b/2b yields 0.940 sim. Needle-in-haystack passes single needle across 512-131k, 5/5 multi-needle at max context, 3/3 multi-fact coherence, golden ratio completion (perplexity 1.05-1.35), and math reasoning. Recall@8=0.55 (3-bit, N=4096, exceeds paper's 0.40 threshold); Spearman rank rho >0.85 (N=2048). Paper theorems validated: MSE bounds hold for unit-norm vectors, 1/4^b distortion scaling (2b=0.70x bound, 3b=0.82x, 4b=0.97x), <0.1% bias, 4.41x compression at head_dim=256.

Adversarial audit confirms 2x context on dense models and ~4.6-5x compression (misleading paper claim ignores Pi/S matrices/ring buffer), but notes low recall@1=38%, hybrid decode dequantizes to float32 (storage win, no compute save), and needle tests are easy (query≠key copies). GPU util near 100% idle-free at scale, power 130-142W.

Triton Kernels and vLLM Integration for Production

Custom Triton kernels fuse decode attention; vLLM adapter monkey-patches KV hooks for quantization, flat compressed store, and hybrid decode. Architecture modular: codebook.py (Beta quantizers), rotation.py (projections), quantizer.py (TurboQuantMSE/Prod algos), kv_cache.py (bit-packing), score.py (compressed scoring). Supports dense/MoE, compresses only full-attention layers. All 35+ tests pass (7 core quantizer, 19 modular, 9 theorem validations). Install via pip from setup.py; benchmark with benchmark.py/proof.py. Tested on RTX 3090/5090, vLLM 0.18.0, AMD EPYC.

Summarized by x-ai/grok-4.1-fast via openrouter

6519 input / 1934 output tokens in 14125ms

© 2026 Edge