TurboQuant: 3-Bit KV Cache Slash Memory in llama.cpp
Google's TurboQuant quantizes KV cache to 2.67 bits/value with <1% perplexity loss, enabling 110K+ contexts on consumer GPUs; llama.cpp community forks deliver CUDA/ROCm support and 5x compression.
TurboQuant Core Mechanism Delivers Extreme Compression
TurboQuant uses Walsh-Hadamard Transform (WHT) rotations on 128-element blocks of KV cache vectors, followed by per-channel normalization and asymmetric quantization. This achieves 2.67 bits per value (turbo3: 3 blocks of 42 bits + 1 norm bit) or 2.25 bits (turbo4) while keeping perplexity loss under 1% on Llama-3.1 405B at 128K context.
Key insight: WHT decorrelates dimensions, allowing independent quantization without cross-channel leakage. Unlike standard int4 (4 bits/value, 36-37% decode slowdown at 110K context), TurboQuant supports direct quantized matmul, eliminating dequantization overhead.
"Google Research just posted a blog and paper about a new algorithm that allows quantizing the KV cache down to under 3 bits with close to 0 accuracy loss." — kth8, discussion starter.
Paper claims: On Llama-3.1-405B-Instruct at 128K context, turbo3 matches FP16 perplexity (10.45 vs 10.44) at 37.5x compression vs FP16; 8.6x vs int8. MLX devs already prototyping.
Baseline KV Quantization Bottlenecks Exposed by Benchmarks
Current llama.cpp KV quants (q8_0, q4_0) save memory but incur decode penalties from per-token dequantization. Corrected DGX Spark GB10 benchmarks (Nemotron-3-Nano-30B Q4_K_XL, 128K ctx, build 8399):
| Cache | KV MiB | Total GPU MiB | Savings | Prompt tok/s @110K | Gen tok/s @110K |
|---|---|---|---|---|---|
| f16 | 768 | 23,092 | - | 815 | 38.0 |
| q8_0 | 408 | 22,732 | -47% | 810 | 25.0 |
| q4_0 | 216 | 22,540 | -72% | 813 | 24.0 (-37%) |
Prompt eval unaffected; generation slows 37% at 110K due to dequant overhead—TurboQuant's direct compute fixes this. Early flawed RSS measurements (claiming q4_0 > f16) corrected via nvidia-smi + internal KV reporting.
NVIDIA's KTVC (similar 20x memory shrink) referenced as vendor push for extreme KV compression.
"The generation decode overhead at 110K (37% slower with q4_0) is the bottleneck TurboQuant eliminates by enabling direct computation on quantized values." — dentity007, corrected benchmark.
llama.cpp Forks Integrate TurboQuant with Platform Support
Community prototypes:
- TheTom/llama-cpp-turboquant (feature/turboquant-kv-cache): CUDA (signalnine PRs + InnerQ), ROCm/HIP, block_size=128 (5.12x turbo3 compression vs 4.57x), turbo4 prefill opts, asymmetric K/V. Fixes OOB writes in CUDA set-rows.cu.
- unixsysdev/llama-turboquant: Works on Strix Halo APU; README details optimal builds.
- spiritbuun's CUDA fork with separate opts.
Block_size=128: 1 norm per 128-el rotation group (vs 4 copies), boosting ratio without changing group size. turbo4 fixes 7 bugs (PPL 679→6.125). Norm corrections: turbo3 (TheTom), turbo4 (spiritbuun).
GB10 (Blackwell sm_121) pending validation; first for block_size=128 CUDA.
"Builds and works on Strix Halo - details in the README - https://github.com/unixsysdev/llama-turboquant/blob/main/README.md PS: Closer to optimal." — unixsysdev.
Advanced Optimizations and Trade-offs
- turbo3/turbo4: turbo3 (2.67 bpb), turbo4 (2.25 bpb) via finer asymmetry.
- InnerQ: Per-channel equalization.
- Rotation groups fixed at 128 els; block_size tunes storage norms.
- Compute: Direct quantized matmul on GPU (CUDA/ROCm paths validated).
Trade-offs: Initial CUDA bugs fixed; perf gains at long ctx outweigh short-ctx neutrality. PPL holds on WikiText2/Llama evals.
Attribution notes: signalnine (CUDA port), TheTom (turbo4/asymmetry papers), spiritbuun (turbo4 norms/CUDA opts).
"Block size 128 is a storage block size change (1 norm per 128-element rotation group instead of 4 identical copies)." — TheTom.
Key Takeaways
- Implement TurboQuant in llama.cpp via TheTom's fork for 5x+ KV compression on CUDA/ROCm.
- Benchmark your hardware: Expect 72% KV savings with q4_0 baseline, but 37% decode speedup from TurboQuant at 110K+ ctx.
- Use block_size=128 for optimal ratios; validate on Blackwell (sm_121) for newest GPUs.
- Prioritize direct quantized compute to eliminate dequant bottleneck in gen phase.
- Test PPL on your models: <1% loss typical for Llama-family at 128K.
- Cross-reference NVIDIA KTVC for vendor baselines.
- Build from corrected forks; avoid RSS for GPU mem measurement—use nvidia-smi + KV reports.
- Explore asymmetry (K/V separate quants) and InnerQ for further gains.