TurboQuant: 3-Bit KV Cache Slash Memory in llama.cpp

TurboQuant Core Mechanism Delivers Extreme Compression

TurboQuant uses Walsh-Hadamard Transform (WHT) rotations on 128-element blocks of KV cache vectors, followed by per-channel normalization and asymmetric quantization. This achieves 2.67 bits per value (turbo3: 3 blocks of 42 bits + 1 norm bit) or 2.25 bits (turbo4) while keeping perplexity loss under 1% on Llama-3.1 405B at 128K context.

Key insight: WHT decorrelates dimensions, allowing independent quantization without cross-channel leakage. Unlike standard int4 (4 bits/value, 36-37% decode slowdown at 110K context), TurboQuant supports direct quantized matmul, eliminating dequantization overhead.

"Google Research just posted a blog and paper about a new algorithm that allows quantizing the KV cache down to under 3 bits with close to 0 accuracy loss." — kth8, discussion starter.

Paper claims: On Llama-3.1-405B-Instruct at 128K context, turbo3 matches FP16 perplexity (10.45 vs 10.44) at 37.5x compression vs FP16; 8.6x vs int8. MLX devs already prototyping.

Baseline KV Quantization Bottlenecks Exposed by Benchmarks

Current llama.cpp KV quants (q8_0, q4_0) save memory but incur decode penalties from per-token dequantization. Corrected DGX Spark GB10 benchmarks (Nemotron-3-Nano-30B Q4_K_XL, 128K ctx, build 8399):

Cache	KV MiB	Total GPU MiB	Savings	Prompt tok/s @110K	Gen tok/s @110K
f16	768	23,092	-	815	38.0
q8_0	408	22,732	-47%	810	25.0
q4_0	216	22,540	-72%	813	24.0 (-37%)

Prompt eval unaffected; generation slows 37% at 110K due to dequant overhead—TurboQuant's direct compute fixes this. Early flawed RSS measurements (claiming q4_0 > f16) corrected via nvidia-smi + internal KV reporting.

NVIDIA's KTVC (similar 20x memory shrink) referenced as vendor push for extreme KV compression.

"The generation decode overhead at 110K (37% slower with q4_0) is the bottleneck TurboQuant eliminates by enabling direct computation on quantized values." — dentity007, corrected benchmark.

llama.cpp Forks Integrate TurboQuant with Platform Support

Community prototypes:

TheTom/llama-cpp-turboquant (feature/turboquant-kv-cache): CUDA (signalnine PRs + InnerQ), ROCm/HIP, block_size=128 (5.12x turbo3 compression vs 4.57x), turbo4 prefill opts, asymmetric K/V. Fixes OOB writes in CUDA set-rows.cu.
unixsysdev/llama-turboquant: Works on Strix Halo APU; README details optimal builds.
spiritbuun's CUDA fork with separate opts.

Block_size=128: 1 norm per 128-el rotation group (vs 4 copies), boosting ratio without changing group size. turbo4 fixes 7 bugs (PPL 679→6.125). Norm corrections: turbo3 (TheTom), turbo4 (spiritbuun).

GB10 (Blackwell sm_121) pending validation; first for block_size=128 CUDA.

"Builds and works on Strix Halo - details in the README - https://github.com/unixsysdev/llama-turboquant/blob/main/README.md PS: Closer to optimal." — unixsysdev.

Advanced Optimizations and Trade-offs

turbo3/turbo4: turbo3 (2.67 bpb), turbo4 (2.25 bpb) via finer asymmetry.
InnerQ: Per-channel equalization.
Rotation groups fixed at 128 els; block_size tunes storage norms.
Compute: Direct quantized matmul on GPU (CUDA/ROCm paths validated).

Trade-offs: Initial CUDA bugs fixed; perf gains at long ctx outweigh short-ctx neutrality. PPL holds on WikiText2/Llama evals.

Attribution notes: signalnine (CUDA port), TheTom (turbo4/asymmetry papers), spiritbuun (turbo4 norms/CUDA opts).

"Block size 128 is a storage block size change (1 norm per 128-element rotation group instead of 4 identical copies)." — TheTom.

Key Takeaways

Implement TurboQuant in llama.cpp via TheTom's fork for 5x+ KV compression on CUDA/ROCm.
Benchmark your hardware: Expect 72% KV savings with q4_0 baseline, but 37% decode speedup from TurboQuant at 110K+ ctx.
Use block_size=128 for optimal ratios; validate on Blackwell (sm_121) for newest GPUs.
Prioritize direct quantized compute to eliminate dequant bottleneck in gen phase.
Test PPL on your models: <1% loss typical for Llama-family at 128K.
Cross-reference NVIDIA KTVC for vendor baselines.
Build from corrected forks; avoid RSS for GPU mem measurement—use nvidia-smi + KV reports.
Explore asymmetry (K/V separate quants) and InnerQ for further gains.