TurboQuant+: 6.4x KV Cache Compression at q8_0 Speed
Implements TurboQuant in llama.cpp for 3.8-6.4x KV cache compression (turbo2/3/4 formats) with PPL near q8_0, matching prefill speed, and 0.9x decode on Apple Silicon, CUDA, AMD—plus Sparse V for +22.8% decode.
TurboQuant Formats Deliver Extreme Compression with Minimal Quality Loss
TurboQuant+ ports Google's TurboQuant (ICLR 2026) to llama.cpp, compressing KV cache via PolarQuant (multi-centroid scalar quantization) + Walsh-Hadamard Transform (WHT) rotation, dropping the paper's 1-bit QJL error correction which amplified softmax variance. Formats: turbo2 (2.5 bits/val, 6.4x vs fp16), turbo3 (3.5 bits/val at block=32, 4.6x; 3.125 bits/val at block=128, 5.12x), turbo4 (4.25 bits/val, 3.8x). On M5 Max (Qwen3.5-27B/35B-A3B), turbo4 PPL 6.125 (+0.23% vs q8_0 baseline 6.111 on wikitext-2 512 chunks); turbo3 6.176 (+1.06%). turbo4 outperforms q4_0 (6.142, +0.52%) in quality at similar compression.
Block size optimization (study: docs/papers/block-size-experiment.md) boosts turbo3 to 5.12x at block=128 with identical PPL across 512-32K contexts, 3 architectures (Qwen2.5-1.5B, Llama3.1-8B, Qwen3.5-27B), validated on M2 Pro/M5 Max Metal. Larger blocks reduce overhead but risk cache thrashing on older hardware—default block=32 balances.
"Compresses transformer KV cache 3.8-6.4x using PolarQuant + Walsh-Hadamard rotation. Near q8_0 prefill speed and ~0.9x decode throughput at long context (Apple Silicon)."
Asymmetric K/V caching preserves quality on Q4_K_M weights: keep K at q8_0 (attention routing), compress V (turbo3/4). E.g., Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V PPL 6.64 (+1.0% vs q8_0); symmetric turbo3 catastrophic (3556 PPL). Bigger models tolerate symmetric better (104B Command-R+: turbo3 +3.6%). Config guide: docs/turboquant-recommendations.md.
Layer-Aware and Sparse Optimizations Maximize Speed and Quality
Boundary V (layer-aware): Protects first/last 2 layers at q8_0-V, turbo2-V elsewhere. Recovers 37-91% of quality gap to turbo3 (e.g., Qwen3.5-35B MoE: turbo2 5.257 → Boundary 5.148 vs turbo3 5.137). Scales with depth (91% on 64L MoE). Enabled via TURBO_LAYER_ADAPTIVE=7; no speed hit.
Sparse V dequant: Skips V dequant for softmax weights <1e-6 (most at long context). +22.8% decode at 32K (turbo3: 0.76x → 0.93x q8_0), no PPL change (wikitext-103 50 chunks, CI±0.021). General opt: +5% on q8_0 KV. Validated 1.5B-104B; dense models gain less (1-2% as FFN dominates).
"Sparse V: Attention-gated KV cache decoding that skips low-weight V positions during inference. Up to +22.8% decode speed at 32K context... no measurable PPL change."
Prefill scales 2K-32K: turbo3/4 ≥ q8_0 (e.g., 32K: turbo3 1204 vs 1098 t/s). Decode (M5 Max Qwen3.5-35B-A3B Sparse V): turbo4 1060 t/s long ctx (0.90x q8_0); real 24K PDF: turbo4 63.7 t/s (0.93x). M1 Max 38K doc: turbo4 +33.9% decode vs q8_0.
Optimization path (4K prefill): fp32 WHT (739 t/s, 0.27x q8_0) → fp16 + vectorized butterfly + graph rotation + block-32 + dequant → 2524 t/s (0.98x). KL div vs f16: turbo4 0.009633 (lower than q4_0 0.008091? Wait, table shows turbo4 better top-p agreement 95.98%).
Cross-Hardware Benchmarks Confirm Production Readiness
Apple Silicon (M5 Max 128GB): 104B@128K turbo3 (PPL 4.024? Wait, table 6.415 +3.6%; 74GB peak). Raise iogpu.wired_limit_mb=117964. M1 Max: turbo4 beats q8_0 long ctx. CUDA (RTX3090 Qwen3.5-9B Q4_K_M): turbo3/4 decode 95-98 t/s (0.93-0.96x q8_0). AMD RX9070 XT (RDNA4 HIP): q8_0-K + turbo4-V +1.0% PPL, +2.5% decode.
| Hardware | Model | Config | Decode t/s | vs q8_0 | Notes |
|---|---|---|---|---|---|
| M5 Max | Qwen3.5-35B-A3B | turbo4 + Sparse V | 1060 (32K) | 0.90x | MoE |
| RTX3090 | Qwen3.5-9B Q4_K_M | turbo4/turbo4 | 95.87 | 0.93x | CUDA |
| M1 Max 64GB | Qwen3.5-35B-A3B | turbo4 | 16.6 (38K) | +33.9% | Real doc |
| RX9070 XT | Qwen2.5-7B Q4_K_M | q8_0-K/turbo4-V | 86.8 | +2.5% | HIP |
"104B at 128K context on a MacBook with turbo3 (PPL 4.024, 74 GB peak memory)."
Retrieval and Perplexity Validate Fidelity
NIAH (Kamradt/RULER): turbo4 31/33 (+3% vs q8_0 30/33); turbo3 + Sparse V 9/9. Multi-key 100% to 32K. Long ctx PPL (32K wikitext-103 50ch): turbo3 +1.64% vs q8_0, Sparse V delta=0. PPL stable: Llama3.1-70B turbo4 +6.3%, Command-R+104B +1.9%.
"turbo4 beats q8_0 on retrieval (31/33 vs 30/33). Shared failure at 8K/100% is a model weakness, not quantization."
Python prototype confirms: turbo4 cosine sim 0.96, MSE 0.0007. Gaussianization exact (kurtosis 900→2.9).
Key Takeaways
- Use turbo4 for best quality/compression balance (3.8x, +0.23% PPL); turbo3 for max (5.12x block=128, +1% PPL).
- Asymmetric q8_0-K + turbo3/4-V on Q4_K_M weights; symmetric on Q8_0+ or large models.
- Enable Sparse V always (+22% long decode, no PPL hit); Boundary V on deep models.
- Prefill ≥ q8_0 speed; validate decode on your hardware (M5+ best for turbo3).
- Build llama.cpp from fork; test PPL/NIAH on your model before deploy.
- For Apple Silicon max ctx: sysctl iogpu.wired_limit_mb=90% RAM.
- Upstream path: Stable pieces as llama.cpp patches.
- MLX Swift fork for 2.5x faster Apple decode (144 t/s Qwen3.5-35B-A3B).