TurboQuant+: 6.4x KV Cache Compression at q8_0 Speed

Implements TurboQuant in llama.cpp for 3.8-6.4x KV cache compression (turbo2/3/4 formats) with PPL near q8_0, matching prefill speed, and 0.9x decode on Apple Silicon, CUDA, AMD—plus Sparse V for +22.8% decode.

TurboQuant Formats Deliver Extreme Compression with Minimal Quality Loss

TurboQuant+ ports Google's TurboQuant (ICLR 2026) to llama.cpp, compressing KV cache via PolarQuant (multi-centroid scalar quantization) + Walsh-Hadamard Transform (WHT) rotation, dropping the paper's 1-bit QJL error correction which amplified softmax variance. Formats: turbo2 (2.5 bits/val, 6.4x vs fp16), turbo3 (3.5 bits/val at block=32, 4.6x; 3.125 bits/val at block=128, 5.12x), turbo4 (4.25 bits/val, 3.8x). On M5 Max (Qwen3.5-27B/35B-A3B), turbo4 PPL 6.125 (+0.23% vs q8_0 baseline 6.111 on wikitext-2 512 chunks); turbo3 6.176 (+1.06%). turbo4 outperforms q4_0 (6.142, +0.52%) in quality at similar compression.

Block size optimization (study: docs/papers/block-size-experiment.md) boosts turbo3 to 5.12x at block=128 with identical PPL across 512-32K contexts, 3 architectures (Qwen2.5-1.5B, Llama3.1-8B, Qwen3.5-27B), validated on M2 Pro/M5 Max Metal. Larger blocks reduce overhead but risk cache thrashing on older hardware—default block=32 balances.

"Compresses transformer KV cache 3.8-6.4x using PolarQuant + Walsh-Hadamard rotation. Near q8_0 prefill speed and ~0.9x decode throughput at long context (Apple Silicon)."

Asymmetric K/V caching preserves quality on Q4_K_M weights: keep K at q8_0 (attention routing), compress V (turbo3/4). E.g., Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V PPL 6.64 (+1.0% vs q8_0); symmetric turbo3 catastrophic (3556 PPL). Bigger models tolerate symmetric better (104B Command-R+: turbo3 +3.6%). Config guide: docs/turboquant-recommendations.md.

Layer-Aware and Sparse Optimizations Maximize Speed and Quality

Boundary V (layer-aware): Protects first/last 2 layers at q8_0-V, turbo2-V elsewhere. Recovers 37-91% of quality gap to turbo3 (e.g., Qwen3.5-35B MoE: turbo2 5.257 → Boundary 5.148 vs turbo3 5.137). Scales with depth (91% on 64L MoE). Enabled via TURBO_LAYER_ADAPTIVE=7; no speed hit.

Sparse V dequant: Skips V dequant for softmax weights <1e-6 (most at long context). +22.8% decode at 32K (turbo3: 0.76x → 0.93x q8_0), no PPL change (wikitext-103 50 chunks, CI±0.021). General opt: +5% on q8_0 KV. Validated 1.5B-104B; dense models gain less (1-2% as FFN dominates).

"Sparse V: Attention-gated KV cache decoding that skips low-weight V positions during inference. Up to +22.8% decode speed at 32K context... no measurable PPL change."

Prefill scales 2K-32K: turbo3/4 ≥ q8_0 (e.g., 32K: turbo3 1204 vs 1098 t/s). Decode (M5 Max Qwen3.5-35B-A3B Sparse V): turbo4 1060 t/s long ctx (0.90x q8_0); real 24K PDF: turbo4 63.7 t/s (0.93x). M1 Max 38K doc: turbo4 +33.9% decode vs q8_0.

Optimization path (4K prefill): fp32 WHT (739 t/s, 0.27x q8_0) → fp16 + vectorized butterfly + graph rotation + block-32 + dequant → 2524 t/s (0.98x). KL div vs f16: turbo4 0.009633 (lower than q4_0 0.008091? Wait, table shows turbo4 better top-p agreement 95.98%).

Cross-Hardware Benchmarks Confirm Production Readiness

Apple Silicon (M5 Max 128GB): 104B@128K turbo3 (PPL 4.024? Wait, table 6.415 +3.6%; 74GB peak). Raise iogpu.wired_limit_mb=117964. M1 Max: turbo4 beats q8_0 long ctx. CUDA (RTX3090 Qwen3.5-9B Q4_K_M): turbo3/4 decode 95-98 t/s (0.93-0.96x q8_0). AMD RX9070 XT (RDNA4 HIP): q8_0-K + turbo4-V +1.0% PPL, +2.5% decode.

HardwareModelConfigDecode t/svs q8_0Notes
M5 MaxQwen3.5-35B-A3Bturbo4 + Sparse V1060 (32K)0.90xMoE
RTX3090Qwen3.5-9B Q4_K_Mturbo4/turbo495.870.93xCUDA
M1 Max 64GBQwen3.5-35B-A3Bturbo416.6 (38K)+33.9%Real doc
RX9070 XTQwen2.5-7B Q4_K_Mq8_0-K/turbo4-V86.8+2.5%HIP

"104B at 128K context on a MacBook with turbo3 (PPL 4.024, 74 GB peak memory)."

Retrieval and Perplexity Validate Fidelity

NIAH (Kamradt/RULER): turbo4 31/33 (+3% vs q8_0 30/33); turbo3 + Sparse V 9/9. Multi-key 100% to 32K. Long ctx PPL (32K wikitext-103 50ch): turbo3 +1.64% vs q8_0, Sparse V delta=0. PPL stable: Llama3.1-70B turbo4 +6.3%, Command-R+104B +1.9%.

"turbo4 beats q8_0 on retrieval (31/33 vs 30/33). Shared failure at 8K/100% is a model weakness, not quantization."

Python prototype confirms: turbo4 cosine sim 0.96, MSE 0.0007. Gaussianization exact (kurtosis 900→2.9).

Key Takeaways

  • Use turbo4 for best quality/compression balance (3.8x, +0.23% PPL); turbo3 for max (5.12x block=128, +1% PPL).
  • Asymmetric q8_0-K + turbo3/4-V on Q4_K_M weights; symmetric on Q8_0+ or large models.
  • Enable Sparse V always (+22% long decode, no PPL hit); Boundary V on deep models.
  • Prefill ≥ q8_0 speed; validate decode on your hardware (M5+ best for turbo3).
  • Build llama.cpp from fork; test PPL/NIAH on your model before deploy.
  • For Apple Silicon max ctx: sysctl iogpu.wired_limit_mb=90% RAM.
  • Upstream path: Stable pieces as llama.cpp patches.
  • MLX Swift fork for 2.5x faster Apple decode (144 t/s Qwen3.5-35B-A3B).

Summarized by x-ai/grok-4.1-fast via openrouter

11014 input / 3209 output tokens in 20267ms

© 2026 Edge