Fixing KV Selection Instability from RoPE Rotation
Standard KV cache compression relies on attention scores from recent post-RoPE queries, but RoPE rotates queries by position, yielding few representative samples. This causes poor top-key selection and unstable long reasoning. TriAttention sidesteps this by analyzing the pre-RoPE space, where query (Q) and key (K) vectors concentrate tightly around fixed, non-zero centers that stay stable across positions—termed Q/K concentration.
This concentration drives position-specific attention biases: queries favor keys at certain distances (like nearest neighbors), with preferences dictated by center angles via a trigonometric series expansion. Q/K vector norms provide an extra importance signal.
TriAttention's Position-Aware Scoring
Key importance is computed using the trigonometric series from Q/K centers to score based on relative positions, avoiding rotation issues entirely. No need for query sampling—instead, derive distance preferences analytically from stable pre-RoPE geometry.
Implementation integrates this scoring into KV eviction, retaining top keys by combined trigonometric position score and norm signals. This preserves reasoning fidelity while slashing cache size.
10.7x KV Savings with Full Accuracy
On AIME25 benchmark with 32K-token generation, TriAttention equals full attention accuracy but delivers 2.5x higher throughput or 10.7x KV memory reduction. Leading baselines halve accuracy at equivalent efficiency. Enables OpenClaw model deployment on a single consumer GPU, dodging OOM failures from long-context full attention.
Code at https://github.com/WeianMao/triattention confirms production viability for efficient long-reasoning LLMs.