NVIDIA Halves DSA Top-K Time via Decode Stability

Top-K Bottleneck in Scaling DSA Contexts

DeepSeek Sparse Attention (DSA) relies on a lightweight indexer to score every token in the KV cache, then select the top 2,048 highest-scoring positions via Top-K. This becomes critical as contexts grow from 8K to 128K tokens: the step scans the full sequence every decode iteration. Even an optimized radix-select kernel—7.4x faster than PyTorch—requires 3–4 passes over all N scores per step, creating a major GPU bottleneck during autoregressive generation.

Autoregressive Quirks Unlock Reuse

Token-by-token generation creates structural predictability: each decode shifts the query by one position, appends one key to the cache, and evolves attention scores gradually. Consecutive steps query highly overlapping KV cache neighborhoods, making Top-K indices temporally stable. NVIDIA's insight treats this as a goldmine, not incidental—profiling shows brute-force scans waste cycles on redundant computations across similar steps.

Guess-Verify-Refine Cuts Compute in Half

Their April 30, 2026 technical report introduces Guess-Verify-Refine Top-K: guess indices from the prior step (leveraging neighborhood similarity), verify against current scores (cheap partial scan), and refine only discrepancies. This halves Top-K time versus the production baseline without accuracy loss, proving workload structure trumps pure algorithmic speedups. Builders profiling LLM kernels should prioritize such properties for 2x+ gains in long-context decoding.

Top-K Bottleneck in Scaling DSA Contexts

Autoregressive Quirks Unlock Reuse

Guess-Verify-Refine Cuts Compute in Half

More on Edge

PCL: Confidence RL for Dynamic LLM Environments

Sentences Define Word Meanings via Self-Attention

Attention Scores Are Kernel Evaluations via Mercer's Theorem

53x AI Efficiency via Model Distillation by 2025