The Architecture of MSA
MiniMax Sparse Attention (MSA) addresses the quadratic complexity bottleneck of standard softmax attention by decoupling the attention process into two distinct branches: an Index Branch and a Main Branch. Instead of performing dense attention across all tokens, MSA operates at a block granularity (defaulting to 128 tokens per block).
- Index Branch: Uses two projection matrices to score key-value blocks. A Top-k operator selects the most relevant blocks (defaulting to 16 blocks per query group), ensuring the query's immediate neighborhood is always included to maintain local context.
- Main Branch: Executes exact softmax attention only on the selected blocks. This limits the per-query compute budget to a fixed 2,048 tokens (16 blocks * 128 tokens), allowing the model to scale to significantly longer contexts without a linear increase in compute cost.
Training and Optimization
Because Top-k selection is non-differentiable, MSA employs a KL alignment loss to train the indexer, matching the Index Branch's distribution to the Main Branch's attention pattern. To stabilize training, the researchers implemented three key techniques:
- Gradient Detach: Prevents the KL loss from causing gradient spikes in the backbone.
- Indexer Warmup: Trains the indexer using full attention for initial iterations before switching to sparse routing.
- Forced Local Block: Guarantees the inclusion of the immediate context, preventing the model from losing local coherence.
Kernel Co-Design for Hardware Efficiency
Theoretical sparsity is ineffective without hardware-level optimization. MSA includes custom kernels (fmha_sm100) designed for NVIDIA SM100 GPUs.
- Exp-free Top-k Selection: By ranking raw scores rather than applying softmax first, the kernel avoids unnecessary compute, achieving a 5.1x speedup over standard
torch.topkat 128K context. - KV-Outer Sparse Attention: The kernel optimizes arithmetic intensity by packing query positions into 128x128 score MMAs, splitting attention and combination steps across thread blocks (CTAs).