The Architecture of MSA

MiniMax Sparse Attention (MSA) addresses the quadratic complexity bottleneck of standard softmax attention by decoupling the attention process into two distinct branches: an Index Branch and a Main Branch. Instead of performing dense attention across all tokens, MSA operates at a block granularity (defaulting to 128 tokens per block).

  • Index Branch: Uses two projection matrices to score key-value blocks. A Top-k operator selects the most relevant blocks (defaulting to 16 blocks per query group), ensuring the query's immediate neighborhood is always included to maintain local context.
  • Main Branch: Executes exact softmax attention only on the selected blocks. This limits the per-query compute budget to a fixed 2,048 tokens (16 blocks * 128 tokens), allowing the model to scale to significantly longer contexts without a linear increase in compute cost.

Training and Optimization

Because Top-k selection is non-differentiable, MSA employs a KL alignment loss to train the indexer, matching the Index Branch's distribution to the Main Branch's attention pattern. To stabilize training, the researchers implemented three key techniques:

  1. Gradient Detach: Prevents the KL loss from causing gradient spikes in the backbone.
  2. Indexer Warmup: Trains the indexer using full attention for initial iterations before switching to sparse routing.
  3. Forced Local Block: Guarantees the inclusion of the immediate context, preventing the model from losing local coherence.

Kernel Co-Design for Hardware Efficiency

Theoretical sparsity is ineffective without hardware-level optimization. MSA includes custom kernels (fmha_sm100) designed for NVIDIA SM100 GPUs.

  • Exp-free Top-k Selection: By ranking raw scores rather than applying softmax first, the kernel avoids unnecessary compute, achieving a 5.1x speedup over standard torch.topk at 128K context.
  • KV-Outer Sparse Attention: The kernel optimizes arithmetic intensity by packing query positions into 128x128 score MMAs, splitting attention and combination steps across thread blocks (CTAs).