MiniMax Sparse Attention: Scaling Long Context with Block-Sparsity

The Architecture of MSA

MiniMax Sparse Attention (MSA) addresses the quadratic complexity bottleneck of standard softmax attention by decoupling the attention process into two distinct branches: an Index Branch and a Main Branch. Instead of performing dense attention across all tokens, MSA operates at a block granularity (defaulting to 128 tokens per block).

Index Branch: Uses two projection matrices to score key-value blocks. A Top-k operator selects the most relevant blocks (defaulting to 16 blocks per query group), ensuring the query's immediate neighborhood is always included to maintain local context.
Main Branch: Executes exact softmax attention only on the selected blocks. This limits the per-query compute budget to a fixed 2,048 tokens (16 blocks * 128 tokens), allowing the model to scale to significantly longer contexts without a linear increase in compute cost.

Training and Optimization

Because Top-k selection is non-differentiable, MSA employs a KL alignment loss to train the indexer, matching the Index Branch's distribution to the Main Branch's attention pattern. To stabilize training, the researchers implemented three key techniques:

Gradient Detach: Prevents the KL loss from causing gradient spikes in the backbone.
Indexer Warmup: Trains the indexer using full attention for initial iterations before switching to sparse routing.
Forced Local Block: Guarantees the inclusion of the immediate context, preventing the model from losing local coherence.

Kernel Co-Design for Hardware Efficiency

Theoretical sparsity is ineffective without hardware-level optimization. MSA includes custom kernels (fmha_sm100) designed for NVIDIA SM100 GPUs.

Exp-free Top-k Selection: By ranking raw scores rather than applying softmax first, the kernel avoids unnecessary compute, achieving a 5.1x speedup over standard torch.topk at 128K context.
KV-Outer Sparse Attention: The kernel optimizes arithmetic intensity by packing query positions into 128x128 score MMAs, splitting attention and combination steps across thread blocks (CTAs).

The Architecture of MSA

Training and Optimization

Kernel Co-Design for Hardware Efficiency

More from AI & LLMs

Benchmarking LLMs for Multi-Sensor Physical Hazard Assessment

Defending LLMs Against Multi-Turn Adversarial Attacks

PlanE: Meta-Planning for Extractive LLM Pipelines

Mitigating Scaffolding Collapse in Socratic Tutors