The Architecture of Extreme Inference Speed
Xiaomi's MiMo-V2.5-Pro-UltraSpeed achieves over 1000 tokens per second (TPS) on a 1-trillion-parameter Mixture-of-Experts (MoE) model using a single 8-GPU commodity node. This performance is not the result of custom silicon, but rather "extreme model-system codesign" that optimizes three distinct layers to eliminate bottlenecks.
Three Layers of Optimization
- FP4 Quantization: To reduce memory bandwidth pressure, Xiaomi applies MXFP4 quantization selectively to the model's MoE experts, which hold the majority of parameters. Other modules utilize FP8 precision. Quantization-Aware Training (QAT) ensures that this reduction does not degrade benchmark performance compared to the original model.
- DFlash Speculative Decoding: Traditional speculative decoding is limited by the serial nature of draft model generation. DFlash overcomes this by using block-level masked parallel prediction, allowing the draft model to fill a block of positions in a single forward pass. Tuned with the Muon second-order optimizer and self-distillation, it achieves high acceptance lengths—averaging 6.30 tokens in coding tasks and 5.56 in math/reasoning.
- TileRT Runtime: At 1000 TPS, traditional operator-by-operator execution creates significant overhead. TileRT replaces this with a Persistent Engine Kernel that remains resident on the GPU. By using Warp Specialization, it coordinates data movement, compute, and communication, preventing the microsecond-scale gaps that typically fracture execution streams.
Trade-offs and Availability
This high-speed mode is designed for latency-sensitive, throughput-bound workloads. While it offers roughly 10x the speed of the baseline MiMo-V2.5-Pro, it comes at 3x the cost and is currently restricted to an API-only trial. The system represents a shift toward optimizing serving infrastructure alongside model architecture, rather than treating them as separate concerns.