Achieving 1000+ TPS on 1T Models via Model-System Codesign

The Architecture of Extreme Inference Speed

Xiaomi's MiMo-V2.5-Pro-UltraSpeed achieves over 1000 tokens per second (TPS) on a 1-trillion-parameter Mixture-of-Experts (MoE) model using a single 8-GPU commodity node. This performance is not the result of custom silicon, but rather "extreme model-system codesign" that optimizes three distinct layers to eliminate bottlenecks.

Three Layers of Optimization

FP4 Quantization: To reduce memory bandwidth pressure, Xiaomi applies MXFP4 quantization selectively to the model's MoE experts, which hold the majority of parameters. Other modules utilize FP8 precision. Quantization-Aware Training (QAT) ensures that this reduction does not degrade benchmark performance compared to the original model.
DFlash Speculative Decoding: Traditional speculative decoding is limited by the serial nature of draft model generation. DFlash overcomes this by using block-level masked parallel prediction, allowing the draft model to fill a block of positions in a single forward pass. Tuned with the Muon second-order optimizer and self-distillation, it achieves high acceptance lengths—averaging 6.30 tokens in coding tasks and 5.56 in math/reasoning.
TileRT Runtime: At 1000 TPS, traditional operator-by-operator execution creates significant overhead. TileRT replaces this with a Persistent Engine Kernel that remains resident on the GPU. By using Warp Specialization, it coordinates data movement, compute, and communication, preventing the microsecond-scale gaps that typically fracture execution streams.

Trade-offs and Availability

This high-speed mode is designed for latency-sensitive, throughput-bound workloads. While it offers roughly 10x the speed of the baseline MiMo-V2.5-Pro, it comes at 3x the cost and is currently restricted to an API-only trial. The system represents a shift toward optimizing serving infrastructure alongside model architecture, rather than treating them as separate concerns.

The Architecture of Extreme Inference Speed

Three Layers of Optimization

Trade-offs and Availability

More from AI & LLMs

FormulaSPIN: Improving Spreadsheet Formula Generation via Self-Play

VBFDD-Agent: Translating Battery Signals into Descriptive Text

Sovereign AI Grounds Robotics in Physics for 1.1M States/Sec

Gemma 4 MTP Drafters: 3x Faster Inference, No Quality Loss