Multipath Routing Fixes Core Bottlenecks in AI Training
MRC (Multipath Reliable Connection) eliminates congestion in AI supercomputers by distributing packets across hundreds of network paths simultaneously, rather than single paths. This delivers faster, more predictable GPU-to-GPU data transfers critical for training massive models. On failures—links, switches, or paths—MRC reroutes in microseconds, versus seconds or tens of seconds for standard 800 Gb/s fabrics. Result: Training jobs survive reboots and maintenance without stalls. OpenAI's multi-plane design connects over 100,000 GPUs using only two Ethernet switch tiers, slashing component count, power use, and costs compared to conventional three- or four-tier setups.
Proven at Scale on Frontier Supercomputers
Deployed across OpenAI's largest NVIDIA GB200 clusters—including Oracle Cloud in Abilene, Texas, and Microsoft's Fairwater—MRC handled a real-world test during frontier model training for ChatGPT and Codex. Four tier-1 switches rebooted without coordinating with running jobs, proving zero-disruption resilience. This lets operators maintain networks mid-training, boosting uptime for trillion-parameter models where network stalls previously cost hours or days.
Open Standards Accelerate Adoption
Specification released via Open Compute Project (OCP MRC 1.0), with contributions from AMD, Broadcom, Intel, Microsoft, and NVIDIA. Builders can implement now for Ethernet-based AI fabrics, avoiding proprietary lock-in while hitting supercomputer-scale performance.