MRC Enables 100k+ GPU Clusters with Resilient Multipath Networking

Multipath Routing Fixes Core Bottlenecks in AI Training

MRC (Multipath Reliable Connection) eliminates congestion in AI supercomputers by distributing packets across hundreds of network paths simultaneously, rather than single paths. This delivers faster, more predictable GPU-to-GPU data transfers critical for training massive models. On failures—links, switches, or paths—MRC reroutes in microseconds, versus seconds or tens of seconds for standard 800 Gb/s fabrics. Result: Training jobs survive reboots and maintenance without stalls. OpenAI's multi-plane design connects over 100,000 GPUs using only two Ethernet switch tiers, slashing component count, power use, and costs compared to conventional three- or four-tier setups.

Proven at Scale on Frontier Supercomputers

Deployed across OpenAI's largest NVIDIA GB200 clusters—including Oracle Cloud in Abilene, Texas, and Microsoft's Fairwater—MRC handled a real-world test during frontier model training for ChatGPT and Codex. Four tier-1 switches rebooted without coordinating with running jobs, proving zero-disruption resilience. This lets operators maintain networks mid-training, boosting uptime for trillion-parameter models where network stalls previously cost hours or days.

Open Standards Accelerate Adoption

Specification released via Open Compute Project (OCP MRC 1.0), with contributions from AMD, Broadcom, Intel, Microsoft, and NVIDIA. Builders can implement now for Ethernet-based AI fabrics, avoiding proprietary lock-in while hitting supercomputer-scale performance.

Multipath Routing Fixes Core Bottlenecks in AI Training

Proven at Scale on Frontier Supercomputers

Open Standards Accelerate Adoption

More from AI News & Trends

TPUs Dominate at Infrastructure Scale Over Per-Chip GPU Wins

AWS Project Rainier: 500K Trainium2 Chips Power Massive AI Cluster

MRC: OpenAI's Protocol for Resilient AI Training Networks

SageMaker Fine-Tuning: LoRA Beats QLoRA on Cost-Perf Balance