MRC: OpenAI's Protocol for Resilient AI Training Networks

Multipath Mechanisms Eliminate Congestion and Enable Fast Recovery

In large AI training clusters, network congestion, link failures, and jitter cause GPU idle time, amplifying costs as clusters scale to millions of data transfers per step. MRC builds on RoCEv2 for hardware-accelerated RDMA over Ethernet and SRv6 for static source routing, shifting intelligence to NICs while switches follow pre-configured paths blindly. This avoids interference from dynamic routing.

Adaptive packet spraying distributes packets across hundreds of paths at the NIC level, achieving higher bandwidth, reduced tail latency, and packet-level load balancing—unlike single-path RoCEv2. For failures, MRC detects issues in microseconds and reroutes: if an 8-port 800Gb/s NIC loses one port, it drops to 7/8 capacity but recalculates paths instantly, notifies peers to avoid the failed plane, and restores it within a minute upon recovery. Conventional fabrics take seconds to tens of seconds, often crashing jobs; MRC keeps training alive with minimal performance hit.

AMD's NSCC congestion control integrates via UEC specs, preserving RDMA semantics while adding multipath support.

Multi-Plane Architecture Cuts Tiers, Costs, and Latency

MRC reimagines NICs as multiple smaller links (e.g., one 800Gb/s interface split into eight 100Gb/s to eight switches), enabling a two-tier Clos network for 131,000 GPUs versus three-to-four tiers in 800Gb/s designs. Longest paths cross three switches instead of five-to-seven, slashing latency.

For full bisection bandwidth, this uses 2/3 the optics and 3/5 the switches of three-tier networks, reducing power, cost, and failure blast radius. A tier-1 switch failure (e.g., rebooting four during training) no longer halts jobs.

Production on Named Hardware Across OpenAI Clusters

Deployed on 400/800Gb/s RDMA NICs like NVIDIA ConnectX-8, AMD Pollara/Vulcano, Broadcom Thor Ultra; SRv6 switches include NVIDIA Spectrum-4/5 (Cumulus/SONiC) and Broadcom Tomahawk 5 (Arista EOS). Powers NVIDIA GB200 supercomputers in OpenAI's Stargate (OCI Abilene, TX) and Microsoft's Fairwater (Atlanta/Wisconsin), training ChatGPT and Codex models without job interruptions from failures.

Multipath Mechanisms Eliminate Congestion and Enable Fast Recovery

Multi-Plane Architecture Cuts Tiers, Costs, and Latency

Production on Named Hardware Across OpenAI Clusters

More from DevOps & Cloud

MRC Enables 100k+ GPU Clusters with Resilient Multipath Networking

TPUs Dominate at Infrastructure Scale Over Per-Chip GPU Wins

AWS Project Rainier: 500K Trainium2 Chips Power Massive AI Cluster

Bigtable Scales Petabytes for Real-Time NoSQL Workloads