The Bottleneck of Host-Driven Communication
In modern AI training, communication overhead is a primary performance killer, consuming up to 43.6% of the forward pass and 47% of execution time in Mixture-of-Experts (MoE) models. The current industry standard relies on host-driven communication, where the CPU manages control paths and issues collective operations (like AllReduce) via libraries such as NCCL.
This approach fails to scale with modern GPU clusters (e.g., GB300 NVL72) for two reasons:
- Orchestration Overhead: Microsecond-scale CPU operations—such as
cudaLaunchKernelcalls and inter-stream event synchronization—create "pipeline bubbles" that prevent GPUs from operating at full capacity. - Coarse-Grained Overlap: Host-driven systems can only overlap compute and communication at kernel boundaries. This prevents the fine-grained, tile-level interleaving required to hide communication latency effectively.
GPU-Driven Communication with mKernel
mKernel, developed by UC Berkeley’s UCCL project, shifts the control logic directly onto the GPU. It provides a library of persistent CUDA kernels that fuse compute and communication into a single execution unit. By moving the communication logic into the GPU, the system achieves fine-grained overlap at the chunk or tile level, regardless of whether the data transfer is intra-node (NVLink) or inter-node (RDMA).
Key architectural features include:
- Persistent Kernel Design: Kernels remain resident on the GPU, with Streaming Multiprocessors (SMs) dynamically assigned to specific roles:
compute,intra-comm,inter-send, andinter-reduce. The allocation of these roles is tunable based on the specific workload shape. - Direct RDMA Integration: The library uses GPU-initiated RDMA writes via
libibverbs, bypassing traditional host-side communication libraries to minimize latency. - Fused Operations: The library provides five primary fused kernels, including:
- AllGather + GEMM: Overlaps data gathering with local matrix multiplication.
- GEMM + AllReduce: Pushes output tiles into the reduction tree the moment they are computed.
- MoE Dispatch + GEMM: Routes tokens and performs grouped GEMM in one pass, eliminating staging buffer round-trips.
- Ring Attention: Performs sequence-parallel attention by rotating KV chunks while concurrently computing.
- GEMM + ReduceScatter: Reduces and forwards output tiles immediately upon production.
Implementation and Backends
mKernel supports two primary networking backends, both sharing a unified host-side API but utilizing different proxy implementations:
- CX7 Backend: Uses
libibverbsRC for InfiniBand/RoCE environments. - EFA Backend: Optimized for AWS p5/p5e instances using
libibverbsandefadv(SRD).
The library requires NVIDIA Hopper GPUs (targeting sm_90a), CUDA 12.9, and PyTorch. It is designed to be a drop-in replacement for scenarios where standard collective communication libraries create unacceptable performance degradation.