mKernel: Fusing Compute and Communication for GPU-Driven Scaling

The Bottleneck of Host-Driven Communication

In modern AI training, communication overhead is a primary performance killer, consuming up to 43.6% of the forward pass and 47% of execution time in Mixture-of-Experts (MoE) models. The current industry standard relies on host-driven communication, where the CPU manages control paths and issues collective operations (like AllReduce) via libraries such as NCCL.

This approach fails to scale with modern GPU clusters (e.g., GB300 NVL72) for two reasons:

Orchestration Overhead: Microsecond-scale CPU operations—such as cudaLaunchKernel calls and inter-stream event synchronization—create "pipeline bubbles" that prevent GPUs from operating at full capacity.
Coarse-Grained Overlap: Host-driven systems can only overlap compute and communication at kernel boundaries. This prevents the fine-grained, tile-level interleaving required to hide communication latency effectively.

GPU-Driven Communication with mKernel

mKernel, developed by UC Berkeley’s UCCL project, shifts the control logic directly onto the GPU. It provides a library of persistent CUDA kernels that fuse compute and communication into a single execution unit. By moving the communication logic into the GPU, the system achieves fine-grained overlap at the chunk or tile level, regardless of whether the data transfer is intra-node (NVLink) or inter-node (RDMA).

Key architectural features include:

Persistent Kernel Design: Kernels remain resident on the GPU, with Streaming Multiprocessors (SMs) dynamically assigned to specific roles: compute, intra-comm, inter-send, and inter-reduce. The allocation of these roles is tunable based on the specific workload shape.
Direct RDMA Integration: The library uses GPU-initiated RDMA writes via libibverbs, bypassing traditional host-side communication libraries to minimize latency.
Fused Operations: The library provides five primary fused kernels, including:
- AllGather + GEMM: Overlaps data gathering with local matrix multiplication.
- GEMM + AllReduce: Pushes output tiles into the reduction tree the moment they are computed.
- MoE Dispatch + GEMM: Routes tokens and performs grouped GEMM in one pass, eliminating staging buffer round-trips.
- Ring Attention: Performs sequence-parallel attention by rotating KV chunks while concurrently computing.
- GEMM + ReduceScatter: Reduces and forwards output tiles immediately upon production.

Implementation and Backends

mKernel supports two primary networking backends, both sharing a unified host-side API but utilizing different proxy implementations:

CX7 Backend: Uses libibverbs RC for InfiniBand/RoCE environments.
EFA Backend: Optimized for AWS p5/p5e instances using libibverbs and efadv (SRD).

The library requires NVIDIA Hopper GPUs (targeting sm_90a), CUDA 12.9, and PyTorch. It is designed to be a drop-in replacement for scenarios where standard collective communication libraries create unacceptable performance degradation.

The Bottleneck of Host-Driven Communication

GPU-Driven Communication with mKernel

Implementation and Backends

More from Software Engineering

Flash-KMeans: Accelerating Exact Clustering on GPUs

Building Tiled GPU Kernels with NVIDIA cuTile Python

Optimizing LLM Inference: KV Cache and Paged Attention

Deploying GPU Workloads Directly from Your IDE with RunPod Flash