Hybrid Attention Slashes KVCache Transfer Bandwidth 13x for Cross-Datacenter Feasibility
Traditional dense-attention LLMs like MiniMax-M2.5 generate massive KVCache during prefill—59.93 Gbps for 32K tokens on 8x H200 GPUs—requiring RDMA networks that confine prefill and decode to single datacenters. Hybrid attention models like MiMo-V2-Flash (4.66 Gbps, 13x reduction), Qwen3.5-397B (8.25 Gbps vs. 33.35 Gbps for dense, 4x reduction), and Ring-2.5-1T (36x memory savings from MLA 4.5x + 7:1 hybrid ratio) produce KVCache at just 3.19 Gbps for 32K tokens on a 1T model. This low throughput fits commodity Ethernet (e.g., 100 Gbps inter-datacenter links), enabling prefill offload to compute-dense remote clusters while decode stays local on memory-bound hardware, but requires handling bursty workloads, uneven prefix caches, and bandwidth fluctuations beyond naive routing.
Length-Threshold Routing and Dual-Timescale Scheduling Optimize Resource Use
PrfaaS routes requests by incremental prefill length l after prefix cache: if l > t (optimal t=19.4K tokens, routing 50% of requests), send to remote PrfaaS cluster (e.g., 32 H200 GPUs); else, handle end-to-end locally on PD cluster (64 H20 GPUs). KVCache transfers use layer-wise pipelining (overlap generation and transmission), multi-connection TCP (maximize bandwidth), and congestion monitoring (detect loss early). Storage separates fixed-size linear attention states (exact-match) from growing full-attention blocks (partial prefix matching) in a unified pool. Short-timescale scheduling adjusts routing by PrfaaS egress utilization/queue depth, prefers local caches when bandwidth-scarce or best global cache when abundant (with cross-cluster transfer), and rebalances local prefill/decode nodes dynamically. This keeps systems compute-bound with 13 Gbps aggregate egress (13% of 100 Gbps capacity) even at 10K-GPU scale (1.8 Tbps total).
Delivers 1.54x Throughput and 64% Faster P90 TTFT Over Baselines
In a 1T hybrid model case study, PrfaaS-PD hits 54% higher serving throughput than homogeneous H20 baseline and 32% over naive heterogeneous (all prefill on H200, decode on H20 without smarts), with 15% gain at equal hardware cost from H200 prefill + H20 decode pairing. Scheduling alone adds 33% uplift (1.16x naive to 1.54x full). TTFT drops 50% mean/64% P90 vs. homogeneous. PrfaaS works today for hybrid models; future gains from larger contexts, KV compression, and specialized hardware (e.g., Rubin CPX for prefill, LPU for decode) will amplify cross-datacenter disaggregation benefits.