Hybrid Attention Unlocks Cross-Datacenter KVCache Transfer
Traditional dense-attention LLMs like MiniMax-M2.5 generate massive KVCache—59.93 Gbps for 32K tokens on 8x H200 GPUs—requiring RDMA networks that confine prefill and decode to single datacenters. Hybrid models like MiMo-V2-Flash (4.66 Gbps, 13x reduction), Qwen3.5-397B (8.25 Gbps vs. 33.35 Gbps for dense, 4x reduction), and Ring-2.5-1T (MLA + 7:1 hybrid ratio yields 36x KV memory savings) drop throughput to 3-8 Gbps, fitting commodity Ethernet (e.g., 3.19 Gbps for internal 1T model at 32K tokens). This compute-intensive prefill (full-attention layers only) produces fixed-size recurrent states for linear layers, making inter-datacenter handoff feasible without stalling decode's memory-bound phase.
Prefill-decode disaggregation optimizes hardware—H200s for prefill throughput, H20s/LPUs for decode bandwidth—but naive setups congest on bursty workloads with uneven prefix caches. PrfaaS fixes this by threshold-routing requests: if incremental length l > t (optimal t=19.4K tokens, routing 50% of long requests), send to remote PrfaaS cluster; else, handle locally. Layer-wise pipelining overlaps KV generation with multi-connection TCP transmission; congestion monitoring backs off routing on queue buildup or loss.
Dual-Timescale Scheduling Maximizes Utilization
Short-timescale scheduler tracks PrfaaS egress (13 Gbps peak, 13% of 100Gbps VPC) and queue depth, prioritizing cache-affine routing (local prefixes when bandwidth-tight) or global best-prefix pulls (cross-cluster transfers when abundant). Long-timescale rebalances local PD node counts as traffic skews, keeping clusters compute-bound with headroom. Storage splits linear states (exact-match, request-level) from KV blocks (partial-match, length-growing) in a unified pool, handling prefix hits efficiently.
In a 32x H200 PrfaaS + 64x H20 PD setup, this yields 1.54x throughput over homogeneous H20 baseline (1.16x for naive heterogeneous), 50% lower mean TTFT, 64% lower P90 TTFT. At equal hardware cost, gain holds at 15%; scales to 10k-GPU datacenters using 1.8 Tbps aggregate egress—within modern links. Deploy today for hybrid models like Kimi Linear, MiMo-V2-Flash, Qwen3.5-397B; future Rubin CPX prefill + LPU decode amplifies gains as contexts grow.