Tackling Agentic Inference Bottlenecks

Agentic coding systems like Claude Code, Codex, and Cursor push inference engines with contexts over 50K tokens across dozens of turns, stressing per-GPU tokens-per-minute (TPM) for multi-user scaling and per-user tokens-per-second (TPS) for responsiveness (target floor: 70 TPS, up to 200+ TPS). Public benchmarks miss this dual pressure, so TokenSpeed (MIT-licensed preview from LightSeek Foundation) prioritizes both metrics via specialized architecture, avoiding generic chat optimizations.

Architectural Edges for Speed and Safety

TokenSpeed builds on five subsystems: (1) Compiler-backed SPMD modeling auto-generates collective ops from I/O annotations, skipping manual comms code. (2) Scheduler splits C++ control plane (FSM with type-enforced KV cache ownership/transfers for compile-time safety) from Python execution plane (fast iteration). (3) Pluggable kernel layer with registry supports heterogeneous accelerators; its MLA kernel (grouping q_seqlen/num_heads for Tensor Core fill, tuned binary prefill softmax) beats TensorRT-LLM decode/prefill, adopted by vLLM. (4) Safe KV reuse restrictions. (5) SMG for low-overhead CPU-GPU handoff. These cut KV errors (common pitfall) and enable modular accel support beyond NVIDIA.

Benchmark Dominance on Real Workloads

On NVIDIA B200 with SWE-smith traces (production-like coding agent traffic) and Kimi K2.5 model, TokenSpeed in Attention TP4 + MoE TP4 config tops TensorRT-LLM Pareto: 9% faster at batch=1 min-latency (>70 TPS/user), 11% higher throughput at ~100 TPS/user. Decode MLA folds query-seq into head axis for better BMM tile fill; binary prefill tunes softmax. With speculative decoding + long prefix KV at batches 4/8/16, latency nearly halves vs. TensorRT-LLM. Single-node only for now; PD disagg coming.