Inference's Memory-Bound Reality Trumps Training Throughput
Training optimizes for FLOPS throughput via parallel forward passes, but inference splits into prefill (compute-bound, parallel prompt processing populating KV cache) and decode (memory-bound, sequential token generation reading full weights + growing KV cache per step). "Decode is memory-bandwidth-bound. The speed at which tokens are generated is not determined by how fast the GPU can multiply, but by how fast it can read. This is the single most important fact about LLM inference." KV cache dominates memory—scaling with sequence length, batch size, heads, dims, layers—often exceeding model weights at scale, inverting training assumptions where weights ruled.
Low arithmetic intensity in decode stalls GPUs waiting on HBM bandwidth, making peak FLOPS irrelevant. Hardware choices prioritize HBM capacity/bandwidth over compute. All optimizations reduce data movement or accelerate it.
KV Cache Innovations Unlock 2-4x Throughput
Naive allocation reserved max-sequence blocks per request, causing 20-40% utilization from fragmentation: internal (over-reserve) and external (scattered frees). PagedAttention (vLLM's core) uses fixed-size non-contiguous pages allocated dynamically—5th token gets 5th page, freeing instantly on completion—hitting 96% utilization for 2-4x throughput vs. HuggingFace Transformers.
RadixAttention (SGLang) adds prefix-sharing via radix tree for multi-turn chats/few-shot/agent workflows: shared prefixes computed/stored once, 75-95% cache hits, up to 6.4x throughput on prefix-heavy loads with LRU eviction.
Practical limit: 80-90% VRAM usable; >80% crashes from host RAM exhaustion in CUDA Graph compilation (needs headroom for metadata/workspace). Rule: Budget 80% for weights/KV, reserve 20%.
"By eliminating the contiguity constraint, PagedAttention pushed memory utilization from the 20 to 40% range up to over 96% in optimized deployments."
Continuous Batching and Low-Overhead Scheduling Maximize GPU Utilization
Static batching processes full batches to slowest request's end, padding short ones and blocking queue—tail latency spikes. Continuous batching (vLLM/SGLang) swaps per-token: evict completes, admit waits if KV slots free, no idle GPU.
Scheduler overhead grows with speed/batch size; Python (vLLM) flexible but slower. LMDeploy's C++ TurboMind hits microsecond precision, 29% higher throughput than vLLM on H100s via compiled batch mgmt/memory/requests. vLLM wins on ecosystem/flexibility for most; TurboMind for peak high-concurrency.
"The GPU never processes a completed request for even one unnecessary iteration, and new requests begin generation as soon as a slot opens."
Speculative Decoding Accelerates Autoregression 2-6.5x
Each decode step moves 10s GB for 70B models. Speculative decoding drafts N tokens cheap/fast, verifies parallel in target model (preserves distribution exactly).
Traditional: Separate small draft model (e.g., 1B for 70B), 40-60% acceptance, 2-3x speedup, but VRAM/sync overhead and weaker drafts.
EAGLE-3 integrates autoregressive heads on target's hidden states (multi-layer fusion), seeing rich embeddings for superior drafts. Dynamic tree verification (not linear seq). 3-6.5x speedup (5.6x on Vicuna-13B vs. vanilla, 1.8x vs. EAGLE-1), 20-40% over EAGLE-2. Task-variant (high on code/templates, lower math). "The most effective inference optimizations are not the ones that work around the model. They are the ones that work with the model’s own internal structure."
Multi-LoRA Serving's Cache Interference Demands Unified Management
Single base in VRAM, swap tiny LoRA adapters (hundreds MB) for variants (support/code/summarization). But KV cache adapter-specific: eviction orphans invalid cache (up to 46.5% in vLLM), bloating TTFT.
FastLibra (ELORA) links adapters/KV in shared tree pool; evict pairs via TTFT-impact cost model (retain hot adapters). 63.4% TTFT reduction, 1.7x peak throughput vs. vLLM.
"A KV cache entry is only valid for the specific adapter that produced it... Experimental data shows that vLLM can reach an invalid KV cache rate of up to 46.5% in high-churn multi-LoRA workloads."
Prefill-Decode Disaggregation, Quantization, and Engine Landscape
Prefill (parallel, compute) and decode (serial, memory) disaggregate to specialized hardware: H100s for decode bandwidth, A100s for prefill FLOPS. Quantization (INT4/INT8) shrinks weights 4-8x with <1% perf loss, but structured outputs need careful handling (e.g., logit biasing). Engines: vLLM (PagedAttention, Python-flex), SGLang (RadixAttention), LMDeploy (TurboMind C++ speed), hardware reality favors HBM-heavy GPUs like H100/H200.
Key Takeaways
- Target 80% VRAM utilization max; reserve 20% for CUDA Graphs/host overhead.
- Deploy PagedAttention (vLLM) for 2-4x baseline throughput via dynamic paging.
- Use continuous batching to eliminate static padding/tail latency.
- Integrate EAGLE-3-style heads for 3-6.5x speculative gains over separate drafts.
- For multi-LoRA, adopt FastLibra to evict adapter-KV pairs, cutting TTFT 63%.
- Prioritize HBM bandwidth over FLOPS; disaggregate prefill/decode if scaling.
- Benchmark engines: vLLM for broad use, LMDeploy/SGLang for peak H100 perf.
- Quantize aggressively (INT4) post-training, validate structured outputs.
- Reuse prefixes with RadixAttention for 6.4x in chats/agents.