Batch Size Math: Why LLM Inference Costs Plummet at Scale
Roofline analysis shows batching 2000+ tokens amortizes weight memory fetches, slashing per-token cost 1000x; fast modes use tiny batches for low latency at 6x price.
Roofline Bounds Reveal Compute vs. Memory Tradeoffs in LLM Decode
Reiner Pope uses roofline analysis to model transformer inference time on a 72-GPU Blackwell NVL72 rack, bounding latency by max(compute time, memory time). Compute time scales linearly with batch size B and active parameters A: t_compute ≥ (B * A * 2) / FLOPs, ignoring minor attention compute. Memory splits into weight fetches (fixed total params N) and KV cache fetches (B * context length C * bytes_per_token): t_memory ≥ max( (N * bytes_per_param / mem_bw) , (B * C * kv_bytes / mem_bw) ).
This creates a latency floor from weight fetches—even at infinite batch, you must load all N params (e.g., 700B for DeepSeek V3). Pope notes: "There is a lower bound on latency. It is simply that I need to read all of my total parameters from memory into the chips, and that takes a certain amount of time." For balanced runs, KV slope matches compute when context hits a "Goldilocks zone," maximizing MFU; doubling C outside this halves MFU as memory dominates.
Sparse attention (e.g., DeepSeek's sqrt(C) scaling) softens KV growth vs. dense O(C), but labs' adoption is unclear. Hardware's FLOPs/mem_bw ratio (~300 dimensionless, FP4-adjusted) stays stable A100-to-B100, making bounds predictive across gens.
Batch Size Amortizes Fixed Costs, Explaining Fast Mode Pricing
Cost per token = latency / B, transforming curves: t_compute/B constant, KV/B constant, weights/B hyperbolic (infinite at B=1). Max of these yields cost hyperbola plunging from sky-high (unbatched weights unamortized) to compute floor. Without batching, "the cost and the economics you get can be a thousand times worse than if you do batch many users together—we’ll be able to see that quite explicitly."
"Fast Mode" (e.g., Claude/Cursor 6x price for 2.5x speed) runs tiny B=1-10: high latency floor but low per-user wait (20ms train departs regardless). Users pay premium to skip batch queue. Reverse "Slow Mode" can't beat compute floor—KV/compute both scale with B, no extra amortization. Practical batches hit weight=compute balance: B ≥ 300 * (A/N), where A/N=sparsity (DeepSeek MoE: 37B active/700B total ≈1/19 dense equiv, but 32/256 experts=1/8 → B~2400). Real ops double/triple for inefficiencies, yielding ~2000-6000 tokens/batch (2000 sequences x1 new token).
With >2000 concurrent users (frontier norm), 20ms cycles fill easily—no queue lag. Pope solves explicitly: equate weight fetch = weight compute → FLOPs/mem_bw = B * (A/N), hardware ratio~300.
Quote: "Batch size needs to be bigger than approximately 300 times sparsity. ... Generally, people will go a little bit larger than this. They don’t really want to be exactly at the balance point because real-world efficiencies aren’t as good as a roofline analysis would say. But take this and maybe double or triple it." (Reiner Pope, deriving optimal B; reveals why labs target 2-6k despite theory.)
KV Cache Dominance Grows with Context, Forcing Hardware Choices
Autoregressive decode: new token attends full history via KV cache (past hidden states, O(C * embed_dim * layers * heads * 2 for K/V)). Single forward pass generates 1 token/sequence, fetching B * C * kv_bytes. Long C shifts KV slope above compute, bloating optimal B and latency floor.
Pope deduces labs' C from API prices (later timestamp): longer C hikes KV cost linearly (dense), inferring effective C~128k-1M. RLHF/post-training overtrains 100x past Chinchilla-optimal compute, bloating N. MoE spreads experts across racks (later: 256 experts/72 GPUs inefficient), pipeline parallelism layers racks (bubbles waste 50%+ time—"As we now know, pipelining is not wise," per Ilya).
Tradeoffs stark: big B minimizes $/token but queues users ("train departs every 20ms"); small B flips to low latency, high cost. Speculative decoding/multi-token prediction accelerates beyond batching (future deep-dive).
Quote: "For the particular context length where the slopes match, that says I am equally memory-bound and compute-bound, which is a really desirable place to be." (Reiner Pope, on KV-compute balance; quantifies why 100k-200k C optimizes MFU, sparse mitigates.)
Deductive Power: Public Pricing Exposes Lab Secrets
Equations + APIs reverse-engineer stacks: DeepSeek V3 sparsity from params, C from $/token ramps, MoE/pipeline from cluster scales. Roofline ignores nuances (attention compute, comms) yet predicts tightly—"these equations here are enough for us to now draw some fit lines."
Evolution: nets mimic crypto (convergent: parallelizable matrix ops like hashes). Training mirrors (pretrain batch >> inference for data parallel), but RL spikes compute 100x.
Quote: "It’s shocking how much you can deduce about what the labs are doing from a handful of equations, public API prices, and some chalk." (Dwarkesh intro; frames lecture's insight—simple math unveils frontier black boxes.)
Key Takeaways
- Target B ≥ 300 * (A/N) * 2-3x for balance; ~2-6k tokens frontier norm, amortizing weights 1000x vs. B=1.
- Latency ≥ N * bytes / mem_bw (weight floor); no sub-10ms without faster mem (MatX's angle?).
- Cost/token → compute floor at high B; fast modes pay 6x for B<<optimal.
- KV linear in C (dense): double C doubles optimal B, halves MFU outside balance.
- Sparse attention sqrt(C) or better scales long-context; watch DeepSeek papers.
- Pipeline parallelism bubbles kill util (50%+ waste); prefer data-parallel + expert-parallel for MoE.
- Deduce lab configs: API $/token vs. length reveals C, param counts sparsity.
- Hardware stable ~300 FLOPs/mem_bw; B insensitive to gens.
- Overtraining (RL) balloons N 100x Chinchilla, hiking floors.
- Run roofline first: max(t_compute, t_mem) predicts before coding.
(Word count: 1024)