Roofline Bounds Reveal Compute vs. Memory Tradeoffs in LLM Decode

Reiner Pope uses roofline analysis to model transformer inference time on a 72-GPU Blackwell NVL72 rack, bounding latency by max(compute time, memory time). Compute time scales linearly with batch size B and active parameters A: t_compute ≥ (B * A * 2) / FLOPs, ignoring minor attention compute. Memory splits into weight fetches (fixed total params N) and KV cache fetches (B * context length C * bytes_per_token): t_memory ≥ max( (N * bytes_per_param / mem_bw) , (B * C * kv_bytes / mem_bw) ).

This creates a latency floor from weight fetches—even at infinite batch, you must load all N params (e.g., 700B for DeepSeek V3). Pope notes: "There is a lower bound on latency. It is simply that I need to read all of my total parameters from memory into the chips, and that takes a certain amount of time." For balanced runs, KV slope matches compute when context hits a "Goldilocks zone," maximizing MFU; doubling C outside this halves MFU as memory dominates.

Sparse attention (e.g., DeepSeek's sqrt(C) scaling) softens KV growth vs. dense O(C), but labs' adoption is unclear. Hardware's FLOPs/mem_bw ratio (~300 dimensionless, FP4-adjusted) stays stable A100-to-B100, making bounds predictive across gens.

Batch Size Amortizes Fixed Costs, Explaining Fast Mode Pricing

Cost per token = latency / B, transforming curves: t_compute/B constant, KV/B constant, weights/B hyperbolic (infinite at B=1). Max of these yields cost hyperbola plunging from sky-high (unbatched weights unamortized) to compute floor. Without batching, "the cost and the economics you get can be a thousand times worse than if you do batch many users together—we’ll be able to see that quite explicitly."

"Fast Mode" (e.g., Claude/Cursor 6x price for 2.5x speed) runs tiny B=1-10: high latency floor but low per-user wait (20ms train departs regardless). Users pay premium to skip batch queue. Reverse "Slow Mode" can't beat compute floor—KV/compute both scale with B, no extra amortization. Practical batches hit weight=compute balance: B ≥ 300 * (A/N), where A/N=sparsity (DeepSeek MoE: 37B active/700B total ≈1/19 dense equiv, but 32/256 experts=1/8 → B~2400). Real ops double/triple for inefficiencies, yielding ~2000-6000 tokens/batch (2000 sequences x1 new token).

With >2000 concurrent users (frontier norm), 20ms cycles fill easily—no queue lag. Pope solves explicitly: equate weight fetch = weight compute → FLOPs/mem_bw = B * (A/N), hardware ratio~300.

Quote: "Batch size needs to be bigger than approximately 300 times sparsity. ... Generally, people will go a little bit larger than this. They don’t really want to be exactly at the balance point because real-world efficiencies aren’t as good as a roofline analysis would say. But take this and maybe double or triple it." (Reiner Pope, deriving optimal B; reveals why labs target 2-6k despite theory.)

KV Cache Dominance Grows with Context, Forcing Hardware Choices

Autoregressive decode: new token attends full history via KV cache (past hidden states, O(C * embed_dim * layers * heads * 2 for K/V)). Single forward pass generates 1 token/sequence, fetching B * C * kv_bytes. Long C shifts KV slope above compute, bloating optimal B and latency floor.

Pope deduces labs' C from API prices (later timestamp): longer C hikes KV cost linearly (dense), inferring effective C~128k-1M. RLHF/post-training overtrains 100x past Chinchilla-optimal compute, bloating N. MoE spreads experts across racks (later: 256 experts/72 GPUs inefficient), pipeline parallelism layers racks (bubbles waste 50%+ time—"As we now know, pipelining is not wise," per Ilya).

Tradeoffs stark: big B minimizes $/token but queues users ("train departs every 20ms"); small B flips to low latency, high cost. Speculative decoding/multi-token prediction accelerates beyond batching (future deep-dive).

Quote: "For the particular context length where the slopes match, that says I am equally memory-bound and compute-bound, which is a really desirable place to be." (Reiner Pope, on KV-compute balance; quantifies why 100k-200k C optimizes MFU, sparse mitigates.)

Deductive Power: Public Pricing Exposes Lab Secrets

Equations + APIs reverse-engineer stacks: DeepSeek V3 sparsity from params, C from $/token ramps, MoE/pipeline from cluster scales. Roofline ignores nuances (attention compute, comms) yet predicts tightly—"these equations here are enough for us to now draw some fit lines."

Evolution: nets mimic crypto (convergent: parallelizable matrix ops like hashes). Training mirrors (pretrain batch >> inference for data parallel), but RL spikes compute 100x.

Quote: "It’s shocking how much you can deduce about what the labs are doing from a handful of equations, public API prices, and some chalk." (Dwarkesh intro; frames lecture's insight—simple math unveils frontier black boxes.)

Key Takeaways

  • Target B ≥ 300 * (A/N) * 2-3x for balance; ~2-6k tokens frontier norm, amortizing weights 1000x vs. B=1.
  • Latency ≥ N * bytes / mem_bw (weight floor); no sub-10ms without faster mem (MatX's angle?).
  • Cost/token → compute floor at high B; fast modes pay 6x for B<<optimal.
  • KV linear in C (dense): double C doubles optimal B, halves MFU outside balance.
  • Sparse attention sqrt(C) or better scales long-context; watch DeepSeek papers.
  • Pipeline parallelism bubbles kill util (50%+ waste); prefer data-parallel + expert-parallel for MoE.
  • Deduce lab configs: API $/token vs. length reveals C, param counts sparsity.
  • Hardware stable ~300 FLOPs/mem_bw; B insensitive to gens.
  • Overtraining (RL) balloons N 100x Chinchilla, hiking floors.
  • Run roofline first: max(t_compute, t_mem) predicts before coding.

(Word count: 1024)