Throughput Design Hides Latency with Massive Parallelism
GPUs prioritize throughput over single-thread latency by allocating transistors to thousands of execution units and a large register file rather than branch predictors or deep caches. A single GPU thread is slower than a CPU core (~1ns instruction), but 20,000+ run concurrently. Off-chip HBM access takes 700+ cycles on H100, so GPUs hide this by keeping enough independent warps ready—switching when one stalls. This requires high occupancy: ratio of resident warps to max (64 per H100 SM). Low occupancy from high register use (e.g., 128 regs/thread limits to 512 threads/SM or 16 warps, 25% occupancy) starves the scheduler, collapsing throughput despite saturated Tensor Cores.
Threads group into 32-thread warps as the scheduling unit under SIMT: hardware issues one instruction across the warp while tracking per-thread PCs and registers for independent appearance. Pre-Volta lockstep caused deadlocks on intra-warp sync; Volta+ Independent Thread Scheduling (ITS) dynamically regroups converging threads, enabling mutexes without divergence penalties (though divergence still serializes paths, doubling time on 50/50 if/else). H100 SMs (132 total) divide into 4 quadrants, each with warp scheduler, 16k registers, 32 FP32/16 INT32 cores, 1 Tensor Core, and L0 instr cache. Blocks (CTAs) run on one SM for shared mem sync; Hopper clusters co-schedule blocks across GPCs for DSMEM (7x faster than global mem).
Warp divergence hurts irregular data (e.g., padding branches); fix via specialization—e.g., FlashAttention-3 assigns producer warps for loads, consumers for math, zero divergence, overlapping mem/compute. Little’s Law quantifies: in-flight warps = throughput × latency. For 400-cycle HBM loads at 1 instr/cycle, need 400+ warps to sustain SM utilization; fewer drops throughput to 25%.
Six-Tier Memory Hierarchy Sets Bandwidth Bounds
Data tiers trade capacity/bandwidth/latency: registers (256KB/SM, 65k 32-bit, 1-cycle) > shared/L1 (228KB shared max, 30-40 cycles) > L2 (50MB, 258-743 cycles) > HBM3 (80GB, 3.35TB/s, 700+ cycles) > NVLink (900GB/s/GPU, µs) > NVMe. Keep working set close: high regs/thread (>255) spills to HBM local mem, killing loops. Shared mem tiles inputs for reuse (GEMM loads slab once, computes multiple times). L1 coalesces warp loads (base+i patterns >> strided). L2 absorbs weight re-reads; >50MB spills to HBM.
LLM decode exemplifies: 70B FP16 model needs 140GB/token read (42ms at 3.35TB/s pre-compute), one FLOP/byte. Bandwidth binds because arithmetic intensity (FLOPs/byte) is ~1; roofline (part 2) shows compute underutilized without high reuse. HBM holds weights/KV/activations; misses from upper tiers thrash it. NVLink shards large models (e.g., tensor parallel syncs partials), but frequent comm bottlenecks vs. pipeline parallel (activations/layer).