TPUs Dominate at Infrastructure Scale Over Per-Chip GPU Wins

Infrastructure Scaling Trumps Per-Chip Performance

Google's TPU v8t for training and v8i for inference trail Nvidia's Rubin and AMD GPUs in raw per-chip compute and memory. However, evaluating at infrastructure level reveals TPUs' edge: Nvidia's NVL72 scales 72 Rubin GPUs per rack, while Google's 4x4x4 cube interconnects up to 9600 TPUs into a superpod delivering 121 exaFLOPS in FP4—surpassing Nvidia's 1152-GPU Rubin pod at 60 exaFLOPS FP4. Google's Virgo network further scales out to 134,000 chips, potentially reaching 1 million, minimizing network overhead via ICI and optical interconnects. This Lego-like modularity avoids the scaling cliffs Nvidia faces when stacking GPUs, where interconnect overhead erodes per-chip advantages.

Nvidia balances scale-out with InfiniBand for diverse customers (neo-clouds like CoreWeave, labs like OpenAI/Meta, hyperscalers like Microsoft/Amazon), prioritizing broad demand profiles. Google, serving internal apps like Gemini and Vertex AI plus external deals (Anthropic's $1B TPU commitment: 40% owned, 60% rented; Meta's multi-billion rental), optimizes purely for its high-volume needs without market fragmentation risks.

Workload Profiles Dictate Hardware Choices

AI tasks bifurcate demands: training prioritizes network bandwidth over compute/memory, benefiting TPU's topology. Inference splits further—prefill (pink line in SemiAnalysis chart) is compute/memory-bound for KV cache parallelization; decode (white line) is bandwidth/latency-bound for autoregressive token streaming. TPU v8t/8i bifurcation matches this: v8t for training's network focus, v8i for inference's varied needs. Virgo flattens network bottlenecks, challenging Nvidia's inference dominance.

Replicating Google's scaling on Nvidia chips risks inefficiency for its varied clientele, locking into a 'balanced diet' pod architecture over specialized superpods.

Explosive Demand Drives Economics

Epoch AI projects 450+ new pre-trained models by 2030, many exceeding GPT-5's ~66 septillion FLOPs (total math ops for weights). A 9600-TPU superpod could theoretically pretrain GPT-5-scale models in under 7 days at FP4 (realistically 3-4 weeks), but efficiency cliffs emerge from memory, bandwidth, or latency based on scale-up/out choices. Rising inference/training demand amplifies TPU economics: internal fab control ensures supply for massive token serving, positioning Google against Nvidia as workloads evolve toward bandwidth constraints.