Scaling TPUs on GKE for Massive AI Workloads

GKE treats TPU slices as atomic units for seamless scaling up to 9k+ chips, with flexible capacity like DWS Flex/Calendar and custom fallbacks for cost-efficient ML training/inference.

TPU Power: Specialized Hardware for AI Matrix Crunching

Kavitha Gowda, product manager for TPUs on GKE, describes TPUs as Google's custom ASICs optimized for machine learning, particularly heavy matrix multiplications in LLMs and recommendation models. The core is the Matrix Multiply Unit (MXU), a "dedicated matrix math wizard" that processes billions of operations per image in recognition tasks thousands of times faster than general-purpose chips.

TPUs feature high-bandwidth memory (HBM) to handle large models and batches on-chip, minimizing data transfer bottlenecks. They interconnect from one chip to thousands via high-speed ICI links and optical circuit switching, enabling massive-scale training and inference. The seventh-generation Ironwood TPU pod supports 9,216 chips, with peak BF16 TFLOPS jumping dramatically—numbers Yufeng Guo initially mistook for typos due to the leap from prior generations like Trillium and v5e.

"MXU is the hardware that makes TPUs so powerful. It's dedicated matrix math wizard that can perform this massive calculation in a single step, making the entire process thousands times faster and more efficient than a general-purpose chip," Gowda explains, highlighting the specialized architecture.

Frameworks like JAX, TensorFlow, and PyTorch are fully supported, integrating seamlessly with GKE, Vertex AI, and Cloud TPU APIs.

GKE's Atomic Slicing: Hiding Complexity for Exponential Scale

GKE abstracts TPU chip intricacies, exposing them as containerized workloads while preserving Kubernetes advantages. It treats TPU 'slices'—from single chips to 9,216-chip pods—as atomic units for provisioning, scheduling, failover, and resilience, maximizing interconnect performance.

Slice types scale progressively:

  • Single-host TPU: One VM with 1-8 chips at zero network latency, ideal for fine-tuning, interactive dev, or small inference. Scales like CPU VMs via horizontal pod autoscaling.
  • Multi-host TPU: Multiple VMs (e.g., 16 VMs with 4 chips each for 64 chips) in one node pool, interconnected via ICI for larger training/inference.
  • Multi-slice TPU: Spans node pools (e.g., 50k-100k chips), with intra-pool ICI links and inter-pool data center networking. Developers must align workloads to high-speed (ICI) vs. slower (DCN) paths.

GKE supports 130k nodes, enabling thousands of TPUs as one unit for frontier models. JobSets and multi-slice networking provide atomic failover: if one VM fails in a 50k-chip slice, GKE auto-repairs the unit and resumes training, boosting 'goodput' (effective throughput) over raw throughput.

"GKE hides the underlying complexity of the chip architecture and relays the TPU chip power to the container-based workloads," Gowda notes, emphasizing ecosystem perks like storage, load balancers, and observability.

Yufeng Guo stresses software-hardware co-design: "We're really seeing this combination of having to have knowledge of the software as well as the hardware in order to be able to take full advantage of these systems."

Capacity Flexibility: DWS, CUDs, and Spot for Cost Control

TPU availability spans options for reliability and economy:

  • Committed Use Discounts (CUDs): Reserved capacity for enterprise needs, from massive training to online inference.
  • Dynamic Workload Scheduler (DWS): New in 2025, with Flex (pay-as-you-go, up to 7 days for bursty POCs/experiments) and Calendar (1-3 month reservations for guaranteed, uninterrupted runs).

GKE autoscales DWS Flex node pools only when workloads deploy, billing solely during execution—scale down post-job for zero idle costs. Calendar ensures dedicated, compact placement without maintenance interruptions, vital for month-long fine-tuning where failures would be "crippling," as Guo observes.

Combine modes: Reserve Calendar for critical jobs, burst to Flex. All backed by on-demand and spot.

"DWS Flex is like an on-demand elasticity... Mostly used for bursty workloads, for experimentation, for POCs... you just pay for what you're running," Gowda clarifies.

Custom Compute Classes: Automated Fallbacks Across Tiers

Custom compute classes define prioritized hierarchies (e.g., Trillium reservation > spot > DWS Flex > on-demand). GKE automatically falls back if primary capacity lacks, promoting to higher tiers when available—optimizing for power, cost, or availability.

Users previously scripted this; now it's native, with GCP optimizing efficiency. Supports 3+ layers (latency trade-offs apply) and even GPU/TPU fallback via vLLM for serving. Example: Start TPU reservations, scale to GPUs.

"With custom compute classes, you can define prioritized hierarchy of TPU configuration... GKE can automatically fall back," Gowda says, noting use for low-priority jobs starting on spot then escalating.

Storage and Ecosystem: Fueling Data-Intensive Workloads

GKE optimizes AI I/O:

  • Secondary boot disks: Preload data/images per node for faster pod startup.
  • GCS Fuse + CSI driver: Caches/parallel-downloads from object storage, yielding 9x faster model loads via PersistentVolumeClaims.
  • Managed Lustre: Parallel filesystem for high-concurrency IO in training/checkpointing.

Integrates open-source like Kubray (orchestrator) and vLLM (serving), plus dashboards.

Companies like Anthropic, Moloco, and Light Tricks already use Kubernetes+TPUs.

Resources: Google AI Hypercomputer, GKE for AI/ML inference docs, TPU-on-GKE LLM fine-tuning tutorial.

"By leveraging GKE's job set and multi-slice networking, you gain an atomic failover model... helps you resume your training if one infrastructure fails," Gowda adds on maximizing expensive TPU utilization.

Key Takeaways

  • Treat TPU slices as atomic units in GKE for provisioning up to 9k+ interconnected chips, aligning workloads to ICI (intra-pool) vs. DCN (inter-pool) speeds.
  • Use DWS Flex for bursty experiments (pay-as-you-go, autoscaling) and Calendar for 1-3 month guaranteed reservations to avoid crippling mid-training failures.
  • Implement custom compute classes for automatic fallbacks (e.g., reservation > spot > Flex) to optimize cost/availability without custom scripts.
  • Accelerate startup with secondary boot disks, GCS Fuse (9x model load speedup), and Managed Lustre for high-IO training.
  • Co-design software for TPU hardware: Leverage MXU/HBM for matrix-heavy LLMs, scale via single/multi-host/slices.
  • Combine CUDs for steady-state with DWS/spot for bursts; fallback to GPUs via vLLM for serving resilience.
  • Maximize goodput with GKE JobSets' atomic failover and auto-resume on VM failures.
  • Start with Ironwood/Trillium pods on GKE for JAX/TF/PyTorch; reference tutorials for LLM fine-tuning.
Video description
Google AI Hypercomputer → https://goo.gle/3ObrQLK GKE for AI/ML inference → https://goo.gle/4cg4k8y [Tutorial] Fine tune a LLM using TPUs on GKE → https://goo.gle/48hT4Hu Tensor Processing Units (TPUs) are now in their 7th generation. They allow machine learning workloads to reach massive scale, especially when running on Google Kubernetes Engine (GKE). But how does that work, and what do you need to know in order to run TPUs on GKE successfully? Join Yufeng Guo as he sits down with Kavitha Gowda, the product manager of TPUs on GKE, to get into the details of how to scale TPU workloads on GKE. Speakers: Yufeng Guo, Kavitha Gowda Products Mentioned: Google Kubernetes Engine, Cloud Tensor Processing Units, AI Hypercomputer

Summarized by x-ai/grok-4.1-fast via openrouter

8516 input / 2468 output tokens in 54357ms

© 2026 Edge