Scaling Transformer Training to 5 Million Tokens

The Memory Bottleneck in Long-Context Training

Training standard transformer models with massive context windows (e.g., 3M+ tokens) faces two primary constraints: quadratic computational complexity and linear memory growth. Even on high-end hardware like an 8xH100 node, standard implementations fail because the model parameters and attention activations quickly exceed available GPU memory.

The Stack of Optimization Techniques

To reach a 3-million token context, a layered approach is required to manage memory usage:

Fully Sharded Data Parallelism (FSDP): Distributes model parameters across all available GPUs to prevent memory exhaustion from the model weights alone.
DeepSpeed Ulysses: A context parallelism technique that distributes attention heads across GPUs. Instead of every GPU computing the full sequence, each GPU handles specific heads, reducing activation memory by approximately 8x.
Activation Checkpointing: Recomputes activations during the backward pass rather than storing them, providing another 8x reduction in memory usage.
CPU Offloading: Moves transformer block inputs to CPU memory when not actively needed for backpropagation, prefetching them just-in-time to minimize performance impact.
Chunked Sequence Training: Tiles element-wise operations (like loss calculations and MLPs) across the sequence length to avoid allocating massive buffers that scale linearly with the token count.

Untied Ulysses: Pushing to 5 Million Tokens

To surpass the 3-million token limit, the team developed "Untied Ulysses." This technique refines context parallelism by further chunking attention heads. Instead of allocating a single large buffer per head group, the system iterates through smaller chunks of heads, reusing the same memory buffers across iterations. This significantly lowers activation memory requirements with negligible impact on throughput.

Practical Implementation Advice

Profiling is critical: Use tools like the PyTorch Profiler to identify exactly where memory is being consumed, as bottlenecks often appear in unexpected places.
Trade-offs: There is a direct relationship between chunk size and throughput. Larger chunks increase memory utilization but can improve overall training speed.
Reinvesting Memory: By stacking these optimizations, you can free up memory that can be reinvested into other training stages or used to push context lengths even further.

The Memory Bottleneck in Long-Context Training

The Stack of Optimization Techniques

Untied Ulysses: Pushing to 5 Million Tokens

Practical Implementation Advice

More from AI & LLMs

DiffusionGemma: Parallel Text Generation via Diffusion

Benchmarking LLM Compression: FP8, GPTQ, and SmoothQuant

FlashAttention: 2-4x Faster Exact Attention on GPUs

Ground Gemini 3 in PDB Geometry for Hallucination-Free Proteomics