LLM Pretraining Scaling: FSDP Wins Until Comms Crater

FSDP Dominates Parallelism Until Scale Forces Pipeline Trade-offs

Pretraining FLOPs = 6ND (2 forward + 4 backward per param-token). Data parallel (DP) copies weights across GPUs but hits HBM limits (B300: 288GB). Fully Sharded Data Parallel (FSDP) shards params per layer across GPUs, all-gathering full weights per layer (forward/backward) while overlapping comms with compute since weights are layer-independent. FSDP comms: params×3 (all-gather forward/back + reduce-scatter backward), 50% over DP's params×2 all-reduce—achievable because all-gather is half an all-reduce. Use hierarchical collectives across NVLink domains: reduce-scatter intra-domain, all-reduce shards inter-domain, all-gather intra-domain to saturate IB bandwidth.

Comms time stays flat with GPU count (ring all-reduce chunks scale inversely with participants), but compute drops linearly, cratering MFU at 'crossover' (comms > compute). Delay crossover by larger batches (more compute/GPU) or sparser models; TPUs excel with bigger domains. Batch size floors FSDP at ~1K GPUs (e.g., 10M-token batch, 10K seq len = 1K seqs). Add pipeline parallelism (PP) next, but it introduces bubbles (idle GPUs at batch start/end) unfillable in training due to per-batch gradient sync. PP constrains architecture (e.g., Kimi's cross-layer attention, mixed attention types cause stage imbalance), slowing research.

Distillation Remains Cheap and Evasion-Proof

Frontier labs can't halt distillation: 1T tokens from Opus 4.6 costs $25M ($25/MTok), commoditizing open models rapidly (cf. Fineweb 18.5T, OpenWebText 9B). Hiding chain-of-thought (CoT) fails—instruct no-think/direct solve or RLVR on reconstructed CoT. Core value in local tool use (file edits, bash) evades cloud hiding; users resist workflow migration. Products atop APIs distill better: reward 'gold diffs' (final user-accepted code) over rejected intermediates from 10+ turn sessions.

Agentic AI Shifts Cybersecurity Toward Defense

Mythos chains 5+ vulns into exploits (vs. prior single-vuln finds), but software is securer now despite human probing—sudden AI intelligence influx likely strengthens defense via industry patching (e.g., Glasswing reveals zero-days). AI excels at vuln finding over patching (XKCD: fixes break edge cases/features). Solutions: LLM-port C to Rust; formal verification (e.g., seL4 proofs); patching mirrors LLM bug-finding in others' repos. Hoarding Mythos risky—build/release classifiers rejecting cyberattack intents (Anthropic plans for 4.7). Evade classifiers by subproblems (harmless vulns). Patching own code routine for coding LLMs.

Pipeline RL Fixes Stragglers; Causality/Bias Dooms Runs

RL responses grow in mean/variance length, straggling GPU utilization. Pipeline RL does 'in-flight weight updates': swap generating model mid-trajectory post-training step, ensuring recent-model rollouts without full offline RL off-policyness.

Pretraining fails via causality breaks (MoE expert-choice routes token n+k affecting n; token-dropping ignores early for later matches—rumored Llama 4/Gemini 2 flops) or bias (FP16 collectives round large sums wrong, e.g., post-1024 granularity skips +1; GPT-4 initial bug). Bias compounds > variance. New scale unveils bespoke issues (numerics, kernels)—not 5 fixable failure modes. RL inference needs training-engine fidelity (numerical drift biases); enforce disciplined compute multipliers to avoid bug stacks. Kernel optimization AGI-hard (Nvidia took ages for Blackwell).