Distillation Remains Unstoppable, Threatening Frontier Moats
Open-source models commoditize frontier labs rapidly because distillation costs little: 1T tokens from a model like Opus 4.6 runs $25/MTok, totaling $25M—affordable even without caching savings. Hiding chain-of-thought (CoT) fails since CoT isn't special tokens; instruct models to solve directly or relocate thinking. Reconstructing CoT as RLVR target adds cost but works. Tool use (local code/files, bash) evades hiding entirely, as users resist fully controlled clouds.
Product companies distill superior models by capturing user-converged 'gold diffs' from 10+ API interactions as RL targets, rewarding final outputs and penalizing rejected intermediates—potentially outperforming base APIs.
Pretraining Fails from Causality Breaks and Compounding Biases
Runs fail via 'breaking causality' and 'adding bias,' making training precarious.
Causality issues:
- Expert routing: Token routing unbalances experts; expert choice balances but depends future tokens on past (e.g., token n allocation sees token n+k), leaking deployment-unseen info. Rumored Llama 4 culprit; token dropping (experts skip weak tokens) similarly future-dependent. Gemini 2 Pro hit by latter.
Bias compounds unlike averaging variance:
- FP16 collectives (all-reduce) lose granularity post-1024, rounding small gradients (1+1...10k sums 10x off), slowing/failing GPT-4 initially.
Implications: Failures aren't just 5 fixable modes—new scale-specific numeric bugs emerge endlessly. Kernel optimization resists AI automation (Nvidia took long for Blackwell despite experts). RL inference needs exact training-engine numerics to avoid off-policy drift; disciplined compute multipliers prevent bias stacking. Bearish on near-term AI kernel writing.
FSDP Scales Pretraining Until Comms Crossover, Then Pipeline
Pretraining FLOPs = 6ND (2 FLOP/param/token forward; 4 backward). Data parallel (DP) copies weights, splits batch—but HBM limits (B300: 288GB) cap models.
Default: Fully Sharded DP (FSDP)—shard params/layer across GPUs, all-gather full layer pre-compute (overlap with prior layer), discard post. Comms: ~3x params (all-gather fwd/bkwd + reduce-scatter bkwd), 50% over DP all-reduce. Hierarchical collectives (reduce-scatter intra-domain, all-reduce inter, all-gather intra) optimize NVLink/IB BW.
Limits force pipeline parallelism (PP):
- Comms crossover: Compute time drops 1/GPUs, comms flat—MFU craters at scale. Larger batch/sparsity delays; TPUs better (bigger domains).
- Batch floor: 10M-token batch at 10K seq = 1K seqs; can't exceed 1K GPUs (attention intra-seq).
PP bubbles waste cycles (end-batch early stages idle; start-batch late stages idle)—can't micro-batch overlap due to gradient sync needs. Constrains research: Kimi-style multi-layer attention or mixed windows imbalanced across stages, slowing iteration.
Mythos Hits Combinatorial Exploits, Pipeline RL Fixes Stragglers
Mythos advances via agentic chaining of 5+ vulnerabilities into full exploits (e.g., arbitrary code exec), not raw intelligence jump—cyberattacks are combinatorial.
Equilibrium: Software securer despite 20y human probing; AI influx (Glasswing/Mythos) lets US firms patch latent zero-days first, strengthening defense. Counter: AI excels finding > patching (XKCD: fixes break edge cases). Solutions: Formal verification (seL4 proofs?), LLM C-to-Rust ports; test Mythos on memory-safe langs.
Anthropic hoarding risks precedent—classifiers evade via subproblem decomposition.
Pipeline RL tackles RL length variance explosion (easy: short; hard: 100k tokens), causing GPU stragglers. Batching rollouts into offline RL mismatches policy. Fix: In-flight weight swaps mid-generation ensure most trajectories (short + partial long) use latest model, sustaining on-policy training.