Diffusion: Data-Efficient Framework Outshining Autoregressives on Scarce Data
Diffusion is a training framework—not architecture—that creates extra samples by gradually noising clean data over 1,000 steps, outperforming autoregressives on 25-100M tokens where data is limited but compute abundant; lags in text due to slow inference and infrastructure.
Diffusion Framework Generates Data from Noise for Efficiency
Diffusion models treat generation as reversing a noising process: start with clean data like images, add Gaussian noise over 1,000 gradual steps until pure noise, creating thousands of augmented samples from one input. Train the model to predict added noise at each timestep (post-2020 DDPM objective), enabling data efficiency. On charts comparing losses, diffusion converges slower but achieves lower final loss than autoregressives when repeating 25-100M tokens—ideal for scarce data, abundant compute scenarios. Unlike autoregressives parsing left-to-right, diffusion handles any order, acting as a superset. Implement with any architecture, including transformers (e.g., DiT), since it's orthogonal: defines training (noise addition/removal), data production, and inference process, not weights connection.
This borrows physical diffusion (high-to-low concentration), formalized via continuous-time differential equations (Stanford approach) over discrete Markov chains, leveraging centuries of math for intuitive probability sampling via KL divergence between distributions. Outcome: from one image, derive 1,000 noisy variants; model learns noise level per step via scheduling, maximizing limited datasets.
Historical Advances Tackle Slow Inference
Originating in 2015's "Deep Unsupervised Learning using Non-Equilibrium Thermodynamics" paper (post-GANs, pre-"Attention is All You Need"), diffusion targeted images, not text. Slow adoption due to math-heavy entry barrier. Breakthrough in 2020 DDPM paper redefined objective to noise prediction (vs. mean/covariance), simplifying training. DDIM improved scheduling; 2022 Stable Diffusion scaled models for viable results. Recent flow matching drops inference from hundreds/thousands steps to a few, slashing compute—during training, retain original for guidance, but inference demands full reversal without it.
Early Markov chains forced every step; continuous math unlocked skips. Result: faster sampling, e.g., Mercury hits 1,000+ tokens/second vs. autoregressive bottlenecks.
Trade-offs: Excels in Images, Trails Text Autoregressives
Strengths shine data-starved: multiple noise levels yield varied viewpoints from one sample. But inference inefficiency (1,000 steps originally) and text embedding mismatches hinder vs. GPT-3 (2020), trained on 10T+ tokens with optimized kernels (vLLM, SGLang autoregression-focused). Less R&D time/infrastructure for diffusion text models like Mercury, despite speed potential. Nvidia Grok-3-like SRMs now match throughput. Yan LeCun calls autoregressives inferior theoretically, yet dominance persists via data/compute abundance, text maturity. Use diffusion for low-data image/video gen; autoregressives scale better on massive text corpora.