Diffusion: Data-Efficient Framework Outshining Autoregressives on Scarce Data
Diffusion is a training framework—not architecture—that creates extra samples by gradually noising clean data over 1,000 steps, outperforming autoregressives on 25-100M tokens where data is limited but compute abundant; lags in text due to slow inference and infrastructure.




