The Architectural Shift: Diffusion vs. Autoregression
Traditional LLMs (GPT-4, Gemini) operate autoregressively, generating one token at a time. This is inherently bottlenecked by memory bandwidth—the chip must stream the entire model and KV cache for every single token.
Text diffusion flips this paradigm. Instead of predicting the next token, the model initializes a block of tokens as random noise and iteratively refines that canvas over multiple denoising steps. Because this approach generates blocks of text simultaneously, it performs significantly fewer memory transfers than autoregressive models. By generating 256 tokens in roughly 24 steps, the model achieves a 10x reduction in memory movement, leading to much lower latency.
Bidirectional Reasoning and Self-Correction
Because diffusion models are not restricted by causal (left-to-right) attention, they can attend to future tokens within the generation window. This enables a unique capability: self-correction.
In a demonstration, a diffusion model initially guessed an incorrect answer to a complex math problem. However, as it continued its forward passes and completed the reasoning steps, it was able to look back at its initial output and correct the answer. Standard autoregressive models, by contrast, are often forced to "stick to their guns" once a token is committed, even if subsequent reasoning proves the initial token wrong.
Dynamic and Adaptive Computation
Diffusion models allow for a flexible compute budget.
- Non-monotonic quality: Generally, increasing the number of denoising steps (forward passes) improves quality, as the model has more opportunities to refine its output.
- Adaptive computation: The model can be trained to determine when it has reached a sufficient level of confidence. Simple tasks (e.g., reciting digits of Pi) may take only 4 steps, while complex tasks (e.g., explaining quantum mechanics) may automatically trigger 30+ steps. This allows the model to spend compute resources only where the complexity of the task demands it.
The Trade-off: Throughput and Cost
Despite the latency benefits, text diffusion is not yet the industry standard for large-scale serving. Autoregressive models are highly efficient at high batch sizes, where they can saturate GPU/TPU compute cores. Diffusion models, which require multiple forward passes per query, hit compute thresholds much faster, leading to lower overall throughput and higher serving costs at scale.
Future Applications: The 2,000 Token/Second Threshold
When latency drops to the 2,000 tokens-per-second range, the nature of UI/UX changes. The presentation demonstrates:
- On-the-fly generation: Web pages (Wikipedia, Reddit) where the HTML, text, and images are generated in real-time as the user interacts.
- Generative OS: An operating system interface where every click triggers the generation of the next screen.
- Vibe Coding: The ability to build functional applications (e.g., a To-Do app with sorting and dark mode) via voice commands in under 15 seconds.
Key Takeaways
- Latency vs. Throughput: Diffusion models excel at low-latency, single-user interactions but currently struggle with the high-throughput efficiency required for massive SaaS scale.
- Bidirectional Advantage: The ability to see "future" tokens allows diffusion models to perform self-correction, a major hurdle for autoregressive architectures.
- Adaptive Compute: Models can be trained to allocate more compute to difficult problems and less to trivial ones, optimizing the balance between speed and accuracy.
- Hardware Bottlenecks: The speed advantage of diffusion comes from reducing memory-bound operations (streaming weights) by performing more computation per memory transfer.
- New UX Paradigms: Ultra-low latency enables "generative interfaces" where the UI is not pre-built but synthesized on-the-fly based on user intent.
Notable Quotes
- "It can do self-corrected generation based on future tokens. It could... see that it got the answer incorrect and then go back and fix the reasoning and do it again."
- "It turns out that both GPUs and TPUs have a lot of flops and not that much bandwidth... because of that ratio, you can do the more flops you do for each streaming amount of data you put through, the better."
- "It's not just the same thing faster. It can really unlock some new, really new applications."
- "Gemini 2.5 Flash also made a mistake... and it actually just stuck to its guns and never changed it and said 36 plus 3 is 42. So it incorporated the error into its reasoning later."