DeepMind's Diffusion Model Training Secrets
Sander from DeepMind reveals data curation trumps model tweaks, latent autoencoders enable scale, diffusion denoises via spectral autoregression for superior audiovisual generation.
Data Curation Drives Quality Over Model Tweaks
High-quality generative models for audiovisual data hinge on meticulous data curation, often more impactful than architectural or optimization changes. Sander emphasizes that research incentives historically discouraged data scrutiny—favoring standard datasets for benchmarks—but scaling demands unlearning this. Time on data yields better returns than hyperparameter tuning, though details remain proprietary as "secret sauce." Poor data leads to artifacts; curation filters noise, balances distributions, and ensures diversity, enabling models like Veo to produce coherent video.
Tradeoff: Curation is labor-intensive and unpublished, but essential for production-scale results where off-the-shelf datasets fail.
"time spent on improving the data is sometimes a better investment of that time than actually sort of trying to tweak the model and trying to make the optimizer better or things like that." (Sander on why data curation outpaces model iteration, highlighting a shift from academic norms.)
Latent Representations Unlock Scalable Training
Raw pixels are infeasible at scale: 30s 1080p 30fps video spans gigabytes per example. Instead, train autoencoders to compress into latents—retaining grid topology but slashing tensor size by 100x via reduced resolution (e.g., 256x256 RGB → 32x32x4 latents, as in Stable Diffusion) and extra channels for high-frequency details.
Process: Encoder squeezes input through bottleneck; decoder reconstructs. Latents preserve semantics and structure for neural nets' inductive biases, unlike semantic-obliterating codecs (JPEG/H.265). Visualization via principal components (from EQ-VAE paper) shows latents abstract local texture, not content—e.g., animal shapes remain discernible.
Decision chain: Rejected pixel-direct training (works small-scale but OOMs) and standard codecs (lose topology). Chose learned autoencoders for 2-order magnitude efficiency, enabling video modeling. Train diffusion on latents, decode samples post-generation.
Tradeoffs: Lossy (discards fine details selectively); simpler than pro codecs but topology-preserving boosts generative fidelity.
"the latent are not really making abstraction of any semantic content of the image they're basically just sort of um abstracting the local texture and very fine grain structure right that's sort of the information that's sort of compressed and that's removed to some degree." (Sander explaining latent design preserves perceptual structure for modeling.)
Diffusion Mechanics: Denoising as Guided Optimization
Diffusion models corrupt data via gradual Gaussian noise addition, then train denoisers to reverse it for sampling. Intuition: From noisy XT, predict average clean X0 (blurry, as ill-posed—many originals map to one noisy input). Take small step toward it, add trace new noise to correct errors, repeat T steps shrinking uncertainty from broad region to point sample.
Analogy: Like SGD in pixel space—local updates prevent overshooting. Autoregression (sequence prediction) fits language but awkwardly rasterizes images/videos; diffusion's parallel refinement suits spatiotemporal data.
Why chosen: Edges out autoregression on audiovisual tasks per parameter budget; flexible sampling.
"We're only going to take a small step and then ask a model again basically. Right? You can compare this to how uh optimization of neural networks works." (Sander likening diffusion sampling to optimizers, revealing iterative local refinement core.)
Spectral Autoregression: Coarse-to-Fine Magic
Fourier analysis reveals why diffusion thrives on images/videos: Natural spectra follow power laws (log-log straight lines on ImageNet samples). Noise is flat-spectrum; corruption drowns high frequencies first (details), then low (structure).
Denoising predicts low→high frequencies naturally—"spectral autoregression." Start coarse (semantics), refine details; perceptually weights key scales. Enables global structure before textures, outperforming one-shot autoregression.
Observation: Image + noise spectrum hugs image until noise dominates. Sampling inverts: low-freq sketch → high-freq polish.
Tradeoffs: Multi-step (vs. autoregressive single-pass), but parallelizable and controllable; error accumulation mitigated by re-noising.
"diffusion is basically spectral auto reggression, right? Because it's essentially allowing you to generate images from coarse defined, right? You start with the low frequencies and then you gradually add higher and higher frequencies." (Sander's key insight tying frequency dynamics to generation intuition.)
Architectures, Scaling, Sampling, Distillation & Control
Denoisers use U-Nets (originally for segmentation)—simple noisy-to-clean predictors. Scaling touches briefly: Massive compute for latents/videos. Sampling flexibility > autoregression: Variable steps, guidance.
Distillation accelerates: Train student to mimic teacher in fewer steps (not size reduction). Control signals (text, etc.) steer via classifiers/gradients, making models "do our bidding."
Progression: Pixels → latents → diffusion → optimized sampling/control. Failures implied: Early pixel training OOM'd; uncurated data flops.
"there's there's there sort of more stuff you can do with diffusion models than you can do with autogressive models." (Sander on diffusion's sampling regime advantages for practical use.)
Key Takeaways
- Prioritize data curation over model tweaks—it's the highest-ROI step for scale.
- Use learned autoencoders for latents: Compress 100x while preserving grid topology and semantics.
- View diffusion as spectral autoregression: Low-to-high freq generation matches perceptual priorities.
- Sample iteratively: Small denoise steps + re-noise prevent error accumulation, like SGD in latent space.
- Reject standard codecs; design primitives preserve inductive biases for convnets.
- For video, latents handle time redundancy best—feasible where pixels fail.
- Distill for speed: Fewer steps without quality loss.
- Leverage diffusion's control flexibility (guidance) for conditioned generation.
- Analyze spectra: Power laws explain natural media structure exploitation.
- Check sander.ai for diffusion intuition blogs.