DiffusionGemma: Parallel Text Generation via Diffusion

Parallel Decoding via Text Diffusion

DiffusionGemma departs from the standard autoregressive paradigm—where tokens are generated sequentially, one at a time—by utilizing text diffusion. Inspired by image generation models, it begins with a canvas of random placeholder tokens and iteratively refines them in parallel. This approach allows the model to finalize approximately 15–20 tokens per forward pass, significantly increasing throughput on dedicated GPUs.

Architecture and Performance

Built on the Gemma 4 26B-A4B backbone, the model is a Mixture of Experts (MoE) architecture that activates only 3.8B parameters during inference. Key technical features include:

Bidirectional Attention: Unlike autoregressive models restricted to causal (backward-looking) attention, DiffusionGemma uses bidirectional attention during denoising, allowing tokens to attend to the entire sequence.
Self-Correction: The model can re-noise and refine low-confidence tokens during the denoising process, a capability impossible in standard autoregressive decoding where tokens are committed once generated.
Hardware Utilization: By shifting the bottleneck from memory bandwidth to compute, the model achieves 1000+ tokens per second on an NVIDIA H100 and 700+ tokens per second on an NVIDIA GeForce RTX 5090.
Hybrid Inference: For longer sequences, it employs 'Block Autoregressive Diffusion,' where 256-token blocks are denoised in parallel and then committed to the KV cache before starting a new canvas.

Trade-offs and Use Cases

Google explicitly positions DiffusionGemma for local, low-latency, single-user workloads like in-line editing and rapid iteration. It is not intended to replace standard autoregressive models for high-quality production tasks, as its overall output quality is lower. Furthermore, the speed advantages diminish in high-concurrency cloud environments where autoregressive models already saturate compute resources efficiently. When quantized, the model fits within 18GB of VRAM, making it accessible for high-end consumer hardware.

Parallel Decoding via Text Diffusion

Architecture and Performance

Trade-offs and Use Cases

More from AI & LLMs

Scaling Transformer Training to 5 Million Tokens

Benchmarking LLM Compression: FP8, GPTQ, and SmoothQuant

FlashAttention: 2-4x Faster Exact Attention on GPUs

Ground Gemini 3 in PDB Geometry for Hallucination-Free Proteomics