NVIDIA's Nemotron-Labs-Diffusion: A Unified Tri-Mode LLM Architecture

A Unified Architecture for Flexible Inference

NVIDIA's Nemotron-Labs-Diffusion (NLD) introduces a model family that supports three distinct decoding modes using the same underlying weights. By training on a joint objective—combining standard autoregressive (AR) next-token prediction with block-wise diffusion denoising—the model eliminates the need for separate architectures for different deployment scenarios.

AR Mode: Standard left-to-right generation, optimized for high-concurrency cloud environments.
Diffusion Mode: Denoises multiple tokens in parallel within fixed-length blocks, allowing for an adjustable accuracy-throughput tradeoff.
Self-Speculation Mode: Uses the diffusion pathway to draft candidate tokens and the AR pathway to verify them in a single forward pass, requiring no auxiliary draft models.

Training and Performance Gains

The model uses a two-stage training process: an initial 1 trillion tokens for AR priors, followed by 300 billion tokens using a joint AR-diffusion objective (α = 0.3). This training strategy yields a 16.05% average accuracy improvement over the baseline.

In self-speculation mode, the 8B parameter model achieves 5.99x tokens-per-forward (TPF) with accuracy comparable to standard AR models. Performance is further enhanced by a LoRA adapter targeting the attention module's o_proj layer, which increases average acceptance length from 5.46 to 6.82 tokens per draft step. This architecture significantly outperforms existing multi-token prediction (MTP) methods like Eagle3, particularly in structured tasks such as coding and mathematics, where acceptance lengths can exceed 8x.

Deployment and Practical Application

Because the model uses a unified architecture, developers can switch between decoding modes at inference time by changing the attention pattern, without reloading weights. The system is compatible with standard serving frameworks like vLLM and SGLang. For single-user or edge deployment, the LoRA-enhanced self-speculation mode is recommended to maximize throughput, while high-concurrency APIs should continue to utilize standard AR decoding to fully saturate GPU compute resources.

A Unified Architecture for Flexible Inference

Training and Performance Gains

Deployment and Practical Application

More from AI & LLMs

Ornith-1.0: Coding Models That Learn Their Own Harness

The Evolution of Positional Encodings: From Integers to RoPE

Gemma 4 E2B: 2.3B On-Device Multimodal LLM

VBFDD-Agent: Translating Battery Signals into Descriptive Text