A Unified Architecture for Flexible Inference

NVIDIA's Nemotron-Labs-Diffusion (NLD) introduces a model family that supports three distinct decoding modes using the same underlying weights. By training on a joint objective—combining standard autoregressive (AR) next-token prediction with block-wise diffusion denoising—the model eliminates the need for separate architectures for different deployment scenarios.

  • AR Mode: Standard left-to-right generation, optimized for high-concurrency cloud environments.
  • Diffusion Mode: Denoises multiple tokens in parallel within fixed-length blocks, allowing for an adjustable accuracy-throughput tradeoff.
  • Self-Speculation Mode: Uses the diffusion pathway to draft candidate tokens and the AR pathway to verify them in a single forward pass, requiring no auxiliary draft models.

Training and Performance Gains

The model uses a two-stage training process: an initial 1 trillion tokens for AR priors, followed by 300 billion tokens using a joint AR-diffusion objective (α = 0.3). This training strategy yields a 16.05% average accuracy improvement over the baseline.

In self-speculation mode, the 8B parameter model achieves 5.99x tokens-per-forward (TPF) with accuracy comparable to standard AR models. Performance is further enhanced by a LoRA adapter targeting the attention module's o_proj layer, which increases average acceptance length from 5.46 to 6.82 tokens per draft step. This architecture significantly outperforms existing multi-token prediction (MTP) methods like Eagle3, particularly in structured tasks such as coding and mathematics, where acceptance lengths can exceed 8x.

Deployment and Practical Application

Because the model uses a unified architecture, developers can switch between decoding modes at inference time by changing the attention pattern, without reloading weights. The system is compatible with standard serving frameworks like vLLM and SGLang. For single-user or edge deployment, the LoRA-enhanced self-speculation mode is recommended to maximize throughput, while high-concurrency APIs should continue to utilize standard AR decoding to fully saturate GPU compute resources.