NVIDIA's NVFP4: 4-Bit Pretraining at Scale

The NVFP4 Methodology

NVFP4 is a 4-bit microscaling format designed to overcome the dynamic range limitations of standard 4-bit quantization during long-horizon pretraining. Unlike previous approaches, NVFP4 uses a 16-element block size (down from 32) and E4M3 scale factors to preserve precision. It employs a two-level scaling architecture: E4M3 per-block scales and an FP32 per-tensor scale, ensuring that the absolute maximum (amax) values in each block maintain near-FP8 fidelity.

Stability Techniques for 4-Bit Training

Directly quantizing linear layer GEMMs to 4-bit causes training divergence. NVIDIA’s methodology stabilizes the process through four specific interventions:

Selective High Precision: Approximately 16% of linear layers (specifically the first two and final eight blocks) are kept in BF16 to handle dynamic range sensitivity.
Random Hadamard Transforms (RHT): Input tiles are multiplied by a 16x16 Hadamard matrix to spread weight gradient outliers into a Gaussian distribution, improving convergence for large models.
2D Block Scaling: Weights are scaled in 16x16 blocks to ensure consistency between forward and backward passes, preventing chain-rule breakage caused by tensor transposition.
Stochastic Rounding: Applied exclusively to gradients to remove the systematic bias introduced by round-to-nearest-even methods.

Performance and Scaling

Validated on a 12B hybrid Mamba-Transformer over 10 trillion tokens, NVFP4 achieved downstream accuracy comparable to FP8 baselines (e.g., 62.58% vs 62.62% on MMLU-Pro). While coding benchmarks showed a slight performance gap, this was mitigated by a precision-switching technique where the forward pass transitioned to BF16 at 8.2T tokens, reducing relative loss error from 1.5% to 0.5%. Compared to MXFP4, NVFP4 demonstrated superior loss convergence, effectively saving a 36% token overhead in training budgets.

The NVFP4 Methodology

Stability Techniques for 4-Bit Training

Performance and Scaling

More from AI & LLMs

IMCBench: Evaluating Multimodal LLMs in Clinical Conversations

The Critical Gaps in Multimodal LLM Evaluation

GLARE: Natural Language Interfaces for Global Model Explanations

SciRisk-Bench: Evaluating Safety in AI for Science