The NVFP4 Methodology

NVFP4 is a 4-bit microscaling format designed to overcome the dynamic range limitations of standard 4-bit quantization during long-horizon pretraining. Unlike previous approaches, NVFP4 uses a 16-element block size (down from 32) and E4M3 scale factors to preserve precision. It employs a two-level scaling architecture: E4M3 per-block scales and an FP32 per-tensor scale, ensuring that the absolute maximum (amax) values in each block maintain near-FP8 fidelity.

Stability Techniques for 4-Bit Training

Directly quantizing linear layer GEMMs to 4-bit causes training divergence. NVIDIA’s methodology stabilizes the process through four specific interventions:

  • Selective High Precision: Approximately 16% of linear layers (specifically the first two and final eight blocks) are kept in BF16 to handle dynamic range sensitivity.
  • Random Hadamard Transforms (RHT): Input tiles are multiplied by a 16x16 Hadamard matrix to spread weight gradient outliers into a Gaussian distribution, improving convergence for large models.
  • 2D Block Scaling: Weights are scaled in 16x16 blocks to ensure consistency between forward and backward passes, preventing chain-rule breakage caused by tensor transposition.
  • Stochastic Rounding: Applied exclusively to gradients to remove the systematic bias introduced by round-to-nearest-even methods.

Performance and Scaling

Validated on a 12B hybrid Mamba-Transformer over 10 trillion tokens, NVFP4 achieved downstream accuracy comparable to FP8 baselines (e.g., 62.58% vs 62.62% on MMLU-Pro). While coding benchmarks showed a slight performance gap, this was mitigated by a precision-switching technique where the forward pass transitioned to BF16 at 8.2T tokens, reducing relative loss error from 1.5% to 0.5%. Compared to MXFP4, NVFP4 demonstrated superior loss convergence, effectively saving a 36% token overhead in training budgets.