The Frequency Bias Problem in SGD

Modern language models rely on training data with highly uneven token distributions. In standard Stochastic Gradient Descent (SGD), every parameter is updated using a fixed learning rate. This creates a significant optimization bottleneck: common tokens receive frequent gradient signals and converge quickly, while rare tokens—which may appear in only 0.1% of batches—receive insufficient updates. Consequently, parameters associated with rare tokens often remain near their random initialization, leading to poor model performance on underrepresented data.

How Adam Normalizes Learning Dynamics

Adam addresses this imbalance through adaptive optimization, specifically via variance normalization. Unlike SGD, Adam maintains a running estimate of the squared gradients (variance) for each parameter independently.

  • Variance Tracking: Adam tracks the historical magnitude of gradients for every parameter.
  • Adaptive Scaling: Before applying an update, Adam divides the learning rate by the square root of the accumulated variance estimate.
  • Automatic Amplification: For rare tokens, the variance estimate remains very small because updates are infrequent. This causes the effective learning rate to be automatically amplified. In a controlled experiment, rare tokens received an effective learning rate over 40 times higher than common tokens, allowing them to converge to the target weight (1.0) despite receiving sparse signals.

Experimental Evidence

In a comparative study using a six-token vocabulary with frequencies spanning four orders of magnitude, SGD failed to move rare token weights beyond 0.15–0.53, while Adam successfully pushed all weights toward the target of 1.0. The results demonstrate that Adam acts as an "automatic equalizer," requiring no manual tuning to compensate for frequency imbalance; the variance normalization term derives the necessary scaling directly from the gradient history.