The Evolution of Scaling Predictability

Scaling laws describe the empirical relationship between training compute ($C$), model size ($N$), and dataset size ($D$). The core insight is that test loss $L$ decreases predictably as these variables increase, following a power-law curve. This predictability allows engineers to estimate the requirements for larger models by fitting curves on smaller, cheaper training runs.

Early research, such as Hestness et al. (2017), established that generalization error follows a power law across diverse domains. Rosenfeld et al. (2020) further refined this by modeling error as a joint function of $N$ and $D$, providing a parametric approach to predict loss for configurations larger than those already trained.

Reconciling Kaplan and Chinchilla

The field experienced a significant shift in 2022 regarding how to allocate compute. Kaplan et al. (2020) suggested that for a 10x increase in compute, one should scale model size by ~5.5x and tokens by only ~1.8x. This implied that models should be large and trained for fewer steps.

The Chinchilla paper (Hoffmann et al., 2022) overturned this by demonstrating that most models at the time were severely undertrained. By scanning over 400 models, they concluded that $N$ and $D$ should scale at equal rates ($N_ \propto C^{0.5}$). The discrepancy between the two findings is largely attributed to the scale of the experiments: Kaplan et al. extrapolated from smaller models, while Chinchilla’s experiments reached 10x larger scales. Additionally, Pearce & Song (2024) showed that embedding parameters—often ignored in early calculations—significantly skew the results at smaller scales, making the local power-law exponent appear higher than it is at larger scales.

Practical Implementation and Fitting

Fitting these laws in practice requires careful experimental design. The Chinchilla team utilized three methods: fixing model sizes to vary token budgets, using iso-FLOP profiles to find the minimum loss for a given compute, and parametric fitting using the Huber loss to ensure robustness against outliers. These methods converge on a compute-optimal frontier, proving that for a fixed compute budget, it is more efficient to train a smaller model on more data than to train a massive model on limited data.

Key Takeaways

  • Compute Allocation: For modern LLM training, prioritize scaling model size and training tokens in equal proportion ($N \propto D$).
  • Avoid Undertraining: If you have a fixed compute budget, training a smaller model for more steps (more data) is generally more efficient than training a larger model for fewer steps.
  • Extrapolation Risks: Scaling laws are sensitive to the regime in which they are fitted. Extrapolating from small-scale experiments to frontier-scale models can lead to incorrect conclusions about optimal architecture.
  • Embedding Impact: When calculating parameter counts for scaling laws, remember that embedding layers represent a non-negligible fraction of total parameters in smaller models, which can distort the perceived power-law exponent.
  • Predictive Power: Use parametric error models (e.g., $L(N, D) = A/N^\alpha + B/D^\beta + E$) to estimate performance before committing to large-scale training runs.