LLM Scaling Works via Strong Superposition

Superposition Drives Predictable Error Reduction

Language models represent tens of thousands of tokens in spaces with only thousands of dimensions by using superposition: squeezing multiple concepts into the same dimensions with slight overlaps. In the dominant 'strong superposition' regime, every token gets represented, and error stems from overlap noise, not dropped rare tokens. Doubling model width (m) halves error via the geometric 1/m relationship, yielding power-law scaling (exponent ~1) regardless of data distribution. Weak superposition, where only common tokens are stored cleanly, requires power-law token frequencies for scaling—less reliable for natural language's flatter distributions.

This mechanistic view outperforms prior assumptions: real LLMs don't discard rare tokens but overlap everything, matching theory with measured overlap strength shrinking at 1/m.

Validation Across Real Models Matches Theory

Analysis of output layers in OPT, GPT-2, Qwen2.5, and Pythia (100M to 70B parameters) confirms strong superposition: all tokens represented with overlaps scaling at 1/m. Observed exponent of 0.91 aligns with theory's 1; DeepMind's Chinchilla data hits 0.88. Simplified models toggling overlap regimes prove scaling emerges directly from geometry, not just data power laws ('power law in, power law out').

Limits and Optimization Opportunities

Scaling halts when width equals vocabulary size—no more overlaps needed, error from superposition vanishes, breaking power laws. Natural language's even frequencies limit speedup, but uneven domains (e.g., specialized vocab) enable steeper curves. Architectures promoting denser packing, like Nvidia's nGPT (vectors on unit sphere), boost performance at fixed size. Trade-off: denser overlaps hinder mechanistic interpretability, complicating AI safety.