Momentum Dampens GD Zigzags via Gradient Averaging
On anisotropic loss surfaces (condition number 100), vanilla GD zigzags and takes 185 steps to converge (loss <0.001); momentum with β=0.9 converges in 159 steps by canceling steep-direction oscillations while accelerating flat directions—but β=0.99 diverges.