Momentum Dampens GD Zigzags via Gradient Averaging
On anisotropic loss surfaces (condition number 100), vanilla GD zigzags and takes 185 steps to converge (loss <0.001); momentum with β=0.9 converges in 159 steps by canceling steep-direction oscillations while accelerating flat directions—but β=0.99 diverges.
Anisotropic Surfaces Force GD Zigzags
Real-world loss surfaces often have uneven curvature—flat in one direction (e.g., 0.05 x²) and steep in another (e.g., 5 y²)—yielding a Hessian with eigenvalues 0.1 and 10 (condition number 100). Gradients are 0.1x, 10y. With learning rate lr=0.18 (near stability limit 2/λ_max=0.2), steep direction factor |1-100.18|=0.8 causes 20% overshoot per step (oscillations), while flat direction |1-0.10.18|=0.982 advances just 1.8% (near-stagnation). Starting at -4,1.5, vanilla GD: θ ← θ - lr ∇L(θ) zigzags slowly, hitting loss<0.001 in 185 steps (final loss 1.5e-5 after 300 steps).
Implement as:
def grad(x, y): return np.array([0.1 * x, 10 * y])
def gradient_descent(start, lr, steps=300):
path = [np.array(start, dtype=float)]
pos = np.array(start, dtype=float)
for _ in range(steps):
pos = pos - lr * grad(*pos)
path.append(pos.copy())
return np.array(path)
High lr speeds flat progress but oscillates steep; low lr stabilizes but crawls flat—core GD trade-off.
Momentum Velocity Cancels Oscillations, Builds Speed
Momentum tracks velocity v (exponential moving average of gradients): v ← β v + (1-β) ∇L(θ); θ ← θ - lr v. Consistent gradients (flat direction) accumulate for larger steps; opposing gradients (steep oscillations) cancel, damping zigzags. From -4,1.5 with lr=0.18:
- β=0.9: smooth path, loss<0.001 in 159 steps (final 1e-6).
- β=0.99: excessive accumulation overshoots, final loss 0.487 (circles minimum).
Code:
def momentum_gd(start, lr, beta, steps=300):
path = [np.array(start, dtype=float)]
pos = np.array(start, dtype=float)
v = np.zeros(2)
for _ in range(steps):
g = grad(*pos)
v = beta * v + (1 - beta) * g
pos = pos - lr * v
path.append(pos.copy())
return np.array(path)
β weights history: β→0 mimics GD; β=0.9 balances smoothing/speed; β→1 risks divergence.
β Tuning via Convergence Sweep
Sweep β=0.0,0.5,0.7,0.85,0.90,0.95,0.99 to loss<0.001 (max 500 steps):
| β | Steps to converge |
|---|---|
| 0.00 | 185 (vanilla GD) |
| 0.50 | 170 |
| 0.70 | 165 |
| 0.85 | 161 |
| 0.90 | 159 (sweet spot) |
| 0.95 | 158 |
| 0.99 | >500 (diverges) |
Inverted U: β=0.9-0.95 optimal (faster by ~15-20% vs GD); too high prioritizes stale velocity. Visualize trajectories (first 55 steps on contours) and log-loss curves confirm: GD slow/oscillatory, good β direct/fast, high β bouncy/failed. Loss surface: def loss(x,y): return 0.05x**2 + 5y**2.