Momentum Dampens GD Zigzags via Gradient Averaging

On anisotropic loss surfaces (condition number 100), vanilla GD zigzags and takes 185 steps to converge (loss <0.001); momentum with β=0.9 converges in 159 steps by canceling steep-direction oscillations while accelerating flat directions—but β=0.99 diverges.

Anisotropic Surfaces Force GD Zigzags

Real-world loss surfaces often have uneven curvature—flat in one direction (e.g., 0.05 x²) and steep in another (e.g., 5 y²)—yielding a Hessian with eigenvalues 0.1 and 10 (condition number 100). Gradients are 0.1x, 10y. With learning rate lr=0.18 (near stability limit 2/λ_max=0.2), steep direction factor |1-100.18|=0.8 causes 20% overshoot per step (oscillations), while flat direction |1-0.10.18|=0.982 advances just 1.8% (near-stagnation). Starting at -4,1.5, vanilla GD: θ ← θ - lr ∇L(θ) zigzags slowly, hitting loss<0.001 in 185 steps (final loss 1.5e-5 after 300 steps).

Implement as:

def grad(x, y): return np.array([0.1 * x, 10 * y])
def gradient_descent(start, lr, steps=300):
    path = [np.array(start, dtype=float)]
    pos = np.array(start, dtype=float)
    for _ in range(steps):
        pos = pos - lr * grad(*pos)
        path.append(pos.copy())
    return np.array(path)

High lr speeds flat progress but oscillates steep; low lr stabilizes but crawls flat—core GD trade-off.

Momentum Velocity Cancels Oscillations, Builds Speed

Momentum tracks velocity v (exponential moving average of gradients): v ← β v + (1-β) ∇L(θ); θ ← θ - lr v. Consistent gradients (flat direction) accumulate for larger steps; opposing gradients (steep oscillations) cancel, damping zigzags. From -4,1.5 with lr=0.18:

  • β=0.9: smooth path, loss<0.001 in 159 steps (final 1e-6).
  • β=0.99: excessive accumulation overshoots, final loss 0.487 (circles minimum).

Code:

def momentum_gd(start, lr, beta, steps=300):
    path = [np.array(start, dtype=float)]
    pos = np.array(start, dtype=float)
    v = np.zeros(2)
    for _ in range(steps):
        g = grad(*pos)
        v = beta * v + (1 - beta) * g
        pos = pos - lr * v
        path.append(pos.copy())
    return np.array(path)

β weights history: β→0 mimics GD; β=0.9 balances smoothing/speed; β→1 risks divergence.

β Tuning via Convergence Sweep

Sweep β=0.0,0.5,0.7,0.85,0.90,0.95,0.99 to loss<0.001 (max 500 steps):

βSteps to converge
0.00185 (vanilla GD)
0.50170
0.70165
0.85161
0.90159 (sweet spot)
0.95158
0.99>500 (diverges)

Inverted U: β=0.9-0.95 optimal (faster by ~15-20% vs GD); too high prioritizes stale velocity. Visualize trajectories (first 55 steps on contours) and log-loss curves confirm: GD slow/oscillatory, good β direct/fast, high β bouncy/failed. Loss surface: def loss(x,y): return 0.05x**2 + 5y**2.

Summarized by x-ai/grok-4.1-fast via openrouter

8869 input / 1948 output tokens in 36530ms

© 2026 Edge