NES optimizes quadratic bowl via gaussian perturbations
Sample 50 perturbed weights from N(w, 0.1), weight by standardized rewards, update w by 0.001/(50*0.1) * sum(noise * weights) to converge in 300 iters.
NES Core Loop for Black-Box Optimization
NES treats parameters w as mean of a fixed-variance gaussian (sigma=0.1). To maximize black-box reward f(w) without gradients:
- Generate npop=50 noise samples N ~ N(0,1) (shape 50x3).
- Perturb: w_tryj = w + sigma * Nj, compute Rj = f(w_tryj). Here f(w) = -||w - 0.5,0.1,-0.3||^2_2 (max reward=0 at solution).
- Standardize: A = (R - mean(R)) / std(R) to zero-mean unit-variance (avoids div-by-zero on flat rewards; speeds convergence vs raw R).
- Update: w += alpha/(npop * sigma) * N.T @ A (alpha=0.001). This is score-function gradient estimator Ereward * noise/sigma.
Starts from random w≈1.76,0.40,0.98 (reward -3.32), reaches -0.000009 error by iter 280.
w = w + alpha/(npop*sigma) * np.dot(N.T, A)
sigma scales perturbation size and normalizes estimator (divisor matches multiplier for consistent gradient scale).
Proven Convergence on Toy Quadratic
300 iters suffice; prints every 20 show steady progress:
- Iter 0: reward -3.323
- Iter 100: -0.727
- Iter 200: -0.001
- Iter 280: -0.000009
Toy mimics NN optimization: f(w) would forward NN on env, return total reward. Solution hidden from optimizer.
Insights from Implementers
- Standardization optional but boosts speed: Raw R works (paper-equivalent via Section 3.2), but centering/scaling prevents stagnation on negative/flat rewards.
- Edge cases: Add epsilon to std(R) avoids div0 when all R equal (common early/simple problems).
- Extensions: Handles moving targets with small jitters; libs like evostra apply to Flappy Bird. No crossover needed vs GA—NES is gradient-like via log-prob derivative.
- Deployment: Save final w; reconstruct NN. Practical for RL vs DQN (no backprop, parallelizable evals).