NES optimizes quadratic bowl via gaussian perturbations

NES Core Loop for Black-Box Optimization

NES treats parameters w as mean of a fixed-variance gaussian (sigma=0.1). To maximize black-box reward f(w) without gradients:

Generate npop=50 noise samples N ~ N(0,1) (shape 50x3).
Perturb: w_tryj = w + sigma * Nj, compute Rj = f(w_tryj). Here f(w) = -||w - 0.5,0.1,-0.3||^2_2 (max reward=0 at solution).
Standardize: A = (R - mean(R)) / std(R) to zero-mean unit-variance (avoids div-by-zero on flat rewards; speeds convergence vs raw R).
Update: w += alpha/(npop * sigma) * N.T @ A (alpha=0.001). This is score-function gradient estimator Ereward * noise/sigma.

Starts from random w≈1.76,0.40,0.98 (reward -3.32), reaches -0.000009 error by iter 280.

w = w + alpha/(npop*sigma) * np.dot(N.T, A)

sigma scales perturbation size and normalizes estimator (divisor matches multiplier for consistent gradient scale).

300 iters suffice; prints every 20 show steady progress:

Toy mimics NN optimization: f(w) would forward NN on env, return total reward. Solution hidden from optimizer.

Standardization optional but boosts speed: Raw R works (paper-equivalent via Section 3.2), but centering/scaling prevents stagnation on negative/flat rewards.
Edge cases: Add epsilon to std(R) avoids div0 when all R equal (common early/simple problems).
Extensions: Handles moving targets with small jitters; libs like evostra apply to Flappy Bird. No crossover needed vs GA—NES is gradient-like via log-prob derivative.
Deployment: Save final w; reconstruct NN. Practical for RL vs DQN (no backprop, parallelizable evals).