Policy Gradients for Pong: 100-Line RL Agent

Train a 2-layer NN to play Atari Pong from raw pixels using REINFORCE policy gradients. Uses 80x80 binary diff frames, discounts rewards with gamma=0.99, standardizes advantages, RMSProp updates every 10 episodes. Converges on CPU in hours.

Network Architecture and Forward/Backward Passes

Build a fully connected policy network with 200 ReLU hidden units: input is 80x80=6400D (binary diff frame), W1 (200x6400 Xavier init), ReLU, W2 (200x1), sigmoid for P(UP=action 2). Forward: h = ReLU(W1 @ x), p = sigmoid(W2 @ h). Sample action stochastically: UP if uniform() < p else DOWN.

Backward computes policy gradient analytically. For episode: stack epx (inputs), eph (hiddens), epdlogp (y - p where y=1 for UP). dW2 = eph.T @ epdlogp. dh = epdlogp.outer(W2), zero ReLU grads (eph<=0), dW1 = dh.T @ epx. Accumulate into grad_buffer over batch_size=10 episodes.

Image Preprocessing for Atari Pong

Transform 210x160x3 uint8 frame: crop top/bottom to 160x80 (35:195), downsample 2x to 80x80 grayscale (I::2,::2,0), binarize (set bg 144/109=0, else=1), flatten to 6400D float. Use difference frames x = cur_x - prev_x (motion highlights ball/paddles, zeros static bg). This reduces noise, enables end-to-end from pixels.

Reward Discounting and Advantage Normalization

Pong rewards: +1 win, -1 lose (sparse, at episode end). For trajectory drs: discount backwards with gamma=0.99, reset running sum at rt!=0 (game boundaries). Standardize discounted_epr to mean=0, std=1 (controls gradient variance). Modulate: epdlogp *= discounted_epr (REINFORCE: grad log pi(a|s) * advantage).

Training Loop and Optimization

OpenAI Gym Pong-v0. Loop: prepro obs, forward policy, sample/act, record x/h/dlogp/r. On done: compute discounted/centered advantages, backward, add to grad_buffer. Every 10 eps: RMSProp update (decay=0.99, lr=1e-4): g / (sqrt(rms_cache) + 1e-5), reset buffer. Track running_reward (EWMA 0.99), save model every 100 eps. Render optional. Resume from save.p.

Prints episode rewards; agent learns to beat random policy quickly, human-level after ~1-2hr CPU (per blog link in comments).

Summarized by x-ai/grok-4.1-fast via openrouter

12952 input / 1480 output tokens in 13868ms

© 2026 Edge