Minimal NumPy RNN for Char-Level Text Gen

RNN Architecture and One-Hot Encoding

Load text from 'input.txt' into data, extract unique chars for vocabulary (vocab_size = len(chars)). Map chars to indices with char_to_ix and reverse. Use one-hot encoding: inputs are lists of indices turned into (vocab_size, 1) vectors with 1 at input index.

Hidden layer size fixed at 100 neurons (hidden_size=100), sequence length 25 (seq_length=25), learning rate 0.1. Weights initialized small: Wxh = np.random.randn(100, vocab_size)*0.01 (input-to-hidden), Whh (hidden-to-hidden, 100x100), Why (hidden-to-output, vocab_size x 100). Biases zero-initialized. Scaling by 0.01 keeps initial activations small for tanh stability and breaks symmetry so hidden units learn distinct features.

Forward step per timestep t: hs[t] = tanh(Wxh @ xs[t] + Whh @ hs[t-1] + bh), then ys[t] = Why @ hs[t] + by, softmax ps[t] = exp(ys[t])/sum(exp(ys[t])) for next-char probs. Loss is negative log-likelihood: sum -log(ps[t]target).

Backpropagation Through Time and Gradients

In lossFun(inputs, targets, hprev): forward pass stores xs, hs, ys, ps for all timesteps. Backward pass starts from output: dy = ps[t].copy(); dy[target] -= 1 (softmax + cross-entropy gradient simplifies to this). Accumulate dWhy += dy @ hs[t].T, dby += dy.

Propagate to hidden: dh = Why.T @ dy + dhnext (dhnext from future timestep), dhraw = (1 - hs[t]^2) * dh (tanh derivative), then dbh += dhraw, dWxh += dhraw @ xs[t].T, dWhh += dhraw @ hs[t-1].T, dhnext = Whh.T @ dhraw for prior timestep.

Clip all gradients to -5, 5 to prevent exploding gradients. Returns total loss, all dparams, final h for next sequence.

Adagrad Training and Text Sampling

Infinite loop sweeps data left-to-right in seq_length=25 chunks: reset hprev=zeros every epoch (when p >= len(data)). Compute inputs/targets as char indices for datap:p+25 and shifted p+1:p+26.

Every 100 iterations: sample 200 chars from model starting with inputs0 seed: forward like training but pick ix = np.random.choice(vocab_size, p=ps.ravel()), decode to text, print. Smooth loss: smooth_loss *= 0.999 + loss * 0.001, print every 100 iters.

Update with Adagrad: mem vars track mem += dparam**2, param -= lr * dparam / sqrt(mem + 1e-8). Advance p by 25, n +=1. Initial smooth_loss = -log(1/vocab_size)*25.

Common issues: input.txt must exceed seq_length+1 chars (else IndexError in loss); large datasets like Shakespeare need 100k+ iters for loss ~3.0 and coherent text.