microgpt.py: Full GPT in 300 Lines of Pure Python

Custom Autograd Engine Powers End-to-End Training

Implements automatic differentiation via Value class with slots for efficiency. Supports add, mul, pow, log, exp, ReLU, and backward via topological sort on computation graph. Chain rule propagates gradients recursively: child.grad += local_grad * v.grad. Enables full forward/backward without libraries. For a names dataset (32k lines from names.txt), builds char-level tokenizer: unique chars (vocab_size=~30+1 BOS token). Model params (~10k total): 1 layer, n_embd=16, block_size=16, n_head=4 (head_dim=4). Weights initialized Gaussian std=0.08. Embeddings: wte (vocab x 16), wpe (16 x 16), lm_head (vocab x 16). Per layer: QKV (4x 16x16), Wo (16x16), MLP fc1 (64x16), fc2 (16x64).

GPT Architecture Mirrors GPT-2 Essentials

Forward pass: token+pos embeds → RMSNorm → residual blocks. Attention: raw dot-product (scaled by 1/sqrt(head_dim)), softmax weights → weighted V sum → Wo projection. Causal via key/value history append (no mask). MLP: RMSNorm → fc1 → ReLU → fc2 → residual. Final lm_head logits → softmax probs. Uses RMSNorm (scale = (mean(x^2)+eps)^-0.5) over LayerNorm, ReLU over GeLU, no biases. Keys/values persist across positions for KV cache simulation. Loss: average -log P(next_token) over sequence (BOS-wrapped docs, up to block_size=16).

Adam Training + Inference in 1000 Steps

Shuffles 32k names, cycles through docs. Per step: tokenize BOS + chars + BOS, forward all positions (building KV cache), average cross-entropy loss → backward → Adam update (lr=0.01 linear decay to 0, β1=0.85, β2=0.99). Prints loss (drops from ~3 to ~1.5 typically). Inference: start BOS, sample argmax-probs (temp=0.5) until BOS, yields plausible names like 'korsal' after training. Demonstrates: core GPT is simple; libs optimize speed/scale. Trade-off: slow (minutes on CPU), but reveals every op.