Train GPT-2 LLM from Scratch on Laptop
Hands-on workshop: Build tokenizer, causal transformer, training loop in PyTorch to train tiny GPT-2 on Shakespeare locally (16GB RAM) or Colab – reveals core engineering without cloud.
Why Local LLM Training Reveals Core Mechanics
Training an LLM from scratch locally demystifies the process, showing 80% of what big labs do without cloud-scale resources. Angelos Perivolaropoulos, who leads speech-to-text at ElevenLabs (creators of top benchmark model Scribe v2), emphasizes starting with basics: no pre-trained weights, pure PyTorch. This tiny GPT-2 variant (vocab=65 chars, context=256, 6 layers) trains fast on laptops, exposing tokenizer choices, architecture blocks, and training loops as the real differentiators between models like GPT-3 vs. GPT-4.
Key principle: Focus on bi-grams (token pairs). Small vocab (65) yields ~4k bi-grams, coverable by Shakespeare dataset; larger (50k like GPT-2) needs trillions of tokens to converge. "If you have a model with 200,000 tokens, you need 200,000 tokens squared at least data to train from scratch."
"We're going to work purely on torch... this is like 80% of the way there to create a model from scratch."
Prerequisites: Python 3.12, 16GB RAM (scales down), MPS/CUDA/CPU support. Use UV for env: uv sync. Colab alternative: !pip install torch numpy datasets tiktoken. Dataset: Shakespeare (tiny text corpus, downloadable via repo).
Tokenizer: Character-Level for Tiny Models
Start here – LLMs process vectors, not text. Character-level tokenizer maps 65 chars (A-Z, a-z, punctuation, space, newline) to integers via simple dict/enumerate. Converts strings to int tensors; embedding layer maps to vectors (dim=384).
Steps:
- Load data:
text = open('input.txt', 'r').read()(Shakespeare). - Build vocab:
chars = sorted(list(set(text)));stoi = {ch:i for i,ch in enumerate(chars)};itos = {i:ch for i,ch in enumerate(chars)};vocab_size = len(chars). - Encode:
def encode(s): return [stoi[c] for c in s]; batch viatorch.tensor. - Decode: Reverse for output.
Trade-off: Low vocab trains fast on small data but poor scaling – model struggles with long-range correlations (e.g., 'sky' + 'is' + 'bl' vs. semantic tokens). For code: Falls to chars for rare vars; BPE (train on data patterns like 'for', 'enumerate') better for prod but needs massive data.
"Character level because it's much easier to train... 65*65 = 4,225 possible bi-grams... our dataset should include all bi-grams multiple times."
Common mistake: Using full GPT-2 vocab (50k) – embedding table alone ~19M params (3x model size), won't converge. Future-proof: Train BPE tokenizer on your corpus for real LLMs.
Quality check: Ensure all bi-grams covered; test encode/decode round-trip.
Causal Transformer: Stack Simple Blocks
GPT-2 base: Decoder-only, causal self-attention. Don't need PhD-level math – implement blocks, learn why via experimentation.
Core blocks (per layer):
- Multi-head self-attention: Computes token relationships (QKV matrices). Causal mask prevents future peeking:
mask = torch.tril(torch.ones(block_size, block_size)). Heads (e.g., n_head=6) parallelize; concat + proj. - MLP/Feed-forward: Processes attended features into logits.
- Residuals: Add input to output (
x + sublayer(x)) – gradients flow directly, stabilizes deep stacks. - LayerNorm: Normalizes activations pre-sublayer (
ln(x) * sublayer(ln(x)) + x); prevents exploding/vanishing.
Model params:
n_embd=384(embed dim)n_head=6n_layer=6block_size=256(context)
Implementation skeleton (PyTorch nn.Module):
- Embed:
self.tok_emb = nn.Embedding(vocab_size, n_embd). - Pos embed:
self.position_embedding_table = nn.Embedding(block_size, n_embd). - Layers: Stack
TransformerBlock(attention + MLP + norms). - Final:
ln_f = LayerNorm(n_embd)→lm_head = nn.Linear(n_embd, vocab_size)(no bias, tie to embed? Optional). - Forward: Add pos embeds, loop layers, project logits.
Principle: Stack identical layers; residuals/norms enable scaling depth. Big labs optimize attention for 1M+ context (e.g., avoid O(n²) blowup) but base works.
"Attention is what makes transformers different... they can attend to previous tokens and understand relationships."
Mistake: No causal mask → cheats by seeing future. Test: Forward pass on sample, check shapes (batch, seq, vocab).
Training Loop: Where Performance Wins
Pre-training core: Next-token prediction (cross-entropy loss). Smarter loops separate GPT-3/4 (e.g., Gemini 3 → 3.1 doubles benchmarks via tuning).
Steps:
- Data: Split train/val; generate batches
get_batch('train')→ (B,T) ints. - Optimize: AdamW, lr=1e-3 (warmup? Basic: constant).
- Loop:
for i in range(max_iters): xb,yb = get_batch(); logits,p = model(xb); loss = F.cross_entropy(logits.view(-1,vocab_size), yb.view(-1)); optimizer.zero_grad(); loss.backward(); optimizer.step(). - Eval: Perplexity on val (
torch.exp(loss)).
Batch size: 4-64 (RAM-limited); steps: 5k+ for convergence. Estimate iters: dataset_tokens / (batch * block_size).
"The training loop is generally the most important part... what you use with the same base model makes the big difference."
Trade-off: Small context (256) fast but forgets long deps; crank on bigger GPU.
Inference: Simple while True: generate next token via top-k/1 sample.
Hardware Trade-offs and Extensions
Local constraints force smart choices: 16GB RAM → tiny model (millions params). Colab GPUs free for this scale.
Scaling path:
- Bigger data/GPU: BPE tokenizer, 16k context.
- Week-long train: Proper LLM.
- Compete: Optimize loss faster.
No deep theory needed initially: "I had no clue how transformers worked... you learn as you push through."
"Transformers have been commoditized... optimizations on the base idea."
Key Takeaways
- Use character-level tokenizer (65 vocab) for tiny local LLMs; covers bi-grams with small data like Shakespeare.
- Implement causal transformer via 4 blocks: attention (masked), MLP, residual, LayerNorm – stack 6 layers.
- Training: Next-token CE loss, AdamW; monitor val perplexity; 5k iters suffices.
- Start with
uv sync; test on Colab if no GPU/RAM. - Trade-off explicitly: Char tok fast/cheap but unscalable; BPE for prod needs data.
- Fork repo, beat baseline loss – extend to code tokenizer or longer context.
- Embeddings dominate small models; GPT-2 vocab would 3x size.
- Residuals/LayerNorm stabilize; causal mask essential.
- Bi-grams rule data needs: vocab² minimum tokens.