Train GPT-2 LLM from Scratch on Laptop

Why Local LLM Training Reveals Core Mechanics

Training an LLM from scratch locally demystifies the process, showing 80% of what big labs do without cloud-scale resources. Angelos Perivolaropoulos, who leads speech-to-text at ElevenLabs (creators of top benchmark model Scribe v2), emphasizes starting with basics: no pre-trained weights, pure PyTorch. This tiny GPT-2 variant (vocab=65 chars, context=256, 6 layers) trains fast on laptops, exposing tokenizer choices, architecture blocks, and training loops as the real differentiators between models like GPT-3 vs. GPT-4.

Key principle: Focus on bi-grams (token pairs). Small vocab (65) yields ~4k bi-grams, coverable by Shakespeare dataset; larger (50k like GPT-2) needs trillions of tokens to converge. "If you have a model with 200,000 tokens, you need 200,000 tokens squared at least data to train from scratch."

"We're going to work purely on torch... this is like 80% of the way there to create a model from scratch."

Prerequisites: Python 3.12, 16GB RAM (scales down), MPS/CUDA/CPU support. Use UV for env: uv sync. Colab alternative: !pip install torch numpy datasets tiktoken. Dataset: Shakespeare (tiny text corpus, downloadable via repo).

Tokenizer: Character-Level for Tiny Models

Start here – LLMs process vectors, not text. Character-level tokenizer maps 65 chars (A-Z, a-z, punctuation, space, newline) to integers via simple dict/enumerate. Converts strings to int tensors; embedding layer maps to vectors (dim=384).

Steps:

Load data: text = open('input.txt', 'r').read() (Shakespeare).
Build vocab: chars = sorted(list(set(text))); stoi = {ch:i for i,ch in enumerate(chars)}; itos = {i:ch for i,ch in enumerate(chars)}; vocab_size = len(chars).
Encode: def encode(s): return [stoi[c] for c in s]; batch via torch.tensor.
Decode: Reverse for output.

Trade-off: Low vocab trains fast on small data but poor scaling – model struggles with long-range correlations (e.g., 'sky' + 'is' + 'bl' vs. semantic tokens). For code: Falls to chars for rare vars; BPE (train on data patterns like 'for', 'enumerate') better for prod but needs massive data.

"Character level because it's much easier to train... 65*65 = 4,225 possible bi-grams... our dataset should include all bi-grams multiple times."

Common mistake: Using full GPT-2 vocab (50k) – embedding table alone ~19M params (3x model size), won't converge. Future-proof: Train BPE tokenizer on your corpus for real LLMs.

Quality check: Ensure all bi-grams covered; test encode/decode round-trip.

Causal Transformer: Stack Simple Blocks

GPT-2 base: Decoder-only, causal self-attention. Don't need PhD-level math – implement blocks, learn why via experimentation.

Core blocks (per layer):

Multi-head self-attention: Computes token relationships (QKV matrices). Causal mask prevents future peeking: mask = torch.tril(torch.ones(block_size, block_size)). Heads (e.g., n_head=6) parallelize; concat + proj.
MLP/Feed-forward: Processes attended features into logits.
Residuals: Add input to output (x + sublayer(x)) – gradients flow directly, stabilizes deep stacks.
LayerNorm: Normalizes activations pre-sublayer (ln(x) * sublayer(ln(x)) + x); prevents exploding/vanishing.

Model params:

n_embd=384 (embed dim)
n_head=6
n_layer=6
block_size=256 (context)

Implementation skeleton (PyTorch nn.Module):

Embed: self.tok_emb = nn.Embedding(vocab_size, n_embd).
Pos embed: self.position_embedding_table = nn.Embedding(block_size, n_embd).
Layers: Stack TransformerBlock (attention + MLP + norms).
Final: ln_f = LayerNorm(n_embd) → lm_head = nn.Linear(n_embd, vocab_size) (no bias, tie to embed? Optional).
Forward: Add pos embeds, loop layers, project logits.

Principle: Stack identical layers; residuals/norms enable scaling depth. Big labs optimize attention for 1M+ context (e.g., avoid O(n²) blowup) but base works.

"Attention is what makes transformers different... they can attend to previous tokens and understand relationships."

Mistake: No causal mask → cheats by seeing future. Test: Forward pass on sample, check shapes (batch, seq, vocab).

Training Loop: Where Performance Wins

Pre-training core: Next-token prediction (cross-entropy loss). Smarter loops separate GPT-3/4 (e.g., Gemini 3 → 3.1 doubles benchmarks via tuning).

Steps:

Data: Split train/val; generate batches get_batch('train') → (B,T) ints.
Optimize: AdamW, lr=1e-3 (warmup? Basic: constant).
Loop: for i in range(max_iters): xb,yb = get_batch(); logits,p = model(xb); loss = F.cross_entropy(logits.view(-1,vocab_size), yb.view(-1)); optimizer.zero_grad(); loss.backward(); optimizer.step().
Eval: Perplexity on val (torch.exp(loss)).

Batch size: 4-64 (RAM-limited); steps: 5k+ for convergence. Estimate iters: dataset_tokens / (batch * block_size).

"The training loop is generally the most important part... what you use with the same base model makes the big difference."

Trade-off: Small context (256) fast but forgets long deps; crank on bigger GPU.

Inference: Simple while True: generate next token via top-k/1 sample.

Hardware Trade-offs and Extensions

Local constraints force smart choices: 16GB RAM → tiny model (millions params). Colab GPUs free for this scale.

Scaling path:

Bigger data/GPU: BPE tokenizer, 16k context.
Week-long train: Proper LLM.
Compete: Optimize loss faster.

No deep theory needed initially: "I had no clue how transformers worked... you learn as you push through."

"Transformers have been commoditized... optimizations on the base idea."

Key Takeaways

Use character-level tokenizer (65 vocab) for tiny local LLMs; covers bi-grams with small data like Shakespeare.
Implement causal transformer via 4 blocks: attention (masked), MLP, residual, LayerNorm – stack 6 layers.
Training: Next-token CE loss, AdamW; monitor val perplexity; 5k iters suffices.
Start with uv sync; test on Colab if no GPU/RAM.
Trade-off explicitly: Char tok fast/cheap but unscalable; BPE for prod needs data.
Fork repo, beat baseline loss – extend to code tokenizer or longer context.
Embeddings dominate small models; GPT-2 vocab would 3x size.
Residuals/LayerNorm stabilize; causal mask essential.
Bi-grams rule data needs: vocab² minimum tokens.