Decoder-Only Transformers Drive GPT Scaling

Self-Attention Enables Parallel Long-Range Dependencies

Transformers replace RNNs' sequential processing, which suffers vanishing gradients beyond 50-100 words, with self-attention that computes direct relationships between all token pairs simultaneously. For a token like "it" in "The cat sat on the mat and looked at the fishbowl because it was hungry," every prior word votes on relevance via query-key dot products scaled by embed_size^{-0.5}, softmax-normalized, and applied to values. This parallelization trains across thousands of GPUs.

GPT's decoder-only design strips away the encoder, applying a causal mask to block future tokens, forcing rich representations solely from predicting the next token. GPT-1 (117M params, 12 layers) showed modest NLP scores, but GPT-2 (1.5B params) gained zero-shot abilities like summarization via prompting. GPT-3 (175B params, 96 layers) added in-context learning from prompt examples without fine-tuning. Deeper layers progress from syntax (early) to reasoning and world models (late). This simplicity scales better than encoder-decoder setups, avoiding cross-attention overhead.

MoE and Test-Time Compute Scale Beyond Dense Models

Dense models activate all parameters per token, making trillions unaffordable. Mixture of Experts (MoE) routes each token to 2-8 specialized experts out of 128+, activating ~5% of weights—e.g., DeepSeek-V3 uses 37B active out of 671B total, trained for $5.6M on 2,048 H800 GPUs, matching GPT-4. Multi-Head Latent Attention (MLA) compresses KV cache to cut memory bandwidth. Tradeoffs include expert collapse (router overloads few experts) and full-model memory needs despite sparse activation.

o1 introduced test-time compute: generate internal reasoning chains (30s for hard problems), backtrack dead ends, and refine via RL on verifiable rewards like math solutions. This outperforms larger instant-response models, decoupling ability from size. GPT-5 routes simple queries fast (System 1) and complex ones deeply (System 2). Open models like DeepSeek-R1 replicate this.

Multimodal Fusion and Real-World Impacts

Early fusion embeds vision tokens from Vision Transformers (e.g., MetaCLIP) into the same space as text, enabling unified attention across modalities—no separate captioning. Models like LLaMA 4, Qwen-VL handle charts, 3D spatial reasoning (GLM-4.5V's rotated positional encoding). This yields native cross-modal reasoning, e.g., diagnosing X-rays directly.

Applications: Harvey AI (RAG + fine-tuned GPT-4) cuts legal review 40-60%; GPT-4.1 hits 54.6% on SWE-bench (21.4pp over GPT-4o), ingesting 1M-token codebases; 75% medical accuracy accelerates drug discovery. Open weights (LLaMA, DeepSeek) ensure data sovereignty.

Implement Mini-GPT from Scratch in PyTorch

Build a character-level GPT: Tokenizer maps unique chars to indices (vocab_size ~50). SelfAttention computes QKV projections, scores = (Q @ K.T) * scale, weights = softmax(scores), out = weights @ V. TransformerBlock adds residual attention + FFN (4x expand, ReLU), LayerNorm post each.

MiniGPT stacks NUM_LAYERS=2 blocks on token + positional embeddings (BLOCK_SIZE=32), outputs logits via linear to vocab_size. Train on dataset.txt: batch BATCH_SIZE=16 sequences, predict next token with CrossEntropyLoss, Adam at 3e-4, 20 EPOCHS. Generation: sample from last-token softmax via multinomial, append up to 100 tokens from context like "AI is".

Project structure: data/dataset.txt, model/{tokenizer,attention,transformer,gpt}.py, train.py saves model.pth, generate.py loads/infers. Config: EMBED_SIZE=64, NUM_HEADS=4 (implied in attention). This replicates core logic scalably.