Decoder-Only Transformers: GPT's Load-Bearing Innovation

Self-Attention and Causal Masking Unlock Parallel Language Modeling

Transformers replace RNNs' sequential processing—which suffers vanishing gradients beyond 50-100 words—with self-attention, computing direct relationships between all token pairs simultaneously. For "it" in "The cat sat on the mat... because it was hungry," tokens vote on relevance: "cat" strongly, "hungry" medium, "fishbowl" weakly. Scores = (Q @ K^T) / sqrt(d_k), softened via softmax for weights, then applied to V. This parallelizes training across GPUs.

GPT's decoder-only design drops encoders, using causal masks to block future tokens, forcing rich representations for next-token prediction. GPT-1 (117M params, 12 layers) introduced this; GPT-3 (175B params, 96 layers) showed zero-shot tasks via prompting; GPT-4 (~120 layers) added complexity. Emergent behaviors like in-context learning arise without explicit training, as scale builds abstract representations: syntax in early layers, reasoning in deep ones.

MoE and Test-Time Compute Scale Beyond Dense Limits

Dense models activate all parameters per token, making trillions uneconomic. Mixture of Experts (MoE) routes tokens to 2-8 specialized experts from 128+, activating ~5% (e.g., DeepSeek-V3: 37B/671B params). Routers prevent collapse by balancing load; MLA compresses KV cache for inference. DeepSeek-V3 matched GPT-4 for $5.6M on 2,048 H800 GPUs.

o1 introduced test-time compute: generate hidden reasoning chains (System 2 thinking) via RL on verifiable rewards, outperforming larger instant models. GPT-5 routes simple queries fast, complex ones deep. LLaMA 4 Maverick runs 17B/400B active on one H100.

Multimodal Early Fusion and Practical Mini-GPT Build

Vision tokens from ViT encoders join text in shared space for unified attention, enabling native cross-modal reasoning (e.g., chart analysis without captions). GLM-4.5V adds 3D positional encoding.

Build a mini-GPT in PyTorch: Use char-level tokenizer (encode/decode on sorted unique chars). SelfAttention: QKV projections, scaled dot-product. TransformerBlock: residual attention + FFN (4x expand, ReLU), LayerNorm. MiniGPT: token/positional embeddings + N layers + LM head. Train on batches (block_size=32, batch=16) predicting next token via CrossEntropyLoss, Adam 3e-4, 20 epochs. Generate via top-p or multinomial sampling up to 100 tokens.

Project structure: data/dataset.txt, model/{tokenizer,attention,transformer,gpt}.py, train.py saves model.pth, generate.py loads for inference from prompt like "AI is".

Impacts: Efficiency Redefines AI Economics and Workflows

DeepSeek democratizes frontier AI; Harvey AI cuts legal review 40-60% via RAG on GPT-4 (90th percentile bar exam); Cursor fixes GitHub issues at 54.6% SWE-bench (GPT-4.1, +21.4pts over 4o), ingesting 1M-token codebases. Open weights (LLaMA 4, Qwen) ensure sovereignty.

Future: 10M contexts (LLaMA 4 Scout) via hierarchical attention; Mamba-like state-space for linear scaling; agentic loops with tools (MAKER framework).