LLM Architecture Gallery: Diagrams, Specs & Diffs for 70+ Models
Sebastian Raschka's gallery visualizes 70+ LLM architectures with diagrams, key specs like KV cache costs, attention types, and a diff tool—ideal for comparing dense vs. MoE designs and inference tradeoffs.
Dense Decoder Baselines: From GPT-2 to Modern Llamas
The gallery starts with foundational dense transformers like GPT-2 XL (1.5B, 2019), a reference for evolution: 48 MHA layers, 1k context, 300 KiB KV cache/token (high), using dropout, GELU, LayerNorm, and absolute positional embeddings. Its 'Classic GPT-2 recipe with dropout, GELU, LayerNorm, and full multi-head attention' contrasts sharply with later stacks, showing how decoder depths have exploded while optimizations like grouped-query attention (GQA) reduced costs.
Llama 3 (8B, 2024) exemplifies pre-norm dense baselines: 32 GQA layers with RoPE, 8k context, 128 KiB KV cache (moderate), wider than peers like OLMo 2 for stability. Llama 3.2 (1B) scales this down—16 GQA layers, 128k context, 32 KiB KV (low)—prioritizing width over depth against narrower Qwen3 0.6B. These choices favor training stability but rack up KV cache for long contexts, a recurring tradeoff.
Qwen3 (32B) pushes dense scale: full config details on Hugging Face, Apache license, serving as OLMo 3 32B benchmark. Dense models dominate sub-50B scales for simplicity, but KV caches (e.g., 512 KiB for OLMo 2 7B's 32 MHA + QK-Norm) highlight memory pain points versus sparse alternatives.
Sparse MoE Revolution: DeepSeek V3 Template Takes Over
Mixture-of-Experts (MoE) architectures dominate recent giants, starting with DeepSeek V3 (671B total/37B active, 5.5% active, Dec 2024): 61 MLA layers, 128k context, low 68.6 KiB KV/token via dense prefix + shared expert. 'Uses a dense prefix plus a shared expert to keep a very large model practical at inference' balances capacity and speed, spawning copies like DeepSeek R1 (same arch, reasoning-tuned: 'Architecture matches DeepSeek V3; the main change is the reasoning-oriented training recipe').
Qwen3 (235B-A22B, 9.4% active) refines it: 94 GQA + QK-Norm layers, drops shared expert for pure efficiency. Llama 4 Maverick (400B/17B active) adapts with 36 chunked + 12 full GQA, 1M context, 192 KiB KV (high), fewer/larger experts than DeepSeek. Trends: MoE active params hover 4-9%, MLA/GQA for attention, chunking for ultra-long contexts—cutting inference costs 5x over dense equivalents while matching quality.
Attention & Normalization Innovations for Scale
Attention evolves from MHA (GPT-2, OLMo 2) to GQA (Llamas, most), MLA (DeepSeek MoEs), with add-ons: RoPE (rotary embeddings, Llama 3), QK-Norm (OLMo 2: 'Uses inside-residual post-norm instead of the usual pre-norm layout' for stability), sliding-window attention (SWA) in Gemma 3 (27B: 52 SW + 10 global, 496 KiB KV very high, 262k vocab). Mistral Small 3.1 (24B) ditches SWA for pure GQA (40 layers, 160 KiB moderate), prioritizing latency.
These tweaks address quadratic scaling: GQA cuts KV heads, MLA fuses multi-head logic in MoE, SWA limits local windows. Tradeoffs surface in KV cache—high for global/MHA (memory hogs), low for optimized MoE (serving wins). Gallery links concepts like GQA, enabling deep dives.
Efficiency Metrics: KV Cache, Context, and Benchmarks
KV cache/token (bf16) quantifies inference memory: low (<100 KiB) for MoE like DeepSeek (68 KiB), high (300-500+ KiB) for dense globals like GPT-2 or Gemma 3. Context leaps: 1k (GPT-2) to 1M (Llama 4). Licenses vary—MIT/Apache for open (OLMo, Qwen), restrictive (Llama, Gemma).
AA Intelligence Index aggregates: Gemma 3 27B (10.3 total: coding 9.6, agents 3.5); DeepSeek V3 (16.5); R1 (18.8). Llama 3.2 1B lags (6.3). Diff tool overlays two models (e.g., GPT-2 vs. Llama 3), revealing layer counts, head dims, norms—crucial for fine-tuning or replication.
Physical/digital posters (Redbubble/Gumroad) aid reference; from-scratch GitHub impls (e.g., Llama 3, Gemma 3, Qwen3) let builders recreate. Changelog/RSS tracks updates to April 2026.
"Late-2019 dense baseline included here as a reference point for how much decoder stacks have changed since GPT-2." – GPT-2 XL card, underscoring 6+ years of progress in layer norms, attention, sparsity.
"Pre-norm baseline; wider than OLMo 2 at a similar scale." – Llama 3 8B, explaining width's role in dense stability vs. OLMo's post-norm experiment.
"Uses inside-residual post-norm instead of the usual pre-norm layout." – OLMo 2 7B, highlighting a rare normalization pivot for training gains.
"Uses a dense prefix plus a shared expert to keep a very large model practical at inference." – DeepSeek V3, core to MoE practicality.
"Architecture matches DeepSeek V3; the main change is the reasoning-oriented training recipe." – DeepSeek R1, showing arch reuse across post-training variants.
Key Takeaways
- Prioritize low KV cache/token (<100 KiB) models like DeepSeek MoEs for cost-effective long-context inference; avoid high (>300 KiB) dense globals unless context <8k.
- Use GQA or MLA over MHA for 2-4x KV savings at similar quality; add QK-Norm for stability in custom training.
- Benchmark new LLMs against gallery baselines (e.g., Llama 3 8B for dense 8B) via AA Index before integration.
- Leverage the diff tool for architecture audits—spot layer mixes, active % in MoE to predict serving needs.
- Replicate via linked from-scratch code (GitHub/LLMs-from-scratch) for chapters on GPT-to-Llama, Gemma3, Qwen3.
- Track open licenses (Apache/MIT) for production; test Gemma/Llama restrictions early.
- Print the poster for team reference—medium size balances readability and wall space.
- File issues on GitHub for inaccuracies; contribute via changelog.