Train GPT-2 for $48 in 2 Hours on 8xH100 with nanochat

Achieve GPT-2 Performance at Fraction of Original Cost

nanochat trains full GPT-2 equivalent models (1.6B params, CORE score 0.2565+) for $15-48 on spot/regular 8xH100 nodes (~$3/GPU/hr, ~$24/hr/node), versus GPT-2's 2019 $43k cost. Use single --depth dial (e.g., d24-d26 for GPT-2) to auto-set all hyperparameters: transformer width, heads, LR schedule, horizons, weight decay for compute-optimal scaling. Pretraining dominates compute; full pipeline (pretrain, SFT, RL, eval, inference, ChatGPT-like UI) runs end-to-end. Reproduce via bash runs/speedrun.sh on Lambda.ai 8xH100: ~2-3 hours to 4e19 FLOPs model. Serve with python -m scripts.chat_web for web UI at http://:8000. Model behaves like "kindergartener": hallucinates identity, explains sky color simply.

Trade-offs: Single GPU works (gradient accumulation, 8x slower); <80GB VRAM needs --device-batch-size reduction (32→16/8/4/2/1). CPU/MPS via runs/runcpu.sh (tiny model, weak results). Precision auto: bf16 on A100/H100 (native tensor cores), fp32 on V100/T4/CPU/MPS; override via NANOCHAT_DTYPE=bfloat16/float16/float32. Weights fp32 (optimizer), compute in COMPUTE_DTYPE, embeddings in reduced prec—no torch.amp.autocast.

Leaderboard Drives Community Optimization

"Time-to-GPT-2" leaderboard ranks wall-clock on 8xH100 to beat GPT-2 CORE 0.256525 via DCLM CORE eval (scripts.base_eval.py). Current best: 1.65 hours (0.2626 CORE, ClimbMix dataset, autoresearch). Progress: 168hr (2019 GPT-2) → 3.04hr baseline → 2.91hr (fp8) → 2.76hr (1M token batch) → 2.02hr (ClimbMix) → 1.80hr (autoresearch r1) → 1.65hr (r2). Submit via runs/speedrun.sh; see dev/LEADERBOARD.md. Monitor wandb: val_bpb vs step/FLOPs/time, CORE, VRAM/MFU/tok/sec. Quick expts: d12 (--depth=12, ~5min pretrain) tests changes across depths.

Minimal, Hackable Code for Full LLM Pipeline

~1k LoC PyTorch: nanochat/gpt.py (transformer), dataloader.py (distributed tokenizing), optim.py (AdamW/Muon), tokenizer.py (BPE GPT-4 style), engine.py (KV-cache inference), execution.py (Python tool exec), core_eval.py (DCLM CORE). Stages: base_train.py (pretrain), chat_sft.py (SFT), chat_rl.py (RL), chat_eval.py (tasks: ARC/GSM8K/MMLU/HumanEval/spellingbee/SmolTalk), chat_cli/web. Tasks in tasks/: mixtures/sequences. Data: FineWeb (HF), ClimbMix (NVIDIA). Setup: uv sync --extra gpu --group dev (uv dep mgr). Scripts: scaling_laws.sh/miniseries.sh sweep depths. No config monsters—depth drives all.

Research and Customization Workflow

Forkable baseline for <$1k micro-models. Improve pretrain (e.g., dataset, fp8, batch=1M). Guides: Infuse personality via synthetic data (dev/gen_synthetic_data.py) + SFT mix; add abilities (e.g., strawberry 'r' count) via tasks/customjson. Ex: torchrun -m scripts.base_train --depth=12 --run=d12 (wandb, no intermediates). PRs: Declare LLM contributions. Inspired by nanoGPT/modded-nanoGPT. Cite as @misc{nanochat...}.