AI Agents Auto-Optimize Nanochat LLM Training on One GPU

AI agents autonomously edit train.py, run 5-minute training epochs on nanochat, evaluate via val_bpb metric (lower better), and iterate overnight to improve models without human intervention.

Autonomous Research Loop Drives Overnight Improvements

AI agents replace manual LLM research by iteratively modifying train.py (model, optimizer, training loop), running fixed 5-minute wall-clock training sessions (excluding startup), and evaluating on validation bits-per-byte (val_bpb, lower is better, vocab-independent for fair architecture comparisons). Agents check if val_bpb improves; if yes, commit changes, else discard and retry. Start by prompting Claude/Codex (permissions disabled) with: "Hi have a look at program.md and let's kick off a new experiment! let's do the setup first." program.md provides agent context and instructions as a lightweight "skill"—edit it to refine agent behavior, add more agents, or accelerate progress. Wake to experiment logs and potentially better models from nanochat (simplified single-GPU LLM trainer). Core files: prepare.py (data prep, constants—do not modify), train.py (agent-editable), program.md (agent programming). Setup: Single NVIDIA GPU (H100 tested), Python 3.10+, uv package manager; run uv sync then python prepare.py.

Fixed-Time Budget Enables Rapid Iteration

Every experiment uses a strict 5-minute training budget regardless of compute details, focusing on throughput. Metric val_bpb normalizes across vocab sizes and architectures. For beginners, reference the "Dummy's Guide" tweet for neural net basics. Ties into nanochat repo for full context. Repo kept minimal (no bloat for CPU/MPS yet—forks welcome; parent nanochat has broader support like Flash Attention 3 fallbacks).

Tuning for Smaller GPUs Maximizes Accessibility

On sub-H100 hardware (e.g., MacBooks), fork and adjust hyperparameters in prepare.py/train.py: reduce vocab_size (default suits tiny models), MAX_SEQ_LEN (e.g., 1024), DEVICE_BATCH_SIZE, EVAL_TOKENS (fewer for speed), DEPTH, WINDOW_PATTERN, TOTAL_BATCH_SIZE (e.g., 2**14). Prompt coding agents with this guide + source code for help. Notable forks listed for low-compute tinkering.

Summarized by x-ai/grok-4.1-fast via openrouter

5258 input / 1447 output tokens in 8021ms

© 2026 Edge