Vintage Models Unlock Anachronism-Free Research
Train LLMs exclusively on pre-1931 English texts—260B tokens for the 13B base model—to test capabilities like predicting historical events (measure 'surprisingness' of post-1930 descriptions), inventing beyond knowledge cutoffs (e.g., rediscover General Relativity from 1911 data as Einstein did in 1915), and learning programming (few-shot Python on HumanEval benchmarks). Base model (53.1 GB, Apache 2.0) at https://huggingface.co/talkie-lm/talkie-1930-13b-base qualifies as a 'vegan' LLM using only out-of-copyright data (US cutoff: Jan 1, 1931). Instruction-tuned version (26.6 GB) at https://huggingface.co/talkie-lm/talkie-1930-13b-it powers chat demo at https://talkie-lm.com/chat, but relies on synthetic data from modern LLMs.
Download and run locally to experiment: avoids API costs, enables custom fine-tunes on historical corpora. Team (Nick Levine, David Duvenaud, Alec Radford of GPT fame) plans to release corpus/scripts soon.
Fine-Tuning Tradeoffs: Purity vs. Usability
Extract instruction pairs from structured pre-1931 sources (etiquette manuals, cookbooks, dictionaries) for initial SFT in chat format. Boost with synthetic prompts for summarization, info requests, multi-turn chats, then DPO using Claude Sonnet 4.6 as judge; final SFT on rejection-sampled Claude Opus 4.6 vs. talkie chats. This introduces anachronisms—e.g., 7B version output listicles—but base stays pure. Challenge: scrub post-1931 contamination from corpus. Demo test ('SVG of pelican on bicycle') yields era-appropriate hallucination: 'generated in 1860... pelicans fishing on horseback.'
Compare to Mr. Chatterbox (similar vintage project needing modern LLMs for chats). Scale solution: use vintage bases as DPO judges for fully bootstrapped pipelines, eliminating modern influence.
Why Build Vegan Models
Out-of-copyright training sidesteps licensing risks, enables reproducible historical baselines. Test if era-limited models hallucinate future knowledge or generate valid novelties. Tradeoff: weaker instruction-following without modern help, but purer for research. Run base for next-token prediction on 1920s-style prose; fine-tune IT for chat without post-1930 leaks.