Caveman Prompts Cut Claude Tokens and Boost Accuracy
Forcing Claude Code into concise 'caveman' outputs saves 4-5% tokens per 100k session and may improve accuracy by preventing verbose over-elaboration, as shown in a study of 31 LLMs across 1500 problems.
Token Savings: Realistic 4-5% Per Session, Not 75%
Caveman (github.com/JuliusBrussee/caveman) trims Claude Code's prose responses to caveman-style brevity—'why say many word when few word do trick'—without altering reasoning, code generation, or tool calls. Repo benchmarks claim 75% fewer output tokens on explanations (e.g., 87% saved explaining a React render bug) and 45% on compressed memory files like claw.md. But these apply only to prose (one portion of output) and system prompts (one portion of input).
In a typical 100k-token session (75k input, 25k output), prose is ~6k tokens; caveman cuts it to 2k, saving 4k or 4% total. Input compression saves ~5k or 5% total. Combined: 4-5% savings per session, or 5-10% weekly—valuable for token-conscious users, scaling to thousands saved without changing core Claude behavior. Error messages and code stay verbatim.
Brevity Reverses LLM Performance: Larger Models Gain 26 Points
A March study ('Brevity Constraints, Reverse Performance Hierarchies, and Language Models,' arxiv.org/abs/2604.00025) tested 31 open-weight models on 1500 problems. Larger models (up to 400B params) underperformed smaller ones (e.g., 2B params) by 28 percentage points on 8% of problems due to 'spontaneous scale-dependent verbosity'—over-elaboration obscuring correct reasoning ('overthinking').
Constraining outputs to brevity boosted large models by 26 points, closing gaps by two-thirds and flipping hierarchies (large now beat small). Smaller models saw minimal change. Root cause: RLHF trains models for verbose 'thorough' responses humans prefer, leading to error accumulation in complex reasoning. Brevity forces models to 'get out of their own way,' preserving internal thought but delivering concise finals—directly mirroring caveman's output-only tweaks.
Frontier models like Claude 3.5 Sonnet may show milder effects, but patterns hold: verbosity hurts scaling laws. For straightforward tasks (where study gaps appeared most), caveman could yield better code/debug outputs beyond tokens.
Implement Caveman: One-Line Install, Zero Downside
Install via one command as a Claude Code 'skill.' Invoke with /caveman, 'caveman mode,' 'less tokens,' or 'ultra caveman' (extreme brevity) vs. 'light.' Applies selectively, preserving code/tools.
Even without repo, add to claw.md: 'Be concise, no filler, straight to the point, use fewer words.' Test on explanations/debugging for token/accuracy wins. No reported downsides; meme origins (5k GitHub stars in 72 hours) belie science-backed value for production Claude workflows.