Recent Frontier Models Scale Successfully with Massive Inference Budgets
Newer LLMs (post-November 2025) productively use 10-50x larger budgets—50M tokens for AISI evals and 1,000 turns for Irregular—revealing higher cumulative success rates on private cyber tasks. Older models plateau quickly due to issues like poor state tracking and planning, gaining little from extra compute. Newer ones show continued gains over orders of magnitude (log-scale axes), with ~8% of AISI's harder tasks solved only after scaling from 10M to 50M tokens. This requires long runs (tens of millions of tokens per attempt, multiple repeats) for accurate performance estimates. Average costs stay manageable: ~$10/run (max $60) at 50M tokens; $1-$20 for most challenges (up to $100 for hardest) at 1,000 turns. Compaction tools preserve history in fixed context windows, enabling effective use of extended budgets.
Low-Budget Evaluations Underestimate True Capability Horizons
Standard constraints (tokens, turns, time, spend) assume extra budget yields no gains, but this breaks for cyber tasks, causing underestimation of model horizons—the human-time difficulty where success drops below 50%. Doubling tokens yields consistent absolute success rate increases (log-linear scaling), but hard tasks demand exponentially more compute for marginal gains. A 5% success at 2M tokens can jump to 30% at 50M, potentially crossing risk thresholds. Prior horizon estimates at low budgets are too conservative, misinforming developers, researchers, and policymakers on risks from apprentice-to-expert cyber capabilities (e.g., 10 years' experience).
Optimize Budgets Using Cost-Per-Success Curves and Increase Transparency
Plot cost-per-success (total attempt costs / successes) vs. budget to find the 'dip' minimizing spend: high at low budgets (few successes), falls as viable paths emerge, then rises on unproductive extensions. Generate model-specific curves over task sets to avoid over/under-estimating costs. Report all limits (tokens, turns, costs, time) consistently for context—low capability vs. constrained setup. Future needs: stable long-run environments, budget-sufficiency tooling, extrapolation from scaling curves to predict ceilings cheaply. Test generalization beyond cyber; METR results suggest limits in some software tasks.