ARC-AGI-3 Leaderboard: Prioritizing Cost-Efficient AI Adaptation

Efficiency Defines Intelligence on ARC-AGI-3

ARC-AGI-3 advances beyond ARC-AGI-1 and 2's passive fluid intelligence tests by requiring AI agents to adapt interactively to novel environments. The core metric is a scatter plot of cost-per-task against performance, revealing that true intelligence demands solving problems with minimal resources. Reasoning systems show trend lines where performance asymptotes with more thinking time, proving diminishing returns beyond certain compute thresholds. Base LLMs like GPT-4.5 and Claude 3.7 deliver single-shot results without extra reasoning, exposing their raw limits. Kaggle systems, constrained to $50 compute for 120 evaluation tasks, prioritize purpose-built efficiency over brute force.

Interpreting Solution Categories for Practical Insights

Connected points on reasoning trend lines track one model's performance across reasoning levels, helping you predict gains from extended compute—expect plateaus, not linear scaling. Base LLM points benchmark off-the-shelf inference, ideal for quick baselines but rarely competitive without enhancements. Kaggle entries represent real-world optimization under tight budgets, teaching how to engineer lean solutions that scale to production constraints. Only systems under $10,000 total run cost qualify, filtering for viable approaches; incomplete outputs default to incorrect, enforcing full-task reliability.

Verification Rules to Trust Results

Preview scores are unofficial from partial tests, like ARC-AGI-2 estimates via o1-pro pricing or provisional Gemini 3 Pro costs pending retest. This setup rewards systems that balance accuracy and economy, guiding builders to favor adaptive agents over resource hogs for deployable AGI progress.