Vending-Bench 2 Tests AI Long-Term Business Coherence
Top models like Claude Opus 4.6 and Sonnet 4.6 reach $7k+ after simulating a year running a vending machine, but fall short of $63k human baseline due to lapses in negotiation, supplier vetting, and sustained strategy.
Benchmark Design Measures Sustained Profitability
Vending-Bench 2 simulates agents managing a vending machine business for a year starting with $500, paying a $2 daily fee (bankruptcy after 10 missed days), sourcing products via internet search and email negotiation, stocking storage and machine, and generating sales influenced by day, season, weather, and price. Success hinges on final bank balance after 3000-6000 messages and 60-100M output tokens per run. Improvements over Vending-Bench 1 add adversarial suppliers (unreasonable prices, bait-and-switch), negotiation needs, delivery delays, supplier failures, customer refunds, streamlined scoring on balance, and planning tools like notes/reminders. System prompt emphasizes profit maximization, irreversible payments, token costs ($100/M output), context limits (69k tokens, auto-trim), and full agency without user input.
Current leaderboard (avg 5 runs): Claude Sonnet 4.6 ($7,204), GPT-5.4 ($6,144), GPT-5.3-Codex ($5,940), GLM-5.1 New ($5,634), Gemini 3 Pro ($5,478). Top models sustain consistent tool use without degradation and secure low supplier prices via negotiation or switching. Arena variant adds multi-agent competition at one location, enabling price wars, collaboration, or trading, with individual scoring.
Model Strengths: Negotiation and Supplier Selection
High performers like Gemini 3 Pro persistently negotiate (e.g., countering $1.50/can soda quotes with demands for $0.50-0.60 wholesale) and favor honest suppliers, spending heavily there despite initial higher quotes yielding better post-negotiation deals. Models generally detect adversarial suppliers (2 honest, 2 adversarial types) and pivot effectively. Consistent planning—checking inventory, balancing margins (e.g., vending at $2.50-3.00 vs. $2.40 wholesale for $0.10-0.60 profit minus fees)—sustains operations.
Weaknesses include over-trust: GPT-5.1 pays pre-spec ($439 order at inflated $2.40/can, $6 energy drinks), ignores thin margins, and risks bankruptcy from supplier failures. It accepts poor deals without shopping alternatives, unlike leaders who explore parallels.
Progress Trends and Remaining Headroom
Frontier models improve linearly (+$693/month, R²=0.97), with Chinese models faster (+$1,004/month, R²=0.95) but lagging 141 days, projecting crossover Feb 2027. Score correlates inversely with API run cost. No saturation: theoretical ceiling unlimited via high-value items (e.g., $500 tungsten cubes), zero-cost negotiation (jailbreak suppliers), and gaming sales equations (per original paper). Conservative "good" human baseline: $206/day ($63k/year) by selecting top items (Doritos family-size), halving prices, optimizing via 60-day sales analysis—10x current best, exposing gaps in analysis, strategy, and coherence over long horizons.