The Need for Dynamic Collaborative Benchmarks
Most current AI benchmarks evaluate models in static or turn-based environments, failing to capture the complexities of real-world collaboration. The GPTNT benchmark addresses this by utilizing the cooperative game Keep Talking and Nobody Explodes. In this environment, two agents must work together to defuse procedurally generated bombs under a strict countdown.
Collaboration is enforced through information asymmetry: one agent (the 'Defuser') can see and interact with the bomb but lacks instructions, while the other (the 'Expert') has the manual but cannot see the bomb. Success is impossible without efficient, real-time communication and coordination, forcing agents to move beyond simple prompt-response patterns.
Identifying Failure Modes in State-of-the-Art Models
Testing across both open- and closed-source models revealed that current state-of-the-art systems struggle significantly, with none successfully defusing a single bomb in real-time—a task human players perform consistently. Controlled experiments identified four primary failure points:
- State Tracking: Models fail to maintain an accurate, evolving mental model of the bomb's status as inputs change.
- Time Pressure: The urgency of the countdown causes performance degradation, as models struggle to prioritize actions efficiently.
- Ambiguity Handling: Agents fail to resolve unclear instructions or visual inputs, leading to stalled progress.
- Error Recovery: When a mistake occurs, agents lack the capability to backtrack or correct their trajectory, often compounding errors until the timer expires.
A Living Benchmark for Evolving Models
Unlike static datasets that can be memorized, GPTNT leverages the game's procedural generation to ensure that test cases are always unique. Because it runs on the actual game engine, the benchmark can evolve alongside the modding community, preventing the 'solved-benchmark' trap where models simply overfit to a fixed set of questions. By withholding the manual or the partner, researchers can isolate whether a model's failure is due to a lack of knowledge or an inability to collaborate in the moment.