GPTNT: A Real-Time Collaborative Benchmark for AI Agents

The Need for Dynamic Collaborative Benchmarks

Most current AI benchmarks evaluate models in static or turn-based environments, failing to capture the complexities of real-world collaboration. The GPTNT benchmark addresses this by utilizing the cooperative game Keep Talking and Nobody Explodes. In this environment, two agents must work together to defuse procedurally generated bombs under a strict countdown.

Collaboration is enforced through information asymmetry: one agent (the 'Defuser') can see and interact with the bomb but lacks instructions, while the other (the 'Expert') has the manual but cannot see the bomb. Success is impossible without efficient, real-time communication and coordination, forcing agents to move beyond simple prompt-response patterns.

Identifying Failure Modes in State-of-the-Art Models

Testing across both open- and closed-source models revealed that current state-of-the-art systems struggle significantly, with none successfully defusing a single bomb in real-time—a task human players perform consistently. Controlled experiments identified four primary failure points:

State Tracking: Models fail to maintain an accurate, evolving mental model of the bomb's status as inputs change.
Time Pressure: The urgency of the countdown causes performance degradation, as models struggle to prioritize actions efficiently.
Ambiguity Handling: Agents fail to resolve unclear instructions or visual inputs, leading to stalled progress.
Error Recovery: When a mistake occurs, agents lack the capability to backtrack or correct their trajectory, often compounding errors until the timer expires.

A Living Benchmark for Evolving Models

Unlike static datasets that can be memorized, GPTNT leverages the game's procedural generation to ensure that test cases are always unique. Because it runs on the actual game engine, the benchmark can evolve alongside the modding community, preventing the 'solved-benchmark' trap where models simply overfit to a fixed set of questions. By withholding the manual or the partner, researchers can isolate whether a model's failure is due to a lack of knowledge or an inability to collaborate in the moment.

The Need for Dynamic Collaborative Benchmarks

Identifying Failure Modes in State-of-the-Art Models

A Living Benchmark for Evolving Models

More from AI & LLMs

Internalizing Future-Aware Planning in LLM Agents

Automating Mechanistic Interpretability with Agentic Loops

Analyzing AI Model Behavior via Agent Trajectories

Co-Evolutionary Strategy Development in LLM-Driven Adversarial Games