The Fallacy of Curated Safety Benchmarks

Current methodologies for evaluating AI agent safety often rely on 'attack selection'—the practice of choosing a specific, limited set of adversarial prompts or scenarios to test model robustness. This research demonstrates that this approach is fundamentally flawed. By narrowing the scope of testing to a pre-selected 'menu' of attacks, developers inadvertently create a blind spot where models appear safe under controlled conditions but remain highly vulnerable to novel or unselected adversarial inputs.

The Impact of Selection Bias on Risk Assessment

When safety evaluations are restricted to known or 'representative' attack vectors, the resulting metrics fail to capture the true distribution of potential failures. The authors argue that this selection bias leads to an overestimation of model robustness. Because agentic AI systems operate in complex, multi-step environments, their failure modes are often non-linear and context-dependent. A model that successfully defends against a standard set of jailbreak attempts may still exhibit catastrophic failure when faced with slight variations in prompt structure or task-specific adversarial goals that were excluded from the evaluation set.

Toward Holistic Adversarial Testing

To meaningfully improve safety, the paper advocates for a shift away from static, curated benchmarks toward more dynamic, automated adversarial testing. Relying on a fixed set of attacks is insufficient because it treats safety as a static property rather than a moving target. The findings suggest that developers must implement 'red teaming' pipelines that prioritize diversity and unpredictability in attack generation. By moving beyond curated selection, teams can better identify the 'long tail' of risks that are currently masked by standard safety benchmarks, ensuring that agents are resilient against a broader spectrum of adversarial intent.