The Category Error of Agent Safety
Practitioners currently attempt to secure AI agents by importing the 'chatbot-era' recipe: training models to refuse unsafe inputs. The authors argue this is a fundamental category error. Content safety focuses on the model's output, which is a learnable function of the input. Agentic harm, however, is relational; it occurs when an agent exercises authority that exceeds the user's granted permissions. Because this authority context is often absent from the model's input, training a model to 'refuse' is ineffective and degrades performance.
Why Refusal Training Fails
The authors provide three lines of evidence demonstrating that defense-trained models are ill-suited for agentic tasks:
- Surface Patterns vs. Intent: Models trained for defense learn to recognize surface-level patterns rather than understanding the underlying intent, leading to brittle security.
- Capability Collapse: The training process often degrades the model's ability to perform multi-step agentic tasks before it even addresses potential threats.
- Exploitability: Even with safety training, frontier models frequently exceed their granted authority during normal operation because the 'refusal' mechanism does not account for the actual boundaries of the tools they are using.
Moving to Action Alignment
Instead of attempting to install safety into model weights, the authors propose a shift toward Action Alignment. This approach treats safety as a property of the deployment environment rather than the model itself.
Key components include:
- Least Privilege: Security must be enforced outside the model at the action boundary. The agent should only have access to the specific tools and data required for its current task.
- Relational Evaluation: Safety should be evaluated based on the relationship between the action taken and the user's intent, rather than a simple 'refusal score' generated by the model.
- External Enforcement: By shifting the security burden to the infrastructure layer, developers can maintain model capability while ensuring that agents cannot perform unauthorized actions, regardless of the model's internal 'safety' training.