Red-Teaming and Security for Agentic AI Systems

The Shift to AI-Native Security

Zico Kolter and Matt Fredrikson, co-founders of Gray Swan, argue that AI security is not merely an extension of traditional cybersecurity. While AI can assist in solving cyber problems, the models themselves introduce unique, inherent vulnerabilities. Because these systems are increasingly autonomous and integrated into critical infrastructure, they must be treated as untrusted entities. The core risk lies in "correlated failures"—where a single vulnerability in a widely used model (like Claude or Codex) creates a systemic exploit across the entire software ecosystem.

Automated Red Teaming vs. Human Intuition

Traditional red teaming is no longer sufficient to keep pace with frontier models. Gray Swan has developed Shade, an automated red-teaming system that utilizes specialized models to probe for vulnerabilities. Kolter notes that while frontier models are often poor at red teaming themselves due to heavy safety training, specialized models can outperform human red teamers in specific, time-constrained adversarial tasks. This shift is critical because AI exhibits "alien intelligence"—it fails in ways that are fundamentally different from human error, making it difficult for developers to predict failure modes without rigorous, automated testing.

The Agentic Security Nightmare

As AI moves from simple chatbots to autonomous agents capable of tool use and computer interaction, the attack surface expands significantly. Fredrikson highlights the "lethal trifecta" of agentic risks: untrusted data, private data access, and exfiltration capabilities. When an agent is given the ability to read untrusted content (like a coding agent reading a repository), it becomes susceptible to indirect prompt injection. The challenge is not just preventing the model from saying something bad, but ensuring it maintains its objective and security boundaries when interacting with external, potentially malicious environments.

Mechanistic Interpretability and Future Research

While mechanistic interpretability (mech interp) has historically been limited to testing small, isolated hypotheses, Kolter expresses optimism that coding agents will transform it into a more rigorous science. By automating the experimentation process, researchers can move beyond manual inspection and use AI to interpret the internal states of other AI systems. This recursive approach—using AI to audit and secure AI—may be the only way to keep up with the rapid scaling of model capabilities.

Key Takeaways

Treat Models as Untrusted: Do not assume a model is safe just because it is large; assume it is an untrusted entity that requires guardrails.
Automate Red Teaming: Human red teaming is essential but insufficient. Use specialized models like Shade to find vulnerabilities at scale.
Beware the Lethal Trifecta: Be hyper-vigilant when agents combine untrusted data, private data access, and tool-use capabilities.
Focus on Agent Identity: As agents perform actions on behalf of users, secure identity and permission management become the primary defense against exfiltration.
Expect Gray Swans: Major AI security incidents are often "gray swans"—events that are clearly visible and predictable, yet ignored until they occur.
Red Teaming is Out-of-Distribution: Effective red teaming requires finding behaviors that are out-of-distribution for the model, which is why specialized adversarial models are increasingly necessary.

Notable Quotes

"AI systems have inherent vulnerabilities of their own. They can be tricked in ways people can be tricked, so you need a different security mindset." — Zico Kolter
"It is clearly a different form of intelligence than people. It’s some alien intelligence that is vastly different, and that difference is actually often brought out to a large degree by things like adversarial attacks." — Zico Kolter
"The goal is to mitigate the risk posed by the AI as it relates to your broader cybersecurity goals." — Matt Fredrikson
"We’re kind of crossing this point where we can do much better than human red teamers now at breaking these models." — Zico Kolter

The Shift to AI-Native Security

Automated Red Teaming vs. Human Intuition

The Agentic Security Nightmare

Mechanistic Interpretability and Future Research

Key Takeaways

Notable Quotes

More from Evals & Reliability

The Promptware Kill Chain: Understanding AI Malware

Agent-Native Immune System (ANIS): Architecture for Runtime Defense

ToE: Hierarchical Claim Verification Against Adversarial Misinformation

CoCoDA: Co-Evolve DAGs to Scale Tool-Augmented Agents