AI Agent Beats Top Jailbreaker's 5 Attacks

Siege and Probing Attacks Fail Against Quarantine

Ply the Liberator, Times 100 AI influencer known for hacking new models in minutes, targeted Matthew Berman's OpenClaw—a personal AI agent that scans a single whitelisted email address for tasks. Blind to architecture or models, Ply started with tokenades: payloads packing 3 million characters into emojis or jailbreak commands to fingerprint the model via erratic responses. Gmail spam filters blocked initial probes, but whitelisting enabled tests. Results: all quarantined, preventing processing. Ply escalated to siege attacks—flooding with millions of tokens across emails to exhaust API quotas and drain wallets. OpenClaw quarantined these too, avoiding token burn despite visible system strain. Key defense: automatic quarantine of suspicious inputs before full processing, neutralizing denial-of-service without crashing.

Jailbreak Templates and Mimicry Blocked by Frontier Reasoning

Shifting tactics, Ply deployed structured jailbreak templates stripped of trigger words to override output formats (e.g., forcing dividers or intros as proof of injection). He mimicked system commands with thinking tags, tricking quarantine loops into self-hardening or executing fakes. Final probe used Claude Opus 4.6 hint to craft XFill via free-association games post-email: haikus, movie scripts, songs leaking memories (e.g., names, locations). All failed—quarantined instantly. Pre-testing on Claude confirmed Opus's built-in safeguards flag embedded instructions, URLs, or actions. Local/weaker models (e.g., Grok) fell easier, but frontier reasoners like Opus sliced low-hanging fruit.

Rules for Ironclad AI Agent Security

Ply rated blind infiltration odds at 80% initially, dropping as defenses held. Core rules: (1) Deploy frontier models (Opus-level reasoners) as first scanner—smaller/instruct models collapse fast. (2) Human-in-the-loop for overrides. (3) Quarantine suspicious payloads pre-execution. Trade-offs: Siege still spikes costs if unmonitored; accounts risk bans from labs (Ply recovers his). No permanence—Ply stressed evolving attacks outpace static hardening. OpenClaw's narrow task scope aided resilience, but broad agents demand constant upgrades. Builders: Prioritize quota limits, input sanitization, and model rotation to counter wallet drains and leaks.