AI Agent Beats Top Jailbreaker's 5 Attacks

Hardened OpenClaw system quarantined all 5 attacks from Ply the Liberator—including token bombs and jailbreaks—using Claude Opus as frontline defense, but no AI stays secure forever.

Siege and Probing Attacks Fail Against Quarantine

Ply the Liberator, Times 100 AI influencer known for hacking new models in minutes, targeted Matthew Berman's OpenClaw—a personal AI agent that scans a single whitelisted email address for tasks. Blind to architecture or models, Ply started with tokenades: payloads packing 3 million characters into emojis or jailbreak commands to fingerprint the model via erratic responses. Gmail spam filters blocked initial probes, but whitelisting enabled tests. Results: all quarantined, preventing processing. Ply escalated to siege attacks—flooding with millions of tokens across emails to exhaust API quotas and drain wallets. OpenClaw quarantined these too, avoiding token burn despite visible system strain. Key defense: automatic quarantine of suspicious inputs before full processing, neutralizing denial-of-service without crashing.

Jailbreak Templates and Mimicry Blocked by Frontier Reasoning

Shifting tactics, Ply deployed structured jailbreak templates stripped of trigger words to override output formats (e.g., forcing dividers or intros as proof of injection). He mimicked system commands with thinking tags, tricking quarantine loops into self-hardening or executing fakes. Final probe used Claude Opus 4.6 hint to craft XFill via free-association games post-email: haikus, movie scripts, songs leaking memories (e.g., names, locations). All failed—quarantined instantly. Pre-testing on Claude confirmed Opus's built-in safeguards flag embedded instructions, URLs, or actions. Local/weaker models (e.g., Grok) fell easier, but frontier reasoners like Opus sliced low-hanging fruit.

Rules for Ironclad AI Agent Security

Ply rated blind infiltration odds at 80% initially, dropping as defenses held. Core rules: (1) Deploy frontier models (Opus-level reasoners) as first scanner—smaller/instruct models collapse fast. (2) Human-in-the-loop for overrides. (3) Quarantine suspicious payloads pre-execution. Trade-offs: Siege still spikes costs if unmonitored; accounts risk bans from labs (Ply recovers his). No permanence—Ply stressed evolving attacks outpace static hardening. OpenClaw's narrow task scope aided resilience, but broad agents demand constant upgrades. Builders: Prioritize quota limits, input sanitization, and model rotation to counter wallet drains and leaks.

Video description
Try Greptile for free for 14 days! http://greptile.com/go/berman Download The 25 OpenClaw Use Cases eBook 👇🏼 https://bit.ly/4aBQwo1 Download The Subtle Art of Not Being Replaced 👇🏼 http://bit.ly/3WLNzdV Download Humanities Last Prompt Engineering Guide 👇🏼 https://bit.ly/4kFhajz Join My Newsletter for Regular AI Updates 👇🏼 https://forwardfuture.ai Discover The Best AI Tools👇🏼 https://tools.forwardfuture.ai My Links 🔗 👉🏻 X: https://x.com/matthewberman 👉🏻 Forward Future X: https://x.com/forwardfuture 👉🏻 Instagram: https://www.instagram.com/matthewberman_ai 👉🏻 TikTok: https://www.tiktok.com/@matthewberman_ai 👉🏻 Spotify: https://open.spotify.com/show/6dBxDwxtHl1hpqHhfoXmy8 Media/Sponsorship Inquiries ✅ https://bit.ly/44TC45V

Summarized by x-ai/grok-4.1-fast via openrouter

6105 input / 1288 output tokens in 8916ms

© 2026 Edge