Agentic Self-Verification Slashes False Positives in Bug Hunting
Scale AI vulnerability detection by building agentic pipelines where models like Claude Mythos Preview analyze code, then autonomously write and execute test cases to confirm issues. This filters speculation: earlier read-only scans with GPT-4 or Claude 3.5 Sonnet produced too much noise, but self-testing turned AI outputs into actionable reports. Mozilla ran Claude Opus across parallel VMs, each handling one file, then added deduplication, prioritization, and fix-tracking. Result: 271 previously unknown bugs in Firefox 150, plus a third of 111 other internal finds, contributing to 423 total resolutions in April—over 5x the prior monthly record of 76. Only 41 came from external reports, proving AI's edge over traditional methods.
Proof of robustness emerged too: AI attempts to exploit Prototype Pollution failed against Mozilla's pre-existing sandbox defenses, validating years-old architecture choices without manual re-testing.
AI Excels at Rare, Chainable Weaknesses Fuzzing Misses
Target subtle flaws needing chaining for exploits, where fuzzing falls short. Mozilla's AI uncovered a 15-year-old HTML label bug, a 20-year-old XSLT issue in XML tools, sandbox escapes via HTML tables exceeding 65,535 rows (causing counter overflow), and RLBox bypasses in third-party libs. These aren't standalone attacks but prime for combination—exactly AI's strength in reasoning across codebases.
Shift from dismissing AI reports as 'slop' by pairing capable models (post-February Anthropic Frontier Red Team collab) with verification infrastructure. Publish early bug details for transparency, building trust in automated findings.
Automate AI Checks into CI/CD for Every Commit
Integrate pipelines directly into development: Mozilla plans to scan all new code pre-commit, catching issues at source. Start small with supervised runs, then parallelize across infra. Trade-offs: handles complex logic better than fuzzing but relies on model quality—upgrade as capabilities grow. This closes the gap from demo to production, making AI a core security layer for open-source giants like Firefox.