AI Security Moat: System Beats Model Size

Jagged Capabilities Defy Smooth Scaling

AI cybersecurity doesn't improve predictably with model size, price, or generation. Small open-weights models like GPT-OSS-20b (3.6B active params, $0.11/M tokens) and GPT-OSS-120b (5.1B active) detected all eight tested models on FreeBSD NFS buffer overflow (CVE-2026-4747), computing exact overflow sizes (96-312 bytes) and assessing as critical RCE (CVSS 9.8). On OpenBSD's 27-year-old SACK bug, GPT-OSS-120b recovered the full chain: missing lower bound validation, SEQ_LT/SEQ_GT signed overflow at ~2^31, NULL deref after hole deletion. Yet the same Qwen3 32B model aced FreeBSD CVSS but called OpenBSD "robust."

Inverse scaling hit false-positive triage: small models like DeepSeek R1 and GPT-OSS-20b correctly traced OWASP Java servlet data flow—user input discarded by remove(0), bar="moresafe", no SQLi—while frontier Claude Sonnet 4.5 and most GPT-4/5 failed, claiming "param → this is returned!" Rankings reshuffled: no model topped all tasks.

"The capability frontier is jagged." – Stanislav Fort, summarizing why there's "no stable best model for cybersecurity," as small models outperform frontiers on triage but lag on subtle math.

Post-fix specificity exposed gaps: all models detected unpatched FreeBSD 3/3 runs, but only GPT-OSS-120b cleared patched code 3/3; others false-positived, inventing signed bypasses on unsigned oa_length.

Modular Pipeline Exposes Uneven Demands

Mythos blends tasks into one capability, but reality splits into scanning (codebase navigation), detection, triage/verification, patching, exploitation—each scaling differently. Broad scanning favors cheap models' volume: "A thousand adequate detectives searching everywhere will find more bugs than one brilliant detective who has to guess where to look."

Detection commoditizes buffer overflows; OpenBSD needs math reasoning. Triage demands false-positive rejection, vital after curl killed its bounty from noise. Exploitation requires mitigations knowledge: no canary on int32_t, no KASLR, ROP chains. Small models reasoned ROP (prepare_kernel_cred(0)/commit_creds), SMEP bypasses, even wormability—DeepSeek R1 pragmatically skipped 1000-byte SSH key for userland ops post-esc (~160 bytes). None matched Mythos's 15-RPC BSS spray, but alternatives like stack-pivot or copyin showed creative primitives.

AISLE's production system (mid-2025) found 15 OpenSSL CVEs (12/12 in one release, 25+ year olds, CVSS 9.8), 5 curl, 180+ across 30+ projects. Maintainer trust metric: OpenSSL CTO praised "high quality reports." Model-agnostic: Anthropic models used but not always best; scaffolds (containers, ASan oracles, attack surface ranking, iterative tests) drive results.

"The moat in AI cybersecurity is the system, not the model." – Stanislav Fort, contrasting Mythos's intelligence-per-token max with inputs like tokens/dollar, tokens/second, embedded expertise.

Tradeoffs: Frontier models shine on subtlety but cost 10x+; small ones enable broad coverage, lower economics. Jaggedness demands ensembles or task-specific routing.

Production Implications: Broad, Cheap Beats Narrow Elite

Anthropic's $100M credits/$4M donations validate category, but AISLE executed Glasswing mission earlier: live analyzer on OpenSSL/curl/OpenClaw PRs catches pre-ship. Once scaffolds isolate snippets, cheap models suffice for core analysis—end-to-end discovery needs orchestration, not Mythos exclusivity.

Economics shift: deploy small models everywhere, triage with systems earning maintainer trust. False-positives kill adoption (curl precedent); specificity gaps reinforce scaffold necessity.

"Our practical experience on the frontier of AI security suggests that the reality is very uneven." – Stanislav Fort, on why blending tasks misleads: production favors modular, expert-wrapped small models over monolithic frontier hopes.

Replicate by isolating functions via scaffolds, probe with open models (DeepSeek R1, Kimi K2), validate bidirectionally (bug/fix), iterate maintainer feedback.

Key Takeaways

Test small open models (3.6B+) on isolated snippets: they recover flagship vulns like FreeBSD RCE, OpenBSD SACK chains.
Build modular pipelines: scan broad with cheap models, deepen/triage with scaffolds (ASan, attack surface ranks).
Prioritize specificity: re-run on patched code; false-positives drown maintainers—curl bounty died from this.
Route by task: no universal best model; ensemble jagged strengths (e.g., DeepSeek for ROP pragmatics).
Target maintainer acceptance: close loop to accepted patches—OpenSSL CTO endorsement beats raw CVEs.
Exploit creatively under constraints: models independently solved 304-byte ROP limits differently than Mythos.
Scale via volume: cheap tokens enable full-codebase scans, outperforming selective frontier probes.
Embed expertise: moat is orchestration (containers, oracles, validation), not model access.

Jagged Capabilities Defy Smooth Scaling

Modular Pipeline Exposes Uneven Demands

Production Implications: Broad, Cheap Beats Narrow Elite

Key Takeaways

More from AI & LLMs

637MB LLM Runs Offline on Base MacBook Air, Works Surprisingly Well

Gemini Exports Editable Slides, Docs, Sheets, PDFs, Word, Excel

Pick Gemma 4 Model by Hardware to Unlock 9/10 Math Accuracy

Forum AI Scales Elite Experts for LLM Evaluation