Claude Mythos Leak Signals 10T Param Frontier
Anthropic's leaked Claude Mythos (10T params) claims unmatched coding, reasoning, and cybersecurity gains, outpacing Opus; GLM 5.1 open-source agent nears proprietary benchmarks at 45.3 coding score.
Frontier Model Leaks Reshape AI Race
Anthropic leaked Claude Mythos, a 10 trillion parameter model in a new tier above Opus, with Capabara as a sub-tier still surpassing current flagships. Early tests show massive leaps in coding, academic reasoning, and cybersecurity—Fortune reports deem it incomparable to Opus and potentially dangerous, prompting a slow rollout for misuse risks. Rumors suggest intermediate Opus 5 or Sonnet 5 releases first, possibly as marketing to build hype. OpenAI's internal Spud model promises similar step changes. Expect 2026 as pivotal, with GPT 5.5, DeepSeek v4 imminent, collapsing gaps between tools and systems.
Open-Source Agents Close Proprietary Gap
Zhipu's GLM 5.1 advances agentic capabilities over GLM 5, excelling in long-running tasks, instruction following, and multi-step workflows at low cost. Coding benchmarks hit 45.3 (vs. Opus 4.6's 47.9), rivaling closed models despite slow inference. Generated UIs demonstrate strong frontend skills: clean landing pages with dynamic elements, varied typography, and structures. Mistral's Boxrol TTS adds open-weight expressive speech in 9 languages, low-latency, voice-adaptable.
Real-Time Tools and Coding Platforms Evolve
Google DeepMind's Gemini 3.1 Flash Live enables real-time multimodal voice/vision agents after a year of refinement, cutting latency while boosting quality/reliability—demo alters app code (e.g., bigger mic, yellow polka dots) fluidly. OpenAI's Codex plugins transform it into an execution environment with one-click workflows for iOS apps, data analysis, reports, presentations—directly challenging Cloud Code. Cloud Code adds remote autofix for PRs (CI failures, reviews); Claude Code imposes 5-hour session limits during peak weekdays (weekly quotas unchanged), introduces auto-mode with classifiers for instant safe actions/blocked risky ones. Cursor's Composer 2, claimed frontier-level on internal benchmarks, revealed as fine-tuned Kimi K2.5.
Benchmarks Track True AGI Progress
ARC AGI 3 tests agentic reasoning in interactive environments: first-try solves, no training/instructions. Top models score under 1% (humans: 100%), resisting overfitting for genuine intelligence over memorization. Success here paves for commercial video games, enabling AI to act/adapt in complex worlds.