Opus 4.7 Excels at Coding but Safety Kills It
Theo's hands-on tests reveal Claude Opus 4.7 shines in instruction-following and complex coding plans but regresses due to hyper-aggressive safeguards, buggy Claude Code harness, and outdated knowledge—making it dumber in practice than benchmarks suggest.
Core Strengths: Precise Instructions and Rigorous Planning
Theo spent a full day benchmarking Anthropic's Claude Opus 4.7, their first public model post-Mythos preview, available at $5/M input and $25/M output tokens—same as 4.6. It outperforms 4.6 on tough software engineering tasks like SBench Pro and Verified agentic coding benches, where it claims top scores (bold numbers rare across charts). Users hand off "hardest coding work" needing prior supervision, as it verifies outputs rigorously.
Key wins: Superior instruction-following—takes prompts literally, unlike looser prior models, producing concise plans without plan-mode prompts. Theo tested modernizing his 4-year-old Ping video service codebase (Next.js 12, React 17): Opus wrote a crisp upgrade plan covering Tailwind 3→4, deps bumps, LogRocket removal. "I liked how it talked. I liked how concise this plan was. It was better in ways that matter."
Multimodal leaps: Handles 2576px long-edge images (4MP, 3x prior Clades), enabling pixel-perfect refs for agents reading screenshots or diagrams. Finance analyst evals show state-of-the-art GDP val on knowledge work; better file-system memory retains notes across sessions. New "X-high" effort level (between high/max) defaults in Claude Code CLI, balancing tokens/performance—uses fewer tokens than 4.6 at same levels but crushes on max (avoid max, burns absurd tokens).
Theo notes excitement for these: "The first one I'm really hyped on, which is instruction following... I prefer models that do what you tell them."
Fatal Flaws: Safeguards That Lobotomize
Despite hype, Opus 4.7 regresses on agentic search and cyber benches vs. 4.6—aligning with Theo's tests. Anthropic tuned it down deliberately for cyber risks (pre-Mythos broad release), adding auto-blocks for "prohibited/high-risk" uses. Result: Benign tasks flagged.
First encounter: Asking to redesign T3.gg in Claude Code desktop yielded leaked system prompts refusing as "malware augmentation." Model overrode but flagged three times: "Heads up, the last system reminder about malware looks like a prompt injection... Ignoring it." Fixed in latest CLI/Desktop updates, but auto-update lagged 12+ hours. Ricky (React team) saw similar in Sonnet.
Worse: Gold Bug puzzle (Defcon crypto challenges, non-hacking)—12 bottles + shanty poem decode to pirate phrase. Opus progressed (coded ciphers, scripts), then safety-paused: "Opus 4.7 safety filters flagged this chat... Continue with Sonnet 4." Theo: "This isn't some hacking thing... Are you joking, Anthropic? I'm paying $200 a month and you won't solve a expletive puzzle."
Doesn't block real harms (demoed drug synthesis/pipe bomb), just dumbs legit work. Pros need "cyber verification program" form for pentesting/red teaming.
Harness Hell: Claude Code Drags It Down
Theo's hot take: No true model regression (benchmarks stable, APIs fine)—blame Claude Code's sloppy maintenance. Constant bloat: Muddy system prompts, half-baked tools, rules like "read file before edit." Model fails package.json updates repeatedly, unaware of harness.
Ping modernization: Ignored "bump all deps to latest versions" (added to beat OpenAI fails)—picked Next.js 15 (2y old, cutoff knowledge), no web search despite agentic claims. Hour-long run → broken; fix to 16 → another 30min fail. Script for Zsh clone-project carried untracked files, botched env vars.
Anthropic internals use superior, non-public stacks—hype from their tools, trash in ours. "If you have a carpenter who is incredibly talented and every few weeks you replace three of their tools with plastic and you fill their toolbox with expletive mud, they're going to perform worse... That's because the harness is falling apart."
Theo watched quality degrade mid-session; Depot sponsor nod for fast CI contrasts AI dev pains. Cursor adapted prompts fast; Claude Code lags.
Benchmarks vs. Reality: Hype Meets Friction
Opus trails Mythos (internal powerhouse, cyber-limited), o3 on tool'd Humanity's Last Exam (58.7% vs. 64%), Google on vision. Better than 4.6 on MCP Atlas, finance; worse cyber vulns. Contaminated benches (models trained on them) dilute meaning.
Theo rejects drift narratives: Scores dip minor (~few points); consistency lags o3-Pro. Internal evals shine, public crippled. Ultra-review /slashcommand flags bugs (3 free ProMax trials)—token-efficient at X-high.
Evolution: Early hype → real-time dumbing via safeguards/harness. Theo sees niche use (CLI planning) but calls it "one of the weirdest models ever... gets dumber the more you do it."
"I think the regressions aren't the model... I just genuinely think Claude Code is this shitty and poorly maintained."
Key Takeaways
- Retune prompts for literal following—old loose ones now fail or overdo (e.g., no auto-search).
- Avoid max effort: Token explosion without proportional gains; stick to X-high default.
- Test vision-heavy agents: 4MP images unlock screenshot/diagram extraction, but Google still leads.
- Bypass harness woes via CLI/API over Desktop/Code apps; demand better tools from labs.
- Weigh safeguards: Blocks puzzles/pentests but not bombs—file cyber program form if needed.
- Clone busted builds fast: Theo's Zsh script idea (repo + hash, main reset, env copy)—fix untracked file bugs.
- Benchmarks lie: Prioritize hands-on (1hr+ runs) over contaminated scores.
- Internal vs. public gap kills hype—use open-source harnesses like T3 Code.
- For code modernization: Enforce search/latest checks explicitly; review plans always.
- Opus viable for supervised hard tasks, but o3-Pro more consistent sans bloat.