Opus 4.7 Excels at Coding but Safety Ruins It

Benchmark Wins Mask Real-World Flaws

Claude Opus 4.7 delivers targeted improvements over Opus 4.6, especially in advanced software engineering on contaminated benchmarks like SWE-Bench Pro and Verified, where it leads without Claude Mythos preview scores for comparison. It scores higher on Humanity's Last Exam (no tools) than previous models but trails OpenAI's o1 on tool-enabled versions (58.7% vs. 64% for Mythos). Gains appear in agentic coding, finance analysis (state-of-the-art on GDP val for knowledge work), and multimodal vision—handling up to 2576 pixels (4 megapixels, 3x prior Clades)—enabling dense screenshot analysis and pixel-perfect tasks. However, it regresses on Agentic Search and cybersecurity benches, aligning with user reports of poor search decisions.

Pricing stays at $5/M input tokens and $25/M output, available via Claude API, Bedrock, Vertex AI, and Foundry. New features include 'X-High' effort level (between High and Max for refinement control, default in Claude Code), ultra review /slash command for bug/design flagging (3 free for Pro Max users), and better file-system memory for multi-session tasks. Internal tests show rigorous finance models, professional outputs, and less misalignment than Opus 4.6 (close to Sonnet 4.6). Theo notes fewer 'bold' top scores across charts, signaling it's not broadly SOTA.

"Notice they said a range of benchmarks instead of all benchmarks. That's because Opus 4.7 actually performs worse than Opus 4.6 on a handful of these benches, including the Agentic Search bench." This quote highlights Anthropic's cautious framing, as Theo experienced questionable searches firsthand.

Safety Safeguards Backfire into Usability Nightmares

To test cyber safeguards ahead of Mythos (per Project Glass Wing), Anthropic dialed back Opus 4.7's cyber capabilities, adding auto-detection for high-risk requests. This manifests as overkill: In Claude Code desktop (latest version), a T3.gg design improvement prompt triggered three 'malware' system reminders, dismissed as 'prompt injection'—despite no user customization. Theo: "They're trying so hard to keep this model from doing malicious malware things that they have inadvertently lobotomized it with the system prompt."

Worse, a harmless Defcon Gold Bug puzzle (cryptography, not hacking)—stumping teams for days, solved by o1-preview in 15 minutes—progressed promisingly (cipher trials, code scripting) before safety filters paused the chat: "Opus 4.7 safety filters flagged this chat... Continue with Sonnet 4." Legit cybersecurity users must join a verification program. Tests confirm it still handles prohibited topics like drug synthesis or bombs, proving safeguards dumb it down without enhancing safety. Early CLI use avoided some issues, but desktop lags with buggy auto-updates.

"I'm paying $200 a month and you won't solve a expletive puzzle for me." Theo's frustration underscores how safeguards block benign tasks, forcing retries on weaker models.

Instruction Following Tradeoff: Literal but Uninformed

Opus 4.7's standout: precise instruction adherence, taking prompts literally where prior Clades skipped or loosened them—prompts for older models may now 'produce unexpected results,' requiring retuning. Theo prefers this: "I like models that do what you tell them." Claude Code and some tools lag adaptations (fixed post-release).

Yet literalism skips verification: Modernizing a 4-year-old Ping video service codebase (Next.js 12, React 17), it planned concisely (remove LogRocket, bump deps)—but proposed Next.js 15 (2 years outdated vs. 16) and Tailwind 4 (disruptive migration), ignoring 'latest versions' due to no web search, relying on stale training data. Ran 1 hour before catch, fixed to 16 (another 30 mins), still broke builds. Failed harness rules (read files before updates). Script for ZshRC cloning (hidden dir, main branch, env copy) carried untracked files erroneously.

"Despite being better at following instructions, it's really bad at understanding the definitions of things and that it doesn't have the latest information." This reveals the cost: fidelity over initiative, amplifying knowledge gaps.

Claude Code Harness: The Real Regression Culprit

Theo's hot take: No true model dumbing (API benchmarks stable, slight dips negligible vs. o1 consistency). Blame Claude Code's 'shitty and poorly maintained' state—constant slop additions (rules, tools, prompts) degrade performance. "If you have a carpenter who is incredibly talented and every few weeks you replace three of their tools with plastic and you fill their toolbox with expletive mud, they're going to perform worse..." Anthropic's internal tools differ vastly from public ones (unlike OpenAI/Google), hyping models internally while public gets 'lobotomized' versions. Token efficiency improves (fewer on most settings, better perf), but Max burns absurd amounts.

Community echoes: React's Ricky saw similar in Sonnet; fixes rolled unevenly. Theo watched quality degrade mid-session, mirroring beer analogy: "much like me drinking beer, this model just gets dumber the more you do it."

Initial hype faded: Good plans/conversation, but unreliability kills trust. Still usable for hard coding handoffs, vision/creativity in pros (interfaces/slides/docs).

Key Takeaways

Retune prompts for literal instruction following; test small before long runs.
Avoid Claude Code desktop for now—use CLI/API until harness stabilizes; watch auto-updates.
Expect safety pauses on puzzle-like tasks; apply for cyber verification if needed.
Bump to latest deps explicitly and verify plans—model skips searches.
Prioritize token-efficient efforts (X-High default); skip Max to avoid burn.
Compare via API for true perf; public tools lag internal capabilities.
Leverage vision for high-res (2576px) multimodal agents, but Google leads recognition.
Track benchmarks skeptically—contamination inflates agentic coding scores.
For finance/legal: Stronger analysis/models than 4.6.
Overall: Use for supervised hard tasks, not unsupervised agents yet.