Opus 4.7: Great Coder, Ruined by Safety Bloat and Bad Harness

Opus 4.7's Core Strengths: Precise Instructions and Vision Upgrades

Opus 4.7 delivers on advanced software engineering, especially tough tasks that once required human oversight. It handles long-running jobs with consistency, verifies its own outputs, and follows instructions literally—unlike prior models that skipped steps or interpreted loosely. This shift means prompts for older Claudes can backfire; users must retune them. Benchmarks show gains on SBench Pro and Verified (agentic coding), Humanity's Last Exam (no tools), MCP Atlas, and GDP val for knowledge work in finance/legal domains. It's state-of-the-art on finance agent evals, producing rigorous analyses, models, and pro presentations.

Vision jumps to 2576px long edge (~4MP), enabling pixel-perfect tasks like screenshot reading for agents, diagram extraction, or detail-heavy refs. No image gen needed—Anthropic catches up without it. File-system memory improves: it recalls notes across sessions, reducing context needs. New 'X-high' effort level (between high/max) defaults in Claude Code, balancing output quality/token use. Ultra Review /slash command flags bugs/design issues in changes; ProMax users get 3 free.

"Users report being able to hand off their hardest coding work... to Opus 4.7 with confidence." (Anthropic release notes—highlights real-user trust in unsupervised complex work.)

In tests, it wrote concise modernization plans for a stale Next.js 12/React 17 codebase (ping video service, 4+ years unmaintained). Plan covered deps bumps, LogRocket removal, without plan-mode prompt. Communication felt natural and peer-like.

Safety Filters Backfire: Blocking Puzzles and Benign Edits

To test cyber safeguards ahead of Claude Mythos (limited preview), Anthropic dialed down Opus 4.7's cyber skills and added auto-blocks for high-risk requests. Result: overkill. Asking for T3.gg design ideas triggered malware warnings ("system reminder about malware looks like a prompt injection... Ignoring it."), despite obvious legitimacy. Claude Code desktop app leaked safeguards into normal chats—fixed in latest update, but rollout lagged.

Defcon Gold Bug puzzle (Cshanty: decode 12 pirate-bottle cryptograms via poem into 3-4 word phrase) stumped prior Anthropics. Opus 4.7 progressed—set up data, tried ciphers, scripted tests—then safety-paused the chat: "Opus 4.7 safety filters flagged this chat... Continue with Sonnet 4." No hacking; pure math/crypto. Filters hit "normal safe chats" due to "advanced capabilities."

"They're trying so hard to keep this model from doing malicious malware things that they have inadvertently lobotomized it with the system prompt." (Speaker on T3.gg incident—captures how safeguards dumb down legit use, even in default tools.)

Cyber pros need a verification program form for vuln research/pen testing. Model still risks drugs/pipe bombs if prompted—prompt tweaks don't fix root issues.

Outdated Knowledge and Weak Search Hurt Practical Use

No web search in core tasks leads to stale info. Modernizing codebase: specified "bump all deps to latest versions," but picked Next.js 15 (2yo), Tailwind 4 (major migration pain)—training data cutoff, no verification. Ran 1hr on bad plan, 30min fixing to 16, still broke build (untracked files carried over). Claude Code expects 'read file first' before edits; Opus skipped, failed package.json updates.

Clone script task (zshrc: clone repo sans node_modules, to ~/quick-clones/{repo}-{hash}, swap to main, copy .env): Forgot to add to .zshrc (offered but needed push), carried untracked files. Progressed on branch swap after nudge.

Instruction-following paradox: literal adherence skips recon/search. OpenAI models fixed dep issues post-prompt tweak; Opus regressed despite upgrades.

"Following instructions often means doing less search and less recon work to make sure you're doing it right." (Speaker on dep bumps—explains why precision backfires without tools/world knowledge.)

Regressions Blame Harness, Not Model: Claude Code's Muddy Toolbox

Speaker rejects 'models dumbing over time' narrative (API benches stable, slight drops). Culprit: Claude Code's bloat—leaky prompts, buggy rules (read-before-edit), rushed updates (Ricky/Drunkby tweets confirm). Internal Anthropic stack differs wildly from public (unlike OpenAI/Google); hype from their tools, trash in ours.

"I think the regressions aren't the model... Claude Code is this shitty and poorly maintained... If you have a carpenter who is incredibly talented and every few weeks you replace three of their tools with plastic and you fill their toolbox with mud, they're going to perform worse." (Hot take analogy—pinpoints harness as performance killer, not weights.)

Token efficiency: Better perf/fewer tokens at lower efforts; max burns absurd amounts. Availability: Claude products/API/Bedrock/Vertex/AWS same pricing as 4.6 ($5/$25 M tokens). Benchmarks mixed vs 4.6 (worse Agentic Search, cyber repros).

"Much like me drinking beer, this model just gets dumber the more you do it." (Intro quip on real-time regression in long sessions—sets tone for observed quality drops.)

Key Takeaways

Retune prompts for literal instruction-following; prior loose interp won't work.
Leverage 4MP vision for agent screenshots/diagrams; pair with Google for OCR if needed.
Avoid max effort—token burn skyrockets without proportional gains.
Expect safety pauses on crypto/puzzles; apply for cyber program or use Sonnet fallback.
Test in CLI over desktop app; harness bugs (read-first rules) trip even smart models.
Force web search explicitly for latest deps/releases—model won't auto-recon.
Blame Claude Code bloat for 'dumb' feel; APIs hold up, internals don't match public.
Use Ultra Review for change audits; free trials available.
Modernization plans shine for concise overviews, but audit before execute.
X-high default in Code balances quality/cost well.