Opus 4.7 Tops Coding Benchmarks but Needs Explicit Prompts

Precision Gains Reward Detailed Instructions

Opus 4.7 self-verifies outputs against requests, catching logic errors mid-plan without prompting—a manual technique now native. It sustains multi-hour tasks like building a scheduled apartment-hunting dashboard from Craigslist/Zillow data, where 4.6 faltered. Benchmarks confirm: jumps on SWE-bench Pro's hardest tasks, CursorBench from 58% to 70%, and 3x more resolved production tasks on Rakuten-SWE-Bench vs. 4.6. Vision handles 3x higher resolution, spotting pixel-level UI issues like misaligned buttons. New 'extra high' effort default (between high/max) suits async analysis; use max for architecture, high/medium for iteration. Generates coherent PowerPoints by self-checking slides via improved vision.

These shine on well-specified work: topped Every's LFG coding benchmark and produced 'better than my own' consulting prose per tester Mike Taylor, with zero fluff for explaining complex topics simply.

Literal Shift Breaks Old Prompts, Cuts Proactive Insights

Unlike 4.6's implicit prompt engineering and unprompted noticing (e.g., flagging P&L data errors), 4.7 hedges, stalls, or guesses wrong without explicit direction. COO Brandon Gell found it missed 4.6's instinctual catches in finances/ops. Anthropic researcher Alex Albert confirmed deliberate tuning: Sonnet 3.7 too eager, Opus 4 dialed back, 4.6 overdid it, 4.7 reined in—fourth tweak in a year for 'perpetual back-and-forth.' Existing 4.6 prompts disappoint initially; users must add explicit permission, criteria, constraints, budget, cadence.

Team notes mixed depth: Dan Shipper sees 'big model smell' with hidden powers emerging slowly, less emotionally intelligent. Kieran Klaassen praises deeper 'compound engineering' workflows when pushed. Katie Parrott favors it for data analysis/automations over writing (too slow/regimented), pending prompt tweaks.

Use for Coder/Verifier Roles, Not Loose Exploration

Reach for 4.7 on tight briefs: coding, verification, long-coherence tasks, precise writing/PPTs—elegant, detailed daily driver once tuned. Stick to 4.6/Sonnet for unprompted noticing or interactive speed. Rewrite prompts this weekend: looser rails yield vague outputs; tighter ones unlock cleaner results. Delays 'general-purpose work agent' convergence with OpenAI's Codex due to Anthropic's zigzags. Test deeply—initial 'not mind-blowing' hides capabilities revealed over weeks.