Opus 4.7 Excels with Explicit Prompts, Stalls Without
Anthropic's Opus 4.7 delivers top coding benchmark scores and self-verification when given detailed instructions, but hedges or misses proactive insights unlike 4.6, shifting prompt specificity burden to users.
Precision Gains Demand Detailed Instructions
Opus 4.7 outperforms predecessors on key benchmarks like SWE-bench Pro's hardest tasks, CursorBench (58% to 70% jump), and three times more resolved tasks on Rakuten-SWE-Bench versus 4.6. It introduces self-verification by reviewing outputs against requests, catching logic errors mid-plan without prompting—a manual technique now native. Long-horizon coherence sustains multi-hour tasks, like building a twice-daily Craigslist/Zillow apartment dashboard that 4.6 couldn't maintain. Vision processing handles over three times the resolution, spotting pixel-level UI issues like misaligned buttons. New 'extra high' effort level (default in Claude Code) suits async handoffs; use 'max' for complex architecture, 'high/medium' for interactive iteration. For consultants, it generates superior PowerPoints by self-checking slides for consistency.
This follows Anthropic's pattern of four re-tunings in a year: Sonnet 3.7 (March 2025, too eager), Opus 4 (May 2025, dialed back), Opus 4.6 (February 2026, over-proactive), now 4.7 reined in for literalness. Existing 4.6 prompts fail initially as 4.7 drops implicit prompt engineering, requiring explicit permission and specificity to unlock potential.
Mixed Team Verdicts Highlight Workflow Fit
Team tests on LFG coding benchmark showed 4.7 clearing hardest tasks with detailed briefs but stalling or guessing wrong without. Writing outputs thrilled with fluff-free prose 'better than my own,' though it struggles imitating personal styles or staying on-brand. Operations tasks lost 4.6's unprompted noticing (e.g., flagging P&L errors), delivering clean but incomplete summaries. Non-writing shines in data analysis and automations, but speed and regimentation favor Sonnet for daily writing. Leaders note 'big model smell': harder initially, less emotionally intelligent, but deeper on push—ideal for compound engineering, less showy day-one wow.
Choose Based on Prompting Style and Task
Reach for 4.7 in structured lanes needing verification, sustained coherence, or high-precision coding/UI iteration—rewarding sharp operators with elegant, detailed results. Stick to softer models like 4.6 for unprompted noticing or loose briefs where instincts matter. Update prompts this weekend: add explicit acceptance criteria, constraints, budget, cadence for tighter rails that make outputs cleaner and more reliable. The model-rail interaction drives outcomes—loose prompts flatten performance, tight ones elevate it beyond priors.