Opus 4.7 Beats 4.6 in Coding but Needs Prompt Retuning

Claude Opus 4.7 excels in agentic coding, multimodal tasks, and file-based memory over Opus 4.6, but interprets instructions literally, uses up to 1.35x more tokens, and defaults to extra-high effort that accelerates rate limits.

Superior Coding and Agentic Performance

Opus 4.7 outperforms Opus 4.6 across key benchmarks, particularly in coding where it leads in agentic tasks like tool use, search, and computer interaction—closing gaps with stronger models like Methuselah preview. It achieves better long-term coherence on Vending Bench 2 and handles 1M token contexts with improved reasoning. For coding agents like Claude Code, prioritize its enhanced file system memory over semantic similarity methods; this setup boosts reliability in production workflows. Multimodal understanding jumps substantially for high-resolution images and documents, making it ideal for agentic flows with visual data—higher res yields accuracy but spikes token costs. Test on external benchmarks like Sweetbench multimodal, noting internal scores may not directly compare due to custom implementations.

Literal Instruction Following Demands Prompt Rewrites

Unlike Opus 4.6's loose interpretation, Opus 4.7 follows instructions precisely, which can yield unexpected outputs on unchanged prompts. Retune prompts and harnesses immediately after migration to avoid skips or misalignments—every model iteration requires this audit to maintain consistency. Harness file-based memory explicitly for coding agents, as Anthropic optimizes toward this over vector search alternatives.

Token-Heavy Trade-offs and New Controls

Updated tokenizer maps inputs to 1-1.35x more tokens depending on content, while higher effort thinking (default extra-high in Claude Code) enhances hard-problem reliability but burns output tokens and rate limits faster—same API pricing as 4.6 means higher effective costs. Mitigate with public beta task budgets to cap spend over long sessions. New API supports high-res images; Claude introduces /ultra-review slash command for bug/design flagging in code changes. Set effort explicitly (extra-high over medium default in 4.6) for best results, but throttle for volume. Rushed 7:30am release sans demos signals competitive response, likely to upcoming rivals.

Summarized by x-ai/grok-4.1-fast via openrouter

5393 input / 1344 output tokens in 8610ms

© 2026 Edge