Claude Opus 4.7 Dominates Agentic Coding but Burns Tokens

Achieves State-of-the-Art in Coding Benchmarks and Agentic Tasks

Claude Opus 4.7 outperforms Claude Opus 4.6, GPT-4o, and Gemini 1.5 Pro on toughest benchmarks, hitting state-of-the-art (SOTA) on SWE-Bench Pro and Verified for software engineering tasks. Web development scores match Gemini 1.5 Pro for UI generation. It leads in real-world knowledge work like finance/legal agents and GPQA (graduate-level questions). Memory improves for long multi-session workflows, enabling rigorous long-running tasks with self-verification before output. Vision processes images at 3x higher resolution, yielding polished UI designs, slides, and documents. Reasoning efficiency jumps: low effort now matches prior medium, medium matches high, high matches max—use highest for complex planning. Follows instructions literally, but retune prompts from Opus 4.6 as they may break.

To access: Use claude.ai chat, OpenRouter, or Kilo CLI (open-source agent harness with $25 free credits). Rate limits increased after initial single-prompt hits on max reasoning.

Delivers Top Demos in Simulations and Frontend but SVG Lags

In Kilo CLI tests at max reasoning:

SUV 3D physics sim: Best yet—breaks prompt into physics/engine/rendering/camera systems for realistic mountain drive.
Minecraft clone: Most ambitious—procedural terrain, ores, mobs, water physics; buggy execution but creative.
MacOS UI clone: Most accurate—functional menu bar, Finder, Launchpad, Spotlight, apps like Safari/Notes/Calculator; toolbar janky but icons solid.

Claude.ai SVG tests:

Animated butterfly and ambient painting (flying birds, sun reflection): Strong, creative.
PS5 controller: Weak—touchpad ok, body inaccurate vs. Qwen 2.5 (35B) or Gemini.

Frontend shines: Dynamic landing pages with consistent style (typography/colors), on par with Gemini 1.5 Pro (big leap from 4.6). FPS shooter has recoil/movement/enemies but glitches on keys.

Hand off hard engineering with less supervision; excels long-horizon planning by decomposing tasks.

Trade-offs: Higher Quality Raises Costs and Limits

Uses more tokens per task (2-3x less efficient than 4.6) for deeper reasoning, shrinking usable context and raising effective costs despite unchanged pricing ($5/1M input, $25/1M output). Early reports flag weaker long-context retention. Max reasoning hits rate limits fast (fixed via limit hikes). Not perfect for all creative SVG; execution bugs in ambitious builds. Test prompts carefully—literal interpretation boosts precision but disrupts old workflows.