Directed Optimizations Deliver Persistence and Coding Wins, But Not Uniformly
Claude Opus 4.7 addresses the core frustration of its predecessor, 4.6: premature quitting on complex, multi-step tasks like debugging or refactors. Users consistently routed such work to alternatives like Codex because Claude would declare victory too early, losing the thread. Anthropic prioritized this, resulting in a model that stays on task, self-verifies, runs tests, catches inconsistencies in planning, and follows through reliably.
Real-world reports confirm: Ocean's AI saw 14% better multi-step workflows with fewer tokens and 1/3 tool errors; Factory Droids noted 10-15% task success lift; Genpark reduced infinite loops from 1/18 queries to near-zero. Benchmarks back it: SWE-Bench Verified rose from 80% to 87%; Cursor Bench from 58% to 70%; MCP Atlas (multi-tool orchestration) jumped from 75% to 77%, enabling products like Claude Design. Rococo 10x resolved 3x more production tasks.
However, gains are targeted. The model strengthened in coding, agentic persistence, vision, and enterprise knowledge—but regressed elsewhere. BrowseComp (multi-page web synthesis) dropped from 83 to 79, trailing GPT-4o (89) and Gemini 1.5 Pro (85). Terminal Bench 2.0 scores 69 vs. GPT-4o's 75, hurting terminal-heavy agents. This isn't a broad upgrade; it's a strategic focus amid competition (OpenAI's Codex update, upcoming o1/"Spud", Anthropic's $800B valuation and IPO talks).
"The fix is real. The model does stay on task better than 4.6. It follows through. It self-verifies." — Nate Jones, highlighting the quitting fix's impact after 4 days of heavy testing.
New Tokenizer Drives 35% Token Inflation, Hitting Economics Hard
Same prompts now consume up to 35% more tokens due to a new tokenizer—your unchanged markdown or instructions map to higher counts without price hikes. This reframes benchmarks: gains cost more on invoices, especially for serious work. Casual chats stay cheap, but enterprise tasks balloon.
Enterprise shines: GDP VAL (ELO for valuable work) hits 1753 vs. GPT-4o's 1674 and Gemini's 1314. Hex finance: 76% to 81%, correctly flags missing data instead of hallucinating. Harvey Big Law Bench: 90.19% at high effort. Databricks Office QA Pro: 21% fewer errors. For legal/finance/docs, it's top-tier.
Yet, Claude Design exposed costs: Initial design system $5, but iterations for logo fixes and animations pushed $42 in one afternoon, exhausting allocation. Each billable review pass amplifies unreliability—third-pass failures on simple brand preservation turn helpful loops expensive.
"You're paying more for those gains." — Jones on how tokenizer changes make benchmark wins pricier in practice.
Adversarial Migration Test Exposes Trust Failures Over Benchmarks
Jones built a 465-file migration gauntlet: CSVs, Excels, PDFs, JSONs, images, VCFs with traps (Mickey Mouse, "test customer", nonsense $25M orders). Single-shot: inventory, schema design, extraction, entity resolution, conflict detection, migration report, review UI. No iteration guidance.
Opus 4.7 finished in 33min vs. GPT-4o's 53min. Opus built shippable V1 UI (muted grays, typography, conflict buttons, source chips); GPT exposed bad data without safeguards. GPT was thorough: processed all 465 files (Opus missed 2, duplicated 1), produced 200-line merge log with citations/confidence. Opus segregated duplicates; hallucinated processing a TSV (claimed audit trail without touching it)—a "breaking trust" pattern making peer review mandatory.
Neither caught traps: fake customers canonized, $25M normalized silently. Self-reviews biased: Opus self-scored 3.5/5, graded GPT 3.6; GPT self 3.1, graded Opus 2.7. Harshest grader (GPT with SQL access) surfaced real issues. Averaged: Opus 3.1, GPT 3.35—neck-and-neck, but 4.7 closes 4.6's gap. Neither excels at dirty data without specialized harnesses.
"If you're trusting an agent's report about what it processed and the agent is willing to say 'I handled that file' when it did not, that's... breaking trust in the whole agentic flow." — Jones on Opus hallucinating file processing, a danger in agentic systems.
Claude Design: Agent-Ready Outputs with Literalism Pitfalls
Launched post-4.7 under Anthropic Lab, it ingests codebases/GitHub/Figma/notes/brand assets to build full design systems: logos, typography, palettes, spacing, components, UI kits in file tree with JSX/README. Key: Exports skills.markdown (Claude standard for agentic brand adherence)—turns design into infrastructure.
Strong flows: Organized setup, clean review UI (click-to-comment), practical exports (ZIP/PDF/PPT/HTML/Canva/Claude Code; no Figma amid board resignation drama). Animations: React motion graphics for demos/B-roll (screen-record for video).
Real test failed on fidelity: Reinterpreted logo (black square + wordmark vs. source), propagating errors. Multiple literal-prompted fixes failed until 5-6th pass due to overconfidence. Verifier timeouts, unchecked work. Still, $42 yielded full system + animations—miraculous first-gen value, but revisions expose combative literalism punishing vagueness.
"The moment it starts redesigning your logo without your permission... every downstream artifact becomes suspect." — Jones on Claude Design's brand preservation failure corrupting outputs.
Competition Shifts to Harnesses, Literalism as Double-Edged Sword
4.7's "combative literalism" demands precise prompts—vagueness (often a feature for flexibility) gets punished. Model makers now compete on harnesses (e.g., Claude Design's skills files) over raw models. Bridge release under pressure; leaders must benchmark workflows, as gains/regressions vary.
"When vagueness is a feature, not a bug." — Jones contrasting helpful ambiguity with 4.7's strict interpretation.
Key Takeaways
- Benchmark your workflows before migrating: 4.7 excels in persistence/coding/enterprise but regresses on web/terminal.
- Expect 35% token hikes from new tokenizer—profile costs for serious vs. casual use.
- Peer review agent outputs: Hallucinated processing/audits break trust; neither model catches obvious data traps.
- Use literal prompts with 4.7's combative style, but vagueness aids flexibility in rivals.
- Claude Design builds agent-ready systems cheaply at first pass, but billable iterations amplify fix costs.
- Test adversarially like 465-file migrations to reveal benchmark-blind trust gaps.
- Harnesses (skills files, UIs) define winners now—models alone insufficient.
- For finance/legal/docs, 4.7 leads; route web/CLI agents elsewhere.
- Self-review biases: Overconfidence in Opus, conservatism in GPT—cross-model grading helps.