Live Tests Reveal Opus 4.7's Self-Verification Edge

Claude Opus 4.7 improves on long tasks and output verification but shows mixed live results in agent creation, writing, and coding—slower, needs prompt tweaks vs. 4.6.

Opus 4.7's Core Claims and Benchmark Context

Anthropic launched Claude Opus 4.7 without an early access program, catching testers like Every's team off-guard. They claim it's their most capable Opus yet, excelling at long-running tasks with more rigor, precise instruction-following, and a novel self-verification step where the model checks its own output before finalizing. This mirrors a best practice for prompters: prompting reflection on work quality, now baked into the model.

Benchmarks show gains—10% on SWE-bench Pro, 7% on SWE-bench Verified—but hosts Dan Shipper and Brandon Gell dismiss over-reliance on them. 'I really don't like benchmarks... I really like getting your hands in the model,' Shipper says, emphasizing real-world use over scores. It's not matching o1's leaps, preserving crypto security for now, but signals rapid internal progress amid a model release rush post-Claude Mythos rumors.

Tradeoffs surface immediately: Opus 4.7 is noticeably slower than 4.6 in initial tests, a common frontier model pain point during high-demand launches. Availability rolls out unevenly—first in co-work (Claude's workspace), then Cursor and Claude Code, but not instantly everywhere.

"Verifies its own output before reporting back. That's actually new... Seems like the models are now doing that automatically. That's amazing." Dan Shipper, highlighting the self-reflection feature as a prompter-inspired breakthrough that could reduce supervision needs.

Writing and Analysis: Investor Updates and Idea Generation

In Proof, Every's agent-native document editor, Shipper tests Opus 4.7 on a real P&L spreadsheet to generate a March 2026 investor update—a blend of financial analysis and polished writing. The model asks clarifying questions (e.g., 'Celebrate it as it was a good month?'), shows analysis steps, and modifies the document, demonstrating improved reasoning chaining. Compared to prior runs, it handles complexity well but requires nudges for full execution.

A workflow experiment connects co-work to Proof for a 'codec scratchpad': an agent reads the live document and generates ideas every few seconds. Opus 4.7 struggles with presence awareness (e.g., not detecting doc content initially) but schedules tasks and offers solid next-step suggestions like 'And schedule' or 'Interesting. These are good.' It's not a 'holy shit' moment but intrigues for iterative writing. Katie Parrot, a team writer, eyes it for reducing 'AI smell' in copy and powering content module agents.

Results: Promising for supervised financial tasks but finicky on real-time doc integration. Tradeoff—stronger reasoning vs. occasional confusion without refined prompts.

"I'm excited to test... how much like AI smell there is on the copy, which I think is always the first thing that we're looking at with any new model." Katie Parrot, prioritizing natural writing output and agentic workflow integration in her daily processes.

Agent Creation: OpenClaw Builds from Personal Context

Brandon Gell replicates a prompt in co-work: build an 'open claw' (custom agent setup) using his Claude memories, workflows, API connections, and habits—minimal guidance, expecting autonomous research and file generation. 4.6 vs. 4.7 side-by-side reveals divergence.

Opus 4.6 creates structured files (user.md, soul.md named 'Koa, my COA'), memory seeds for onboarding, cron jobs for recurring tasks, and new skills inferred from his usage (e.g., avoiding duplicates). It's organized per OpenClaw conventions.

Opus 4.7 shifts: bundles user/soul into one agent.md (functional but non-standard), skips memory seeds for raw Markdown knowledge files (ignoring OpenClaw's daily organization), duplicates existing skills (e.g., 'bootstrap CFO', irrelevant 'AI check' for detecting generated text), and lists services/connections but skips API keys. No hidden folder magic.

Gell's verdict: Stick with 4.6 for this; 4.7 does the work but lacks precision. New models demand prompt evolution—old styles falter. 'Just because it's different does not necessarily mean it's bad,' Shipper notes.

This tests 'hand off your hardest work with less supervision': Partial win on autonomy, but supervision still needed for structure.

"I really did not give it any instructions. I was like, just make an open claw... right now I'd probably stick with 4.6 for making my open claw like this." Brandon Gell, after comparing outputs, favoring 4.6's better organization despite 4.7's creativity.

Coding Benchmarks: Vibe Slop and Production Polish

Shipper's 'Vibe Slop Benchmark' uses a snapshot of Proof's production repo—'vibe coded' slop that crashed post-launch, later human-rewritten. Task: Rewrite from first principles like a senior engineer, planning then executing.

Opus 4.7 generates a plan (still running at stream's end), showing focus amid mess. Prior tests with GPT-4o/5.4: Great plans, poor execution—ambitious rewrites devolve into superficial masks over slop, distracted by periphery.

Not fully benchmarked here (stream truncated), but early signs suggest better rigor for long tasks. Claude Code access enables this; community confirms rollout. Hosts plan deeper Cursor tests.

Tradeoffs: Potential for senior-level fixes, but slowness and distraction risks persist. Ideal for sloppy-to-solid transitions in small teams.

Availability, Community Vibe Checks, and Next Steps

No early access means raw, live feedback from 3,000+ viewers. Red/Yellow/Green/Gold scale invited: Gold for paradigm shifts, red for trash. Rollout: co-work first, then Claude Code/Cursor. Every promises synthesized TL;DR via newsletter.

Broader context: Every's ecosystem (Proof, Monologue, Kora, etc.) for edge AI; subscription bundles vibe checks, apps, trainings. Anthropic invited onstage.

Initial vibes: Improved self-checks shine, but slower speed and prompt sensitivity temper hype. Benchmarks up, hands-on mixed—refine for your stack.

"I have a vibecoded slop codebase. Can you make and execute a plan to rewrite it from first principles? ... What this benchmark tests is if we give a new frontier model a sloppy codebase, can it figure out what a senior engineer would figure out." Dan Shipper, introducing his custom test for agentic code refactoring under production pressure.

Key Takeaways

  • Prioritize hands-on vibe checks over benchmarks; test your daily workflows immediately on rollout.
  • Leverage self-verification for long tasks—prompt reflection if not automatic, reducing supervision.
  • Expect speed hits on launch day; compare to prior versions (e.g., 4.6) before switching.
  • Refine prompts for new models—minimalist ones expose organization gaps in agent builds.
  • Use real production slop for coding evals; check if agents rewrite deeply or just patch.
  • Integrate with tools like Proof/co-work for live doc agents, but debug presence issues.
  • Rollouts uneven—co-work/Cursor first, Claude Code follows; share vibes (red/yellow/green/gold).
  • For writing/analysis, it handles P&L-to-updates well with Q&A nudges; watch for 'AI smell'.
  • Build custom benchmarks like Vibe Slop to mimic senior eng fixes on messy repos.

Summarized by x-ai/grok-4.1-fast via openrouter

9331 input / 2690 output tokens in 27528ms

© 2026 Edge