Benchmark Reliability Has Fractured—Prioritize Pro Over Verified

SWE-bench Verified, once the gold standard for end-to-end GitHub issue resolution, is now disputed due to 59.4% flawed test cases and training data contamination across top models like GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash. OpenAI stopped reporting it in February 2026, shifting to SWE-bench Pro (1,865 tasks from public, held-out, and proprietary sets), where scores vary wildly by scaffold: original Scale AI tests hit sub-25%, but optimized runs now exceed 60% (Claude Opus 4.7 at 64.3%, GPT-5.5 at 58.6%). Terminal-Bench 2.0 better captures terminal/DevOps workflows, with GPT-5.5 at 82.7% vs Claude's 69.4%, but harness differences create 7-point gaps for the same model. Agent scaffolds amplify model gaps by up to 17 points on 731 tasks, so scores reflect full stack, not model alone.

"Improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities." — OpenAI Frontier Evals team, explaining their benchmark abandonment; this forces reliance on Pro and Terminal-Bench for production signals.

Claude Code and Codex Dominate Quality vs Execution

Anthropic's Claude Code (#1) leverages Opus 4.7 (April 2026 release) for 87.6% SWE-bench Verified (up 7 points from 80.8% on 4.6), 64.3% Pro (11-point gain), and self-verification (auto-writes/runs tests) plus multi-agent coordination for parallel review/documentation. 1M token context handles large repos, but needs indexing for monorepos. Pricing splits CLI/IDE subs ($20–$200/mo) from API ($5/$25M tokens). Rakuten saw 3x task resolution; CodeRabbit 10%+ PR recall. Tradeoff: Trails GPT-5.5 on terminal tasks.

OpenAI Codex (#2, GPT-5.5 April 2026) flips to 82.7% Terminal-Bench leadership for shell/DevOps, 58.6% Pro Public, ~88.7% Verified (third-party). CLI runs locally (GitHub: openai/codex), web/IDE cloud-sandboxed. 85% OpenAI staff use weekly. API $5/$30M tokens (2x prior). Best for fire-and-forget automation, but lags Claude on multi-file quality.

"Complex command-line workflows requiring planning, iteration, and tool coordination." — OpenAI on why Terminal-Bench fits Codex; highlights execution over pure coding.

IDE-Native and Free Options Scale Adoption

Cursor (#3, $2B ARR Feb 2026) hits ~51.7% Verified default but scales with backends (Opus 4.7 boosts CursorBench to 70%). VS Code fork with Plan/Act mode, per-task models, Pro+ background agents ($60/mo). 30% faster than Copilot, but editor lock-in limits JetBrains/Neovim users.

Google's Gemini CLI (#4, open-source npm) delivers 80.6% Verified / 68.5% Terminal-Bench via Gemini 3.1 Pro (Feb 2026, free tier/Google One Premium). Strong reasoning (94.3% GPQA), GCP integration. Removes paywall for solos/open-source.

GitHub Copilot (#5, 4.7M subs) at ~56% Verified Agent Mode, multi-model (Claude/GPT backends Feb 2026), shifting to AI Credits billing June 2026 ($10–$39/mo + credits). Enterprise king for compliance/IDE breadth (VS Code to Xcode).

Devin 2.0 (#6) excels scoped tasks (upgrades/migrations) in sandboxed cloud IDE with planning/Wiki, but falters on ambiguity (community tests: more failures than successes on varied tasks). Free–$200/mo tiers.

Claude Mythos Preview (93.9% Verified) leads but restricted to partners via Project Glasswing—no broad access due to cyber concerns.

"The model writes tests, runs them, and fixes failures before surfacing results." — Anthropic on Opus 4.7 self-verification; core to its quality edge on complex tasks.

Tradeoffs Dictate Choice: Workflow, Cost, Access

No universal winner—Claude for engineering depth (large repos, refactoring), Codex for terminal speed, Cursor for IDE flow, Gemini for free frontier access, Copilot for enterprise scale. Costs range free (Gemini) to $200/mo (high tiers); adoption signals (Copilot 4.7M subs, Cursor $2B ARR) beat pure benchmarks. Scaffolds/models interplay means test in your repo. 85% developers use AI coding by 2026, but production hinges on matching agent to task archetype.

"Agent scaffolding matters as much as the underlying model." — Article insight from 17-point gaps on same model; why vendor scores aren't apples-to-apples.

Key Takeaways

  • Ditch SWE-bench Verified rankings without Pro/Terminal-Bench context—contamination kills reliability.
  • Test Claude Code (Opus 4.7) first for multi-file fixes; expect 64.3% Pro, self-fixing tests.
  • Use Codex CLI (GPT-5.5) for DevOps/terminal: 82.7% Terminal-Bench, local run saves security headaches.
  • Cursor shines in VS Code ($20–60/mo) with 30% speed edge, model choice; switch editors if committed.
  • Free Gemini CLI (80.6% Verified) for solos; Copilot ($10/mo entry) for enterprise compliance.
  • Devin 2.0 for scoped chores only—ambiguity tanks it; always review plans.
  • Factor scaffolds: 2–17 point swings mean your mileage varies by harness/repo.
  • Pricing shift: Copilot credits June 2026; API cheaper for custom agents (Claude $5/$25M).
  • Internal wins: 3x Rakuten tasks (Claude), 85% OpenAI staff (Codex), $2B Cursor ARR.