Claude Regressions: Harnesses and Expectations, Not Just Models

Benchmarks Confirm Consistent Degradation

Multiple independent benchmarks show Claude models, especially Opus 4.7 and Sonnet 4.6, declining since March. Margin Labs' weighted SWE-bench averages dropped from 57% to 55%, with weekly dips despite new model releases causing temporary bumps. BridgeMind's hallucination benchmark saw Opus fall from 87.6% to 73.3% between launch and April 12th, using direct API calls without harnesses. Terminal bench ranks Claude Code at 58% for Opus, far behind Forge Code (75-82%) and Cursor. Matt Mau's 100-feature implementation test revealed Opus 15% worse in Claude Code vs. Cursor, while GPT-4o and Gemini performed closer across CLIs. AMD's AI director documented Opus 4.6-4.7 coding regressions, including lazier behavior and errors like boolean flips.

These quantify real issues: dumber solutions (worse code), getting lost (misinterpreting intent), and refusals (API blocks or model quits). Cloud code sessions degrade over time, with context rot amplifying errors.

"The weighted averages from March to now on Margin Lab... has seen a meaningful dip. It's not a huge dip but it's from 57% down to 55%. And it's consistently down like every week it gets lower."

Failure Layers Across the Inference Stack

Requests pass through harness, API, diverse compute, and model—each introduces regressions. Harnesses wrap prompts with system instructions, tools, and context, but Claude Code's implementation pollutes this. It enforces "read before edit" rigidly, causing redundant tool calls: search fails as read, forcing manual read after error, ballooning API calls, tokens, and costs. Malware warnings trigger as prompt injections on safe tasks like T3.gg redesigns, wasting context on dismissals.

API filters aggressively: Goldbug puzzle refused as "hacking," despite math nature—pre-GPU block, not model fault. Anthropic's multi-vendor compute (AWS Trainium, Google TPUs, Nvidia GPUs, Broadcom) routes requests variably; multi-step Claude Code flows hit different hardware per tool call (read → edit), introducing inconsistency.

Model updates like 4.6 to 4.7 show some decline, but benchmarks isolate it below other layers. Historical pattern: strong launches regress over time.

"Anthropic is too focused on making Claude code have all these features... The result is that the models feel dumber. We are now at a point where anthropics incompetence in engineering is making us think their models are getting dumber."

Rising Expectations and User-Side Pollution

Users push boundaries as capabilities grow: November baselines impressed with file edits across dirs; now expected. Tasks once novel fail against higher bars, feeling like regressions—like junior code seeming poor post-senior growth. Custom skills, MCPs, plugins bloat system prompts, steering outputs off-trained paths.

Claude Code exemplifies: frequent slop ships, expanding attack surface for stupidity. Five-word system prompt tweak can tank performance 20x. Benchmarks prove: same Opus crushes in Cursor but flops in native harness.

"If you gave me source code access to cloud code, I could make it the dumbest harness ever with just a couple words being changed in the system prompt."

Context pollution mirrors cluttered desktops: malware pop-ups, irrelevant reads derail focus, reducing effective capacity. Half of perceived regressions likely harness-sourced; fix via cleaner scaffolds like Cursor boosts scores dramatically.

Key Takeaways

Benchmark tools rigorously: Use SWE-bench, Terminal bench, or Matt Mau's feature tests to isolate harness vs. model issues.
Minimize harness bloat: Avoid excess skills/plugins; test system prompt changes on held-out tasks.
Account for multi-step flows: Expect hardware variance in Anthropic; prefer single-vendor providers for consistency.
Reset expectations: Track personal baselines; what impresses evolves with experience.
Prioritize clean CLIs: Cursor outperforms native Claude Code by 15-20%; switch for production coding.
Monitor API refusals: Log blocks to distinguish from model errors; appeal over-aggressive filters.
Demand engineering rigor: Vendor incompetence (e.g., read-before-edit bugs) wastes compute—voice feedback.
Layer-debug systematically: Test raw API, then harness-wrapped, to pinpoint regressions.

"Every time a new tool is added, every time a new adjustment to the system prompt is made... they are increasing the service area for stupid."

"The more shit that exists in the context... makes the model dumber."