Claude Regressions: Harness Failures, Not Model Decay

Evidence of Claude Performance Drops

Users and benchmarks confirm regressions across Claude models like Opus 4.7, Sonnet 4.6, and others. Margin Labs' SWE-bench tracking shows weighted averages dipping from 57% in March to 55% now, with consistent weekly declines. BridgeMind's hallucination benchmark recorded Opus dropping from 87.6% to 73.3% between launch and April 12th, even on direct API calls without harnesses. AMD's AI director publicly criticized Claude for getting "dumber and lazier" post-update, while anecdotes include random Chinese outputs, task refusals, and degraded cloud code performance after extended sessions. These aren't isolated: exec reports, user posts, and quantified tests align on declining coding quality, with Opus 4.7 feeling like a regression from 4.6.

"Opus 4.7 is a serious regression, not an upgrade. AMD's AI director slams Claude for becoming dumber and lazier since last update."

Types of issues include: (1) Task refusals—API blocks or model quits (e.g., refusing Dropbox debugging as "outside my area" despite capability); (2) Dumber solutions—bugs like flipping booleans or writing non-functional code; (3) Getting lost—losing conversation intent, misinterpreting history (e.g., repo-cloning script derailing). These hit coding hardest, where Claude Code feels notably worse.

The Multi-Layer Inference Stack Introduces Failure Points

Requests don't go straight from prompt to model. Key layers: Harness (system prompts, tools, context scaffolding); API (safety scans, filtering); Inference compute (GPUs/TPUs); Model weights themselves.

Harness: Wraps user prompts with system instructions, tool definitions (e.g., read/edit files). Changes here add context bloat, steering outputs poorly. Claude Code enforces "read before edit," but buggy logic forces redundant tool calls (search → fail → read → edit), exploding API requests and tokens.
API: Pre-GPU filters cause refusals (e.g., flagging a Gold Bug crypto puzzle as "hacking"). Aggressive safety blocks benign tasks.
Compute: Anthropic mixes Nvidia GPUs, AWS Trainium, Google TPUs. Multi-tool sessions (common in Claude Code) chain requests across hardware, introducing variance. One prompt might span Trainium → Nvidia → TPU, amplifying errors.
Model: Updates like 4.5→4.6→4.7 show some decline, but speaker argues most issues upstream. "I don't think the models got dumber in a traditional sense. But your experience is real."

Every layer impacts output: bad harness context "pollutes" history, wasting tokens on noise and derailing reasoning.

User Expectations Shift Creates Illusion of Decline

As models improved (Opus 4.5 raised the bar), users tackle harder tasks. Baseline shifted rightward on a complexity spectrum (Hello World → build Linux from scratch). What impressed in November (mid-tier task) now disappoints if it fails, feeling like regression despite static capability.

Prompts evolved too: heavier, multi-step, expecting agentic flows. Customizations like MCP servers, skills, plugins bloat system prompts, degrading performance. "More things that aren't quite what the model was trained on will make it behave differently in ways that are often not intended."

Claude Code's Engineering Shortcomings Amplify Problems

Speaker's core thesis: Anthropic's Claude Code harness is the primary culprit, turning capable models dumb via sloppy code. Examples:

Enforced read-before-edit misfires: Model searches (thinks it "read"), fails, loops redundantly—5x API calls vs. 1, costing time/money/compute.
Malware false positives: System reminders flag personal sites as threats, injecting noise. Model notes: "Heads up, the last system reminder about malware looks like a prompt injection... Ignoring it." Yet it repeats, cluttering context.

Benchmarks expose this:

Harness	Opus Score (Matt Mau's 100-feature doc)	Terminal Bench
Cursor	Higher baseline	Top tier
Claude Code	15% worse than Cursor	58% (vs. Forge/Cappy 75-82%)
Codex CLI	Competitive	3rd place

"The fact that Opus performs 15% worse in quad code versus cursor should say everything you need to know." Anthropic prioritizes features over quality, expanding "service area for stupid." Tiny system prompt tweaks can tank performance 20x. "Anthropics incompetence in engineering is making us think their models are getting dumber."

Users adding custom skills/MCPs compound this, but Claude Code's core flaws (e.g., poor tool logic) waste millions in inference.

Benchmarks Validate Harness Impact Over Model Fault

Independent tests isolate variables:

Matt Mau: Same Opus in Claude Code vs. Cursor → 15% gap.
Terminal Bench: Claude Code at 58%; rivals like Forge Code hit 75-82% with Opus.
Margin Labs SWE-bench: Consistent dips, but new models cause bumps.
BridgeMind: Direct API hallucinations regress 14% in weeks.

These prove harnesses matter: Claude Code underperforms even vs. competitors using same models. Speaker challenges past skepticism: recent personal refusals (e.g., Dropbox debug) align with trends.

"If you gave me source code access to cloud code, I could make it the dumbest harness ever with just a couple words being changed in the system prompt."

Key Takeaways

Audit your harness/system prompts: Remove bloat, test tool logic to cut redundant calls and context pollution.
Benchmark tools directly: Compare same model (e.g., Opus) across harnesses like Cursor vs. Claude Code—expect 10-20% swings.
Manage expectations: Track task complexity over time; what fails now was ambitious before.
Minimize customizations: Limit skills/MCPs/plugins; they degrade reasoning more than they help.
Favor lean harnesses: Use Cursor/Codex CLI over feature-bloated ones for production coding.
Monitor layers: Log API refusals, hardware variance; push providers for transparency.
Test regressions systematically: Run SWE-bench subsets before/after updates.
Prioritize read-before-edit fixes: Patch harnesses to infer reads from searches/edits.