Claude 'Regressions' Stem from Harnesses and APIs, Not Dumber Models

User Expectations Have Shifted, Amplifying Perceived Regressions

Theo argues that what feels like Claude models degrading is partly due to rising user baselines. Early on, simple file edits impressed users, but as capabilities grew (e.g., Opus 4.5 handling complex tasks), expectations escalated. A task once seen as advanced now seems baseline; failures that were tolerable before now register as regressions.

He illustrates with a personal spectrum: from 'hello world' to 'building Linux from scratch.' Pre-Opus 4.5, models hit mid-range; post-upgrade, users expect higher performance. "Code that you thought was good when you were a junior looks like shit when you're a more experienced developer," Theo says, explaining why the same output disappoints more now. This isn't model dumbing—it's users pushing harder prompts and customizations like MCP servers or plugins, which pollute system prompts and dilute focus.

Benchmarks confirm dips: Margin Labs' SWE-bench tracker shows Claude Code weighted average dropping from 57% in March to 55% now, with weekly declines. Sonnet 4.6 regressed post-March 9th; Opus 4.7 shows cloud code issues. Anecdotes abound: AMD execs documenting laziness, Reddit/HN threads on daily variability, even Claude outputting Chinese randomly.

"I have historically pushed back on these types of claims... at least until recently," Theo admits, citing his own post on OpenClaw bans limiting non-coding tasks like Dropbox debugging, where Claude refused: "That's outside my area. I'm built for software engineering tasks."

Layers Between Prompt and Output Introduce Failures

Theo breaks down the request pipeline: user prompt → harness (system prompt, tools) → API (filtering/safety checks) → inference (GPUs/TPUs). Each layer can degrade output without touching the model.

API Refusals: Aggressive filters block benign tasks. Example: Claude Code refused a Gold Bug cipher (math puzzle, not hacking), citing malware risk—pure API, not model. Bans on non-SE tasks (e.g., UI debugging) spiked post-OpenClaw changes.

Harness Pollution: Custom skills/plugins bloat context, nudging models off-track. Users add 'useless MCP servers'; devs over-customize. Worse: Claude Code's own harness flaws. It mandates reading files before edits but mishandles searches as reads, forcing redundant tool calls. One package.json update ballooned from 1 API call to 5, wasting tokens/compute/context.

"This is an example of the harness not just making the model behave worse or dumber but also costing you more usage and money," Theo notes. Matt Mau's benchmark is damning: same Opus model scores 15% worse in Claude Code vs. Cursor (official CLIs also lag). "Anthropic is too focused on making Claude code have all these features... shipping utter slop constantly. And the result is that the models feel dumber."

System prompt tweaks alone can tank performance: "If you gave me source code access to cloud code, I could make it the dumbest harness ever with just a couple words being changed."

Inference Variability: Anthropic shards across Nvidia GPUs, AWS Trainium, Google TPUs—diverse hardware yields inconsistent outputs. Tool-heavy flows (read → edit) chain requests, potentially hitting different backends per step. Multi-cloud desperation amplifies errors.

Context Rot and 'Getting Lost': Long sessions accumulate noise (failed tools, irrelevant reads), causing models to misinterpret history. Opus 4.7 scripting demo: model flipped repo-clone logic from prior chat drift.

Model Updates Aren't Immune, But Aren't the Main Culprit

Opus 4.6→4.7 feels worse for many, including Theo, but he pins most on non-model layers. Anthropic's postmortem (linked) details prior issues; new tokenizer costs more tokens. Trackers like Margin Labs quantify code regressions. Yet, benchmarks isolate harness impact—Opus shines in cleaner envs like Cursor.

"We are now at a point where anthropics incompetence in engineering is making us think their models are getting dumber," Theo hot-takes. Features expand 'service area for stupid': e.g., malware false-positive on T3.gg design tweaks polluted context start-to-finish.

Historical pattern: launches strong, then regresses via layers. Solution? Cleaner harnesses, stable APIs, unified inference. Users: minimize custom junk; reset contexts.

Key Takeaways

Audit your harness/system prompt: strip unused skills/plugins to reduce context pollution and boost reliability.
Test models in multiple UIs (e.g., Cursor vs. Claude Code) to isolate harness flaws—15% gaps are common.
Expect variability from multi-hardware inference; short sessions minimize chain-request drift.
Pushback on refusals: distinguish API blocks (retriable) from true model limits.
Track benchmarks like Margin Labs SWE-bench or Matt Mau's for objective regressions vs. expectation shifts.
Demand engineering rigor from providers: features without harness fixes create 'slop' that mimics dumb models.
Raise your bar strategically—harder prompts are fine, but pair with clean scaffolding.
For production, prefer stable envs over bleeding-edge; Opus 4.5 may outperform 4.7 in cluttered setups.