Anthropic: Agent Harnesses Need Only 3 Core Agents

Video description

Explore MaxClaw/MiniMax Agent: https://agent.minimax.io/?utm_media_source=YTB&utm_campaign=kol&utm_content=AILABS-393 and Download Agent Desktop here: https://agent.minimax.io/download Community with All Resources 📦: http://ailabspro.io Video code: V52 Your agent harness is dead weight, and Anthropic just proved it. They tested ai agents on their own harness, removed components one by one, and found most coding frameworks break with Opus 4.6. Here's what your claude code and ai setup should actually look like now. 🔗 Links * Article: https://www.anthropic.com/engineering/harness-design-long-running-apps Anthropic ran experiments on their own agent harness, stripping out components and measuring what actually impacts performance with newer models. Their findings reveal that most ai agent harness setups, including popular frameworks like BMAD, GSD, SpecKit, and the superpowers agent harness, now carry dead weight that holds back Opus 4.6. In this video, we break down exactly what Anthropic discovered: why micro-detailed planning is now counterproductive, why context isolation no longer matters, and why the best agent harness setup is just three core components, a planner, a generator, and an evaluator. We cover how graded evaluation works, drawing parallels to the ralph agent harness approach of strict implementation enforcement for claude, and why your evaluator needs scored rubrics instead of simple pass/fail checks. Whether you are doing vibe coding or building production apps through agentic coding, claude and agentic ai have evolved past the point where micro-task breakdowns actually help. If you use claude code, consider this a claude code tutorial on setting up your agents properly, using agent teams where the generator and evaluator communicate directly instead of writing to documents. With approaches ranging from manus ai to claude code, the landscape of claude ai tools keeps shifting, and this video shows you exactly what matters right now. We compare how each framework handles evaluation: BMAD's multi-angle code review agents, GSD's verifier sub-agent, the superpowers agent harness TDD enforcement that blocks code before tests exist, and Anthropic's scored criteria system. If you want the best agent harness for agentic coding and ai development, this is the breakdown that shows you what to keep and what to strip out. 00:00 Introduction 00:32 Agent Harness 01:53 Planning 04:44 Sponsor — Miniax 05:39 Self Review 06:46 Sprint Contract 07:39 Context 08:36 Generator 09:17 Evaluator 10:42 Graded Evaluation 12:17 The Tech Stack Now Hashtags: #claudecode #ai #claude #claudeai #vibecoding #claudecodetutorial #manusai #agentharness

Strip Frameworks to Planner, Generator, Evaluator

Anthropic's experiments on their own agent harnesses prove that with Claude Opus 4.6, 90% of components in frameworks like BMAD, GSD, SpecKit, and Superpowers add overhead without value. Each component assumes model limitations—like needing micro-task sharding or context resets—that no longer hold due to the model's 1M token context window and improved coherence. Test assumptions by removing parts and measuring task success; results show only three agents deliver substantial gains over long horizons: planner for high-level product outlines, generator for implementation, and evaluator for critical review. This minimal setup outperforms bloated harnesses by letting capable models handle details autonomously, avoiding error cascades from upfront technical specs.

High-Level Planning Unlocks Model Autonomy

Shift planning from detailed micro-tasks (e.g., BMAD's technical sharding or Specit's step-by-step fragments) to product-level deliverables like full feature breakdowns and user stories. Opus 4.6 excels here: detailed plans cause single errors to propagate, locking agents into flawed paths, while high-level scopes let them discover optimal implementations. Use BMAD only up to PRD generation for its specialized context-augmented agents, or Superpowers' questioning for edge cases. Anthropic's example planner prompt pushes boundary-testing app ideas at product scale, generating folders with phased docs—avoid Claude's native plan mode, which dives into implementation details prematurely. Outcome: agents deliver complete user-expected workflows without hand-holding.

Separate Generator-Evaluator with Graded Rubrics

Never let the generator self-evaluate— it overconfidently praises subpar work, especially subjective UI where standards vary. Frameworks like GSD, BMAD, and Superpowers fix this with distinct validators (e.g., BMAD's QA agents run tests; Superpowers enforces TDD; Specit verifies against docs), but lack rigorous scoring. Anthropic's evaluator acts as adversary: simulates users via Playwright, critiques assuming bugs exist, and scores on explicit criteria before approving. For UI, grade on four axes—design quality (coherent fields vs. strung components), originality (avoid default purple-white gradients), craft (typography, spacing, contrast harmony), functionality (UX enhancement)—each weighted to prioritize holistic excellence. Opus 4.6 skips sprint contracts needed by weaker models like Sonnet; context anxiety is gone, so no resets or external breakdowns required. Result: iterative feedback loop yields production-ready apps matching your standards.

Implement Minimal Harness Without Full Frameworks

GSD is closest ready option with its planner-generator-evaluator loop, but upgrade its pass/fail evaluator to scored rubrics. Otherwise, build via Claude agent teams: one generator (understands task → implements in Git → refines via design/verify subphases), one evaluator (tests live via browser MCP, communicates fixes). No sub-agents—teams enable direct chat, cutting doc overhead. For smaller models, retain task docs and contracts; scale up with Opus. Resources in AIABS Pro provide ready agents. This evolves your setup as models advance, shipping better apps faster.

Video description

Strip Frameworks to Planner, Generator, Evaluator

High-Level Planning Unlocks Model Autonomy

Separate Generator-Evaluator with Graded Rubrics

Implement Minimal Harness Without Full Frameworks

More from AI & LLMs

Run Gemma 4 Agents On-Device with LiteRT Stack

Mistral Vibe Remote Agents Run Coding Tasks in Cloud at 77.6% SWE-Bench

Nemotron 3 Nano Omni: Unified Open Model for Multimodal Agents

Deep Research Max Builds Visual Reports from Private Data