Strip Frameworks to Planner, Generator, Evaluator

Anthropic's experiments on their own agent harnesses prove that with Claude Opus 4.6, 90% of components in frameworks like BMAD, GSD, SpecKit, and Superpowers add overhead without value. Each component assumes model limitations—like needing micro-task sharding or context resets—that no longer hold due to the model's 1M token context window and improved coherence. Test assumptions by removing parts and measuring task success; results show only three agents deliver substantial gains over long horizons: planner for high-level product outlines, generator for implementation, and evaluator for critical review. This minimal setup outperforms bloated harnesses by letting capable models handle details autonomously, avoiding error cascades from upfront technical specs.

High-Level Planning Unlocks Model Autonomy

Shift planning from detailed micro-tasks (e.g., BMAD's technical sharding or Specit's step-by-step fragments) to product-level deliverables like full feature breakdowns and user stories. Opus 4.6 excels here: detailed plans cause single errors to propagate, locking agents into flawed paths, while high-level scopes let them discover optimal implementations. Use BMAD only up to PRD generation for its specialized context-augmented agents, or Superpowers' questioning for edge cases. Anthropic's example planner prompt pushes boundary-testing app ideas at product scale, generating folders with phased docs—avoid Claude's native plan mode, which dives into implementation details prematurely. Outcome: agents deliver complete user-expected workflows without hand-holding.

Separate Generator-Evaluator with Graded Rubrics

Never let the generator self-evaluate— it overconfidently praises subpar work, especially subjective UI where standards vary. Frameworks like GSD, BMAD, and Superpowers fix this with distinct validators (e.g., BMAD's QA agents run tests; Superpowers enforces TDD; Specit verifies against docs), but lack rigorous scoring. Anthropic's evaluator acts as adversary: simulates users via Playwright, critiques assuming bugs exist, and scores on explicit criteria before approving. For UI, grade on four axes—design quality (coherent fields vs. strung components), originality (avoid default purple-white gradients), craft (typography, spacing, contrast harmony), functionality (UX enhancement)—each weighted to prioritize holistic excellence. Opus 4.6 skips sprint contracts needed by weaker models like Sonnet; context anxiety is gone, so no resets or external breakdowns required. Result: iterative feedback loop yields production-ready apps matching your standards.

Implement Minimal Harness Without Full Frameworks

GSD is closest ready option with its planner-generator-evaluator loop, but upgrade its pass/fail evaluator to scored rubrics. Otherwise, build via Claude agent teams: one generator (understands task → implements in Git → refines via design/verify subphases), one evaluator (tests live via browser MCP, communicates fixes). No sub-agents—teams enable direct chat, cutting doc overhead. For smaller models, retain task docs and contracts; scale up with Opus. Resources in AIABS Pro provide ready agents. This evolves your setup as models advance, shipping better apps faster.