Three-Role Architecture Composes Multi-Agent Patterns for Long-Running Autonomy
Multi-agent systems succeed by blending five frontier strategies: delegation (one agent spawns sub-agents for subtasks like database schema research), creator-verifier (separate agents implement then scrutinize to avoid cost bias), direct communication (agents DM without coordinator, risky due to fragmented state), negotiation (agents trade over shared resources like code sections for win-win outcomes), and broadcast (one agent shares status or constraints to all for coherence). Missions integrates four—delegation, creator-verifier, broadcast, negotiation—into a workflow where humans set goals, approve plans, then agents execute for hours or days via shared state and structured handoffs.
The orchestrator acts as a sounding board: it scopes goals through conversation, clarifies requirements, outputs a plan with features, milestones, and a validation contract defining "done" upfront (hundreds of assertions per project, independent of code). Workers get clean context per feature: read spec, implement, commit via Git for clean slates. Validators run post-milestone: scrutiny (lints, type checks, dedicated code reviews per feature) and user testing (QA-like interaction via computer use—filling forms, clicking buttons, verifying flows). Validators see no prior code, ensuring adversarial checks. This catches drift since post-implementation tests confirm decisions, not requirements—pre-defined contracts enforce behavioral correctness.
Structured Handoffs and Milestone Checkpoints Enable Self-Healing
Workers end features with structured handoffs logging completions, undone work, commands/exit codes, issues, and procedure adherence. Orchestrators review at milestones: if issues arise, scope corrective features, pulling the system back on track. This explicit documentation prevents context loss, unlike vague "I'm done" signals. Longest mission ran 16 days (potentially 30); errors surface at boundaries for targeted fixes, compounding correctness over time.
Serial Execution Outperforms Naive Parallelism; Model Selection Compounds Gains
Parallel agents conflict on shared codebases—duplicating work, inconsistent decisions, high coordination overhead burning tokens. Missions runs serially (one worker/validator active), parallelizing only read-only ops like codebase search or code reviews. Wall-clock time skews to user testing (live app interaction), not token generation. Assign models deliberately per role: slow reasoners for orchestration planning, fast coders for workers, precise followers for validation—mix providers to avoid biases. This "droid whispering" skill lets weaker/open-weight models succeed via structure; prompt-defined orchestration (700 lines of text/skills, no hard-coded state machines) adapts to new models, avoiding obsolescence (the "bitter lesson" fear). Prompt caching offsets long-run costs.
Production Metrics Prove Scalable Software Autonomy
Slack clone example: 60% time/tokens on implementation; validation fails first pass always, triggering 50% of final code as tests (90% coverage). Unlocks 3x workstreams (5 engineers → 30 missions), freeing humans for architecture/product decisions. Codebases improve: more tests/skills boost agent/human productivity. Enterprise uses: overnight prototypes, rapid internal tools, refactors/migrations, ML research, codebase modernization. Mission Control dashboard tracks progress, budget, handoffs asynchronously—set it and ignore.