VS Code Agent Loop: Tools, Sub-Agents, and Optimizations
VS Code's agent loop is a dynamic while loop powered by model-tuned prompts, context gathering, and tools; sub-agents use cheaper models for speed, with constant harness optimizations boosting code quality from 53% to 90%.
Agent Loop Fundamentals
Pierce Boggan explains the agent loop as a giant while loop triggered by a user's first prompt in VS Code's GitHub Copilot chat. Each iteration sends an API request to a model with dynamically built components: a system prompt tailored to the selected model family (optimized pre- and post-launch via A/B tests and evaluations), explicit context (e.g., mentioned files like hello.tsx), implicit context (open editors, running terminals, environment info), available tools, and the user prompt.
Tools form the core: unlike basic chat's text-only responses, agents choose from built-in tools (search, read/edit files, GitHub NCP servers) or custom ones, each with schemas and descriptions. The model decides actions—like searching files, reading them, then editing—appending outputs to iterate until issuing a stop message with a user summary. "Imagine you just basically have a giant while loop... every interaction is an API request to a model."
James Monte Magno notes the loop's evolution over 6-8 months, with growing options like bypass, autopilot, planning modes, custom agents, and reasoning levels. Users see branches in chat (e.g., research via grepping/reading files), all driven by the model appending prior outputs as context.
Harness Optimizations and Model Tuning
The "harness"—prompts, context gathering, tools, and custom backend models—differentiates VS Code from CLI or other agents. Pierce highlights massive unseen optimization: VS Code's team (15-20 people) refines prompts with providers like Anthropic, OpenAI, Gemini, xAI weeks/months pre-launch using VS SWE-bench (custom, pollution-free alternative to SWE-bench). They analyze agent trajectories—not just pass/fail, but optimal paths for faster resolutions (1 minute vs. 1 hour).
Post-launch, A/B tests and online evals handle demand spikes (e.g., new Opus 4.7 launch day capacity issues). Results: from 52-53% GPT-4o code commit rate to 90% with o1. Custom models tackle specifics, like agentic code retrieval for edits or cheap models for chat titles. "With Opus o1, we're getting 90% of Opus code in our harness committed... improvement we see in 1 year."
Pierce stresses continuous loops: model updates, prompt/tool tweaks, purpose-built models. New models start "infant" but hone quickly; different models (4.5 to 4.7, o1 to 5.3 Codex) think differently, requiring per-model tuning.
Sub-Agents: Specialization and Model Choices
Sub-agents address when the main agent delegates: it's a tool the model selects to run a fresh agent loop with a goal, returning results like a function. Users question different models (e.g., main o1-preview at 3x cost, sub-agent Haiku at 0.33x): no bait-and-switch, but deliberate for best experience.
Reasons: speed/cost for narrow tasks (context gathering, exploration). Main agent (heavy reasoning model) plans/coordinates; sub-agents use fast/cheaper models. Pierce: sub-agent as "run this workflow with fresh context... return back to main thread." Model decides via tool choice in the loop.
Customizations modify basics: instructions append text (global or glob-patterned), skills let model fetch/append context like tools, NCP adds tools. Trade-offs abound: too many tools/options degrade choice (like humans with overload); custom models prune to relevant ones. User corrections append as text, enabling smart pivots but risking bad paths—kill/restart advised.
"When you give people more choices, their ability to pick the right choice degrades."
Trade-Offs in Customization and Behaviors
Pierce warns against extremes: stuffing prompts fills context windows; 1,000 tools overwhelm. Optimizations include tool-refining models and context-specific custom models. Bad loops from poor prior tokens require intervention, as each predicts the next.
Features like auto-titles, commit messages, PRs, next edits run mini-loops transparently. Harness tailors to code quality; incentives align to user success, not tricks. James emphasizes micro-decisions' impact on prompting.
Ongoing: demand prediction challenges in agentic era (10+ parallel agents), offline eval limits, provider updates.
"There's an enormous amount of optimization... that you don't actually see."
Key Takeaways
- Trigger the agent loop with a clear prompt; watch iterations via chat for search/read/edit patterns.
- Select models wisely—new ones like o1-preview need weeks to optimize; expect initial capacity hiccups.
- Use instructions/skills sparingly to avoid context bloat; let model choose via tools.
- Kill bad sub-agent paths early—corrections append as text, but prior tokens influence heavily.
- Customize via glob instructions or NCP for targeted tools, but limit options to aid model decisions.
- Evaluate via trajectories, not just resolution: aim for optimal paths in your workflows.
- Leverage VS SWE-bench insights: focus on production harnesses over polluted benchmarks like SWE-bench.
- For sub-agents, embrace model mixing—cheap/fast for exploration, heavy for orchestration.
- Monitor trade-offs: more tools degrade choice; use custom models for retrieval/edits.
- Stay updated weekly—harness evolves with models, boosting code acceptance from ~50% to 90%.