VS Code's Agent Loop: Prompts, Tools, Sub-Agents Exposed

VS Code Copilot's agent loop is a dynamic while loop that iterates model calls with optimized system prompts, context, tools, and sub-agents, achieving 90% code commit rates through relentless harness tuning.

Agent Loop Fundamentals: A While Loop Powering Iterations

Brian breaks down the agent loop as a giant while loop triggered by your first prompt in VS Code Copilot. Each iteration sends an API request to the model with four key components: a dynamically built system prompt, explicit and implicit context (like open editors, terminals, dates), available tools, and the user prompt. Tools—such as search, file reads, edits, or NCP calls—have schemas and descriptions, allowing the model to select and parameterize them instead of just responding with text.

The loop continues by appending previous outputs: a search yields files, reads gather context, edits apply changes, and a final text summary with a stop message ends it. "The model is given the outputs of the previous thing and able to iterate on it," Brian says. This setup evolved from simple chat, where models only returned text, to agentic flows enabling multi-step reasoning.

James highlights user confusion around spinning loops, unexpected models, and context windows, noting how options like bypass, autopilot, planning, and custom agents multiply complexity. All modes build on this core loop, with customizations like instructions (appended text), skills (model-selectable context appends), and NCP servers (extra tools) modifying it subtly.

Tool Choice Trade-offs and Hidden Optimizations

Too many tools overwhelm the model, mirroring human decision paralysis: "Just like a human, when you give people more choices, their ability to pick the right choice degrades." Brian reveals backend optimizations, including custom models that prune tool lists to relevant ones per session and specialized retrievers for agentic code context—crucial for accurate edits.

System prompts are model-specific, tuned pre-launch with providers like Anthropic, OpenAI, and xAI via offline evaluations, then refined post-launch with A/B tests and online metrics. Even chat title generation or commit messages run lightweight agent loops via cheap models. Brian emphasizes the "harness"—prompts, context gathering, tools, and custom models—as the differentiator across tools like CLI or Cursor, explaining varied behaviors.

User corrections append as text, letting smart models adapt, but bad paths require manual intervention since tokens predict sequentially. With 15-20 engineers dedicated, VS Code hit 90% Opus 4.6 code commit rates, up from 52% GPT-4o a year ago, by influencing "agent trajectories"—optimal paths minimizing steps from hour-long grinds to minute resolutions.

Sub-Agents as Tools: Delegation Without Bait-and-Switch

Sub-agents address the big question: why cheaper models like Haiku appear mid-loop despite premium selection? Brian clarifies they're tools the main agent invokes via parameters, spinning fresh loops with goal-specific context that return results like functions. No fast one—it's explicit model choice in the loop for efficiency.

"A sub-agent is basically like this main agent can decide, 'I want to go basically do this workflow, run this agent loop again with fresh context,'" Brian explains. The main agent prompts via tool call, decided by context and system instructions pushing delegation for tasks like exploration. This orchestration scales without bloating the primary context.

James recounts Twitter confusion over model switches (e.g., 3x cost to 0.33x), pulling docs from OpenAI and Claude. Incentives align: top experience drives tuning, not tricks. Custom agents and orchestration layer atop this, with skills/instructions as prompt mods.

Evaluation Loops: From VS SWE-bench to Production Polish

Offline evals use VS SWE-bench—a cleaner SWE-bench alternative avoiding training pollution—running multiple trajectories per case to optimize paths, not just pass/fail. Pre-launch access (weeks/months) refines prompts; post-launch handles capacity crunches (new models like Opus 4.7 spike demand) and A/B tests real-world gains.

"We're actually going and saying, 'What is the path the model took and was that an optimal path? How can we influence the path the model takes?'" Brian notes. Model updates from providers compound improvements. New models start raw—"today is like the worst day to use that model" due to capacity and untuned prompts—but mature in weeks.

Demand prediction falters in agentic era (10+ parallel agents), but continuous work—generic optimizations, purpose-built models—ensures evolution. Even transparent features like AI edits or next-edits embed mini-loops.

"With Opus 4.6, James, I think we're getting 90% of Opus 4.6 code in our harness committed. This is pretty amazing. GPT-4o, when I first started on this team, we were 52, 53%. So, this is the improvement we see in 1 year."

Key Takeaways

  • Understand the agent loop as a while loop iterating model calls with dynamic system prompts, auto-context (editors/terminals), tools, and appended history—kill bad paths early since tokens chain predictably.
  • Limit tools to essentials; overload degrades choice—trust harness optimizations like tool pruners and code retrievers for relevance.
  • Sub-agents are tools for delegation: main agent spins goal-focused child loops returning results, enabling cheaper models without tricks.
  • Harness (prompts/tools/context/custom models) differentiates agents—VS Code's yields 90% commit rates via trajectory tuning.
  • New models need weeks to mature: expect capacity issues and raw performance initially; evals evolve via VS SWE-bench and A/B tests.
  • User corrections append as text—models adapt if prompted well, but explicit instructions guide sub-agent use.
  • Every click (titles, commits) hides mini-loops; appreciate backend for production-grade results.

"There's an enormous amount of optimization going in from our side that you don't actually see... around like tool optimization, like, what are the right tools, how many tools should we have?"

"The system prompt... is actually dynamically built for every single kind of combination of things you pick in the picker."

"Offline evaluations are always flawed... so then post-launch... we can do things like run AB tests and actually know in the wild what is better."

Summarized by x-ai/grok-4.1-fast via openrouter

8885 input / 1981 output tokens in 18878ms

© 2026 Edge