Build MCP Deep Research Agents + Writing Pipelines
Hands-on guide to engineer a goal-directed research agent using MCP for web search, YouTube analysis, evidence synthesis, then pipe outputs to a constrained writing workflow with evaluation—distilling real-world tradeoffs for production AI systems.
Avoid AI Slop: Target Deep, Grounded Research Over Shallow Generation
AI-generated content like LinkedIn posts often fails with hallucinations, outdated info, vague generalizations ("most teams miss"), and slop phrases ("rapidly evolving landscape"). Deep research agents fix this by planning strategies, searching the web, analyzing sources (e.g., YouTube videos, GitHub), filtering for relevance/trustworthiness, and synthesizing cited artifacts. This workshop builds one using MCP (Multi-Chain Prompting) for agentic reasoning, emphasizing goal-directed loops: plan → search/inspect → pivot/refine → synthesize.
Key principle: Research demands high precision/recall to combat context rot (performance degradation beyond ~200k tokens due to lost-in-the-middle issues). Start simple—ask if a prompt suffices, then escalate to RAG, workflows, or agents only if dynamic branching or reactions to environment (e.g., web) are needed. Common mistake: Overbuilding multi-agents for fixed sequences, adding unreliability without value.
"Deep research is one of the best ways to learn how to build real AI systems because it forces you to combine reasoning, planning, autonomy, tools, grounding, and feedback loops."
Autonomy Slider: Match Workflows or Agents to Constraints
AI engineering balances cost/latency/quality/privacy via an "autonomy slider":
- Prompts: For known tasks; add few-shot examples.
- Context injection: Paste <200k tokens or cache for static docs.
- RAG/workflows: Fixed chains for sequential tasks (e.g., ticket classification → routing → drafting → validation). Use routers for conditions, parallel calls for voting, loops for judge feedback.
- Agents: For dynamic actions (plan tools, react to results). Limit to one agent + specialist tools (own prompts/LLMs) to preserve global context.
- Multi-agents: Delegate when >20 tools or context >200k; e.g., sub-agents for security silos.
Tradeoffs: More autonomy = less control/higher cost. Example: CRM marketing bot—client wanted multi-agents for grant appeal, but sequential workflow (plan → retrieve client data → generate → validate) sufficed via one agent calling format-specific tools (SMS/email). Tools as "specialists" keep decisions centralized, avoiding handoff errors.
Manage context budget: Trim/summarize/retrieve selectively; delegate to tools/sub-agents. Avoid context rot by staying lean.
"We always want to use the simplest solution... if the model already knows enough about the task, you can just prompt it."
MCP Agent Architecture: Tools for Web, Video, Synthesis
MCP server orchestrates the agent:
- Setup: Register tools (schemas, descriptions). Use Gemini for grounding.
- Core tools:
- Deep research: Prompt for strategy (e.g., "Plan 3-5 searches on topic, prioritize recent/authoritative sources"). Calls web search, filters results.
- YouTube analysis: Transcribe/extract timestamps, summarize key segments, cite clips.
- Compile research: Synthesize evidence into markdown artifact with citations; self-evaluate relevance.
- Prompting: Teach via few-shots (e.g., plan → execute → reflect). Workflow: Goal → Plan skills → Execute → Compile → Output.
Live demo: Input "What is AI engineering?" → Agent plans searches (Towards AI, papers), analyzes videos, outputs cited report. Pivots on gaps (e.g., re-search if shallow).
Prerequisites: Python/TypeScript comfort, LLM APIs (Gemini/OpenAI). Fits early in product pipelines for content automation.
Quality criteria: Grounded (citations), precise (no noise), iterative (feedback loops). Mistake: Exhaustive scraping—filter aggressively for signal.
Constrained Writing: Evaluator-Optimizer Over Freeform Agents
Research is exploratory (agentic), writing is polish-focused (workflow). Pipe research artifact to writer:
- Guidelines: Explicit structure (intro/hook → sections → code/images → CTA), tone (practical, no hype), length (~500 words for LinkedIn).
- Few-shot prompting: 2-3 examples of good posts (grounded, opinionated, cited).
- Evaluator-optimizer loop: Writer drafts → Reviewer scores (relevance, slop-free, value) → Optimizer revises. Repeat 2-3x.
- Post-skill: Generate images/code snippets if needed.
Why constrained? Reduces hallucinations, enforces brand voice. Demo: Research on "AI engineering" → Polished post with runnable code, no "most teams" fluff.
"Writing quality often improves with tighter workflows, review loops, and explicit guidance."
Observability: Trace, Judge, Iterate with Metrics
Use Opik for tracing (visualize chains, tool calls, latencies). Build LLM Judge:
- Dataset: Curate input/output pairs (topics → gold research/writing).
- Metrics: F1-score on citations/relevance (judge prompts: "Rate 1-10 on groundedness, novelty").
- Eval loop: Run agent → Judge → Log failures → Tune prompts/tools.
Production tip: Human-in-loop for edge cases; measure cost/task.
"The context grows and the performance degrades which we call context rot... manage this context budget."
Key Takeaways
- Start with autonomy slider: Prompts > workflows > single agent > multi-agents; simplest wins reliability.
- Build research agents with MCP/tools for planning (strategy), execution (search/analyze), synthesis (cited markdown).
- Delegate via tools to fight context rot—keep agent context <200k tokens.
- For writing, use evaluator-optimizer: Few-shots + review loops > open agents.
- Instrument everything: Opik traces + LLM Judge with F1 on datasets for continuous improvement.
- Prioritize precision/recall in search; filter noise early to avoid slop.
- Test in production: Build for utility (e.g., Towards AI courses), not demos.
- Exercise: Fork GitHub repo, run on your topic, eval F1 >0.8 before deploying.
Notable quotes:
- "Most people are interested in building agents, but most... are actually somewhat super simple workflows." (On over-engineering)
- "Tools as specialists but the global context stays within our only agent." (Single-agent advantage)
- "High quality technical content is expensive... automate most of this process as writer augmentation." (Business rationale)
- "It's a goal-directed research loop: one that can search, inspect, pivot, and progressively refine." (Core agent behavior)
- "AI products... combine all of that. They combine tools, workflows." (Holistic systems)