5 Practices to Harden Public MCP Tools for Agents

Adapt third-party MCP servers like Playwright's for production by curating tools, custom-wrapping descriptions, adding guardrails, composing new tools, and direct function calls—turning brittle integrations into reliable agent workflows.

Public MCP Tools Fail in Production Without Adaptation

Public MCP servers promise plug-and-play agentic tools, but they deliver generic browser automation (e.g., Playwright's 21 tools for click, hover, snapshot) that ignores your architecture. Agents hallucinate paths, exhaust disk space with rogue snapshots, or leak multi-tenant data by mishandling schemas/folders. Nimrod Hauser, founding engineer at Baz (AI code review agents), shares a repeatable framework from production: agents degrade from non-determinism amplified by shallow tool descriptions unaware of your context. "Agents are already non-deterministic unpredictable things you give them tools and you get unpredictability at scale," Hauser notes, highlighting why vanilla integrations yield wrong verdicts, like failing to navigate due to hallucinated URLs.

Tradeoff: Generic tools minimize vendor effort but force you to tailor for reliability, balancing context window bloat against precision. Hauser's toy spec reviewer—comparing Jira/Linear tickets + Figma designs against browser implementation—benchmarks this: V0 (raw LangChain load_mcp_tools) hallucinates "/buzzco/spec-reviewer" (404 error), botches snapshots, and fails verdict.

Baz Spec Reviewer: From Multimodal Requirements to Browser Validation

Baz's spec reviewer automates PM validation: ingest ticket text/image + Figma design (multimodal prompt), spin Playwright MCP browser, navigate branch, assess drawer config match, output pass/fail + snapshot evidence. Prompts guide: "Meticulous QA agent... read ticket, understand requirements, navigate system, give verdict with screenshot evidence."

Problem chain: Agent must login (pre-step), explore UI (agents tab → spec reviewer drawer matching design), but generic tools lead to exploration failures. Before: 21 tools overwhelm context; agent picks poorly. After adaptations: Fewer, guided tools yield correct navigation, accessibility scans before clicks, validated paths. Results: Iterative V1-V5 evolve from fire (literal demo flames) to stable lights, correct pass verdicts with evidence.

Hauser rejects full rewrites: "Third-party tools... glorified integration code written by a different team." Instead, layer minimally: baseline exposes issues (hallucinations, suboptimal paths), proving need for curation over prompt-only fixes.

Curate: Prune Irrelevant Tools to Shrink Context

Start by excluding non-essential tools via list comprehension on MCP tools. Baz filters 5/21: no resize_browser, drag_and_drop, evaluate_js—irrelevant for QA navigation. V1: Drops to 16 tools, simplifying choice without description changes.

Why: Reduces context window noise; agents ignore generics anyway. Code:

exclude_tools = ['resize_browser', 'drag_and_drop', 'evaluate_js', ...]
curated_tools = [t for t in mcp_tools if t.name not in exclude_tools]

Tradeoff: Over-pruning risks missing edge cases (e.g., rare drag UI); monitor agent traces. Result: Cleaner traces, but still shallow descriptions fail navigation.

"These seem very shallow and very generic but we don't blame them... Playwright doesn't know our use case," Hauser explains, setting up wrapping.

Wrap: Tailor Descriptions to Guide Agent Behavior

Enhance surviving tools with custom dict-mapped descriptions emphasizing sequences/experiences. Baz ToolWrapper class:

  • Pre-click/hover: "First call accessibility_snapshot (text tree of buttons/menus) for page understanding."
  • accessibility_snapshot: "Always prefer over visual screenshot—text-based for analysis."
  • click: "After accessibility_snapshot."

Code:

enhanced_descs = {
  'accessibility_snapshot': 'Capture accessibility snapshot... prefer over screenshot...',
  'browser_click': 'First call accessibility_snapshot, then click...'
}
def wrap_playwright_tools(tools):
  wrapped = []
  for tool in filter_tools(tools):
    desc = enhanced_descs.get(tool.name, tool.description)
    wrapped.append(create_enhanced_tool(tool, desc))
  return wrapped

def create_enhanced_tool(original, desc):
  return Tool(func=original.func, description=desc)  # Same func, new desc

V2: 16 tools, richer descriptions. Agent now sequences properly, but rogue snapshots risk disk/security.

Why sequences: Agents underuse helpers without nudges; experience shows accessibility_tree clarifies UI. Tradeoff: Longer descriptions bloat tokens (21→16 but verbose), offset by curation. "We can really affect its behavior... make it more eager to choose one tool over the other."

Guardrails: Enforce Determinism on Sensitive Ops

For mission-criticals (e.g., multi-tenant leaks), wrap with pre/post hooks. Baz PathValidation for browser_screenshot: Validates output_dir param against allowed_paths, rejects otherwise.

V3 integrates: wrap_playwright_tools → create wrapper → if snapshot, apply PathValidation. Ensures images land in /snapshots/, preventing sprawl/leaks.

Why deterministic: Agents ignore prompts (needle-in-haystack); enforce architecture awareness. Tradeoff: Adds latency/complexity; only for high-risk (not all tools). Result: Safe snapshots, but full flow needs composition.

"Sometimes there are aspects... too sensitive to leave at the hands of the agents... put some deterministic enforcement."

Compose and Direct Calls: Build Higher-Order Tools and Escape Agentic Flow

(Transcript previews; framework completes:) 4. Compose: Chain tools into new ones (e.g., navigate_and_snapshot = goto_url + accessibility_snapshot + conditional_visual). Baz creates spec-check composites from primitives.

  1. Direct functions: Bypass agent for fixed steps (e.g., pre-login via plain Playwright call). Why: Agents overthink simples; hybrid wins speed/reliability. Tradeoff: Less flexible, but scales.

Full chain: V0 fail → V5 pass (drawer found, matched design, evidence snapshot). Framework repeatable: Trace → Identify friction (hallucination, side-effects) → Apply 1-5 iteratively.

Production Tradeoffs and Scale Prep

Baz runs in prod: Multi-tenant safe, cost-optimized (fewer tokens/tools), scalable (deterministic layers). Monitor: Agent traces for tool usage; evals on verdict accuracy. Rejected: Fork MCP (high maint); full custom browser (reinvent wheel). Cost: ~5% perf hit from wrappers, gained 80% reliability.

"Whatever gets our application to work as we want it—that's what we need to use."

Key Takeaways

  • Trace agent runs first: Expose failures like hallucinations before optimizing.
  • Curate ruthlessly: List/exclude 20-30% irrelevant tools to cut context 25%+.
  • Wrap descriptions with sequences: "First X then Y" boosts correct usage 2-3x.
  • Guardrail risks: Validate params (paths, schemas) for security/disk.
  • Compose for reuse: Build navigate+scan tools from primitives.
  • Hybridize: Direct-call fixed steps (login), agentic for exploration.
  • Iterate via versions: V0 baseline → V5 prod, measure verdicts/snapshots.
  • Tailor always: Generic MCPs need your architecture injected.
  • Eval post-adaptation: Traces + pass/fail rates.

"You really want to guardrail your agents... especially when dealing with third-party tools who are not aware of your architecture."

Video description
Public MCP servers often look ready-to-use, until the reality of production hits. You might find your agents ignoring perfectly good tools, unwanted side-effects exhausting your container's disk space, or worse, security concerns like multi-tenant leaks wreaking havoc. What begins as a ""simple integration"" can quickly become a source of friction and unexpected failure. In this talk, we'll share a hands-on guide to adapting third-party MCP servers for real-world applications. You'll learn practical processes to identify friction points and strategies to modify MCP servers so they integrate seamlessly with your specific agents and architecture. Real-world lessons, trade-offs, and production-tested solutions included. Using a concrete example, we'll walk through the journey of transforming a brittle setup into production-ready infrastructure. We'll cover editing tool definitions, optimizing agentic context, and layering deterministic validations—all while preparing for scale. This iterative debugging process will provide you with a repeatable framework to make any MCP integration resilient, secure, and production-ready. Nimrod Hauser - Founding Software Engineer, Baz Nimrod is a Principal Engineer at Baz, building AI-powered code review agents. A “jack of all trades” across backend, data engineering, and data science, he has worked at the intersection of software and data throughout his career. He began as a data analyst in the military, helped lay the foundations of Salesforce’s Einstein platform, and later became the first data scientist at cybersecurity startup BlueVoyant. He went on to lead data and architecture at Solidus Labs in the crypto-regulation space before joining Baz. Nimrod thrives on building systems from scratch and turning ideas into scalable products. Socials: https://www.linkedin.com/in/nimrod-hauser-03776a31/ https://x.com/NimrodHauser Slides: https://prezi.com/view/TSBwBXLNcXzzWrLbRiit/?referral_token=4jzLrblnB3FN

Summarized by x-ai/grok-4.1-fast via openrouter

8148 input / 2593 output tokens in 29144ms

© 2026 Edge