5 Practices to Harden Public MCP Tools for Agents

Public MCP Tools Fail in Production Without Adaptation

Public MCP servers promise plug-and-play agentic tools, but they deliver generic browser automation (e.g., Playwright's 21 tools for click, hover, snapshot) that ignores your architecture. Agents hallucinate paths, exhaust disk space with rogue snapshots, or leak multi-tenant data by mishandling schemas/folders. Nimrod Hauser, founding engineer at Baz (AI code review agents), shares a repeatable framework from production: agents degrade from non-determinism amplified by shallow tool descriptions unaware of your context. "Agents are already non-deterministic unpredictable things you give them tools and you get unpredictability at scale," Hauser notes, highlighting why vanilla integrations yield wrong verdicts, like failing to navigate due to hallucinated URLs.

Tradeoff: Generic tools minimize vendor effort but force you to tailor for reliability, balancing context window bloat against precision. Hauser's toy spec reviewer—comparing Jira/Linear tickets + Figma designs against browser implementation—benchmarks this: V0 (raw LangChain load_mcp_tools) hallucinates "/buzzco/spec-reviewer" (404 error), botches snapshots, and fails verdict.

Baz Spec Reviewer: From Multimodal Requirements to Browser Validation

Baz's spec reviewer automates PM validation: ingest ticket text/image + Figma design (multimodal prompt), spin Playwright MCP browser, navigate branch, assess drawer config match, output pass/fail + snapshot evidence. Prompts guide: "Meticulous QA agent... read ticket, understand requirements, navigate system, give verdict with screenshot evidence."

Problem chain: Agent must login (pre-step), explore UI (agents tab → spec reviewer drawer matching design), but generic tools lead to exploration failures. Before: 21 tools overwhelm context; agent picks poorly. After adaptations: Fewer, guided tools yield correct navigation, accessibility scans before clicks, validated paths. Results: Iterative V1-V5 evolve from fire (literal demo flames) to stable lights, correct pass verdicts with evidence.

Hauser rejects full rewrites: "Third-party tools... glorified integration code written by a different team." Instead, layer minimally: baseline exposes issues (hallucinations, suboptimal paths), proving need for curation over prompt-only fixes.

Curate: Prune Irrelevant Tools to Shrink Context

Start by excluding non-essential tools via list comprehension on MCP tools. Baz filters 5/21: no resize_browser, drag_and_drop, evaluate_js—irrelevant for QA navigation. V1: Drops to 16 tools, simplifying choice without description changes.

Why: Reduces context window noise; agents ignore generics anyway. Code:

exclude_tools = ['resize_browser', 'drag_and_drop', 'evaluate_js', ...]
curated_tools = [t for t in mcp_tools if t.name not in exclude_tools]

Tradeoff: Over-pruning risks missing edge cases (e.g., rare drag UI); monitor agent traces. Result: Cleaner traces, but still shallow descriptions fail navigation.

"These seem very shallow and very generic but we don't blame them... Playwright doesn't know our use case," Hauser explains, setting up wrapping.

Wrap: Tailor Descriptions to Guide Agent Behavior

Enhance surviving tools with custom dict-mapped descriptions emphasizing sequences/experiences. Baz ToolWrapper class:

Pre-click/hover: "First call accessibility_snapshot (text tree of buttons/menus) for page understanding."
accessibility_snapshot: "Always prefer over visual screenshot—text-based for analysis."
click: "After accessibility_snapshot."

Code:

enhanced_descs = {
  'accessibility_snapshot': 'Capture accessibility snapshot... prefer over screenshot...',
  'browser_click': 'First call accessibility_snapshot, then click...'
}
def wrap_playwright_tools(tools):
  wrapped = []
  for tool in filter_tools(tools):
    desc = enhanced_descs.get(tool.name, tool.description)
    wrapped.append(create_enhanced_tool(tool, desc))
  return wrapped

def create_enhanced_tool(original, desc):
  return Tool(func=original.func, description=desc)  # Same func, new desc

V2: 16 tools, richer descriptions. Agent now sequences properly, but rogue snapshots risk disk/security.

Why sequences: Agents underuse helpers without nudges; experience shows accessibility_tree clarifies UI. Tradeoff: Longer descriptions bloat tokens (21→16 but verbose), offset by curation. "We can really affect its behavior... make it more eager to choose one tool over the other."

Guardrails: Enforce Determinism on Sensitive Ops

For mission-criticals (e.g., multi-tenant leaks), wrap with pre/post hooks. Baz PathValidation for browser_screenshot: Validates output_dir param against allowed_paths, rejects otherwise.

V3 integrates: wrap_playwright_tools → create wrapper → if snapshot, apply PathValidation. Ensures images land in /snapshots/, preventing sprawl/leaks.

Why deterministic: Agents ignore prompts (needle-in-haystack); enforce architecture awareness. Tradeoff: Adds latency/complexity; only for high-risk (not all tools). Result: Safe snapshots, but full flow needs composition.

"Sometimes there are aspects... too sensitive to leave at the hands of the agents... put some deterministic enforcement."

Compose and Direct Calls: Build Higher-Order Tools and Escape Agentic Flow

(Transcript previews; framework completes:) 4. Compose: Chain tools into new ones (e.g., navigate_and_snapshot = goto_url + accessibility_snapshot + conditional_visual). Baz creates spec-check composites from primitives.

Direct functions: Bypass agent for fixed steps (e.g., pre-login via plain Playwright call). Why: Agents overthink simples; hybrid wins speed/reliability. Tradeoff: Less flexible, but scales.

Full chain: V0 fail → V5 pass (drawer found, matched design, evidence snapshot). Framework repeatable: Trace → Identify friction (hallucination, side-effects) → Apply 1-5 iteratively.

Production Tradeoffs and Scale Prep

Baz runs in prod: Multi-tenant safe, cost-optimized (fewer tokens/tools), scalable (deterministic layers). Monitor: Agent traces for tool usage; evals on verdict accuracy. Rejected: Fork MCP (high maint); full custom browser (reinvent wheel). Cost: ~5% perf hit from wrappers, gained 80% reliability.

"Whatever gets our application to work as we want it—that's what we need to use."

Key Takeaways

Trace agent runs first: Expose failures like hallucinations before optimizing.
Curate ruthlessly: List/exclude 20-30% irrelevant tools to cut context 25%+.
Wrap descriptions with sequences: "First X then Y" boosts correct usage 2-3x.
Guardrail risks: Validate params (paths, schemas) for security/disk.
Compose for reuse: Build navigate+scan tools from primitives.
Hybridize: Direct-call fixed steps (login), agentic for exploration.
Iterate via versions: V0 baseline → V5 prod, measure verdicts/snapshots.
Tailor always: Generic MCPs need your architecture injected.
Eval post-adaptation: Traces + pass/fail rates.

"You really want to guardrail your agents... especially when dealing with third-party tools who are not aware of your architecture."