The Problem: MCP Token Bloat
Model Context Protocol (MCP) servers are designed for fidelity, not frugality. They often return massive JSON blobs or full-page scrapes that exceed the context window of the LLM consuming them. This creates two distinct token sinks: the "menu tax" (the tools/list schema injected into context every turn) and the response payload itself. A single call can easily consume hundreds of thousands of tokens, causing agents to crash or suffer from degraded performance.
Measuring the Cost
To manage token usage, you must first measure it. You can build a simple harness using tiktoken to count tokens in both the schema definitions and the actual tool outputs. The author's testing revealed that some MCP servers have a schema tax nearly 3x the size of their actual tool responses. By running a probe against each server, you can identify which tools are the most expensive and prioritize them for optimization.
Strategies for Reduction
- Shorten the Menu: The most effective way to reduce the per-turn schema tax is to unmount servers you don't need or use fine-grained tool-denial configurations in your client (e.g., Cursor or Claude Code) to remove unused tools from the context entirely.
- Native Limiting: Check if the MCP server supports pagination, field filtering, or native output modes (like
fileoutput in Playwright MCP) to limit the data returned at the source. - Token-Budgeting Proxy: When native limits are insufficient, deploy a thin proxy server between your client and the MCP server. This proxy forwards calls upstream and post-processes responses using a "strip, spill, or pass" strategy.
Implementing a Token-Budgeting Proxy
The proxy acts as an impedance-matcher. It intercepts the response, strips known noise (like base64 images or long tracking URLs), and checks the remaining token count against a hard budget (e.g., 8,000 tokens). If the payload is still too large, the proxy spills the full content to disk and returns a preview plus a file path. This allows the LLM to see a summary immediately and perform follow-up actions like grep or partial reads on the full file if needed.
Key Takeaways
- Measure before optimizing: Use a script to count tokens for both
tools/listand representative tool calls to identify your biggest sinks. - Aggressively prune schemas: If you aren't using a tool, remove it from your client configuration to save context space every turn.
- Use a proxy for hard limits: When a server returns unmanageable data, a middleware proxy can enforce a hard token budget by spilling excess data to disk.
- Leverage file-based workflows: Modern LLMs are capable of reading files on disk; use this to your advantage by saving large tool outputs to a local directory rather than streaming them into the context window.
- Strip noise first: Simple regex-based cleaning of tracking parameters and data URIs can often reduce payload sizes significantly before you even reach the budget threshold.
Notable Quotes
- "MCP servers are essentially APIs for LLMs, but the response still has to fit in a buffer with a hard size limit (the context window). Our token-budgeting proxy sits between those two worlds as an impedance-matcher."
- "The
tools/listcost is the one that absolutely horrified me — this schema or ‘menu’ tax is something you pay every turn, on every conversation, forever." - "It’s strictly better to limit at the source than to download 300k tokens and throw most away."
- "The trade-off of this approach is that the model must be smart enough to route through the
calltool rather than calling the MCP’s tools directly, and — when a result spills — to follow up with a grep or a partial read."