Agentic Pipelines: Cache Keys Cut Token Bloat 95%
Intercept tool calls with a ToolOrchestrator that swaps cache keys for large datasets, keeping LLM context to metadata only—avoids 50k-token ping-pong, slashes latency and costs by 95%, frees model for pure reasoning.
Ditch Naive Tool Chaining to Stop Token Ping-Pong
Naive agent setups dump raw data like 50,000-token SQL results directly into the LLM context, then pass it back out to Python sandboxes and chart engines—repeating this 3+ times per turn. This chaining anti-pattern causes latency spikes to minutes, context overflows with max_tokens_exceeded errors, and cognitive degradation where the LLM hallucinates from data overload instead of reasoning.
The fix: Treat the LLM as a traffic director, not a data courier. Use session cache to store heavy payloads (e.g., 50MB datasets) and pass only lightweight cache keys (10 tokens) like 'cache_key_alpha'. Tools fetch data internally via keys, process it, and return new keys—never bloating the prompt.
Result: Context stays under dozens of metadata tokens per turn, eliminating re-transmissions of the same dataset.
Orchestrate Data Flow: SQL → Compute → Charts via Keys
Rewire the pipeline so tools communicate through cache without LLM involvement:
- Agent calls execute_sql; middleware caches raw data, returns 'SUCCESS: Raw data saved to cache_key_alpha'.
- Agent generates Python script, calls execute_python(script, input_key='cache_key_alpha'); sandbox pulls data internally, runs regression, caches output as cache_key_beta, returns success pointer.
- Agent calls generate_chart(input_key='cache_key_beta'); engine renders UI from cache, delivers directly to user.
Benefits stack: Token costs drop over 95% (strings vs. spreadsheets), TTFT shrinks to near-instant, model focuses 100% on planning/synthesis since it skips JSON parsing.
This mirrors CS pass-by-reference: functions share memory addresses, not copies—scales to enterprise queries without collapse.
Build ToolOrchestrator: Middleware for Pointer Magic
Centralize with a ToolOrchestrator class that intercepts every tool call:
- Resolve pointers: Scan args for 'cache_key_*' strings, swap for raw cache data before execution.
- Execute + intercept: For execute_python, inject resolved data, run code, cache output, return tiny pointer like 'SUCCESS: Results saved to cache_key_sandbox_output'.
- Handle charts: Resolve data_json from key, generate UI payload.
Key code:
import json
from typing import Dict, Any
class ToolOrchestrator:
def __init__(self, cache_manager, sandbox, chart_engine):
self.cache = cache_manager
self.sandbox = sandbox
self.chart = chart_engine
def execute_tool(self, tool_name: str, arguments: Dict[str, Any]) -> str:
resolved_args = self._resolve_pointers(arguments)
if tool_name == "execute_python":
raw_output = self.sandbox.execute_python(
code_string=resolved_args.get("script"),
injected_data=resolved_args.get("data")
)
new_key = "cache_key_sandbox_output"
self.cache.write_to_cache(new_key, raw_output)
return f"SUCCESS: Python execution complete. Results saved to '{new_key}'."
elif tool_name == "generate_chart":
ui_payload = self.chart.generate_chart(
data_json=resolved_args.get("data"),
chart_type=resolved_args.get("chart_type"),
title="Requested Analysis"
)
return "SUCCESS: Chart rendered."
def _resolve_pointers(self, arguments: Dict[str, Any]) -> Dict[str, Any]:
resolved = {}
for key, value in arguments.items():
if isinstance(value, str) and value.startswith("cache_key_"):
resolved[key] = self.cache.read_raw_data(value)
else:
resolved[key] = value
return resolved
Integrate with prior cache/sandbox/chart tools from the Cognitive Agent Architecture series for full enterprise agents.