ToolSense: A Diagnostic Framework for Auditing LLM Tool Knowledge

The Need for Tool-Specific Diagnostics

As LLMs are increasingly integrated into agentic workflows, their ability to effectively select and use external tools (APIs, functions, or software) has become a critical performance bottleneck. Current evaluation methods often rely on end-to-end task success, which obscures whether a failure stems from poor reasoning, hallucinated parameters, or a fundamental lack of 'parametric tool knowledge'—the model's internal understanding of what a tool does and how to invoke it correctly.

The ToolSense Framework

ToolSense introduces a diagnostic approach to isolate and measure this knowledge. Instead of treating the model as a black box, the framework audits the model's internal representation of tool schemas. It systematically tests:

Tool Selection Accuracy: Can the model identify the correct tool from a large library of candidates?
Parameter Mapping: Does the model correctly map natural language intent to the specific, often rigid, schema requirements of an API?
Constraint Adherence: Can the model respect the boundaries of a tool's functionality without attempting to 'over-engineer' or hallucinate non-existent features?

By decoupling tool knowledge from general reasoning, developers can identify whether a model needs better documentation in its system prompt, a more refined tool definition, or if the model simply lacks the training data to handle specific technical interfaces.

Practical Implications for AI Engineering

This framework shifts the focus from 'prompt engineering' to 'tool engineering.' By auditing tool knowledge, builders can:

Optimize Context Windows: Only include the most relevant tool definitions, as ToolSense helps identify which tools the model is already proficient with versus those it struggles to parse.
Improve Reliability: By identifying specific failure modes in parameter mapping, developers can implement targeted validation layers or few-shot examples specifically for the problematic tools.
Model Selection: Use the diagnostic results to determine which models are better suited for tool-heavy environments versus those that excel at creative reasoning but fail at structured API interaction.

The Need for Tool-Specific Diagnostics

The ToolSense Framework

Practical Implications for AI Engineering

More from AI & LLMs

CEO-Bench: Measuring Long-Term Strategic Reasoning in AI Agents

Automated Pre-Mediation Pipelines for Human Negotiation

AgentCo-op: Retrieval-Based Synthesis of Multi-Agent Workflows

AI Agent Memory: 4 Dimensions, Benchmarks, Tool Tiers