Inspect: Framework for Robust LLM Evaluations

Construct Evaluations Using Modular Tasks

Define evaluations as @task-decorated functions returning a Task with three components: datasets, solvers, and scorers. Datasets are tables with input prompts and target answers or grading guidance—e.g., for the Sally-Anne false belief test, inputs describe object movements like "Jackson entered the hall... Chloe moved the boots to the pantry," targeting "bathtub" or "pantry." This setup tests theory-of-mind reasoning.

Chain solvers to process inputs: chain_of_thought() elicits step-by-step reasoning, generate() calls the model, and self_critique() refines outputs. Scorers like model_graded_fact() use another model to grade factual accuracy against targets, producing aggregate metrics. Reuse tasks across models by overriding via task_with() or CLI flags, enabling flexible runtime configuration without code changes.

This modularity scales from simple prompts to agentic workflows: adapt datasets from CSV/JSON sources, include multimodal data (images/audio/video), and handle long contexts with compaction to fit model windows.

Execute Evals Seamlessly Across Providers

Install via pip install inspect-ai, set API keys (e.g., OPENAI_API_KEY), then run inspect eval script.py --model openai/gpt-4o. Supports 20+ providers out-of-box: OpenAI (gpt-4o), Anthropic (claude-sonnet-4-0), Google (gemini-2.5-pro), xAI (grok-3-mini), Mistral, Hugging Face (Llama-2-7b), plus AWS Bedrock, Azure, TogetherAI, Groq, vLLM, Ollama. Use batch mode for cost savings, caching to skip repeat calls, and limits on tokens/time/cost to control spend.

From Python: eval(task(), model="openai/gpt-4o"). Parallelism tunes async workers to respect rate limits, yielding high throughput on local or cloud setups. Errors auto-retry; early stopping halts on convergence.

Debug and Scale with Logs, Tools, and Agents

Logs save to ./logs/ with sample traces, messages, events; inspect view launches a browser viewer for metrics, per-sample inspection, and filtering. VS Code extension integrates running, tuning, and visualization. Extract dataframes for analysis: read_eval_df(log_path) pulls scores, read_sample_df() gets inputs/outputs.

For agents, use built-in ReAct for planning/tool-use/memory on long-horizon tasks; compose multi-agents or bridge LangChain/OpenAI SDKs. Tools extend models: register Python functions for code execution (sandboxed), web search/browsing, text editing. Standard tools handle computer use; MCP integrates external providers. Approval policies gate risky calls.

Advanced: eval sets run benchmark suites; tracing diagnoses issues; extensions add providers/tools. Pre-built evals cover ARC, popular papers via Inspect Evals repo.