The Challenge of Automated Circuit Explanation

Mechanistic interpretability has successfully localized circuits within neural networks, but interpreting the function of these components remains a manual, labor-intensive process. The authors introduce AgenticInterpBench, a new benchmark comprising 84 semi-synthetic transformer circuits and 163 component-level annotations, to evaluate whether language model (LM) agents can standardize and accelerate this explanation process.

The HyVE Framework

The authors propose HyVE (Hypothesize, Validate, Explain), an agentic explainer designed to operate on identified circuits. The framework functions through an iterative loop:

  1. Observation: The agent inspects the component's behavior.
  2. Hypothesis Generation: The agent proposes a function for the component based on its observations.
  3. Causal Validation: The agent performs causal interventions to test the hypothesis.

This process culminates in both component-level explanations and a broader task-level description of the circuit. Testing across four different LM backbones revealed that while models are capable of generating grounded hypotheses, they struggle significantly with the validation phase. Common failure modes include incomplete validation plans, code execution errors, and an inability to resolve conflicting hypotheses.

Key Insights and Limitations

  • Backbone Performance: No single LM backbone consistently outperformed others, suggesting that current model capabilities are not yet optimized for the rigorous, multi-step reasoning required for mechanistic interpretability.
  • The Validation Bottleneck: The research highlights that the primary obstacle to fully automated interpretability is not the initial hypothesis, but the reliability of the validation loop. Even when models can "guess" correctly, they often fail to construct the causal experiments necessary to prove their claims.
  • Generalization: A case study on an arithmetic circuit within Llama-3-8B demonstrates that the HyVE approach is not limited to semi-synthetic benchmarks and can be applied to naturally trained models, indicating a path forward for real-world interpretability pipelines.