The Evaluation-Data Gap
Model capability is typically observed retrospectively through noisy, aggregated benchmark scores. When a model fails, engineers often struggle to bridge the gap between a high-level benchmark failure (e.g., a drop in BBH scores) and the specific data corpus intervention required to fix it. This process is usually driven by intuition rather than a systematic, auditable methodology.
The Capability Slice Framework
To solve this, the authors introduce the "capability slice": a granular unit of evaluation that groups samples by background condition, task type, solving operation, and output constraint. This unit is designed to be:
- Specific enough to localize a single model weakness.
- Stable enough to survive aggregation across larger datasets.
By combining these slices with a structured evaluation taxonomy and a non-instruction data taxonomy, the authors create a closed-loop system. This system allows developers to map specific benchmark failures directly to targeted data interventions, turning debugging into an experimental, repeatable process.
Validating the Loop
The authors demonstrate the effectiveness of this loop through two contrasting case studies:
- Ruling out data interventions: When continued pre-training caused a -46.82% drop in BBH performance, the loop diagnosed the issue as a single masked
<EOS>loss rather than a reasoning failure. Restoring this loss recovered BBH to 66.44, surpassing the original checkpoint without changing the training data. - Targeted data interventions: For a persistent math-reasoning weakness, the loop decomposed the failure by solving operation. By applying a weakness-targeted sampling procedure, the authors increased AIME2025/AIME2026 Pass@128 scores from 6.67/0.00 to 26.67 each.
These results demonstrate that evaluation-to-data inference can be routine and experimentally validated, moving beyond the guesswork common in current LLM development workflows.