The Challenge of Agentic Abstention

LLM agents are designed to operate in multi-turn environments, utilizing tools like search, terminal interfaces, and browsers to achieve user goals. However, a critical failure mode occurs when agents continue to interact with an environment even after a goal becomes clearly unachievable or ill-specified. This problem, defined as "Agentic Abstention," is a sequential decision-making challenge: the agent must decide at each turn whether to act, gather more information, or stop entirely.

Research across 28,000 tasks—spanning web shopping, terminal operations, and question answering—reveals that current agents struggle significantly with the timing of abstention. Some models fail to stop when they should, while others perform excessive, redundant interactions before finally abstaining. This issue is most pronounced in scenarios where a task appears feasible initially but becomes impossible based on environment feedback (e.g., a search query returning no valid results).

Model Capability vs. Abstention Performance

Interestingly, the study found that increased model scale and reasoning capabilities do not linearly correlate with better abstention. In some cases, larger, more capable models were worse at identifying the correct moment to stop compared to smaller counterparts. This suggests that current agent scaffolding and model training prioritize goal completion over goal evaluation, leading to a "persistence bias" where agents continue to act simply because they are prompted to solve a task.

Improving Abstention with CONVOLVE

To address this, the authors introduced CONVOLVE, a context engineering method that improves abstention without requiring model parameter updates. CONVOLVE works by distilling full interaction trajectories into reusable stopping rules. By providing the model with a clearer framework for when to terminate, the researchers demonstrated significant improvements in performance. For example, on the WebShop benchmark, applying CONVOLVE increased the timely recall rate for Llama-3.3-70B from 26.7% to 57.4%. This highlights that improving agent reliability often depends more on how we structure the interaction context than on the raw intelligence of the underlying model.