EVE-Agent: Improving Self-Evolving Agents with Evidence Verification

The Problem: Unverifiable Self-Evolution

Self-evolving agents—systems that generate their own questions, answer them, and learn from the results—often suffer from a lack of accountability. Without human oversight, these models can fall into a loop of generating fluent but hallucinated or unsupported content. This creates a feedback loop where the model reinforces its own errors, leading to an opaque and unreliable training curriculum.

The Solution: Evidence-Verifiable Training

EVE-Agent introduces a structural modification to the standard proposer-solver framework to enforce grounding. Instead of just generating an answer, the agent must produce a verbatim evidence span from a source.

Key components of the EVE-Agent framework include:

Proposer-Solver Framework: The agent generates a triplet consisting of a question, an answer, and a specific evidence span.
Evidence Verifier: This component rewards the agent based on the marginal accuracy gain provided by the evidence. If the evidence significantly improves the model's ability to answer correctly, it receives a higher reward.
Auditable Curriculum: Because every training example is linked to a specific source span, the resulting dataset is inherently inspectable. This allows developers to verify why the model trusts a specific piece of information.

Impact and Implementation

EVE-Agent achieves superior evidence-grounded correctness compared to previous self-evolving search agents. Crucially, this approach requires no oracle answers, human annotations, or external labels. It functions as an architectural wrapper that leaves the underlying backbone model, retriever, and search tools unchanged, making it a modular upgrade for existing agentic pipelines.

The Problem: Unverifiable Self-Evolution

The Solution: Evidence-Verifiable Training

Impact and Implementation

More from AI & LLMs

MedEvoEval: A Longitudinal Framework for Evaluating Doctor Agents

PathoSage: Agentic Workflows for Pathology Evidence Adjudication

Improving Agentic Tool-Calling with Uncertainty-Aligned RL

Evaluating the Feasibility of Autonomous AI Research Systems