Internalizing Self-Critique via Reinforcement Learning (ICRL)

The Shift from External to Internalized Critique

Traditional AI systems often rely on external verifiers or multi-agent setups to critique outputs, which introduces latency and dependency on external compute or prompt-based feedback loops. ICRL (Internalizing Self-Critique with Reinforcement Learning) proposes a framework where the model learns to perform this critique internally. By treating the critique process as a learned behavior within the agent's policy, the model can identify and rectify errors during the generation process without requiring an explicit, separate verification step.

Reinforcement Learning for Self-Correction

The core mechanism of ICRL involves training the model to optimize for a reward signal that accounts for both task completion and self-correction accuracy. Instead of relying on static prompt engineering or external feedback, the agent is trained to generate a 'critique' latent or token sequence that informs its own subsequent generation. This approach forces the model to develop a more robust internal representation of 'correctness,' as it is directly penalized for failing to catch its own errors during the reinforcement learning phase. This method effectively compresses the multi-step reasoning process into a more efficient, internalized loop, leading to higher accuracy in complex reasoning tasks where external verifiers might struggle to provide granular, context-aware feedback.

The Shift from External to Internalized Critique

Reinforcement Learning for Self-Correction

More from AI & LLMs

GPTNT: A Real-Time Collaborative Benchmark for AI Agents

Internalizing Future-Aware Planning in LLM Agents

Automating Mechanistic Interpretability with Agentic Loops

Analyzing AI Model Behavior via Agent Trajectories