Template Collapse Undermines LLM Agent RL: Fix with MI & SNR
RL-trained LLM agents collapse into input-agnostic templates despite stable entropy; track mutual information (MI) for true reasoning quality and use SNR-aware prompt filtering to boost performance across tasks.
Entropy Misses Template Collapse in Agent RL
Reinforcement learning (RL) on multi-turn LLM agents is unstable, with reasoning quality driving task success. Standard entropy metrics track within-input diversity but fail to detect 'template collapse,' where agents output superficially diverse fixed templates that ignore input differences. This input-agnostic behavior persists even with stable entropy, evading all existing diagnostics and tanking cross-input adaptability.
Authors decompose reasoning into two parts: within-input diversity (entropy) and cross-input distinguishability (mutual information, MI). MI proxies enable online monitoring, revealing that MI correlates far stronger with final performance than entropy across tasks.
Low SNR Causes Collapse via Gradient Weakening
Template collapse stems from signal-to-noise ratio (SNR) dynamics. Low reward variance across prompts produces weak task gradients, allowing regularization to dominate training. This erases input-specific reasoning signals, forcing reliance on generic templates.
High-SNR prompts—those with substantial reward variance—preserve task-relevant differences, countering regularization's homogenizing effect.
SNR-Aware Filtering Restores Input Dependence
To fix this, apply SNR-Aware Filtering: per training iteration, select prompts with high reward variance as a lightweight SNR proxy. This amplifies task signals without added compute.
Tested on planning, math reasoning, web navigation, and code execution, it consistently enhances MI (input responsiveness) and end-task performance, making RL training more reliable for production LLM agents.