The Structural Weaknesses of GRPO

GRPO (Group Relative Policy Optimization) is widely favored for its efficiency, as it eliminates the need for a critic network. However, its reliance on group-relative advantage normalization creates three primary failure modes that stall training:

  • Advantage Collapse: Occurs when all sampled responses in a group receive the same reward (e.g., all correct or all incorrect). This results in near-zero advantage, effectively killing the gradient signal. This is most common on very hard or very easy prompts.
  • Entropy Collapse: As the model converges, it may lose generation diversity. Once entropy drops below a critical threshold (typically < 0.5 nats), the model becomes stuck in a narrow mode, making it difficult to recover without external intervention.
  • KL Drift: Using a blunt KL penalty coefficient often forces the model to choose between reward hacking (low penalty) or stagnation (high penalty). Baking KL into the reward signal further distorts the advantage normalization process.

Engineering Solutions via DAPO

The DAPO (Dynamic Sampling Policy Optimization) framework provides specific algorithmic fixes to these issues:

  • Dynamic Sampling: Instead of training on all groups, filter out groups with zero reward variance. This prevents the model from updating on noise.
  • Asymmetric KL Clipping: By using a higher upper bound for the probability ratio, the model can aggressively reinforce correct responses without needing to compress its entire output distribution, which helps preserve entropy.
  • Decoupled KL: Remove the KL penalty from the reward signal entirely. Apply it as a direct loss term after advantage computation to prevent reward distortion.
  • Token-Level Normalization: Standard GRPO normalizes at the sample level, which biases the model against long chain-of-thought reasoning. Normalizing by total token count ensures that longer, more complex reasoning traces are weighted appropriately.

Production Best Practices

Beyond the algorithm, the success of GRPO depends on the quality of the reward signal and the training pipeline.

  • Audit the Reward Model: If the verifier is noisy, it will inject false signals that exacerbate advantage collapse.
  • Monitor Entropy: Track per-token entropy as a first-class metric. If it stays below 0.5 nats for more than 50 steps, the model is likely collapsing.
  • Manage SFT Bias: If the initial SFT checkpoint is already over-fitted to a specific format, it will be more prone to entropy collapse during RL.
  • Hyperparameter Tuning: While increasing group size (G) can stabilize estimates, it is often more compute-efficient to use dynamic sampling to discard low-variance groups than to simply increase the number of rollouts.