The Problem: RLVR Drift
Reinforcement Learning with Verifiable Rewards (RLVR) has enabled large language models to achieve expert-level performance in complex domains like competition math. However, these models often drift toward "idiosyncratic patterns," such as poor readability or language mixing, because the reward signal only cares about the final answer, not the process. This makes the resulting reasoning chains difficult for humans or weaker models to follow.
The Solution: Tandem Reinforcement Learning (TRL)
TRL introduces a collaborative training paradigm to solve this compatibility issue. Instead of training a model in isolation, a "senior" model (the one being trained) and a "frozen junior" model (a weaker, static model) alternate stochastically to co-generate a single reasoning chain. Both models are rewarded as a team for the final output, and the standard GRPO (Group Relative Policy Optimization) loss is applied to the senior model.
This forces the senior model to adapt its reasoning style to be legible to the junior model. By requiring the senior to "hand off" the reasoning process to a weaker partner, the senior is incentivized to produce chains of thought that are structurally simpler and more logical.
Key Outcomes
When training the Qwen3-4B-Instruct model on competition math, TRL demonstrated three distinct benefits compared to vanilla GRPO:
- Maintained Performance: TRL achieved solo reasoning capabilities equal to models trained with standard GRPO.
- Increased Legibility: The resulting chains of thought were significantly more understandable to the junior model.
- Reduced Drift: The senior model exhibited less distributional drift, staying closer to standard language patterns rather than devolving into the idiosyncratic, unreadable shorthand often seen in high-performance RLVR models.
These findings suggest that TRL provides a viable path for creating AI systems that are not only high-performing but also more compatible with human users and other, smaller AI agents.