Build RL Environments to Train LLM Agents

Shift from SFT to RL with Verifiable Rewards for LLM Reasoning

Reinforcement learning (RL) maps directly to LLMs: the model acts as agent, generating text actions (e.g., moves or reasoning traces); the environment provides states (e.g., game boards), verifiable rewards (e.g., +1 win, -0.1 invalid move), and handles interactions until termination. Unlike supervised fine-tuning (SFT), which mimics curated prompt-response pairs and stays close to example distributions, RL with verifiable rewards lets models explore novel trajectories, discovering efficient strategies like chain-of-thought without expensive human data. DeepSeek R1 and o1 models scale performance via RL compute, using algorithms like GRPO (group-relative policy optimization) for lighter setups than PPO. Rewards come from auto-checkable outcomes: correct answers, successful tool calls, or game wins. This enables training on dynamic tasks where SFT fails due to data scarcity, balancing exploration (new actions) and exploitation (known good ones) to maximize cumulative rewards over trajectories (full episodes like one game).

To reduce SFT limits—pre-training plateaus, costly chain-of-thought data—generate reasoning traces + answers, verify outcomes, and RL-train to favor high-reward paths. Startups and labs (DeepSeek, MiniMax) use thousands of such environments to boost challenging tasks.

Verifiers: Modular Library for LLM RL Environments

Verifiers (open-source by Prime Intellect) turns environments into installable Python packages for evaluation/training, abstracting model serving (OpenAI-compatible APIs, vLLM), async parallel rollouts, response parsing (e.g., XML tags), and trainers (integrates TRL, SkyLLM). Core types build on multi-turn envs with state dicts, dynamic responses, @vf_stop decorators for termination (e.g., game over), and rubrics (weighted reward sums).

Single-turn: E.g., reverse-text env loads 1000-paragraph dataset, maps to prompt/ground-truth, parses tags, rewards longest common subsequence ratio. Eval: 5 examples × 3 rollouts = 15 trajectories; stats include reward distributions.
Multi-turn: E.g., double-check math: model answers, env replies "Are you sure?", loops until stop.
Tool envs: Define Python functions (e.g., wiki search); model calls tools mid-reasoning. Supports MCP servers, stateful tools (e.g., DB sessions), recursive LMs for long contexts.

Environments Hub shares them, fighting fragmentation. Pairs with libs like piles. Focus: task logic/rewards, not infra.

Tic-Tac-Toe Experiment: Weak SLM to Master via SFT + RL

Start with GPT-4o Mini (strong: good format, wins vs random) vs LSM2-1.6B (weak: poor format/valid moves, rare wins vs random). Build tic-tac-toe env: model as X (sometimes first/second), outputs 0-8; env tracks board/winner, random/optimal opponent (minimax, controllable via mean/max random-move prob 0-1), continues post-invalid (-0.1 penalty, cap -8), rewards: win (+1, w=1), format/XML/think tags (w=0.2), invalid (-0.1). Reduce noise: fixed seeds per example/turn/board for deterministic opponent responses; stratified batch sampling balances opponent difficulty (e.g., 20-70% random moves).

Training LSM2:

SFT warmup: Generate 200 synthetic games via GPT-4o Mini (filter losses), train ~minutes on 96GB GPU → near-perfect format, fewer invalids, better play.
GRPO RL (verifiers trainer): Batch size ≥256 critical (small → unstable/collapse from few games); n_groups for advantages vs rollout average; GPU inference/train split. Plots: total/format rewards rise, invalids →0.

Post-RL eval: Dominates random (high wins), draws 85% vs optimal; invalids ~0. Outperforms base/SFT. Code: GitHub repo with OOM tips. Scales to multi-step/tool agents; fun, practical for SLMs.