Build RL Environments to Train LLM Agents

Use Verifiers library to create RL environments where small LLMs interact, explore, and master tasks like tic-tac-toe via verifiable rewards, surpassing SFT limits.

Shift from SFT to RL with Verifiable Rewards for LLM Reasoning

Reinforcement learning (RL) maps directly to LLMs: the model acts as agent, generating text actions (e.g., moves or reasoning traces); the environment provides states (e.g., game boards), verifiable rewards (e.g., +1 win, -0.1 invalid move), and handles interactions until termination. Unlike supervised fine-tuning (SFT), which mimics curated prompt-response pairs and stays close to example distributions, RL with verifiable rewards lets models explore novel trajectories, discovering efficient strategies like chain-of-thought without expensive human data. DeepSeek R1 and o1 models scale performance via RL compute, using algorithms like GRPO (group-relative policy optimization) for lighter setups than PPO. Rewards come from auto-checkable outcomes: correct answers, successful tool calls, or game wins. This enables training on dynamic tasks where SFT fails due to data scarcity, balancing exploration (new actions) and exploitation (known good ones) to maximize cumulative rewards over trajectories (full episodes like one game).

To reduce SFT limits—pre-training plateaus, costly chain-of-thought data—generate reasoning traces + answers, verify outcomes, and RL-train to favor high-reward paths. Startups and labs (DeepSeek, MiniMax) use thousands of such environments to boost challenging tasks.

Verifiers: Modular Library for LLM RL Environments

Verifiers (open-source by Prime Intellect) turns environments into installable Python packages for evaluation/training, abstracting model serving (OpenAI-compatible APIs, vLLM), async parallel rollouts, response parsing (e.g., XML tags), and trainers (integrates TRL, SkyLLM). Core types build on multi-turn envs with state dicts, dynamic responses, @vf_stop decorators for termination (e.g., game over), and rubrics (weighted reward sums).

  • Single-turn: E.g., reverse-text env loads 1000-paragraph dataset, maps to prompt/ground-truth, parses tags, rewards longest common subsequence ratio. Eval: 5 examples × 3 rollouts = 15 trajectories; stats include reward distributions.
  • Multi-turn: E.g., double-check math: model answers, env replies "Are you sure?", loops until stop.
  • Tool envs: Define Python functions (e.g., wiki search); model calls tools mid-reasoning. Supports MCP servers, stateful tools (e.g., DB sessions), recursive LMs for long contexts.

Environments Hub shares them, fighting fragmentation. Pairs with libs like piles. Focus: task logic/rewards, not infra.

Tic-Tac-Toe Experiment: Weak SLM to Master via SFT + RL

Start with GPT-4o Mini (strong: good format, wins vs random) vs LSM2-1.6B (weak: poor format/valid moves, rare wins vs random). Build tic-tac-toe env: model as X (sometimes first/second), outputs 0-8; env tracks board/winner, random/optimal opponent (minimax, controllable via mean/max random-move prob 0-1), continues post-invalid (-0.1 penalty, cap -8), rewards: win (+1, w=1), format/XML/think tags (w=0.2), invalid (-0.1). Reduce noise: fixed seeds per example/turn/board for deterministic opponent responses; stratified batch sampling balances opponent difficulty (e.g., 20-70% random moves).

Training LSM2:

  1. SFT warmup: Generate 200 synthetic games via GPT-4o Mini (filter losses), train ~minutes on 96GB GPU → near-perfect format, fewer invalids, better play.
  2. GRPO RL (verifiers trainer): Batch size ≥256 critical (small → unstable/collapse from few games); n_groups for advantages vs rollout average; GPU inference/train split. Plots: total/format rewards rise, invalids →0.

Post-RL eval: Dominates random (high wins), draws 85% vs optimal; invalids ~0. Outperforms base/SFT. Code: GitHub repo with OOM tips. Scales to multi-step/tool agents; fun, practical for SLMs.

Video description
Reasoning models like DeepSeek R1 have demonstrated that learning from interaction is just as critical as learning from examples. To build these capabilities ourselves, we need to move beyond static datasets and start building Reinforcement Learning Environments: little worlds where models can act, get rewards, and learn. In this talk, I will walk you through my journey exploring this space from a practical software engineering perspective. We will cover: - How classic Reinforcement Learning concepts translate to Language Models - Verifiers, an open-source library to build Environments as software artifacts - Concrete examples of environments, from single-turn tasks to multi-turn games and tool-using agents - How to use these environments for both evaluating and training Small Language Models. Join me to learn how to move from prompting models to building the gyms where they learn. Stefano Fiorucci - AI/SW Engineer/Explorer, deepset Stefano is an AI/Software Engineer and explorer. He currently works on AI Orchestration at Deepset, where he contributes to and maintains Haystack, a widely used open-source framework for building LLM applications. He loves experimenting with Small Language Models, Post-Training and Reinforcement Learning, and shares his learning through code, writing, and talks. Socials: https://twitter.com/theanakin87 https://www.linkedin.com/in/stefano-fiorucci/ https://github.com/anakin87 https://huggingface.co/anakin87 Slides: https://drive.google.com/file/d/116PKThwtyTxeH1GmZQ7bL3HPYM6KCgHa/view?usp=drive_link

Summarized by x-ai/grok-4.1-fast via openrouter

7419 input / 1660 output tokens in 14878ms

© 2026 Edge