Target Rollout Generation to Cut RL Training Time

In synchronous RL post-training for tasks like math reasoning or code generation, rollout generation dominates 65-72% of step time across RL-Think (continuing reasoning models) and RL-Zero (training base models from scratch) workloads on Qwen3-8B. The five RL stages—data loading, preparation, generation, log-prob recompute (27-33%), and optimization—make generation the sole high-impact target, as other phases remain unchanged by rollout optimizations.

Speculative decoding addresses this by using a fast draft model to propose multiple tokens, verified by the target model via rejection sampling. This guarantees identical output distribution to autoregressive generation, avoiding off-policy corrections or fidelity loss common in async, low-precision, or replay methods. Result: faster rollouts with unchanged training signals, KL penalties, and GRPO losses computed solely on target policy samples.

Integrate via Two-Path Architecture in NeMo RL v0.6.0

Embed speculative decoding directly in NeMo RL using vLLM backend (SGLang also supported). A two-path system handles policy updates: general EAGLE-3 path for any pretrained draft (no native MTP needed); native path for MTP-equipped models. Online adaptation caches verifier hidden states and log-probs to supervise draft head gradient-free, preventing policy gradient interference.

Critical configs maximize speedup:

  • Draft init: Domain-aligned (e.g., DAPO post-training data) beats generic (UltraChat/Magpie): 1.77× vs 1.51× gen speedup on RL-Zero at k=3.
  • Draft length k: Optimum k=3 (1.77× RL-Zero, 1.53× RL-Think); k=5 drops to 1.44×/0.84×, k=7 to 1.21×/0.71× as verification overhead outweighs gains in complex reasoning traces.
  • Online adaptation: Boosts weak inits (UltraChat: 1.51× to 1.63×) but minimal for strong ones (DAPO: 1.77× to 1.78×).

N-gram drafting fails despite >2 token acceptance (0.7×/0.5× speedups), proving acceptance alone insufficient if verification slows net progress.

Complements async execution: at 8B RL-Think (policy lag 1, 16 nodes), cuts exposed gen time 10.4s to 0.6s/step, end-to-end 75s to 60.5s (1.24×).

Achieve 1.8× Gen, 1.4× Step Speedup at 8B; 2.5× Projected at 235B

On 32 GB200 GPUs, EAGLE-3 drops RL-Zero gen from 100s to 56.6s (1.8×), RL-Think 133.6s to 87s (1.54×), yielding 1.41×/1.35× step speedups. AIME-2024 validation accuracy matches autoregressive baselines, validating lossless property.

Simulator projects for Qwen3-235B-A22B: synchronous 512 GB200s at k=3 (accept=3) gives 2.72× rollout/1.70× end-to-end; async 2048 GPUs (lag 2) hits ~3.5× rollout/2.5× end-to-end. Speculation shrinks per-rollout cost; async hides remainder behind compute.