TRL Code Guide: SFT to GRPO LLM Alignment on T4 GPU

LoRA and TRL Setup Enables Post-Training on Limited Hardware

Use LoRA (r=8, alpha=16, dropout=0.05, targets='q_proj','k_proj','v_proj','o_proj') with TRL trainers to adapt Qwen/Qwen2.5-0.5B-Instruct on T4 GPU (16GB). Common args across stages: num_train_epochs=1, gradient_checkpointing=True, bf16 if supported else fp16, logging_steps=10, report_to="none", save_strategy="no". Install stack: torchao>=0.16, trl>=0.20, transformers>=4.45, peft>=0.13, bitsandbytes. Helpers like chat_generate apply chat template, generate with temp=0.7/top_p=0.9. Cleanup VRAM with gc.collect() + torch.cuda.empty_cache() between stages to fit in Colab.

SFT and RM Build Imitation and Reward Signals

For Supervised Fine-Tuning, load trl-lib/Capybara (train:300), use SFTConfig(per_device_train_batch_size=2, gradient_accumulation_steps=4, learning_rate=2e-4, max_length=768). Trainer imitates high-quality chat responses; post-train inference on "Explain bias-variance tradeoff in two sentences" yields coherent output. Reward Modeling on trl-lib/ultrafeedback_binarized (train:300) uses RewardConfig(batch_size=2, accum_steps=2, lr=1e-4, max_length=512), LoRA task_type="SEQ_CLS". Trains to score chosen vs. rejected pairs, producing a preference-based reward without explicit RL.

DPO Skips RM for Direct Preference Alignment

DPOTrainer on same ultrafeedback_binarized:300 simplifies via implicit rewards: DPOConfig(batch_size=1, accum_steps=4, lr=5e-6, beta=0.1, max_length=512, max_prompt_length=256). Beta controls KL-divergence from reference policy, preventing mode collapse. Optimizes policy to prefer chosen over rejected responses directly, reducing steps vs. traditional RM+PPO.

GRPO Uses Custom Rewards to Sharpen Reasoning

GRPOTrainer generates num_generations=4 completions per prompt (max_prompt_length=128, max_completion_length=96, max_steps=15), ranks via reward_funcs. Custom dataset: 200 synthetic math problems (e.g., "Solve 17 + 28 =", gold=eval). Rewards: correctness_reward (1.0 if last extracted number matches gold else 0), brevity_reward (max(0,1-len(c)/200)0.2). GRPOConfig(lr=1e-5, batch=2, accum=2). Inference on "17+28?", "97?", "100-47?" produces accurate, concise answers like final numbers, improving verifiable task performance over base.