#reinforcement-learning
Every summary, chronological. Filter by category, tag, or source from the rail.
Breaking Filter Bubbles with Semantic Pareto-DQN
A new reinforcement learning framework for recommender systems that treats engagement, diversity, and fairness as distinct, non-aggregable rewards to prevent semantic homogenization.
Fixing GRPO Failure Modes in Production
GRPO is more efficient than PPO but prone to silent failures like advantage collapse and entropy loss. Using Dynamic Sampling Policy Optimization (DAPO) techniques—specifically dynamic sampling, token-level normalization, and decoupled KL—is essential for stable production training.
Verbal Reinforcement Learning: Closing the Feedback Loop
The paper introduces a framework for 'Verbal Reinforcement Learning' (VRL), shifting from raw reward signals to structured insight governance by extracting and managing verbal feedback from world interactions.
SVoT: Enhancing Spatial Reasoning via State-Aware Visualization
SVoT improves spatial reasoning in LLMs by using reinforcement learning to generate state-aware visual representations of thought, allowing models to track complex spatial relationships more accurately than text-only chain-of-thought.
Optimizing AI for Tool Use via RL and Data Quality
Improving model performance for complex tasks often requires teaching tool discipline through RL and high-quality data rather than scaling model size. A 4B parameter model outperformed a 235B model by learning to inspect schemas and self-correct errors.
AI EngineerHarness-1: Offloading Bookkeeping to Improve Search Agent Performance
Harness-1 improves retrieval performance by separating search policy from state management, using a stateful harness to handle bookkeeping and memory, allowing the 20B model to focus on semantic decisions.
SIA: Self-Improving Agents That Evolve Scaffold and Weights
Hexo Labs' open-source SIA framework enables AI agents to autonomously improve by iteratively updating both their operational harness (prompts/tools) and internal model weights (via LoRA) within a single feedback loop.
Practical Lessons in Building Adaptive Routing Agents with RL
Building a DQN-based routing agent reveals that reinforcement learning is often fragile; success depends less on the algorithm and more on rigorous reward shaping, stability tracking, and evaluation beyond simple success rates.
COSMO-Agent: Automating CAD-CAE Design Loops with LLMs
COSMO-Agent is a reinforcement learning framework that enables LLMs to bridge the CAD-CAE semantic gap by orchestrating external tools to perform iterative, constraint-driven geometric design.
Physical AI Trains Robots via Sim + RL Feedback Loops
Physical AI equips robots with VLAs for perception-reasoning-action, uses reinforcement learning in randomized simulations, and iterates with real-world data to close the sim-to-real gap for messy environments.
IBM TechnologyRelative Slate Bandits for E-com Homepage Picks
Use group-relative contextual bandits to select optimal product slates for e-commerce homepages, leveraging relative quality signals for efficient RL over full prediction models.
RL Solves Sequential Coupon Optimization
Treat coupon decisions (when, to whom, strength) as sequential problems with reinforcement learning to balance conversion, margins, budgets, and customer fatigue—backed by field experiments.
Showing 12 of 12