RL Agent Outperforms Similarity in LLM Memory Retrieval

Train PPO agent in custom Gym env to pick optimal memory from top-8 similarity candidates using features like sim, entity/slot match, rank; beats cosine baseline on retrieval accuracy (val/test splits) and downstream LLM QA.

Synthetic Dataset Captures Retrieval Noise for Realistic Training

Create a memory bank from 8 entities across domains (robotics, astronomy, biomedicine, climate, logistics, materials, agriculture, healthcare) with 5 facts each (e.g., Astra battery: "18 hours", Orion aperture: "8 meters"). Generate factual memories via 5 phrasing templates (e.g., "{entity} has {slot}: {value}"), distractors (5 per entity, e.g., "{entity} was discussed in a briefing"), and 8 noise items (e.g., "system maintenance occurred on Tuesday"). Total: ~100 items. Embed texts with OpenAI text-embedding-3-small (normalized L2). Build ~60 queries targeting facts (e.g., "What is the battery of Astra?"), embed similarly. For each query, fetch top-8 cosine candidates; gold memory often ranks lower due to phrasing variance, forcing agent beyond raw similarity.

Custom Features and Rewards Teach Agent Relevance Over Similarity

State: 85=40 candidate features (cosine sim, keyword overlap, entity_match=1 if entity in text, slot_match=1 if slot in text, inverse rank 1/(1+rank)) + 2 globals (unique_topic_bonus=1 if topic in query, normalized query_len). Action: discrete 0-7 select one candidate. Reward: 2.0is_gold + 0.8entity_match + 0.6slot_match + 0.5sim + 0.3overlap - 0.15*rank. Gym Env (MemoryRetrievalEnv): reset samples query uniformly, step yields reward/info (is_correct, texts). Split data 70/15/15 train/val/test. Train PPO (MlpPolicy, lr=3e-4, n_steps=256, batch_size=64, gamma=0.99, gae_lambda=0.95, ent_coef=0.01, clip=0.2, 12k timesteps) on DummyVecEnv.

Retrieval Gains Transfer to Accurate LLM Answers

Baseline: pick max sim candidate. RL: predict deterministic action on state. Eval retrieval accuracy (exact gold match): RL beats baseline on val/test (code prints rounded to 4 decimals, bar plots confirm). Downstream QA: feed single retrieved memory to gpt-4o-mini (system: answer only from memories or 'I do not know'), judge exactness via gpt-4o-mini (JSON score >=0.5=1). Sample 12 test queries: RL QA accuracy > baseline (displayed table/bar). Examples: baseline grabs distractor ("Orion has been compared...") for "What is the telescope of Orion?"; RL picks gold ("Orion in astronomy uses infrared array for telescope"). Interactive demo embeds new query, shows top-8 for manual/RL pick.

Summarized by x-ai/grok-4.1-fast via openrouter

9204 input / 1608 output tokens in 24964ms

© 2026 Edge