The Challenge of Implicit Intent
Modern generative models for images and videos rely on precise, well-formed prompts. However, end users rarely provide detailed instructions, leading to a misalignment between user preferences and generated content. The core problem is twofold: user behavior data is not natively legible to language-based reasoning models, and standard models lack the specific skill of translating interaction history into actionable generation instructions.
The NaviGen Architecture
NaviGen addresses these gaps by introducing a novel representation and training pipeline:
- Dual Identifier Representation: Each item in a user's history is encoded using a dual identifier system. This couples a collaborative code (capturing behavioral patterns) with a textual code (capturing semantic meaning). This creates a unified token stream that acts as both a behavioral substrate and a semantic bridge, allowing the model to reason about user intent through history.
- Two-Stage Alignment Pipeline:
- Stage 1 (SFT): The model undergoes Supervised Fine-Tuning (SFT) to distill preference reasoning and instruction-writing capabilities from evolutionarily searched supervision data.
- Stage 2 (RL): The model is further aligned using Reinforcement Learning (RL) with hierarchical and self-consistent rewards. This ensures that the generated instructions are not only relevant to the user's history but also optimized for high-fidelity visual synthesis.
Performance and Impact
Experiments across product, gaming, and short-video domains demonstrate that NaviGen significantly improves the quality of personalized generation. By effectively turning interaction history into executable instructions, the model achieves better next-item prediction accuracy and produces more specific, relevant, and visually coherent outputs compared to baseline methods that lack this behavioral-to-semantic translation layer.