NaviGen: Bridging User History and Personalized Multimodal Generation

The Challenge of Implicit Intent

Modern generative models for images and videos rely on precise, well-formed prompts. However, end users rarely provide detailed instructions, leading to a misalignment between user preferences and generated content. The core problem is twofold: user behavior data is not natively legible to language-based reasoning models, and standard models lack the specific skill of translating interaction history into actionable generation instructions.

The NaviGen Architecture

NaviGen addresses these gaps by introducing a novel representation and training pipeline:

Dual Identifier Representation: Each item in a user's history is encoded using a dual identifier system. This couples a collaborative code (capturing behavioral patterns) with a textual code (capturing semantic meaning). This creates a unified token stream that acts as both a behavioral substrate and a semantic bridge, allowing the model to reason about user intent through history.
Two-Stage Alignment Pipeline:
- Stage 1 (SFT): The model undergoes Supervised Fine-Tuning (SFT) to distill preference reasoning and instruction-writing capabilities from evolutionarily searched supervision data.
- Stage 2 (RL): The model is further aligned using Reinforcement Learning (RL) with hierarchical and self-consistent rewards. This ensures that the generated instructions are not only relevant to the user's history but also optimized for high-fidelity visual synthesis.

Performance and Impact

Experiments across product, gaming, and short-video domains demonstrate that NaviGen significantly improves the quality of personalized generation. By effectively turning interaction history into executable instructions, the model achieves better next-item prediction accuracy and produces more specific, relevant, and visually coherent outputs compared to baseline methods that lack this behavioral-to-semantic translation layer.

The Challenge of Implicit Intent

The NaviGen Architecture

Performance and Impact

More from AI & LLMs

VBFDD-Agent: Translating Battery Signals into Descriptive Text

Sovereign AI Grounds Robotics in Physics for 1.1M States/Sec

Gemma 4 MTP Drafters: 3x Faster Inference, No Quality Loss

H2E: Deterministic Safety via Riemannian Multimodal Fusion