VRAG: Multimodal Agentic RAG with RL Training
VRAG builds retrieval-augmented generation for images, PDFs, and videos using multi-turn agents; supports GVE/Qwen embeddings (2048-4096 dims), DashScope API demos, and RL training on Qwen2.5-VL-7B.
Multimodal Retrieval Setup Handles Images, PDFs, Videos
Prepare corpus by placing images directly, converting PDFs to images, and chunking videos into segments. Build searchable index using embedding models: Alibaba-NLP/GVE-3B (2048 dims, Qwen2.5-VL based), GVE-7B (3584 dims, higher quality/more VRAM), Qwen/Qwen3-VL-Embedding-2B (2048 dims), or -8B (4096 dims). Indexing resumes from checkpoints if interrupted and saves periodically. Launch FastAPI search engine at http://localhost:8001/search, retrieving top-K=3 results by default for agent queries.
This retrieval boosts agent reasoning over noisy multimodal data, outperforming static RAG via iterative search-refine cycles.
Agentic Demos: VimRAG (API) and VRAG (Local)
VimRAG demo (recommended, no GPU needed) uses Qwen3.5-Plus via DashScope API (https://dashscope.aliyuncs.com/compatible-mode/v1); configure max 20 reasoning steps, top-K=3 searches. Launch via Streamlit: streamlit run demo/vimrag_app.py. Features multi-turn interaction on screenshots, diagrams, videos with visual grounding.
VRAG demo runs local Qwen2.5-VL-7B via vLLM for full control; supports same corpus. Both agents iterate retrieval-generation, handling complex queries like video event localization or diagram analysis, as shown in GIF demos of iterative refinement.
Programmatic use: Initialize agent with API/search URLs, then call agent.run(query) for JSON responses with reasoning traces.
RL Training for Custom Multi-Modal Agents
VRAG-RL trains agents via GRPO on Qwen2.5-VL-7B using verl framework; install via conda/pip, run train_grpo_qwen2_5_vl_7b.sh. Focuses on multi-turn multimodal reasoning, improving noise robustness per arXiv papers. VimRAG (Qwen3-VL) training forthcoming post-review. Built on ViDoRAG for dynamic iterative agents, integrates LLaMA-Factory, Search-R1, verl.
Yields SOTA on Hugging Face VRAG Collection, VidDoSeek benchmarks via actor-critic multi-agent paradigm.