VRAG: Multimodal Agentic RAG with RL Training

Multimodal Retrieval Setup Handles Images, PDFs, Videos

Prepare corpus by placing images directly, converting PDFs to images, and chunking videos into segments. Build searchable index using embedding models: Alibaba-NLP/GVE-3B (2048 dims, Qwen2.5-VL based), GVE-7B (3584 dims, higher quality/more VRAM), Qwen/Qwen3-VL-Embedding-2B (2048 dims), or -8B (4096 dims). Indexing resumes from checkpoints if interrupted and saves periodically. Launch FastAPI search engine at http://localhost:8001/search, retrieving top-K=3 results by default for agent queries.

This retrieval boosts agent reasoning over noisy multimodal data, outperforming static RAG via iterative search-refine cycles.

Agentic Demos: VimRAG (API) and VRAG (Local)

VimRAG demo (recommended, no GPU needed) uses Qwen3.5-Plus via DashScope API (https://dashscope.aliyuncs.com/compatible-mode/v1); configure max 20 reasoning steps, top-K=3 searches. Launch via Streamlit: streamlit run demo/vimrag_app.py. Features multi-turn interaction on screenshots, diagrams, videos with visual grounding.

VRAG demo runs local Qwen2.5-VL-7B via vLLM for full control; supports same corpus. Both agents iterate retrieval-generation, handling complex queries like video event localization or diagram analysis, as shown in GIF demos of iterative refinement.

Programmatic use: Initialize agent with API/search URLs, then call agent.run(query) for JSON responses with reasoning traces.

VRAG-RL trains agents via GRPO on Qwen2.5-VL-7B using verl framework; install via conda/pip, run train_grpo_qwen2_5_vl_7b.sh. Focuses on multi-turn multimodal reasoning, improving noise robustness per arXiv papers. VimRAG (Qwen3-VL) training forthcoming post-review. Built on ViDoRAG for dynamic iterative agents, integrates LLaMA-Factory, Search-R1, verl.

Yields SOTA on Hugging Face VRAG Collection, VidDoSeek benchmarks via actor-critic multi-agent paradigm.

Multimodal Retrieval Setup Handles Images, PDFs, Videos

Agentic Demos: VimRAG (API) and VRAG (Local)

RL Training for Custom Multi-Modal Agents