Replace Similarity with Reasoning for Relevant Retrieval
Traditional vector RAG fails on long professional documents like financial reports because semantic similarity doesn't equal relevance—it lacks domain-specific reasoning. PageIndex fixes this by building a hierarchical tree index mimicking a table-of-contents, with nodes containing titles, IDs, page ranges (start_index/end_index), summaries, and child nodes. LLMs then perform agentic tree search to navigate and retrieve exact sections, enabling human-like extraction. This yields traceable results with page references, unlike opaque vector matches. Core process: (1) Generate tree from PDF/MD; (2) Query via reasoning over tree. Trade-off: Relies on LLM API costs but avoids vector DB setup and chunking artifacts.
Generate and Query Trees in Minutes
Install via pip3 install --upgrade -r requirements.txt, add OPENAI_API_KEY to .env (supports LiteLLM for multi-LLM). Run python3 run_pageindex.py --pdf_path /path/to/document.pdf to build tree—customize with --model gpt-4o-2024-11-20, --max-pages-per-node 10, --max-tokens-per-node 20000. Markdown mode uses --md_path and heading levels (#, ##) for hierarchy. Integrate into RAG: Load tree JSON, use LLMs for tree traversal queries. Examples include agentic_vectorless_rag_demo.py (OpenAI Agents SDK for end-to-end agentic RAG), pageindex_RAG_simple.ipynb (minimal RAG), and vision_RAG_pageindex.ipynb (image-based, no OCR). Self-host or use cloud API/MCP.
98.7% FinanceBench Win Proves Edge on Complex Docs
PageIndex powers Mafin 2.5, achieving state-of-the-art 98.7% accuracy on FinanceBench (complex financial QA)—outpacing vector RAG by enabling precise navigation of SEC filings. Handles PDFs beyond LLM limits like reports, textbooks, manuals. Deployment: Local repo, chat platform (chat.pageindex.ai), API/developer tools, or enterprise on-prem. Explore cookbooks/tutorials for document/tree search; tree excels where vectors falter on multi-step reasoning.