PageIndex: Tree-Based RAG Without Vectors or Chunking
PageIndex creates LLM-reasoned hierarchical tree indexes from long documents for relevance-focused retrieval via tree search, hitting 98.7% accuracy on FinanceBench vs. vector RAG's similarity flaws—no DBs or chunks needed.
Replace Similarity with Reasoning for Relevant Retrieval
Traditional vector RAG fails on long professional documents like financial reports because semantic similarity doesn't equal relevance—it lacks domain-specific reasoning. PageIndex fixes this by building a hierarchical tree index mimicking a table-of-contents, with nodes containing titles, IDs, page ranges (start_index/end_index), summaries, and child nodes. LLMs then perform agentic tree search to navigate and retrieve exact sections, enabling human-like extraction. This yields traceable results with page references, unlike opaque vector matches. Core process: (1) Generate tree from PDF/MD; (2) Query via reasoning over tree. Trade-off: Relies on LLM API costs but avoids vector DB setup and chunking artifacts.
Generate and Query Trees in Minutes
Install via pip3 install --upgrade -r requirements.txt, add OPENAI_API_KEY to .env (supports LiteLLM for multi-LLM). Run python3 run_pageindex.py --pdf_path /path/to/document.pdf to build tree—customize with --model gpt-4o-2024-11-20, --max-pages-per-node 10, --max-tokens-per-node 20000. Markdown mode uses --md_path and heading levels (#, ##) for hierarchy. Integrate into RAG: Load tree JSON, use LLMs for tree traversal queries. Examples include agentic_vectorless_rag_demo.py (OpenAI Agents SDK for end-to-end agentic RAG), pageindex_RAG_simple.ipynb (minimal RAG), and vision_RAG_pageindex.ipynb (image-based, no OCR). Self-host or use cloud API/MCP.
98.7% FinanceBench Win Proves Edge on Complex Docs
PageIndex powers Mafin 2.5, achieving state-of-the-art 98.7% accuracy on FinanceBench (complex financial QA)—outpacing vector RAG by enabling precise navigation of SEC filings. Handles PDFs beyond LLM limits like reports, textbooks, manuals. Deployment: Local repo, chat platform (chat.pageindex.ai), API/developer tools, or enterprise on-prem. Explore cookbooks/tutorials for document/tree search; tree excels where vectors falter on multi-step reasoning.