PageIndex: Tree-Based RAG Without Vectors or Chunking

Replace Similarity with Reasoning for Relevant Retrieval

Traditional vector RAG fails on long professional documents like financial reports because semantic similarity doesn't equal relevance—it lacks domain-specific reasoning. PageIndex fixes this by building a hierarchical tree index mimicking a table-of-contents, with nodes containing titles, IDs, page ranges (start_index/end_index), summaries, and child nodes. LLMs then perform agentic tree search to navigate and retrieve exact sections, enabling human-like extraction. This yields traceable results with page references, unlike opaque vector matches. Core process: (1) Generate tree from PDF/MD; (2) Query via reasoning over tree. Trade-off: Relies on LLM API costs but avoids vector DB setup and chunking artifacts.

Generate and Query Trees in Minutes

Install via pip3 install --upgrade -r requirements.txt, add OPENAI_API_KEY to .env (supports LiteLLM for multi-LLM). Run python3 run_pageindex.py --pdf_path /path/to/document.pdf to build tree—customize with --model gpt-4o-2024-11-20, --max-pages-per-node 10, --max-tokens-per-node 20000. Markdown mode uses --md_path and heading levels (#, ##) for hierarchy. Integrate into RAG: Load tree JSON, use LLMs for tree traversal queries. Examples include agentic_vectorless_rag_demo.py (OpenAI Agents SDK for end-to-end agentic RAG), pageindex_RAG_simple.ipynb (minimal RAG), and vision_RAG_pageindex.ipynb (image-based, no OCR). Self-host or use cloud API/MCP.

98.7% FinanceBench Win Proves Edge on Complex Docs

PageIndex powers Mafin 2.5, achieving state-of-the-art 98.7% accuracy on FinanceBench (complex financial QA)—outpacing vector RAG by enabling precise navigation of SEC filings. Handles PDFs beyond LLM limits like reports, textbooks, manuals. Deployment: Local repo, chat platform (chat.pageindex.ai), API/developer tools, or enterprise on-prem. Explore cookbooks/tutorials for document/tree search; tree excels where vectors falter on multi-step reasoning.

Replace Similarity with Reasoning for Relevant Retrieval

Generate and Query Trees in Minutes

98.7% FinanceBench Win Proves Edge on Complex Docs

More on Edge

Context Engineering Unlocks AI via RAG & GraphRAG

20B Chroma Context-1 Fixes RAG Retrieval Woes

RAG Evolves from Keyword Search to Agentic Reasoning

Phi-4-Mini Masterclass: Quantized LLM Pipelines