Vector RAG Fails on Structure and Relevance
Vector RAG assumes semantic similarity equals relevance, but this crumbles in real documents: queries like "company’s total debt in 2023" retrieve CEO letters or glossaries instead of balance sheet numbers on page 64. Chunking obliterates hierarchy, severing cross-references like "see Table 3.2" or "Appendix G." Queries express intent with different vocabulary from answers, making cosine similarity unreliable. Result: 50% accuracy on FinanceBench for financial docs, where executive summaries overshadow footnotes despite keyword overlap.
PageIndex flips this by treating retrieval as reasoning: an LLM navigates a document's natural tree structure like a human skimming a table of contents, preserving context and following logic over blind similarity.
Build Hierarchical Tree Without Embeddings
Parse PDFs page-by-page with PyMuPDF, group into sections (e.g., 3 pages each) to respect boundaries, then use Gemini to generate JSON nodes per section: title (5-8 words), 2-3 sentence summary, key topics array. Output: nested tree like:
Annual Report 2023
├── Financial Statements
│ ├── Balance Sheet
│ └── Notes to Financial Statements
└── Note 12: Long-term Debt
Store as JSON—no vectors, no DB. Cost: LLM calls only during indexing, reusable for queries.
Query with Step-by-Step Reasoning
Feed query + tree text to LLM: it reasons "debt query → Financial Statements → Notes," outputting JSON with reasoning trace, selected node IDs (e.g., "S001", "S004"), confidence (high/medium/low). Fetch raw section text (up to 3000 chars), generate answer with citations. Explainability shines: see exact navigation logic vector search hides. Examples: precise debt figures from page 87 footnotes, not summaries.
Architecture: sequential LLM steps (index → reason → expand → retrieve → answer) prioritize accuracy over speed.
Trade-offs: Use for Precision, Not Scale
PageIndex excels on single long structured docs (10-Ks, contracts, manuals) needing 98.7% FinanceBench accuracy and citations for finance/legal/healthcare. Avoid for multi-doc search (use vectors), high-throughput (sequential calls add latency/cost), or flat text (no hierarchy benefit).
Hybrid: vectors select docs, PageIndex extracts answers. Open-source at GitHub; cloud at pageindex.ai integrates with agents like Claude.