PageIndex: LLM Reasoning Beats Vector RAG on Structured Docs

Vector RAG Fails on Structure and Relevance

Vector RAG assumes semantic similarity equals relevance, but this crumbles in real documents: queries like "company’s total debt in 2023" retrieve CEO letters or glossaries instead of balance sheet numbers on page 64. Chunking obliterates hierarchy, severing cross-references like "see Table 3.2" or "Appendix G." Queries express intent with different vocabulary from answers, making cosine similarity unreliable. Result: 50% accuracy on FinanceBench for financial docs, where executive summaries overshadow footnotes despite keyword overlap.

PageIndex flips this by treating retrieval as reasoning: an LLM navigates a document's natural tree structure like a human skimming a table of contents, preserving context and following logic over blind similarity.

Build Hierarchical Tree Without Embeddings

Parse PDFs page-by-page with PyMuPDF, group into sections (e.g., 3 pages each) to respect boundaries, then use Gemini to generate JSON nodes per section: title (5-8 words), 2-3 sentence summary, key topics array. Output: nested tree like:

Annual Report 2023
├── Financial Statements
│   ├── Balance Sheet
│   └── Notes to Financial Statements
       └── Note 12: Long-term Debt

Store as JSON—no vectors, no DB. Cost: LLM calls only during indexing, reusable for queries.

Query with Step-by-Step Reasoning

Feed query + tree text to LLM: it reasons "debt query → Financial Statements → Notes," outputting JSON with reasoning trace, selected node IDs (e.g., "S001", "S004"), confidence (high/medium/low). Fetch raw section text (up to 3000 chars), generate answer with citations. Explainability shines: see exact navigation logic vector search hides. Examples: precise debt figures from page 87 footnotes, not summaries.

Architecture: sequential LLM steps (index → reason → expand → retrieve → answer) prioritize accuracy over speed.

Trade-offs: Use for Precision, Not Scale

PageIndex excels on single long structured docs (10-Ks, contracts, manuals) needing 98.7% FinanceBench accuracy and citations for finance/legal/healthcare. Avoid for multi-doc search (use vectors), high-throughput (sequential calls add latency/cost), or flat text (no hierarchy benefit).

Hybrid: vectors select docs, PageIndex extracts answers. Open-source at GitHub; cloud at pageindex.ai integrates with agents like Claude.

Vector RAG Fails on Structure and Relevance

Build Hierarchical Tree Without Embeddings

Query with Step-by-Step Reasoning

Trade-offs: Use for Precision, Not Scale

More on Edge

rag-injection-scanner Detects Hidden RAG Prompt Attacks

Claude Code Skills Fix LLM Memory Gaps

MEL: Test AI Models on Behavior, Not Benchmarks

ChatGPT Predicts Words from Patterns, Not Facts