RAG-Anything + LightRAG Handles Images/Charts in PDFs
RAG-Anything extends LightRAG to process scanned PDFs, charts, and images via local MinerU parsing, splitting into text/images, extracting entities/relationships/embeddings with GPT-4o-mini, and merging into a unified vector DB + knowledge graph for querying.
Local Parsing Extracts Components from Non-Text Docs
RAG-Anything solves the limitation of text-only RAG systems like LightRAG by handling scanned PDFs, images, charts, and graphs. It uses MinerU, an open-source local tool, to parse documents into components: headers, text blocks, charts, images, and LaTeX equations. MinerU identifies these without understanding content—it draws bounding boxes around elements.
Specialized local models then process components:
- PaddleOCR extracts readable text from scanned blocks (e.g., "Company X reported strong Q3'23 results with revenue growth").
- Charts and equations convert to text where possible.
- Pure images (e.g., bar graphs) become screenshots.
This splits output into two buckets—text and images—avoiding full-document OCR. Local processing on CPU (or GPU with PyTorch tweaks) keeps it free and fast, reducing LLM costs compared to screenshot-everything approaches.
Dual-Path LLM Processing Builds Embeddings and Knowledge Graphs
Text and image buckets feed into an LLM like GPT-4o-mini (or local Ollama) via separate prompts:
- Text path: Prompt extracts entities, relationships (for knowledge graph), and embeddings (for vector DB).
- Image path: LLM analyzes screenshots to extract the same—entities/relationships/embeddings.
From one document, this creates four artifacts: text embeddings, text KG, image embeddings, image KG. RAG-Anything merges them by overlaying entities into single vector DB and KG. This preserves context across modalities, enabling queries like "monthly revenue trend for Novatech Inc. Jan-Sep 2025" to pull bar chart data (e.g., Jan: $4.6M, Feb: $4.9M, etc.).
Merging saves money/time: Local scalpel parsing minimizes LLM tokens vs. treating entire docs as images.
Integrate with LightRAG and Use via Claude Code Skills
RAG-Anything wraps LightRAG: Ingest text docs via LightRAG UI/API; non-text via RAG-Anything script. Post-processing merges RAG-Anything's DB/KG with LightRAG's into one unified system. Query unchanged—via LightRAG UI, API, or Claude Code natural language (e.g., it auto-calls query API).
Setup (one-shot Claude Code prompt in LightRAG dir):
- Updates storage path for existing Docker.
- Sets models: GPT-4o-mini (or nano), text-embedding-3-large (OpenAI).
- Fixes repo bugs like embedding double-wrap. Downloads MinerU/dependencies (heavier than LightRAG; CPU default, GPU optional).
Ingest non-text: Claude Code skill runs script—"use rag-anything skill to upload these docs/folder." Auto-restarts Docker, processes via MinerU → LLM → merge. Text uploads stay via UI/skill.
Trade-offs: Script-only for non-text (no UI); CPU slow for large batches (GPU fix via Claude Code); minor OpenAI costs for LLM extraction. Result: Production RAG for real docs, cheaper than cloud alternatives.