Build GraphRAG for Complex Queries Across Articles
GraphRAG builds knowledge graphs from scraped articles to enable reasoning over interconnected data, outperforming standard RAG on global questions like themes and relationships in AI copyright disputes.
Standard RAG Fails on Interconnected Data—GraphRAG Fixes It
Standard RAG chunks documents, embeds them, and retrieves similar vectors for LLM context. It works for simple fact lookups in small datasets but degrades with scale: accuracy drops 12% at 100k pages due to embedding overlap. Worse, chunks are isolated—no cross-document reasoning for questions like "main themes across articles" or "company connections in disputes."
GraphRAG layers a knowledge graph on top. LLMs extract entities (e.g., organizations like OpenAI, events like lawsuits) and relationships (e.g., "defendant in", "trained on"). These form nodes and edges in a property graph, capturing structure. Microsoft's approach adds community detection (hierarchical Leiden algorithm groups related entities) and community summaries (LLM-generated briefs per cluster).
Use GraphRAG for:
- Hundreds/thousands of interconnected docs (law, policy, research).
- Global queries: patterns, trends, summaries.
- Traceable answers.
Stick to standard RAG for:
- Single-doc facts.
- Speed/cost priority on small, non-relational data.
Before/after: Standard RAG might retrieve unrelated chunks on "AI copyright connections"; GraphRAG traces paths like OpenAI → defendant in → NYT lawsuit → filed against → artists.
"Standard rack has two more fundamental blind spots. The number one is each chunk is treated as an isolated fragment... no ability to reason across documents."
Scrape Real-World Data Without Browser Hassles
Start with live data: Use SerpApi's Google Search API for structured JSON results—no Selenium needed. Free tier covers testing.
Collection steps:
- Define queries (e.g., "AI intellectual property", "copyright generative AI").
- Call
GoogleSearchResultswith params:engine="google", gl="us", hl="en", num=10. - Dedupe URLs across queries → Pandas DataFrame.
Enrich with full text:
- Articles: Trafilatura extracts clean body text (strips nav/ads/footers).
- YouTube: Regex video ID from URL →
youtube_transcript_apifor captions. - Filter successes (paywalls/captionless fail) → Save as CSV.
Code snippet for search:
import serpapi
results = GoogleSearchResults({'q': query, 'api_key': SERPAPI_KEY})
raw = results.get_dict()
For 20 articles on AI copyright, this yields ~10-20 usable full-text docs. Scales to any topic—swap queries.
"SER API is what we're using to script Google News results... real time structured clean search results... no browser automation needed."
Ontology-Driven Extraction Ensures Reliable Graphs
Define ontology first: List entity types (e.g., ORGANIZATION, PERSON, LAWSUIT) and relations (e.g., FILED_AGAINST, REGULATES, TRAINED_ON). Domain-specific—tailor to AI copyright (e.g., defendant_in).
Extraction prompt (via LlamaIndex):
- Input: Article text.
- Output: Up to 20 entity-relation triplets per article.
- Per entity: name, type, description.
- Per relation: source, target, type, description.
Use Pydantic models for structured output:
from pydantic import BaseModel
class ExtractedEntity(BaseModel):
name: str
type: str
description: str
class ExtractedRelationship(BaseModel):
source: str
target: str
relation: str
description: str
Pass as OpenAI function-calling schema—auto-validates/rejects bad outputs. No regex parsing.
GraphRAGExtractor class:
- LLM extracts → Pydantic → LlamaIndex EntityNode/Relation objects.
- Process 50 articles in parallel (GPT-4o-mini for cost).
Common mistake: Skipping ontology → hallucinated/inconsistent entities. Fix: Explicit lists in prompt. Quality check: Descriptions enrich context for later summaries.
Graph Construction, Communities, and Local-Global Querying
GraphRAGStore:
- Insert extracted nodes/edges.
- Run Leiden community detection → Clusters (e.g., "OpenAI lawsuits").
- LLM summarizes each: Collect entities/relations → GPT-4o-mini brief.
Query engine (two-step):
- GPT-4o-mini filters relevant communities (skip irrelevant to save tokens).
- GPT-4o synthesizes: Per-community answers → Global response.
Code flow:
from llama_index.core.graph_stores import SimpleGraphStore
# ...extraction...
graph_store = SimpleGraphStore()
# Insert, detect_communities(), summarize_communities()
query_engine = GraphRAGQueryEngine(...)
response = query_engine.query("Central companies in AI copyright disputes?")
Visualization: D3.js for interactive graph (nodes=entities, edges=relations, clusters colored).
Production tips:
- GPT-4o-mini for extraction/summaries (volume), GPT-4o for queries (reasoning).
- No chunking if articles short.
- Parallel workers speed indexing.
Example query: "Connections in AI copyright?" → Traces OpenAI, NYT, artists via graph traversal.
"Using a knowledge graph has been shown to improve LM response accuracy... sensemaking: understand connections, patterns and themes."
Key Takeaways
- Switch to GraphRAG for cross-document reasoning; standard RAG for isolated facts.
- Always define domain ontology first—prevents extraction drift.
- SerpApi + Trafilatura = reliable scraping pipeline; dedupe and filter aggressively.
- Pydantic + function calling = bulletproof structured extraction.
- Community summaries enable efficient local-global querying—filter first, synthesize second.
- Use cheaper models for indexing, premium for queries to optimize costs.
- Visualize with D3.js to debug/trace graph quality.
- Test on real data like AI copyright: Start with GitHub repo, adapt ontology.
"The ontology... tells the LLM exactly what types of entities and relationships it's allowed to extract."
"At query time, these summaries are queried instead of the raw graph, which makes it particularly effective and fast for big picture questions."