Build GraphRAG: Scrape, Graph, Query AI News

Standard RAG Fails on Scale and Connections—GraphRAG Fixes It

Standard RAG chunks documents, embeds them, and retrieves similar vectors for LLM context. It excels at simple fact retrieval from small datasets but degrades with volume: one study shows 12% accuracy drop at 100k pages due to embedding overlap. Worse, chunks are isolated—no links between related info across docs—and it can't reason globally, like tracing themes or relationships spanning sources.

GraphRAG layers a knowledge graph on top. An LLM extracts entities (e.g., companies, lawsuits) and relationships (e.g., "defendant in") from text, forming nodes and edges. This captures structure: Microsoft's approach adds community detection (hierarchical Leiden algorithm via graspologic) to cluster related entities, then LLM-summarizes clusters for query-time efficiency. Result: "sensemaking" for patterns, transparency in reasoning traces, and accuracy on interconnected data like law/policy.

Use GraphRAG for hundreds/thousands of docs needing cross-links, big-picture queries (themes/trends), or explainability. Stick to vector RAG for single-doc facts, speed, low cost. Prerequisite: Python basics, OpenAI API, familiarity with embeddings/prompts. Fits after basic RAG in production pipelines for complex domains.

"Standard RAG has two more fundamental blind spots... no ability to reason across documents."

Collect Real-World Data Without Browser Hassles

Start with live scraping: Use SerpApi's Google Search API for structured JSON results—no Selenium headaches. Install google-search-results, trafilatura (article text extraction), youtube-transcript-api.

Key steps:

Load env vars for API keys; set max_results=10 per query.
collect_search_results(queries=['AI copyright lawsuits', 'generative AI intellectual property']): Loops queries, calls SerpApi (engine='google', gl='us', hl='en'), dedupes URLs, returns DataFrame + raw JSON.
enrich_search_results(df): For each URL, trafilatura strips junk from articles; regex-extract YouTube IDs, fetch transcripts. Filter successes, add full_text column, save CSV.

Example output: 20 articles/videos on AI copyright with full text. Scales to multi-page via repeated calls. Common mistake: Hardcoding keys—use .env. Handles paywalls/captionless videos by skipping.

Quality check: Raw API peek for debugging; full text >> snippets for extraction.

"SER API... gives you real time structured clean search results from Google... no browser automation needed."

Define Ontology, Extract, and Build Graph with LlamaIndex

Core: LlamaIndex + custom extractor. Config: GPT-4o-mini for extraction/summaries (volume work), GPT-4o for queries (reasoning). Process 50 articles, 20 triplets/article, 4 parallel workers.

Ontology: Domain-specific schema lists entity types (ORGANIZATION, PERSON, LAWSUIT, etc.) and relations (FILED_AGAINST, DEFENDANT_IN, REGULATES, TRAINED_ON). Tailor to use case—drives extraction quality.
Extraction Prompt: Template injects ontology. Instruct: ID entities (name/type/desc), relations (source/target/relation/desc) from article. Limits hallucinations via schema.
Pydantic Models: ExtractedEntity (name, type, desc), ExtractedRelationship (source, target, relation, desc), ExtractionResult (lists both). Enables structured outputs: OpenAI function-calling auto-validates/typed—rejects bad formats, no regex parsing.
GraphRAGExtractor Class:
- Per article: llm.structured_predict(ExtractionResult, prompt + text) → validated entities/rels.
- Convert to LlamaIndex EntityNode/Relation objects.
- Collect all → GraphRAGStore (property graph: nodes/edges with props like desc).
Communities: GraphRAGStore.get() → NetworkX graph → Leiden clustering → Per-community LLM summary (GPT-4o-mini: "Summarize entities/rels in this cluster").

Pitfalls: Skip chunking for short articles (direct extract); descs enrich summaries. Output: Persistent graph, community summaries for fast local/global search.

"The ontology... is the schema of our knowledge graph and it tells the LLM exactly what types of entities and relationships it's allowed to extract."

Query Engine: Filter-Relevant, Synthesize Answers

Two-step query:

Per-community: GPT-4o-mini checks summary relevance (Can this answer '{query}'? → skip irrelevants, save tokens).
GPT-4o synthesizes from relevant summaries + graph traces.

Modes: LOCAL (single community), GLOBAL (dataset themes). Example query: "Companies at center of disputes?" → Traces connections like OpenAI defendant in NYT suit.

Visualization: Export to JSON, d3.js/NetworkX for interactive graph (nodes=entities, edges=rels).

"At query time, these summaries are queried instead of the raw graph, which makes it particularly effective and fast for big picture questions."

Key Takeaways

Scraping first: SerpApi + trafilatura for clean, real-time article/transcript data; dedupe/filter successes.
Ontology upfront: Define 5-10 entity/relation types per domain—guides reliable extraction.
Pydantic + structured predict: Auto-validates LLM JSON, skips chunking for short docs.
Communities key: Leiden clusters + summaries enable scalable global queries without full-graph scans.
Model tiering: Mini for extract/summaries, full for synthesis—cuts costs 5-10x.
Test on complex topics: GraphRAG shines on scattered, relational data like news/legal.
Visualize always: d3.js traces reasoning, builds trust.
Git clone https://github.com/thu-vu92/graphRAG; swap queries for your dataset.