k-NN on Google Searches Builds Explorable Knowledge Graph
Embed 800 results from 100 Google queries, run cosine k-NN to reveal 42.2% cross-query connections—every document links to at least one from a different search in its top 8 neighbors.
Shift from Google Ranking to Semantic Proximity for Hidden Connections
Treat Google search results as points in a shared embedding space: concatenate title + snippet + domain + source_query, embed with nomic-embed-text via Ollama, index in ChromaDB using cosine distance. Query k-NN (k=8) to find nearest neighbors across the entire merged corpus of ~800 results from 100 topic-specific queries. This surfaces connections no single search reveals, like linking an ArXiv quantization paper to NVIDIA INT8/FP16 benchmarks and Llama.cpp forks. Result: 42.2% of neighbor links cross query boundaries, with every one of 797 documents having at least one such link in its top 8—far outperforming isolated searches.
k-NN excels here because it's training-free, leveraging embedding structure directly for local similarity. Use multi-angle queries (e.g., hardware, benchmarks, site:arxiv.org) in queries.json to cover a topic like edge ML, ensuring broad coverage without overlap loss—same URL from different queries becomes distinct rows via SHA-256 hash of url + source_query.
Separate Source of Truth (DuckDB) from Vectors (Chroma) for Reliability
Store raw SERP data in DuckDB as a single portable .duckdb file: columns id (SHA-256), source_query, url, title, snippet, domain, position. Ingest via Bright Data SERP API client that retries 3x with backoff, unwraps JSON envelope, limits organics to 10 (post-2025 &num= deprecation), fails loudly on empty/bad responses. Merge mode skips existing source_queries; --refresh wipes and refetches.
Embed.py reads DuckDB, deletes/recreates Chroma collection (no upsert complexity), batches embeddings (32 at a time) to avoid OOM. Serve neighbors by fetching anchor vector from Chroma, querying top-k, hydrating full rows from DuckDB by id—preserves rank order, stitches distances. Trade-off: Chroma metadata is query-unfriendly; DuckDB enables SQL inspection/export/rebuilds without vector changes. Run order: ingest.py → embed.py → serve.py (FastAPI + JS UI at localhost:8766).
Prerequisites: Python 3.10+, uv venv, Ollama with nomic-embed-text, Docker Chroma on :8000, BRIGHT_DATA_API_KEY/ZONE.
Defensive Client and Embedding Choices Boost Pipeline Robustness
BrightDataSERPClient handles gotchas: quote queries, add hl/lr for language, post to api.brightdata.com/request with zone/url/format=json, parse inner body, slice organics:10. Retry linear backoff 0.5s*(attempt+1). Embedding_text joins fields with newlines for context—domain adds topical weight (arxiv.org ≠ thinkrobotics.com), source_query differentiates same-URL provenance.
Ollama embed handles /api/embed response formats (embeddings0 or legacy embedding), normalizes ndarray vs list. UI highlights cross-query neighbors; click any result to explore graph. Full code: github.com/sixthextinction/knn. Scales to your topic by editing queries.json—no orchestration needed, paces API calls to dodge throttling.