Local SERP Index with Typesense: $0 Faceted Search

Pipeline Delivers Instant Faceted Search Over Live SERPs

Replace manual JSON grepping for research (e.g., arXiv RAG papers) with a local index: query Google via site:arxiv.org + topic using Bright Data SERP API, transform organics to docs, bulk upsert into Typesense—a free, Dockerized Algolia alternative running on localhost:8108. Default fetches 10 results per query (sliced client-side since Google ignores &num= post-2025 deprecation), delays 0.6s between calls to respect rates. Run ingest.py to drop/recreate collection or --append to accumulate across runs, enabling cross-query analysis like overlaps in "agentic RAG" vs. "hybrid search". Browser queries proxy through serve.py (stdlib http.server, 30 lines) to hide admin API key; UI shows keyword results, source_query/domain chips, position-sorted cards with provenance.

Yields sub-second faceted search: filter by seed query chips (exact strings like "site:arxiv.org long context vs RAG 2026") or domains, revealing patterns like papers surfacing under multiple angles. Total cost $0 beyond Bright Data credits; scales to any domain where Google outperforms native search.

Schema and Doc Mapping Unlocks Provenance

Define Typesense collection with fields mirroring SERP organics: title/snippet (capped 8000/16000 chars), url, position (int32, defaults to rank or index+1), source_query/domain (string, facet: true). Set default_sorting_field: "position" to preserve Google's order as baseline ranking signal. Generate doc IDs as sha256(url + "\t" + source_query)—critical for duplicates: same arXiv paper under two queries becomes two docs, each facet-tagged, letting you spot multi-angle surfacing. Hash URL alone loses this; index stays "clean" but provenance vanishes.

organic_to_documents handles var names (link/url, description/snippet, rank/position); skips invalids. import_ with {"action": "upsert"} on JSONL batch reports errors per line (e.g., check '"success":false'). --num-results 8 arg caps post-fetch; retries 2x with 0.5*(attempt+1)s backoff on Bright Data 200-but-empty or non-200 inner status. Validates unwrap from body (often JSON string) before organic access—skipping silently indexes nothing.

Demo queries: "site:arxiv.org retrieval augmented generation 2026", etc.; override via --query (repeatable) or --queries-file (one/line, skip #/blanks). --delay 0.6 tunes politeness.

serve.py proxies /api/search (fixed params: q, filter_by=source_query:*chip* || domain:*chip*, facet_by=source_query,domain, per_page=20) to Typesense, stripping auth from response. No frameworks—pure http.server + urllib.parse for static/index.html. UI: input triggers fetch, chips toggle facets (e.g., q=graph RAG&filter_by=source_query:site:arxiv.org graph RAG 2026), results as cards (title, snippet, url, position, chips).

Docker Compose persists /data volume; --api-key devtypesense matches .env. Swap SERP providers by editing .env only. Explores snapshot (recreate) vs. corpus (--append) modes: add Thursday query to Monday index without reset, query once-collected data many times.

Pipeline Delivers Instant Faceted Search Over Live SERPs

Schema and Doc Mapping Unlocks Provenance

Proxy Shields API Key; UI Leverages Facets Natively