The Evolution of Retrieval
Despite the "RAG is dead" discourse on social media, actual usage metrics show a massive inflection point in search volume starting in mid-2025. The confusion stems from a narrow definition of RAG as a single, static vector search call. In production, sophisticated systems have moved toward Agentic Retrieval—an iterative process where agents use a mix of vector search, full-text search (BM25), regex, and filters to progressively reason over context until they have the necessary information to complete a task.
Embeddings as Cached Compute
Architectural decisions between tools like Cursor and Claude Code highlight a fundamental trade-off in AI engineering:
- Per-session discovery (e.g., Claude Code): Uses grep-based retrieval without upfront indexing. This avoids indexing costs but incurs higher latency and token costs per run because the agent must re-discover information every time.
- Upfront indexing (e.g., Cursor): Treats embeddings as "cached compute." While there is an initial cost to parse, chunk, and embed a codebase, it enables lightweight, high-speed retrieval at runtime.
For teams working on shared codebases, this can be optimized further using Merkle trees to identify and re-index only the changed files, significantly reducing redundant compute.
Impact and Performance
Data from Cursor’s implementation of semantic search demonstrates that even small percentage gains in accuracy and retention are significant when weighted by query relevance. Specifically, semantic search integration led to:
- A 24% increase in answer accuracy for their composer model.
- A 2.6% gain in code retention for large codebases.
- A 2.2% drop in dissatisfied user requests.
These gains are notable because semantic search is not triggered on every query; the impact is concentrated on the subset of queries where retrieval is actually necessary.
The "Right Million" Strategy
As context windows grow, the engineering challenge shifts from capacity to precision. Echoing Jeff Dean’s philosophy, the goal is not to feed an LLM a trillion tokens at once, but to use an iterative retrieval mechanism to narrow down a massive corpus to the "right million" tokens. This approach optimizes for both cost and agent performance, ensuring the model receives the most relevant context without the noise of a full-context dump.