The Pipeline as a System

Most production RAG failures are not caused by the LLM, but by silent breakdowns in the retrieval pipeline. Treating RAG as a single black box hides these issues. Instead, view it as an eleven-stage chain—from query intake to response delivery—where each stage is independently tunable and observable.

Optimizing Retrieval Precision

Retrieval is the most common point of failure. Relying solely on dense vector embeddings often leads to poor performance on exact identifiers like SKUs or model numbers.

  • Hybrid Retrieval: Combine dense (semantic) and sparse (keyword/BM25) search using Reciprocal Rank Fusion. This captures both conceptual meaning and exact keyword matches.
  • Two-Stage Retrieval: Use a wide-net retrieval (top 20-50 chunks) followed by a cross-encoder reranker. The reranker scores candidates against the query, providing a significant precision lift that simple similarity search cannot match.
  • Query Expansion: Short user queries often lack the context needed for high-quality retrieval. Techniques like HyDE (Hypothetical Document Embeddings) generate a hypothetical answer to create a richer retrieval signal, though you must ensure the hypothetical text is never leaked into the final context.

Data Ingestion and Chunking

Chunking is a hypothesis, not a fixed rule. Start with 300-500 tokens and 10-20% overlap, then tune against a dedicated evaluation set.

  • Semantic Boundaries: Prefer splitting by paragraphs or sections rather than arbitrary character counts to avoid breaking sentences mid-thought.
  • Table Handling: Standard text loaders often flatten tables into unusable strings. Use dedicated table extraction tools to preserve the structure of product specs and pricing matrices.

Intent and Routing

Not every query should hit the same index. Use zero-shot classification to route queries to specific indices (e.g., policy docs vs. order logs) and select appropriate prompt templates. Common intents can be short-circuited entirely by hitting a curated FAQ, which drastically reduces latency and improves reliability.

Evaluation and Monitoring

If you are not measuring, you are guessing. The most overlooked area in RAG is the lack of a robust evaluation set. Without a ground-truth dataset, you cannot objectively measure the impact of changes to your chunking strategy, embedding model, or retrieval logic. Treat evaluation as a first-class citizen of the development lifecycle.