RAG Needs Automated Internal Evaluation for Optimization

RAG systems require quantitative evaluation to compare optimizations like retrieval strategies, avoiding manual checks that are slow and subjective—integrate into CI/CD pipelines like unit tests. Focus on internal evaluation of retrieval and generation modules:

Retrieval metrics:

  • Relevance: Retrieved chunks match query?
  • Coverage: All relevant database chunks fetched?
  • Correctness: High signal-to-noise ratio, relevant chunks ranked top?

Generation metrics:

  • Relevance: Answer aligns with query, no off-topic drift?
  • Factuality: Answer grounded in retrieved sources, no hallucinations?
  • Correctness: Answer factually accurate?

Prefer LLM-as-a-judge over traditional NLP metrics (ROUGE, BLEU) for nuanced semantic judgment. Ground evaluators in production setups like Azure AI Search indexes (e.g., rag-evalution-chris with 50 chunks from employee handbook PDFs, vectorized in text_vector field).

Azure SDK Evaluators Automate LLM-as-Judge Scoring

Leverage azure.ai.evaluation package with GPT-4 (gpt-4.1 deployment) for zero-shot scoring (1.0-5.0 scale). Key evaluators:

  • GroundednessEvaluator: Measures answer's fidelity to sources—scores drop if facts can't be verified in context, even if externally true. Input: response=answer, context=sources.
  • RelevanceEvaluator: Checks query-response alignment and contextual fit. Input: query=user_question, response=answer, context=sources.

Setup clients for Azure AI Search and OpenAI:

import os
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
from openai import AzureOpenAI
# Load env vars: AZURE_SEARCH_*, AZURE_OPENAI_*
openai_client = AzureOpenAI(api_key=AZURE_OPENAI_API_KEY, azure_endpoint=AZURE_OPENAI_ENDPOINT, api_version="2024-10-21")
search_client = SearchClient(endpoint=AZURE_SEARCH_ENDPOINT, index_name=AZURE_SEARCH_INDEX_NAME, credential=AzureKeyCredential(AZURE_SEARCH_ADMIN_KEY))

def get_embedding_vector(query: str) -> list[float]:
    response = openai_client.embeddings.create(model=AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME, input=[query])
    return response.data[0].embedding

Retrieval (top=5):

  • Keyword: search_client.search(search_text=user_question)
  • Vector: search_client.search(None, vector_queries=[VectorizedQuery(vector=get_embedding_vector(user_question), k_nearest_neighbors=50, fields="text_vector")])
  • Hybrid (semantic): search_client.search(user_question, vector_queries=[...], query_type="semantic", semantic_configuration_name="rag-evaluation-chris-semantic-configuration")

Generation prompt enforces grounding:

SYSTEM_MESSAGE = """Answer ONLY with facts from sources. Use [source] citations."""
response = openai_client.chat.completions.create(model=AZURE_OPENAI_LLM_DEPLOYMENT_NAME, messages=[{"role": "system", "content": SYSTEM_MESSAGE}, {"role": "user", "content": user_question + "\nSources: " + sources}])
answer = response.choices[0].message.content

Evaluate:

from azure.ai.evaluation import AzureOpenAIModelConfiguration, GroundednessEvaluator, RelevanceEvaluator
model_config = {"azure_endpoint": AZURE_OPENAI_ENDPOINT, "azure_deployment": AZURE_OPENAI_LLM_DEPLOYMENT_NAME, "api_key": AZURE_OPENAI_API_KEY}
relevance_eval = RelevanceEvaluator(model_config)
groundedness_eval = GroundednessEvaluator(model_config)
relevance_score = relevance_eval(query=user_question, response=answer, context=sources)
groundedness_score = groundedness_eval(response=answer, context=sources)

Keyword Search Wins for Simple Queries, Enables Agentic RAG

On query "What does a product manager do?" (50-chunk index):

MethodGroundednessRelevance
Keyword4.55.0
Hybrid4.04.5
Vector3.03.5

Keyword search topped scores unexpectedly for this task, proving automated eval reveals trade-offs (e.g., vector struggles with exact phrasing). This closes the loop for Agentic RAG: reliable metrics select best retrieval for self-improving agents.