RAG Needs Automated Internal Evaluation for Optimization
RAG systems require quantitative evaluation to compare optimizations like retrieval strategies, avoiding manual checks that are slow and subjective—integrate into CI/CD pipelines like unit tests. Focus on internal evaluation of retrieval and generation modules:
Retrieval metrics:
- Relevance: Retrieved chunks match query?
- Coverage: All relevant database chunks fetched?
- Correctness: High signal-to-noise ratio, relevant chunks ranked top?
Generation metrics:
- Relevance: Answer aligns with query, no off-topic drift?
- Factuality: Answer grounded in retrieved sources, no hallucinations?
- Correctness: Answer factually accurate?
Prefer LLM-as-a-judge over traditional NLP metrics (ROUGE, BLEU) for nuanced semantic judgment. Ground evaluators in production setups like Azure AI Search indexes (e.g., rag-evalution-chris with 50 chunks from employee handbook PDFs, vectorized in text_vector field).
Azure SDK Evaluators Automate LLM-as-Judge Scoring
Leverage azure.ai.evaluation package with GPT-4 (gpt-4.1 deployment) for zero-shot scoring (1.0-5.0 scale). Key evaluators:
- GroundednessEvaluator: Measures answer's fidelity to sources—scores drop if facts can't be verified in context, even if externally true. Input:
response=answer, context=sources. - RelevanceEvaluator: Checks query-response alignment and contextual fit. Input:
query=user_question, response=answer, context=sources.
Setup clients for Azure AI Search and OpenAI:
import os
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
from openai import AzureOpenAI
# Load env vars: AZURE_SEARCH_*, AZURE_OPENAI_*
openai_client = AzureOpenAI(api_key=AZURE_OPENAI_API_KEY, azure_endpoint=AZURE_OPENAI_ENDPOINT, api_version="2024-10-21")
search_client = SearchClient(endpoint=AZURE_SEARCH_ENDPOINT, index_name=AZURE_SEARCH_INDEX_NAME, credential=AzureKeyCredential(AZURE_SEARCH_ADMIN_KEY))
def get_embedding_vector(query: str) -> list[float]:
response = openai_client.embeddings.create(model=AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME, input=[query])
return response.data[0].embedding
Retrieval (top=5):
- Keyword:
search_client.search(search_text=user_question) - Vector:
search_client.search(None, vector_queries=[VectorizedQuery(vector=get_embedding_vector(user_question), k_nearest_neighbors=50, fields="text_vector")]) - Hybrid (semantic):
search_client.search(user_question, vector_queries=[...], query_type="semantic", semantic_configuration_name="rag-evaluation-chris-semantic-configuration")
Generation prompt enforces grounding:
SYSTEM_MESSAGE = """Answer ONLY with facts from sources. Use [source] citations."""
response = openai_client.chat.completions.create(model=AZURE_OPENAI_LLM_DEPLOYMENT_NAME, messages=[{"role": "system", "content": SYSTEM_MESSAGE}, {"role": "user", "content": user_question + "\nSources: " + sources}])
answer = response.choices[0].message.content
Evaluate:
from azure.ai.evaluation import AzureOpenAIModelConfiguration, GroundednessEvaluator, RelevanceEvaluator
model_config = {"azure_endpoint": AZURE_OPENAI_ENDPOINT, "azure_deployment": AZURE_OPENAI_LLM_DEPLOYMENT_NAME, "api_key": AZURE_OPENAI_API_KEY}
relevance_eval = RelevanceEvaluator(model_config)
groundedness_eval = GroundednessEvaluator(model_config)
relevance_score = relevance_eval(query=user_question, response=answer, context=sources)
groundedness_score = groundedness_eval(response=answer, context=sources)
Keyword Search Wins for Simple Queries, Enables Agentic RAG
On query "What does a product manager do?" (50-chunk index):
| Method | Groundedness | Relevance |
|---|---|---|
| Keyword | 4.5 | 5.0 |
| Hybrid | 4.0 | 4.5 |
| Vector | 3.0 | 3.5 |
Keyword search topped scores unexpectedly for this task, proving automated eval reveals trade-offs (e.g., vector struggles with exact phrasing). This closes the loop for Agentic RAG: reliable metrics select best retrieval for self-improving agents.