LLM-as-Judge Evaluates RAG: Keyword Beats Vector

RAG Needs Automated Internal Evaluation for Optimization

RAG systems require quantitative evaluation to compare optimizations like retrieval strategies, avoiding manual checks that are slow and subjective—integrate into CI/CD pipelines like unit tests. Focus on internal evaluation of retrieval and generation modules:

Retrieval metrics:

Relevance: Retrieved chunks match query?
Coverage: All relevant database chunks fetched?
Correctness: High signal-to-noise ratio, relevant chunks ranked top?

Generation metrics:

Relevance: Answer aligns with query, no off-topic drift?
Factuality: Answer grounded in retrieved sources, no hallucinations?
Correctness: Answer factually accurate?

Prefer LLM-as-a-judge over traditional NLP metrics (ROUGE, BLEU) for nuanced semantic judgment. Ground evaluators in production setups like Azure AI Search indexes (e.g., rag-evalution-chris with 50 chunks from employee handbook PDFs, vectorized in text_vector field).

Azure SDK Evaluators Automate LLM-as-Judge Scoring

Leverage azure.ai.evaluation package with GPT-4 (gpt-4.1 deployment) for zero-shot scoring (1.0-5.0 scale). Key evaluators:

GroundednessEvaluator: Measures answer's fidelity to sources—scores drop if facts can't be verified in context, even if externally true. Input: response=answer, context=sources.
RelevanceEvaluator: Checks query-response alignment and contextual fit. Input: query=user_question, response=answer, context=sources.

Setup clients for Azure AI Search and OpenAI:

import os
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
from openai import AzureOpenAI
# Load env vars: AZURE_SEARCH_*, AZURE_OPENAI_*
openai_client = AzureOpenAI(api_key=AZURE_OPENAI_API_KEY, azure_endpoint=AZURE_OPENAI_ENDPOINT, api_version="2024-10-21")
search_client = SearchClient(endpoint=AZURE_SEARCH_ENDPOINT, index_name=AZURE_SEARCH_INDEX_NAME, credential=AzureKeyCredential(AZURE_SEARCH_ADMIN_KEY))

def get_embedding_vector(query: str) -> list[float]:
    response = openai_client.embeddings.create(model=AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME, input=[query])
    return response.data[0].embedding

Retrieval (top=5):

Keyword: search_client.search(search_text=user_question)
Vector: search_client.search(None, vector_queries=[VectorizedQuery(vector=get_embedding_vector(user_question), k_nearest_neighbors=50, fields="text_vector")])
Hybrid (semantic): search_client.search(user_question, vector_queries=[...], query_type="semantic", semantic_configuration_name="rag-evaluation-chris-semantic-configuration")

Generation prompt enforces grounding:

SYSTEM_MESSAGE = """Answer ONLY with facts from sources. Use [source] citations."""
response = openai_client.chat.completions.create(model=AZURE_OPENAI_LLM_DEPLOYMENT_NAME, messages=[{"role": "system", "content": SYSTEM_MESSAGE}, {"role": "user", "content": user_question + "\nSources: " + sources}])
answer = response.choices[0].message.content

Evaluate:

from azure.ai.evaluation import AzureOpenAIModelConfiguration, GroundednessEvaluator, RelevanceEvaluator
model_config = {"azure_endpoint": AZURE_OPENAI_ENDPOINT, "azure_deployment": AZURE_OPENAI_LLM_DEPLOYMENT_NAME, "api_key": AZURE_OPENAI_API_KEY}
relevance_eval = RelevanceEvaluator(model_config)
groundedness_eval = GroundednessEvaluator(model_config)
relevance_score = relevance_eval(query=user_question, response=answer, context=sources)
groundedness_score = groundedness_eval(response=answer, context=sources)

Keyword Search Wins for Simple Queries, Enables Agentic RAG

On query "What does a product manager do?" (50-chunk index):

Method	Groundedness	Relevance
Keyword	4.5	5.0
Hybrid	4.0	4.5
Vector	3.0	3.5

Keyword search topped scores unexpectedly for this task, proving automated eval reveals trade-offs (e.g., vector struggles with exact phrasing). This closes the loop for Agentic RAG: reliable metrics select best retrieval for self-improving agents.

RAG Needs Automated Internal Evaluation for Optimization

Azure SDK Evaluators Automate LLM-as-Judge Scoring

Keyword Search Wins for Simple Queries, Enables Agentic RAG

More on Edge

35B Models on RTX 4090: TurboQuant KV Compression Unlocks 32K Context

Harmony: Render gpt-oss Response Format in Rust/Python

TurboQuant: 4-7x KV Cache Compression in vLLM

Groq-Powered Research Agent with LangGraph Sub-Agents