The Failure of LLM-Judges in Context-Dependent Safety Evaluation

The Mismatch Between Universal Evaluators and Contextual Safety

Modern safety evaluation often relies on "LLM-judges"—using a stronger model to grade the outputs of a target model. However, this research highlights a fundamental flaw: safety is inherently contextual, while LLM-judges are trained on broad, generalized datasets that impose rigid, universal priors. When an evaluator is asked to judge safety, it defaults to these baked-in biases rather than adapting to the specific domain, user intent, or situational constraints of the prompt. This creates a "prior-drift" where the judge penalizes safe, contextually appropriate behavior because it deviates from the judge's static definition of safety.

The Risks of Rigid Evaluation Priors

Because LLM-judges lack the ability to dynamically adjust their safety thresholds, they frequently exhibit two failure modes:

Over-refusal (False Positives): The judge flags benign content as unsafe because the content touches on sensitive topics (e.g., medical, legal, or creative writing) that the judge has been conditioned to treat as high-risk, regardless of the actual safety of the response.
Under-detection (False Negatives): In highly specialized domains, the judge may fail to recognize subtle, context-specific harms because its training data lacks the depth required to understand the nuances of that specific field.

This rigidity makes LLM-judges unreliable for production systems where safety requirements vary significantly by user persona and application. Relying on these judges as a "ground truth" creates a feedback loop that reinforces the model's existing biases rather than improving its ability to handle complex, real-world safety scenarios.

The Mismatch Between Universal Evaluators and Contextual Safety

The Risks of Rigid Evaluation Priors

More from AI & LLMs

Mitigating Scaffolding Collapse in Socratic Tutors

Enhancing Molecular Property Prediction with Neuro-Symbolic LLMs

ComMem: Dual-Memory Systems for VLM Test-Time Adaptation

Refusal in LLMs is Gated by Persona