The Mismatch Between Universal Evaluators and Contextual Safety
Modern safety evaluation often relies on "LLM-judges"—using a stronger model to grade the outputs of a target model. However, this research highlights a fundamental flaw: safety is inherently contextual, while LLM-judges are trained on broad, generalized datasets that impose rigid, universal priors. When an evaluator is asked to judge safety, it defaults to these baked-in biases rather than adapting to the specific domain, user intent, or situational constraints of the prompt. This creates a "prior-drift" where the judge penalizes safe, contextually appropriate behavior because it deviates from the judge's static definition of safety.
The Risks of Rigid Evaluation Priors
Because LLM-judges lack the ability to dynamically adjust their safety thresholds, they frequently exhibit two failure modes:
- Over-refusal (False Positives): The judge flags benign content as unsafe because the content touches on sensitive topics (e.g., medical, legal, or creative writing) that the judge has been conditioned to treat as high-risk, regardless of the actual safety of the response.
- Under-detection (False Negatives): In highly specialized domains, the judge may fail to recognize subtle, context-specific harms because its training data lacks the depth required to understand the nuances of that specific field.
This rigidity makes LLM-judges unreliable for production systems where safety requirements vary significantly by user persona and application. Relying on these judges as a "ground truth" creates a feedback loop that reinforces the model's existing biases rather than improving its ability to handle complex, real-world safety scenarios.