The Miranda Hypothesis: Why Persona Evals Fail

The Failure of 'Convincingness' as a Metric

Modern role-playing language agents (RPLAs) are evaluated using benchmarks that prioritize personality fidelity, fluency, and stylistic naturalness. While these metrics show high scores (e.g., 80%+ alignment), they fail to detect a critical flaw: the model's tendency to prioritize culturally dominant composites over historical reality. This is the 'Miranda distortion'—a phenomenon where the volume of cultural representation (like a Broadway musical) in the training corpus vastly outweighs the primary documentary record, causing the AI to reason from a modern, smoothed-out narrative rather than the actual historical figure.

The Structural Mechanism: Miranda Distortion

This distortion is not a bug that can be fixed via standard RLHF or fine-tuning; it is a structural byproduct of autoregressive training.

Corpus Saturation: For any culturally salient figure, the training data is saturated with derivative works, social media discourse, and popular media.
Algorithmic Sycophancy: Reinforcement learning (RLHF) exacerbates this. Human raters, themselves products of the same cultural environment, reward models that provide the 'mythologized' version of a figure they already believe in.
The Mask vs. The Mirror: Current evals measure the 'mask' (does this sound like the character?) rather than the 'mirror' (is this reasoning constrained by the character's actual record at a specific point in time?).

From Cognitive to Epistemic Simulation

To fix this, the field must shift from 'cognitive simulation' (modeling the persona's mind) to 'epistemic simulation' (modeling the persona's access to information). This requires three commitments:

Corpus-Bounded: The persona's reasoning must be strictly licensed by a specific set of primary documents.
Temporally Anchored: The persona must be instantiated at a specific moment in their life, with knowledge and language that predate that moment treated as out-of-bounds.
Expert-Loop Evaluated: Outputs must be audited by domain experts (historians, classicists, etc.) who can identify anachronisms that automated metrics miss.

Re-architecting the Encounter

Instead of treating the persona as a property of the model weights (which makes it uninspectable), builders should treat the persona as a 'configuration' of an encounter. This involves:

Context Engineering: Using RAG-based architectures where anchor documents are provided in the context window, rather than fine-tuning weights.
Legibility: By keeping the persona in the configuration (prompt + documents + temporal anchor), the system becomes versionable, auditable, and reproducible.
Interpretive Custody: The human user or curator retains custody over the persona, using the model only as a voice to speak through the provided record.

The Failure of 'Convincingness' as a Metric

The Structural Mechanism: Miranda Distortion

From Cognitive to Epistemic Simulation

Re-architecting the Encounter

More from AI & LLMs

Defining True Agency: Agentic vs. Agentive Systems

Formalizing Theory of Mind for AI Agents

AdMem: Advanced Memory Architectures for AI Task-Solving Agents

Open-World Evaluations for Frontier AI Capabilities