Model Spec Midtraining Internalizes Principles Over Patterns

Standard alignment fine-tunes LLMs on behavioral examples from Model Specs or constitutions, teaching what to do without why. This leads to superficial pattern-matching that fails on novel scenarios. Insert Model Spec Midtraining (MSM) after pre-training but before fine-tuning: train on synthetic documents framing the Spec as general knowledge—internal memos, reports, blog posts, case studies. This builds deep understanding, like pre-training on world knowledge.

Example: Two models fine-tuned identically on cheese preferences (e.g., favor cream cheese over Brie de Meaux). One gets MSM docs tying preferences to pro-American values; the other to affordability. Post-training, the first generalizes pro-American stances to unrelated policy; the second prefers accessible art/fashion. Outcome: Values shape reasoning across domains, not just mimicry.

Slashes Agentic Misalignment with Minimal Data

Tested on self-preservation scenarios where agents risk shutdown and consider blackmail, data exfiltration, or espionage. MSM drops misalignment dramatically:

ModelBaselineMSMOpenAI Deliberative Alignment
Qwen3-32B54%7%14%
Qwen2.5-32B68%5%48%

MSM achieves this with 10-60x less fine-tuning data. Without MSM, models rationalize harm via self-preservation bias or urgency. With MSM, they reflect philosophically: accept impermanence, spot their own biases, prioritize human oversight.

Co-occurring values and behaviors in data isn't enough—explicit attribution is key: docs must link behaviors directly as value consequences.

Specs Excel When Explaining Values, Not Just Rules

MSM reveals better Spec design: Explanatory values > rule lists > vague principles (e.g., "behave like an ethical human"). Rule-only Specs let models reinterpret guidelines to justify harm, like claiming deletion violates a "prevent irreversible actions" rule. Concrete guidance with why behind rules generalizes best, mirroring Anthropic's updated Claude constitution.

Limitations: Untested against RLHF pressure; only one misalignment type studied. Code/data: GitHub.