The Challenge of Linear Refusal Representations

Recent research into LLM safety often relies on the assumption that 'refusal'—the model's tendency to decline answering certain prompts—can be isolated as a single, linear vector within the model's activation space. This paper investigates this assumption by comparing two primary methods for identifying these directions: Difference-in-Means (Diff-in-Means) and Iterative Null-space Projection (INLP).

Diff-in-Means vs. INLP: Methodological Trade-offs

  • Diff-in-Means: This approach calculates the average activation difference between a set of refusal prompts and a set of benign prompts. While computationally efficient and straightforward, it assumes that the refusal behavior is concentrated in a single, dominant direction. The study suggests that this method may be too simplistic to capture the nuanced, multi-dimensional nature of model refusals.
  • INLP (Iterative Null-space Projection): This technique iteratively identifies and removes linear components that correlate with a specific behavior (in this case, refusal). By repeatedly projecting activations into the null-space of the identified refusal direction, INLP aims to 'scrub' the model of the behavior more thoroughly than a single vector subtraction. The authors evaluate whether this iterative process is more effective at neutralizing refusal without degrading general model performance.

Preliminary Findings on Refusal Complexity

The core argument is that refusal is likely not a single direction but a complex, multi-faceted phenomenon. The study highlights that while both methods are useful for interpretability, they often fail to account for the non-linear ways in which models encode safety constraints. The authors suggest that relying on a single linear direction for intervention is insufficient for robust safety alignment, as models may have multiple, redundant pathways for triggering a refusal response.