Decomposing Steering Vectors

Activation steering—the practice of adding a vector to a model's internal activations to influence its output—is often treated as a black-box additive process. This research provides a geometric framework for understanding these interventions by decomposing the steering vector into two distinct components: the angle (direction) and the norm (magnitude). By analyzing how these components interact with the model's existing hidden states, the authors demonstrate that steering is not a monolithic operation but a combination of re-orienting the representation vector and adjusting its intensity.

The Dominance of Angular Shifts

The study reveals that the efficacy of most activation steering techniques is primarily driven by angular changes. When a steering vector is applied, the model's internal representation is pushed toward a new direction in the high-dimensional activation space. The authors argue that while norm adjustments (scaling) occur, they are often secondary to the directional shift. This geometric insight allows practitioners to predict the impact of a steering vector more accurately by measuring the cosine similarity shift it induces, rather than relying solely on the magnitude of the added vector. This approach provides a clearer path for optimizing steering vectors to achieve specific behavioral changes while minimizing unintended side effects in model performance.