Steering LLM Behavior with Contrastive Neuron Attribution

Identifying Behavior-Specific Circuits

Contrastive Neuron Attribution (CNA) is a method for identifying the specific MLP neurons responsible for model behaviors, such as refusing harmful requests. Unlike Contrastive Activation Addition (CAA), which modifies entire layer signals, CNA operates at the individual neuron level. By calculating the mean activation difference between positive and negative prompt sets, researchers can isolate the top 0.1% of neurons that drive a specific behavior.

To ensure the identified circuit is behavior-specific, the method includes a filtering step that removes "universal" neurons—those that fire in the top 0.1% across more than 80% of diverse prompts. This prevents the accidental ablation of general-purpose neurons, which would otherwise degrade model performance.

Causal Steering Without Training

CNA is computationally efficient, requiring only forward passes through the model to identify and verify circuits. Once identified, the circuit's influence can be tested by applying a scalar multiplier to the activations of the target neurons during inference. Setting the multiplier to 0 effectively ablates the behavior, while values greater than 1 amplify it.

Experimental results on Llama 3.1/3.2 and Qwen 2.5 models demonstrate that ablating these circuits can reduce refusal rates by over 50% while maintaining output quality (measured by n-gram repetition) above 0.97. Crucially, CNA preserves general capabilities, with MMLU accuracy remaining within one percentage point of the baseline, unlike CAA, which often degrades performance at high steering strengths.

The Role of Fine-Tuning in Model Structure

Research findings indicate that the late-layer structure responsible for discriminating between prompt types exists in base models before any fine-tuning occurs. Alignment fine-tuning does not create this structure; rather, it transforms the function of the neurons within it.

Comparison between base and instruct models reveals that only 8–29% of individual neurons overlap between the two, suggesting that fine-tuning effectively "rewires" the specific neurons within a pre-existing architectural framework. This separation between layer-level structure and neuron-level function allows for precise steering without the need for expensive auxiliary training like Sparse Autoencoders (SAEs).

Identifying Behavior-Specific Circuits

Causal Steering Without Training

The Role of Fine-Tuning in Model Structure

More from AI & LLMs

ComMem: Dual-Memory Systems for VLM Test-Time Adaptation

Refusal in LLMs is Gated by Persona

T2D-Bench: Evidence-Gated Evaluation for Clinical LLM Accuracy

MiniMax Sparse Attention: Scaling Long Context with Block-Sparsity