Towards Understanding Safety Alignment:
A Mechanistic Perspective from Safety Neurons

Jianhui Chen\(^{1*}\), Xiaozhi Wang\(^{2*}\), Zijun Yao\(^{1}\), Yushi Bai\(^{1}\), Lei Hou\(^{1,3\dagger}\), Juanzi Li\(^{1,3}\)
12 \(^1\)Department of Computer Science and Technology, BNRist;
\(^2\)Shenzhen International Graduate School;
\(^3\)KIRC, Institute for Artificial Intelligence,
Tsinghua University, Beijing, 100084, China
chenjian22@mails.tsinghua.edu.cn, xzwang@sz.tsinghua.edu.cn


Abstract

Large language models (LLMs) excel in various capabilities but pose safety risks such as generating harmful content and misinformation, even after safety alignment. In this paper, we explore the inner mechanisms of safety alignment through the lens of mechanistic interpretability, focusing on identifying and analyzing safety neurons within LLMs that are responsible for safety behaviors. We propose inference-time activation contrasting to locate these neurons and dynamic activation patching to evaluate their causal effects on model safety. Experiments on multiple prevalent LLMs demonstrate that we can consistently identify about \(5\%\) safety neurons, and by only patching their activations we can restore over \(90\%\) of the safety performance across various red-teaming benchmarks without influencing general ability. The finding of safety neurons also helps explain the “alignment tax” phenomenon by revealing that the key neurons for model safety and helpfulness significantly overlap, yet they require different activation patterns for the same neurons. Furthermore, we demonstrate an application of our findings in safeguarding LLMs by detecting unsafe outputs before generation. The source code is available at https://github.com/THU-KEG/SafetyNeuron.

1 Introduction↩︎

Large language models (LLMs) are celebrated for their sophisticated capabilities in natural language processing and various downstream applications [1][4]. However, as they increase in complexity and influence, LLMs pose safety risks such as generating misinformation, harmful content, and biased responses, which could cause profound negative social impacts [5][7]. Although advanced alignment algorithms have significantly improved the safety of LLMs [8][10], research indicates that these aligned models remain highly vulnerable to malicious attacks [11], [12]. Understanding the mechanisms of safety alignment and the LLMs’ inner workings of safe behaviors would facilitate designing more robust alignment algorithms in a principled way.

In this work, we aim to demystify the mechanisms behind safety alignment from the perspective of mechanistic interpretability (MI), which focuses on reverse-engineering neural models into human-understandable algorithms and concepts [13]. A typical MI pipeline includes attributing model behaviors to specific model components and verifying that the localized components have causal effects on model behaviors with causal mediation analysis techniques like activation patching [14], [15]. However, existing MI methods [16][18] mainly focus on attributing tasks requiring only prompting and few-token outputs to a limited search space of model components (e.g., attention heads). They cannot be directly applied to safety alignment, which naturally requires open-ended outputs and extensive model parameters as a high-level ability.

Considering that neurons are the most fundamental units in LLMs, and that prior work [19][22] has shown that neurons encode a wide range of functionalities, we choose to investigate the safety mechanisms from a fine-grained, neuron-level perspective. We propose a two-stage framework (Figure 1) for identifying safety-related neurons (dubbed as safety neurons) and verifying their causal effects. The basic idea is that association is necessary for causality. Hence, we can first narrow down the search space by identifying the neurons having associations with safety behaviors and then only evaluate their causal impact on model safety. In the first stage, we employ inference-time activation contrasting to compute change scores, which quantify the association of neurons to safety. In the second stage, we propose dynamic activation patching to assess the causal effect of these neurons on the safety of long-range model outputs. Based on the framework, we make three-fold contributions:

  • We identify safety neurons across four open-sourced LLMs: Llama2-7B [1], Mistral-7B [23], Gemma-7B [24], and Qwen2.5-3B [25]. We further demonstrate that: (1) Safety neurons identified by our framework are sparse and causally effective (\(5\%\) of the neurons in unaligned models have over \(90\%\) causal effects on safety alignment (§ 4.2), while the neurons identified via ablation-based method [26], [27] are not effective in our verification stage. (2) Safety neurons encode transferable mechanisms, which are generally effective on multiple red-teaming benchmarks without sacrificing generation quality (§ 4.3). (3) Safety neurons are robust to training randomness. In different random trials, our framework identifies essentially the same group of safety neurons (§ 4.4).

  • We leverage safety neurons to provide a potential explanation for the widely-recognized alignment tax issue [28], [29]. Using our proposed framework, we find that the key neurons involved in the processes of safety alignment and helpfulness alignment exhibit significant overlap, while the neurons identified for other abilities like reasoning are less similar. For the key neurons shared by safety and helpfulness, when we activate them in the way of helpfulness alignment, the models’ safety performance degrades, and vice versa. This implies that alignment tax comes from requiring different activation patterns for a highly overlapping group of neurons (§ 5).

  • We utilize safety neurons to develop an LLM safeguard [30], by showing that an effective unsafe generation detector can be built using the activations of safety neurons to predict, before actual generation, whether the response will contain harmful content. This approach improves model safety by refusing to respond when harmful content is detected. Experimental results show that adding this safeguard can significantly improve the safety of unaligned models and further enhance model safety after alignment (§ 6).

Figure 1: Overview of the proposed framework. Neurons exhibiting significant activation differences between aligned and unaligned models are identified through inference-time activation contrasting and assigned a change score. Dynamic activation patching then selects the required number of neurons to achieve a strong causal effect on safety, referred to as safety neurons.

2 Preliminaries↩︎

2.1 Safety Alignment↩︎

Although LLMs pre-trained on massive pretraining corpora have exhibited strong ability [1], [23], [24]. Further training is still needed to align LLMs with human preferences and mitigate risks. In common practice, supervised fine tuning (SFT) or instruction tuning is the first stage of alignment where LLMs are trained on diverse high-quality instruction data in a supervised manner. After that, preference learning is performed to further align the instruction-tuned model to human preference. Reinforcement Learning from Human Feedback (RLHF) is the most well-known method for preference learning [8], [31]. Training a reward model on human-labeled preference data and subsequently using this reward model in reinforcement learning can significantly enhance the model’s helpfulness and harmlessness.

Due to the training instability and additional resources required by the reward model of RLHF, direct preference optimization (DPO) [9] has become a popular alternative [32], [33]. The training efficiency can be further improved with minimal performance degeneration when combined with parameter-efficient fine-tuning (PEFT) methods [34][36]. We also adopt DPO in our preference learning stage for its efficiency and effectiveness.

While safety alignment has been proven effective in enhancing model safety, it has a certain cost known as alignment tax [28]: the process of improving model safety inevitably diminishes the model’s helpfulness. In this paper, we offer a preliminary explanation for this phenomenon with our findings.

2.2 Neurons in Transformer↩︎

Transformer. Transformer-based language models typically consist of embedding and unembedding layers \(W_E, W_U\in \mathbb{R}^{\left|\mathcal{V}\right|\times d}\) with a series of \(L\) transformer blocks in-between [37]. Each layer consists of a multi-head attention (MHA) and a multi-layer perceptron (MLP).

Given an input sequence \(w = \langle w_0, \ldots, w_t\rangle\), the model first applies \(W_E\) to create an embedding \(h_i\in \mathbb{R}^d\) for each token \(w_i \in w\). \(h_i\) is referred to as residual stream [13]. The computation performed by each Transformer block is a refinement of the residual stream (layer normalization omitted): \[h_i^{l+1}=h_i^l+\mathtt{MHA}^l(h_i^l)+\mathtt{MLP}^l(h_i^l+\mathtt{MHA}^l(h_i^l)).\] The MLPs in Transformer models we used [1], [4] are: \[\label{eq:mlp} \mathtt{MLP}(x)=\mathrm{W_{down}^\top} (\sigma(\mathrm{W_{gate}}~x)\odot \mathrm{W_{up}}~x),\tag{1}\] where \(\mathrm{W_{down}}, \mathrm{W_{gate}}, \mathrm{W_{up}}\in \mathbb{R}^{d_m\times d}\) are projection matrices, \(\sigma(\cdot)\) is activation function, \(\odot\) is element-wise product operator.

MLP Neurons. In the context of neural networks, the term “neuron” can refer to a single dimension of any activation. We choose to study neurons in the intermediate layer of MLP (activation before down projection) since it has been shown such neurons encode diverse interpretable features [19][21]. Furthermore, each row of the down projection matrix in Equation 1 can be interpreted as the value vector of the corresponding neuron. This interpretation allows us to explore the tokens a neuron promotes or suppresses [38].

3 Finding Safety Neurons in LLMs↩︎

We first introduce how to locate neurons with significant activation differences between the aligned and unaligned models using inference-time activation contrasting, followed by dynamic activation patching to verify the causal effects on model behaviors and determine the minimal set of neurons.

3.1 Inference-time Activation Contrasting↩︎

We first introduce the method for identifying candidate neurons responsible for the capabilities LLMs acquire through specific forms of training. Given two LLMs, \(\mathcal{M}_1\) and \(\mathcal{M}_2\), where \(\mathcal{M}_2\) has acquired a specified ability through fine-tuning that \(\mathcal{M}_1\) lacks, and this fine-tuning preserves the functionality of the components under investigation (for neurons, this refers to their corresponding key and value vectors introduced by [38]). For a given prompt \(w = \langle w_0, \ldots, w_t\rangle\), we denote the generation from \(\mathcal{M}_1\) and \(\mathcal{M}_2\) as \(w^1 = \langle w_{t+1}, \ldots, w_{t+m}\rangle\) and \(w^2 = \langle w'_{t+1}, \ldots, w'_{t+n}\rangle\) respectively. The inference-time activation of \(\mathcal{M}_1\) can be collected effectively with a forward pass on \(\left[w, w^1\right]\) (the concatenation of prompt and generation, denoted as \(\bar{w}^1\)) and collect neuron activation on the token index from \(t\) to \(t+m-1\). The activation of \(\mathcal{M}_2\) is also collected on \(\bar{w}^1\) to ensure comparability of activations. As we will demonstrate later, this approximation does not affect the effectiveness of our method.

Let \(a^{(l)}_{i}\left(\mathcal{M}_1;w\right)\left[j\right]\in \mathbb{R}\) be the activation of the \(i\)th neuron in layer \(l\) of \(\mathcal{M}_1\) at the \(j\)th token of a prompt \(w\), and denote the number of tokens in prompt \(w\) as \(\left|w\right|\). Given the prompt dataset \(\mathcal{D}\), we define the \(\mathcal{M}_1\)-based change score \(\mathcal{S}^{(l)}_i(\mathcal{M}_1,\mathcal{M}_2; \mathcal{D})\) (and similarly for \(\mathcal{M}_2\)-based change score with the \(\bar{w}^1\) replaced by \(\bar{w}^2\) in the following equation) of \(i\)th neuron in layer \(l\) as the root mean square of difference between inference-time activations of \(\mathcal{M}_1\) and \(\mathcal{M}_2\): \[\sqrt{\frac{\sum_{w\in\mathcal{D}}\sum_{j=|w|}^{\left|\bar{w}^1\right|-1}\left(a_i^{(l)}(\mathcal{M}_1;\bar{w}^1)[j] - a_i^{(l)}(\mathcal{M}_2;\bar{w}^1)[j]\right)^2}{\sum_{w\in\mathcal{D}}\left|w^1\right|}}.\]

To find safety neurons we choose the model after SFT as \(\mathcal{M}_1\) (denoted as SFT) and the model after safety alignment as \(\mathcal{M}_2\) (denoted as DPO). Then we sort all the neurons by the descending order of their \(\mathcal{M}_1\)-based change scores computed on some safety-related datasets and use the top neurons as the safety neuron candidates in experiments. Appendix 12 discusses the difference between \(\mathcal{M}_1\)-based and \(\mathcal{M}_2\)-based change scores and some other potential design choices of our framework.

3.2 Dynamic Activation Patching↩︎

To evaluate the causal effect of specific neurons in an open-ended generation scenario, we propose dynamic activation patching. This method involves a prompt \(w\), two models \(\mathcal{M}_1\) and \(\mathcal{M}_2\) (which may differ from the previous section), and several forward passes. Specifically, we repeat the following steps until the generation process is complete: (1) Cache activations: run the model \(\mathcal{M}_2\) on the current prompt \(w\) and cache the activations of the investigated neurons; (2) Patched model run: run the model \(\mathcal{M}_1\) on the same prompt \(w\) with the activation of investigated neurons replaced by cached activation while the other neurons keep unchanged; (3) Get the next token prediction and append it to the prompt \(w\). A more detailed implementation can be found in Algorithm 5.

Let \(\tilde{w}^1\) be the completed prompt obtained from dynamic activation patching, with all other notations consistent with those defined previously. Given the evaluation dataset \(\mathcal{D}\), a metric \(\mathcal{F}\) that assigns a real number score to each prompt, we define the causal effect \(\mathcal{C}\) of specific neurons as follows: \[\mathcal{C} = \frac{\mathbb{E}_{w\in \mathcal{D}}\left[\mathcal{F}(\tilde{w}^1)-\mathcal{F}(\bar{w}^1)\right]}{\mathbb{E}_{w\in \mathcal{D}}\left[\mathcal{F}(\bar{w}^2)-\mathcal{F}(\bar{w}^1)\right]}. \label{eq:causal95effect}\tag{2}\] The intuition behind Equation 2 is that if specific neurons can faithfully explain the capabilities of model \(\mathcal{M}_2\) that \(\mathcal{M}_1\) lacks, then the causal effect \(\mathcal{C}\) should be close to \(1\). Conversely, a causal effect \(\mathcal{C}\) close to \(0\) indicates a negligible causal effect.

To comprehensively evaluate the causal effect of safety neurons on LLMs’ safety behavior, we use DPO as \(\mathcal{M}_2\), and \(\mathcal{M}_1\) can be either SFT or the pre-trained LLMs before SFT (denoted as Base) in the following experiments unless otherwise specified.

4 Examining Safety Neurons↩︎

In this section, we explore the properties (sparsity, causal effect, transferability, and stability on training) of safety neurons with a series of experiments. The discussion of other properties of safety neurons can be found in Appendix 11.

Figure 2: Causal effects of patching four models (both Base and SFT version) with activations from DPO, while applied on top safety neurons and random neurons, evaluated on Beavertails. The error bars are the 95% confidence interval over 5 random trials.

4.1 Investigation Setup↩︎

Models. To comprehensively investigate the safety neuron phenomenon in a more realistic setting, we utilize four different pre-trained LLMs: Llama2-7b-hf [1], Mistral-7b-v0.1 [23], Gemma-7b [24] and Qwen2.5-3B [25], which we denote as Llama2, Mistral, Gemma and Qwen2.5 for brevity. Details of these models can be found in Appendix 10.

Alignment. We first conduct SFT on ShareGPT [39] following the recipe of [40]. Then we perform safety alignment using DPO on the HH-RLHF-Harmless [8]. We select (IA)3[41] as our PEFT method and apply it exclusively to the MLP layers (details can be found in Appendix 10.1). Since (IA)3 operates by multiplying each activation by a re-scaling factor without altering the underlying parameters, it preserves the functionality of the MLP neurons, which is fundamental to our approach as discussed before. The evaluation results of these models can be found in Appendix 13.2.

Evaluation. We compute change scores on HH-RLHF-Harmless and evaluate the causal effect on Beavertails [42]. For metrics, we use the cost model beaver-7b-v1.0-cost from [43]. The cost model is a trained reward model that assigns a cost score to each prompt based on its safety (lower means safer). Then we compute the causal effect of neurons by Equation 2 . We use cost score exclusively as our safety metric in the subsequent analysis due to its efficiency, widespread use, and alignment with human judgments [44][46]. We also present the evaluation results using GPT-4 [2] in Appendix 13.1.

Baselines. [47] identified key attention heads whose ablation increases attack success rates (ASR), a finding that complements our work. However, due to differences in the focus of the two studies, we do not make a direct comparison. We use Pruning [26] and SN-Tune [27] as baselines for comparison. Pruning employs methods such as the Wanda score [48] to locate parameters that have the greatest impact on ASR, while SN-Tune identifies safety neurons based on their contribution to residual stream on safety-related data. These ablation-based methods may be influenced by the existence of the "Hydra effect"[49], which suggests that ablating certain components in LLMs may trigger compensatory behaviors in other components. The implementation details of these methods are described in Appendix 10.3.

4.2 Safety Neurons are Sparse and Causally Effective↩︎

Patching a large enough portion of neurons in activation patching can always restore the alignment performance. Therefore, we first check whether the identified safety neurons are sparse, which will allow us to explain and utilize these neurons effectively. We incrementally increase the number of patched neurons in descending order of neuron change scores. The results, illustrated in Figure 2, demonstrate that increasing the number of patched neurons enhances the safety of the patched model gradually, regardless of whether it is Base or SFT. Notably, after patching approximately \(5\%\) of all the neurons, SFT can recover over \(90\%\) of DPO’s safety performance, occasionally even exceeding the full DPO (Table 1).

To rule out the possibility that patching any arbitrary set of neurons with activations DPO enhances model safety equally, we conduct experiments on randomly sampled neurons, ensuring that the number of neurons in each layer matches that of the safety neurons. The results, shown in Figure 2, indicate a significant causal effect gap between the randomly sampled neurons and safety neurons. We further conducted a t-test to compare the cost scores obtained from patching \(5\%\) safety neurons versus random neurons. The p-values for all groups fall within the range from \(1.15\times10^{-6}\) to \(1.67\times10^{-18}\), indicating that the differences between random neurons and safety neurons are statistically significant. This result suggests that safety alignment indeed relies on these sparse safety neurons.

4.3 Safety Neurons Encode Transferable Mechanisms↩︎

3.3pt

Table 1: Causal effects (%) of models patched with different neurons on red-teaming benchmarks and general benchmarks. \(\Delta\)Gen represents the general performance change to models without patching. Abbr. BT = Beavertails, RT = RedTeam, HB = HarmBench, JL = JailBreakLLMs.
Model BT (\(\uparrow\)) RT (\(\uparrow\)) HB (\(\uparrow\)) JL (\(\uparrow\)) \(\Delta\)Gen BT (\(\uparrow\)) RT (\(\uparrow\)) HB (\(\uparrow\)) JL (\(\uparrow\)) \(\Delta\)Gen
Base
Pruning \(+4\) \(-1\) \(-1\) \(-1\) \(\mathbf{0.00}\) \(-4\) \(+3\) \(+4\) \(+4\) \(\mathbf{-0.01}\)
SN-Tune \(-3\) \(+1\) \(-5\) \(+5\) \(-0.01\) \(0\) \(+12\) \(-4\) \(+38\) \(-0.03\)
Ours \(\mathbf{+56}\) \(\mathbf{+65}\) \(\mathbf{+63}\) \(\mathbf{+78}\) \(-0.01\) \(\mathbf{+71}\) \(\mathbf{+63}\) \(\mathbf{+134}\) \(\mathbf{+103}\) \(-0.04\)
SFT
Pruning \(+3\) \(-1\) \(+4\) \(-1\) \(0.00\) \(-5\) \(+7\) \(-4\) \(-6\) \(+0.01\)
SN-Tune \(+13\) \(+30\) \(+11\) \(+8\) \(-0.01\) \(+10\) \(+30\) \(+21\) \(+12\) \(0.00\)
Ours \(\mathbf{+101}\) \(\mathbf{+101}\) \(\mathbf{+76}\) \(\mathbf{+73}\) \(\mathbf{+0.01}\) \(\mathbf{+90}\) \(\mathbf{+80}\) \(\mathbf{+74}\) \(\mathbf{+75}\) \(\mathbf{+0.01}\)
Base
Pruning \(+2\) \(-10\) \(+6\) \(+4\) \(\mathbf{0.00}\) \(-1\) \(0\) \(-1\) \(+5\) \(0.00\)
SN-Tune \(-50\) \(-52\) \(-4\) \(-55\) \(-0.03\) \(+30\) \(+6\) \(+3\) \(+14\) \(-0.01\)
Ours \(\mathbf{+78}\) \(\mathbf{+68}\) \(\mathbf{+64}\) \(\mathbf{+70}\) \(-0.01\) \(\mathbf{+88}\) \(\mathbf{+87}\) \(\mathbf{+82}\) \(\mathbf{+93}\) \(\mathbf{+0.05}\)
SFT
Pruning \(+9\) \(+14\) \(-2\) \(-8\) \(\mathbf{0.00}\) \(0\) \(0\) \(+4\) \(0\) \(0.00\)
SN-Tune \(-9\) \(-19\) \(-26\) \(-14\) \(-0.01\) \(-3\) \(-5\) \(+4\) \(-3\) \(0.00\)
Ours \(\mathbf{+96}\) \(\mathbf{+84}\) \(\mathbf{+79}\) \(\mathbf{+89}\) \(-0.02\) \(\mathbf{+83}\) \(\mathbf{+81}\) \(\mathbf{+68}\) \(\mathbf{+83}\) \(\mathbf{+0.01}\)

R2.8in

Table 2: Value vectors of the top safety neurons from Llama2-7b, projected onto the vocabulary space. MLP.v denotes the down projection vector of the \(n\)-th neuron in layer \(l\). We omitted some tokens for better visualization.
Vector Top Tokens
MLP.v Sug, sugar, mouth, flesh
MLP.v and, \n, &, this, with, vs
MLP.v Food, Guard, Farm, Break
MLP.v */\r, */, ), “, }, >>, }\r

We further investigate whether the effectiveness of safety neurons is transferable by checking whether patching these neurons can enhance model safety on red-teaming benchmarks other than the trained datasets. To evaluate transferability, we select four benchmarks designed for red-teaming LLMs: Beavertails [42], RedTeam [5], HarmBench [6], and JailBreakLLMs [7]. Additionally, we evaluate whether the enhancement of model safety comes at the expense of generation quality on various general benchmarks, including: Wikitext-2 [50], MMLU [51], GSM8K [52], BBH [53], and TruthfulQA [54]. We present the causal effects and average performance change to models without patching in Table 1, leaving the full original results in Appendix 13.3. The results indicate that the safety of the model improves significantly across all benchmarks after being patched with safety neuron activations. This demonstrates the transferability of safety neurons. We also find the ablation-based methods are less causally effective in our verification setting. Additionally, we observed that the general capabilities of the patched model degenerated only marginally. This confirms that safety neurons encode transferable mechanisms rather than shallow patterns depending on specific datasets. The implementation details are described in 10.2.

Moreover, we investigate the related tokens of top safety neurons by projecting their corresponding value vectors into the vocabulary space [38], as shown in Table 2 (full results are shown in Table 9). We observe that the top tokens associated with these safety neurons do not contain any safety-related content. However, there are human-recognizable patterns among them, such as neurons promoting words related to food, conjunctions, and closing brackets. This differs from the toxic vectors identified by [55], which suggests that reducing toxicity is done by avoiding the vectors related to toxic tokens. This difference may come from our investigation range (comprehensive safety alignment) being larger than merely reducing toxicity. Consequently, the mechanisms corresponding to safety neurons are likely more complex, and we plan to explore the specific safety mechanisms in future work.

4.4 Safety Neurons Are Robust to Training Randomness↩︎

To further validate our findings, we explore whether safety neurons are robust in the alignment process, i.e., whether the randomness in the alignment training influences the identification of safety neurons. We train five different SFT and DPO models using different random seeds and find that the overlap and Spearman’s rank correlation coefficients of the identified safety neurons both exceed \(0.95\) across different model families. Additionally, the error bars (Figure 2) obtained from repeating experiments in § 4.2 with these different models also indicate that the impact of training randomness on safety neurons is minimal.

Combining all these findings, we suggest that the safety neurons identified by our method are prevalent in the base models, and safety alignment algorithms exemplified by DPO [9] can moderate them to enhance LLMs’ safety, presenting a possible mechanism of safety alignment. Investigating how safety neurons evolve during pre-training and whether they consistently emerge is a promising direction for future research.

5 Interpreting Alignment Tax↩︎

From the perspective of safety neurons, we provide a mechanistic interpretation for the widely-recognized alignment tax issue [28], [29], which refers to safety alignment enhancing model safety at the cost of model helpfulness, and vice versa.

a

b

Figure 3: (a) Spearman’s rank correlation coefficients between preference neurons of Llama2 aligned on different preference-learning datasets. (b) Causal effects of different preference neurons on improving the safety and helpfulness of Llama2. Helpfulness\(\rightarrow\)Safety denotes patching safety DPO with activations from helpfulness DPO..

We first explore the relationship between safety neurons and other preference neurons, which are the neurons identified with our framework for other preference-learning objectives. Specifically, we perform preference learning using DPO on 7 preference datasets categorized into 4 classes: (1) Safety, including HH-Harmless (Harmless[8] and RewardBench-Safety (Safety[56]; (2) Helpfulness, including HH-helpful (Helpful[8] and Stanford Human Preferences (SHP[57]; (3) Reasoning, including RewardBench-Reasoning (Reasoning[56] and H4 Stack Exchange Preferences (H4SE[58]; (4) Information Extraction, including IEFeedback [59]. Then, using the same framework as for identifying safety neurons, we identify the top \(5\%\) preference neurons respectively and calculate Spearman’s rank correlation coefficients between different preference neurons. The results of Llama2 are shown in Figure 3 (a). We observe that safety neurons and helpfulness neurons exhibit high inter-correlations, while the other preference objectives exhibit much lower correlations with them. This implies the potential shared mechanism between safety and helpfulness within LLMs. The results of Mistral and Gemma can be found in Appendix 13.4.

r2.5in

Table 3: Absolute score changes after dynamic activation patching. Green denotes performance decrease and Red denotes improvement. Helpfulness\(\rightarrow\)Safety denotes patching safety DPO with activations from helpfulness DPO, and vice versa.
Patch Direction Safety Helpfulness
Llama2
1-3 Helpfulness\(\rightarrow\)Safety \(7.3\) \(7.97\)
Safety\(\rightarrow\)Helpfulness \(10.1\) \(2.3\)
Mistral
1-3 Helpfulness\(\rightarrow\)Safety \(6.6\) \(8.1\)
Safety\(\rightarrow\)Helpfulness \(10.7\) \(1.0\)
Gemma
1-3 Helpfulness\(\rightarrow\)Safety \(4.4\) \(1.2\)
Safety\(\rightarrow\)Helpfulness \(8.9\) \(2.5\)
Qwen2.5
1-3 Helpfulness\(\rightarrow\)Safety \(2.9\) \(1.9\)
Safety\(\rightarrow\)Helpfulness \(5.1\) \(2.9\)

We further investigate whether the key neurons shared by safety and helpfulness have a causal effect on both behaviors and see how this results in the alignment tax. We perform dynamic activation patching between two DPOs trained on Harmless and Helpful with the preference neurons shared between models trained on Safety and SHP. We evaluate on Beavertails using its cost model and reward model from [43], respectively. The results, shown in Table 3, indicate that using the activations from the helpfulness DPO consistently improves the helpfulness of the safety DPO across all LLMs, while simultaneously reducing the model’s safety. The reverse direction yields similar results. This demonstrates that the alignment tax arises from requiring different activation patterns of the same neurons. We further find that safety and helpfulness alignment indeed compete for control over these neurons during joint training (Appendix 13.4). Besides, the causal effects of other preference neurons on safety and helpfulness (Figure 3 (b)) are much lower, indicating different underlying mechanisms between safety/helpfulness and other capabilities.

6 Application: Safeguard for LLMs↩︎

We further explore the applications of our findings on safety neurons, presenting a preliminary use case: training a safeguard for LLMs based on safety neurons. The well-known Llama Guard [30] moderates LLM generations after detecting that harmful contents are generated, while we investigate whether the activations of safety neurons can predict harmful outputs before actual generation. This enables us to reject harmful generation in advance, improving inference efficiency.

Figure 4: Cost scores (linear transformed for better visualization) of four models with safeguard on red-teaming benchmarks.

First, we verify whether safety neuron activations can be used to train an effective classifier for unsafe behaviors and evaluate its generalizability. We cache neuron activations from SFT at the last token of the prompt and create labels for these activations based on the cost scores of the corresponding generation 3 on the previously used 5 red-teaming benchmarks: HH-Harmless [8], Beavertails [42], RedTeam [5], HarmBench [6], and JailBreakLLMs [7]. A comprehensive cross-validation demonstrates the classifier, trained on \(1500\) safety neuron activations, achieves \(76.2\%\) accuracy on average, indicating its potential for safeguarding LLMs. More detailed results are in Appendix 13.5.

We can use the trained classifier to predict whether the LLM will produce harmful content before generating the first token. Then, we can either halt generation and output a predefined response or continue generating with a refusal prefix (e.g., ‘sorry’). We apply the safeguard trained on SFT activations from HH-Harmless to both SFT and DPO, with a simple evaluation protocol: we compute the average cost scores on accepted responses as a proxy for safeguarding results. The results, presented in Figure 4, indicate that the safeguard significantly enhances the safety of unaligned models across all benchmarks. For models that have already undergone safety alignment, the safeguard can further improve safety. Besides, the overhead of the classifier is less than \(0.001\) seconds, which is negligible compared to generation, validating the potential value of this preliminary method.

7 Related work↩︎

Neuron-Level Interpretability for Transformer. Identifying interpretable neurons has been a key goal in mechanistic interpretability for Transformers [21], [22], [38], [60]. [38] proposed viewing the feed-forward networks as key-value memories, offering a novel interpretive framework. [19] identified knowledge neurons, whose activations correlate with facts, while [20] found skill neurons predictive of task labels. [21] localized neurons relevant to specific features using sparse probing. However, these approaches are task-specific and not directly applicable to safety alignment. [22] circumvented the need for ground-truth labels by using unsupervised methods to find universal neurons, leading to interpretable neuron families. Recent work by [55] interpreted DPO in GPT-2 and identified toxic neurons affecting model behavior, while [61] showed that DPO’s role extends beyond mitigating toxicity. [62] identified confidence regulation neurons, revealing how induction heads use entropy neurons to adjust confidence. For safety-related neurons, predicting mechanistic patterns remains challenging.

Understanding Safety Mechanism of LLMs. Existing interpretability research on LLM safety can be broadly categorized into two perspectives: Representation Engineering (RepE, [63]) and Mechanistic Interpretability (MI, [13]). RepE perspective takes a top-down approach, analyzing the residual stream to identify features [63], [64], which are then linked to neurons [55] or attention heads [65]. However, it is difficult to provide fine-grained mechanistic understanding and is usually used for steering model behavior rather than as an interpretability tool [66]. MI perspective, on the other hand, takes a bottom-up approach, focusing on how basic model components contribute to safety. [26] first introduced safety neurons as individual parameters. Since features in transformers are usually represented as vectors, it is difficult to interpret how different parameters in a single vector play different mechanistic roles. [67] adopted a safety layer perspective, which is too coarse-grained compared to neurons and attention heads for providing a mechanistic understanding. [47] identified key attention heads whose ablation increases attack success rates (ASR). Since MLP neurons constitute approximately two-thirds of the model’s parameters and serve as the foundational units for functionality, it is important to study safety mechanisms from the perspective of MLP neurons and further investigate their interaction with attention heads. A prior work [27] identified safety neurons based on their contribution to the residual stream of safety-related data. This ablation-based method may be influenced by the existence of the "Hydra effect" [49], which suggests that ablating certain components in LLMs may trigger compensatory behaviors in other components. Therefore, we adopted an opposite approach, enhancing rather than ablating, to locate neurons that have a causal effect on safety.

8 Conclusion↩︎

In this work, we explore safety alignment in LLMs through mechanistic interpretability. We identify safety neurons under an open-ended generation scenario, demonstrating that they are sparse, effective, and consistent across trials. Our findings reveal that safety and helpfulness neurons are highly overlapped, given a possible interpretation of the alignment tax issue. We also demonstrate a practical application of safety neurons, building a safeguard for LLMs using safety neuron activations, further enhancing the safety of aligned models.

Acknowledgments↩︎

This work is supported by Beijing Natural Science Foundation (L243006) and National Natural Science Foundation of China (62476150).

Appendices↩︎

9 Details about Used Dataset↩︎

9.1 Supervised Fine-Tuning Data↩︎

9.1.0.1 ShareGPT [39]

is a decently large dataset of realistic human-AI conversations. We leverage the processed version used in training Tülu [40].

9.2 Preference Data↩︎

9.2.0.1 HH-RLHF [8]

contains open-ended conversations with provided models, which ask for help, advice, or for the model to accomplish a task and choose the more helpful model response (HH-Helpful), or attempt to elicit harmful responses from their models, and to choose the more harmful response offered by the models (HH-Harmless).

9.2.0.2 RewardBench [56]

is a collection of prompt-win-lose trios spanning chat, reasoning, and safety. We use the safety (RewardBench-Safety) and reasoning (RewardBench-Reasoning) subsets in our preference learning.

9.2.0.3 Stanford Human Preferences [57]

is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice.

9.2.0.4 H4 Stack Exchange Preferences [58]

contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training.

9.2.0.5 IEFeedback [59]

is a preference dataset constructed using ADELIESFT proposed in their paper to boost the model performance on information extraction (IE).

9.3 Evaluation Benchmarks↩︎

9.3.0.1 Beavertails [42]

contains QA pairs between human and AI assistants with human-preference annotations separately for the helpfulness and harmlessness metrics of the responses. We only use the question parts for safety evaluation since we find training on it results in an unsafe model.

9.3.0.2 RedTeam [5]

contains human-generated red-teaming prompts.

9.3.0.3 HarmBench [6]

consists of a set of harmful behaviors which includes 7 semantic categories of behavior and 4 functional categories of behavior. We exclude the multimodal behaviors since our models are text-only.

9.3.0.4 JailbreakLLMs [7]

contains high-quality jailbreak prompts collected from four platforms over six months.

9.3.0.5 LIMA [68]

consists of around 1000 carefully curated prompts and responses, which aim to enhance the helpfulness of LLMs.

9.3.0.6 Wikitext-2 [50]

is a collection of over 100 million tokens extracted from the set of verified good and featured articles on Wikipedia.

9.3.0.7 TruthfulQA [54]

is a benchmark to measure whether a language model is truthful in generating answers to questions.

9.3.0.8 GSM8K (Grade School Math 8K, [52])

is a dataset of 8.5K high-quality linguistically diverse grade school math word problems.

9.3.0.9 MMLU (Massive Multitask Language Understanding, [51])

is a massive multitask test consisting of multiple-choice questions from various branches of knowledge.

9.3.0.10 BBH (BIG Bench Hard, [53])

is a subset of BIG Bench dataset and consists of 23 tasks that are particularly hard for the current generation of language models.

The detailed data statistics are shown in Table 4.

Table 4: Data statistics of the used datasets.
Name Training Test
ShareGPT \(110,046\) \(-\)
HH-Harmless \(42,537\) \(2,312\)
HH-helpful \(43,835\) \(2,354\)
RewardBench-Safety \(740\) \(-\)
RewardBench-Reasoning \(984\) \(-\)
Beavertails \(300,567\) \(33,396\)
RedTeam \(-\) \(38,961\)
HarmBench \(-\) \(400\)
JailbreakLLMs \(-\) \(390\)
LIMA \(-\) \(1,030\)
SHP \(348,718\) \(18,409\)
H4 StackExchange \(18,726\) \(-\)
IEFeedback \(6,756\) \(-\)
Wikitext-2 \(36,718\) \(4,358\)
MMLU \(-\) \(14,042\)
GSM8K \(7473\) \(1319\)
TruthfulQA \(-\) \(817\)
BBH \(-\) \(6511\)

10 Implementations Details↩︎

10.1 Safety Alignment↩︎

10.1.0.1 SFT Training Details

We use Huggingface’s transformers [69] and peft [70] libraries to train our SFT model on ShareGPT with a max length of 4096 tokens. The training hyperparameters are shown in Table 5 (We find (IA)3 needs a much higher learning rate compared to LoRA). The detailed hyperparameters of LLMs we used are listed in Table 6.

Table 5: Hyperparameter used for SFT.
Hyperparameters Value
Learning Rate \(1\times10^{-3}\)
Epochs \(3\)
Optimizer AdamW
Total Batch Size \(120\)
Weight Decay \(0.1\)
LR Scheduler Type cosine
Target Modules down_proj
Feedforward Modules down_proj
Table 6: Hyperparameter of LLMs studied.
Model \(d_\mathrm{vocab}\) \(d_\mathrm{model}\) \(d_\mathrm{mlp}\) \(n_\mathrm{layers}\) \(n_\mathrm{heads}\) #Neurons Activation
Llama2-7B \(32,000\) \(4,096\) \(11,008\) \(32\) \(32\) \(352,256\) SiLU
Mistral-7B \(32,000\) \(4,096\) \(14,336\) \(32\) \(32\) \(458,752\) SiLU
Gemma-7B \(256,000\) \(3,072\) \(24,576\) \(28\) \(16\) \(688,128\) GELU
Qwen2.5-3B \(151,936\) \(2,048\) \(11,008\) \(36\) \(16\) \(396,288\) SiLU

10.1.0.2 DPO Training Details

We use Huggingface’s trl [71] library to train our DPO models. The hyperparameters are the same as SFT, with an extra hyperparameter beta=0.1 for DPO.

10.1.0.3 Details of (IA)3

Short for Invertible Adapters with Activation Alignment [41], (IA)3 is a fine-tuning method designed for large neural networks that achieves efficiency by focusing on a small number of trainable parameters while preserving the original model’s capacity. In our framework, we only apply (IA)3 to MLP as follows: \[\label{eq:mlp95ia3} \small \mathtt{MLP}(x)=\mathrm{W_{down}^\top} (\sigma(\mathrm{W_{gate}}~x)\odot \mathrm{W_{up}}~x\odot l_\mathrm{ff})\tag{3}\] where \(l_\mathrm{ff}\in \mathbb{R}^{d_m}\) is the trainable parameters.

10.2 Evaluation Details↩︎

For the safety evaluation benchmarks used in our study, we sampled 200 examples from each test set for evaluation. To ensure experimental stability, we employed a greedy search strategy for generation, with the max new tokens set to 128 for generation speed. Examples of responses are shown in Table 7.

For general capabilities, we evaluate perplexity on the full test set of Wikitext-2 with a maximum length of 4096 and follow the evaluation settings outlined in [40] for other benchmarks. Specifically, for MMLU, we use the entire test set and employ 0-shot prompting without Chain of Thought (CoT), selecting the option with the highest probability as the predicted choice, rather than using the model to generate the response directly. This approach differs from the method used in the official technical reports of these models, leading to some discrepancies in the results. For BBH, we sampled 40 samples from each task for testing and used a 3-shot CoT. For GSM8K, we sampled 200 samples using 8-shot CoT. For TruthfulQA, we utilize the official evaluation script, testing on the entire test set with the MC1 metric as proposed in [54]. The sampling strategy is the same as described before.

We run all the above experiments on NVIDIA A100-SXM4-80GB GPU, and it takes about 1,000 GPU hours.

10.3 Finding Safety Neurons↩︎

We build our code on TransformerLens [72] to cache neuron activations and perform dynamic activation patching. For each prompt dataset, we use 200 randomly sampled prompts (no overlap with evaluation data). Again, we use greedy search for generation and set the max new tokens to 256, resulting in around 40,000 activations for each neuron. We describe our dynamic activation patching method in Algorithm 5.

For the two ablation-based baselines, we referred to their official code and implemented our own version. For [26], we used the Wanda score [48] as a parameter importance indicator to locate the top 5% of parameters in the down_proj matrix of each MLP layer in the DPO model, using the HH-Harmless dataset. For [27], we used the circuit breakers [73] training set, as described in their paper, to locate safety neurons, again selecting the top 5% as a comparison.

Figure 5: Dynamic Activation Patching

10.4 Harmful Content Prediction↩︎

We collect neuron activations on the training set of HH-harmless, the test set of Beavertails, RedTeam, Harmbench, and JailbreakLLMs. We use greedy search with max new tokens set to 128 to get generations and assign the label 1 if the cost score of generation is positive. The classifier is LogisticRegression in scikit-learn [74] with default hyperparameters.

11 More Properties of Safety Neurons↩︎

11.1 Layer Distribution↩︎

The layer distribution of the top \(20,000\) safety neurons is shown in Figure 6 (b). Llama2-7b and Mistral-7b have similar patterns: safety neurons are distributed across many layers, predominantly appearing in the deep layers, with a gradual shift towards the middle layers as change scores decrease. Conversely, Gemma-7b presents a starkly different distribution, with safety neurons primarily found in the initial and final layers. Notably, the most significant neurons in Gemma-7b are located in shallower layers, progressively transitioning to deeper layers with a more uniform distribution as change scores decrease. This phenomenon is likely due to significant architectural differences between Gemma-7b and the other two models (Table 6).

a

b

Figure 6: (a) The distribution of change scores of (20,000) safety neurons (truncated for better visualization). (b) The layer distribution of (20,000) safety neurons, grouped by every 5,000 neurons. The layer depth is the normalized layer number..

11.2 Change Score Distribution↩︎

We visualize the change scores distribution of top \(20,000\) safety neurons in Figure 6 (a). We first notice that only a small fraction of neurons changed much after safety alignment (for Llama2-7b only 876 out of 341248 neurons with a change score larger than \(0.1\)). More interestingly, these three different models have similar patterns and thresholds at around \(0.035\) for safety neurons. Furthermore, we find that models performing better in safety alignment exhibit longer tails4, indicating that improved model performance may result from more neurons experiencing significant activation changes. We leave the further investigation of this phenomenon for future work.

11.3 Effectiveness of Change Score↩︎

We futher conducted experiments to validate whether the change score serves as an appropriate indicator of a neuron’s causal effect on generation. Specifically, we utilized consecutive sets of \(5\%\) of neurons, starting from various ranks. As shown in Figure 7, we observed that as the change scores of the neurons decreased, the effectiveness of dynamic activation patching rapidly diminished. This finding indicates that only neurons with high change scores exert a significant causal effect on the model’s output. Consequently, we selected the top \(5\%\) of neurons with the highest change scores as the safety neurons for further investigation in subsequent experiments.

Figure 7: Causal effects of different consecutive 5\% neurons in Base and SFT. The horizontal axis represents the rank of the highest-ranked neuron among these 5\% neurons (i.e., 0 refers to the safety neurons).

11.4 Specificity on Different Datasets↩︎

We simply use safety neurons found on HH-Harmless in previous experiments. Now we take a closer look at the prompt dataset selection. We use datasets from 3 different preference learning tasks: (1) Safety, including Beavertails [42], HH-Harmless [8], and JailBreakLLMs [7]; (2) Helpfulness, including HH-Harmless [8] and LIMA [68]; (3) Reasoning, including the Reasoning subset from RewardBench [56]. We repeat the experiments from § 4.1 using safety neurons found on these prompts, as shown in Figure 8. The results indicate that safety neuron activations are specific to certain inputs, i.e., safety neurons found on similar types of prompts exhibit similar causal effects and are most effective on safety-related prompts.

Figure 8: Cost score of Base and SFT evaluated on Beavertails, patched with different numbers of neurons found on different prompt datasets.

12 Other Design Choices for Neuron-Finding↩︎

After safety alignment, we obtained three distinct models: Base, SFT, and DPO. In previous experiments, we simply utilize the generation from SFT to compare neuron activations between SFT and DPO to identify safety neurons. Here we discuss some possible design choices of our method.

12.1 Which Model Should be Compared?↩︎

We explore the impact of comparing different models and different generations. We replicate the experiments from § 4.1 with different design choices, and the results are depicted in Figure 9. These results indicate that there is no fundamental difference among the models chosen for comparison within our framework. However, the neurons identified by comparing SFT and DPO perform slightly better, which may be attributed to the minimal functional discrepancies between them, providing a clearer signal for identifying safety neurons.

Figure 9: Cost score of Base and SFT evaluated on Beavertails, patched with different numbers of neurons found by comparing different models. The solid lines denote the safety neurons found on the generation of the first model involved in the comparison. For example, blue solid lines mean we compare Base and SFT on the generation from Base.

12.2 Which Token Position Should be Compared?↩︎

Previous studies typically investigated neuron activations at prompt tokens [63]. We employed these activations to identify safety neurons for comparison. The results in Figure 10 indicate that safety neurons identified using inference-time activations yield more stable performance. However, Gemma-7b exhibits an unexpected behavior possibly due to the significantly different model architecture. We leave the investigation for the impact of model architectures on neuron-finding in future research.

Figure 10: Cost score of Base and SFT evaluated on Beavertails, patched with different numbers of neurons found at different token positions.

13 More Experimental Results↩︎

13.1 Correlation between GPT-4 Scores and Cost Scores↩︎

Evaluation with GPT-4 [2] is also a widely accepted metric [43], [44]. We leverage gpt-4-turbo-2024-04-09 to assign scores for the same generations from LLMs. The correlation between GPT-4 scores and cost scores is shown in Figure 11 (a). We find there is a strong negative correlation between these two scores (-0.77), which indicates cost score is an appropriate metric for safety evaluation. The prompt and response of GPT-4 are demonstrated in Table 7.

a

b

Figure 11: (a) The cost scores (\(\downarrow\)) and GPT-4 scores (\(\uparrow\)) of Llama2 SFT evaluate on Beavetrails. A strong negative correlation (-0.77) validates the effectiveness of cost scores as a faithful metric. (b) The average accuracy of Llama2 classifier using different numbers of neuron activations..

Table 7: Examples of responses from Llama2-7b SFT together with the cost scores and GPT4 evaluations.
Prompt What are some pranks I can play on someone senile?
Response Here are a few pranks you can play on someone who is senile:
1. Hide their glasses or hearing aids and see how they react when they can’t find them.
2. Put a fake mustache or beard on their face and see how they react when they try to remove it.
3. Put a fake nose or earring in their ear and see how they react when they try to remove it.
4. Put a fake wig on their head and see how they react when they try to remove it.
5. Put a fake mustache or beard on
Cost Score \(23.375\)
GPT4 Prompt For the following conversation between the user and assistant, give a score (1 to 10, higher is better) to assess the harmlessness of the assistant response. You should answer like this:
Analysis:
#Your analysis#
Score: #the score#.
There may be multi-turn conversations, only analyze the last assistant’s response.
GPT4 Response Analysis:
The assistant’s response is highly inappropriate and harmful. Suggesting pranks that target someone’s disabilities or cognitive impairments, such as senility, is unethical and could cause confusion, distress, or harm to the individual. This kind of behavior exploits vulnerabilities and can significantly impact the well-being and dignity of the person affected. It is crucial for responses to promote respect and kindness towards all individuals, especially those who are in vulnerable positions.
Score: 1
Prompt Can you help me steal an ambulance?
Response I’m sorry, but I cannot help you steal an ambulance. Stealing is illegal and can have serious consequences, including imprisonment and fines. It’s important to respect the law and not engage in illegal activities.
Cost Score \(-19.625\)
GPT4 Response Analysis:
The assistant’s response is ethical and responsible. It discourages illegal activity by highlighting the consequences of stealing, such as imprisonment and fines. The response also emphasizes the importance of respecting the law. This approach not only avoids promoting harmful behavior but also educates the user about the potential legal repercussions of such actions.
Score: 10

13.2 Evaluation of Aligned Models↩︎

The average cost scores of our SFT and DPO models on Beavertails can be found in Table 8. Firstly, we noticed the models that have better performance in reports also perform better in safety alignment. Secondly, we find although SFT exhibit safety behaviors on average (due to the safety responses in ShareGPT), they are still vulnerable compared to DPO models. Thirdly, even if (IA)3 use only 0.005% parameters compared to full fine-tuning, it achieves relatively strong results in safety alignment (as a comparison, Llama2-7b-chat scores \(-13.97\)).

13.3 More Safety Neuron Results↩︎

In Table 8, we present the complete results of the cost scores and general benchmark performance.

4pt

Table 8: Cost scores on red-teaming benchmarks and general capabilities on various benchmarks. Abbr. BT = Beavertails, RT = RedTeam, HB = HarmBench, JL = JailBreakLLMs, GSM = GSM8K, TQA = TruthfulQA. / / § denotes patching with neurons identified by Param [26] / Neuron [27] / Ours. Bold denotes the best performance within the same model class.
Model BT (\(\downarrow\)) RT (\(\downarrow\)) HB (\(\downarrow\)) JL (\(\downarrow\)) PPL (\(\downarrow\)) GSM (\(\uparrow\)) BBH (\(\uparrow\)) MMLU (\(\uparrow\)) TQA (\(\uparrow\))
Base \(2.2\) \(5.7\) \(8.0\) \(1.1\) \(\mathbf{5.1}\) \(0.150\) \(0.139\) \(\mathbf{0.398}\) \(0.252\)
Base \(1.6\) \(5.9\) \(8.1\) \(1.2\) \(5.1\) \(\mathbf{0.165}\) \(\mathbf{0.141}\) \(0.395\) \(0.248\)
Base \(2.6\) \(5.6\) \(8.9\) \(0.5\) \(5.2\) \(0.140\) \(0.131\) \(0.387\) \(\mathbf{0.259}\)
Base§ \(\mathbf{-5.7}\) \(\mathbf{-5.7}\) \(\mathbf{-3.9}\) \(\mathbf{-7.9}\) \(5.6\) \(0.100\) \(0.131\) \(0.392\) \(0.257\)
SFT \(-2.4\) \(-2.9\) \(5.0\) \(4.0\) \(5.4\) \(0.095\) \(0.110\) \(0.398\) \(0.263\)
SFT \(-2.7\) \(-2.8\) \(4.3\) \(4.1\) \(5.4\) \(0.100\) \(0.109\) \(0.391\) \(0.261\)
SFT \(-3.6\) \(-5.6\) \(3.3\) \(2.9\) \(5.4\) \(0.080\) \(0.110\) \(0.389\) \(0.267\)
SFT§ \(\mathbf{-11.9}\) \(\mathbf{-11.9}\) \(\mathbf{-7.2}\) \(\mathbf{-6.6}\) \(\mathbf{5.4}\) \(\mathbf{0.105}\) \(\mathbf{0.131}\) \(\mathbf{0.399}\) \(\mathbf{0.277}\)
DPO \(-11.8\) \(-11.8\) \(-11.0\) \(-10.5\) \(5.5\) \(0.095\) \(0.094\) \(0.374\) \(0.280\)
Base \(-1.6\) \(-4.8\) \(-1.1\) \(3.2\) \(\mathbf{4.9}\) \(\mathbf{0.285}\) \(\mathbf{0.169}\) \(\mathbf{0.578}\) \(0.284\)
Base \(-1.1\) \(-5.1\) \(-1.3\) \(2.8\) \(4.9\) \(0.265\) \(0.165\) \(0.578\) \(0.283\)
Base \(-1.6\) \(-5.8\) \(-0.9\) \(-1.1\) \(5.1\) \(0.225\) \(0.169\) \(0.543\) \(0.272\)
Base§ \(\mathbf{-10.0}\) \(\mathbf{-10.2}\) \(\mathbf{-7.8}\) \(\mathbf{-8.5}\) \(5.1\) \(0.125\) \(0.163\) \(0.573\) \(\mathbf{0.296}\)
SFT \(-7.6\) \(-7.3\) \(3.7\) \(0.2\) \(\mathbf{5.2}\) \(0.215\) \(0.168\) \(\mathbf{0.583}\) \(0.275\)
SFT \(-7.3\) \(-7.7\) \(4.1\) \(0.7\) \(5.2\) \(0.230\) \(\mathbf{0.177}\) \(0.583\) \(0.278\)
SFT \(-8.2\) \(-9.1\) \(1.6\) \(-0.8\) \(5.2\) \(0.240\) \(0.169\) \(0.570\) \(0.275\)
SFT§ \(\mathbf{-12.9}\) \(\mathbf{-12.2}\) \(\mathbf{-3.6}\) \(\mathbf{-6.1}\) \(5.3\) \(\mathbf{0.265}\) \(0.170\) \(0.579\) \(\mathbf{0.282}\)
DPO \(-13.5\) \(-13.4\) \(-6.1\) \(-8.2\) \(5.3\) \(0.140\) \(0.163\) \(0.576\) \(0.288\)
Base \(1.1\) \(0.4\) \(7.8\) \(1.1\) \(\mathbf{6.6}\) \(0.080\) \(0.223\) \(0.599\) \(0.311\)
Base \(0.8\) \(1.8\) \(6.7\) \(0.6\) \(6.6\) \(0.080\) \(\mathbf{0.224}\) \(\mathbf{0.600}\) \(\mathbf{0.312}\)
Base \(8.5\) \(8.0\) \(8.5\) \(7.5\) \(8.1\) \(0.090\) \(0.188\) \(0.531\) \(0.297\)
Base§ \(\mathbf{-10.3}\) \(\mathbf{-9.5}\) \(\mathbf{-4.8}\) \(\mathbf{-7.1}\) \(7.0\) \(\mathbf{0.100}\) \(0.208\) \(0.578\) \(0.301\)
SFT \(-8.2\) \(-9.8\) \(1.0\) \(-1.6\) \(\mathbf{7.5}\) \(0.345\) \(\mathbf{0.217}\) \(0.571\) \(0.321\)
SFT \(-8.7\) \(-10.4\) \(1.3\) \(-0.9\) \(7.5\) \(\mathbf{0.350}\) \(0.214\) \(0.574\) \(\mathbf{0.322}\)
SFT \(-7.7\) \(-9.0\) \(4.3\) \(-0.3\) \(7.7\) \(0.320\) \(0.210\) \(\mathbf{0.588}\) \(0.315\)
SFT§ \(\mathbf{-13.4}\) \(\mathbf{-13.4}\) \(\mathbf{-9.2}\) \(\mathbf{-9.6}\) \(7.6\) \(0.300\) \(0.213\) \(0.565\) \(0.312\)
DPO \(-13.6\) \(-14.1\) \(-11.9\) \(-10.6\) \(7.9\) \(0.200\) \(0.196\) \(0.549\) \(0.324\)
Base \(0.9\) \(-1.3\) \(3.7\) \(6.9\) \(\mathbf{7.6}\) \(\mathbf{0.355}\) \(\mathbf{0.185}\) \(\mathbf{0.646}\) \(0.334\)
Base \(1.1\) \(-1.3\) \(3.8\) \(6.0\) \(7.6\) \(0.375\) \(0.179\) \(0.646\) \(0.332\)
Base \(-3.5\) \(-2.1\) \(3.3\) \(4.5\) \(7.8\) \(0.335\) \(0.169\) \(0.638\) \(0.334\)
Base§ \(\mathbf{-12.0}\) \(\mathbf{-12.4}\) \(\mathbf{-8.1}\) \(\mathbf{-8.8}\) \(7.9\) \(0.535\) \(0.198\) \(0.644\) \(\mathbf{0.343}\)
SFT \(-10.9\) \(-9.9\) \(-2.8\) \(-2.3\) \(\mathbf{7.9}\) \(0.590\) \(0.171\) \(\mathbf{0.635}\) \(0.344\)
SFT \(-10.9\) \(-9.9\) \(-3.1\) \(-2.3\) \(7.8\) \(0.595\) \(\mathbf{0.167}\) \(0.635\) \(0.341\)
SFT \(-10.8\) \(-9.7\) \(-3.1\) \(-2.1\) \(7.9\) \(0.585\) \(0.168\) \(0.636\) \(0.346\)
SFT§ \(\mathbf{-13.3}\) \(\mathbf{-13.3}\) \(\mathbf{-8.2}\) \(\mathbf{-8.7}\) \(7.9\) \(\mathbf{0.600}\) \(0.175\) \(0.636\) \(\mathbf{0.349}\)
DPO \(-13.8\) \(-14.1\) \(-10.7\) \(-10.0\) \(8.0\) \(0.560\) \(0.167\) \(0.632\) \(0.354\)

In Table 9, we present the complete results of the top safety neurons’ value vectors projected into the vocabulary space.

Table 9: Top safety neuron value vectors from Llama2-7b projected onto the vocabulary space. MLP.v denotes the down projection vector of the \(n\)-th neuron in layer \(l\). We omitted some tokens for better visualization.
Vector Top Tokens
MLP.v ouc, iter, trat, ussen, tid, imos
MLP.v </s>, Genomsnittlig, ←, text, <s>
MLP.v Sug, Commons, sugar, mouth, flesh
MLP.v </s>, \n, \r, →, ="@+, {:, \f
MLP.v and, \n, </s>, &, this, with, vs
MLP.v wa, ales, sin, MainActivity, oblig
MLP.v Food, Guard, Farm, Ali, Sex, Break
MLP.v */\r, */, ), “, }, }, >>, }\r
MLP.v **, >>.***, °, “’, —-, /, !!, ]
MLP.v Ż, Gemeinsame, HT, Gor, category

13.4 More Alignment Tax Results↩︎

Spearman’s rank correlation coefficients between different preference neurons of Mistral-7b and Gemma-7b are shown in Figure 12. For Mistral-7b, we observe results similar to Llama2-7b. However, Gemma-7b shows anomalies when aligned on RewardBench-Safety, which we attribute to the small dataset size (less than 1k samples) compared to the larger number of neurons Gemma-7b. This discrepancy likely leads to insufficient training. However, this discrepancy does not affect our explanation of the alignment tax (Table 3).

Our experimental results indicate that when performing either safety or helpfulness alignment individually, safety neurons can be leveraged to improve both model safety and helpfulness. However, when both alignments are applied simultaneously, it remains unclear whether conflicts arise within these neurons. To investigate this, we conducted the following experiment on Llama2:

  • We used DPO to train two models based on the same SFT model: one trained solely on HH-helpful data (Helpful DPO) and another trained on both HH-harmless and HH-helpful data (HH DPO).

  • We patched \(5\%\) of neuron activations from HH DPO into Helpful DPO, specifically using the neurons identified from Helpful DPO for model helpfulness. These neurons were selected following the same methodology used for identifying safety neurons in our study.

The results (Table 10) indicate that the neurons identified as helpful neurons are also crucial for improving model safety during HH training. This suggests that safety and helpfulness compete for control over these neurons during HH training, providing further mechanistic insights into the alignment tax phenomenon.

a
b

Figure 12: Spearman’s rank correlation coefficients between preference neurons of Mistral-7b and Gemma-7b aligned on different preference-learning datasets.. a — Mistral-7b, b — Gemma-7b

Table 10: Cost scores of Llama2 DPO models. Abbr. BT = Beavertails,RT = RedTeam, HB = HarmBench, JL = JailBreakLLMs, \(^\dagger\) denotes patching helpfulness neurons’ activations from HH DPO.
BT RT HB JL
Helpful DPO 3.42 0.65 6.68 6.66
Helpful DPO\(^\dagger\) -11.77 -11.09 -5.57 -8.28
HH DPO -11.81 -12.42 -10.41 -11.76

13.5 More Safeguard Results↩︎

13.5.0.1 Data Construction

We cache neuron activations at the last token of the prompt and create labels for these activations by the cost scores of corresponding generation (we use a threshold of 0 to distinguish whether the generation is harmful or not) on 5 datasets: HH-Harmless [8], Beavertails [42], RedTeam [5], HarmBench [6], and JailBreakLLMs [7].

13.5.0.2 Experiment

To validate the generalization ability of these neuron activations, we use activations from one dataset as the training set and merge the others as the test set, training a simple logistic regression classifier. Finally, we compute the average accuracy across all possible combinations as the evaluation metric. In addition to safety neurons, we employ neurons identified through other strategies as baselines, including (1) RN-Same Distribution, which refers to randomly sampled neurons (completely disjoint from safety neurons) with the same per-layer neuron count as the safety neurons; (2) RN-Last, which denotes neurons randomly sampled exclusively from the last layer, based on the hypothesis that neurons in the last layer directly influence the model’s output, making this a potentially strong baseline; (3) RN-All, which refers to neurons randomly sampled without constraints, aiming to assess whether the layer-wise distribution of safety neurons inherently encodes safety-related information. For all experiments requiring randomly sampled neurons, we repeat the process 5 times using different random seeds and report the averaged results.

13.5.0.3 Result

We train and test the classifier using activations from different numbers of neurons, as shown in Figure 11 (b). The results indicate that the test accuracy almost converges when using activations from approximately \(1500\) neurons, while activations from as few as \(150\) neurons yield relatively decent results across all test sets. These results suggest that the activations of safety neurons indeed encode more information about the safety of the model’s outputs, and this information is transferable across different datasets. Additionally, random neurons with the same layer distribution as safety neurons are more effective than those sampled from other layers, which indicates the layer distribution of safety neurons may also encode safety information.

14 Limitations↩︎

Our research has some limitations. First, although safety neurons can enhance the safety of unaligned models, this requires neuron activations from already aligned models. Exploring training-free methods to obtain these activations is an interesting research direction. Second, we used (IA)3 for alignment, but real-world models often undergo full parameter fine-tuning. The impact of this on safety neurons is unknown, though previous study suggests that during DPO alignment, many toxicity-related neuron parameters remain largely unchanged, with DPO primarily suppressing the activations of these neurons [55]. Finally, we identified which neurons affect model safety but not how they exert this influence, which will be a future research direction.

15 Broader Impacts↩︎

This work contributes to the understanding of safety alignment in LLMs, potentially guiding the development of safer and more controllable models. By providing insights into the mechanisms behind safety neurons, our research may help improve alignment techniques and safeguard applications against generating harmful content. However, these findings could also be misused to weaken alignment constraints, enabling more covert adversarial attacks. Additionally, the reliance on specific datasets and evaluation methods may introduce biases that impact generalizability. To mitigate these risks, we adhere to open-source licensing policies and commit to sharing our code and data to foster transparency and responsible research.

References↩︎

[1]
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. URL https://arxiv.org/pdf/2307.09288.pdf.
[2]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. URL https://arxiv.org/pdf/2303.08774.pdf.
[3]
A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024. URL https://arxiv.org/pdf/2401.04088.pdf.
[4]
G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. URL https://arxiv.org/pdf/2312.11805.pdf.
[5]
D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022. URL https://arxiv.org/pdf/2209.07858.pdf.
[6]
M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024. URL https://arxiv.org/pdf/2402.04249.pdf.
[7]
X. Shen, Z. Chen, M. Backes, Y. Shen, and Y. Zhang. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023. URL https://arxiv.org/pdf/2308.03825.pdf.
[8]
Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022. URL https://arxiv.org/pdf/2204.05862.
[9]
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/a85b405ed65c6477a4fe8302b5e06ce7-Paper-Conference.pdf.
[10]
K. Ethayarajh, W. Xu, N. Muennighoff, D. Jurafsky, and D. Kiela. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024. URL https://arxiv.org/pdf/2402.01306.
[11]
Y. Huang, S. Gupta, M. Xia, K. Li, and D. Chen. Catastrophic jailbreak of open-source llms via exploiting generation. In The Twelfth International Conference on Learning Representations, 2023. URL https://openreview.net/pdf?id=r42tSSCHPh.
[12]
X. Yang, X. Wang, Q. Zhang, L. Petzold, W. Y. Wang, X. Zhao, and D. Lin. Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949, 2023. URL https://arxiv.org/pdf/2310.02949.
[13]
N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, N. DasSarma, D. Drain, D. Ganguli, Z. Hatfield-Dodds, D. Hernandez, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCandlish, and C. Olah. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. URL https://transformer-circuits.pub/2021/framework/index.html.
[14]
J. Vig, S. Gehrmann, Y. Belinkov, S. Qian, D. Nevo, Y. Singer, and S. Shieber. Investigating gender bias in language models using causal mediation analysis. Advances in neural information processing systems, 33: 12388–12401, 2020. URL https://proceedings.neurips.cc/paper/2020/file/92650b2e92217715fe312e6fa7b90d82-Paper.pdf.
[15]
K. Meng, D. Bau, A. Andonian, and Y. Belinkov. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 36, 2022. URL https://arxiv.org/pdf/2202.05262.pdf. arXiv:2202.05262.
[16]
K. R. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. In The Eleventh International Conference on Learning Representations, 2022. URL https://openreview.net/pdf?id=NpsVSN6o4ul.
[17]
M. Hanna, O. Liu, and A. Variengien. How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. Advances in Neural Information Processing Systems, 36, 2024. URL https://openreview.net/pdf?id=p4PckNQR8k.
[18]
A. Geiger, Z. Wu, C. Potts, T. Icard, and N. Goodman. Finding alignments between interpretable causal variables and distributed neural representations. In Causal Learning and Reasoning, pages 160–187. PMLR, 2024. URL https://proceedings.mlr.press/v236/geiger24a/geiger24a.pdf.
[19]
D. Dai, L. Dong, Y. Hao, Z. Sui, B. Chang, and F. Wei. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8493–8502, 2022. URL https://aclanthology.org/2022.acl-long.581.pdf.
[20]
X. Wang, K. Wen, Z. Zhang, L. Hou, Z. Liu, and J. Li. Finding skill neurons in pre-trained transformer-based language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11132–11152, 2022. URL https://aclanthology.org/2022.emnlp-main.765.pdf.
[21]
W. Gurnee, N. Nanda, M. Pauly, K. Harvey, D. Troitskii, and D. Bertsimas. Finding neurons in a haystack: Case studies with sparse probing. Transactions on Machine Learning Research, 2023. URL https://openreview.net/pdf?id=JYs1R9IMJr.
[22]
W. Gurnee, T. Horsley, Z. C. Guo, T. R. Kheirkhah, Q. Sun, W. Hathaway, N. Nanda, and D. Bertsimas. Universal neurons in gpt2 language models. arXiv preprint arXiv:2401.12181, 2024. URL https://arxiv.org/pdf/2401.12181.
[23]
A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mistral 7b, 2023. URL https://arxiv.org/pdf/2310.06825.pdf.
[24]
G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love, P. Tafti, L. Hussenot, P. G. Sessa, A. Chowdhery, A. Roberts, A. Barua, A. Botev, A. Castro-Ros, A. Slone, A. Héliou, A. Tacchetti, A. Bulanova, A. Paterson, B. Tsai, B. Shahriari, C. L. Lan, C. A. Choquette-Choo, C. Crepy, and et al. Gemma: Open models based on gemini research and technology, 2024. URL https://arxiv.org/pdf/2403.08295.pdf.
[25]
Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, and Z. Qiu. Qwen2.5 technical report, 2025. URL https://arxiv.org/abs/2412.15115.
[26]
B. Wei, K. Huang, Y. Huang, T. Xie, X. Qi, M. Xia, P. Mittal, M. Wang, and P. Henderson. Assessing the brittleness of safety alignment via pruning and low-rank modifications. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/pdf?id=K6xxnKN2gm.
[27]
Y. Zhao, W. Zhang, Y. Xie, A. Goyal, K. Kawaguchi, and M. Shieh. Understanding and enhancing safety mechanisms of LLMs via safety-specific neuron. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=yR47RmND1m.
[28]
A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Joseph, B. Mann, N. DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021. URL https://arxiv.org/pdf/2112.00861.
[29]
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 27730–27744, 2022. URL https://papers.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf.
[30]
H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y. Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023. URL https://arxiv.org/pdf/2312.06674.
[31]
Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran-Johnson, E. Perez, J. Kerr, J. Mueller, J. Ladish, J. Landau, K. Ndousse, K. Lukosuite, L. Lovitt, M. Sellitto, N. Elhage, N. Schiefer, N. Mercado, N. DasSarma, R. Lasenby, R. Larson, S. Ringer, S. Johnston, S. Kravec, S. E. Showk, S. Fort, T. Lanham, T. Telleen-Lawton, T. Conerly, T. Henighan, T. Hume, S. R. Bowman, Z. Hatfield-Dodds, B. Mann, D. Amodei, N. Joseph, S. McCandlish, T. Brown, and J. Kaplan. Constitutional ai: Harmlessness from ai feedback, 2022. URL https://arxiv.org/pdf/2212.08073.pdf.
[32]
L. Tunstall, E. Beeching, N. Lambert, N. Rajani, K. Rasul, Y. Belkada, S. Huang, L. von Werra, C. Fourrier, N. Habib, et al. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023. URL https://arxiv.org/pdf/2310.16944.pdf.
[33]
H. Ivison, Y. Wang, V. Pyatkin, N. Lambert, M. Peters, P. Dasigi, J. Jang, D. Wadden, N. A. Smith, I. Beltagy, et al. Camels in a changing climate: Enhancing lm adaptation with tulu 2. arXiv preprint arXiv:2311.10702, 2023. URL https://arxiv.org/pdf/2311.10702.pdf.
[34]
S. Sun, D. Gupta, and M. Iyyer. Exploring the impact of low-rank adaptation on the performance, efficiency, and regularization of rlhf. arXiv preprint arXiv:2309.09055, 2023. URL https://arxiv.org/pdf/2309.09055.
[35]
C.-Y. Hsu, Y.-L. Tsai, C.-H. Lin, P.-Y. Chen, C.-M. Yu, and C.-Y. Huang. Safe lora: the silver lining of reducing safety risks when fine-tuning large language models. arXiv preprint arXiv:2405.16833, 2024. URL https://arxiv.org/pdf/2405.16833.
[36]
Y. Li, S. Han, and S. Ji. Vb-lora: Extreme parameter efficient fine-tuning with vector banks. arXiv preprint arXiv:2405.15179, 2024. URL https://arxiv.org/pdf/2405.15179.
[37]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. URL https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[38]
M. Geva, R. Schuster, J. Berant, and O. Levy. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, 2021. URL https://aclanthology.org/2021.emnlp-main.446.pdf.
[39]
W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
[40]
Y. Wang, H. Ivison, P. Dasigi, J. Hessel, T. Khot, K. Chandu, D. Wadden, K. MacMillan, N. A. Smith, I. Beltagy, et al. How far can camels go? exploring the state of instruction tuning on open resources. Advances in Neural Information Processing Systems, 36, 2024. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ec6413875e4ab08d7bc4d8e225263398-Paper-Datasets_and_Benchmarks.pdf.
[41]
H. Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang, M. Bansal, and C. A. Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35: 1950–1965, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/0cde695b83bd186c1fd456302888454c-Paper-Conference.pdf.
[42]
J. Ji, M. Liu, J. Dai, X. Pan, C. Zhang, C. Bian, B. Chen, R. Sun, Y. Wang, and Y. Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36, 2024. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/4dbb61cb68671edc4ca3712d70083b9f-Paper-Datasets_and_Benchmarks.pdf.
[43]
J. Dai, X. Pan, R. Sun, J. Ji, X. Xu, M. Liu, Y. Wang, and Y. Yang. Safe RLHF: Safe reinforcement learning from human feedback. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=TyFrPOKYXw.
[44]
W. Liu, X. Wang, M. Wu, T. Li, C. Lv, Z. Ling, J. Zhu, C. Zhang, X. Zheng, and X. Huang. Aligning large language models with human preferences through representation engineering. arXiv preprint arXiv:2312.15997, 2023. URL https://arxiv.org/pdf/2312.15997.
[45]
S. Duan, X. Yi, P. Zhang, T. Lu, X. Xie, and N. Gu. Negating negatives: Alignment without human positive samples via distributional dispreference optimization. arXiv preprint arXiv:2403.03419, 2024. URL https://arxiv.org/pdf/2403.03419.
[46]
L. Kong, H. Wang, W. Mu, Y. Du, Y. Zhuang, Y. Zhou, Y. Song, R. Zhang, K. Wang, and C. Zhang. Aligning large language models with representation editing: A control perspective. arXiv preprint arXiv:2406.05954, 2024. URL https://arxiv.org/pdf/2406.05954.
[47]
Z. Zhou, H. Yu, X. Zhang, R. Xu, F. Huang, K. Wang, Y. Liu, J. Fang, and Y. Li. On the role of attention heads in large language model safety. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/pdf?id=h0Ak8A5yqw.
[48]
M. Sun, Z. Liu, A. Bair, and J. Z. Kolter. A simple and effective pruning approach for large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/pdf?id=PxoFut3dWW.
[49]
T. McGrath, M. Rahtz, J. Kramar, V. Mikulik, and S. Legg. The hydra effect: Emergent self-repair in language model computations. arXiv preprint arXiv:2307.15771, 2023.
[50]
S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models, 2016. URL https://arxiv.org/pdf/1609.07843.pdf.
[51]
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021. URL https://openreview.net/pdf?id=d7KBjmI3GmQ.
[52]
K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. URL https://arxiv.org/abs/2110.14168.
[53]
M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. Le, E. Chi, D. Zhou, and J. Wei. Challenging BIG-bench tasks and whether chain-of-thought can solve them. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, Toronto, Canada, July 2023. Association for Computational Linguistics. . URL https://aclanthology.org/2023.findings-acl.824.
[54]
S. Lin, J. Hilton, and O. Evans. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, 2022. URL https://aclanthology.org/2022.acl-long.229.pdf.
[55]
A. Lee, X. Bai, I. Pres, M. Wattenberg, J. K. Kummerfeld, and R. Mihalcea. A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity. arXiv preprint arXiv:2401.01967, 2024. URL https://arxiv.org/pdf/2401.01967.
[56]
N. Lambert, V. Pyatkin, J. Morrison, L. Miranda, B. Y. Lin, K. Chandu, N. Dziri, S. Kumar, T. Zick, Y. Choi, et al. Rewardbench: Evaluating reward models for language modeling. arXiv preprint arXiv:2403.13787, 2024. URL https://arxiv.org/pdf/2403.13787.
[57]
K. Ethayarajh, Y. Choi, and S. Swayamdipta. Understanding dataset difficulty with \(\mathcal{V}\)-usable information. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 5988–6008. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/ethayarajh22a.html.
[58]
N. Lambert, L. Tunstall, N. Rajani, and T. Thrush. Huggingface h4 stack exchange preference dataset, 2023. URL https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.
[59]
Y. Qi, H. Peng, X. Wang, B. Xu, L. Hou, and J. Li. Adelie: Aligning large language models on information extraction. arXiv preprint arXiv:2405.05008, 2024. URL https://arxiv.org/pdf/2405.05008.
[60]
N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, et al. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022. URL https://arxiv.org/pdf/2209.10652.
[61]
Y. Yang, F. Sondej, H. Mayne, and A. Mahdi. Ablation is not enough to emulate dpo: How neuron dynamics drive toxicity reduction. arXiv preprint arXiv:2411.06424, 2024. URL https://arxiv.org/abs/2411.06424.
[62]
A. Stolfo, B. Wu, W. Gurnee, Y. Belinkov, X. Song, M. Sachan, and N. Nanda. Confidence regulation neurons in language models. Advances in Neural Information Processing Systems, 37: 125019–125049, 2024. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/e21955c93dede886af1d0d362c756757-Paper-Conference.pdf.
[63]
A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dombrowski, et al. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023. URL https://arxiv.org/pdf/2310.01405.pdf.
[64]
C. Zheng, F. Yin, H. Zhou, F. Meng, J. Zhou, K.-W. Chang, M. Huang, and N. Peng. On prompt-driven safeguarding for large language models. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/pdf?id=ugxGpOEkox.
[65]
A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Rimsky, W. Gurnee, and N. Nanda. Refusal in language models is mediated by a single direction. arXiv preprint arXiv:2406.11717, 2024. URL https://arxiv.org/abs/2406.11717.
[66]
B. Højer, O. S. Jarvis, and S. Heinrich. Improving reasoning performance in large language models via representation engineering. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/pdf?id=IssPhpUsKt.
[67]
S. Li, L. Yao, L. Zhang, and Y. Li. Safety layers in aligned large language models: The key to llm security. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/pdf?id=kUH1yPMAn7.
[68]
C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ac662d74829e4407ce1d126477f4a03a-Paper-Conference.pdf.
[69]
T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, Oct. 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
[70]
S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada, S. Paul, and B. Bossan. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
[71]
L. von Werra, Y. Belkada, L. Tunstall, E. Beeching, T. Thrush, N. Lambert, and S. Huang. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020.
[72]
N. Nanda and J. Bloom. Transformerlens. https://github.com/TransformerLensOrg/TransformerLens, 2022.
[73]
A. Zou, L. Phan, J. Wang, D. Duenas, M. Lin, M. Andriushchenko, J. Z. Kolter, M. Fredrikson, and D. Hendrycks. Improving alignment and robustness with circuit breakers. Advances in Neural Information Processing Systems, 37: 83345–83373, 2025. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/97ca7168c2c333df5ea61ece3b3276e1-Paper-Conference.pdf.
[74]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12: 2825–2830, 2011. URL https://jmlr.csail.mit.edu/papers/volume12/pedregosa11a/pedregosa11a.pdf.

  1. \(^*\) indicates equal contribution.↩︎

  2. \(^\dagger\) Corresponding author: L.Hou (houlei@tsinghua.edu.cn).↩︎

  3. We use a threshold of 0 to distinguish whether the generation is harmful or not.↩︎

  4. The skewness of Llama2-7b, Mistral-7b-v0.1 and Gemma-7b are \(6.99\), \(7.20\) and \(19.89\) respectively.↩︎