Abstract

Pre-trained language models have achieved remarkable success across diverse applications but remain susceptible to spurious, concept-driven correlations that impair robustness and fairness. In this work, we introduce CURE, a novel and lightweight framework that systematically disentangles and suppresses conceptual shortcuts while preserving essential content information. Our method first extracts concept-irrelevant representations via a dedicated content extractor reinforced by a reversal network, ensuring minimal loss of task-relevant information. A subsequent controllable debiasing module employs contrastive learning to finely adjust the influence of residual conceptual cues, enabling the model to either diminish harmful biases or harness beneficial correlations as appropriate for the target task. Evaluated on the IMDB and Yelp datasets using three pre-trained architectures, CURE achieves an absolute improvement of +10 points in F1 score on IMDB and +2 points on Yelp, while introducing minimal computational overhead. Our approach establishes a flexible, unsupervised blueprint for combating conceptual biases, paving the way for more reliable and fair language understanding systems.¹

1 Introduction↩︎

None

Figure 1: Example of shortcut learning in sentiment classification, where a classification model \(f_\theta\) wrongly associate reviews about Food to a positive sentiment..

With the rapid advancement of artificial intelligence, pre-trained language models (PLMs) have been widely adopted across various domains, including education, healthcare and e-commerce [1]–[3]. A predominant strategy for applying these models is fine-tuning, where a PLM is further adapted to task-specific data, aiming to enhance its performance or better align with human intent [4]. However, fine-tuning often exposes models to dataset biases, leading to shortcuts—spurious correlations between features and labels [5]. For instance, [6] demonstrated that on the Yelp dataset [7], a LLaMA2-based [8] sentiment classifier mistakenly associated the concept of “food” with a “positive” label. These fragile dependencies not only limit the robustness of PLMs but also pose significant risks. In medical diagnosis, a biased detector might incorrectly associate certain biological attributes with diseases, leading to inaccurate predictions [9]. Similarly, in automated recruitment systems, a shortcut may result in a favor to applicants with certain demographic attributes, exacerbating fairness problem. In Figure 1, we present an example where the classifier incorrectly associates the concept of food with positive sentiment.

Contemporary debiasing research primarily focus on two strategies: (1) modifying shortcut-inducing terms in training data [10], [11], and (2) generating counterfactual samples [6], [12] via large language models (LLMs). However, both approaches suffer from notable limitations. Lexical modification requires prior knowledge of shortcut-inducing terms, which is often challenging to obtain [13]. Moreover, its effectiveness is restricted to lexical shortcuts rather than conceptual biases. On the other hand, LLM-based counterfactual generation is computationally expensive and increases training costs significantly. While LLM-free counterfactual generation still relies on prior knowledge [14], making it similarly constrained.

As an unsupervised and lightweight solution, we propose CURE—Controlled Unlearning for Robust Embeddings. CURE remaps the semantic space to disentangle conceptual and content-related information without human annotation, offering fine-grained control over shortcut effects. It first trains a content extractor using a concept classifier and back-translation to produce concept-irrelevant representations. A contrastive learning-based debiasing module then refines sample representations, adjusting conceptual features as needed. Finally, the module is jointly trained with a classification head to enhance model robustness.

Unlike traditional approaches, CURE offers three key advantages: Prior Knowledge Independence – CURE uses unsupervised learning, eliminating the need for manual annotations of shortcuts. Resource Efficiency – CURE eliminates the need for LLM-driven data augmentation, reducing the training time to approximately one-tenth of the original. Controllability – CURE can quantify the impact of conceptual bias on classification results. This facilitates both the mitigation of conceptual biases to enhance performance on out-of-distribution (OOD) data and the exploitation of shortcuts to improve performance on independent and identically distributed (i.i.d.) data. Such adaptability enables users to align training objectives with their generalization requirement, while also providing a quantifiable framework for future debiasing research.

Our contributions are as follows:

We propose a novel conceptual debiasing approach named CURE. It mitigates shortcuts without relying on prior knowledge or data augmentation, reducing training time to one-tenth of that required by LLM-driven methods. Furthermore, CURE is highly adaptable and can be seamlessly integrated with any mainstream PLM.
CURE enables precise control over the impact of shortcuts. It mitigates conceptual biases to enhance robustness against distribution shifts. Conversely, in scenarios where shortcuts align well with the target task, e.g., i.i.d. data, it leverages them to improve classification accuracy. This adaptability allows CURE to balance robustness and accuracy based on specific generalization requirements.
We evaluate CURE across two benchmark datasets and three PLMs. Experimental results indicate that on the IMDB dataset, the RoBERTa-based CURE achieves an approximately 5-point improvement in accuracy over an LLM-driven debiasing approach and outperforms the baseline by about 10 points in F1 score, demonstrating its effectiveness in mitigating conceptual shortcuts.

Addressing spurious correlations in PLMs has become a critical research focus, as these correlations can lead to biased and unreliable predictions, limiting model robustness and fairness. Traditional works have explored various strategies to mitigate these issues, including causal inference techniques [15], adversarial training [16], and data augmentation methods designed to reduce model reliance on spurious features [17]. Additionally, approaches leveraging counterfactual reasoning [18] have shown promise in improving fairness and robustness in LLMs. These advancements collectively contribute to a growing body of research aimed at developing more reliable and ethically sound language models.

2.1 General Approaches to Addressing Spurious Correlations↩︎

[19] addresses the challenge of models learning spurious topical shortcuts instead of relevant features in tasks like native language identification. They introduce an adversarial model to demote these latent topical confounds using log-odds ratios, guiding the model to focus on stylistic rather than topic-based features. [20] enhance robustness by fine-tuning models on “forgettable” examples that models initially misclassified. [21] tackle the issue of natural language inference (NLI) models relying on superficial hypothesis patterns by using an ensemble of adversarial classifiers. [22] propose using treatment effect estimation to distinguish genuine correlations from spurious ones, such as associating “Spielberg” with positive sentiment in movie reviews. [23] extend this concept with an automated framework using interpretability techniques, cross-dataset stability, and knowledge-aware perturbation to identify spurious tokens at scale. [24] explores how pre-trained models like BERT handle spurious correlations, finding that they improve robustness by generalizing from minority counterexamples. The authors propose using multi-task learning (MTL) with auxiliary tasks to enhance robustness when these counterexamples are scarce. [25] propose the Less-Learn-Shortcut (LLS) which down-weighs examples with high correlations between specific words and labels. [18] present a counterfactual debiasing approach that balances predictions between claim-only and claim-evidence models to reduce bias associated with claim patterns. While these studies primarily address general spurious correlations, recent research has started focusing on spurious correlations at the concept level.

2.2 Concept-Level Spurious Correlations↩︎

[6] introduce biases in NLP at the concept level, highlighting how language models often rely on broad associative patterns rather than deeper semantic understanding. For instance, models may learn to associate certain concepts, such as “food”, with inherently positive sentiment, leading to spurious correlations that degrade generalization performance. To mitigate this issue, the authors leverage LLM to generate counterfactual data that rebalances label distributions, thereby reducing the reliance on such superficial cues.

However, this approach presents certain limitations in terms of scalability. Specifically, generating counterfactual data for each new task requires substantial manual intervention, as it involves defining relevant concept-level biases and ensuring the generated data maintains both linguistic plausibility and task relevance. Even with advanced LLMs like ChatGPT, this process remains resource-intensive, particularly for large-scale or multi-domain applications. Additionally, the effectiveness of this method depends on the quality and diversity of the generated counterfactuals, which can vary depending on the prompt design and the inherent biases present in the language model used for data generation. These challenges underscore the need for more automated, generalizable approaches to mitigating concept-level biases in NLP.

3 Methodology↩︎

Figure 2: The training process of our CURE involves three steps: 1) We train a content extractor to filter out concepts while retaining concept-irrelevant information using a reversal network. 2) The PLM outputs are remapped through a debiasing network, regulating concept retention by controlling the relationship between the original and content representations. 3) We jointly train the classification head and the debiasing network to maximize robust feature retention while filtering out conceptual information. “” indicates frozen model parameters.

3.1 Problem Formulation↩︎

Given a set of i.i.d. labeled documents – \(D = \{d_1, \ldots, d_N\}\), where each sample \(d_i\) associates with a conceptual label \(c_i \in \mathcal{C}\) and a classification label \(y_i \in \mathcal{Y}\). We assume that the classification labels are balanced, while the conceptual labels are biased. That is, for every label \(y \in \mathcal{Y}\), the number of samples in \(D\) with label \(y\) is equal: \[\forall\, y \in \mathcal{Y}, \quad \left| \{ d_i \in D \mid y_i = y \} \right| = \frac{N}{|\mathcal{Y}|}.\] The distribution of conceptual labels is uneven: \[\begin{gather} \exists\, c, c' \in \mathcal{C} \text{ such that }\\ \left| \{ d_i \in D \mid c_i = c \} \right| \neq \left| \{ d_i \in D \mid c_i = c' \} \right|. \end{gather}\] Here, \(\mathcal{C}\) is correlated with but is not causal related to \(\mathcal{Y}\), i.e., \(\mathcal{C}\perp \!\!\!\perp\mathcal{Y}, \text{ but } \mathcal{C}\nrightarrow\mathcal{Y}\). We first transform samples in \(D\) to their semantic embedding \(\mathcal{X}= \{x_1, \ldots, x_N\}\subseteq\mathbb{R}^u\) by using a PLM, then optimize a classification head \(f_\theta\) with parameter \(\theta\) for mapping \(\mathcal{X}\rightarrow\mathcal{Y}\) by minimizing classification loss \(\ell\): \[\theta^* = \arg\min_{\theta} \frac{1}{n} \sum_{i=1}^{n} \ell \big( f_\theta(x_i), y_i \big). \label{eq:ce}\tag{1}\] However, due to the bias between \(\mathcal{Y}\) and \(\mathcal{C}\), the model may erroneously associate \(c_i\) with \(y_i\), thereby losing its robustness. Our primary objective is to enhance the robustness of \(f_\theta\), measured by its classification accuracy on a conceptually balanced OOD test set.

3.2 Concept Labeling↩︎

Due to the lack of available conceptual annotations in classification datasets and the demonstrated capability of LLMs to perform text annotation [26], we employ the standard text conceptual annotation pipeline outlined in [6] by using GPT-4o [4].

Specifically, we preprocess \(D\) with the following three steps:

Data Cleaning: We remove uninformative content, including non-ASCII characters and irrelevant metadata from texts.
Concept Labeling: We design structured prompts (see Appendix 6.2) and input them into GPT-4o to label each sample \(d_i\) with a concept \(c_i\).
Meta-Concept Merging: The generated concepts are then automatically categorized and merged by GPT-4o into a meta-concept set \(\mathcal{C}\).

After obtaining the concept set \(\mathcal{C}\), we compute the mutual information between a concept \(c\) and \(\mathcal{Y}\) to quantify the bias of a specific concept: \[I(c; \mathcal{Y}) = \sum_{y \in \mathcal{Y}} P(c, y) \log \frac{P(c, y)}{P(c) P(y)}.\]Subsequently, we select samples with the top \(k\) concepts with the highest \(I(c; \mathcal{Y})\) as the training set for training a biased benchmark. Furthermore, we treat samples with the \(k\) concepts with the lowest mutual information as the OOD data from real world to evaluate our debiasing method.

3.3 Extraction of Concept-Irrelevant Content↩︎

To mitigate the impact of conceptual biases, we first extract concept-irrelevant content representations from a semantic embedding \(x\). To achieve this, we freeze the parameters of the PLM and insert a lightweight network \(f_\phi\) to its output layer, as a content extractor. Here, our objective is to maximize the dropout of concept-related features, while maximizing the retention of content-related features. Therefore, the training loss consists of two components: a concept dropout loss, and a content retention loss.

3.3.1 Conceptual Information Filter↩︎

We first train a concept classifier to quantify the retention of concept-related features in \(\mathcal{X}\). This classifier consists of a classification head \(f_\omega\), built on the same PLM as the task classifier \(f_\theta\). We optimize parameter \(\omega\) by maximizing the probability for predicting \(\mathcal{C}\) from \(\mathcal{X}\): \[\omega^* = \arg\min_{\omega} \frac{1}{N} \sum_{i=1}^{N} \ell \big( f_\omega(x_i), c_i \big), \label{eq:concept}\tag{2}\] where \(\ell\) is the cross-entropy loss, defined as: \[\ell \big( f_\omega(x_i), c_i \big) = - \log P(c_i\mid x_i; \omega),\] where \(P(c_i \mid x_i; \omega)\) denotes the predicted probability of concept \(c_i\) given input \(x_i\), obtained from the softmax output of \(f_\omega\).

We expect the conceptual information in \(x\) to be filtered out after transformation by the content extraction function \(f_\phi\). To enforce this constraint, we compute the Kullback-Leibler (KL) divergence between the predicted distribution of the concept classifier \(\omega\) and a uniform distribution over \(\mathcal{C}\) as the training loss \(\mathcal{L_\text{concept}}(\phi)\), as shown in Eq. 3 .

\[\sum_{c \in \mathcal{C}} P(c \mid f_\phi(x); \omega) \log \left( \frac{P(c \mid f_\phi(x); \omega)}{(1 / |\mathcal{C}|)^\tau} \right), \label{eq:kl95loss}\tag{3}\] where \(\tau\) is a temperature parameter that controls the strength of the distribution alignment.

With the training of \(f_\phi\), we force the concept classifier \(\omega\) to produce maximally uncertain predictions, indicating the absence of learnable conceptual information.

3.3.2 Concept-Irrelevant Content Maintenance↩︎

The semantic features are often entangled with each other [27]. As a result, although the content extractor \(f_\phi\) solely aims at filtering out conceptual information, it is crucial to ensure that the concept-irrelevant information remain complete. Inspired by back-translation in machine translation [28], we construct a reversal network \(\hat{\phi}\), with the same architecture as \(f_\phi\). \(\hat{\phi}\) is designed to reconstruct \(x\) from \(f_\phi(x)\), ensuring that the mapping function \(f_\phi\) does not excessively lose the concept-irrelevant information. We first freeze \(\phi\), then use the following loss to train \(\hat{\phi}\): \[\mathcal{L}(\hat{\phi}) = \left\| f_{\hat{\phi}}\left(f_\phi(x)\right) - x \right\|_2^2. \label{eq:reconstruction95loss}\tag{4}\] Next, we freeze the parameters of \(\hat{\phi}\). During the training of \(\phi\), we use it to remap \(f_\phi(x)\) back to \(x\), and compute the mean squared error between them as a content retention loss: \[\mathcal{L}_{\text{content}}(\phi) = \| f_{\hat{\phi}}(f_\phi(x)) - x \|^2_2.\]Finally, we combine \(\mathcal{L}_{\text{content}}\) and \(\mathcal{L}_{\text{concept}}\) with weighted summation to form the overall loss for training \(f_\phi\), as shown in eq. 5 . \[\mathcal{L}(\phi) = \mathcal{L}_{\text{concept}}(\phi) + \lambda \mathcal{L}_{\text{content}}(\phi), \label{eq:phi}\tag{5}\] where \(\lambda\) is the weighing to control the relative importance of the content retention.

Finally, we alternately train \(f_\phi\) and \(f_{\hat{\phi}}\) to ensure that \(f_{\hat{\phi}}\) can effectively track the retention of concept-irrelevant information by \(f_\phi\). Here, \(\phi\) and \(\hat{\phi}\) minimizes conceptual information while maximizing the content information, forming an information bottleneck [29].

After training, the classifier \(\phi\) maximizes the retention of content while minimizing the retention of conceptual information to avoid conceptual shortcut in further training. Here, \(f_\phi(x)\) is the content representation of \(x\), denoted as \(x_\text{cont}\).

3.4 Conceptual Shortcut Debiasing↩︎

Although \(x_{\text{cont}}\) can replace the original embedding \(x\) to mitigate the conceptual bias, we further argue that eliminating conceptual shortcuts is not always beneficial. Theoretically, we identify two special cases where preserving conceptual biases could be advantageous: (1) when the conceptual bias aligns with human intent, and (2) when the application scenario is constrained, where the optimization objective is limited to i.i.d. data.

When the imbalance of conceptual attributes aligns with human natural intent, the shortcuts should be enhanced. For example, in the movie review dataset IMDB [30], most reviews labeled by GPT-4o as containing the conceptual attribute of “humor” are positive. This observation is consistent with psychological studies on relation between language styles and sentiments, which suggest that humorous expression tends to be associated with positive emotion [31]. Furthermore, for certain application scenario where the i.i.d. and OOD data distribution is identical, the real-world data hold the same distributional bias. For instance, in clinical medicine, a model trained on electronic health records collected from a specific hospital is often deployed to the same environment [32], classifying text with similar biases in training and inference. In such a case, reinforcing shortcuts can also improve classification performance in application.

To achieve flexible control over the shortcut exploitation, we introduce a lightweight feedforward network \(\psi\) on top of the frozen content extractor \(f_\phi\) and the PLM. This network maps both the original embedding \(x\) and its content representation \(x_{\text{cont}}\) into a conceptually controlled semantic space \(\mathcal{X_\text{CURE}}\subseteq\mathbb{R}^u\). We then employ contrastive learning to regulate their cosine similarity in this space. The training losses for removing the conceptual shortcut \(\mathcal{L}_{\text{r}}(\psi)\) and enhancing the shortcut \(\mathcal{L}_{\text{e}}(\psi)\) as follows:

\[\mathcal{L}_{\text{r}}=\max \left( 0, 1-\cos(f_\psi{(x)}, f_\psi{(x_{\text{cont}})})-\text{M} \right), \label{eq:margin1}\tag{6}\]

\[\mathcal{L}_{\text{e}}=\max \left( 0, \cos(f_\psi{(x)}, f_\psi{(x_{\text{cont}})})-\text{M} \right), \label{eq:margin2}\tag{7}\] where \(\text{m}\in[0,1]\) is a margin that controls the degree of conceptual information retention.

A smaller margin \(\text{M}\) enforces a stricter optimization objective. In the removal loss \(\mathcal{L}_{\text{r}}\), decreasing \(\text{M}\) compels \(f_\psi{(x)}\) and \(f_\psi{(x_{\text{cont}})}\) to be nearly identical, ensuring the complete removal of conceptual information. Conversely, in the enhancement loss \(\mathcal{L}_{\text{e}}\), a smaller \(\text{M}\) forces \(f_\psi{(x_{\text{cont}})}\) and \(x_{\text{cont}}\) to be maximally separated, thereby amplifying the influence of conceptual features. By adjusting \(\text{M}\), we can flexibly control the extent to which conceptual information is retained in \(f_\psi{(x)}\).

Finally, we replace the original embedding \(x\) with \(f_\psi{(x)}\), as the input to the classifier \(f_\theta\) and jointly train \(f_\theta\) and \(f_\psi\) using Equation 1 . The trained model can flexibly adjust the extent of conceptual bias retention based on the training objective, making it either more robust or more specialized, as shown in Fig. 2. In terms of parameter efficiency, CURE introduces only a lightweight content extractor and feedforward network on top of the original classifier, ensuring minimal computational overhead.

Table 1: Accuracy and F1 on i.i.d. and OOD test on the IMDB and Yelp datasets. “Baseline” stands for PLMs fine-tuned solely on classification tasks. “ACC” stands for Accuracy; **Bolded** values indicate best performing; underlined the second-best.
Dataset		IMDB						Yelp
Model		DistilBERT		MPNet		RoBERTa		DistilBERT		MPNet		RoBERTa
		ACC \(\uparrow\)	F1 \(\uparrow\)	ACC \(\uparrow\)	F1 \(\uparrow\)	ACC \(\uparrow\)	F1 \(\uparrow\)	ACC \(\uparrow\)	F1 \(\uparrow\)	ACC \(\uparrow\)	F1 \(\uparrow\)	ACC \(\uparrow\)	F1 \(\uparrow\)
	Baseline	84.00	85.05	87.33	86.94	88.50	89.27	94.75	94.76	92.75	93.11	93.75	93.51
	FL	83.70	82.00	87.50	87.32	88.67	88.90	92.25	92.54	93.75	95.17	93.50	93.00
	RAZOR	83.25	83.00	87.00	86.50	85.33	83.19	95.50	95.32	93.50	94.83	92.50	93.00
i.i.d.	CURE	85.50	85.48	88.83	88.78	89.67	89.77	95.25	95.25	95.00	95.00	94.75	94.63
	Baseline	81.67	82.20	79.33	80.19	78.83	74.85	89.75	90.44	89.00	88.30	89.25	89.64
	FL	81.33	82.25	79.00	76.75	79.33	76.70	90.25	89.53	90.25	89.40	89.00	89.52
	RAZOR	80.83	81.30	79.00	79.33	78.67	77.70	90.75	90.60	90.75	89.26	89.50	89.76
OOD	CURE	84.00	84.36	81.50	81.22	83.50	84.51	92.00	92.12	90.75	90.68	91.50	91.33

4 Experiments↩︎

4.1 Experimental Setup↩︎

4.1.0.1 Dataset Description

We used IMDB [30] and Yelp [33] datasets. The IMDB movie review dataset is a binary sentiment analysis dataset, which consists of 50,000 positive or negative reviews from the Internet Movie Database. The Yelp dataset is provided by the Yelp Dataset Challenge, contains business reviews labeled with ratings ranging from 0 to 4 [33]. We used the version that was cleaned and organized by [34].

Based on the concepts labeled in Section 3.2, we divided the samples in the each dataset into two groups for i.i.d. and OOD testing:

Group A contains imbalanced concept distributions, where certain concepts are overrepresented in one task-relevant category, but the overall number of samples across task-relevant labels remains equal. Samples in Group A will be separate to a biased training set and an i.i.d. test set.
Group B contains balanced concept distributions, where each concept has an equal number of samples across the task-relevant categories. Samples in Group B will be used as the OOD test set.

4.1.0.2 Compared Methods and Hyperparameters

As there are currently no model-based debiasing approaches, we primarily compare our method with FL [35], which optimizes loss computation with unbalanced data, and RAZOR [11], which utilizes LLMs for data debiasing. The result is shown in Table 1.

In our training, we used a mini-batch size of 16, with the optimizer AdamW [36]. The learning rate for the content extractor and reversal extractor was set to 0.0001, while that for the classification heads was set to 0.0003. The concept classifier head and task classifier head have identical structures and are based on the same PLM.

4.1.0.3 Computational Efficiency

CURE is highly lightweight. Specifically, the content extractor used consists of two single linear layers with layer normalization and a single Transformer layer [1], each with 768 neurons, resulting in a total of approximately 1.78M parameters. Our debiasing module comprises a SwiGLU layer [37] followed by a linear layer, with a total of approximately 1.18M parameters. We make a comparison with GPT-3.5-Turbo-based RAZOR in Table 2. Here, we calculated the average training and inference time per sample with a batch size of 16 on a single NVIDIA A100 Tensor Core-Graphics Processing Unit.

4.1.0.4 Case Study

To better understand CURE’s improvements, we analyze the model’s attention patterns in sentiment classification tasks. Specifically, we randomly sampled a positive review from Yelp, using sentiment classifiers based on DistilBERT and CURE to classify it. After that, we studied their attention across different terms, which is measured by Shapley [38]. The attribution visualizations in Table 3 and Table 4 highlight these differences.

Table 2: Computational scale and the average training/inference time per sample. We take the Yelp dataset with RoBERTa as an example.
Model	Scale \(\downarrow\)	Training \(\downarrow\)	Inference \(\downarrow\)
RoBERTa	125M	\(\approx\) 11ms	\(\approx\) 1ms
RAZOR	GPT-3.5-Turbo	\(>\) 600ms	\(\approx\) 1ms
CURE	127.96M	\(\approx\) 59ms	\(\approx\) 1ms

4.1.0.5 The Convergence of the Content Extractor

Since the content extractor \(\phi\) is optimized by two training objectives simultaneously, i.e., \(\mathcal{L}_{\text{content}}(\phi)\) and \(\mathcal{L}_{\text{concept}}(\phi)\), we empirically demonstrated its convergence. The training curve of the content extractor is shown in Fig. 3.

Figure 3: The convergence of the content extractor. We scale the loss values by a factor of 100 for clear comparison.

4.2 Results and Discussions↩︎

CURE outperformed the baselines on nearly all metrics across both datasets. as shown in Table 1. The largest improvement comes from the Roberta model on the IMDB with the OOD test, with an approximate increase of 5 points in Accuracy and 10 points in F1 score. Compared to the i.i.d test, our model introduced a more significant improvement on the OOD test. We analyze that the benchmarks on the i.i.d. test have achieved relatively high accuracy, making it challenging to further improve their performances. Furthermore, we observe that CURE outperforms loss adjustment method FL and LLM-driven approach RAZOR. We attribute this to the fact that FL and RAZOR primarily address label- and word-level biases rather than conceptual biases. For semantic-level biases, these two methods lack mechanisms for regulating the semantic representations, making it challenging for them to improve the baselines. In contrast, CURE remaps the semantic space, enabling the controllable filtering of concept information that cause shortcuts, thereby enhancing robustness of baselines and boosting their OOD performances.

Our findings show that the baseline model tends to distribute attention across both sentiment-related and domain-specific words, while CURE prioritizes sentiment-expressive terms. Table 3 illustrates how the DistilBERT-based classifier assigns nearly equal importance to both “service” and “great”, which indicates a reliance on topic-specific terms rather than sentiment indicators. In contrast, Table 4 shows that CURE places stronger emphasis on “great”, which suggests it better captures the actual sentiment while reducing confounding biases.

4pt

Table 3: SHAP-based token attribution visualization - DistilBERT. Red represents the contribution to positive sentiment. “[CLS]” and “[SEP]” are special tokens.
[CLS]	the	service	was	great	.	[SEP]

4pt

Table 4: SHAP-based token attribution visualization - CURE. Red represents the contribution to positive sentiment. “[CLS]” and “[SEP]” are special tokens.
[CLS]	the	service	was	great	.	[SEP]

CURE is lightweight and efficient, as shown in Table 2. Compared to the baseline, CURE holds only 2% additional parameters with a nearly identical inference time. Compared to RAZOR which is based on GPT-3.5-Turbo, CURE does not require participating of LLMs during training, which reduces the training time to approximately one-tenth of RAZOR’s. Additionally, the time complexity of the debiasing module involved in inference is \(\mathcal{O}(L \cdot H^2)\), where \(L\) represents the input length and \(H\) denotes the hidden state dimension, which aligns with that of the PLMs used [39]. Therefore, the usage of CURE does not alter the time complexity of the baselines. This substantially reduces both computational and time costs that enhances the practicality and generalizability of CURE in real-world applications.

The content extractor used can converge under all conditions, as shown in Fig. 3. This not only provides an experimental foundation for CURE but also indicates that the two optimization objectives employed, i.e. \(\mathcal{L}_{\text{content}}(\phi)\) and \(\mathcal{L}_{\text{concept}}(\phi)\), are not in conflict. We argue that this finding supports that concept information is not entirely tangled with the semantic information in the latent space, thereby offering a theoretical basis for future work on feature disentanglement.

4.3 Ablation Study↩︎

4.3.0.1 The Effectiveness of Back-Translation

To investigate the effect of the reversal network used in training, we conducted ablation experiments on the reversal network, as shown in Table 5.

Table 5: Ablation study on the reversal network \(\hat{\phi}\). “w/” and “w/o” represent “with” and “without”, respectively.
	Yelp		IMDB
2-3	ACC \(\uparrow\)	F1 \(\uparrow\)	ACC \(\uparrow\)	F1 \(\uparrow\)
RoBERTa(w/o \(\hat{\phi}\))	79.75	83.09	81.33	79.03
RoBERTa(w/ \(\hat{\phi}\))	91.50	91.33	83.50	84.51
MPNet(w/o \(\hat{\phi}\))	90.25	89.71	79.83	78.73
MPNet(w/ \(\hat{\phi}\))	90.75	90.68	81.50	81.22
DistilBERT(w/o \(\hat{\phi}\))	91.50	91.05	80.83	82.12
DistilBERT(w/ \(\hat{\phi}\))	92.00	92.12	84.00	84.36

We found that removing the reversal network results in a degradation in classification accuracy, as shown in Table 5. The most significant decline was observed with the RoBERTa model on the Yelp dataset, with a decrease of approximately 12 points in accuracy and 8 points in F1 score. Our further experiments revealed that the content extractor exhibited parameter sparsity in the absence of the reversal network.

Based on these observations, we hypothesize that, without control of content preservation, the content extractor attempts to map all inputs to similar representations, causing its output to become indistinguishable by the concept classifier and leading to the minimization of the loss \(\mathcal{L}_\text{concept}\). In such a case, due to the information loss on robust features, the classifiers struggled to obtain sufficient effective features for learning, leading to a decline in performance.

4.3.0.2 The Controllability of Shortcuts

We demonstrated how to weaken or enhance shortcuts by adjusting the value of the margin \(M\) in eq. 6 and eq. 6 , as shown in Fig. 4. To ensure a fair comparison, all other training parameters were held constant in this experiment.

The margin has a controlling effect on the shortcut learning, as shown in Figure 4. We observed that with the increase of \(M\) increases, the performance of all three models on the two datasets exhibits a volatile decline. This suggests that a higher margin makes our method more permissive in enhancing or suppressing shortcut learning, leading to a corresponding decrease in performance on both i.i.d. and OOD data. Therefore, by adjusting \(M\), CURE can quantitatively control the impact of shortcut learning on classification, providing a quantifiable benchmark for future debiasing research in theory.

5 Conclusion↩︎

In this work, we introduced CURE, a novel and lightweight framework for mitigating conceptual shortcuts in pre-trained language models. CURE enables fine-grained control over conceptual bias retention by systematically disentangling concept-relevant and content-relevant representations. It balances robustness and accuracy based on task requirements. Our experiments on IMDB and Yelp datasets demonstrate that CURE significantly improves out-of-distribution robustness, achieving up to 5-point accuracy gains and 10-points F1 gains over baselines while maintaining minimal computational overhead. Notably, CURE reduces training time by an order of magnitude compared to LLM-driven debiasing approaches, making it a scalable and efficient solution. These results highlight CURE, which reveals the potential of unsupervised conceptual debiasing in enhancing the reliability of language models while preserving critical task-relevant features.

Limitations↩︎

While CURE demonstrates strong performance and computational efficiency, we acknowledge the following limitations. First, due to computational constraints, we were unable to include large-scale comparisons against debiasing baselines such as RAZOR [11] or Focal Loss [35] on newer model architectures such as LLaMA3-1B [40] and Qwen-2.5 [41]. While we conducted a preliminary evaluation on LLaMA3-1B to evaluate the generalization ability of CURE (see Appendix [sec:sec:appendix95additional]), this was limited to comparisons with standard fine-tuned baselines. A comprehensive benchmarking against other debiasing approaches on these models is left to future work. Second, although CURE itself does not rely on LLM-driven data augmentation during training, we utilized large language models for a one-time concept annotation step during data preprocessing, following prior work [6]. This step does not incur additional inference cost and could be replaced with human-annotated concepts in future applications to reduce reliance on external models. However, we did evaluate the plausibility of these annotations through a human study (see Appendix 6.3), confirming their quality for use in downstream evaluations.

Despite these limitations, CURE remains a scalable and adaptable framework for mitigating conceptual biases in NLP models, paving the way for more robust and generalizable language understanding systems.

6 Appendix↩︎

6.1 Training Algorithm↩︎

Figure 5: The Training Algorithm of CURE

6.2 Prompt Design for Concept Labeling↩︎

Table 6: Prompt \(P_a\) is used to label the IMDB dataset for concept annotation. The placeholder **{review}** represents the input movie review. The example reviews are also sourced from the IMDB dataset.
Here is a given movie review: {review} Identify the main concept discussed in this review using only ONE WORD. Your response should be ONE-WORD for each review (e.g., acting, plot, cinematography). Examples:
1. Review: “Seen ‘Back to the Future’? This movie, ‘Tangents’ (aka ‘Time Chasers’), tries a similar time-travel concept but fails to hit the mark. Made in 1994, it looks and feels like it’s from the 80s. The cast includes an unappealing leading man, a cliché-ridden leading lady, a cartoonish villain, and henchmen with questionable jobs. The plot is hard to follow, so I’d recommend watching it with Mystery Science Theater 3000 for entertainment. On its own: 3 stars. With MST3K: 8 stars.”
Concept: plot 2. Review: “And you thought your significant other’s family was weird? Wedding Slashers will make you think twice about ever saying ‘I do.’ It is reminiscent of past horror titles such as ‘Deadly Friend’ and ‘Friday the 13th.’ It is a classic slasher film that features characters with names like ‘Sock Monkey’ and ‘The Mortician.’ You may laugh at first but trust me, these guys will freak you out. This is a quencher for the blood-thirsty horror/slasher fan that needs to see gore, gore and more gore. It’s not all slash and gash either - Wedding Slashers is chock-full-of one-liners and will give you more than just a chuckle. You’re going to need to see this one to believe it.”
Concept: genre Now, classify the given review and provide the main concept using only ONE WORD:

Table 7: Prompt \(P_b\) is used to refine the concept clusters, and it returns final concept list \(\mathcal{C}\). The placeholder **{concepts}** represents the concepts that are generated using \(P_a\).
Here is a list of extracted concepts from movie reviews: {concepts}
Analyze these concepts and suggest an appropriate number of clusters and one-word cluster names to group them. Cluster names should not overlap, should be distinctive.

Table 8: Prompt \(P_c\) is used to assign the final concept from \(\mathcal{C}\) to each movie review. The placeholder **{concept}** represents the extracted concept from a movie review. The placeholder **{concept labels}** refers to \(\mathcal{C}\), the predefined concept list generated using \(P_b\).
Given concept: {concept}
Predefined Concept List: {concept labels}
Provide the concept from the predefined list that is closest to the given concept. Return nothing else.

6.3 Annotation Quality Evaluation↩︎

To measure the quality of GPT-4o’s concept annotations, we conducted a human evaluation using crowdsourcing. We randomly selected 10 annotations from each dataset (Yelp and IMDB), and each annotation was rated by seven independent annotators using Qualtrics ². The annotators assessed how accurately each concept reflected the associated text using a 5-point Likert scale [42], where 1 = Not accurately at all, 2 = Slightly accurately, 3 = Moderately accurately, 4 = Very accurately, and 5 = Extremely accurately. The average ratings were 4.31 for Yelp and 3.81 for IMDB. We define the agreement rate as the proportion of ratings above 3, which reached 100% for Yelp and 70% for IMDB. These results indicate that GPT-4o’s concept annotations are largely considered plausible and can be reliably used in downstream tasks.

Table 9: Human evaluation of GPT-4o’s concept annotations.
Dataset	Mean Rating	Agreement Rate (%)
Yelp	4.31	100
IMDB	3.81	70

6.4 Supplementary Experiments↩︎

Table 10: Accuracy and F1 results on IMDB and Yelp using LLaMA3-1B. “LLaMA3-1B” refers to the base model without debiasing. **Bold** indicates the best result.
Dataset		IMDB		Yelp
Model		ACC \(\uparrow\)	F1 \(\uparrow\)	ACC \(\uparrow\)	F1 \(\uparrow\)
	LLaMA3-1B	70.67	65.00	93.25	93.00
	LLaMA3-1B (w/ CURE)	73.83	75.00	93.50	94.00

To further demonstrate the generalization capability of our method, we apply CURE to the LLaMA3-1B model [40] to evaluate its effectiveness on a recent large-scale foundation model. Using the same evaluation protocol, we observe a 3-point accuracy improvement on IMDB and consistent gains on Yelp (Table [tab:llama3]), reflecting the performance trends previously seen with smaller PLMs (Table 1). Due to computational resource limitations, this additional experiment could not be extended to include comparisons with other baselines such as RAZOR [11] and FL [35], and was therefore not included in the main body of the study. Nonetheless, we report it here in the appendix to highlight the broader applicability and robustness of CURE across diverse and emerging model architectures.

6.5 Sentiment Distributions in the Imbalanced Groups↩︎

References↩︎

[1]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1423. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.

[2]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. https://api.semanticscholar.org/CorpusID:160025533. OpenAI blog, 1(8):9.

[3]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. https://arxiv.org/abs/2302.13971. Preprint, arXiv:2302.13971.

[4]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA. Curran Associates Inc.

[5]

He He, Sheng Zha, and Haohan Wang. 2019. https://doi.org/10.18653/v1/D19-6115. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), pages 132–142, Hong Kong, China. Association for Computational Linguistics.

[6]

Yuhang Zhou, Paiheng Xu, Xiaoyu Liu, Bang An, Wei Ai, and Furong Huang. 2024. https://doi.org/10.18653/v1/2024.acl-long.28. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 478–492, Bangkok, Thailand. Association for Computational Linguistics.

[7]

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Proceedings of the 29th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 649–657, Cambridge, MA, USA. MIT Press.

[8]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. https://arxiv.org/abs/2307.09288. Preprint, arXiv:2307.09288.

[9]

Amelia Jiménez-Sánchez, Dovile Juodelyte, Bethany Chamberlain, and Veronika Cheplygina. 2023. https://doi.org/10.1109/ISBI53787.2023.10230572. In 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), pages 1–5.

[10]

Jiaxin Wen, Yeshuang Zhu, Jinchao Zhang, Jie Zhou, and Minlie Huang. 2022. https://doi.org/10.18653/v1/2022.findings-emnlp.170. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2302–2317, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

[11]

Shuo Yang, Bardh Prenkaj, and Gjergji Kasneci. 2024. https://arxiv.org/abs/2412.07675. Preprint, arXiv:2412.07675.

[12]

Zeming Chen, Qiyue Gao, Antoine Bosselut, Ashish Sabharwal, and Kyle Richardson. 2023. https://doi.org/10.18653/v1/2023.acl-long.302. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5514–5528, Toronto, Canada. Association for Computational Linguistics.

[13]

Divyansh Kaushik and Zachary C. Lipton. 2018. https://doi.org/10.18653/v1/D18-1546. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5010–5015, Brussels, Belgium. Association for Computational Linguistics.

[14]

Weizhi Xu, Qiang Liu, Shu Wu, and Liang Wang. 2023. https://doi.org/10.18653/v1/2023.acl-long.374. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6777–6789, Toronto, Canada. Association for Computational Linguistics.

[15]

Siyin Wang, Jie Zhou, Changzhi Sun, Junjie Ye, Tao Gui, Qi Zhang, and Xuanjing Huang. 2022. https://aclanthology.org/2022.coling-1.607. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6966–6977, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.

[16]

Shiori Sagawa*, Pang Wei Koh*, Tatsunori B. Hashimoto, and Percy Liang. 2020. https://openreview.net/forum?id=ryxGuJrFvS. In International Conference on Learning Representations.

[17]

Darsh Kaushik, Abdullah Faiz Ur Rahman Khilji, Utkarsh Sinha, and Partha Pakray. 2021. https://doi.org/10.18653/v1/2021.sdp-1.13. In Proceedings of the Second Workshop on Scholarly Document Processing, pages 103–109, Online. Association for Computational Linguistics.

[18]

Weizhi Xu, Qiang Liu, Shu Wu, and Liang Wang. 2023. Counterfactual debiasing for fact verification. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6777–6789.

[19]

Sachin Kumar, Shuly Wintner, Noah A Smith, and Yulia Tsvetkov. 2019. Topics to avoid: Demoting latent confounds in text classification. arXiv preprint arXiv:1909.00453.

[20]

Yadollah Yaghoobzadeh, Soroush Mehri, Remi Tachet, Timothy J Hazen, and Alessandro Sordoni. 2019. Increasing robustness to spurious correlations using forgettable examples. arXiv preprint arXiv:1911.03861.

[21]

Joe Stacey, Pasquale Minervini, Haim Dubossarsky, Sebastian Riedel, and Tim Rocktäschel. 2020. Avoiding the hypothesis-only bias in natural language inference via ensemble adversarial training. arXiv preprint arXiv:2004.07790.

[22]

Zhao Wang and Aron Culotta. 2020. Identifying spurious correlations for robust text classification. arXiv preprint arXiv:2010.02458.

[23]

Tianlu Wang, Rohit Sridhar, Diyi Yang, and Xuezhi Wang. 2021. Identifying and mitigating spurious correlations for improving robustness in nlp models. arXiv preprint arXiv:2110.07736.

[24]

Lifu Tu, Garima Lalwani, Spandana Gella, and He He. 2020. An empirical study on robustness to spurious correlations using pre-trained language models. Transactions of the Association for Computational Linguistics, 8:621–633.

[25]

Yanrui Du, Jing Yan, Yan Chen, Jing Liu, Sendong Zhao, Qiaoqiao She, Hua Wu, Haifeng Wang, and Bing Qin. 2022. Less learn shortcut: Analyzing and mitigating learning of spurious feature-label correlation. arXiv preprint arXiv:2205.12593.

[26]

Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. https://api.semanticscholar.org/CorpusID:257766307. Proceedings of the National Academy of Sciences of the United States of America, 120.

[27]

Ning Dai, Jianze Liang, Xipeng Qiu, and Xuanjing Huang. 2019. https://doi.org/10.18653/v1/P19-1601. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5997–6007, Florence, Italy. Association for Computational Linguistics.

[28]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. https://doi.org/10.18653/v1/P16-1009. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96, Berlin, Germany. Association for Computational Linguistics.

[29]

Naftali Tishby, Fernando C. Pereira, and William Bialek. 1999. https://arxiv.org/abs/physics/0004057. In Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing, pages 368–377.

[30]

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. http://www.aclweb.org/anthology/P11-1015. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.

[31]

NICHOLAS A. KUIPER and ROD A. MARTIN. 1993. https://doi.org/doi:10.1515/humr.1993.6.3.251. HUMOR, 6(3):251–270.

[32]

Kyunghoon Hur, Jiyoung Lee, Jungwoo Oh, Wesley Price, Younghak Kim, and Edward Choi. 2022. https://proceedings.mlr.press/v174/hur22a.html. In Proceedings of the Conference on Health, Inference, and Learning, volume 174 of Proceedings of Machine Learning Research, pages 183–203. PMLR.

[33]

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.

[34]

[35]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2020. https://doi.org/10.1109/TPAMI.2018.2858826. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2):318–327.

[36]

Ilya Loshchilov and Frank Hutter. 2019. https://openreview.net/forum?id=Bkg6RiCqY7. In International Conference on Learning Representations.

[37]

Noam M. Shazeer. 2020. https://api.semanticscholar.org/CorpusID:211096588. ArXiv, abs/2002.05202.

[38]

Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 4768–4777, Red Hook, NY, USA. Curran Associates Inc.

[39]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.

[40]

Meta. 2024. https://huggingface.co/meta-llama/Llama-3.2-1B. Accessed: January 23, 2025.

[41]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. 2024. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115.

[42]

Rensis Likert. 1932. A technique for the measurement of attitudes. Archives of Psychology, 140:1–55.

Our code is available at https://github.com/aysenurozmen/CURE ↩︎
https://www.qualtrics.com/↩︎

CURE: Controlled Unlearning for Robust Embeddings – Mitigating Conceptual Shortcuts in Pre-Trained Language Models

Abstract

1 Introduction↩︎

2.1 General Approaches to Addressing Spurious Correlations↩︎

2.2 Concept-Level Spurious Correlations↩︎

3 Methodology↩︎

3.1 Problem Formulation↩︎

3.2 Concept Labeling↩︎

3.3 Extraction of Concept-Irrelevant Content↩︎

3.3.1 Conceptual Information Filter↩︎

3.3.2 Concept-Irrelevant Content Maintenance↩︎

3.4 Conceptual Shortcut Debiasing↩︎

4 Experiments↩︎

4.1 Experimental Setup↩︎

4.1.0.1 Dataset Description

4.1.0.2 Compared Methods and Hyperparameters

4.1.0.3 Computational Efficiency

4.1.0.4 Case Study

4.1.0.5 The Convergence of the Content Extractor

4.2 Results and Discussions↩︎

4.3 Ablation Study↩︎

4.3.0.1 The Effectiveness of Back-Translation

4.3.0.2 The Controllability of Shortcuts

5 Conclusion↩︎

Limitations↩︎

6 Appendix↩︎

6.1 Training Algorithm↩︎

6.2 Prompt Design for Concept Labeling↩︎

6.3 Annotation Quality Evaluation↩︎

6.4 Supplementary Experiments↩︎

6.5 Sentiment Distributions in the Imbalanced Groups↩︎

References↩︎

Subjects

Updated on Academus

CURE: Controlled Unlearning for Robust Embeddings – Mitigating Conceptual Shortcuts in Pre-Trained Language Models

Abstract

1 Introduction↩︎

2 Related Work↩︎

2.1 General Approaches to Addressing Spurious Correlations↩︎

2.2 Concept-Level Spurious Correlations↩︎

3 Methodology↩︎

3.1 Problem Formulation↩︎

3.2 Concept Labeling↩︎

3.3 Extraction of Concept-Irrelevant Content↩︎

3.3.1 Conceptual Information Filter↩︎

3.3.2 Concept-Irrelevant Content Maintenance↩︎

3.4 Conceptual Shortcut Debiasing↩︎

4 Experiments↩︎

4.1 Experimental Setup↩︎

4.1.0.1 Dataset Description

4.1.0.2 Compared Methods and Hyperparameters

4.1.0.3 Computational Efficiency

4.1.0.4 Case Study

4.1.0.5 The Convergence of the Content Extractor

4.2 Results and Discussions↩︎

4.3 Ablation Study↩︎

4.3.0.1 The Effectiveness of Back-Translation

4.3.0.2 The Controllability of Shortcuts

5 Conclusion↩︎

Limitations↩︎

6 Appendix↩︎

6.1 Training Algorithm↩︎

6.2 Prompt Design for Concept Labeling↩︎

6.3 Annotation Quality Evaluation↩︎

6.4 Supplementary Experiments↩︎

6.5 Sentiment Distributions in the Imbalanced Groups↩︎

References↩︎

Subjects

Updated on Academus