September 05, 2025
Pre-trained language models have achieved remarkable success across diverse applications but remain susceptible to spurious, concept-driven correlations that impair robustness and fairness. In this work, we introduce CURE, a novel and lightweight framework that systematically disentangles and suppresses conceptual shortcuts while preserving essential content information. Our method first extracts concept-irrelevant representations via a dedicated content extractor reinforced by a reversal network, ensuring minimal loss of task-relevant information. A subsequent controllable debiasing module employs contrastive learning to finely adjust the influence of residual conceptual cues, enabling the model to either diminish harmful biases or harness beneficial correlations as appropriate for the target task. Evaluated on the IMDB and Yelp datasets using three pre-trained architectures, CURE achieves an absolute improvement of +10 points in F1 score on IMDB and +2 points on Yelp, while introducing minimal computational overhead. Our approach establishes a flexible, unsupervised blueprint for combating conceptual biases, paving the way for more reliable and fair language understanding systems.1
None
Figure 1: Example of shortcut learning in sentiment classification, where a classification model \(f_\theta\) wrongly associate reviews about Food to a positive sentiment..
With the rapid advancement of artificial intelligence, pre-trained language models (PLMs) have been widely adopted across various domains, including education, healthcare and e-commerce [1]–[3]. A predominant strategy for applying these models is fine-tuning, where a PLM is further adapted to task-specific data, aiming to enhance its performance or better align with human intent [4]. However, fine-tuning often exposes models to dataset biases, leading to shortcuts—spurious correlations between features and labels [5]. For instance, [6] demonstrated that on the Yelp dataset [7], a LLaMA2-based [8] sentiment classifier mistakenly associated the concept of “food” with a “positive” label. These fragile dependencies not only limit the robustness of PLMs but also pose significant risks. In medical diagnosis, a biased detector might incorrectly associate certain biological attributes with diseases, leading to inaccurate predictions [9]. Similarly, in automated recruitment systems, a shortcut may result in a favor to applicants with certain demographic attributes, exacerbating fairness problem. In Figure 1, we present an example where the classifier incorrectly associates the concept of food with positive sentiment.
Contemporary debiasing research primarily focus on two strategies: (1) modifying shortcut-inducing terms in training data [10], [11], and (2) generating counterfactual samples [6], [12] via large language models (LLMs). However, both approaches suffer from notable limitations. Lexical modification requires prior knowledge of shortcut-inducing terms, which is often challenging to obtain [13]. Moreover, its effectiveness is restricted to lexical shortcuts rather than conceptual biases. On the other hand, LLM-based counterfactual generation is computationally expensive and increases training costs significantly. While LLM-free counterfactual generation still relies on prior knowledge [14], making it similarly constrained.
As an unsupervised and lightweight solution, we propose CURE—Controlled Unlearning for Robust Embeddings. CURE remaps the semantic space to disentangle conceptual and content-related information without human annotation, offering fine-grained control over shortcut effects. It first trains a content extractor using a concept classifier and back-translation to produce concept-irrelevant representations. A contrastive learning-based debiasing module then refines sample representations, adjusting conceptual features as needed. Finally, the module is jointly trained with a classification head to enhance model robustness.
Unlike traditional approaches, CURE offers three key advantages: Prior Knowledge Independence – CURE uses unsupervised learning, eliminating the need for manual annotations of shortcuts. Resource Efficiency – CURE eliminates the need for LLM-driven data augmentation, reducing the training time to approximately one-tenth of the original. Controllability – CURE can quantify the impact of conceptual bias on classification results. This facilitates both the mitigation of conceptual biases to enhance performance on out-of-distribution (OOD) data and the exploitation of shortcuts to improve performance on independent and identically distributed (i.i.d.) data. Such adaptability enables users to align training objectives with their generalization requirement, while also providing a quantifiable framework for future debiasing research.
Our contributions are as follows:
We propose a novel conceptual debiasing approach named CURE. It mitigates shortcuts without relying on prior knowledge or data augmentation, reducing training time to one-tenth of that required by LLM-driven methods. Furthermore, CURE is highly adaptable and can be seamlessly integrated with any mainstream PLM.
CURE enables precise control over the impact of shortcuts. It mitigates conceptual biases to enhance robustness against distribution shifts. Conversely, in scenarios where shortcuts align well with the target task, e.g., i.i.d. data, it leverages them to improve classification accuracy. This adaptability allows CURE to balance robustness and accuracy based on specific generalization requirements.
We evaluate CURE across two benchmark datasets and three PLMs. Experimental results indicate that on the IMDB dataset, the RoBERTa-based CURE achieves an approximately 5-point improvement in accuracy over an LLM-driven debiasing approach and outperforms the baseline by about 10 points in F1 score, demonstrating its effectiveness in mitigating conceptual shortcuts.
Addressing spurious correlations in PLMs has become a critical research focus, as these correlations can lead to biased and unreliable predictions, limiting model robustness and fairness. Traditional works have explored various strategies to mitigate these issues, including causal inference techniques [15], adversarial training [16], and data augmentation methods designed to reduce model reliance on spurious features [17]. Additionally, approaches leveraging counterfactual reasoning [18] have shown promise in improving fairness and robustness in LLMs. These advancements collectively contribute to a growing body of research aimed at developing more reliable and ethically sound language models.
[19] addresses the challenge of models learning spurious topical shortcuts instead of relevant features in tasks like native language identification. They introduce an adversarial model to demote these latent topical confounds using log-odds ratios, guiding the model to focus on stylistic rather than topic-based features. [20] enhance robustness by fine-tuning models on “forgettable” examples that models initially misclassified. [21] tackle the issue of natural language inference (NLI) models relying on superficial hypothesis patterns by using an ensemble of adversarial classifiers. [22] propose using treatment effect estimation to distinguish genuine correlations from spurious ones, such as associating “Spielberg” with positive sentiment in movie reviews. [23] extend this concept with an automated framework using interpretability techniques, cross-dataset stability, and knowledge-aware perturbation to identify spurious tokens at scale. [24] explores how pre-trained models like BERT handle spurious correlations, finding that they improve robustness by generalizing from minority counterexamples. The authors propose using multi-task learning (MTL) with auxiliary tasks to enhance robustness when these counterexamples are scarce. [25] propose the Less-Learn-Shortcut (LLS) which down-weighs examples with high correlations between specific words and labels. [18] present a counterfactual debiasing approach that balances predictions between claim-only and claim-evidence models to reduce bias associated with claim patterns. While these studies primarily address general spurious correlations, recent research has started focusing on spurious correlations at the concept level.
[6] introduce biases in NLP at the concept level, highlighting how language models often rely on broad associative patterns rather than deeper semantic understanding. For instance, models may learn to associate certain concepts, such as “food”, with inherently positive sentiment, leading to spurious correlations that degrade generalization performance. To mitigate this issue, the authors leverage LLM to generate counterfactual data that rebalances label distributions, thereby reducing the reliance on such superficial cues.
However, this approach presents certain limitations in terms of scalability. Specifically, generating counterfactual data for each new task requires substantial manual intervention, as it involves defining relevant concept-level biases and ensuring the generated data maintains both linguistic plausibility and task relevance. Even with advanced LLMs like ChatGPT, this process remains resource-intensive, particularly for large-scale or multi-domain applications. Additionally, the effectiveness of this method depends on the quality and diversity of the generated counterfactuals, which can vary depending on the prompt design and the inherent biases present in the language model used for data generation. These challenges underscore the need for more automated, generalizable approaches to mitigating concept-level biases in NLP.
Given a set of i.i.d. labeled documents – \(D = \{d_1, \ldots, d_N\}\), where each sample \(d_i\) associates with a conceptual label \(c_i \in \mathcal{C}\) and a classification label \(y_i \in \mathcal{Y}\). We assume that the classification labels are balanced, while the conceptual labels are biased. That is, for every label \(y \in \mathcal{Y}\), the number of samples in \(D\) with label \(y\) is equal: \[\forall\, y \in \mathcal{Y}, \quad \left| \{ d_i \in D \mid y_i = y \} \right| = \frac{N}{|\mathcal{Y}|}.\] The distribution of conceptual labels is uneven: \[\begin{gather} \exists\, c, c' \in \mathcal{C} \text{ such that }\\ \left| \{ d_i \in D \mid c_i = c \} \right| \neq \left| \{ d_i \in D \mid c_i = c' \} \right|. \end{gather}\] Here, \(\mathcal{C}\) is correlated with but is not causal related to \(\mathcal{Y}\), i.e., \(\mathcal{C}\perp \!\!\!\perp\mathcal{Y}, \text{ but } \mathcal{C}\nrightarrow\mathcal{Y}\). We first transform samples in \(D\) to their semantic embedding \(\mathcal{X}= \{x_1, \ldots, x_N\}\subseteq\mathbb{R}^u\) by using a PLM, then optimize a classification head \(f_\theta\) with parameter \(\theta\) for mapping \(\mathcal{X}\rightarrow\mathcal{Y}\) by minimizing classification loss \(\ell\): \[\theta^* = \arg\min_{\theta} \frac{1}{n} \sum_{i=1}^{n} \ell \big( f_\theta(x_i), y_i \big). \label{eq:ce}\tag{1}\] However, due to the bias between \(\mathcal{Y}\) and \(\mathcal{C}\), the model may erroneously associate \(c_i\) with \(y_i\), thereby losing its robustness. Our primary objective is to enhance the robustness of \(f_\theta\), measured by its classification accuracy on a conceptually balanced OOD test set.
Due to the lack of available conceptual annotations in classification datasets and the demonstrated capability of LLMs to perform text annotation [26], we employ the standard text conceptual annotation pipeline outlined in [6] by using GPT-4o [4].
Specifically, we preprocess \(D\) with the following three steps:
Data Cleaning: We remove uninformative content, including non-ASCII characters and irrelevant metadata from texts.
Concept Labeling: We design structured prompts (see Appendix 6.2) and input them into GPT-4o to label each sample \(d_i\) with a concept \(c_i\).
Meta-Concept Merging: The generated concepts are then automatically categorized and merged by GPT-4o into a meta-concept set \(\mathcal{C}\).
After obtaining the concept set \(\mathcal{C}\), we compute the mutual information between a concept \(c\) and \(\mathcal{Y}\) to quantify the bias of a specific concept: \[I(c; \mathcal{Y}) = \sum_{y \in \mathcal{Y}} P(c, y) \log \frac{P(c, y)}{P(c) P(y)}.\]Subsequently, we select samples with the top \(k\) concepts with the highest \(I(c; \mathcal{Y})\) as the training set for training a biased benchmark. Furthermore, we treat samples with the \(k\) concepts with the lowest mutual information as the OOD data from real world to evaluate our debiasing method.
To mitigate the impact of conceptual biases, we first extract concept-irrelevant content representations from a semantic embedding \(x\). To achieve this, we freeze the parameters of the PLM and insert a lightweight network \(f_\phi\) to its output layer, as a content extractor. Here, our objective is to maximize the dropout of concept-related features, while maximizing the retention of content-related features. Therefore, the training loss consists of two components: a concept dropout loss, and a content retention loss.
We first train a concept classifier to quantify the retention of concept-related features in \(\mathcal{X}\). This classifier consists of a classification head \(f_\omega\), built on the same PLM as the task classifier \(f_\theta\). We optimize parameter \(\omega\) by maximizing the probability for predicting \(\mathcal{C}\) from \(\mathcal{X}\): \[\omega^* = \arg\min_{\omega} \frac{1}{N} \sum_{i=1}^{N} \ell \big( f_\omega(x_i), c_i \big), \label{eq:concept}\tag{2}\] where \(\ell\) is the cross-entropy loss, defined as: \[\ell \big( f_\omega(x_i), c_i \big) = - \log P(c_i\mid x_i; \omega),\] where \(P(c_i \mid x_i; \omega)\) denotes the predicted probability of concept \(c_i\) given input \(x_i\), obtained from the softmax output of \(f_\omega\).
We expect the conceptual information in \(x\) to be filtered out after transformation by the content extraction function \(f_\phi\). To enforce this constraint, we compute the Kullback-Leibler (KL) divergence between the predicted distribution of the concept classifier \(\omega\) and a uniform distribution over \(\mathcal{C}\) as the training loss \(\mathcal{L_\text{concept}}(\phi)\), as shown in Eq. 3 .
\[\sum_{c \in \mathcal{C}} P(c \mid f_\phi(x); \omega) \log \left( \frac{P(c \mid f_\phi(x); \omega)}{(1 / |\mathcal{C}|)^\tau} \right), \label{eq:kl95loss}\tag{3}\] where \(\tau\) is a temperature parameter that controls the strength of the distribution alignment.
With the training of \(f_\phi\), we force the concept classifier \(\omega\) to produce maximally uncertain predictions, indicating the absence of learnable conceptual information.
The semantic features are often entangled with each other [27]. As a result, although the content extractor \(f_\phi\) solely aims at filtering out conceptual information, it is crucial to ensure that the concept-irrelevant information remain complete. Inspired by back-translation in machine translation [28], we construct a reversal network \(\hat{\phi}\), with the same architecture as \(f_\phi\). \(\hat{\phi}\) is designed to reconstruct \(x\) from \(f_\phi(x)\), ensuring that the mapping function \(f_\phi\) does not excessively lose the concept-irrelevant information. We first freeze \(\phi\), then use the following loss to train \(\hat{\phi}\): \[\mathcal{L}(\hat{\phi}) = \left\| f_{\hat{\phi}}\left(f_\phi(x)\right) - x \right\|_2^2. \label{eq:reconstruction95loss}\tag{4}\] Next, we freeze the parameters of \(\hat{\phi}\). During the training of \(\phi\), we use it to remap \(f_\phi(x)\) back to \(x\), and compute the mean squared error between them as a content retention loss: \[\mathcal{L}_{\text{content}}(\phi) = \| f_{\hat{\phi}}(f_\phi(x)) - x \|^2_2.\]Finally, we combine \(\mathcal{L}_{\text{content}}\) and \(\mathcal{L}_{\text{concept}}\) with weighted summation to form the overall loss for training \(f_\phi\), as shown in eq. 5 . \[\mathcal{L}(\phi) = \mathcal{L}_{\text{concept}}(\phi) + \lambda \mathcal{L}_{\text{content}}(\phi), \label{eq:phi}\tag{5}\] where \(\lambda\) is the weighing to control the relative importance of the content retention.
Finally, we alternately train \(f_\phi\) and \(f_{\hat{\phi}}\) to ensure that \(f_{\hat{\phi}}\) can effectively track the retention of concept-irrelevant information by \(f_\phi\). Here, \(\phi\) and \(\hat{\phi}\) minimizes conceptual information while maximizing the content information, forming an information bottleneck [29].
After training, the classifier \(\phi\) maximizes the retention of content while minimizing the retention of conceptual information to avoid conceptual shortcut in further training. Here, \(f_\phi(x)\) is the content representation of \(x\), denoted as \(x_\text{cont}\).
Although \(x_{\text{cont}}\) can replace the original embedding \(x\) to mitigate the conceptual bias, we further argue that eliminating conceptual shortcuts is not always beneficial. Theoretically, we identify two special cases where preserving conceptual biases could be advantageous: (1) when the conceptual bias aligns with human intent, and (2) when the application scenario is constrained, where the optimization objective is limited to i.i.d. data.
When the imbalance of conceptual attributes aligns with human natural intent, the shortcuts should be enhanced. For example, in the movie review dataset IMDB [30], most reviews labeled by GPT-4o as containing the conceptual attribute of “humor” are positive. This observation is consistent with psychological studies on relation between language styles and sentiments, which suggest that humorous expression tends to be associated with positive emotion [31]. Furthermore, for certain application scenario where the i.i.d. and OOD data distribution is identical, the real-world data hold the same distributional bias. For instance, in clinical medicine, a model trained on electronic health records collected from a specific hospital is often deployed to the same environment [32], classifying text with similar biases in training and inference. In such a case, reinforcing shortcuts can also improve classification performance in application.
To achieve flexible control over the shortcut exploitation, we introduce a lightweight feedforward network \(\psi\) on top of the frozen content extractor \(f_\phi\) and the PLM. This network maps both the original embedding \(x\) and its content representation \(x_{\text{cont}}\) into a conceptually controlled semantic space \(\mathcal{X_\text{CURE}}\subseteq\mathbb{R}^u\). We then employ contrastive learning to regulate their cosine similarity in this space. The training losses for removing the conceptual shortcut \(\mathcal{L}_{\text{r}}(\psi)\) and enhancing the shortcut \(\mathcal{L}_{\text{e}}(\psi)\) as follows:
\[\mathcal{L}_{\text{r}}=\max \left( 0, 1-\cos(f_\psi{(x)}, f_\psi{(x_{\text{cont}})})-\text{M} \right), \label{eq:margin1}\tag{6}\]
\[\mathcal{L}_{\text{e}}=\max \left( 0, \cos(f_\psi{(x)}, f_\psi{(x_{\text{cont}})})-\text{M} \right), \label{eq:margin2}\tag{7}\] where \(\text{m}\in[0,1]\) is a margin that controls the degree of conceptual information retention.
A smaller margin \(\text{M}\) enforces a stricter optimization objective. In the removal loss \(\mathcal{L}_{\text{r}}\), decreasing \(\text{M}\) compels \(f_\psi{(x)}\) and \(f_\psi{(x_{\text{cont}})}\) to be nearly identical, ensuring the complete removal of conceptual information. Conversely, in the enhancement loss \(\mathcal{L}_{\text{e}}\), a smaller \(\text{M}\) forces \(f_\psi{(x_{\text{cont}})}\) and \(x_{\text{cont}}\) to be maximally separated, thereby amplifying the influence of conceptual features. By adjusting \(\text{M}\), we can flexibly control the extent to which conceptual information is retained in \(f_\psi{(x)}\).
Finally, we replace the original embedding \(x\) with \(f_\psi{(x)}\), as the input to the classifier \(f_\theta\) and jointly train \(f_\theta\) and \(f_\psi\) using Equation 1 . The trained model can flexibly adjust the extent of conceptual bias retention based on the training objective, making it either more robust or more specialized, as shown in Fig. 2. In terms of parameter efficiency, CURE introduces only a lightweight content extractor and feedforward network on top of the original classifier, ensuring minimal computational overhead.
Dataset | IMDB | Yelp | ||||||||||||
Model | DistilBERT | MPNet | RoBERTa | DistilBERT | MPNet | RoBERTa | ||||||||
ACC \(\uparrow\) | F1 \(\uparrow\) | ACC \(\uparrow\) | F1 \(\uparrow\) | ACC \(\uparrow\) | F1 \(\uparrow\) | ACC \(\uparrow\) | F1 \(\uparrow\) | ACC \(\uparrow\) | F1 \(\uparrow\) | ACC \(\uparrow\) | F1 \(\uparrow\) | |||
Baseline | 84.00 | 85.05 | 87.33 | 86.94 | 88.50 | 89.27 | 94.75 | 94.76 | 92.75 | 93.11 | 93.75 | 93.51 | ||
FL | 83.70 | 82.00 | 87.50 | 87.32 | 88.67 | 88.90 | 92.25 | 92.54 | 93.75 | 95.17 | 93.50 | 93.00 | ||
RAZOR | 83.25 | 83.00 | 87.00 | 86.50 | 85.33 | 83.19 | 95.50 | 95.32 | 93.50 | 94.83 | 92.50 | 93.00 | ||
i.i.d. | CURE | 85.50 | 85.48 | 88.83 | 88.78 | 89.67 | 89.77 | 95.25 | 95.25 | 95.00 | 95.00 | 94.75 | 94.63 | |
Baseline | 81.67 | 82.20 | 79.33 | 80.19 | 78.83 | 74.85 | 89.75 | 90.44 | 89.00 | 88.30 | 89.25 | 89.64 | ||
FL | 81.33 | 82.25 | 79.00 | 76.75 | 79.33 | 76.70 | 90.25 | 89.53 | 90.25 | 89.40 | 89.00 | 89.52 | ||
RAZOR | 80.83 | 81.30 | 79.00 | 79.33 | 78.67 | 77.70 | 90.75 | 90.60 | 90.75 | 89.26 | 89.50 | 89.76 | ||
OOD | CURE | 84.00 | 84.36 | 81.50 | 81.22 | 83.50 | 84.51 | 92.00 | 92.12 | 90.75 | 90.68 | 91.50 | 91.33 |
We used IMDB [30] and Yelp [33] datasets. The IMDB movie review dataset is a binary sentiment analysis dataset, which consists of 50,000 positive or negative reviews from the Internet Movie Database. The Yelp dataset is provided by the Yelp Dataset Challenge, contains business reviews labeled with ratings ranging from 0 to 4 [33]. We used the version that was cleaned and organized by [34].
Based on the concepts labeled in Section 3.2, we divided the samples in the each dataset into two groups for i.i.d. and OOD testing:
Group A contains imbalanced concept distributions, where certain concepts are overrepresented in one task-relevant category, but the overall number of samples across task-relevant labels remains equal. Samples in Group A will be separate to a biased training set and an i.i.d. test set.
Group B contains balanced concept distributions, where each concept has an equal number of samples across the task-relevant categories. Samples in Group B will be used as the OOD test set.
As there are currently no model-based debiasing approaches, we primarily compare our method with FL [35], which optimizes loss computation with unbalanced data, and RAZOR [11], which utilizes LLMs for data debiasing. The result is shown in Table 1.
In our training, we used a mini-batch size of 16, with the optimizer AdamW [36]. The learning rate for the content extractor and reversal extractor was set to 0.0001, while that for the classification heads was set to 0.0003. The concept classifier head and task classifier head have identical structures and are based on the same PLM.
CURE is highly lightweight. Specifically, the content extractor used consists of two single linear layers with layer normalization and a single Transformer layer [1], each with 768 neurons, resulting in a total of approximately 1.78M parameters. Our debiasing module comprises a SwiGLU layer [37] followed by a linear layer, with a total of approximately 1.18M parameters. We make a comparison with GPT-3.5-Turbo-based RAZOR in Table 2. Here, we calculated the average training and inference time per sample with a batch size of 16 on a single NVIDIA A100 Tensor Core-Graphics Processing Unit.
To better understand CURE’s improvements, we analyze the model’s attention patterns in sentiment classification tasks. Specifically, we randomly sampled a positive review from Yelp, using sentiment classifiers based on DistilBERT and CURE to classify it. After that, we studied their attention across different terms, which is measured by Shapley [38]. The attribution visualizations in Table 3 and Table 4 highlight these differences.
Model | Scale \(\downarrow\) | Training \(\downarrow\) | Inference \(\downarrow\) |
---|---|---|---|
RoBERTa | 125M | \(\approx\) 11ms | \(\approx\) 1ms |
RAZOR | GPT-3.5-Turbo | \(>\) 600ms | \(\approx\) 1ms |
CURE | 127.96M | \(\approx\) 59ms | \(\approx\) 1ms |
Since the content extractor \(\phi\) is optimized by two training objectives simultaneously, i.e., \(\mathcal{L}_{\text{content}}(\phi)\) and \(\mathcal{L}_{\text{concept}}(\phi)\), we empirically demonstrated its convergence. The training curve of the content extractor is shown in Fig. 3.
CURE outperformed the baselines on nearly all metrics across both datasets. as shown in Table 1. The largest improvement comes from the Roberta model on the IMDB with the OOD test, with an approximate increase of 5 points in Accuracy and 10 points in F1 score. Compared to the i.i.d test, our model introduced a more significant improvement on the OOD test. We analyze that the benchmarks on the i.i.d. test have achieved relatively high accuracy, making it challenging to further improve their performances. Furthermore, we observe that CURE outperforms loss adjustment method FL and LLM-driven approach RAZOR. We attribute this to the fact that FL and RAZOR primarily address label- and word-level biases rather than conceptual biases. For semantic-level biases, these two methods lack mechanisms for regulating the semantic representations, making it challenging for them to improve the baselines. In contrast, CURE remaps the semantic space, enabling the controllable filtering of concept information that cause shortcuts, thereby enhancing robustness of baselines and boosting their OOD performances.
Our findings show that the baseline model tends to distribute attention across both sentiment-related and domain-specific words, while CURE prioritizes sentiment-expressive terms. Table 3 illustrates how the DistilBERT-based classifier assigns nearly equal importance to both “service” and “great”, which indicates a reliance on topic-specific terms rather than sentiment indicators. In contrast, Table 4 shows that CURE places stronger emphasis on “great”, which suggests it better captures the actual sentiment while reducing confounding biases.
4pt
[CLS] | the | service | was | great | . | [SEP] |
---|
4pt
[CLS] | the | service | was | great | . | [SEP] |
---|
CURE is lightweight and efficient, as shown in Table 2. Compared to the baseline, CURE holds only 2% additional parameters with a nearly identical inference time. Compared to RAZOR which is based on GPT-3.5-Turbo, CURE does not require participating of LLMs during training, which reduces the training time to approximately one-tenth of RAZOR’s. Additionally, the time complexity of the debiasing module involved in inference is \(\mathcal{O}(L \cdot H^2)\), where \(L\) represents the input length and \(H\) denotes the hidden state dimension, which aligns with that of the PLMs used [39]. Therefore, the usage of CURE does not alter the time complexity of the baselines. This substantially reduces both computational and time costs that enhances the practicality and generalizability of CURE in real-world applications.
The content extractor used can converge under all conditions, as shown in Fig. 3. This not only provides an experimental foundation for CURE but also indicates that the two optimization objectives employed, i.e. \(\mathcal{L}_{\text{content}}(\phi)\) and \(\mathcal{L}_{\text{concept}}(\phi)\), are not in conflict. We argue that this finding supports that concept information is not entirely tangled with the semantic information in the latent space, thereby offering a theoretical basis for future work on feature disentanglement.
To investigate the effect of the reversal network used in training, we conducted ablation experiments on the reversal network, as shown in Table 5.
Yelp | IMDB | ||||
---|---|---|---|---|---|
2-3 | ACC \(\uparrow\) | F1 \(\uparrow\) | ACC \(\uparrow\) | F1 \(\uparrow\) | |
RoBERTa(w/o \(\hat{\phi}\)) | 79.75 | 83.09 | 81.33 | 79.03 | |
RoBERTa(w/ \(\hat{\phi}\)) | 91.50 | 91.33 | 83.50 | 84.51 | |
MPNet(w/o \(\hat{\phi}\)) | 90.25 | 89.71 | 79.83 | 78.73 | |
MPNet(w/ \(\hat{\phi}\)) | 90.75 | 90.68 | 81.50 | 81.22 | |
DistilBERT(w/o \(\hat{\phi}\)) | 91.50 | 91.05 | 80.83 | 82.12 | |
DistilBERT(w/ \(\hat{\phi}\)) | 92.00 | 92.12 | 84.00 | 84.36 |
We found that removing the reversal network results in a degradation in classification accuracy, as shown in Table 5. The most significant decline was observed with the RoBERTa model on the Yelp dataset, with a decrease of approximately 12 points in accuracy and 8 points in F1 score. Our further experiments revealed that the content extractor exhibited parameter sparsity in the absence of the reversal network.
Based on these observations, we hypothesize that, without control of content preservation, the content extractor attempts to map all inputs to similar representations, causing its output to become indistinguishable by the concept classifier and leading to the minimization of the loss \(\mathcal{L}_\text{concept}\). In such a case, due to the information loss on robust features, the classifiers struggled to obtain sufficient effective features for learning, leading to a decline in performance.
We demonstrated how to weaken or enhance shortcuts by adjusting the value of the margin \(M\) in eq. 6 and eq. 6 , as shown in Fig. 4. To ensure a fair comparison, all other training parameters were held constant in this experiment.
Figure 4: The impact of the margin on classification accuracy. Fig. 4 (a) and Fig. 4 (c) show cases for reducing shortcuts on OOD test. Fig. 4 (b) and Fig. 4 (d) show cases for enhancing shortcuts on i.i.d. test.. a — IMDB (Debiasing), b — IMDB (Biasing), c — Yelp (Debiasing), d — Yelp (Biasing)
The margin has a controlling effect on the shortcut learning, as shown in Figure 4. We observed that with the increase of \(M\) increases, the performance of all three models on the two datasets exhibits a volatile decline. This suggests that a higher margin makes our method more permissive in enhancing or suppressing shortcut learning, leading to a corresponding decrease in performance on both i.i.d. and OOD data. Therefore, by adjusting \(M\), CURE can quantitatively control the impact of shortcut learning on classification, providing a quantifiable benchmark for future debiasing research in theory.
In this work, we introduced CURE, a novel and lightweight framework for mitigating conceptual shortcuts in pre-trained language models. CURE enables fine-grained control over conceptual bias retention by systematically disentangling concept-relevant and content-relevant representations. It balances robustness and accuracy based on task requirements. Our experiments on IMDB and Yelp datasets demonstrate that CURE significantly improves out-of-distribution robustness, achieving up to 5-point accuracy gains and 10-points F1 gains over baselines while maintaining minimal computational overhead. Notably, CURE reduces training time by an order of magnitude compared to LLM-driven debiasing approaches, making it a scalable and efficient solution. These results highlight CURE, which reveals the potential of unsupervised conceptual debiasing in enhancing the reliability of language models while preserving critical task-relevant features.
While CURE demonstrates strong performance and computational efficiency, we acknowledge the following limitations. First, due to computational constraints, we were unable to include large-scale comparisons against debiasing baselines such as RAZOR [11] or Focal Loss [35] on newer model architectures such as LLaMA3-1B [40] and Qwen-2.5 [41]. While we conducted a preliminary evaluation on LLaMA3-1B to evaluate the generalization ability of CURE (see Appendix [sec:sec:appendix95additional]), this was limited to comparisons with standard fine-tuned baselines. A comprehensive benchmarking against other debiasing approaches on these models is left to future work. Second, although CURE itself does not rely on LLM-driven data augmentation during training, we utilized large language models for a one-time concept annotation step during data preprocessing, following prior work [6]. This step does not incur additional inference cost and could be replaced with human-annotated concepts in future applications to reduce reliance on external models. However, we did evaluate the plausibility of these annotations through a human study (see Appendix 6.3), confirming their quality for use in downstream evaluations.
Despite these limitations, CURE remains a scalable and adaptable framework for mitigating conceptual biases in NLP models, paving the way for more robust and generalizable language understanding systems.
Here is a given movie review: {review} Identify the main concept discussed in this review using only ONE WORD. Your response should be ONE-WORD for each review (e.g., acting, plot, cinematography). Examples: |
1. Review: “Seen ‘Back to the Future’? This movie, ‘Tangents’ (aka ‘Time Chasers’), tries a similar time-travel concept but fails to hit the mark. Made in 1994, it looks and feels like it’s from the 80s. The cast includes an unappealing leading man, a cliché-ridden leading lady, a cartoonish villain, and henchmen with questionable jobs. The plot is hard to follow, so I’d recommend watching it with Mystery Science Theater 3000 for entertainment. On its own: 3 stars. With MST3K: 8 stars.” |
Concept: plot 2. Review: “And you thought your significant other’s family was weird? Wedding Slashers will make you think twice about ever saying ‘I do.’ It is reminiscent of past horror titles such as ‘Deadly Friend’ and ‘Friday the 13th.’ It is a classic slasher film that features characters with names like ‘Sock Monkey’ and ‘The Mortician.’ You may laugh at first but trust me, these guys will freak you out. This is a quencher for the blood-thirsty horror/slasher fan that needs to see gore, gore and more gore. It’s not all slash and gash either - Wedding Slashers is chock-full-of one-liners and will give you more than just a chuckle. You’re going to need to see this one to believe it.” |
Concept: genre Now, classify the given review and provide the main concept using only ONE WORD: |
Here is a list of extracted concepts from movie reviews: {concepts} |
Analyze these concepts and suggest an appropriate number of clusters and one-word cluster names to group them. Cluster names should not overlap, should be distinctive. |
Given concept: {concept} |
Predefined Concept List: {concept labels} |
Provide the concept from the predefined list that is closest to the given concept. Return nothing else. |
To measure the quality of GPT-4o’s concept annotations, we conducted a human evaluation using crowdsourcing. We randomly selected 10 annotations from each dataset (Yelp and IMDB), and each annotation was rated by seven independent annotators using Qualtrics 2. The annotators assessed how accurately each concept reflected the associated text using a 5-point Likert scale [42], where 1 = Not accurately at all, 2 = Slightly accurately, 3 = Moderately accurately, 4 = Very accurately, and 5 = Extremely accurately. The average ratings were 4.31 for Yelp and 3.81 for IMDB. We define the agreement rate as the proportion of ratings above 3, which reached 100% for Yelp and 70% for IMDB. These results indicate that GPT-4o’s concept annotations are largely considered plausible and can be reliably used in downstream tasks.
Dataset | Mean Rating | Agreement Rate (%) |
---|---|---|
Yelp | 4.31 | 100 |
IMDB | 3.81 | 70 |
Dataset | IMDB | Yelp | ||||||
Model | ACC \(\uparrow\) | F1 \(\uparrow\) | ACC \(\uparrow\) | F1 \(\uparrow\) | ||||
LLaMA3-1B | 70.67 | 65.00 | 93.25 | 93.00 | ||||
LLaMA3-1B (w/ CURE) | 73.83 | 75.00 | 93.50 | 94.00 |
To further demonstrate the generalization capability of our method, we apply CURE to the LLaMA3-1B model [40] to evaluate its effectiveness on a recent large-scale foundation model. Using the same evaluation protocol, we observe a 3-point accuracy improvement on IMDB and consistent gains on Yelp (Table [tab:llama3]), reflecting the performance trends previously seen with smaller PLMs (Table 1). Due to computational resource limitations, this additional experiment could not be extended to include comparisons with other baselines such as RAZOR [11] and FL [35], and was therefore not included in the main body of the study. Nonetheless, we report it here in the appendix to highlight the broader applicability and robustness of CURE across diverse and emerging model architectures.
Figure 6: Sentiment distributions in the imbalanced groups of the IMDB and YELP datasets.. a — Sentiment distribution in the concept “Emotion”, b — Sentiment distribution in the concept “Story”, c — Sentiment distribution in the concept “Experience”, d — Sentiment distribution in the concept “Service”
Our code is available at https://github.com/aysenurozmen/CURE↩︎