September 29, 2024
Language models frequently inherit societal biases from their training data. Numerous techniques have been proposed to mitigate these biases during both the pre-training and fine-tuning stages. However, fine-tuning a pre-trained debiased language model on a downstream task can reintroduce biases into the model. Additionally, existing debiasing methods for downstream tasks either (i) require labels of protected attributes (e.g., age, race, or political views) that are often not available or (ii) rely on indicators of bias, which restricts their applicability to gender debiasing since they rely on gender-specific words. To address this, we introduce a novel debiasing regularization technique based on the class-wise variance of embeddings. Crucially, our method does not require attribute labels and targets any attribute, thus addressing the shortcomings of existing debiasing methods. Our experiments on encoder language models and three datasets demonstrate that our method outperforms existing strong debiasing baselines that rely on target attribute labels while maintaining performance on the target task.1
Language Models (LMs) based on encoders are used for a variety of purposes such as document classification [1], [2], job recommendation [3], text generation [4], or as text encoder for multimodal models such as text-to-audio [5] or text-to-image [6] models. These models often encode societal biases rooted in the corpora used for training [7], [8], which causes a distributional shift of embeddings, hence affecting their outputs either with disproportionate misclassification of documents belonging to minority groups or unfair ranking of the documents [9], [10].
Several works focus on reducing the effect of these biases by improving model performance related to some specific fairness metric (empirical fairness) or by making the model blind to the existence of a certain attribute (representational fairness) [11]. For instance, [11] leverage contrastive learning to improve empirical fairness. Recent works focus mostly on efficiency and user flexibility when it comes to debiasing using modular approaches such as sub-networks or adapters [12]. [13] introduce a modular debiasing scheme with adversarial training [14] and mutual information reduction [15] to control the bias in encoder LMs. [16] use adversarial training with adapters [17] to improve representational fairness on document classification. Finally, [18] used gated adapters to improve representational fairness while preserving task performance for classification and retrieval tasks. Although these methods effectively reduce sensitive attribute information and enhance fairness [11], [19] through blindness, they depend on attribute labels to align the distribution of the target attribute. Since the user input data contains numerous nuanced protected attributes, such as age, race, religion, etc., it is challenging to collect labeled data for each individual attribute across every task. Moreover, supervised debiasing methods typically require training on each attribute individually, scaling linearly with the number of attributes. This complexity highlights the need for more efficient and scalable approaches to handle multiple protected attributes in debiasing efforts.
To address this limitation, some works attempt to debias language models without using attribute labels. [20] employ contrastive learning combined with instance weighting to reduce the bias encoded in the language model. Moreover, [21] utilize post-hoc contrastive learning to enhance the fairness of pre-trained encoder language models concerning gender bias. [22] integrate the masking objective used during the pre-training of encoder language models with fine-tuning on gender-specific tasks to address gender bias.
These methods address gender bias without requiring labeled data using explicit gender indicators present within the text. However, they are ineffective against other biases such as age, race, or political view of the user, as well as implicit gender bias when gender information is removed from the text, limiting their possible use cases.
In this work, we bridge this gap by introducing a new regularization scheme based on class-wise variance to reduce unknown (unlabeled) representational bias in the embeddings of LMs. Our regularization enforces low-variance embeddings, which results in mitigating any possible distributional shift caused by unknown attributes in the model’s embeddings. With this method we force the model to produce robust embeddings that are informative about the classification task but contain less information about the protected attributes resulting in fair representation of the protected attributes. This gives our method the advantage of not relying on any type of information on the attribute during debiasing. To the best of our knowledge, we are the first to address the debiasing of arbitrary attributes without having access to attribute labels.
We demonstrate the effectiveness of our method on document classification taskss using adapters [17] and two commonly used encoder LMs, BERT-Base and RoBERTa-Base. Furthermore, we show that our method, when compared to existing supervised debiasing methods, can enhance attribute removal while still showing competitive classification task performance.
In recent years, adapter networks [12], [17] have emerged as an efficient way of training models on downstream tasks. In addition to their improved training efficiency, adapters keep the backbone LM weights fixed, helping preserve information within the model.
In our initial study, we assess how much gender information can be extracted from commonly used encoder LMs. We use adapters [17] in combination with [23] and, additionally, a gender-debiased version of the same model [24] debiased for empirical fairness, on two downstream classification tasks. We then train probes on the embeddings of the fine-tuned models to check how much information about gender can be extracted from both model variations and report the average balanced accuracy as indicator of gender information in the embeddings.
Table 1 shows the result of both models for occupation prediction (BIOS [2]) and mention prediction (PAN16 [25]) datasets. We observed that task performance of the debiased LM, BERT-NLI, is consistently lower than BERT-Base, which aligns with observations by [22]. Moreover, training adapters contain the gender information to a great extent on BIOS; while on PAN16, BERT-NLI leaks more gender information in the embeddings, although it has already been subject to debiasing.
| Model | BIOS | PAN16 | ||
| Task\(\uparrow\) | Gender\(\downarrow\) | Task\(\uparrow\) | Gender\(\downarrow\) | |
| BERT-Base | \(84.3_{0.1}\) | \(67.0_{0.1}\) | \(92.4_{0.1}\) | \(70.7_{0.1}\) |
| BERT-NLI | \(84.1_{0.1}\) | \(64.5_{0.1}\) | \(88.2_{0.1}\) | \(73.7_{0.1}\) |
This provides strong motivation for using debiasing methods during fine-tuning, even when using an already debiased pre-trained LM. However, as surveyed in § 1, existing debiasing methods either rely on attribute labels or are limited to attributes with explicit indicators in the text, such as gender. Furthermore, there exists a plethora of sensitive attributes, and labeling them all is challenging across tasks. This increase in number also affects debiasing complexity as it scales with the number of attributes. Thus, a method that addresses this gap would be highly desirable. In the following, we outline how we solve this gap.
We start with our problem setting, formulated as follows: Given a set of \(N\) documents with \(k\) classes, we are interested in having robust high-dimensional embeddings (\(Z \in \mathbb{R}^d\)) for document classification which are (i) informative about the classes but (ii) contain as little information as possible about any arbitrary protected attribute (\(\rho\)) not directly related to the classification task. Our approach to debiasing deviates from existing ones in two crucial ways: (i) It is independent of labeled attributes, and (ii) it targets any protected attribute simultaneously.
We formulate our regularization scheme based on \(k\) centers, each representing a class in the dataset with \(d\) dimension \(\{C_1,C_2,...C_k | C_i \in \mathbb{R}^d\}\), where \(d\) is the model’s embedding size. We aim to adjust the parameters of the network if the variance of the embeddings in a batch is high, which intuitively results in the mitigation of any undesirable distributional shift that might exist in the embeddings. Since we have \(k\) classes, class-wise variance is a good proxy for this regularization loss.
We define the regularization loss as the distance between embeddings (\(Z \in \mathbb{R}^d\)) of class \(i\) in a given batch from their corresponding center. For each batch, we calculate the center of embeddings that belong to the same class (\(C_i\)), which results in \(k\) centers. To account for noisy data points and empty batches, we use the weighted sum of the current batch center \(C_i^b\) and the normalized weighted sum of previous batch centers \(C_i^{b-1}\) where \(\omega\) is a hyperparameter to control the influence of previous batch and found through grid search. The centers are calculated as follows:
\[C_i^b = (1 - \omega)\frac{Z_1 + Z_2 ... Z_m}{m} +\omega C_i^{b-1},\]
where \(m\) is the number of samples for the \(i^{th}\) class in a batch. In practice, if there are no samples of a class within a batch, we ignore it; and if only one sample of the class is in the batch, the center becomes the sample itself. We then define the regularization loss as the sum of distances for each specific sample belonging to class \(i\) from the estimated center of the batch:
\[\mathcal{L}_r = \sum_{i=1}^{k}\sum_{r=1}^{m}\sqrt{\sum_{j=1}^{d}(z_{jr}^i-c_j^i)^2},\]
where \(c_j^i\) is the center value for the \(j^{th}\) dimension of the \(i^{th}\) class and \(z_{jr}^{i}\) is the value for the \(j^{th}\) dimension of the \(r^{th}\) embedding for the \(i^{th}\) class. This corresponds to reducing the class-wise variance of the embeddings created by the model, which in turn reduces distributional shift that might exist in the data points of the same class and results in the alignment of the embeddings. We also use the calculated centers as extra input for the classification task and calculate the loss of the centers. We show later in § 5 that this added loss term is essential to mitigate degradation in task performance.
The overall loss then becomes a linear combination:
\[\label{eq:loss} \mathcal{L}_{total} = \mathcal{L}_t + \lambda \mathcal{L}_r +\mathcal{L}_c\tag{1}\]
where \(\mathcal{L}_t\) is the classification loss, \(\mathcal{L}_r\) is the regularization loss, and \(\mathcal{L}_c\) is the loss to classify the calculated centers belonging to each class.
For our experiments, we follow previous works and focus on transformer-based language models. We use BERT-Base and RoBERTa-Base, in combination with adapters [17] for each task. Trained in this way, we denote models using our debiasing method as AdpLVR.
We use the following document classification datasets: occupation prediction (BIOS; [2]) with gender as protected attribute, hate speech detection (FCDL18; [1]) with race for protected attribute, and mention detection (PAN16; [25]), corresponding to a multi-attribute setting with age and gender as protected attributes. For each dataset, we remove all explicit indicators of protected attributes following previous works [13], [16], [18] from the text.
| Model | Type | BIOS | FDCL18 | PAN16-Gender | PAN16-Age | ||||
| Task\(\uparrow\) | Probe\(\downarrow\) | Task\(\uparrow\) | Probe\(\downarrow\) | Task\(\uparrow\) | Probe\(\downarrow\) | Task\(\uparrow\) | Probe\(\downarrow\) | ||
| BERT-Base | Ft | \(84.6_{0.4}\) | \(67.3_{0.8}\) | \(81.0_{1.0}\) | \(92.9_{1.8}\) | \(93.6_{1.8}\) | \(69.6_{0.8}\) | \(93.6_{1.8}\) | \(42.3_{0.9}\) |
| Adp | \(84.3_{0.1}\) | \(67.0_{0.1}\) | \(80.0_{0.1}\) | \(93.3_{0.4}\) | \(92.4_{0.1}\) | \(70.7_{0.1}\) | \(92.4_{0.1}\) | \(42.4_{0.}\) | |
| AdpNLI | \(84.1_{0.1}\) | \(64.5_{0.1}\) | \(81.2_{0.6}\) | \(93.5_{0.6}\) | \(88.2_{0.1}\) | \(73.7_{0.1}\) | \(88.2_{0.1}\) | \(42.5_{0.1}\) | |
| FtAdv | \(84.0_{0.3}\) | \(60.8_{0.2}\) | \(81.0_{1.0}\) | \(84.4_{4.0}\) | \(92.4_{0.8}\) | \(59.8_{0.7}\) | \(92.4_{0.8}\) | \(31.3_{1.1}\) | |
| AdpAdv | \(84.2_{0.1}\) | \(61.9_{0.5}\) | \(79.8_{0.3}\) | \(75.6_{0.5}\) | \(92.2_{0.1}\) | \(\boldsymbol{54.2}_{0.4}\) | \(92.1_{0.1}\) | \(\boldsymbol{21.7}_{0.1}\) | |
| AdpMMD | \(84.4_{0.2}\) | \(65.3_{0.3}\) | \(80.1_{0.2}\) | \(81._{0.3}\) | \(91.4_{0.4}\) | \(67.4_{0.3}\) | \(92.0_{0.8}\) | \(36.8_{0.7}\) | |
| AdpLVR | \(84.0_{0.2}\) | \(\boldsymbol{59.2}_{0.3}\) | \(81.7_{0.1}\) | \(\boldsymbol{66.7}_{0.9}\) | \(91.3_{0.1}\) | \(54.4_{0.1}\) | \(91.3_{0.1}\) | \(21.9_{0.2}\) | |
| RoBERTa-Base | Ft | \(84.5_{0.4}\) | \(66.2_{0.7}\) | \(80.6_{0.4}\) | \(93.2_{1.2}\) | \(98.5_{0.1}\) | \(63.6_{0.4}\) | \(98.5_{0.1}\) | \(22.7_{0.8}\) |
| Adp | \(84.3_{0.1}\) | \(67.3_{0.7}\) | \(80.0_{0.6}\) | \(94.0_{0.6}\) | \(98.2_{0.1}\) | \(62.8_{0.4}\) | \(98.1_{0.1}\) | \(31.9_{0.1}\) | |
| FtAdv | \(84.1_{0.3}\) | \(61.6_{0.3}\) | \(80.5_{1.0}\) | \(83.6_{1.9}\) | \(98.2_{0.1}\) | \(52.0_{0.9}\) | \(98.2_{0.1}\) | \(24.1_{1.4}\) | |
| AdpAdv | \(84.0_{0.1}\) | \(62.9_{0.1}\) | \(80.0_{0.5}\) | \(79.7_{0.3}\) | \(98.1_{0.1}\) | \(53.7_{0.7}\) | \(98.0_{0.1}\) | \(22.3_{1.0}\) | |
| AdpMMD | \(84.3_{0.2}\) | \(64.2_{0.3}\) | \(80.0_{0.1}\) | \(80._{0.5}\) | \(97.8_{0.1}\) | \(60.4_{0.3}\) | \(98.0_{0.4}\) | \(27.1_{0.3}\) | |
| AdpLVR | \(83.8_{0.1}\) | \(\boldsymbol{55.6}_{0.3}\) | \(81.5_{0.2}\) | \(\boldsymbol{77.3}_{0.1}\) | \(97.7_{0.1}\) | \(\boldsymbol{51.1}_{0.4}\) | \(97.7_{0.1}\) | \(\boldsymbol{20.6}_{0.8}\) | |
| \(\omega\) | \(\mathcal{L}_c\) | BIOS | FDCL18 | PAN16-Gender | PAN16-Age | ||||
| Task\(\uparrow\) | Probe\(\downarrow\) | Task\(\uparrow\) | Probe\(\downarrow\) | Task\(\uparrow\) | Probe\(\downarrow\) | Task\(\uparrow\) | Probe\(\downarrow\) | ||
| - | - | \(80.2_{0.8}\) | \(60.9_{0.3}\) | \(81.1_{0.1}\) | \(72.8_{0.2}\) | \(91.2_{0.1}\) | \(54.4_{0.3}\) | \(91.2_{0.1}\) | \(24.3_{0.1}\) |
| \(✔\) | - | \(83.5_{0.2}\) | \(59.6_{0.1}\) | \(81.1_{0.2}\) | \(72.0_{0.3}\) | \(91.1_{0.1}\) | \(54.4_{0.1}\) | \(91.1_{0.1}\) | \(22.3_{0.1}\) |
| - | \(✔\) | \(83.8_{0.4}\) | \(59.8_{0.5}\) | \(81.5_{0.2}\) | \(70.9_{0.8}\) | \(91.4_{0.2}\) | \(55.5_{0.3}\) | \(91.4_{0.2}\) | \(23.1_{0.1}\) |
| AdpLVR | \(84.0_{0.3}\) | \(59.2_{0.2}\) | \(81.7_{0.1}\) | \(66.7_{0.9}\) | \(91.3_{0.1}\) | \(54.4_{0.1}\) | \(91.3_{0.1}\) | \(21.9_{0.2}\) | |
We choose baselines as follows: Ft, Adp and AdpNLI as fine-tuned versions of the entire model and adapter-based training of the model and adapter-based training for BERT model trained on debiased NLI, respectively, without using any additional bias mitigation method. We also select recent in-process debiasing algorithms as strong baselines, relying either on adversarial training [14] or mutual information reduction [15] to reduce the bias encoded within the embeddings and increase representational fairness. Note that all supervised methods use labels of the target attribute to align the embeddings, while AdpLVR does not have access to any attribute label throughout the training.
We follow the setup of previous works using the same datasets [13], [16], [18]. Specifically, we use a maximum of 120 tokens for the BIOS dataset and 40 tokens for FDCL18 and PAN16 since they comprise comparatively short tweets. We train each model for 15 epochs with a learning rate of \(2 \times 10^{-5}\). We select reduction factors for adapters on BIOS, PAN16, and FCDL18 as \(2\), \(1\), and \(2\), respectively, as they led to the best task performance. Since each loss term affects each model differently, we train baselines with a fixed \(\lambda=1\) for supervised debiasing and our unsupervised AdpLVR with \(\lambda=0.1\). We also select \(\omega=0.3\) as it performed best across all datasets in our grid search.
We train five probes consisting of two-layer fully connected networks and tanh activation function for 40 epochs and a learning rate of \(1 \times 10^{-4}\) to predict protected attributes from
embeddings [13].
To evaluate task performance, we use accuracy as the evaluation metric. In order to evaluate the performance of bias mitigation methods we use balanced accuracy of the probes. We choose balanced accuracy to account for any unbalanced dataset with regard to the distribution of the protected attributes. For gender and dialect attributes, if balanced accuracy is around \(50\%\) it shows that the model is confusing the protected attribute; for age this value should be close to \(20\%\) because there are five classes for age. Furthermore, we run our experiments three times for each model and report the mean and standard deviation of three runs to account for variations in the training process.
Table 2 shows the task and probe performance of the baselines and AdpLVR.
In our single attribute experiments on FCDL18 and BIOS, using both BERT-Base and RoBERTa-Base, AdpLVR is able to remove information about protected attributes considerably better than all the baselines on BIOS and FCDL18. As for task performance, we observe a decrease in accuracy with AdpLVR compared to the baselines on BIOS. Remarkably, our regularization method even shows an improvement in task performance on FCDL18, demonstrating the robustness of its embeddings.
In our multi-attribute experiment on PAN16 (2 protected attributes), we observe that AdpLVR performs slightly worse than the best-performing model, AdpAdv, on the main task. However, unlike AdpAdv, which has access to the protected attribute label during training, AdpLVR crucially does not rely on attribute labels for bias mitigation; yet, it still outperforms the baselines in protected attribute information removal. Overall, we observe that, with BERT-Base, AdpLVR shows slightly higher balanced accuracy compared to AdpAdv for both protected attributes, while with RoBERTa-Base, AdpLVR similarly shows improved mitigation performance.
Notably, other debiasing methods show similar decreases in task performance. Still, on FCDL18, AdpLVR clearly outperforms all supervised baselines on the main task and information removal.
To ensure all parts of our method are necessary to achieve its performance, we conduct an ablation study where we remove (i) the memory of the previous batch, \(\omega\), and (ii) the center loss \(\mathcal{L}_c\) introduced in section 3. Table 3 shows the result of this ablation study. By removing \(\omega\), the balanced accuracy of the probe considerably increases, meaning that the robustness of the embeddings toward protected attributes is reduced. Thus, more information about unknown, unrelated attributes influences the final output of the model to a larger extent.
Moreover, we observe that task performance clearly degrades when removing \(\mathcal{L}_c\). Overall, the best-performing model, both in terms of task performance and probe balanced accuracy, is the one that has both \(\omega\) acting as memory of previous batches for the model and \(\mathcal{L}_c\), corresponding to our class-center-based loss.
In this work, we focus on representational fairness and introduce a novel regularization and optimization scheme to debias encoder LMs without accessing protected attribute labels. We show the effectiveness of our method using two encoder LMs across three datasets and multiple protected attributes. We demonstrate that our method enhances debiasing while maintaining task performance compared to strong baselines. To the best of our knowledge, our method is the first that can mitigate bias of any arbitrary target attribute by generating robust embeddings best suited for the classification task. Since our method does not rely on attribute labels, we hope it paves the future for more accessible, effective, and efficient debiasing of encoder-based transformer models.
One limitation of this work is the definition of gender used in all datasets, which is limited to binary female/male, lacking an inclusive and nuanced definition of gender. Moreover, although our method proved independent of attribute labels, a thorough evaluation would require more datasets with a variety of defined attributes. Another limitation of this work is the task in which we narrowed our study to classification tasks. We acknowledge that the findings of this paper might not be applicable to other tasks such as retrieval or recommendation. Furthermore, our study is focused on transformer-based language models which put an additional limitation on the generalization of the work to other models such as CNNs or LSTM-based language models. Due to the lack of suitable datasets, we relied on datasets commonly used in the debiasing literature. In FDCL18, race is restricted to African American and White American, which does not reflect real-life scenarios. Furthermore, we follow previous works [26]–[28] and use labels of protected attributes assigned using another model, making them not fully representational of the real data distribution. A final limitation of this work is the lack of suitable datasets for multi-attribute settings, in which we could demonstrate that our approach can handle even more attributes than demonstrated with PAN16 simultaneously.
This research was funded in whole or in part by the Austrian Science Fund (FWF): https://doi.org/10.55776/P33526, https://doi.org/10.55776/DFH23, https://doi.org/10.55776/COE12, https://doi.org/10.55776/P36413.
The code for the experiments is available on GitHub: https://github.com/ShawMask/UnlabeledDebiasing↩︎