April 01, 2024

Deep neural networks (DNNs) are known to be sensitive to adversarial input perturbations, leading to a reduction in either prediction accuracy or individual fairness. To jointly characterize the susceptibility of prediction accuracy and individual
fairness to adversarial perturbations, we introduce a novel robustness definition termed *robust accurate fairness*. Informally, robust accurate fairness requires that predictions for an instance and its similar counterparts consistently align with
the ground truth when subjected to input perturbations. We propose an adversarial attack approach dubbed RAFair to expose false or biased adversarial defects in DNN, which either deceive accuracy or compromise individual fairness. Then, we show that such
adversarial instances can be effectively addressed by carefully designed *benign perturbations*, correcting their predictions to be accurate and fair. Our work explores the double-edged sword of input perturbations to robust accurate fairness in DNN
and the potential of using benign perturbations to correct adversarial instances.

Deep neural networks (DNN) have demonstrated outstanding performance in a broad range of complex learning tasks. However, the widespread deployment of DNN has raised growing concerns about their trustworthiness across various domains [1]. Studies have unveiled that DNN are susceptible to minor input perturbations [2]. These perturbations, nearly indistinguishable from the original data, pose a dual challenge: 1) compromising prediction accuracy [3] which could lead to false treatment, and 2) influencing individual fairness which may result in disparate decisions for similar individuals [4].

Therefore, ensuring that DNN maintains both functional and ethical correctness when subjected to input perturbations is an important yet challenging task. Particularly in scenarios where accuracy and individual fairness are equally important. Adversarial perturbations can generate false or biased treatments, leading to unacceptable defects in civil domains such as income prediction [5], loan approval [6], and criminal justice risk assessment [7]. On the one hand, adversarial instances with accurate but biased predictions may unconsciously hide the bias introduced by perturbations, violating anti-discrimination laws and ethical principles. On the other hand, adversarial instances with false but fair predictions may undermine the effectiveness of DNN, resulting in reduced profits or compromised societal security. Moreover, it’s noteworthy that perturbations also hold the potential to positively enhance the trustworthiness of DNN. Our research shows that carefully crafted benign perturbations can repair the predictions of adversarial instances to be both accurate and fair.

This paper first introduces a novel robustness definition, *robust accurate fairness*, designed to characterize the susceptibility of prediction accuracy and individual fairness of DNN to imperceptible input perturbations. Individuals are
considered similar if their only differences lie in some sensitive attributes, such as gender, race, and age. The prediction of an instance satisfies robust accurate fairness when predictions for itself and its similar counterparts consistently align with
its ground truth under the allowed input perturbations. Instances failing to maintain this alignment may experience either false or biased treatment due to the input perturbations.

Subsequently, we explore the duality of input perturbations on robust accurate fairness in DNN. On the one hand, we propose an adversarial approach, *RAFair*, to generate false or biased adversarial instances with adversarial perturbations. These
instances may either deceive prediction accuracy or influence individual fairness. Our method utilizes the total derivative to quantify variations in the loss function caused by input perturbations. It identifies relevant features and directions for
perturbation under the guidance of the fairness confusion matrix [8]. Perturbations guided by this matrix may compromise accuracy by elevating the loss functions of
instances or influence individual fairness by inducing differential changes in loss functions between instances and their similar counterparts. On the other hand, we further propose the generation of *benign perturbations* to mitigate the false or
biased treatments induced by adversarial perturbations. This kind of input perturbation contributes to correcting the prediction accuracy and individual fairness of adversarial instances, by minimizing the loss functions of both the instances and their
similar counterparts.

We have successfully implemented the RAFair approach and conducted an extensive analysis on benchmark datasets, including Census Income [5], German Credit [9], Bank Marketing [6], and ProPublica Recidivism [7]. Our experimental results and quantile regression analyses illustrate that RAFair can efficiently expose false or biased adversarial instances, with only 22.4% of adversarial instances still maintaining accurate and fair predictions after adversarial perturbations, on average across the four datasets. Benign perturbations can greatly mitigate the inaccurate or unfair treatments introduced by adversarial perturbations. It corrects 58.4% of adversarial instances to be accurate and fair after their application. This underscores the duality of input perturbations: 1) revealing robust accurate fairness defects in DNN with adversarial perturbations, and 2) enhancing the trustworthiness of false or biased adversarial instances through benign perturbations.

The key contributions of this paper are as follows:

We introduce robust accurate fairness to characterize the susceptibility of prediction accuracy and individual fairness in DNN to input perturbations. This is a novel robustness definition that focuses on the negative impact of input perturbations on both prediction accuracy and individual fairness.

To explore the duality of input perturbations on robust accurate fairness in DNN, we introduce RAFair to unveil the false or biased defects in DNN under adversarial perturbations. We additionally generate benign perturbations to mitigate the false or biased treatments caused by adversarial perturbations

Experimental results and quantile regression analyses demonstrate the effectiveness of RAFair in exposing false or biased adversarial instances in DNN. It reveals that only a limited percentage of adversarial instances maintain accurate and fair predictions after adversarial perturbations. Such false or biased adversarial instances can be greatly corrected by applying benign perturbations.

The rest of this paper is organized as follows. We briefly discuss the related work in Section 2, followed by the formal definition and discussion of the robust accurate fairness in Section 3. We explore the duality of input perturbations to robust accurate fairness in Section 4. Experimental evaluations are reported and analyzed in Section 5. The paper is concluded in Section 6 with some future work.

**Impact of Perturbations on Prediction Accuracy.** Carefully crafted adversarial perturbations can fool DNN models into inaccurate predictions [2]. The concept of adversarial robustness addresses the DNN’s ability to withstand perturbations without compromising accuracy. To reveal the impact of perturbations on adversarial robustness, it often evaluates a model’s
accuracy in the presence of specific adversarial perturbations [10]. Fast Gradient Sign Method [2] and Projected Gradient Descent attack [11] are popular algorithms that may mislead the model by generating perturbations in the
gradient direction of the loss function. The Auto Projected Gradient Descent [12], an advanced variant of the Projected Gradient Descent attack, iteratively
considers the inertia direction of a search point with an adaptive step size. The Auto Conjugate Gradient [13] employs the conjugate gradient method to thoroughly
explore adversarial examples, updating a search point in more diverse directions.

However, these methods often prioritize adversarial robustness without sufficient consideration of individual fairness under adversarial perturbations. Instances with accurate but biased predictions pose a significant challenge, potentially hiding the bias induced by perturbations and consistently violating anti-discrimination laws and ethical principles.

**Impact of Perturbations on Prediction Fairness.** Adversarial perturbations also influence group fairness [14] and
individual fairness [15]. [16] observed that
adversarial robustness might introduce biased treatment among subgroups. Certain subgroups exhibit better resistance to perturbations, which is referred to as robustness bias. Similarly, adversarial perturbations also induce biased treatment among similar
individuals. Studies [4] discovered that carefully constructed perturbations can generate biased instances. These biased instances represent another type of
adversarial instance, distinct from instances with inaccurate predictions. Their predictions exhibit individual bias. Adversarial Discrimination Finder [4] employs a
two-phase gradient search to generate biased perturbations by amplifying prediction differences among similar individuals. The Efficient Individual Discriminatory Instances Generator [17] utilizes momentum optimization and intermediate gradient information to accelerate the gradient search. DICE [18] employs
entropy metrics to quantify fairness in decision-making and employs an information-theoretic framework to guide perturbations and induce biased treatments.

However, current approaches for biased instance generation often overlook the impact of adversarial perturbations on prediction accuracy. This limitation arises because test oracles, as ground truths are typically unnecessary for the generated biased instances. While this characteristic may be advantageous in certain scenarios, it also restricts the usefulness of these approaches, as their generation results may encounter adversarial robustness issues. In an extreme scenario, enforcing identical predictions for all inputs would satisfy any fairness criterion but severely compromise effectiveness.

Input perturbations produce negative impacts on both prediction accuracy and fairness. However, after perturbations, there may be a conflict between prediction accuracy and individual fairness. A DNN with greater adversarial robustness may inadvertently produce unfair predictions. Conversely, a DNN with higher individual fairness under perturbations may unconsciously generate inaccurate outputs. This poses an important yet challenging task to ensure DNNs maintain both functional and ethical correctness under perturbations.

In this section, we introduce the concept of robust accurate fairness and discuss its connections with adversarial robustness and individual fairness.

Let \(V \subseteq X_1 \times X_2 \times \cdots \times X_m \times A_{1} \times A_{2} \times \cdots \times A_n\) be a finite and labeled dataset. Here, \(X_i\) represents the domain of the \(i\)-th non-sensitive attribute, \(A_{j}\) represents the domain of the \(j\)-th sensitive attribute, and \(Y\) represents the ground truths. Each data point \(v=(x_1,x_2,\dots,x_m,a_1,a_2,\dots a_n)\in V\) is associated with a ground truth value \(y\in Y\). Let \(I(v)\subseteq X_1 \times X_2 \times \cdots \times X_m \times A_{1} \times A_{2} \times \cdots \times A_n\) represent the set of similar instances to \(v\). Formally, \(I(v)=\{(x_1,x_2,\dots,x_m,a_1',a_2',\dots a_n')~|~a_j'\in A_j, j \in [1,n]\}\). It’s important to note that \(v\in I(v)\).

Consider a classifier \(f:V \rightarrow Y\) learned from training data. The prediction of classifier \(f\) for input \(v\) is denoted as \(\hat{y}=f(v)\). The allowed data point perturbations are formalized as \(\Delta \subseteq \mathbb{R}^{m+n}\). If a perturbation \(\delta \in \Delta\) is added to \(v\), resulting in the adversarial instance \(v_{adv}=v+\delta\), its prediction by the classifier \(f\) becomes \(\hat{y}_{adv}=f(v_{adv})\), its similar counterpart becomes \(v_{adv}'\in I(v_{adv})\). The robust accurate fairness can be defined as follows:

*Definition 1* (Robust Accurate Fairness). A classifier \(f:X \times A \rightarrow {Y}\) exhibits robust accurate fairness to the instance \((v,y)\) under the allowed data point
perturbations \(\Delta \subseteq \mathbb{R}^{m+n}\), if for any perturbation \(\delta \in \Delta\), the distance \(D(y,f(v_{adv}'))\) between the ground
truth \(y\) and the prediction result \(f(v_{adv}')\) is at most \(K\geq 0\) times the distance \(d(v_{adv},v_{adv}')\) between the adversarial instance \(v_{adv}\) and its similar counterparts \(v_{adv}'\), i.e.,
\[\begin{gather} D(y,f(v_{adv}')) \leq Kd((v_{adv}),(v_{adv}')) \label{eqn:RAF}
\end{gather}\tag{1}\] for any \(v_{adv}'\in I(v_{adv})\). Here, \(D(\cdot,\cdot)\) and \(d(\cdot,\cdot)\) are distance metrics.

Robust Accurate Fairness addresses the resilience of DNN to withstand input perturbations without compromising both prediction accuracy and individual fairness. It incorporates the adversarial robustness and individual fairness requirements for adversarial instances, seeking to reconcile these two aspects under input perturbations.

*Theorem 1*. If a classifier \(f:X \times A \rightarrow Y\) exhibits robust accurate fairness for an instance \(v\) under the allowed data point perturbations \(\Delta \subseteq \mathbb{R}^{m+n}\), then \[\begin{align} &y=f(v_{adv}) \\ &D\left(f(v_{adv}), f(v_{adv}')\right) \leq Kd(v_{adv}, v'_{adv}),
\end{align}\] for any similar adversarial individual \(v_{adv} \in I(v_{adv})\) with \(v_{adv}' \neq v_{adv}\).

*Proof.* By the identity of indiscernibles of a distance metric, considering an instance \(v\) that any of its adversarial instances satisfy the Definition 1: \[\begin{align} D(y, f(v_{adv})) &\leq Kd(v_{adv}, v_{adv}) = 0,
\end{align}\] This simplifies Constraint 1 to \(y = f(v_{adv})\) (i.e., \(D(y, f(v_{adv})) = 0\)) for any adversarial instance \(v_{adv}\) generated form the instance \(v\). Thus, adversarial instance \(v_{adv}\) is also adversarial robustness.

Leveraging the triangle inequality and symmetry of a distance metric: \[\begin{align} D\left(f(v_{adv}), f(v_{adv}')\right) \leq D(y, f(v_{adv})) + D(y, f(v_{adv}')), \end{align}\] where \(v_{adv}' \neq v_{adv}\), and \(y\) is the ground truth of instance \(v\).

Based on Definition 1, any similar adversarial individual (\(v_{adv}' \in I(v_{adv})\) with \(v_{adv}' \neq v_{adv}\)) : \[\begin{align} D(y, f(v_{adv}')) &\leq Kd(v_{adv}, v_{adv}'). \end{align}\] This implies that \(D\left(f(v_{adv}), f(v_{adv}')\right) \leq Kd(v_{adv}, v'_{adv})\). Hence, the adversarial instance \(v_{adv}\) is also individual fairness with its similar counterparts \(v'_{adv}\). ◻

It is evident from Theorem 1 that Robust Accurate Fairness integrates the requirements of adversarial robustness [2] and individual fairness [19]. Instances for which any of their adversarial instances satisfy Definition 1 will concurrently meet the criteria for both adversarial robustness and individual fairness.

In this section, we analyze the duality of input perturbations on robust accurate fairness in DNN, including the false or biased adversarial instances induced by adversarial perturbations and the trustworthiness enhancements achievable through benign perturbations.

Based on the orthogonal relationship between individual fairness and accuracy depicted by the fairness confusion matrix [8] (Table 1), we categorize the impact of input perturbations on robust accurate fairness in DNN as follows:

*True Fair Impact*: Input perturbations cause the generated instance to receive the same prediction as its similar counterparts, and its prediction is consistent with the ground truth.*True Biased Impact*: Input perturbations cause the generated instance to receive a different prediction from its similar counterparts, and its prediction is consistent with the ground truth.*False Biased Impact*: Input perturbations cause the generated instance to receive a different prediction from its similar counterparts, and its prediction is inconsistent with the ground truth.*False Fair Impact*: Input perturbations cause the generated instance to receive the same prediction as its similar counterparts, and its prediction is inconsistent with the ground truth.

Fair | Biased | |
---|---|---|

True | True Fair (TF) | True Biased (TB) |

False | False Fair (FF) | False Biased (FB) |

-0.1in

We classify perturbations that induce true biased, false fair, or false biased impact as adversarial perturbations. These perturbations can generate false or biased adversarial instances, impacting either prediction accuracy or individual fairness. Conversely, input perturbations resulting in true fair impact are considered benign perturbations. These benign perturbations can rectify predictions of adversarial instances and their similar counterparts, promoting accuracy and individual fairness.

To expose the false or biased defects under input perturbations, we present RAFair, an adversarial approach to generate inaccurate or unfair adversarial instances. This is achieved by formulating three distinct optimization problems:

1. Maximize the distance \(D(y, f(v_{adv}))\) of adversarial instances, producing the False Biased Impact: \[\begin{align} \max_{\delta \in \Delta} D(y,f(v_{adv})) \text{s.t} (v,y)\in V \times Y, v_{adv} = v +\delta \label{FB95optimization} \end{align}\tag{2}\] 2. Maximize the distance \(D(y, f(v_{adv}'))\) of similar counterparts differing from adversarial instances, producing the True Biased Impact: \[\begin{align} \max_{\delta \in \Delta} D(y, f(v_{adv}')) \quad\text{s.t}&(v, y) \in V\times Y, v_{adv} = v + \delta \label{TB95optimization} \notag \\ &v_{adv}' \in I(v_{adv}), v_{adv}'\neq v_{adv} \end{align}\tag{3}\] 3. Maximize the distance \(D(y, f(v_{adv}'))\) of both adversarial instances and their similar counterparts, producing the Fair False Impact: \[\begin{align} \max_{\delta \in \Delta} D(y, f(v_{adv}'))\quad \text{s.t.}& (v, y) \in V\times Y, v_{adv} = v + \delta \label{FF95optimization} \notag\\ &\text{for any } v_{adv}' \in I(v_{adv}) \end{align}\tag{4}\] By associating the loss function \(l(\cdot, \cdot)\) with the distance \(D(y, f(v_{adv}'))\) in Equations 2 , 3 , and 4 , the problem is equivalent to identifying appropriate features and directions for perturbations that increase the loss functions for either adversarial instances or their similar counterparts.

We make the assumption that partial derivatives of the loss function exist within a specific neighborhood of each point in \(I(v)\), and these partial derivative functions are continuous at those points. The total derivatives of the loss function at the data point \(v\) and its similar counterpart \(v'\) are expressed in Equations 5 and 6 . \[\begin{align} \small \tag{5} \Delta l|_{v} &= \sum_{i=1}^{m} \frac{\partial l}{\partial x_i} \Delta x_i + \sum_{j=1}^{n} \frac{\partial l}{\partial a_j} \Delta a_j \\ \tag{6} \Delta l|_{v'} &= \sum_{i=1}^{m} \frac{\partial l}{\partial x_i} \Delta x_i + \sum_{j=1}^{n} \frac{\partial l}{\partial a_j'} \Delta a_j' \end{align}\] The variations in the loss function caused by input perturbations can be quantified and linearly approximated by Equations 5 and 6 . The perturbation \(\Delta x_i\) in attributes \(x_i\) results in variations of \(\frac{\partial l}{\partial x_i}|_v \Delta x_i\) and \(\frac{\partial l}{\partial x_i}|_{v'} \Delta x_i\) in the loss functions \(f(v)\) and \(f(v')\), respectively.

To select features and directions for perturbations that maximize the distance in Equations 2 ,3 , and 4 , RAFair utilizes the fairness confusion matrix to guide adversarial input perturbations.

Algorithm 1 presents the workflow of our RAFair adversarial approach. To scale adversarial instance generation, RAFair initiates adversarial instance generation by invoking `getSeeds(`

\(V, n\)`)`

, which clusters the given dataset \(V\) and randomly samples \(n\) initial seeds from the clusters. These seeds form the set \(S\) for generation. Each seed \((v, y)\) in \(S\) undergoes adversarial generations \(iter\) times (Lines 1-3).

During adversarial generations, RAFair selects \((v, y)\)’s similar counterpart \(v'\) from \(I(v)\) and calculates the gradients of the loss functions \(g\) and \(g'\) of them two. The similar counterpart is chosen to have the largest absolute gradient value, excluding \(v\) (Lines 4-5), as input perturbations induce the largest variations in the loss function.

Subsequently, it utilizes the fairness confusion matrix to guide adversarial perturbations (Lines 6-14). In line 6, `FeatureGradient(`

\(g\)`)`

returns the gradient component \(g_i\) on each attribute, and \(g'_i\) is the corresponding component in the gradient \(g'\). To maximize \(l(y,
f(v_{adv}'))\) for \(v_{adv}'\) in Optimization 3 , perturbations should decrease the loss function of \(v_{adv}\) but increase that of its
similar counterpart \(v_{adv}'\). Therefore, attributes with different gradient signs are selected for perturbation, and the perturbation direction is set as \(-\)`sign(`

\(g_i\)`)`

(Lines 10). This perturbation strategy is named *True Biased* attack. To maximize \(l(y, f(v_{adv}'))\) for both \(v_{adv}\) and
\(v_{adv}'\) in Optimization 4 , perturbations should increase the loss function of both \(v_{adv}\) and \(v_{adv}'\). Hence, attributes with the same gradient signs are selected for perturbation, and the perturbation direction is set as `sign(`

\(g_i\)`)`

(Lines 12). This
perturbation strategy is called *false fair* attack. To maximize \(l(y, f(v_{adv}))\) for \(v_{adv}\) in Optimization 2 , perturbations should
increase the loss function of \(v_{adv}\) but decrease that of its similar counterpart \(v_{adv}'\). Hence, attributes with different gradient signs are selected for perturbation, and
the perturbation direction is set as `sign(`

\(g_i\)`)`

(Lines 14). This perturbation strategy is referred to as the *False Biased* attack. Finally, perturbations are performed according to the
selected direction \(dir\), and the fixed step size \(p \in R^{m+n}\) is specified for each feature. The generated adversarial instances are then projected into the allowed perturbations by
\(\Delta\) and saved for iterative attacks.

2 illustrates the impact of fairness confusion guided perturbations on robust accurate fairness. The prediction of the adversarial instance \(v_{adv}\) remains unchanged
after the *True Biased* attack and is even far away from the decision boundary (due to its loss function decrease). In contrast, the prediction of its similar counterpart \(v_{adv}'\) crosses the decision
boundary (due to its loss function increase). After performing *False Fair* attack, the predictions of both \(v_{adv}\) and \(v_{adv}'\) cross the decision boundary (due to both
loss function increase). Following the *False Biased* attack, the prediction of \(v_{adv}\) crosses the decision boundary (due to its loss function increase). However, the prediction of \(v_{adv}'\) still retains the original prediction and far away from the decision boundary (due to its loss function decrease).

Fairness confusion guided perturbations, as introduced by RAFair, negatively affect the prediction accuracy or individual fairness of the generated instances, leading to inaccurate or unfair predictions. To mitigate this negative impact, our objective is to introduce benign perturbations to the adversarial instances, thereby improving their trustworthiness and ensuring accurate and fair treatments. This challenge is tackled through the following optimization problem: minimizing the distance \(D(y, f(v'_{\text{ben}}))\) for both the benign instances and their similar counterparts: \[\begin{gather} \min_{\delta \in \Delta} D(y, f(v_{ben}')) \label{TF95optimization} \end{gather}\tag{7}\] for any \(v_{\text{ben}}' \in I(v_{\text{ben}})\), \(v_{\text{ben}} = v_{\text{adv}} + \delta\), where \(v_{\text{adv}}\) represents adversarial instances generated by adding adversarial perturbations to clean data, and \(v_{\text{ben}}\) represents benign instances generated by adding benign perturbations to adversarial instances to improve their trustworthiness.

Similar to RAFair (Algorithm 1), benign perturbation generation (Algorithm 3) is also based on the total derivative and identifies perturbations through the fairness confusion matrix. Algorithm 3 outlines its workflow with two main differences: 1) Algorithm 3 takes the adversarial dataset as input rather than the clean dataset. 2) Algorithm 3 employs a different strategy to select perturbations that minimize the loss function after benign perturbations.

Specifically, to minimize the distance for both \(v_{ben}\) and \(v_{ben}'\) in Optimization 7 , perturbations should decrease the loss function of both \(v_{ben}\) and \(v_{ben}'\). Hence, attributes with the same gradient signs are selected for perturbation, and the perturbation direction is set as \(-\text{{sign}}(g_i)\) (Lines 8).

We implemented RAFair in Python 3.8 using TensorFlow 2.4.1. Our implementation is evaluated on an Ubuntu 18.04.3 system, equipped with Intel Xeon Gold 6154 @3.00GHz CPUs, GeForce RTX 2080 Ti GPUs, and 512GB of memory. The source code, experimental datasets, and models are submitted in the supplementary material and will be made publicly available later.

**Experimental Setup.** Four popular fairness datasets, Census Income (Adult) [5], German Credit (Credit) [9], Bank Marketing (Bank) [6], and ProPublica Recidivism (COMPAS) [7], are considered for experiment. The sizes, sensitive attributes, and baseline models (BL) trained with these datasets are reported in Table 2, where \(A\)(\(k\)) indicates that attribute \(A\) has \(k\) values, and FCNN(\(l\)) denotes a fully connected neural network classifier with \(l\) layers. Specifically, we consider all sensitive attributes jointly,
instead of separately, to conduct complex case studies.

Dataset | Size | Model | Sensitive Attributes |
---|---|---|---|

Credit | 1000 | FCNN(6) | gender(2), age(51) |

Bank | 45211 | FCNN(6) | age(77) |

Adult | 45222 | FCNN(6) | gender(2), age(71), race(5) |

COMPAS | 6172 | FCNN(6) | gender(2), age(71), race(6) |

-0.1in

The state-of-the-art adversarial instance generation techniques, including Auto Projected Gradient Descent (APGD) [12] and Auto Conjugate Gradient (ACG) [13], along with biased instance generation techniques such as Adversarial Discrimination Finder (ADF) [15], entropy metrics information-theoretic testing framework (DICE) [18], are included for comparison.

**Evaluation Metrics.** In addition to the accuracy (ACC), individual fairness (IF), and accurate fairness (AF) metrics, we also employ robust accuracy (R-ACC), robust individual fairness (R-IF), and robust accurate fairness (R-AF) to
quantify the proportion of instances where the adversarial generation fails to generate false, biased, and false or biased adversarial examples, respectively. Higher percentages suggest better reliability against perturbations in terms of adversarial
robustness, individual fairness, and robustness accurate fairness. Additionally, we use true fair rate (TFR), true biased rate (TBR), false fair rate (FFR), and false biased rate (FBR) to quantify the rate of true fair, true biased, false fair, and false
biased instance in the dataset.

The objective of our experiments is to address the following three inquiries.

*Q1: How effective is the RAFair adversarial approach in generating false or biased adversarial instances?*

Figure 4 showcases the trends in the loss function for an instance and its similar counterpart during the RAFair attack, involving false fair, true bias, and false bias attacks. These figures are obtained by monitoring the loss functions of an instance in the Bank dataset under adversarial perturbations, with a step size of \(p=0.01\) and iteration count \(iter=50\).

As depicted in Figure 4, during the false fair attack, perturbations induce a similar impact on the instance and its similar counterparts, increasing both loss functions. This suggests that the instance may receive a false treatment according to its ground truth but remains equal to its counterpart. In contrast, during true biased and false biased attacks, perturbations have distinct impacts on the instance and its similar counterparts, with one experiencing an increase while the other undergoing a decrease. This results in the instance receiving the biased treatment compared to its counterpart. Specifically, in the true biased attack, the instance’s loss function decreases while its similar counterpart’s loss function increases. Conversely, in the false biased attack, these loss functions exhibit opposite changes.

Additionally, we utilize quantile regression [20] to analyze the trends in the loss function of 100 instances in the Bank dataset. Initially, we apply K-Means clustering [21] to the test dataset, carefully selecting 100 initial seeds from the resulting clusters. Subsequently, we record the loss functions of these instances during the RAFair attack. Treating the loss function as a random variable given the number of iterations, we employ quantile regression to model the median value of the loss function \(l\) as a linear function of the number of iterations \(iter\), as represented by Equation 8 . Here, \(\mathbb{Q}_{\frac{1}{2}}(l | \text{iter})\) represents the median of the loss function given the number of iterations, and \(\beta=[\beta_0, \beta_1]^T\) is the parameter vector estimated through the optimization problem expressed in Equation 9 : \[\begin{align} &\mathbb{Q}_\frac{1}{2}\left( l|iter \right)= \beta_0 +\beta_1\cdot iter \tag{8} \\ &\hat{\beta}=\underset{\beta \in \mathbb{R}^2}{\mathrm{arg}\min}\sum_{k=1}^n{\left| l_k-\beta_0-\beta_1 \cdot iter_k \right|} \tag{9} \end{align}\]

The sign of \(\hat{\beta_1}\) indicates the trend of the loss function concerning the number of iterations. If \(\text{sign}(\hat{\beta_1}) = 1\), it signifies that the median of the loss function increases as the number of iterations grows, indicating an upward trend. Conversely, if \(\text{sign}(\hat{\beta_1}) = -1\), it indicates that the median of the loss function decreases as the number of iterations increases, reflecting a downward trend. In the case where \(\text{sign}(\hat{\beta_1}) = 0\), it implies that the median of the loss function remains relatively constant as the number of iterations increases.

The statistical results of quantile regression are presented in 5. We use the parameters \(\hat{\beta_1}\) and \(\hat{\beta_1'}\) derived
from the instances and their similar counterparts to categorize the coordinate system into four areas, namely: *True Fair* (TF, sign\((\hat{\beta_1})\) = -1, sign\((\hat{\beta_1'})\) = -1), *True Biased* (TB, sign\((\hat{\beta_1})\) = -1, sign\((\hat{\beta_1'})\) = 1), *False Fair* (FF, sign\((\hat{\beta_1})\) = 1, sign\((\hat{\beta_1'})\) = 1), and *False Biased* (FB, sign\((\hat{\beta_1})\) = 1, sign\((\hat{\beta_1'})\) = -1). As depicted in 5, most instances following false fair, true biased, and false biased attacks are situated in their respective categories of FF, TB, and FB
areas.

Due to the page limit, we herein discuss the RAFair attack on the Bank dataset. Similar observations can be made on the Credit, Adult, and COMPAS datasets. Please refer to the supplementary material for detailed experimental results.

*Answer to Q1: The RAFair adversarial approach efficiently generates false or biased adversarial instances by manipulating perturbations. This process aims to either elevate the loss functions of instances, compromising adversarial robustness, or
induce distinct changes in the loss functions of instances and their similar counterparts, thereby influencing individual fairness.*

*Q2: Impact of adversarial perturbations on robust accurate fairness, robustness, and individual fairness.*

Table 3 presents average statistics of clean test seeds and adversarial datasets generated from clean test seeds. The top two rows show the average statistics of clean test seeds across four benchmarks, while the remaining rows illustrate the statistics of the adversarial dataset generated by each method. To prepare the comparison, we applied K-Means clustering to the test dataset, selecting 200 initial seeds from the resulting clusters for each generation technique, and set the perturbation step size \(p=0.01\) and iteration \(iter=20\).

Compared to the metrics of the clean dataset (ACC, IF, AF), ACG, APGD, and ADF generate adversarial perturbations leading to a decrease in R-ACC, R-IF, R-AF. DICE fails to generate perturbations resulting in R-IF decrease, and its perturbations even slightly increase R-IF, R-AF. However, RAFair demonstrates the highest efficiency in false or biased adversarial perturbation generation, resulting in the lowest R-AF, with only 22.4% of adversarial instances remaining accurate and fair, as well as the lowest R-ACC, R-IF.

Moreover, in addition to the overall decrease in R-ACC, R-IF, and R-AF after perturbations, RAFair presents an adversarial approach to target specific instances, such as true biased, false fair, or false biased adversarial instances, thereby maintaining stability in either R-IF or R-AF while significantly decreasing the other. RAFair\(_{M}\) denotes that only \(M\) adversarial generations are performed by RAFair. The results in Table 3 demonstrate that TB, FF, and FB attacks efficiently generate true biased, false fair, and false biased adversarial instances, with the corresponding percentages of TBR, FFR, and FBR (underlined) increasing and the remaining components decreasing. For instance, RAFair\(_{TB}\) effectively generates true biased adversarial instances (12.3% improvement in TBR), resulting in a 7.4% improvement in R-ACC, a 7.3% decrease in R-IF, and a 4.9% decrease in R-AF. RAFair\(_{FF}\) effectively generates false fair adversarial instances (37.9% improvement in FFR), resulting in a 51.3% decrease in R-ACC, a 49.1% decrease in R-AF, while only a 6.2% decrease in R-IF.

Data\(_{clean}\) | Acc | IF | AF | TFR | TBR | FFR | FBR |
---|---|---|---|---|---|---|---|

Seeds | 0.883 | 0.774 | 0.718 | 0.718 | 0.165 | 0.056 | 0.061 |

Data\(_{adv}\) | R-Acc | R-IF | R-AF | TFR | TBR | FFR | FBR |

RAFair\(_{TB}\) | 0.957 | 0.701 | 0.669 | 0.669 | 0.288 |
0.032 | 0.011 |

RAFair\(_{FF}\) | 0.370 | 0.712 | 0.277 | 0.277 | 0.093 | 0.435 |
0.195 |

RAFair\(_{FB}\) | 0.673 | 0.647 | 0.600 | 0.600 | 0.073 | 0.047 | 0.280 |

RAFair | 0.327 |
0.424 |
0.224 |
0.224 | 0.362 | 0.266 | 0.148 |

APGD | 0.434 | 0.593 | 0.394 | 0.394 | 0.040 | 0.198 | 0.367 |

ACG | 0.791 | 0.689 | 0.601 | 0.601 | 0.190 | 0.089 | 0.120 |

ADF | 0.684 | 0.582 | 0.469 | 0.469 | 0.215 | 0.112 | 0.204 |

DICE | 0.867 | 0.787 | 0.722 | 0.722 | 0.145 | 0.065 | 0.069 |

-0.1in

As a result, the harm of the adversarial instances generated by RAFair\(_{TB}\) or RAFair\(_{FF}\) is more subtle and difficult to detect. For tasks primarily focusing on adversarial robustness, RAFair\(_{TB}\) can generate robust instances that contain hidden discrimination. Conversely, for tasks primarily concerned with individual fairness, RAFair\(_{FF}\) can generate fair instances that contain hidden inaccuracy. Neither adversarial robustness nor individual fairness alone can reveal the impacts of RAFair\(_{TF}\) or RAFair\(_{FF}\). Utilizing the robust accurate fairness criterion becomes imperative to expose false or biased adversarial instances.

*Answer to Q2: True Biased, False Fair, and False Biased perturbations can manipulate the prediction accuracy or individual fairness to generate false or biased adversarial instances.*

*Q3: Effectiveness of benign perturbations for trustworthiness improvement.*

To mitigate the negative impact of adversarial instances, we introduce benign perturbations to adversarial instances to enhance their trustworthiness. An overview of the average statistics of benign datasets is presented in Table 4.

Data\(_{benign}\) | R-Acc | R-IF | R-AF | TFR | TBR | FFR | FBR |
---|---|---|---|---|---|---|---|

RAFair\(_{TB}\) | 0.992 | 0.909 | 0.906 | 0.906 | 0.087 | 0.004 | 0.004 |

RAFair\(_{FF}\) | 0.873 | 0.786 | 0.694 | 0.694 | 0.179 | 0.092 | 0.036 |

RAFair\(_{FB}\) | 0.985 | 0.917 | 0.913 | 0.913 | 0.072 | 0.004 | 0.011 |

RAFair | 0.933 | 0.864 | 0.808 | 0.808 | 0.126 | 0.056 | 0.010 |

APGD | 0.995 | 0.922 | 0.920 | 0.920 | 0.075 | 0.003 | 0.003 |

ACG | 0.995 | 0.922 | 0.920 | 0.920 | 0.075 | 0.003 | 0.003 |

ADF | 0.995 | 0.922 | 0.920 | 0.920 | 0.075 | 0.003 | 0.003 |

DICE | 0.995 | 0.922 | 0.920 | 0.920 | 0.075 | 0.003 | 0.003 |

-0.1in

Comparing the results with those in Table 3, it is evident that benign perturbations play a crucial role in improving R-ACC, R-IF, and R-AF across diverse adversarial datasets generated by various approaches, including ACG, APGD, ADF, and DICE. The application of benign perturbations successfully corrects the majority of adversarial instances, aligning them with accuracy and fairness standards. An interesting observation is the resilience of true biased adversarial instances, which prove more challenging to repair even after benign perturbations. The repair results show that TBR remains higher compared FFR and FBR after benign perturbation, suggesting the difficulty in rectifying true biased instances.

Notably, when considering adversarial instances generated by RAFair, only 80.8% are corrected to be accurate and fair by benign perturbations. This lower correction rate can be attributed to the complex perturbation strategy employed by RAFair, making its adversarial instances inherently more complex to correct. Despite this, benign perturbations consistently demonstrate the ability to enhance the trustworthiness of adversarial instances.

*Answer to Q3: Benign perturbations prove highly effective in mitigating the negative impacts of adversarial perturbations, leading to noteworthy improvements in rectifying adversarial instances.*

In this paper, we introduce the Robust Accurate Fairness, emphasizing that minor input perturbations should not compromise prediction accuracy or individual fairness. It integrates the requirements of adversarial robustness and individual fairness, requiring that predictions for an instance and its similar counterparts should consistently align with the ground truth to withstand input perturbations.

Input perturbations present a double-edged impact on robust accurate fairness in DNN. On the one hand, adversarial perturbations may induce false or biased treatments. On the other hand, benign perturbations can contribute to rectifying prediction accuracy and individual fairness. To explore the duality of input perturbations, we introduce RAFair to expose robust accurate fairness defects. Additionally, we generate benign perturbations to alleviate the false or biased treatments induced by adversarial perturbations. Experimental studies illustrate the efficiency of RAFair in manipulating prediction accuracy or individual fairness through fairness confusion guided perturbations. However, these false or biased treatments can be effectively addressed by introducing benign perturbations to adversarial instances.

Our method is a white-box approach that relies on DNN gradient access. An intriguing future direction involves extending false or biased adversarial generations to black-box scenarios. Furthermore, robust accurate fairness relies on the concept of similar counterparts. In this study, we employ tabular data and define similar counterparts as the instances that share the same none-protected attributes. There is potential to extend the concept of similar counterparts across diverse data types and task scenarios, including computer vision tasks, natural language processing, etc, to analyze the impact of input perturbations on robust accurate fairness in machine learning models across various domains.

Figures 6, [compas32G32search], and [credit32G32search] depict the trends in the loss function for an instance and its corresponding counterpart during the RAFair attack (false fair, true bias, and false bias attack). These figures were generated by tracking the loss function values of one instance from the Adult, COMPAS, or Credit datasets, respectively, when subjected to perturbations. The perturbation parameters used were a step size of \(p=0.01\) and 50 iterations.

As shown in Figures 6, [compas32G32search], and [credit32G32search], we observe similar trends across the three datasets during the RAFair attack. False fair attack results in a similar impact on both the instance and its counterpart, increasing their loss functions. This suggests that the perturbed instance may receive a false treatment according to its ground truth while remaining similar to its counterpart. In contrast, during the true biased and false biased attacks, perturbations affect the instance and its counterpart differently. One experiences an increase in loss, while the other undergoes a decrease, indicating differential treatment of the perturbed instance compared to its counterpart.

In addition, we employ quantile regression (Quantile Regression [20]) to analyze the trends in the loss function of 100 instances from the Adult, COMPAS, and Credit datasets. Initially, we apply K-Means clustering (K-Means [21]) to the test dataset and meticulously select 100 initial seeds from the resulting clusters. Subsequently, we record the loss function values of these instances when subjected to the RAFair attack.

Regarding the loss function as a random variable, given the number of iterations, we employ quantile regression to model the median value of the loss function, denoted as \(l\), as a linear function of the number of iterations, represented as \(\mathbb{Q} \frac{1}{2}\left( l|iter \right)= \beta_0 +\beta_1\cdot iter\). Here, \(\mathbb{Q}{\frac{1}{2}}(l | \text{iter})\) represents the median of the loss function given the number of iterations, and \(\beta=[\beta_0, \beta_1]^T\) is the parameter vector estimated through the optimization problem expressed as: \(\hat{\beta}=\underset{\beta \in R^2}{\mathrm{arg}\min}\sum_{k=1}^n{\left| l_k-\beta_0-\beta_1 \cdot iter_k \right|}\).

The statistical results of quantile regression for RAFair attack across the Adult, COMPAS, and Credit datasets are presented in Figures 7, 8, and [credit32global32analysis]. To categorize the coordinate system, we utilize the parameters \(\hat{\beta_1}\) and \(\hat{\beta_1'}\) derived from the instances and their similar counterparts. This categorization results in four areas: *True Fair* (sign\((\hat{\beta_1})\) = -1, sign\((\hat{\beta_1'})\) = -1), *True Biased* (sign\((\hat{\beta_1})\) = -1, sign\((\hat{\beta_1'})\) = 1), *False Fair* (sign\((\hat{\beta_1})\) = 1, sign\((\hat{\beta_1'})\) = 1), and *False Biased* (sign\((\hat{\beta_1})\) = 1, sign\((\hat{\beta_1'})\) = -1).

As illustrated in Figures 7, 8, and [credit32global32analysis], the results indicate that the majority of instances subjected to false fair, true biased, and false biased global perturbations are consistently situated within their respective categories of false fair, true biased, and false biased areas.

\(v1\):[0.60, 0.15, 0, 0.40, 0, 0.40, 0, 0.31, 0.40, 0, 0.63, 0, 1, 1] 1 |

-0.1in

Here is an example of a true biased adversarial instance in the Adult dataset. Each sample in the adult dataset has 13-dimensional attributes, including three socially sensitive attributes: age, race, and gender (highlighted in blue). The objective is to train a DNN model to predict the income of every input individual, with predictions being accurate and individually fair concerning these sensitive attributes.

\(v1_{adv}\): [0.60, 0.15, 0, 0.40, 0, 0.41, 0, 0.32, 0.41, 0, 0.63, 0, 1, 1] 1 |

\(v1'_{adv}\): [0.60, 0.15, 0, 0.40, 0, 0.41, 0, 0.32, 0.41, 0, 1.18, 1, 0] \(\quad\)0 |

-0.1in

The item \(v1\) with its ground truth highlighted in red gets a DNN prediction of **1**. However, when small perturbations [0, 0, 0, 0, 0, 0.01, 0, 0.01, 0.01, 0, 0, 0, 0, 0] are added to its attributes, its
adversarial instance \(v1_{adv}\) remains **1**. Nonetheless, its similar counterpart \(v1'_{adv}\), sharing the same non-protected attribute values but with a different
age, race, and gender, is now predicted as **0**. Despite being adversarial robustness, this results in biased predictions between similar counterparts.

\(\quad\quad\quad v2\):[0.60,0,0,0.40,0,0.40,0,0.31,0.40,0, 0.19,0,1, 1] 0\(\quad\quad\quad\) |

-0.1in

Here is another example of a false fair adversarial instance:

\(v2_{adv}\):[0.59,0.01,0,0.39,0,0.39,0.01,0.31,0.39,0, 0.19,0.0,1, 1] 0 |

\(v2'_{adv}\):[0.59,0.01,0,0.39,0,0.39,0.01,0.31,0.39,0, 0.26,0.5,1] \(\quad\)0 |

-0.1in

After adding small perturbations [-0.01, 0.01, 0, -0.01, 0, -0.01, 0.01, 0, -0.01, 0, 0, 0, 0, 0] to \(v2\), both its adversarial instance \(v2_{adv}\) and similar counterpart \(v2'_{adv}\) are predicted as **0**. Although receiving equal treatment, their predictions are inaccurate, differing from their ground truth (1).

[1]

Zhang, J. M., Harman, M., Ma, L., and Liu, Y. Machine learning testing: Survey, landscapes and horizons. *IEEE Trans. Software Eng.*, 48 (2): 1–36, 2022.
. URL https://doi.org/10.1109/TSE.2019.2962027.

[2]

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In Bengio, Y. and LeCun, Y. (eds.), *3rd International Conference on Learning
Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings*, 2015. URL http://arxiv.org/abs/1412.6572.

[3]

Su, J., Zhang, Z., Wu, P., Li, X., and Zhang, J. Adversarial input detection based on critical transformation robustness. In *IEEE 33rd International Symposium on Software
Reliability Engineering, ISSRE 2022, Charlotte, NC, USA, October 31 - Nov. 3, 2022*, pp. 390–401. IEEE, 2022. . URL https://doi.org/10.1109/ISSRE55969.2022.00045.

[4]

Zhang, P., Wang, J., Sun, J., Dong, G., Wang, X., Wang, X., Dong, J. S., and Dai, T. White-box fairness testing through adversarial sampling. In Rothermel, G. and Bae, D. (eds.),
*ICSE ’20: 42nd International Conference on Software Engineering, Seoul, South Korea, 27 June - 19 July, 2020*, pp. 949–960. ACM, 2020. . URL https://doi.org/10.1145/3377811.3380331.

[5]

Adult. UCI Machine Learning Repository, 1996. : 10.24432/C5XW20.

[6]

Moro, S., Rita, P., and Cortez, P. . UCI Machine Learning Repository, 2012.

[7]

Angwin, J., Larson, J., Mattu, S., and Kirchner, L. Machine bias. *Ethics of Data and Analytics: Concepts and Cases*, pp. 254, 2022.

[8]

Li, X., Wu, P., and Su, J. Accurate fairness: Improving individual fairness without trading accuracy. In Williams, B., Chen, Y., and Neville, J. (eds.), *Thirty-Seventh
AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence,
EAAI 2023, Washington, DC, USA, February 7-14, 2023*, pp. 14312–14320. AAAI Press, 2023. . URL https://doi.org/10.1609/aaai.v37i12.26674.

[9]

Hofmann, H. . UCI Machine Learning Repository, 1994. : 10.24432/C5NC77.

[10]

Awais, M. and Bae, S. A survey on efficient methods for adversarial robustness. *IEEE Access*, 10: 118815–118830, 2022. . URL
https://doi.org/10.1109/ACCESS.2022.3216291.

[11]

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. In *6th International Conference on Learning
Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings*. OpenReview.net, 2018. URL https://openreview.net/forum?id=rJzIBfZAb.

[12]

Croce, F. and Hein, M. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In *Proceedings of the 37th International Conference on
Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event*, volume 119 of *Proceedings of Machine Learning Research*, pp. 2206–2216. PMLR, 2020. URL http://proceedings.mlr.press/v119/croce20b.html.

[13]

Yamamura, K., Sato, H., Tateiwa, N., Hata, N., Mitsutake, T., Oe, I., Ishikura, H., and Fujisawa, K. Diversified adversarial attacks based on conjugate gradient method. In Chaudhuri, K.,
Jegelka, S., Song, L., Szepesvári, C., Niu, G., and Sabato, S. (eds.), *International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA*, volume 162 of *Proceedings of
Machine Learning Research*, pp. 24872–24894. PMLR, 2022. URL https://proceedings.mlr.press/v162/yamamura22a.html.

[14]

Zeng, H., Yue, Z., Kou, Z., Zhang, Y., Shang, L., and Wang, D. Fairness-aware training of face attribute classifiers via adversarial robustness. *Knowl. Based Syst.*, 264: 110356,
2023. . URL https://doi.org/10.1016/j.knosys.2023.110356.

[15]

Zhang, P., Wang, J., Sun, J., Wang, X., Dong, G., Wang, X., Dai, T., and Dong, J. S. Automatic fairness testing of neural classifiers through adversarial sampling. *IEEE
Trans. Software Eng.*, 48 (9): 3593–3612, 2022. . URL https://doi.org/10.1109/TSE.2021.3101478.

[16]

Nanda, V., Dooley, S., Singla, S., Feizi, S., and Dickerson, J. P. Fairness through robustness: Investigating robustness disparity in deep learning. In Elish, M. C., Isaac, W., and
Zemel, R. S. (eds.), *FAccT ’21: 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event / Toronto, Canada, March 3-10, 2021*, pp. 466–477. ACM, 2021. . URL
https://doi.org/10.1145/3442188.3445910.

[17]

Zhang, L., Zhang, Y., and Zhang, M. Efficient white-box fairness testing through gradient search. In Cadar, C. and Zhang, X. (eds.), *ISSTA ’21: 30th ACM
SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, Denmark, July 11-17, 2021*, pp. 103–114. ACM, 2021. . URL https://doi.org/10.1145/3460319.3464820.

[18]

Monjezi, V., Trivedi, A., Tan, G., and Tizpaz-Niari, S. Information-theoretic testing and debugging of fairness defects in deep neural networks. In *45th
IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023*, pp. 1571–1582. IEEE, 2023. . URL https://doi.org/10.1109/ICSE48619.2023.00136.

[19]

Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. S. Fairness through awareness. In *Innovations in Theoretical Computer Science 2012*, pp. 214–226, 2012.

[20]

Koenker, R. and Hallock, K. F. Quantile regression. *Journal of economic perspectives*, 15 (4): 143–156, 2001.

[21]

Lloyd, S. P. Least squares quantization in PCM. *IEEE Trans. Inf. Theory*, 28 (2): 129–136, 1982. . URL https://doi.org/10.1109/TIT.1982.1056489.