April 02, 2024
Noisy label learning aims to train deep neural networks using a large amount of samples with noisy labels, whose main challenge comes from how to deal with the inaccurate supervision caused by wrong labels. Existing works either take the label correction or sample selection paradigm to involve more samples with accurate labels into the training process. In this paper, we propose a simple yet effective sample selection algorithm, termed as Pairwise Similarity Distribution Clustering (PSDC), to divide the training samples into one clean set and another noisy set, which can power any of the off-the-shelf semi-supervised learning regimes to further train networks for different downstream tasks. Specifically, we take the pairwise similarity between sample pairs to represent the sample structure, and the Gaussian Mixture Model (GMM) to model the similarity distribution between sample pairs belonging to the same noisy cluster, therefore each sample can be confidently divided into the clean set or noisy set. Even under severe label noise rate, the resulting data partition mechanism has been proved to be more robust in judging the label confidence in both theory and practice. Experimental results on various benchmark datasets, such as CIFAR-10, CIFAR-100 and Clothing1M, demonstrate significant improvements over state-of-the-art methods.
The remarkable success of deep learning is largely attributed to the training of Deep Neural Network (DNN) using a large datasets with human annotated labels [1]–[3]. However, it is not only labor-expensive but also time-consuming to label extensive data with high-quality annotations [4].To overcome this problem, noisy label learning [5] has been widely studied by worldwide researchers, which aims to train DNNs only with a large amount of samples with noisy labels. Because of the significant number of parameters, the DNNs are very easy to overfit the noisy labels by learning complex mapping functions [6]. Therefore, how to involve more samples with accurate labels into the training process has become a critical issue in noisy label learning.
In the past few years, extensive works have been done in the field of noisy label learning, which could be simply divided into two categories, i.e., label correction [5], [7], [8] and sample selection [9]–[11]. In particular, the label correction methods try to learn a transition matrix between network’s predictions and noisy labels, in which the transition matrix will revise the wrong gradients indicated by noisy labels. For example, the Meta Soft Label Corrector (MSLC) [5] takes extra clean annotations to learn a transition matrix by utilizing the original targets and dynamic predictions. What’s different, the sample selection methods aim to choose more samples with clean labels into the training process, in which the sample reweighting strategy is often used to drop out samples based on different priori knowledge. For example, the Self-Paced Robust Learning (SPRL) algorithm [11] trains the DNN in a process from more reliable samples to less reliable ones under the supervision of well labeled data. Even though great progress has been achieved in noisy label learning, all the existing methods still suffer from improving the quality of supervision under severe label noise rate.
Recently, the mainstream methods usually formulate the noisy label learning as a semi-supervised learning problem [12], [13], in which the training samples are first divided into one clean set and another noisy set, and then the ones in clean set are taken for supervised learning while the ones in noisy set are taken for unsupervised learning. From this point of view, this line of methods actually belongs to the sample selection category. These methods [12], [14], [15] have achieved the state-of-the-art performance in noisy label learning, whose success is mainly because of it’s a relatively easy thing to divide the training samples into one clean set and another noisy set, therefore the potential of samples with clean labels can be fully explored in the supervised learning step. For example, the recent UNIform selection and CONtrastive learning (UNICON) method [12] takes the Jensen-Shannon Divergence (JSD) as a metric to conduct data partition, in which the pureness of clean set can reach about 90% in the training process. To further improve the performance, the key of these methods lies on how to accurately divide the training samples into clean set and noisy set.
To achieve this goal, different divergence metrics are formulated to construct the clean set and noisy set. For example, the Jensen-Shannon Divergence (JSD) [12], Cross-Entropy [13], [16] are widely used for data partition, where the samples with lower losses are divided into the clean set, while the samples with higher losses are divided into the noisy set. Some methods manually set a cut-off threshold between small and large losses [16], others use clustering methods like Beta Mixture Model (BMM) [17] and Gaussian Mixture Model (GMM) [13] or an automatically calculated threshold [12] to automatically separate losses. Because these methods take the small loss criterion for data partition, they are very easy to overfitting noisy labels when the noise rate is high in the original training data. To alleviate this issue, two critical points need to be addressed in the resulting work: (1) The adopted loss should be robust to different noise rates; and (2) The divergence metric should have a wide adaptability to different rates of noisy labels.
In this paper, we propose a novel Pairwise Similarity Distribution Clustering (PSDC) method for noisy label learning, which is effective to divide the training samples into one clean set and another noisy set. Therefore, the resulting samples can be further taken to learn discriminative features representation in a supervised and unsupervised manner, respectively. In particular, we compute the pairwise similarity between sample pairs to represent the sample structure in each noisy cluster, which can act as important prior information for candidate sample separation. Because the pairwise sample structure has no direct relationship with noisy labels, it can overcome the drawback of small loss criterion in data partition. Besides, we take the Gaussian Mixture Model (GMM) to model the similarity distribution between sample pairs belonging to the same noisy cluster, therefore each sample can be confidently divided into the clean set or noisy set. As shown in Figure 1, the samples of “shirt” can be picked out to learn a discriminative feature representation in supervised manner, while the samples not belonging to “shirt’ can be taken to learn a robust feature representation in an unsupervised manner. Even under severe specific types of label noise rate, we have proved that the resulting data partition mechanism is robust in judging the label confidence in both theory and practice.
The main contributions of this work is highlighted as follows:
We propose a novel Pairwise Similarity Distribution Clustering method for noisy label learning, which takes the pairwise sample structure and Gaussian Mixture Model to improve the accuracy of data partition.
We present a clear theoretical analysis to Jensen-Shannon Divergence, Cross-Entropy Criterion and Gaussian Mixture Model, which indicates that our method has a wide tolerance range to noise rate.
We conduct extensive experiments on CIFAR-10, CIFAR-100 and Clothing1M datasets with different types and noise rates, and achieve the state-of-the-art results.
In this section, we briefly review the related works from two aspects, i.e., label correction and sample selection, which are introduced in the following paragraphs.
Label Correction. For label correction methods, a noisy label is usually refurbished to prevent network overfitting to false labels. For example, Bootstrapping [18] first proposes the concept of label correction to update the target labels of samples in the training process. Recently, the iterative methods, such as Joint Optimization Framework (JOF) [19] and Online Label Smoothing (OLS) [20], relabel samples based on the predictions on networks. What’s different, the loss correction [21] methods often estimate the noise transition matrix, which represents the probabilities that clean labels flip into noisy labels. For example, the Gold Loss Correction (GLC) [22] estimates the noise transition matrix using a set of training samples with clean labels, whose performance is closely related with the initial representation of network. To alleviate this issue, the Meta Label Correction (MLC) [23] introduces a meta learning framework to learn the noise transition matrix, in which an adaptive meta corrector is learned to cope with the enhancement of representation ability in the whole training process. What’s more, some other works focus on how to involve more samples with noisy labels into the training process. For example, work in [24] selectively refurbish and exploit those samples whose labels can be corrected with high precision, which can prevent the risk of noise accumulation by gradually increasing the number of training samples. Besides, Self-Ensemble Label Correction (SELC) [25] uses ensemble predictions formed by an exponential moving average of network’s outputs to update the original noisy labels. Self-Evolution Average Label (SEAL) [26] presents a simple yet effective algorithm on instance dependent noise, in which both theoretical and empirical evidences are given to show its robustness to instance dependent noise.
Sample Selection. For sample selection methods, different prior knowledge is usually modeled to choose more reliable samples to train networks. For example, most sample selection based methods take the phenomenon that DNN learns simple patterns before fitting label noise [27] as a prior knowledge, thererfore they first choose samples with samll losses to train network and then add more samples with large losses into the training process. What’s different, Meta Weight Network (MWN) [5] tries to learn a sample selection criterion, in which a set of samples with clean labels are taken as meta data to learn how to choose samples with clean labels. Recently, the mainstream methods take semi-supervised learning [14] and co-training technologies [28] to train networks, in which the selected clean samples are treated as labeled data and noisy samples are treated as unlabeled data [29]. For example, the well known DivideMix [13] models the losses of samples with GMM, which can dynamically divide the training data into a labeled set with clean samples and an unlabeled set with noisy samples. Besides, the recent UNICON [12] uses Jensen-Shannon divergence to select samples, which has achieved the state-of-the-art results in noisy label learning. Although these methods have achieved the promising performance, they are sensitive to the training data which has a high rate of noisy labels. To the best of our knowledge, the underying reason is that the noisy labels are directly used to measure their own credibility, whose idea is mainly based on the phenomenon that networks first learn samples with clean labels and then learn samples with wrong labels [27]. Unlike the mainstream sample selection methods, we aim to explore the structure between sample pairs in data partition, and provide the insight why it is more robust than using noisy labes to divide samples into clean and noisy set.
Let \(\mathcal{X}\) denote the instance space, and \(\mathcal{Y}\) denote the label space, such that for each instance \(x\in\mathcal{X}\), there exists a corresponding label \(y\in\mathcal{Y}\). We denote the noise-free dataset as \(\mathbb{D}=\{\mathcal{X},\mathcal{Y}\}=\{(x_i,y_i)\}_{i=1}^{N}\), where \(x_i\) represents an image, \(y_i\) represents its corresponding label, and \(N\) denotes the total number of training samples. Let us consider a measurable function \(\mathcal{C}: \mathcal{X}\rightarrow \mathcal{Y}\), which maps the data samples to their corresponding real labels. Next, we consider datasets where the given label may be corrupted and the labeling is not entirely accurate. We base our approach on the class-conditional noise assumption \(p(\tilde{y}|y,\boldsymbol{x})=p(\tilde{y}|y)\), where \(\tilde{y}\) represents the corrupted label [30]. In practice, we typically have a training dataset \(\mathbb{\tilde{D}}=\{\mathcal{X},\mathcal{\tilde{Y}}\}=\{(x_i,\tilde{y}_i)\}_{i=1}^{N}\) where the label set defaults to the noise label set.
For a k-class classification problem, we begin by initializing a DNN model with a feature extractor \(f(\bullet;\theta)\). After the feature extractor, there is a classification layer, \(h : f(\mathcal{X};\theta) \rightarrow \mathbb{R}^{N \times k}\) and a projection head, \(g : f(\mathcal{X};\theta)\rightarrow \mathbb{R}^{N \times m}\), where \(m\) is the dimension; \(g\) is used for contrastive learning. We minimize a loss function, \(l: \mathbb{R}^{N\times k}\times \mathcal{\tilde{Y}}\rightarrow \mathbb{R}^{N}\) to train with the given labels on the training set \(\tilde{\mathbb{D}}\). We face a sample selection problem where we need to partition a training set \(\tilde{\mathbb{D}}\) into a clean subset, \(\mathbb{D}_{clean}\), and a noisy subset, \(\mathbb{D}_{noisy} = \tilde{\mathbb{D}}\backslash \mathbb{D}_{clean}\). Then, \(\mathbb{D}_{clean}\) is used for supervised training, while \(\mathbb{D}_{noisy}\) is utilized for unsupervised training without using the corresponding noisy ground-truth labels. This approach is a standard semi-supervised method where pseudo-labels are generated for the examples in \(\mathbb{D}_{noisy}\). This section introduces our proposed method for learning with noisy labels. It includes a sample selection module and a semi-supervised learning module, which are introduced in the following paragraphs.
Consider the features extracted by the backbone with a projection head \(\mathbb{G} = g(f(\mathcal{X},\theta),\phi)\). We then partition \(\mathbb{G}\) based on the given labels in \(\mathcal{\tilde{Y}}\) as follows: \[\mathbb{G} = \{\mathbb{G}_i\}_{i=1}^{k},\] where \(k\) denotes the number of classes, and the samples in each \(\mathbb{G}_i\) share the same given label. We approach the problem as a binary classification task for each \(\mathbb{G}_i\), given that the label for each sample in the set is the same. Thus, we only need to consider the feature differences of each sample. Clean samples that belong to the label set depict the same thing, so they have similar features. However, noise samples that should not be in the set depict different things, thus they do not share similar features with clean samples. To gauge pairwise similarity of samples in the set \(\mathbb{G}_i\), we calculate the cosine similarity and generate an affinity matrix defined as:
Definition 1. Affinity Matrix: Let \(A^i \in \mathbb{R}^{n\times n}, i=\{1,2,...,k\}\) denote the affinity matrix of \(\mathbb{G}_i\), \(i=1,2,...,k\) , where k is the class number of training dataset, n is the sample number of \(\mathbb{G}_i\). Row \(p\) of \(A^i\) denotes the similarity measure between sample \(x_p\) and other samples.
To facilitate the representation of theory, some concept is proposed here for noisy data:
Definition 2. Submerged: Assume that the average affinity of clean samples \(a_p = \{a_{p_1},...,a_{p_n}\}\) are a sequence of i.i.d. random variables with expectation \(\mu_p\) and variance \(\sigma_{p}^{2}\), the average affinity of noisy samples \(a_q = \{a_{q_1},...,a_{q_{n}}\}\)are a sequence of independent random variables satisfying Liapunov Condition [31], expectation \(\mu_q\) and variance \(\sigma_{q}^{2}\). if \[\sum_{i=p_1}^{p_n} a_i < \sum_{i=q_1}^{q_n} a_i,\] then the clean samples are submerged by noisy samples.
Our research introduces the first theory, as described below:
Theorem 1. Consider two pairs of samples, \({(x_p,\tilde{y}),(x_q,\tilde{y})}\), randomly selected from \(\mathbb{G}_i\) , with their respective indices in the affinity matrix \(A^i\) being \(p\) and \(q\). Given the following conditions:
\(\mathcal{C}(x_p) \neq \tilde{y}\) and \(\mathcal{C}(x_q) = \tilde{y}\);
clean samples are not submerged by noise samples;
clean samples and noise samples obey different distributions.
Then the mean value of row \(p\) on the affinity matrix \(A^i\) follows a Gaussian distribution with mean \(\mu_p\) and the mean value of row \(q\) in the affinity matrix \(A^i\) follows a Gaussian distribution with mean \(\mu_q\), where \(\mu_q < \mu_p\).
Theorem 1 states that a sample is classified as a noisy sample when its affinities with all other samples are small. Building on this theory, we sum the matrices \(A^i\) by rows to obtain a series of column vectors \(a^i \in \mathbb{R}\), where \(i = \{1,2,...,k\}\). The values of \(a^i\) represent the sum of affinities between clean and noise samples. According to the central limit theorem, the mean value of \(a^i\) for both clean and noise samples follows a normal distribution. We omit the step of dividing by \(n\) because a series of samples that corresponds to a normal distribution will also correspond to a normal distribution when multiplied by the same value simultaneously. Therefore, we believe that the samples belonging to a Gaussian distribution with a higher mean value are clean samples, while the others are noise samples. Based on this, we adopted the following steps to conduct sample screening: We represent the training sample feature set by \(\mathbb{G}\), where for a feature set \(\mathbb{G}_k=\{g_i\}_{i=1}^l\) with \(l\) features and a label category of \(k\), we calculate the cosine distribution: \[A^i_{j,z} = \mathrm{Cosine}\left( g_z, g_j\right)=\dfrac{ g_z\cdot g_j}{|| g_z||\times|| g_j||},\] where \(g_z,g_j\) are two fearures in \(\mathbb{G}_k\). We then represent \(A^i\) as column vectors \(A^i=[a_1\;a_2\;...\;a_l]\). By summing these column vectors \[a^i = \sum_{index=1}^l a_{index},\] we obtain a value of \(a^i\) that follows two different normal distributions. To differentiate between samples, we fit a two-component GMM using the Expectation-Maximization algorithm. For each sample, GMM can give a posterior probability that it belongs to two Gaussian distributions and the mean value of each normal distribution. We can then select clean samples based on their posterior probability values being higher than a cutoff value of \(d_{cutoff}\) in order to improve accuracy. The detailed the sample selection algorithm is presented in Algorithm 2.
In practice, the features extracted by network usually tend to be inaccurate at the beginning, therefore we jointly take the uniform selection criterion [12] to help data partition at early stages. In particular, the selecting results given by JSD are utilized when the number of joint samples chosen by PSDC and JSD is lower than the 0.8 times of that chosen by JSD. With the iteration going on, the representation capability of network will become stronger, therefore we simply use our PSDC to conduct sample selection. As shown in Figure 3, we illustrate some sample selection examples on the Clothing1M, CIFAR-100, and CIFAR-10 datasets, in which the noise rates are set to be \(3/7\), \(2/7\), and \(2/7\), respectively. Due to the powerful capability of our PSDC, the average purity of noisy set and clean set can reach \(6/7\), which is a strong guarantee to any semi-supervised learning algorithm.
Various techniques have been employed in prior works to create clean and noisy subsets. Amongst them, the recent state-of-the-art method UNICON [12] uses JSD to select samples. We conducted an analysis to compare the theoretical scope and applicability of this method with our own. Assuming class-conditional noise, the label corruption can be expressed by a noise transition matrix \(T\in\mathbb{R}^{k\times k}\). Here, \(T_{ij}=p(\tilde{y}=j|y=i)\) indicates the probability of flipping a class-i example into a class-j example. The noisy data distribution satisfies \(p(\boldsymbol{x},\tilde{y}) = \sum_{i=1}^kp(\tilde{y}|y=i)p(\boldsymbol{x},y=i)\).
We assume the presence of a backbone that includes a classification layer, \(h(f(\boldsymbol{x};\Theta)):\mathcal{X}\rightarrow\mathbb{R}^{N\times k}\) with output \(g(\boldsymbol{x};\Theta)=[\hat{p}_1(\boldsymbol{x}),...,\hat{p}_k(\boldsymbol{x})]\in\mathbb{R}^k\), where \(\boldsymbol{x}\in \mathcal{X}\) is a random sample. For simplicity, we denote \(h(f(\boldsymbol{x};\Theta))\) as \(h(\boldsymbol{x})\) and omit the \(\Theta\) parameter. The softmax output of \(h\) for \((\boldsymbol{x},y)\) is \(\left[\hat{p}_{1}\left(\boldsymbol{x}\right), \ldots, \hat{p}_{k}\left(\boldsymbol{x}\right)\right]=\left[T_{h\left(\boldsymbol{x}\right) 1}, \ldots, T_{h\left(\boldsymbol{x}\right) k}\right]\).
Theorem 2. Consider two pairs of randomly selected samples, \((\boldsymbol{x}_1,\tilde{y})\) and \((\boldsymbol{x}_2,\tilde{y})\), with the same observed label from the set \(\{\mathcal{X},\mathcal{\tilde{Y}}\}\). If the following conditions hold:
\(\mathcal{C}(\boldsymbol{x}_1)=\tilde{y}\) and \(\mathcal{C}(\boldsymbol{x}_2)\neq\tilde{y}\);
the noise transition matrix T of \(\{\mathcal{X},\mathcal{\tilde{Y}}\}\) satisfies the diagonally-dominant condition \(T_{ii}>max\{max_{j\neq i}T_{ij},max_{j\neq i}T_{ji}\}\),\(\forall i\);
noise type is uniform and noise rate is under 1, or noise type is pairwise and noise rate is under 0.5, or noise type is structured and noise rate is under 0.5.
are satisfied, then \(JSD(h(\boldsymbol{x}_1),\tilde{y})<JSD(h(\boldsymbol{x}_2),\tilde{y})\), where JSD denotes Jensen-Shannon divergence.
Theorem 2 provides a definite order relationship for specific types of noise. However, for other types of noise, it is difficult to analyze, which raises concerns about the reliability of the method in realistic noisy environments. Additionally, because this method relies on labels, overfitting to noisy labels can cause the JSD for clean and noisy samples to become increasingly difficult to distinguish.
Our method theorem 1 provides a submerged condition for effectiveness. This means that as long as the noise is not so severe that multiple samples differing from the clean category and resembling the clean category appear simultaneously in the same category of the dataset, or the number of a class of noisy samples exceeds that of the clean category, our method can effectively detect the noise. Additionally, Lyapunov condition ensures that the random variables in the random variable column are “uniform”, “equal”, with each random variable being “insignificant”. This condition is an inference of the Lindeberg condition: \[\lim_{n\to\infty}\frac{1}{B_n^2}\sum_{i=1}^n E((X_i-\mu_i)^2I[|X_i-\mu_i|\ge\epsilon B_n])=0.\] In simple terms, the Lyapunov condition requires that the sum of variances for all random variables \(B_n\), as shown in Eq. 1 , \[\label{eq1} B_n = Var(a_{q_1}) + Var(a_{q_2}) + ... + Var(a_{q_{n}}),\tag{1}\] be sufficiently large while the effect of any individual random variable, \(a_i\), on the total variance is small. This ensures that no single random variable dominates the change in total variance. With the DNN training convergence, affinity matrix values, \(A^i\), become stable, and as a result, most datasets satisfy the Lyapunov condition. Moreover, pairwise similarity distribution clustering is not a label-dependent method, its effectiveness solely depends on the accuracy of feature extraction. This reduces the impact of network overfitting on noise in sample selection, giving our method an advantage over loss-based methods. Therefore, our method has a wider range of theoretical applications and greater clarity.
In summary, grouping samples is a better strategy than selecting all samples directly because it fits the theoretical framework presented in Theorems 1 and 2, in which a partial order relation is established for samples with identical ground truth labels but different clean labels. In our work, if the group mechanism is missing and similarity distribution is calculated directly among all samples, then similar samples will not dominate among all samples, particularly when the dataset is large and there are several non-similar sample pairs. Under these conditions, the similarity of all clean samples will be overshadowed by the similarity of noisy samples, and PSDC will be invalid.
Proofs of theorem 1 and theorem 2 can be seen in the supplementary materials.
Once the training samples are divided into clean set and noisy set, any of the off-the-shelf semi-supervised learning methods can be further applied to train DNN. In particular, Figure 4 illustrates our training and sample updating process. At time \(t\), the previously trained DNN is fed with the training set, and their features are extracted. Afterward, sample features are grouped by their ground truth labels, and PSDC selects each group of samples. The resulting clean set \(\mathbb{D}_{clean}\), and noisy set \(\mathbb{D}_{noisy}\), are subsequently used for semi-supervised training of the DNN at time \(t+1\).
Inspired by DivideMix [13], we train two networks simultaneously. Algorithm 5 depicts the algorithm for Semi-Supervised Training. At each epoch, a network accepts the \(\mathbb{D}_{clean}\) and \(\mathbb{D}_{noisy}\) as labeled dataset and unlabeled dataset. For each mini-batch, MixMatch [32] with contrastive learning SimCLR [33] is used for semi-supervised training. By training a model using self-extracted features and dividing the data, it might lead to confirmation bias [34]. Therefore, co-teaching [35] is implemented to prevent error accumulation. The features extracted by one network are used for sample selection in the other network. The two networks are kept distinct due to varying random parameter initialization, training data selection, and training shuffle.
During the training process, we create two types of augmented datasets - strongly augmented and weakly augmented - from both labeled and unlabeled datasets. The weakly augmented unlabeled dataset is utilized for guessing pseudo-labels, while the weakly augmented labeled dataset is used for label co-refinement. The strongly augmented unlabeled dataset is employed to compute the contrastive loss. Additionally, MixUp [36] is performed on the ground-truth labeled samples and pseudo-labeled samples from the labeled and unlabeled datasets, respectively, to produce two augmented datasets, namely \(\hat{\mathcal{X}}\) and \(\hat{\mathcal{U}}\).
The semi-supervised losses are generated after the MixUp operation as follows: \[\mathcal{L}_{\mathcal{X}}=\frac{1}{|\hat{\mathcal{X}}|}\sum_{\mathbf{x},\mathbf{p}\in\hat{\mathcal{X}}}\mathrm{H}(\mathbf{p},\mathbf{h}(\mathbf{f}(\mathbf{y}\mid\mathbf{x};\theta);\phi)),\] \[\mathcal{L}_\mathcal{U}=\dfrac{1}{|\hat{\mathcal{U}}|}\sum_{\mathbf{u},\mathbf{q}\in\hat{\mathcal{U}}}\|\mathbf{q}-\mathbf{h}(\mathbf{f}(\mathbf{y}\mid\mathbf{u};\theta);\phi)\|_2^2,\] where \(p\) refers to the co-refinement labels and \(q\) are the pseudo labels, \(H(\bullet,\bullet)\) is the cross-entropy, y are ground-truth labels. Moreover, we applied regularization to prevent the single-class assignment of all examples [19]: \[\mathcal{L}_{R}=\sum_c\pi_c log\big(\frac{1}{\frac{1}{|\mathcal{X}+\mathcal{U}|}\sum_{\mathbf{x}\in|\mathcal{X}+\mathcal{U}|}\mathbf{h}(\mathbf{f}(\mathbf{x};\theta);\phi)}\big).\] We introduce a contrastive loss for the unlabeled dataset, using projected features \(z_i\) and \(z_j\) of the augmented samples from the unlabeled dataset \(x_i\) and \(x_j\). The contrastive loss function is expressed as: \[\ell_{i,j}=-\log\frac{\exp(sim(\mathbf{z}_i,\mathbf{z}_j)/\kappa)}{\sum_{b=1}^{2B}\mathbf{1}_{b\neq i}\exp(sim(\mathbf{z}_i,\mathbf{z}_b)/\kappa)},\] \[\label{8} {\mathcal{L}}_{\mathcal{C}}=\frac{1}{2B}\sum_{b=1}^{2B}[\ell_{2b-1,2b}+\ell_{2b,2b-1}],\tag{2}\] where \(\mathbf{1}_{b\neq i}\) is an indicator function, \(\kappa\) is a temperature constant and we set \(\kappa=1\) in our work, \(B\) is the number of samples in a mini-batch, \(sim(z_i,z_j)\) is the cosine similarity between \(z_i,z_j\). Finally, we calculate \(\mathcal{L}_{T}\), which is the total loss function to minimize. The total loss function we minimize is \[\mathcal{L}_{T}=\mathcal{L}_{\mathcal{X}}+\lambda_{\mathcal{U}}\mathcal{L}_{\mathcal{U}}+\lambda_{R}\mathcal{L}_{R}+\lambda_\mathcal{C}\mathcal{L}_\mathcal{C},\] where \(\lambda_{\mathcal{U}}, \lambda_R,\lambda_\mathcal{C}\) are loss coefficients.
We evaluate our approach’s effectiveness on three benchmark datasets: CIFAR-10, CIFAR-100 [37], and a real-world dataset, Clothing1M [38], which are introduced as follows:
CIFAR-10/100: The CIFAR-10/100 datasets contain 50k training and 10k test images. respectively. We experiment with two types of noise models: symmetric and asmmetric. In particular, the symmetric noise is generated by randomly replacing the labels to all possible labels with \(r\) portion of samples. What’s different, the design of asymmetric label noise follows the structure of real mistakes that labels are only replaced by similar classes (e.g. bird \(\rightarrow\) airplane, deer \(\rightarrow\) horse).
Clothing1M: Clothing1M contains 1M clothing images in 14 classes. The dataset has noisy labels due to its origin from multiple online shopping websites, resulting in numerous mislabelled samples. For training, validating, and testing, the dataset has 50k, 14k, and 10k images, respectively.
For backbones, the PreAct ResNet18 [39] architecture is used for CIFAR-10 and CIFAR-100, while the ResNet50 [40] network is used for Clothing1M. Besides, the Stochastic Gradient Descent (SGD) optimization is employed with initial learning rate 0.04, momentum 0.9, weight decay of \(5e^{-4}\), and batch size 128 for training on CIFAR-10 and CIFAR-100. The network is trained for 350 epochs with a 10-epoch warm-up for CIFAR-10 and 30-epoch warm-up for CIFAR-100, linearly decaying the learning rate (lr-decay) by 0.1 per 120 epochs. In the case of Clothing1M, the network was trained for 150 epochs with a 2-epoch warm-up using a momentum of 0.9 and a weight decay of \(1e^{-3}\). Initially, we set the learning rate to 0.002, which we subsequently reduced by a factor of 10 after 50 and 100 epochs. Moreover, the batch size remained fixed as 32. Auto-augment Policy [41] is utilized for data augmentation. In addition, CIFAR10-Policy is used for CIFAR-10 and CIFAR-100, and ImageNet-Policy is used for Clothing1M. Cut off threshold \(d_{cutoff}\) for sample selection is set to \(0.9\). Hyperparameters for semi-supervised learning \(T\) is set to 0.5, \(\lambda_{\mathcal{C}}, \lambda_{\mathcal{U}}, \lambda_{R}, \kappa\) are set to \(0.025, 30, 1, 0.05\), the beta distribution parameter adopted by MixUp is set to \(4\) for CIFAR-10/100 and \(0.5\) for Clothing1M. All the experiments are run on NVIDIA GeForce RTX 3090 GPUs.
To compare with the state-of-the-art approaches, the performance of our method is evaluated under various label noise scenarios. These include synthetic noisy label datasets such as CIFAR-10 and CIFAR-100, as well as real-world noisy datasets like Clothing1M. In particular, the symmetric noise rates of \(20\%, 50\%, 80\%\) and asymmetric noise rates of \(10\%, 30\%, 40\%\) are considered in our experiments.
Table 1 depicts the average performance on the CIFAR-10 and CIFAR-100 datasets under the symmetric noise, in which our method achieves better results than the state-of-the-art approaches. Specifically, for the CIFAR-10 dataset, our PSDC shows superior results than the other methods at medium (\(50\%\)) and severe (\(80\%\)) noise levels. Similarly, for the CIFAR-100 dataset, our PSDC shows better performance at low (\(20\%\)), medium (\(50\%\)), and high (\(80\%\)) noise levels. PSDC’s accuracy is slightly lower than that of DivideMix [13] and UNICON [12] at a 20% noise rate on the CIFAR-10 dataset. It is possible that this is due to the accumulation of errors caused by inaccurate early feature extraction.
Method | CIFAR-10 | CIFAR-100 | ||||
20% | 50% | 80% | 20% | 50% | 80% | |
CE | 86.8 | 79.4 | 62.9 | 62.0 | 46.7 | 19.9 |
LDMI [42] | 88.3 | 81.2 | 43.7 | 58.8 | 51.8 | 27.9 |
MixUp [36] | 95.6 | 87.1 | 71.6 | 67.8 | 57.3 | 30.8 |
Co-teaching+ [43] | 89.5 | 85.7 | 67.4 | 65.6 | 51.8 | 27.9 |
DivideMix [13] | 96.1 | 94.6 | 92.9 | 77.3 | 74.6 | 60.2 |
UNICON [12] | 96.0 | 95.6 | 93.9 | 78.9 | 77.6 | 63.9 |
(ours) | 96.2 | 95.7 | 94.0 | 79.4 | 77.7 | 64.3 |
Method | CIFAR-10 | CIFAR-100 | ||||
10% | 30% | 40% | 10% | 30% | 40% | |
CE | 88.8 | 81.7 | 76.1 | 68.1 | 53.3 | 44.5 |
LDMI [42] | 91.1 | 91.2 | 84.0 | 68.1 | 54.1 | 46.2 |
MixUp [36] | 93.3 | 83.3 | 77.7 | 72.4 | 57.6 | 48.1 |
DivideMix [13] | 93.8 | 92.5 | 91.7 | 71.6 | 69.5 | 55.1 |
MOIT [44] | 94.2 | 94.1 | 93.2 | 77.4 | 75.1 | 74.0 |
UNICON [12] | 95.3 | 94.8 | 94.1 | 78.2 | 75.6 | 74.8 |
(ours) | 95.6 | 95.1 | 94.2 | 79.1 | 77.8/80.47 | 75.1 |
Tiny-ImageNet | |||
---|---|---|---|
Noise(%) | 0 | 20 | 50 |
CE | 57.4 | 35.8 | 19.8 |
Decoupling | - | 37.0 | 22.8 |
F-correction | - | 44.5 | 33.1 |
MentorNet | - | 45.7 | 35.8 |
Co-teaching+ | 52.4 | 48.2 | 41.8 |
M-correction | 57.7 | 57.2 | 51.6 |
NCT | 62.4 | 58.0 | 47.8 |
UNICON | 62.7 | 59.2 | 52.7 |
ours | 63.1 | 60.9 | 53.5 |
Table 2 presents the average performance on the CIFAR-10 and CIFAR-100 datasets under the asymmetric noise, in which our method achieves better results than all the state-of-the-art approaches. For the CIFAR-100 dataset at a 40% rate of asymmetric noise rate, the performance of our method is consistent with that of UNICON [12]. This is attributed to the fact that most methods face challenges in learning with high noise rates. In the case of our method, an increase in noise rate elevates the likelihood of clean samples being overwhelmed by noisy samples, resulting in a decrease in performance.
Table 4 illustrates the average performance on the Clothing1M dataset. From the results, we can find that our PSDC yields better results than most of the baseline methods, albeit slightly inferior to UNICON[12]. This is likely due to UNICON’s adoption of a category balance strategy, which enhances the performance of test results.
We further study the effect of removing different components, which can offer a better understanding of the factors that contribute to the success of our approach. Without loss of generality, we evaluate our method on the CIFAR-100 dataset for convenience.
Effectiveness of Pairwise Similarity Distribution: To evaluate the performance of pairwise similarity distribution in sample selection, we compare it with the other two sample selection methods under 50% and 80% symmetric noise rates. The test accuracies are shown in Table 6, in which: (1) “GMM” means that clustering the samples directly using the extracted features by backbone network; (2) “GMM+CE” means that combining the cross-entropy loss with GMM to cluster samples, as done by DivideMix [13]; and (3) “GMM+PSDC” means that taking the pairwise similarity distribution to represent sample structure and the GMM for dividing the samples into clean set and noisy set. From the results, we can find that “GMM+PSDC” can achieve the best results in noisy label learning, which indicates that the combination of GMM and pairwise similarity distribution is a robust sample selection method under different noise rates.
Method | Backbone | Test Accuracy |
---|---|---|
CE | ResNet-50 | 69.21 |
Joint-Optim [19] | ResNet-50 | 72.00 |
MetaCleaner [45] | ResNet-50 | 72.50 |
PCIL [46] | ResNet-50 | 73.49 |
DivideMix [13] | ResNet-50 | 74.76 |
ELR [47] | ResNet-50 | 74.81 |
UNICON [12] | ResNet-50 | 74.98 |
CC | ResNet-50 | 75.4 |
(ours) | ResNet-50 | 75.55 |
Method | Backbone | Test Accuracy |
---|---|---|
CE | Vgg19-BN | 79.4 |
Nested Dropout | Vgg19-BN | 81.3 |
SELFIE | Vgg19-BN | 81.8 |
Nested+Co-teaching (NCT) | Vgg19-BN | 84.1 |
InstanceGM with ConvNeXt | ConvNeXt | 84.7 |
Dynamic Loss | Vgg19-BN | 86.5 |
BtR | Vgg19-BN | 88.5 |
(ours) | Vgg19-BN | 87.8 |
Method | Test accuracy |
---|---|
Dataset | CIFAR-100 |
Symmetric Noise Rate | \(50\%\) \(80\%\) |
GMM + PSDC | 77.7 64.3 |
GMM + CE | 74.6 |
GMM | 74.5 |
Method | Test accuracy |
---|---|
Dataset | CIFAR-100 |
Symmetric Noise Rate | \(50\%\) \(80\%\) |
PSDC+GMM | 77.764.3 |
PSDC+K-means | 69.2 |
To support our viewpoint, we present the accuracy of selected samples in the clean set under 50% symmetric noise rate, as shown in Figure 6, in which the “Begin” and “Highest” denotes the model obtained after the warmup training and the highest accuracy respectively in the training process. From the results, we find that the lowest accuracy is achieved by simply using the GMM to cluster samples, because it is hard to deal with the high dimensional features without considering the noisy label or sample structure. Besides, the highest accuracy is achieved by jointly using the GMM and pairwise similarity distribution to cluster samples, which indicates that the sample pairwise sample structure is more robust than the noisy label prior information in sample selection at moderate noise rates. Because the pairwise similarity distribution solely relies on the extracted features, and is not directly impacted by noisy labels during sample selection.
Effectiveness of Gaussian Mixture Model: To explain the effectiveness of GMM in sample clustering, we compare it with the widely-used K-means clustering method under 50% and 80% symmetric noise rates. It should be noticed that the K-means needs few samples with clean labels to conduct sample selection, while the GMM doesn’t need any sample with clean label in the sample selection process. In practice, we give 3 samples with clean labels for K-means to divide the training samples into clean set and noisy set. The test accuracies are shown in Table 7, in which the “PSDC+GMM” achieves better results than “PSDC+K-means” in both cases. The results indicate that even though no samples with clean labels are used in our GMM, it is superior than the K-means in sample selection. The underlying reason is that the GMM can give the posterior probability to each cluster, therefore the purity of clean set can be significantly improved by setting a threshold in the training process.
This paper suggests a novel pairwise similarity distribution clustering method for network training with noisy labels. It divides the training samples into one clean set and another noisy set, so as to power any of the off-the-shelf semi-supervised learning methods to train networks. Unlike the previous methods which take the noisy labels as prior information, we utilize the pairwise similarity distribution as sample structure to increase its adaptability to severe label noise. Our findings demonstrate significant improvements over prior research in various datasets. As future work, we plan to investigate how the pairwise sample structure and noisy label prior can be utilized to complement each other in the context of noisy label learning.