January 01, 1970
Existing knowledge distillation methods mostly focus on distillation of teacher’s prediction and intermediate activation. However, the structured representation, which arguably is one of the most critical ingredients of deep models, is largely overlooked. In this work, we propose a novel Semantic Representational Distillation (SRD) method dedicated for distilling representational knowledge semantically from a pretrained teacher to a target student. The key idea is that we leverage the teacher’s classifier as a semantic critic for evaluating the representations of both teacher and student and distilling the semantic knowledge with high-order structured information over all feature dimensions. This is accomplished by introducing a notion of cross-network logit computed through passing student’s representation into teacher’s classifier. Further, considering the set of seen classes as a basis for the semantic space in a combinatorial perspective, we scale SRD to unseen classes for enabling effective exploitation of largely available, arbitrary unlabeled training data. At the problem level, this establishes an interesting connection between knowledge distillation with open-set semi-supervised learning (SSL). Extensive experiments show that our SRD outperforms significantly previous state-of-the-art knowledge distillation methods on both coarse object classification and fine face recognition tasks, as well as less studied yet practically crucial binary network distillation. Under more realistic open-set SSL settings we introduce, we reveal that knowledge distillation is generally more effective than existing Out-Of-Distribution (OOD) sample detection, and our proposed SRD is superior over both previous distillation and SSL competitors. The source code is available at https://github.com/jingyang2017/SRD_ossl.
example.eps gsave newpath 20 20 moveto 20 220 lineto 220 220 lineto 220 20 lineto closepath 2 setlinewidth gsave .4 setgray fill grestore stroke grestore
Optimizing lightweight Convolutional Neural Networks (CNNs) to be highly performing is critical, e.g.., enabling the developments on resource-limited platforms such as mobile devices. To that end, different model compression approaches have been extensively investigated, including network pruning [1], [2], network quantization [3], [4], neural architecture search [5], [6], and knowledge distillation [7], [8]. In particular, knowledge distillation aims to transfer the knowledge from a stronger network (i.e.., the teacher) to another (i.e.., the student). Typically, the teacher is a high-capacity model or an ensemble capable of achieving stronger performance, while the student is a compact model with much fewer parameters and requiring much less computation. The objective is to facilitate the optimization of the student by leveraging the teacher’s capacity. A general rationale behind distillation can be explained from an optimization perspective that higher-capacity models are able to seek for better local minima thanks to over-parameterization [9], [10].
Existing knowledge distillation methods start with transferring classification predictions [7] and intermediate representations (e.g.., feature tensors [11] and attention maps [8]). However, they suffer from a limitation of distilling structured representational knowledge including the latent complex interdependencies and correlations between different dimensions. This is because their objective formulations typically treat all the feature or prediction dimensions independently. Motivated by this analysis, a representation distillation method [12] is recently developed by contrastive learning [13], [14]. The concrete idea is to maximize the representation’s mutual information across the teacher and student via contrastive learning. Despite a principled solution following seminal information theory [15], this method is limited in high-level semantics perception and distillation. Because the teacher’s classifier, that maps the feature representation to the semantic class space, is totally ignored during distillation. Further, contrastive learning often requires a large number of training samples in loss computation, meaning a need of resource demanding large mini-batch or complex remedy (e.g.., using a memory bank).
To overcome the aforementioned limitations, in this work a novel Semantic Representational Distillation (SRD) is introduced. Our key idea is to leverage the pretrained teacher’s classifier as a semantic critic for guiding representational distillation in a classification-aware manner. Concretely, we introduce a notion of cross-network logit, obtained by feeding the student’s representation to the teacher’s classifier. Subject to the teacher and student sharing the same input, aligning the cross-network logit with the teacher’s counterpart can then enable the distillation of high-order semantic correlations among feature dimensions, i.e.., semantic distillation of representation. Further, we extend the proposed SRD to open-set semi-supervised learning (SSL) by exploiting unconstrained unlabeled data from arbitrary classes. This is motivated by our perspective that seen classes of labeled training data can be regarded collectively as a basis of the semantic space in the linear algebra theory; And any unseen classes can be approximated by a specific combination of seen classes. This hypothesis naturally breaks the obstacles of generalizing the knowledge of seen classes to unseen classes, a key underlying challenge in solving open-set SSL (e.g.., over-confident classification on the samples of unseen classes [16]).
Our contributions are three-fold: (I) We propose a simple yet effective Semantic Representational Distillation (SRD) method with a focus on structured representation optimization via semantic knowledge distillation. This is realized by taking the teacher’s classifier as a semantic critic used for evaluating both teacher and student’s representation in terms of their classification performance and ability. (II) We connect semantic distillation with open-set semi-supervised learning based on an idea that seen classes can be used as a basis of the semantic space. (III) Extensive experiments show that the proposed SRD method can train more generalizable student models than the state-of-the-art distillation methods across a variety of network architectures (e.g.., Wide ResNets, ResNets, and MobileNets) and recognition tasks (e.g.., coarse-grained object classification and fine-grained face recognition, real and binary network distillation). Compared to previous open-set SSL works, we further introduce more realistic experiment settings characterized by more classes and unlabeled data with different distributions, as well as less common classes between labeled and unlabeled sets. Critically, our experiments reveal that knowledge distillation turns out to be a more effective strategy than previously often adopted Out-Of-Distribution (OOD) detection (Table 5 vs.. Table [tab:ssl95tin]), and our SRD outperforms both state-of-the-art distillation and SSL methods, often by a large margin. On the other hand, it is also shown that OOD detection brings very marginal benefits to knowledge distillation methods (Table 7).
This is an extension of our preliminary ICLR 2021 work [17]. We further make the following significant contributions: (1) Extending our method in general knowledge distillation to open-set semi-supervised learning, two previously independently investigated fields. (2) Analyzing the limitations of existing open-set SSL settings and introducing more realistic ones with less constrained unlabeled data such as less class overlap between labeled and unlabeled sets. (3) Evaluating and comparing comprehensively both knowledge distillation and open-set SSL methods with new findings and insights in tackling more unconstrained unlabeled data. (4) To show the generality of our approach, we evaluate on a diverse range of problems with varying underlying characteristics, such as coarse-grained object classification tasks and fine-grained face recognition distillation.
Knowledge distillation is an effective approach to optimizing low-capacity networks with extensive studies in image classification [7], [8], [11], [18]–[31]. Existing distillation methods can be generally divided into two categories: isolated knowledge based and relational knowledge based.
Isolated knowledge based methods: The seminal work in Hinton et al. [7] popularized the research on knowledge distillation by simply distilling the teacher’s classification outputs (i.e.., the knowledge). Compared to one-hot class label representation, this knowledge is semantically richer due to involving underlying inter-class similarity information. Soon after, intermediate teacher representations such as feature tensors [11] are also leveraged for richer distillation. However, matching the whole feature tensors is not necessarily viable in certain circumstances due to the capacity gap between the teacher and student, even adversely affecting the performance and convergence of the student. As an efficient remedy, Attention Transfer (AT) [8] might be more achievable as the feature attention maps (i.e.., a summary of all the feature channels) represent a more flexible knowledge to be learned. The following extended AT based on the maximum mean discrepancy of the activations [18] shares the same spirit. Interestingly, Cho et al.. [27] reveal that very highly strong networks would be “too good” to be effective teachers. To mitigate this issue, they early stop the teacher’s training. Later on, Heo et al.. [31] study the effect of distillation location within the network, along with margin ReLU and a specifically designed distance function for maximizing positive knowledge transfer. More recently, Passalis et al.. [32] leverage the previously ignored information plasticity by exploiting information flow through teacher’s layers.
Relational knowledge based methods: Another line of knowledge distillation methods instead explore relational knowledge. For example, Yim et al.. [19] distil the feature correlation by aligning the layer-wise Gram matrix of feature representations across the teacher and student. A clear limitation of this method is at a high computational cost. This could be alleviated to some extent by compressing the feature maps using singular value decomposition [22]. Park et al.. [28] consider both distance-wise and angle-wise relations of each embedded feature vector. This idea is subsequently extended by [26] for better capturing the correlation between multiple instances with Taylor series expansion, and by [29] for modeling the feature space transformation across layers via a graph with the instance feature and relationship as vertexes and edges. Inspired by an observation that semantically similar samples should give similar activation patterns, Tung et al.. [23] introduce an idea of similarity-preserving knowledge distillation w.r.t.. the generation of either similar or dissimilar activations. Besides, Jain et al.. [33] exploit the relational knowledge w.r.t.. a quantized visual word space during distillation. For capturing more detailed and fine-grained information, Li et al.. [34] employ the relationship among local regions in the feature space. In order to distil richer representational knowledge from the teacher, Tian et al.. [12] maximize the representation’s mutual information between the teacher and student. Whilst sharing a similar objective, in this work we instead correlate the teacher’s and student’s representations via considering the pretrained teacher’s classifier as a semantic critic.
Despite its simplicity, we show that our method is superior and more generalizable than prior work [12] in distilling the underlying semantic representation information over a variety of applications (see Sec. 4.2 and Sec. 5.1).
Different from previous works, we further exploit the potential of distillation using unlabeled training data often available at scale. This brings together the two fields of knowledge distillation and open-set semi-supervised learning [16], [35], both of which develop independently, and importantly presents a unified perspective and common ground that enable natural model comparison and idea exchange across the two fields.
Most existing semi-supervised learning (SSL) works [36]–[44] make a closed-set assumption that unlabeled training data share the same label space as the labeled data. This assumption, however, is highly artificial and may hinder the effectiveness of SSL when processing real-world unconstrained unlabeled data with unseen classes, i.e.., out-of-distribution (OOD) samples [35]. This is because OOD data could cause harmful error propagation, e.g.., via incorrect pseudo labels.
To further generalize SSL to unconstrained data without labels, there is a recent trend of developing more realistic open-set SSL methods [16], [45]–[50] . A common strategy of these works is to identify and suppress/discard OOD samples as they are considered to be less/not beneficial. Specifically, pioneer methods (UASD [16] and DS\(^3\)L [45]) leverage a dynamic weighting function based on the OOD likelihood of an unlabeled sample. Curriculum learning has been used to detect and drop potentially detrimental data [46]. Besides, T2T [48] pretrains the feature model with all unlabeled data for improving OOD detection. [49] leverages both sample uncertainty and prior knowledge about class distribution to produce pseudo-labels for unlabeled data. More recently, OpenMatch [50] trains a set of one-vs-all classifiers for OOD detection and removal during SSL.
Whilst taking a step away from the artificial closed-set assumption, most existing open-set SSL works either focus on a simplified setting where both labeled and unlabeled sets are drawn from a single dataset, or consider only limited known classes and unlabeled data [46], [48] with all the known classes included in the unlabeled set [48], [50]. Clearly, both cases are fairly ideal and hardly valid in many practical cases. To overcome this limitation, we introduce more realistic open-set SSL settings characterized by more classes and unlabeled data with distinct distributions, and less class overlap between labeled and unlabeled sets. Critically, we find that existing open-set SSL methods fail to benefit from using unlabeled data under such unconstrained settings (see Table [tab:ssl95tin]). This challenges all the previous OOD based findings and is thought-provoking. The main reasons we find include more challenging OOD detection and the intrinsic limitation of exploiting unlabeled samples from seen classes alone. Further, our experiments show that knowledge distillation methods provide a more effective and reliable solution to leverage unlabeled data with less constraints (see Table 5 vs.. Tables [tab:ssl95tin], [tab:KD95place], and [tab:KD95cc3m] ).
Self-supervised learning has been leveraged for enhancing knowledge distillation [51]. Interestingly, the usage of unlabeled data is not considered. We empirically show that our SRD can readily benefit from this strategy in the open-set SSL setting (Table [tab:sskd]).
Figure 1: Schematic overview of the proposed Semantic Representational Distillation (SRD) method for knowledge distillation at the presence of both labeled and unlabeled data. Following the knowledge distillation pipeline, (a) we first pretrain a teacher model on the labeled training set. (b) Subsequently, we distil the semantic knowledge from the pretrained frozen teacher to improve the optimization of a student. Specifically, given a training image \(I\), we feed it into both the teacher \(T\) and the student \(S\) to obtain the feature representations \(\boldsymbol{x}^t\) and \(\boldsymbol{x}^s\). Critically, we introduce a notion of cross-network logit \(\hat{\boldsymbol{z}}\), obtained by passing the student’s representation \(\boldsymbol{x}^s\) into the teacher’s classifier \(h^t\) via a feature adaptor \(\varphi\). Considering the teacher’s classifier \(h^t\) as a semantic critic, we distil the semantic knowledge of \(\boldsymbol{x}^t\) to \(\boldsymbol{x}^s\) via aligning the cross-network logit \(\hat{\boldsymbol{z}}\) towards to the teacher’s logit \(\boldsymbol{z}^t\). In this design, the two representations \(\boldsymbol{x}^t\) and \(\boldsymbol{x}^s\) share the same semantic critic (i.e.., classifier) which could facilitate representational knowledge distillation. To further ease the semantic distillation, we impose a feature-level alignment regularization \(\mathcal{R}\) between the teacher’s representation \(\boldsymbol{x}^t\) and the adapted student’s representation \(\hat{\boldsymbol{x}}\). For labeled training samples, we also apply a supervised learning supervision on the student’s prediction.
A generic CNN consists of a feature extractor \(f: I \rightarrow \boldsymbol{x}\), and a classifier \(h: \boldsymbol{x} \longrightarrow \boldsymbol{p}\), where \(I \in \mathcal{R}^{H\times W \times 3}\), \(\boldsymbol{x} \in \mathcal{R}^d\), \(\boldsymbol{p} = [p_1, \cdots, p_k, \cdots, p_K] \in \mathcal{R}^K\) denote an input image sized at \(H \times W\), its feature vector of \(d\) dimensions, and its classification probability over \(K\) classes, respectively. Often, \(\boldsymbol{x}\) is obtained by global average pooling over the last feature map \(\boldsymbol{F}\). The classifier \(h\) is parameterized by a projection matrix \(\boldsymbol{W} \in \mathbb{R}^{d \times K}\) that first projects \(\boldsymbol{x}\) into the logits: \(\boldsymbol{z} = \boldsymbol{W}^{\top} \boldsymbol{x} = [z_1, \cdots, z_k, \cdots, z_K]\), followed by softmax normalization: \[\begin{align} \label{eq:sm} {p}_k = \texttt{sm}({z}_k)=\frac{\exp({z}_k)}{\sum_{k'=1}^K{\exp({z}_{k'})}}, \end{align}\tag{1}\] where \(k \in \{1,\cdots,K\}\) indexes the class.
In general knowledge distillation [7], we have a teacher network \(T=\{f^t, h^t\}\) and a student (target) network \(S=\{f^s, h^s\}\). It has two steps in training. In the first step, the teacher network \(T\) is pretrained on a labeled training set \(\mathcal{D}_l\) in a supervised learning manner. Often, the cross entropy loss is adopted as \[\begin{align} \label{eq:ce} \mathcal{L}_{ce} = -\sum_{k=1}^{K} y_k \log p_k \end{align}\tag{2}\] where \(\boldsymbol{y} = [y_1, \cdots, y_k, \cdots, y_K] \in \mathcal{Y}\) is the one-hot ground-truth label of a given input image \(I \in \mathcal{D}_l\). In the second step, the student network is then trained under distillation with the frozen teacher and supervised learning with the ground-truth labels (e.g.., the cross-entropy loss). A typical knowledge distillation process is realized by logit-matching [7] that minimizes the KL divergence between the logits of \(T\) and \(S\) as: \[\begin{align} \label{eq:kd} \mathcal{L}_{kd} = -\sum_{k=1}^{K} p_k^t \log p_k^s, \;\; \text{where} \\ \nonumber \boldsymbol{p}^t = [p_1^t, \cdots, p_K^t] = \texttt{sm}(\boldsymbol{z}^{t}), \boldsymbol{z}^{t} = h^t(\boldsymbol{x}^t), \boldsymbol{x}^t = f^t(I); \\ \nonumber \boldsymbol{p}^s = [p_1^s, \cdots, p_K^s] = {\texttt{sm}(\boldsymbol{z}^{s})}, \boldsymbol{z}^{s} = h^s(\boldsymbol{x}^s),\boldsymbol{x}^s = f^s(I). \end{align}\tag{3}\] Whilst this formula has shown to be effective, we consider it is less dedicated on distilling teacher’s representation knowledge, especially when considering the structured correlations and inter-dependencies between distinctive feature dimensions.
To overcome the aforementioned problem, we propose a novel distillation method, dubbed as Semantic Representational Distillation (SRD), dedicated to enhancing the representational transfer from the pretrained teacher to the target student in the distillation process. An overview of SRD is depicted in Fig. 1. Specifically, we leverage the pretrained teacher’s classifier \(h^t\) as a semantic critic for explicitly distilling the underlying semantic knowledge of the teacher’s representation \(\boldsymbol{x}^t\) to the student’s counterpart \(\boldsymbol{x}^s\). That is, the same classifier is shared by the two representations \(\boldsymbol{x}^t\) and \(\boldsymbol{x}^s\) for facilitating representational knowledge distillation via a dedicated channel.
Formally, we pass the student’s representation \(\boldsymbol{x}^{s}\) through the teacher’s classifier \(h^{t}\) to obtain the cross-network logit as: \[\begin{align} \label{eq:xnet95logit} \hat{\boldsymbol{z}} = h^t \big( \varphi(\boldsymbol{x}^s) \big) = [\hat{z}_1, \cdots, \hat{z}_k, \cdots, \hat{z}_K], \end{align}\tag{4}\] where \(\varphi\) is a representation adaptor for making \(\boldsymbol{x}^s\) compatible with the teacher’s classifier. In practice, \(\varphi\) is implemented by a 1\(\times\)1 convolutional layer with batch normalization and activation applied on the last feature map of the student. We formulate a general SRD objective function as: \[\begin{align} \label{eq:srd} \mathcal{L}_{srd} = \texttt{dist}(\boldsymbol{z}^t, \hat{\boldsymbol{z}}), \end{align}\tag{5}\] where the teacher’s logit \(\boldsymbol{z}^t\), and \(\texttt{dist}(\cdot, \cdot)\) denotes any distance metric. For an extreme case of \(\boldsymbol{z}^t = \hat{\boldsymbol{z}}\) (corresponding to the minimal \(\mathcal{L}_{srd}\)), since the teacher’s classifier is shared by both representations, this means that \(\boldsymbol{x}^s=\boldsymbol{x}^t\) subject to some sufficient conditions such as full rank transformation matrix, i.e.., the full knowledge of \(\boldsymbol{x}^t\) has been transferred to \(\boldsymbol{x}^s\). Generally, minimizing \(\mathcal{L}_{srd}\) is equivalent to maximizing the knowledge transfer from \(\boldsymbol{x}^t\) to \(\boldsymbol{x}^s\).
To implement SRD objective, we consider three different designs. The first design adopts the KL divergence, following the logit-matching distillation function as: \[\label{eq:srd95kl} \mathcal{L}_{srd}^{kl} = -\sum_{k=1}^{K} p_k^t \log \hat{p}_k,\tag{6}\] where \(\boldsymbol{\hat{p}} = [\hat{p}_1, \cdots, \hat{p}_k, \cdots, \hat{p}_K] = \texttt{sm}(\hat{\boldsymbol{z}})\) is the cross-network classification probability and \(p_k^t\) is the teacher’s classification probability obtained as in Eq. 1 . It is noteworthy that, the logit-matching distillation (Eq. 3 ) uses specific classifiers \(h^t\)/\(h^s\) for the representations \(\boldsymbol{x}^t\)/\(\boldsymbol{x}^s\), separately; Compared to our SRD sharing a single classifier for both representations, this gives more degrees of freedom to the optimization of feature extractor, resulting in less dedicated constraint on representational knowledge distillation. We will show in the experiments (Sec. 4.1) that our SRD can yield clearly superior generalization capability.
In the second design, we adopt the mean square error (MSE) as the distillation loss: \[\begin{align} \label{eq:srd95mse} \mathcal{L}_{srd}^{mse} = \left \| \boldsymbol{z}_t -\hat{\boldsymbol{z}} \right\|^2 = \left \| (\boldsymbol{W}^t)^\top (\boldsymbol{x}^t-\varphi(\boldsymbol{x}^s)) \right\|^2. \end{align}\tag{7}\] This is essentially a Mahalanobis distance with the linear transformation defined by the teacher’s classifier weights \(\boldsymbol{W}^t\). As is pretrained, this imposes semantic correlation over all the feature dimensions, making the distillation process class discriminative.
In the third design, we consider classification probability MSE by further applying the softmax normalization as: \[\label{eq:srd95msev2} \mathcal{L}_{srd}^{pmse} = \left \| \boldsymbol{p}^t -\boldsymbol{\hat{p}} \right\|^2.\tag{8}\] This allows us to evaluate the effect of normalization in comparison to the second design. We will evaluate these different designs in our experiment (Table 1).
We formulate the overall loss objective function of SRD on the labeled training set \(\mathcal{D}_l\) as: \[\mathcal{L}_l = \mathcal{L}_{ce}(\mathcal{D}_l) + \alpha \mathcal{L}_{srd}(\mathcal{D}_l) + \beta \mathcal{R}(\mathcal{D}_l), \label{eq:allloss}\tag{9}\] where \(\mathcal{L}_{ce}\) is the cross-entropy loss computed between the student’s classification probability \(\boldsymbol{p}^s\) and ground-truth labels, as defined in Eq. 2 . \(\mathcal{R} = \| \boldsymbol{x}^t - \varphi(\boldsymbol{x}^s)\|\) is a feature regularization inspired by the notion of feature matching of FitNets [7]. This is conceptually complementary with SRD loss \(\mathcal{L}_{srd}\) as it functions directly in the representation space and potentially facilitates the convergence of our SRD loss. The two scaling parameters \(\alpha\) and \(\beta\) control the impact of respective loss terms.
The aim of knowledge distillation is to transfer the learned knowledge from the teacher to the student. The standard setting is to exclusively use the labeled training set of a target domain to train the student. It is a supervised learning scenario. However, distillation methods [7], [11] typically require no ground-truth labels, presenting an unsupervised learning property. Therefore, only using the labeled training set of target domain is unnecessarily restricted and extra unlabeled data should be well incorporated for improved knowledge distillation. Technically, inspired by the spirit of linear algebra, we consider the appearential characteristics of unseen classes can be approximately combined with those of seen classes. In other words, all the seen classes used in our SRD (Eq. 5 ) can be viewed collectively as a basis of the semantic space including unseen classes.
In light of these above considerations, we further explore the usage of unlabeled data often available at scale in many real-world situations. Typically, there is no guarantee that the unlabeled data only contain the seen/target classes and follow the same distribution as the labeled training set, i.e.., unconstrained unlabeled data with unknown distributions and classes. This is open-set semi-supervised learning, an emerging problem that has received an increasing amount of attention recently [16], [46], [48], [50]. Under this interesting context of “Knowledge distillation meets open-set semi-supervised learning”, we would investigate how knowledge distillation can benefit from unconstrained unlabeled data and previous open-set SSL algorithms, as well as how existing open-set SSL methods can influence the knowledge distillation process.
Formally, except the typical labeled training set \(\mathcal{D}_l\) with \(K\) known classes \(\mathcal{Y}\) as used above, we further exploit an unconstrained set \(\mathcal{D}_u\) of unlabeled samples not limited to the same label space \(\mathcal{Y}\). This represents a more realistic scenario since unlabeled data is typically collected under little or even no constraints, including the set of class labels considered. Our objective is to leverage \(\mathcal{D}_u\) for further enhancing the student network on top of \(\mathcal{D}_l\). To that end, we extend the objective function Eq. 9 as: \[\label{eq:combine} \mathcal{L}_{l+u} = \mathcal{L}_{ce}(\mathcal{D}_{l})+ \alpha \mathcal{L}_{srd}(\mathcal{D}_{l}\cup\mathcal{D}_{u}) + \beta \mathcal{R}(\mathcal{D}_{l}\cup\mathcal{D}_{u}),\tag{10}\] where all the loss terms except the cross-entropy \(\mathcal{L}_{ce}\) are applied to \(\mathcal{D}_u\). We summarize our SRD in Algorithm 2.
In the open-set SSL literature, the existing methods [16], [46], [48], [50] typically resort to the Out-Of-Distribution (OOD) strategy. The main idea is to identify and discard those samples not belonging to any seen classes of labeled training set (i.e.., OOD samples). This is driven by a hypothesis that OOD samples are potentially harmful to SSL. We consider this could be overly restrictive whilst ignoring useful knowledge shared across labeled and unlabeled classes, such as common parts and attributes. For example, flatfish and goldfish exhibit similar body parts such as fins and eyes. Our SRD and other distillation methods can overcome elegantly this limitation by leveraging a pretrained teacher model to extract such information for enhancing the training of a student model. Further, existing open-set SSL works usually consider a small number of unlabelled data with high similarity as labeled data (e.g.., object-centric images sampled from the same source dataset). In this paper, we scale this setting by using unconstrained unlabeled data at larger scale and with lower similarity (e.g.., scene images). Under such more realistic settings, we reveal new findings in opposite to those reported in previous open-set SSL papers, and show that strategically knowledge distillation is often superior and more reliable than OOD detection in exploiting unconstrained unlabeled data (Sec. 5).
Figure 2: Semantic Representational Distillation
We first ablate SRD on CIFAR-100 [52]. For training, we use SGD with weight decay of 5e-4 and momentum of 0.9. We set the batch size to 128, the initial learning rate to 0.1 decayed by 0.1 at epochs 100/150 until reaching 200 epochs [31]. We adopt the standard data augmentation scheme [8] including random cropping (w/ 4-pixels padding) and horizontal flipping. By default, we utilize two variants of Wide ResNet [53], namely WRN40-4 and WRN16-4, as the teacher and student, unless specified otherwise.
SRD loss designs: We first evaluate the three different designs of SRD loss discussed in Sec. 3.1. The corresponding experiments are shown in Table 1 that all these designs are effective with \(\mathcal{L}_{srd}^{mse}\) (Eq. 7 ) yielding the best results. Interestingly, \(\mathcal{L}_{srd}^{mse}\) even slightly surpasses the teacher’s performance. We hypothesize this is due to that our SRD might impose some regularization effect (e.g.., fusing the capacity of the student and teacher to some degree) during distillation. Overall, this validates the efficacy of our SRD formulation and loss design. In the following experiments, we hence use \(\mathcal{L}_{srd}^{mse}\) as the default design, unless stated otherwise.
Accuracy | Top-1 (%) | Top-5 (%) |
---|---|---|
Supervised learning | 76.97 | 93.89 |
\(\mathcal{L}_{srd}^{kl}\) (Eq. 6 ) | 79.04 | 95.12 |
\(\mathcal{L}_{srd}^{mse}\) (Eq. 7 ) | 79.58 | 95.21 |
\(\mathcal{L}_{srd}^{pmse}\) (Eq. 8 ) | 79.13 | 94.88 |
Teacher | 79.50 | 94.57 |
Effect of loss components: We examine the impact of each loss component in Eq. 9 . As shown in Table 2, around \(1\%\) and \(2\%\) improvements in Top-1 accuracy can be obtained by the regularization and our distillation, respectively. So, our SRD loss is clearly more effective than \(\mathcal{R}\). Moreover, when combining them together, an additional \(0.48\%\) improvement is gained.
\(\mathcal{R}\) | \(\mathcal{L}_{srd}\) | Top-1 (%) | Top-5 (%) |
---|---|---|---|
76.97 | 93.89 | ||
78.05 | 94.45 | ||
79.10 | 94.99 | ||
79.58 | 95.21 |
Distillation effect: The objective of knowledge distillation is to encourage the student mimic the prediction behaviour of the teacher. It is hence insightful to evaluate this mimicry quality. For comparative evaluation, we contrast SRD with the logit-matching distillation [7]. For mimicry measurement, we adopt the KL divergence between the teacher’s and student’s predictions, as well as \(L_2\) distance between the teacher and student representations. It is observed from Table [tab:ans95sim] that the mimicry ability of a student presents a positive correlation with its accuracy. More similarly a student can mimic the teacher, more accurate result achieved. Further, we examine qualitatively the feature representation distribution. It is evident in Fig. 3 that our SRD can learn more discriminative features, consistent with the above numerical measurement. Besides, Top-1 accuracy with pretrained teacher classifier proves that learned representation by SRD is closer to teacher’s feature.
We examine the distribution of prediction confidence across true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN). As shown in Fig. 4, the peak confidence levels for TP and TN correspond to correct predictions, whereas for FP and FN, they are associated with incorrect predictions. Compared to KD, SRD has higher confidence for correct predictions and low confidence for incorrect predictions, which is a favored property.
[h]
\centering
\caption{
Evaluating the distillation effect (\ie, mimicry quality) on CIFAR-100. Metrics: KL divergence between the student's and teacher's predictions, $L_2$ distance between the student's and teacher's representations, Top-1 accuracy with teacher's classifier, and Top-1 accuracy with individual classifier.}
\resizebox{1\textwidth}{!}{
\setlength\tabcolsep{3.5pt}
\begin{tabular}{l|c|c|c|c}
\hline
Method & \tabincell{c}{KL div.}& \tabincell{c}{$L_2$ dis.}
&\tabincell{c}{with $\mathbf{W}^t$}
&\tabincell{c}{Top-1 (\%)}\\
\hline
\em Supervised learning &0.5964 &1.48 &0.91 &76.97 \\ \hline
KD~\cite{hinton2015distilling} & 0.5818 &1.21 &1.15& 78.35\\
\hline
\bf \shortname &\textbf{0.4597}&\textbf{1.01}&\textbf{79.11}&\textbf{79.58}\\
\hline
\end{tabular}}
\label{tab:ans95sim}
Figure 3: Feature distribution visualization of 10 classes on CIFAR-100. Class is color coded. Better viewed in color..
Figure 4: Classifier confidence distribution over TP, FP, TN, FN by SRD and KD on CIFAR-100 test set..
Complementary with logit-matching distillation: We further test the complementary of our SRD with the logit-matching distillation [7]. We optimize their combination weight by typical grid search. For extensive evaluation, we experiment with a diverse set of teacher/student pairs using ResNets [54], WRNs [55], MobileNetV2 [56]. Table [tab:sup95c10095c] shows that this distillation combination is compatible and often leads to further performance gain.
[ht]
\centering
\caption{Complementary with logit-matching knowledge distillation \cite{hinton2015distilling} on CIFAR-100.
Metric: Top-1 accuracy ($\%$).
Surp. learn.: Supervised learning.
}
\resizebox{1\textwidth}{!}{
\setlength\tabcolsep{3.5pt}
\begin{tabular}{l|c|c|c|c|c|c}
\hline
Teacher &WRN40-4&WRN40-4&ResNet34&ResNet50&ResNet34&WRN40-4\\
(Params) &8.97M&8.97M&1.39M&1.99M&21.33M&8.97M\\
\hline
Student &WRN16-2&WRN16-4&ResNet10&ResNet18&WRN16-2&MobileNetV2\\
(Params) &0.70M &2.77M&0.34M&0.75M&0.70M&2.37M\\
\hline
\em Surp. learn. &72.70&76.97&68.42&71.07&72.70&68.42\\
\hline
\bf \shortname &75.96&79.58&69.91&73.47&75.38&71.82\\
{\bf \shortname}+KD\cite{hinton2015distilling} &\bf{76.05}&\bf{79.63}&\bf{70.41}&\bf{73.53}&\bf{75.46}&\bf{72.06}\\
\hline
Teacher &79.50&79.50&72.05&73.31&78.44&79.50\\
\hline
\end{tabular}}
\label{tab:sup95c10095c}
Evaluating the generality of architecture: Except CNNs as the backbone, we further adopt the recent Vision Transformers (ViTs) [57] to evaluate the architectural generality of our SRD. We experiment with two teacher-student pairs: ViT-Base as the teacher and ViT-Tiny as the student, and ResNet50 as teacher and ViT-Tiny as student (cross-architecture). We use the popular code repository1. In Table [tab:vit] we observe that our SRD is again superior over top alternatives, consistent with the case of pure CNN backbones. This suggests our method is architecture agnostic.
[ht]
\centering
\caption{Architectural generality evaluation on CIFAR-100. Metric: Top-1 accuracy ($\%$).
}
\resizebox{1\textwidth}{!}{
\setlength\tabcolsep{3.5pt}
\begin{tabular}{c|c|c}
\hline
Teacher & ViT\_Base (85.55M) &ResNet50(23.73M)\\
Student & ViT\_Tiny (5.54M) &ViT\_Tiny (5.54M)\\\hline
\em Supervised learning& 80.43 & 80.43\\\hline
KD~\cite{hinton2015distilling}&82.11&81.40\\
CRD~\cite{tian2019contrastive}&83.62&82.28\\
\bf \shortname& \bf{85.34} &\bf{83.37}\\\hline
Teacher & 93.86 &85.89\\\hline
\end{tabular}}
\label{tab:vit}
For extensive evaluation, we consider multiple mainstream network architectures including ResNets [54], Wide ResNets [55], MobileNetV2 [56], and MobileNet [58] with different learning capacities.
We compare our SRD with five state-of-the-art knowledge distillation methods: KD [7], AT [8], OFD [31], RKD [28], and CRD [12].
Except the common image classification problem, we further consider two more practically critical applications with less investigation in distillation: fine-grained face recognition (Sec. 4.2.4), and binary network optimization (Sec. 4.2.5).
Setting: CIFAR-10 is a popular image classification dataset consisting of 50,000 training and 10,000 testing images evenly distributed across 10 object classes. All the images have a resolution of \(32\times 32\) in pixel. Following [8], during training, we randomly crop and horizontally flip each image. We train the ResNet for 350 epochs using SGD. We set the initial learning rate to 0.1, gradually reduced by a factor of 10 at epochs 150, 250 and 320. Similarly, we train the WRN models for 200 epochs with the initial learning rate of 0.1 and a decay rate of 5 at epochs 60, 120 and 160. We set the dropout rate to 0. For the logit-matching KD [7], we set \(\alpha=0.9\) and \(T=4\). For AT [8], as in [8], [23], we set the weight of distillation loss to 1000. Note the AT loss is added after each layer group for WRN and the last two groups for ResNet following [8]. Following OFD [31], we set the weight of distillation loss to \(10^{-3}\). For RKD [28], we set \(\beta_{1} = 25\) for distance, and \(\beta_{2} = 50\) for angle, as suggested in [12], [28]. We exclude CRD [12] here because, in our experiments, we found that the parameters originally proposed for CIFAR-100 and ImageNet-1K do not work well for CIFAR-10. We evaluate three types of teacher/student pairs: (1) Two pairs using WRNs; (2) Three pairs using ResNets; (3) One pair across WRN and ResNet.
[ht]
\centering
\caption{Evaluating knowledge distillation methods on CIFAR-10.
Metric: Top-1 accuracy ($\%$).
The parameter size of each network is given in the round bracket.
Surp. learn.: Supervised learning.
}
\resizebox{1\textwidth}{!}{
\setlength\tabcolsep{3.5pt}
\begin{tabular}{l|c|c|c|c|c|c}
\hline
Teacher &WRN16-2&WRN40-2&ResNet26&ResNet26&ResNet34&ResNet26\\
(Params)&(0.69M) &(2.2M) &(0.37M) &(0.37M) &(1.4M) &(0.37M)\\
\hline
Student &WRN16-1 &WRN16-2 &ResNet8&ResNet14&ResNet18&WRN16-1\\
(Params) &(0.18M) &(0.69M) &(0.08M) &(0.17M) &(0.7M) &(0.18M)\\
\hline
\em Surp. learn. &91.04 &93.98&87.78&91.59&93.35&91.04\\
\hline
KD~\cite{hinton2015distilling}&92.57&94.46&88.75&92.57&93.74&92.42\\
AT~\cite{zagoruyko2016paying}&92.15&94.39&88.15&92.11&93.52&91.32\\
OFD~\cite{heo2019comprehensive}&92.28&94.30&87.49&92.51&93.80&92.47\\
RKD~\cite{park2019relational}&92.51&94.41&88.50&92.36&92.95&92.08\\
\bf \shortname&\bf{92.95}&\bf{94.66}&\bf{89.02}&\bf{92.70}&\bf{93.92}&\bf{92.94}\\
\hline
Teacher &93.98 &95.07 &93.58 &93.58 &94.11 &93.58\\
\hline
\end{tabular}}
\label{tab:sup95c10}
\vspace{-4mm}
Results: Top-1 classification results on CIFAR-10 are compared in Table [tab:sup95c10]. On this largely saturated dataset, it is evident that our SRD still results in clear gains over all the alternatives in all settings, suggesting a generic and stable superiority. Besides, it is found that the logit-matching KD performs second only to SRD, surpasses all the other KD variants.
Setting: We use the same setting as in ablation (Sec. 4.1). Similarly, we experiment with three sets of teacher/student network pairs. The first set is constructed by the performance using WRNs: strong teacher/weak student (WRN40-4/WRN16-2), and strong teacher/fair student (WRN40-4/WRN16-4). The second set repeats similar pairings using ResNets: (ResNet34/ResNet10), and (ResNet50/ResNet18). The third set is created by combining different architectural families: (ResNet34/WRN16-2) and (WRN40-4/MobileNetV2).
[ht]
\centering
\caption{Evaluating knowledge distillation methods on CIFAR-100.
Metric: Top-1 accuracy ($\%$).
Surp. learn.: Supervised learning.
}
\resizebox{1\textwidth}{!}{
\setlength\tabcolsep{3.5pt}
\begin{tabular}{l|c|c|c|c|c|c}
\hline
Teacher &WRN40-4&WRN40-4&ResNet34&ResNet50&ResNet34&WRN40-4\\
(Params) &(8.97M)&(8.97M)&(1.39M)&(1.99M)&(21.33M)&(8.97M)\\
\hline
Student &WRN16-2&WRN16-4&ResNet10&ResNet18&WRN16-2&MobileNetV2\\
(Params) &(0.70M) &(2.77M)&(0.34M)&(0.75M)&(0.70M)&(2.37M)\\
\hline
\em Surp. learn. &72.70&76.97&68.42&71.07&72.70&68.42\\
\hline
KD~\cite{hinton2015distilling} &74.52&78.35&69.18&73.41&73.95&69.15\\
AT~\cite{zagoruyko2016paying} &74.33&78.06&68.49&71.90&72.32&68.95\\
OFD~\cite{heo2019comprehensive}&75.57&79.29&68.94&72.79&74.78&70.08\\
RKD~\cite{park2019relational} &74.23&78.38&68.70&70.93&73.91&68.19\\
CRD~\cite{tian2019contrastive} &75.27&78.83&\bf{70.24}&73.23&74.88&71.46\\
\bf \shortname &\bf{75.96}&\bf{79.58}&69.91&\bf{73.47}&\bf{75.38}&\bf{71.82}\\
\hline
Teacher &79.50&79.50&72.05&73.31&78.44&79.50\\
\hline
\end{tabular}}
\label{tab:sup95c100}
\vspace{-4mm}
Results: We report the Top-1 performance on CIFAR-100 in Table [tab:sup95c100]. On this more challenging test, we observe that for almost all teacher/student configurations, our SRD achieves consistent and significant accuracy gains over prior methods. Further, there is no clear second best. For WRN pairs, OFD ranks second. For the other cases, CRD instead achieves the second. This suggests that our SRD is also more generalizable to different distillation configurations.
Setting: For larger scale evaluation, we test the standard ImageNet-1K benchmark. We crop the images to a resolution of \(224\times 224\) pixels for both training and test. We use SGD with Nesterov momentum set to 0.9, weight decay to \(10^{-4}\), initial learning rate to 0.2 which decays by a factor of 10 every 30 epochs. We set the batch size to 512. We train a total of 100 epochs for all methods except CRD [12] which uses 10 more epochs following the authors’ suggestion. For simplicity, we use pretrained PyTorch models [59] as the teacher [12], [31]. We adopt two common teacher/student pairs: ResNet34/ResNet18 and ResNet50/MobileNet [58]. When testing logit-matching KD [7], we set the weight of KL loss and cross-entropy loss to 0.9 and 0.5, which yields better accuracy as found in [12].
Results: We report the ImageNet classification results in Table [tab:imagenet]. Again, we observe that our SRD outperforms all the competitors by a large margin in all cases. This suggests that the advantage of SRD is scalable. Specifically, for the ResNet34/ResNet18 pair, RKD reaches the second best Top-1 accuracy; Whilst in the ResNet-50/MobileNet case, CRD is the second. This further suggests that previous methods are less stable than SRD in the selection of networks. Critically, SRD favors MobileNet with a big margin of absolute \(1.1\%\) in absolute terms over the best alternative, CRD. Considering that MobileNet has been widely used across many devices and mobile platforms, this performance gain could be particularly promising and valuable in practice.
[!htbp]
\centering
\caption{Evaluating knowledge distillation methods on ImageNet-1K.
Metric: Top-1 and Top-5 accuracy ($\%$).
The parameter size of each network is given in the round bracket.
Surp. learn.: Supervised learning.
}
\resizebox{1\textwidth}{!}{
\setlength\tabcolsep{3.5pt}
\begin{tabular}{l|c|c|c|c}
\hline
Teacher &\multicolumn{2}{c}{ResNet34 (21.80M)} &\multicolumn{2}{c}{ResNet50 (25.56M)}\\
Student &\multicolumn{2}{c}{ResNet18(11.69M)}&\multicolumn{2}{c}{MobileNet(4.23M)}\\
\hline
&Top-1 (\%)&Top-5 (\%)&Top-1 (\%)&Top-5 (\%)\\
\hline
\em Surp. learn. &70.04 &89.48 &70.13&89.49\\
\hline
KD~\cite{hinton2015distilling} &70.68 &90.16 &70.68 &90.30\\
AT~\cite{zagoruyko2016paying} &70.59 &89.73 &70.72 &90.03\\
OFD~\cite{heo2019comprehensive}&71.08 &90.07 &71.25 &90.34\\
RKD~\cite{park2019relational} &71.34 &90.37 &71.32 &90.62\\
CRD~\cite{tian2019contrastive} &71.17 &90.13 &71.40 &90.42\\
\bf \shortname &\bf{71.73} &\bf{90.60} &\bf{72.49}&\bf{90.92}\\
\hline
Teacher &73.31 &91.42 &76.16&92.86\\
\hline
\end{tabular}}
\label{tab:imagenet}
\vspace{-4mm}
Except coarse object classification, we also consider fine-grained face recognition task that requires learning more detailed representations specific to individual object instance in the distillation perspective.
For training, we use the MS1MV2 dataset [60], a refined version of MS-Celeb-1M [61], [62]. For testing, we use the refined MegaFace [61], [63] which includes a million of distractors. We adopt MegaFace’s Challenge1 using FaceScrub as the probe set.
We consider two face recognition tasks. (1) Face verification: Given a pair of face images, the objective is to determine whether they describe the same person’s identity. This is accomplished by calculating a pairwise similarity (e.g.., cosine similarity or negative Euclidean distance) in the feature space and matching it w.r.t.. a threshold. For performance metrics, we adopt the True Acceptance Rate (TAR) at the False Acceptance Rate (FAR) of \(10^{-6}\) [60]. The decision threshold is determined by the FAR. (2) Face identification: Given a query face image, the objective is to identify those images with the same person identity in a gallery. This is often treated as a retrieval problem by ranking the gallery images according to the pairwise similarity scores w.r.t.. the query image. To evaluate the performance, we use the rank-1 accuracy [60].
Competitors: For comparative evaluation, we consider both logit (KD [7]) and feature (AT [8], RKD [28] and PKT [64]) based distillation methods. Note, here we exclude CRD [12] due to its optimization difficulty with margin-based softmax loss and OFD [31] due to its difficulty of finding a good-performing distillation layer.
Setting: For model training, we use SGD as the optimizer and set the momentum to 0.9, and weight decay to \(5e-4\). We set the batch size to 512. We set the initial learning rate to 0.1, and decay it by 0.1 at 100K, and 160K iterations. We train each model for a total of 180K iterations. We adopt ResNet101 [54] as the teacher with 65.12M parameters, and MobileFaceNet [65] as the student with 1.2M parameters.
Results: We present the face recognition results in Table 3. We make several key observations as follows. (1) Our SRD achieves the best distillation in comparison to all the other competing methods, suggesting its superior ability of distilling fine-grained representational knowledge. (2) For face verification, only SRD and AT gain advantages from distillation, with SRD outperforming all its rivals. Several reasons explain this performance observation: First, there is a pronounced discrepancy in performance levels between the teacher and student models, adding difficulty in distillation [66]. Second, while existing methods focus on guiding the student’s learning with specifically designed knowledge at either the intermediate or classification stages, SRD uniquely distills knowledge directly from the final representation which is used directly for face verification. Thirdly, previous methods often ignore the class prototypes within the teacher’s classifier, which capture the discrimination information for each identity. In contrast, our strategy makes full use of the teacher’s classifier by feeding the student’s features into the teacher’s classifier. (3) The logit-matching KD [7] fails on both tasks. A plausible reason is due to incompatibility between softened logits matching and margin-based softmax loss. By leveraging the learned identity prototypes with teacher’s classifier to constrain the learning of student’s representation, SRD manages to distil useful knowledge with subtlety successfully.
Method | Verification (%) | Identification (%) |
---|---|---|
Supervised learning | 93.44 | 92.28 |
KD [7] | 92.86 | 90.91 |
AT [8] | 93.55 | 92.46 |
RKD [28] | 93.37 | 92.34 |
PKT [64] | 93.25 | 92.38 |
SRD | 94.17 | 93.26 |
Teacher | 98.56 | 98.82 |
In all the above experiments, both student and teacher use some networks parameterized with real-valued precision (real networks). However, they are often less affordable in low-resource regime (e.g.., mobile devices). One of the promising approaches is to deploy neural networks with binary-valued parameters (i.e.., binary networks, the most extreme case of network quantization), as they are not only smaller in size but run faster and more efficiently [3], [67]. However, training accurate binary networks from scratch is highly challenging and the use of distillation has been shown to be a key component [68]. In light of these observations, we investigate the largely ignored yet practically critical binary network distillation problem. The objective is to distil knowledge from a real teacher to a binary student.
Datasets: In this evaluation, we use CIFAR-100 and ImageNet-1K following the standard setup as above.
Competitors: Similar as face recognition, we compare with both logit (KD [7]) and feature (AT [8], RKD [28], OFD [31], PKT [64], and CRD [12]) based distillation methods.
Setting: For training, we use Adam as the optimizer. For CIFAR-100, we set the initial learning 0.001 decayed by a factor of 0.1 at the epochs of \(\{150,250,320\}\) and the total epochs to 350. For ImageNet-1K, we use the initial learning 0.002 decayed by a factor of 0.1 at the epochs of \(\{30,60,90\}\), and a total of 100 epochs. For the teacher and student, we use the same ResNet architecture with the modifications as introduced in [67].
Results: We report binary network distillation results in Table 4. It is evident that on both datasets SRD outperforms consistently all the alternatives by a clear margin. On the other hand, OFD [31] performs worst probably due to its problem specific and less generalizable distillation position. Overall, these results validate that the performance advantage of SRD can extend well from real networks to binary networks, a rarely investigated but practically significant application scenario.
Datasets | CIFAR-100 | ImageNet-1K |
---|---|---|
Teacher | ResNet34(Real) | ResNet18(Real) |
Student | ResNet34(Binary) | ResNet18(Binary) |
Surp. learn. | 65.34 | 56.70 |
KD [7] | 68.65 | 57.39 |
AT [8] | 68.54 | 58.45 |
OFD [31] | 66.84 | 55.74 |
RKD [28] | 68.61 | 58.84 |
CRD [12] | 68.78 | 58.25 |
SRD | 70.50 | 59.57 |
Teacher | 75.08 | 70.20 |
In this evaluation, we use CIFAR-100 [52] as the target dataset, including a labeled training set and a test set. We follow the standard training-test split. To simulate a realistic open-set SSL setting, we use Tiny-ImageNet [69] as unlabeled training data. As a subset of ImageNet-1K [70], this dataset consists of 200 classes, each with 500, 50 and 50 images for training, validation and test, respectively. We only use its training set, consisting of 100,000 images. The two datasets share a very small proportion of classes.
To facilitate fair comparative evaluation, we adopt the same setup of [12] including the training configuration as given in its open source code2. For all compared methods below, the same initialization, training and test data are applied under the same training setup. We apply the same data augmentation for all the labeled and unlabeled training data as in Sec. 4.2.2. We resize all the images to \(32 \times 32\) before data augmentations.
We consider a diverse set of three (teacher, student) network pairs for different distillation methods: (ResNet32\(\times\)4, ResNet8\(\times\)4), (WRN40-2, WRN40-1), and (ResNet32x4, ShuffleNetV1). For statistical stability, for each experiment we run 5 trials and report the average result.
We compare extensively a total of 12 state-of-the-art distillation methods: KD [7], FitNet [11], AT [8], SP [23], CC [26], VID [71], RKD [28], PKT [64], AB [30], FT [24], NSP [18], CRD [12]. For all these methods except CRD [12], unlabeled data can be directly accommodated without design adaptation. Instead, class labels are required by CRD [12]. To solve this issue, we extend CRD by a pseudo-labeling strategy. We obtain a pseudo label for every unlabeled sample in a maximum likelihood manner using the teacher network pretrained on the labeled set. We then treat the pseudo labels as the ground-truth during training CRD.
Teacher | ResNet\(32\times4\) | (7.43M) | WRN40-2 | (2.25M) | ResNet\(32\times4\) | (7.43M) |
Student | ResNet\(8\times4\) | (1.23M) | WRN40-1 | (0.57M) | ShuffleNetV1 | (0.94M) |
Training data | \(\mathcal{D}\) | \(\mathcal{D+U}\) | \(\mathcal{D}\) | \(\mathcal{D+U}\) | \(\mathcal{D}\) | \(\mathcal{D+U}\) |
Surp. learn. | 72.50 | - | 71.98 | - | 70.59 | - |
KD [7] | 73.33\(\pm\)0.25 | 74.68\(\pm\)0.05\(\uparrow\) | 73.54\(\pm\)0.20 | 75.08\(\pm\)0.25\(\uparrow\) | 74.07\(\pm\)0.19 | 76.52\(\pm\)0.03\(\uparrow\) |
FitNet [11] | 73.50\(\pm\)0.28 | 73.30\(\pm\)0.14\(\downarrow\) | 72.24\(\pm\)0.24 | 71.43\(\pm\)0.17\(\downarrow\) | 73.59\(\pm\)0.15 | 72.83\(\pm\)0.13\(\downarrow\) |
AT [8] | 73.44\(\pm\)0.19 | 71.75\(\pm\)0.11\(\downarrow\) | 72.77\(\pm\)0.10 | 73.11\(\pm\)0.19\(\uparrow\) | 71.73\(\pm\)0.31 | 72.82\(\pm\)0.24\(\uparrow\) |
SP [23] | 72.94\(\pm\)0.23 | 72.10\(\pm\)0.21\(\downarrow\) | 72.43\(\pm\)0.27 | 73.02\(\pm\)0.13\(\uparrow\) | 73.48\(\pm\)0.42 | 76.01\(\pm\)0.15\(\uparrow\) |
CC [26] | 72.97\(\pm\)0.17 | 71.96\(\pm\)0.11\(\downarrow\) | 72.21\(\pm\)0.25 | 70.64\(\pm\)0.12\(\downarrow\) | 71.14\(\pm\)0.06 | 70.85\(\pm\)0.12\(\downarrow\) |
VID [71] | 73.09\(\pm\)0.21 | 73.48\(\pm\)0.23\(\uparrow\) | 73.30\(\pm\)0.13 | 73.06\(\pm\)0.17\(\downarrow\) | 73.38\(\pm\)0.09 | 75.80\(\pm\)0.12\(\uparrow\) |
RKD [28] | 71.90\(\pm\)0.11 | 72.50\(\pm\)0.23\(\uparrow\) | 72.22\(\pm\)0.20 | 72.99\(\pm\)0.17\(\uparrow\) | 72.28\(\pm\)0.39 | 73.38\(\pm\)0.09\(\uparrow\) |
PKT [64] | 73.64\(\pm\)0.18 | 75.24\(\pm\)0.15\(\uparrow\) | 73.45\(\pm\)0.19 | 74.29\(\pm\)0.18\(\uparrow\) | 74.10\(\pm\)0.25 | 76.50\(\pm\)0.12\(\uparrow\) |
AB [30] | 73.17\(\pm\)0.31 | 72.34\(\pm\)0.09\(\downarrow\) | 72.38\(\pm\)0.31 | 72.88\(\pm\)0.18\(\uparrow\) | 73.55\(\pm\)0.31 | 73.33\(\pm\)0.12\(\downarrow\) |
FT [24] | 72.86\(\pm\)0.12 | 71.57\(\pm\)0.11\(\downarrow\) | 71.59\(\pm\)0.15 | 71.47\(\pm\)0.23\(\downarrow\) | 71.75\(\pm\)0.20 | 72.81\(\pm\)0.21\(\uparrow\) |
NSP [18] | 73.30\(\pm\)0.28 | 72.06\(\pm\)0.20\(\downarrow\) | 72.24\(\pm\)0.22 | 72.11\(\pm\)0.15\(\downarrow\) | 74.12\(\pm\)0.19 | N/A |
CRD [12] | 75.51\(\pm\)0.18 | 73.84\(\pm\)0.14\(\downarrow\) | 74.14\(\pm\)0.22 | 73.54\(\pm\)0.19\(\downarrow\) | 75.11\(\pm\)0.32 | 76.70\(\pm\)0.16\(\uparrow\) |
SRD | 75.92\(\pm\)0.19 | 76.24\(\pm\)0.02\(\uparrow\) | 74.75\(\pm\)0.20 | 75.32\(\pm\)0.06\(\uparrow\) | 76.40\(\pm\)0.13 | 77.40\(\pm\)0.17\(\uparrow\) |
Teacher | 79.42 | - | 75.61 | - | 79.42 | - |
We report the results in Table 5. We make the following observations: (1) Whilst KD [7], RKD [28], PKD [64], and our SRD consistently improve from using unlabeled training data, FitNet [11] and CC [26] degrade across all different networks. This implies that intermediate feature matching [11] and inter-instance correlation [26] are less robust to unconstrained data. This further verifies that using the high-level feature alignment as in SRD is more reliable. (2) Besides, AT [8], SP [23], AB [30], FT [24], VID [71], and CRD [12] not necessarily improve, conditioned on network pairs. In particular, heterogeneous architectures are preferred by CRD and FT, whilst others present no clear trend. (3) Our SRD yields consistently best results, with clear boost up from open-set unlabeled data across all the cases. This suggests the overall performance advantage and network robustness of our model design despite its simplicity. (4) Interestingly, the logit-matching KD [7] benefits more from unlabeled data than the other competitors. However, its overall performance is inferior to ours. In a nutshell, this test shows that not all distillation methods can easily benefit from using open-set unlabeled data, with some methods even conditioned on the network choice.
Self-supervised learning is an orthogonal dimension for improving knowledge distillation. For instance, a recent method SSKD [51]. integrates self-supervised learning (e.g.., instance discrimination by contrastive learning [13]) to distillation, conceptually complementary to our SRD. We evaluate this combination in our open-set SSL setting with Tiny-ImageNet as the unlabeled data. From Table [tab:sskd], we observe that both SSKD and our SRD in isolation are effective, and importantly, they can be integrated well for further improvement.
[h]
\centering
\caption{Integration with self-supervised learning. Labeled set: CIFAR-100; Unlabeled set: Tiny-ImageNet. Metric: Top-1 accuracy (\%).}
\resizebox{1\textwidth}{!}{
\setlength\tabcolsep{3.5pt}
\begin{tabular}{l|c|c|c}
\hline
Network& ResNet8$\times$4 & WRN40-1 & ShuffleNetV1\\
\hline\hline
\em Supervised learning &72.50 &71.98 &70.50\\ \hline
SSKD~\cite{SSKD} &75.15&74.64&78.36\\
\bf{SRD} &76.24&75.32&77.40\\
SSKD+\bf{SRD} &\bf{76.38}&\bf{75.61}&\bf{78.74}\\\hline
Teacher&79.42&75.61&79.42\\
\hline
\end{tabular}}
\label{tab:sskd}
\vspace{-4mm}
In addition to distillation methods, we further compare our SRD with existing SSL methods capable of leveraging unlabeled data. To enable comparing the two different approaches, we adopt the same training and test setting as in Sec. 4.2.2. In particular, we adopt the same three student networks as the target models: ResNet8\(\times\)4, WRN40-1 and ShuffleNetV1. For SRD, we apply the same three (teacher, student) pairs (see top of Table 5). Following the convention of SSL, we report the average result of the last 20 epochs. Specifically, our experimental design utilizes CIFAR-100 as the source of labeled data, while the unlabeled datasets include classes not found in CIFAR-100, sourced from TinyImageNet [69], Places365 [72], and CC3M [73]. The selection of unlabeled data from three distinct sources introduces a significant variability in relevance and complexity: TinyImageNet shares some similarities with CIFAR-100, making it somewhat related; Places365, with its focus on 365 different scene categories, presents a stark contrast to the object-centric images of TinyImageNet and CIFAR-100; CC3M, an even less refined dataset, is amassed without stringent filtering or manual labeling. To categorize the unlabeled datasets, we utilized a pretrained ResNet32 × 4 classifier to label each sample. Any sample with a confidence score below 0.9 was deemed OOD. Through this process, we determined that 73% of Tiny-ImageNet (from a total of 100,000 samples), 70% of Places365 (from a total of 1,803,460 samples), and 78% of CC3M (from a total of 2,313,472 samples) fall into the OOD category. When incorporating these datasets as unlabeled data alongside CIFAR-100, the proportions of OOD data vary, ranging from 48% for TinyImageNet and 68% for Places365 to 76% for CC3M.
We evaluate four representative closed-set (PseudoLabel [37], MeanTeacher [38], MixMatch [39] and FixMatch [36]) and three state-of-the-art open-set (MTCR [46], T2T [48] and OpenMatch [50]) SSL methods. For each method, we adopt its well-tuned default setting provided in the respective source codes. We also include the supervised learning baseline without using any unlabeled data.
[h]
\caption{Comparing state-of-the-art closed-set and open-set semi-supervised learning methods.
{\em Labeled set}: CIFAR-100;
{\em Unlabeled set}: Tiny-ImageNet.
{\em Metric}: Best Top-1 accuracy (\%).
}
\centering
\resizebox{1\textwidth}{!}{
\setlength\tabcolsep{3.5pt}
\begin{tabular}{l|c|c|c}
\hline
Network & ResNet8$\times$4 &WRN40-1 &ShuffleNetV1\\
\hline \hline
\em Supervised learning &72.50 &71.98 &70.50\\
\hline
PseudoLabel~\cite{lee2013pseudo}&33.33&54.48&42.47\\
MeanTeacher~\cite{tarvainen2017mean}&70.18&70.25&65.60\\
FixMatch~\cite{sohn2020fixmatch}&68.85&65.54&61.27\\
MixMatch~\cite{berthelot2019mixmatch} &\underline{74.27} &\underline{74.01} &\underline{75.79}\\
\hline
MTCR~\cite{yu2020multi}&65.42&61.84&42.34\\
T2T~\cite{T2T} &66.05&62.55 &57.53\\
OpenMatch~\cite{openmatch}&70.41&69.08&66.88\\
\hline
{\bf \shortname} &\bf{76.24}&\bf{75.32}&\bf{77.40}\\
\hline
\end{tabular}}
\label{tab:ssl95tin}
\vspace{-4mm}
The results are compared in Table [tab:ssl95tin]. We draw several interesting observations. (1) With a strong closed-set label space assumption, as expected pseudo labeling [37] gives the poorest performance with significant degradation from the supervised baseline over all three networks. This is because assigning a label for every unlabeled sample in unconstrained open-set setting is highly error-prone. In practice, it is found that \(\sim90\%\) of unlabeled data were wrongly labeled to a single class “crocodile” during training and \(\sim50\%\) of test samples were classified as “crocodile” mistakenly. (2) Relying on less rigid consistency regularization, MeanTeacher [38] yields better performance but still suffers from accuracy drop compared to using only labeled training data. This reveals the limitation of conventional consistency loss in tackling open-set unlabeled data. (3) Combining consistency regularization and pseudo-labeling, the recent closed-set SSL method FixMatch [36] reasonably performs at a level between [38] and [37]. Note, this is rather different from its original closed-set SSL results, revealing previously unseen limitation of this hybrid strategy. (4) Among all the closed-set SSL methods, MixMatch [39] is the only one capable of achieving accuracy gain from unconstrained unlabeled data. This indicates a new finding that multi-augmentation pooling and sharpening based consistency turns out to be more generalizable to unconstrained open-set SSL. (5) Surprisingly, it is observed that all open-set SSL methods (MTCR [46], T2T [48] and OpenMatch [50]) fail to improve over the supervised learning baseline. This contradicts the reported findings under their simpler and less realistic settings including fewer target classes and unlabeled samples but higher class overlap percentages between labeled and unlabeled sets. This presents a failure case for previous open-set SSL methods and would be thought-provoking and inspire more extensive investigation under more challenging and realistic scenarios with abundant unlabeled data. (6) Our SRD consistently improves the performance of all three networks and surpasses significantly the best competitor MixMatch [39]. This validates a clear advantage of the proposed semantic representational distillation over previous SSL methods in handling open-set unlabeled data. Conceptually, SRD can be also viewed as semantic consistency regularization derived from a pre-trained teacher. Our design shares the general consistency regularization spirit whilst differentiating from typical data augmentation based consistency in formulation [36], [39]. (7) Regarding the high-level approach, distillation methods (Table 5) are shown to be generally superior over both closed-set and open-set SSL ones (Table [tab:ssl95tin]). This implies that using a pretrained teacher model as semantic guidance in various ways could be more advantageous than OOD detection of open-set SSL methods in capitalizing unconstrained unlabeled data. Actually, it is shown that detecting and discarding OOD samples even results in harmful effect over the ignorance of OOD data typical with closed-set SSL methods.
Integration with semi-supervised learning: Knowledge Distillation (KD) and open-set Semi-Supervised Learning (SSL) represent two different paradigms in machine learning, with distinct pipeline designs and research objectives. We concur that integrated evaluation may cast additional insights. Specifically, we have now conducted an experiment by integrating our SRD (the knowledge distillation component) with the best performing SSL method MixMatch. MixMatch operates by generating low-entropy, or high-confidence, labels for augmented versions of unlabeled data. Subsequently, we combine both labeled and unlabeled data by the application of MixUp. We have extended this process by applying SRD to both labeled and unlabeled data, under the strategic guidance of a teacher network. As shown in Table [tab:mixmatch], we observe that both MixMatch and SRD in isolation are effective, and importantly, their combination achieves further improvement.
[h]
\centering
\caption{Integration with semi-suppervised learning. Labeled set: CIFAR-100; Unlabeled set: Tiny-ImageNet. Metric: Top-1 accuracy (\%).}
\resizebox{1\textwidth}{!}{
\setlength\tabcolsep{3.5pt}
\begin{tabular}{l|c|c|c}
\hline
Network& ResNet8$\times$4 & WRN40-1 & ShuffleNetV1\\
\hline\hline
\em Supervised learning &72.50 &71.98 &70.50\\ \hline
\bf{SRD} &76.24&75.32 &77.40\\
MixMatch &74.27 &74.01 &75.79\\
MixMatch+\bf{SRD} &\bf{77.13}&\bf{76.06} &\bf{78.25}\\\hline
Teacher&79.42&75.61&79.42\\
\hline
\end{tabular}
\label{tab:mixmatch}
}
\vspace{-4mm}
To further examine the failure of open-set SSL methods, we analyse the behavior of OOD detection with T2T [48] and OpenMatch [50] during training. In particular, we track the per-epoch proportion of unlabeled data passing through the OOD detector, with their ground-truth class labels categorized into OOD and in-distribution (IND). It is shown in Fig. 5 that both T2T and OpenMatch can identify the majority of OOD samples at most time whilst keeping away from IND samples at varying rates. However, their performance is still inferior to supervised learning baseline. To further isolate the performance factors, we test particularly the supervised learning component of OpenMatch by deactivating all the unsupervised loss terms. We find that its use of unlabeled data turns out to degrade the performance. This is due to both the challenge of identifying OOD samples under such more difficult open-set settings as studied here, and improper use of unlabeled data. We conjugate that in highly unconstrained open-set SSL scenarios, the OOD strategy has become ineffective. On the contrary, with KD a pretrained teacher can instead extract more useful latent knowledge (e.g.., parts and attributes shared across labeled and unlabeled classes) from unlabeled data not limited to labeled/known classes and more effectively improve the model optimization.
Figure 5: Per-epoch usage of unlabeled data (those surviving through OOD detection) with (top) T2T [48] and (bottom) OpenMatch (OM) [50]..
We further conduct a series of analytical experiments for providing in-depth insights in model design, performance evaluation, training data, image resolution under open-set SSL setting.
We evaluate the effect of unlabeled data size on the performance. From the training set of Tiny-ImageNet, we create four varying-size ({25%, 50%, 75%, 100%}) unlabeled sets via random and teacher prediction score based selection, respectively. As shown in Fig. 6, both KD and SRD benefit from more unlabeled data regardless of the selection process. This suggests they are generally scalable and insensitive to unlabeled data filtering.
Figure 6: Size effect of unlabeled data selected (Top) randomly or (Bottom) by the teacher prediction score. Teacher: WRN40-2. Student: WRN40-1.
Effect of labeled data size: We evaluate the impact of the amount of labeled data on performance. For our experiment, we randomly select each class 250 images from the training set of CIFAR-100 for training. Evaluation remains unchanged. It is shown in Table 6 that the deduction of in the quantity of labeled data results in a decrease of performance. Despite this, SRD still ranks first among the competitors. This indicates that the effectiveness of the model generally scales with the size of the labeled data.
It might be interesting to see how OOD works with distillation methods. To that end, we adopt the OOD detector of T2T [48] which is a binary classifier with the teacher’s representation, trained jointly during distillation. It is shown in Table 7 that OOD detection has very marginal effect in most cases, although more unlabeled data on average (Fig. 7) are selected as compared to its original form (Fig. 5). This indicates little complementary between OOD detection and distillation methods.
Figure 7: The per-epoch usage of unlabeled data when an OOD detector is equipped with (top) KD [7] and (bottom) our SRD..
Network | ResNet\(8\times4\) | WRN40-1 | ShuffleNetV1 |
---|---|---|---|
KD [7] | 74.68 | 75.08 | 76.53 |
KD+OOD | 74.58 | 75.31 | 76.54 |
SRD | 76.24 | 75.32 | 77.40 |
SRD+OOD | 76.08 | 75.54 | 77.41 |
Data augmentation based consistency: A key component with SSL methods is data augmentation based consistency regularization. Here we examine how it can work together with distillation loss which instead imposes cross-network consistency. We experiment with stochastic augmentation operations including random cropping by up to 4-pixels shifting and random horizon flipping. Given an unlabeled image, we generate two views via data augmentation and feed one view into the teacher and both into the student. A consistency loss is then applied via maximizing their logit similarity. It is shown in Table 8 that data augmentation based consistency would lead to performance degradation. This suggests that cross-network and cross-augmentation consistency are incompatible with each other. As we find, this is because the teacher could output rather different predictions for the two views of a single image (e.g.., giving different predicted labels on \(\sim\)40% of unlabeled images), causing contradictory supervision signal.
Network | ResNet8\(\times\)4 | WRN40-1 | ShuffleNetV1 |
---|---|---|---|
KD [7] | 74.68 | 75.08 | 76.52 |
KD [7]+DAC | 72.82 | 74.34 | 75.64 |
SRD | 76.24 | 75.32 | 77.40 |
SRD+DAC | 75.66 | 73.70 | 76.34 |
More unconstrained unlabeled data: For evaluating the generality and scalability in terms of unlabeled data, we test less related unlabeled data by replacing Tiny-ImageNet with Places365 [72]. This dataset has 1,803,460 training images from 365 scene categories, drastically different from object-centric images from Tiny-ImageNet and CIFAR-100 (see Fig. 8). This hence presents a more challenging open-set SSL scenario. Similarly, we use its training set as unlabeled data. All the other settings remain. We make similar observations from Table [tab:KD95place]. (1) Similarly, all closed-set SSL methods except MixMatch fail to improve over supervised learning baseline. (2) Again, open-set SSL methods are all ineffective and even suffer from more performance drop. (3) Our SRD consistently delivers the best accuracy with a decent margin over the conventional distillation method. (4) Overall, distillation methods remain superior over all SSL competitors, suggesting their generic advantages even at more challenging scenarios with less relevant unlabeled data involved.
Figure 8: Object-centric images from (a) CIFAR-100 and (b) Tiny-ImageNet vs.. scene images from (c) Place365 including plate objects.
[h]
\centering
\caption{Generality and scalability test with unconstrained unlabeled data from the Place365 dataset.}
\resizebox{1\textwidth}{!}{
\setlength\tabcolsep{3.5pt}
\begin{tabular}{l|c|c|c}
\hline
Network& ResNet8$\times$4 & WRN40-1 & ShuffleNetV1\\
\hline
\em Supervised learning &72.50 &71.98 &70.50\\ \hline
PseudoLabel~\cite{lee2013pseudo}&21.62&49.96&40.43\\
MeanTeacher~\cite{tarvainen2017mean}&66.94&53.19&60.32\\
FixMatch~\cite{sohn2020fixmatch}&68.57&66.42&64.26\\
MixMatch~\cite{berthelot2019mixmatch}&73.40&73.72&74.89\\
\hline
MTCR~\cite{yu2020multi}&58.08&55.38&33.67\\
T2T~\cite{T2T}&63.22&61.88&61.54\\
OpenMatch~\cite{openmatch}&58.55&69.80&67.57\\
\hline
KD~\cite{hinton2015distilling} &74.13 & 74.29 &75.81\\
\bf {\shortname} &\bf{75.93} & \bf{75.40} &\bf{77.18}\\
\hline
\end{tabular}}
\label{tab:KD95place}
\vspace{-4mm}
Considering that TinyImageNet and Place365 both were created under filtering and manual annotation, we test a further less curated dataset, CC3M [73], as unlabeled data. We compare SRD with the top semi-supervised and distillation competitors: MixMatch [39], OpenMatch [50], and KD [7]. As shown in Table [tab:KD95cc3m], we obtain consistent results as on Tiny-ImageNet and Place365. This implies that image curation has little impact in such open-set scenarios with good scales. This is not surprising since manual annotations are not used and image selection would not significantly reduce the open-set challenges.
[h]
\centering
\caption{Generality and scalability test with more unconstrained unlabeled data from the CC3M dataset.}
\resizebox{1\textwidth}{!}{
\setlength\tabcolsep{3.5pt}
\begin{tabular}{l|c|c|c}
\hline
Network& ResNet8$\times$4 & WRN40-1 & ShuffleNetV1\\
\hline\hline
\em Supervised learning &72.50 &71.98 &70.50\\ \hline
MixMatch~\cite{berthelot2019mixmatch}&73.01&73.02&74.16\\
OpenMatch~\cite{openmatch}&71.56&69.61&66.60\\
KD~\cite{hinton2015distilling} &73.78 &74.20 &75.22\\\hline
\bf {\shortname} &\bf{76.24} &\bf{74.99} &\bf{76.65}\\
\hline
\end{tabular}}
\label{tab:KD95cc3m}
We further adopt multi-class Area Under the Receiver Operating Characteristic (AUROC) for evaluation. In our cases, we employ the standard scikit toolbox3 which treats the multi-class classification task as a set of binary classification task across each class. We compare SRD with MixMatch [39], OpenMatch [50], and KD [7]. As show in Figure 9, SRD is consistently superior to the competitors.
Figure 9: Area under the receiver operating characteristic (AUROC) curve on CIFAR-100 test set.
Effect of higher image resolution: To assess our method’s effectiveness with high-resolution images, we test a subset of ImageNet-1K (ImageNet-Sub), which comprises 500 classes with 500 samples per class. Evaluations are conducted on the original ImageNet evaluation dataset, focusing on the same 500 classes. Additionally, for the unlabeled data, we select a subset (YFCC-Sub) from the Yahoo Flickr Creative Commons 100 Million (YFCC-100M) [74] dataset, which is four times the size of the labeled data with OOD ratio of 80%. The models are trained with SGD as optimizer, weight decay \(1e-3\) and a standard learning rate of 0.1, and decayed by 0.1 every 30 epochs for total 100 epochs. The batch size is set to 256. Specifically, we employ a ResNet50 model as the teacher, guiding two student models with distinct architectures: ResNet18 and MobileNet.
As shown in Table 9, we obtain consistent results as on Tiny-ImageNet, Place365 and CC3M. This implies that image resolution will not affect the overall conclusion.
Network | ResNet18 | MobileNet |
---|---|---|
Supervised learning | 63.49 | 65.10 |
MixMatch [39] | 65.46 | 66.31 |
OpenMatch [50] | 60.16 | 62.56 |
KD [7] | 66.56 | 67.03 |
SRD | 68.23 | 68.93 |
Teacher | 69.10 | 69.10 |
In this work, we have presented a novel Semantic Representational Distillation (SRD) method for structured representational knowledge extraction and transfer. The key idea is that we take pretrained teacher’s classifier as a semantic critic for inducing a cross-network logit on student’s representation. Considering seen classes as a basis of the semantic space, we further scale SRD to highly unconstrained unlabeled data with arbitrary unseen classes involved. This results in a crossing of knowledge distillation and open-set semi-supervised learning (SSL). Extensive experiments on a wide variety of network architectures and vision applications validate the performance advantages of our SRD over both state-of-the-art distillation and SSL alternatives, often by a large margin. Crucially, we reveal hidden limitations of existing open-set SSL methods in tackling more unconstrained unlabeled data, and suggest a favor of knowledge distillation over our-of-distribution data detection.