### Bookworm continual learning: beyond zero-shot learning and continual learning

#### Abstract

We propose bookworm continual learning (BCL), a flexible setting where unseen classes can be inferred via a semantic model, and the visual model can be updated continually. Thus BCL generalizes both continual learning (CL) and zero-shot learning (ZSL). We also propose the bidirectional imagination (BImag) framework to address BCL where features of both past and future classes are generated. We observe that conditioning the feature generator on attributes can actually harm the continual learning ability, and propose two variants (joint class-attribute conditioning and asymmetric generation) to alleviate this problem.

# 1 Introduction

Deep learning has brought extraordinary success to visual recognition by learning from large amounts of data (e.g. object classification and detection, scene classification). There are, however, two critical assumptions that stem from a static view of the world: all concepts of interest are known before training, and the corresponding training data is also available beforehand. The resulting model is also static and remains unchanged after training. Another limitation of conventional classification models is that there is no explicit notion of semantic similarity between concepts (i.e. a semantic model), since classes are represented as one-hot labels (i.e. all classes are equally similar and dissimilar to each other). These assumptions are hardly met in the dynamic real world we live in, where new visual data and new semantic concepts are continuously observed and integrated in our own personal knowledge. Similarly, visual recognition in humans greatly leverages all sort of semantic (and contextual) knowledge, enabling sophisticated inference.

Challenging this static world assumption, continual learning (CL) focuses on how to update the visual model when new classes and visual instances are observed over time (see Fig. 1a). A consequence is that the data is no longer i.i.d. and learning new tasks results in forgetting previous ones (i.e. catastrophic forgetting). This problem has been addressed with different techniques, including weight regularization [1]–[3] distillation [4], episodic memories with exemplars [5] and generative replay methods [6], [7].

On the other hand, zero-shot learning (ZSL) enables the recognition of (visually) unseen classes via a semantic model that describes them in connection to the seen classes (see Fig. 1b). We can also observe that ZSL also has an implicit temporal structure, with the class descriptions learned first, then the visual model is learned from the data of seen classes, and then the model is tested over the unseen classes. ZLS is usually tackled as learning the alignment between visual features and class embeddings (via the semantic model) in an shared intermediate space [8], [9]. Recent works also use feature generators to synthesize features of unseen classes [10]–[12].

In this work, we argue that continual leaning and semantic models are both essential for advanced visual recognition. Therefore, we propose generalized continual learning (GCL) as a more realistic setting where visual recognition is addressed with the help of an explicit semantic model, and in a dynamic scenario that requires continual learning. In the rest of the paper we focus on a particular case that we refer to as bookworm1 continual learning (BCL) where the semantic model remains fixed while the visual model is updated continuously (see Fig. 1c). BCL can be seen as a generalization of CL which is limited by lacking explicit semantic models, and ZSL which is not continual. The main challenge of GCL is the effective integration of semantic models and CL.

We propose a unified BCL framework via feature generation and distillation. A generative model (a conditional VAE) learns the distribution of features of past and future classes and generates synthetic features so a joint classifier on all classes can be trained. In our first BCL model, the feature generator is conditioned on attributes (attr-BImag variant).

We further observe that conditioning on attributes severely hurts the ability of the feature generator to prevent forgetting, compared to its continual learning counterpart. This raises the question of whether attributes are helpful or harmful in dealing with forgetting. We further investigate the problem, noticing an asymmetry between backward and forward generation (past classes have been visually observed, but not future ones), and inherent limitations of attribute-based semantic models themselves. Addressing these limitations we propose three variants with improved performance, while being also memory and computationally efficient. Finally, we also propose a novel metric to evaluate BCL, which generalizes a GZSL metric, not used earlier to evaluate CL.

# 2 Bookworm continual learning

## 2.1 Bookworm and generalized continual learning

We assume a sequence of image classification tasks $$\left(S_1,\ldots,S_K\right)$$​. Each task is learned from a dataset $$\mathcal{S}_k=\left\{\left(\mathbf{x}_i^k,\mathbf{a}_i^k,y_i^k\right)_{i=1}^{N_k}\right\}$$​, where $$\mathbf{x}_i^k\in \mathcal{X}_k$$​ is an image, $$y_i^k\in \mathcal{Y}_k\subset \mathcal{Y}$$​ is the corresponding class label and $$\mathbf{a}_i^k\in \mathcal{A}_k\subseteq \mathcal{A}$$​ is the semantic description. We are ultimately interested in learning and continually updating a visual model $$p_t\left(y\vert \mathbf{x}\right)=C_t\left(F_t\left(\mathbf{x}\right)\right)$$​ that maps images to class probabilities, where $$\mathbf{z}=F_t\left(\mathbf{x}\right)$$​ and $$p_t\left(y\vert \mathbf{z}\right)=C_t\left(\mathbf{z}\right)=\textrm{softmax}\!\left(W_t^\intercal\mathbf{z}\right)$$​ are the visual feature extractor and the classifier at time $$t$$​, respectively (all implemented jointly as a deep neural network). For simplicity, we assume that $$k$$​-th task is learned at time $$t=k$$​ and will use $$t$$​ and $$k$$​ interchangeably.

We also consider a semantic model $$p\left(y\vert \mathbf{a}\right)$$​ ) that relates class and attributes2. The semantic model is learned or annotated from an external source (e.g. class descriptions, taxonomy, Wikipedia), and can be leveraged to help infer classes, including unseen ones, whose instances might have not been observed yet (but their descriptions have). The visual model is always updated over time. In GCL the semantic model can be also continually updated, while in BCL it is learned prior to the visual model during a bookworm stage (at $$t=0$$​, for simplicity). We focus on the latter in this paper (see Fig. 1c), and assume task-agnostic evaluation, i.e. during test the task is unknown and the model has to consider all classes for the prediction. Zero-shot learning (ZSL) can be seen as the particular case of BCL with two tasks and no update after the first one. Using ZSL terms, the first task is seen and the second is unseen, i.e. $$\mathcal{Y}_1=\mathcal{Y}_\text{seen}$$​, $$\mathcal{Y}_2=\mathcal{Y}_\text{unseen}$$​. The model is evaluated on $$\mathcal{Y}_\text{seen}$$​, which can be inferred using the semantic model. Generalized ZSL (GZSL) corresponds to task-agnostic evaluation, i.e. over $$\mathcal{Y}_\text{seen}\bigcup \mathcal{Y}_\text{unseen}$$​. Continual learning (CL) corresponds to the particular case where no semantic model is available, and therefore at time $$t$$​ the model can only discriminate between all the classes seen so far, which we denote as $$\mathcal{Y}_{\leq t}=\bigcup_{k=1}^t \mathcal{Y}_k$$​. Finally, if we further assume no continual update we recover the usual setting where the model is learned with all the data $$\mathcal{S}=\bigcup_k \mathcal{S}_k$$​ (we refer to it as joint training (JT)).

# 3 BImag: feature generation for BCL

## 3.1 Integrating continual learning and semantic models

To address BCL we need to cope with three challenges: (a) catastrophic interference between tasks in the shared feature extractor, (b) bias in the classifier (to the most recent observed data), and (c) a way to predict future classes (via semantic information). Here (a) is related to CL, (c) to GZSL and (b) to both. Our approach tackles these challenges separately with distillation in the feature extractor to prevent catastrophic interference, and synthetic data generation to train a joint and unified classifier for all classes (see Fig. 2). Compared to traditional generative replay in CL [6], our method focuses only on the classifier generating features rather than images, leverages semantic information, has a hierarchical generator which is also bidirectional (generates features of past, i.e. replay, and future classes, i.e. foresight/imagination), and hence we loosely refer to our framework as bidirectional imagination (BImag, see Fig. 2). This allows us to predict any category at any time, while also allowing for continual updates. Semantic information is only used during training, and the final model is a direct mapping from image to class, without mapping to any intermediate semantic space.

In a first stage to learn a new task at time $$t$$​, the feature extractor $$F_t$$​ is updated using an auxiliary classifier $$\hat{C}_t$$​ minimizing the cross-entropy loss over the current task. Forgetting is alleviated by distilling the features of a fixed copy of the previous feature extractor $$F_{t-1}$$​ to the current feature extractor $$F_t$$​, using $$l_2$$​ loss and computed over the images of the current task.

In the second stage (see Fig. 2a) we train a conditional variational autoencoder with an encoder $$\left[\mathbf{\mu},\Sigma\right]=E\left(\mathbf{z},\mathbf{a}\right)$$​ (which estimates the parameters of the multivariate Gaussian latent distribution), and a decoder $$D_t\left(\mathbf{r},\mathbf{a}\right)$$​, and $$\mathbf{r}$$​ is a random latent vector (i.e. $$\mathbf{r}\sim \mathcal{N}\left(\mathbf{\mu},\Sigma\right)$$​). The description generator $$\mathbf{a}=G\left(y\right)=A\mathbf{1}_y$$​ maps the class $$\mathbf{y}$$​ to the attribute-based description $$\mathbf{a}$$​ via the class-to-attribute matrix $$A$$3. The decoder will act as feature generator conditioned on the attribute-based description. The parameters of the encoder and decoder are learned by maximizing the evidence lower bound (ELBO). The feature extractor remains fixed during this stage, and the encoder is learned from scratch every time. In addition, we include the replay alignment loss [7] between the past decoder $$D_{t-1}$$​ and the current decoder $$D_t$$​, which is a form of distillation to prevent forgetting in the feature generator.

Once the VAE is trained, the decoder can generate a set of synthetic features $$\tilde{\mathcal{S}}_{\neq t}$$​ for both past and future classes. The classifier $$C_t$$​ is trained with both real and synthetic features, i.e. $$\mathcal{S}_t\bigcup \tilde{\mathcal{S}}_{\neq t}$$​ using the cross-entropy loss. We refer to this variant with attribute-conditional VAE as attr-BImag.

We notice that the VAE of attr-BImag with $$A=\mathbb{I}$$​ is directly conditioned on the class label, resulting in a CL framework because it cannot predict future classes, which we use as CL baseline (i.e. class-BImag).

Interestingly, we observed that in practice attr-BImag tends to forget more than class-BImag (i.e. attributes, rather than helping, are harming the ability to prevent forgetting previous ones). In order to understand this problem, it is convenient to observe that feature generation in class-BImag can be formulated as $$\mathbf{z}\sim p\left(\mathbf{z} \vert y\right)$$​. Similarly, we can add attributes as another variable and factorize as $$p\left(\mathbf{z} \vert y\right)=p\left(\mathbf{z} \vert \mathbf{a}, y\right)p\left(\mathbf{a} \vert y\right) \label{eq:factorized-95model}$$​.

The particular case of attr-BImag computes $$\mathbf{a}=G\left(y\right)$$​, followed by sampling $$\mathbf{z}\sim p\left(\mathbf{z}\vert\mathbf{a}\right)$$​. Thus, attr-BImag assumes that features and classes are independent, and therefore all relevant visual information to generate synthetic features needs to be represented somehow in the attribute space. This is difficult to achieve in practice, and the feature generation may be unable to synthesize certain discriminative patterns that are essential to keep high accuracy and prevent forgetting. In contrast, class-BImag has a direct mapping between classes and features, so the feature generator could, in principle, model directly the relevant visual information and capture its diversity.

We can partly alleviate the dependence on the attribute space by conditioning the VAE both on attributes and classes, and then generate features as $$\mathbf{z}\sim p\left(\mathbf{z}\vert\mathbf{a}, y\right)$$​ (see $$\ref{eq:factorized-95model}$$). We refer to this variant as class-attr-BImag.

Feature generation in BImag is asymmetric: at a given time, the feature generator has observed only the semantic description of future classes, while has observed both visual and semantic information of past classes. As we discussed previously, conditioning directly on visual information seems to prevent forgetting better than conditioning on attributes, but the latter is necessary to predict unseen classes. Motivated by this observation we decouple both generation directions and use a different VAE for each (asym-BImag), one conditioned on classes for backward generation and the other conditioned on attributes for forward generation.

# 4 Experiments

## 4.1 Settings

CUB is a fine-grained recognition dataset with 200 classes [13], while AwA (specifically AwA2) has 50 coarser classes [14]. We follow the settings and preprocessing used in conventional GZSL methods. We use the data, class splits and train/test splits proposed by [14], adapting them to our BCL setting. This results in two tasks A/B4 with class splits 150/50 for CUB and 40/10 for AwA (tasks A/B in BCL or seen/unseen in ZSL, respectively). Since task B is not trained in ZSL, we created our own train/test splits.

Our implementation is based on PyTorch and trained using NVIDIA GTX 1080Ti GPUs. The feature extractor in our model is a ResNet-101 [15], as commonly used in previous works in ZSL, and then fine tuned on every new task as typically done in CL. Our conditional VAE consists of an encoder with three fully connected layers and a decoder with two fully connected layers (see supplementary material for details). The conditions can be attribute vectors and/or class labels as one-hot vectors. To train the joint classifier (Fig. 2c), we generate 300 synthetic features per class for both past and future classes. We set $$\lambda_1=1$$​, $$\lambda_2=0.1$$​. We use Adam optimizer [16] with learning rates 0.0001 for the feature extractor and 0.001 both for classifier and VAE.

We use class-BImag with fine tuned feature extractor, distillation and replay alignment as main CL baseline. We extend this baseline with different semantic models to the BCL variants attr-BImag, class-attr-BImag and asym-BImag. Note that BCL methods at step $$t=1$$​ correspond to GZSL.

We adapt the AUSUC metric used in GZSL [17] and use the area under the (per-class) task-accuracy curve (AUTAC) as metric to evaluate BCL. AUSUC was proposed as a more robust metric than the more common harmonic mean of seen and unseen accuracies [14], which is very sensitive to score calibration. Finally, to evaluate how a particular approach is able to make predictions for any task or class at any time, which is the main objective in BCL, we compute the average AUTAC across time.

## 4.2 Generalized zero-shot learning

We first evaluate our framework in the GZSL setting (equivalent to BCL at $$t=1$$​). Table ??? shows the results5 for CUB 150/50 and AwA 40/10, including recent works using feature generators [11], f-CLSWGAN [10] and f-VAEGAN-D2 [12]), with either fixed (fix) or fine tuned (ft) feature extractor. Although it was not our main objective, BImag achieves very competitive results, including the best result in AwA, and second best in CUB, only behind f-VAEGAN-D2 (ft). Interestingly, conditioning on class label seems to be also beneficial to GZSL.

## 4.3 Bookworm continual learning

Table ??? shows the results for two tasks for the different variants of BImag. The CL variant class-BImag cannot predict future classes, in contrast to the variants with semantic models (i.e. BCL variants). The lower performance at $$t=2$$​ of attr-BImag compared to class-BImag highlights the limitations of attribute conditioning, probably due to a poorer VAE model when visual instances were already observed. Augmenting the condition with the class label (i.e. class-attr-BImag) and the asymmetric approach asym-BImag significantly alleviate this problem, both variants achieving the best performance in CUB 150/50 in AUTAC metric. In AwA class-attr-BImag performs best in $$t=1,2$$​ and also average AUTAC. In summary, BCL methods outperform CL (i.e. class-BImag) at initial times (thanks to the semantic model), while outperforming GZSL ($$t=1$$​ row) by updating the visual model over time. Overall, properly using semantic information and class labels in our VAE component helps us to improve the functionality of both CL and GZSL.

# 5 Conclusion

We propose GCL as a novel and more realistic setting where continual learning is augmented with an explicit semantic model, which we argue is essential in humans to address visual recognition. We focus on the particular case of BCL, where the semantic model is fixed beforehand, but still generalizes (G)ZSL and CL. We also propose the BImag framework based on feature generation in both forward and backward temporal directions, which we used to study the interplay between CL and semantic models. We observed that the semantic model may harms the ability to prevent forgetting. We propose two variants to alleviate this problem based on joint class-attribute conditioning and asymmetric generation.

# References

[1] J. Kirkpatrick et al., “Overcoming catastrophic forgetting in neural networks,” in NAS, 2017.

[2] R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars, “Memory aware synapses: Learning what (not) to forget,” in ECCV, 2018.

[3] X. Liu, M. Masana, L. Herranz, J. Van de Weijer, A. M. López, and A. D. Bagdanov, “Rotate your networks: Better weight consolidation and less catastrophic forgetting,” in ICPR, 2018, pp. 2262–2268.

[4] Z. Li and D. Hoiem, “Learning without Forgetting,” TPAMI, vol. 40, pp. 2935–2947, 2017.

[5] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “iCaRL: Incremental Classifier and Representation Learning,” in CVPR, 2017, pp. 5533–5542.

[6] H. Shin, J. K. Lee, J. Kim, and J. Kim, “Continual learning with deep generative replay,” in NIPS, 2017, pp. 2990–2999.

[7] C. Wu, L. Herranz, X. Liu, yaxing wang, J. van de Weijer, and B. Raducanu, “Memory replay gans: Learning to generate new categories without forgetting,” in NIPS, 2018, pp. 5962–5972.

[8] A. Frome et al., “Devise: A deep visual-semantic embedding model,” in Advances in neural information processing systems, 2013, pp. 2121–2129.

[9] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid, “Label-embedding for image classification,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 7, pp. 1425–1438, 2015.

[10] Y. Xian, T. Lorenz, B. Schiele, and Z. Akata, “Feature generating networks for zero-shot learning,” in IEEE computer vision and pattern recognition (cvpr), 2018.

[11] A. Mishra, S. Krishna Reddy, A. Mittal, and H. A. Murthy, “A generative model for zero shot learning using conditional variational autoencoders,” in The ieee conference on computer vision and pattern recognition (cvpr) workshops, 2018.

[12] Y. Xian, S. Sharma, B. Schiele, and Z. Akata, “F-vaegan-d2: A feature generating framework for any-shot learning,” in IEEE computer vision and pattern recognition (cvpr), 2019.

[13] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The caltech-ucsd birds-200-2011 dataset,” California Institute of Technology, 2011.

[14] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata, “Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 9, pp. 2251–2265, 2018.

[15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the ieee conference on computer vision and pattern recognition, 2016, pp. 770–778.

[16] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

[17] S. Changpinyo, W. Chao, B. Gong, and F. Sha, “Synthesized classifiers for zero-shot learning,” in CVPR, 2016, pp. 5327–5336.

1. We use an avid reader (i.e. bookworm stereotype) as a metaphor, due to his/her extensive encyclopedic knowledge (e.g. concept descriptions) before eventually observing them visually.↩︎

2. For simplicity, we assume classification tasks and attribute-based semantic models, but our discussion is also valid for any other fixed-size continuous semantic embeddings (e.g. word embeddings, language embeddings).↩︎

3. $$\mathbf{1}_y$$​ represents the one-hot representation of $$y$$↩︎

4. We use $$t=1,2,\ldots$$​ to index time and $$k=A,B,\ldots$$​ to index tasks. We assume that the $$k$$​-th task is learned at time $$t=k$$​.↩︎

5. We average results over 5 runs. Other GZSL methods in the table do not report average results (possibly reporting the best run).↩︎