Abstract

In this report, we describe the technical details of ourÂ submission to the EPIC-Kitchens Object Detection Challenge. Duck filling and mix-up techniques are firstly introduced to augment the data and significantly improve the robustness of the proposed method. Then we propose GRE-FPN and Hard IoU-imbalance Sampler methods to extract more representative global object features. To bridge the gap of category imbalance, Class Balance Sampling is utilized and greatly improves the test results. Besides, some training and testing strategies are also exploited, such as Stochastic Weight Averaging and multi-scale testing. Experimental results demonstrate that our approach can significantly improve the mean Average Precision (mAP) of object detection on both the seen and unseen test sets of EPIC-Kitchens.

1 Introduction↩︎

EPIC-Kitchens dataset was introduced as a large scale first-person action recognition dataset. In this Object detection task, the annotations only capture the ‘active’ objects pre-, during- and post- interaction [1] [2]. It is challenging for object detection due to the influence of the sparse annotations and long-tail class distribution in this dataset. To address these challenges, we focus on the sampling methods in detection process and present the GRE-FPN and Hard IoU-imbalance Sampler methods to improve the robustness of location. Additionally, duck filling methods are applied to balance the influence of the long-tail class distribution and improve the diversity of few-shot classes. Experimental results demonstrate our approach can significantly improve the object detection performance and achieve a competitive result on the test set. The implementations details of the above are described in section 2 and section 3.

2 Proposed Approach↩︎

2.1 Data Preprocess↩︎

Figure 1: The mix-up images for few-shot classes..

Figure 2: The duck filling images for few-shot classes..

The EPIC-Kitchens training dataset for object detection contains 290 valid annotation categories and 326,064 bounding boxes. But the training dataset is extremely imbalanced, and it has the long-tail class distribution. Especially, there are only 8,136 bounding boxes for few-shot classes (118 classes) with less than 200 bounding boxes in total. To address the imbalance between few-shot classes and many-shot classes and improve the diversity of few-shot classes, we adopt some data augmentation preprocess, such as mix-up [3] and few-shot classes duck filling. For mix-up, it fuses two visually coherent images together into one output image by using Beta random distribution to improve the diversity of training set. And the mixtures are present in the Fig 1. For few-shot classes duck filling, we first extract the few-shot bounding boxes from training set, and then fill the few-shot objects into the non-annotated images in training set. In filling processing, some skills such as random weighted average method, random rescale for bounding boxes, are utilized to realize the mixing between few-shot objects and non-annotated images. The result images for few-shot classes duck filling are shown in Fig 2. Besides the two specific data argument methods, some regular augment methods are applied, such as random scale, random flipping, channel shuffling and random brightness contrast and so on.

2.2 Proposed Methods↩︎

We exploit both one-stage (such as FCOS [4], ATSS [5]) and two-stage (Cascade-RCNN [6]) as the basic detectors and evaluate their performance on validation dataset. And several classification networks are chosen as the backbone, such as HRNet-w36, HRNet-w48 [7], ResNext101-64, ResNext101-32 [8] and ResNet101 [9]. Comparing the detection performance of the above methods, we select the Cascade-RCNN as base network framework. And The ResNet101 is chosen as the backbone with FPN and deformable convolution (DCN). Besides, two other improvement skills are introduced to obtain a better performance — GRE-FPN and Hard IoU-imbalance Sampler.

GRE-FPN: At the regular FPN [10] stage, RoI features are usually extracted from one certain pyramid level according to the scales of RoIs. It ignores the importance of the adjacent scale features, which may contain more accurate location information. Therefore, we propose the Global RoI Extractor (GRE), which can extract RoI features from all pyramid levels and learn adaptive weights to balance the importance of features on different pyramid levels automatically. The detailed structure of GRE is illustrated in Fig 3. We first pool RoI features from all pyramid levels and concatenate them together. Then, these pooled features are convoluted with 1x1 convolution to reduce the channel dimension and obtain the final RoI features to predict objects.

Figure 3: The details of Global RoI Extractor.

Hard IoU-imbalance Sampler: We visualize annotations of objects on the respective images and find the annotations are sparse and coarse annotated. Besides, amounts of noise information are introduced and many targets should be labeled are missing. We call these lost targets unmarked targets. As for anchor-based network, the qualities of proposed anchors are significant for achieving an outstanding performance. Considering these images shown in Fig 4, we can observe that the marked target (the green box) may be surrounded by other unmarked targets (the red boxes).

Figure 4: The annotation visualize examples.

The regular sampling methods, such as Random Sampling, Online Hard Example Mining (OHEM), may generate false samples that take the unmarked target as negative samples and lead to the decreased robustness at the stage of RPN. To reduce the possibility of taking the unmarked targets as negative samples, we constrain negative sample regions that IoU (\(IoU_{(S,T)}\)) and center distance with marked target (\(DIS_{(S, T)}\)) are subject to

\[\begin{cases} IoU_{(S,T)} &<0.3 \\ DIS_{(S,T)} &<â?T_w, T_h)â_2 \end{cases}\]

\(S\) is the proposal anchor, and \(T\) is the annotated target bounding box. \(T_w\), \(T_h\) are the width and height of annotated target respectively. In this methods, the negative sample region always surround the marked targets rather than the whole input image, which can significantly reduce the influence of false sample. Besides, inspired by [11], we also improve the sampling ratio of hard negative examples with IoU in range \((0.05, 0.3)\) to further ensure the balance of sample processing.

3 Experiments↩︎

In this section, we conduct several experiments on EPIC-Kitchens object detection dataset. And the comparisons of detection performance are presented to verify the effectiveness of proposed method.

Table 1: Comparison results on EPIC validation dataset. Performance measures contain mAP with ratio 0.05, 0.5, 0.75.
Cascade-rcnn	EPIC-Pretrain	CB	Multi-Test	SWA	IoU>0.05(%)	IoU>0.5(%)	IoU>0.75(%)
✔					71.11	43.22	17.12
✔	✔				72.01	43.67	18.01
✔		✔			72.77	44.67	18.55
✔			✔		72.45	44.51	18.32
✔				✔	71.97	43.78	17.89
✔	✔	✔			73.54	45.10	18.97
✔	✔	✔	✔		73.87	45.57	19.22
✔	✔	✔	✔	✔	74.64	46.22	20.13

c|c|c|c|c|c|c|c|c|c|c & *method & & &
& & & & & & & & & &
*seen & base & 47.52 & 26.01 &7.56 & 36.10& 39.59 &10.79 & 60.17& 37.03 &10.18
& Mixup + duck & 49.86 &31.43 & 12.39 & 64.01 & 40.26 & 13.24 & 61.35 & 38.60 & 13.08
& Mixup + duck + Train tricks & 47.07 & 27.01 & 11.79 & 65.68 & 41.72 & 13.92 & 62.18 & 38.95 &13.52
& Mixup + duck + Hard IoU & 52.48 & 33.30 & 13.29 & 67.27 & 41.16 & 12.91 & 65.23 & 39.68 &12.98
& Mixup + duck + GRE-FPN & 47.14 & 32.02 & 13.35 & 66.23 & 42.47 & 14.52 & 53.58 & 39.83 &14.30
& Ensemble & 54.98 & 32.40 & 14.55 & 68.74 & 43.88 & 15.38 & 66.15 & 41.72 & 15.23
*unseen& base & 20.07 & 13.42 & 1.93 & 54.83 & 31.92 &8.11 & 51.29 & 30.04 &7.48
& Mixup + duck & 42.95 & 18.26 & 4.86 & 65.42 & 38.24 & 11.68 & 63.19 & 36.20 & 10.99
& Mixup + duck + Train tricks & 39.44 & 24.20 & 8.61 & 66.44 & 38.85 & 12.79 & 63.69 & 37.36 & 12.36
& Mixup + duck + Hard IoU & 40.99 & 22.66 & 6.82 & 66.62 & 39.13 & 12.42 & 64.01 & 37.45 & 11.85
& Mixup + duck + GRE-FPN & 34.02 & 19.42 & 7.52 & 66.81 & 39.71 & 13.38 & 63.47 & 37.64 & 12.78
& Ensemble & 35.75 & 22.31 & 7.33 & 67.92 & 41.92 & 14.29 & 64.64 & 39.93 & 13.58

3.1 Experimental Settings↩︎

All the experiments are conducted by using the MMdetection toolbox which is developed with PyTorch by Multimedia Laboratory, CUHK. And we run our experiments on 8 NVIDIA P40 GPUs. The mini-batch Stochastic Gradient Descent (SGD) optimizer with momentum of 0.9 and weight decay of \(1\times e^{-4}\) is utilized to solve the experiments. The input images are randomly sized to (\(1280\times 720\)) and (\(1394\times 764\)). The batch size is 32 and the maximum epoch for training is set to 12. The initial learning is fixed to 0.02. Then, it decays to \(2\times e^{-3}\) at epoch 8 and \(2\times e^{-4}\) at epoch 11. Meanwhile, we use 0.0067 to warm up the training until 500 iterations, then go back to 0.02 and continue training.

According to the number of bounding boxes for each category, the whole training set is divided into two parts : many shot (S1), few shot (S2). In the entire training phase, there are three stages: T1, T2, T3. In training phase T1, we choose datasets S1 to train and get a trained model (M1); In T2, we choose datasets S2 to fine-tune model M1 and obtain fine-tuning model (M2); And in the final phase T3, we use both datasets S1 and S2 to fine-tuning model M2 and obtain the final model.

3.2 Training Skills↩︎

In training and validation stage, we use several training skills to optimize the training procedure, such as Model Pre-training, Class Balance Sampling and Stochastic Weight Averaging. The test results on validation dataset is show in Table 1. The validation dataset is extracted from train dataset, and is divided into seen and unseen set.

Model Pre-training. We extract images based on bounding boxes and categories of objects from EPIC-Kitchens Object Detection datasets, and train a Classification model as our pre-trained model to replace default ImageNet pre-trained model.

Class Balance Sampling. During the training phase, image lists are randomly shuffled before the start of every epoch. Considering the long-tail class distribution and imbalanced categories, we randomly sample images in training list based on category probability as formula (1 ). \[\begin{cases} W_i&= 1/S_{i_c} \\ S_{i_c}&= \sum_{j=0}^m{\mathbb{R}_{(c,c_{i_j})}} \label{CB} \end{cases}\tag{1}\] Where \(W_i\) and \(S_{i_c}\) is the sampling possibility and the total numbers of category id \(c\) in the \(i^{th}\) image in training list respectively. \(c\) is the class ID that is annotated in image \(i\) and has the minimum proportion in the whole training dataset. \(m\) represents the total marked categories in the image \(i\). \(c_{i_j}\) is the category of the \(j^{th}\) object in image \(i\). And \(\mathbb{R}_{(c, c_{i_j})} = 1\) only if \(c = c_{i_j}\), otherwise \(\mathbb{R}_{(c, c_{i_j})} = 0\). This sampling methods can increase the sampling possibility of the few shot class and effectively solve the imbalanced categories problem.

Stochastic Weight Averaging (SWA). SWA [12] is based on averaging the samples proposed by SGD with a learning rate schedule that allows exploration of the region of weight space corresponding to high-performing networks. Using SWA, we achieve notable improvement over conventional SGD training on our base model and this method has almost no computational overhead.

3.3 Testing↩︎

In testing phase, we adopted the fusing results of multi-scale [13], flipping and Gaussian fuzzy transformation as the test output, which can further improve the detection performance. The detection results with different methods are presented in Table [final95res]. Specifically, by using data augmentation proposed in Sec 2.1, proposed methods in Sec 2.2 and training tricks in Sec 3.2, the best performance obtained for seen and unseen results in a mAP of 39.93% with \(IoU> 0.5\). Using the duck filling technique, the recognition accuracy improved by 6.16% (30.04% vs 36.20%) on unseen set. By combining the data argument with training tricks, Hard IoU and GRE-FPN models separately, improvements of 1.16%, 1.25%, 1.44% are obtained respectively. With an ensemble of all the methods, the mAP of \(IoU> 0.5\) is further improved by 3.73% (36.20% vs 39.93%).

4 Conclusion↩︎

The proposed method for the EPIC-Kitchens object detection task is demonstrated in detail in this paper. Our main concerns are to moderate the long-tail class distribution of training set and extract more effective features.

The duck filling, mix-up and class balance sampling are introduced to expand the training set and moderate the long-tail distribution. And The Hard IoU-imbalance Sampler and the reconstructed GRE-FPN are also utilized to help extracting more representative object features. Experimental results demonstrate that our methods are effective and useful. By assembling these main methods, our detection framework can obtain a competitive performance on both the seen and unseen test data.

References↩︎

[1]

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. In European Conference on Computer Vision (ECCV), 2018.

[2]

D. Damen, H. Doughty, G. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. The epic-kitchens dataset: Collection, challenges and baselines. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2020.

[3]

Zhi Zhang, Tong He, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. Bag of freebies for training object detection neural networks. arXiv: Computer Vision and Pattern Recognition, 2019.

[4]

Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. arXiv: Computer Vision and Pattern Recognition, 2019.

[5]

Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z Li. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. arXiv: Computer Vision and Pattern Recognition, 2019.

[6]

Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. pages 6154–6162, 2018.

[7]

Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. pages 5693–5703, 2019.

[8]

Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. pages 5987–5995, 2017.

[9]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. pages 770–778, 2016.

[10]

Tsungyi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. pages 936–944, 2017.

[11]

Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng, Wanli Ouyang, and Dahua Lin. Libra r-cnn: Towards balanced learning for object detection. 2019.

[12]

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. 2018.

[13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence, 37(9):1904–16, 2014.

DHARI Report to EPIC-Kitchens 2020 Object Detection Challenge