LR-FPN: Enhancing Remote Sensing Object Detection with Location Refined Feature Pyramid Network

Hanqian Li1,   Ruinan Zhang1,  Ye Pan2,  Junchi Ren3,  Fei Shen4 5 *1
1Shandong University, Jinan, China
2South China Normal University, Foshan, China
3China Telecom Corporation Limited, Nanjing, China
4Nanjing University of Science and Technology, Nanjing, China
5Tencent AI Lab, Shenzhen, China


Abstract

Remote sensing target detection aims to identify and locate critical targets within remote sensing images, finding extensive applications in agriculture and urban planning. Feature pyramid networks (FPNs) are commonly used to extract multi-scale features. However, existing FPNs often overlook extracting low-level positional information and fine-grained context interaction. To address this, we propose a novel location refined feature pyramid network (LR-FPN) to enhance the extraction of shallow positional information and facilitate fine-grained context interaction. The LR-FPN consists of two primary modules: the shallow position information extraction module (SPIEM) and the contextual interaction module (CIM). Specifically, SPIEM first maximizes the retention of solid location information of the target by simultaneously extracting positional and saliency information from the low-level feature map. Subsequently, CIM injects this robust location information into different layers of the original FPN through spatial and channel interaction, explicitly enhancing the object area. Moreover, in spatial interaction, we introduce a simple local and non-local interaction strategy to learn and retain the saliency information of the object. Lastly, the LR-FPN can be readily integrated into common object detection frameworks to improve performance significantly. Extensive experiments on two large-scale remote sensing datasets (i.e., DOTAV1.0 and HRSC2016) demonstrate that the proposed LR-FPN is superior to state-of-the-art object detection approaches. Our code and models will be publicly available.

remote sensing, object detection, location refined.

1 Introduction↩︎

Remote sensing object detection [1][4] involves detecting and pinpointing objects within remotely captured images. It is a critical tool across multiple domains, notably in agriculture and urban planning. Efficiently and accurately identifying objects in remote sensing images is pivotal for informed decision-making, fostering sustainable development, and optimizing resource utilization.

The feature pyramid network (FPN) [3] stands as a prevalent model architecture widely employed in object detection tasks to construct a hierarchical feature representation akin to a pyramid structure. Its efficacy lies in propagating semantically robust features from higher to lower levels through a top-down approach and lateral connections. Despite the excellent performance demonstrated by FPN [3] when it is proposed, there are still some flaws. Notably, certain design flaws within the feature pyramid network (FPN) are identified. Flaws, as described, are in the following: there is neglect in extracting and utilizing shallow localization information.. The shallow layers of a backbone network usually contain valuable and exploitable localization information, but these pieces of information are often not effectively utilized or overlooked during the process of constructing a pyramid. There is a lack of interaction with contextual information. Employing a \(1\times1\) convolution directly before feature fusion fails to fully exploit interaction within the context, potentially leading to a lack of interaction between the information and an inability to fully utilize the injected information. Addressing these issues is pivotal to prevent potential decreases in model accuracy.

Figure 1: Common dataset and remote sensing dataset. In typical datasets, target objects are larger, while in remote sensing datasets, they are comparatively smaller.

Figure 2: The architecture of our detector. The outputs of the individual shallow position information extraction module (SPIEM) are designed to adaptively align with the scales and channels of the feature maps in the backbone network. In the construction of the extra layers, we leverage the \(3\times3\) convolution network for execution.

Recent advancements in models like AugFPN [5] and PANet [6] highlight the importance of enhanced feature fusion for accuracy. AugFPN ensures similar semantic information in feature maps post lateral connection and minimizes information loss in high-level features. PANet uses precise localization cues to shorten the information pathway and directs key information to proposal subnetworks via adaptive feature pooling. NAS-FPN, beyond manual fusion strategies, uses neural architecture search methodologies for architecture optimization, improving various backbone models. However, these FPNs are not specifically designed for remote sensing scenes, which typically have a single viewpoint and high density, as shown in Fig. 1. For remote sensing applications, the network’s ability to extract object location information and facilitate contextual interactions is crucial. Existing FPNs often overlook low-level positional information and fine-grained context interaction extraction, limiting their performance in remote sensing scenes.

Figure 3: The structure of shallow position information extraction module (SPIEM) and contextual interaction module (CIM). CIM, GAP and GMP represent the global average and max pooling, respectively.

This paper introduces a novel Location Refined Feature Pyramid Network (LR-FPN) to enhance shallow positional information extraction and facilitate fine-grained context interaction. LR-FPN refines location information within the feature pyramid, compensating for location and saliency information across layers, and boosting contextual interaction. This maximizes the use of robustly extracted location information, improving task performance. This refinement is achieved through two key modules: the Shallow Position Information Extraction Module (SPIEM) and the Context Interaction Module (CIM). SPIEM effectively adjusts position information, capturing positional and saliency details from low-level feature maps to maintain accurate target location information, facilitating compensation between layers. Complementarily, CIM infuses reliable location information into different FPN layers through spatial and channel interaction, enhancing the object area. This process bolsters contextual information interaction, seamlessly integrating location information into the FPN and improving layer interaction, thereby maximizing information utilization.

The main contributions of this paper can be summarized as follows:

  • This paper presents a plug-and-play location refined feature pyramid network (LR-FPN) to enhance the extraction of shallow positional information and facilitate fine-grained context interaction.

  • We present the shallow position information extraction module (SPIEM) and the context interaction module (CIM) to extract positional and saliency information from the low-level feature maps, thereby maximizing the enhancement and retention of location information. Furthermore, they facilitate interaction with other layers in both spatial and channel dimensions.

  • We conduct extensive experiments and achieve promising performance gains on two large-scale object datasets. Besides, the ablation studies also verify the effectiveness of the core mechanisms in the LR-FPN for object detection in remote sensing.

2 Related Work↩︎

2.1 Remote Sensing Object Detection↩︎

In recent years, object detection [7][9] has been gaining increasing popularity in the computer vision community. AAF-Faster RCNN [10] introduces the additive activation function, aiming to enhance the detection accuracy of small-scale targets. CF2PN [11] incorporates multi-level feature fusion techniques to address the challenge of inefficient detection of multi-scale objects in remote sensing object detection tasks. RICA [12] introduces an RPN with multi-angle anchors into Faster R-CNN to tackle remote sensing targets with arbitrary orientations. While these methods have to some extent enhanced the accuracy of remote sensing object detection, they don’t take into account the integration of location information with spatial and channel information to improve the performance of remote sensing object detection.

2.2 Context Exploitation↩︎

The extraction of contextual information is widely utilized across various domains [13][16] within artificial intelligence. For example, Deeplab-v3 [17] utilizes atrous convolution which magnifies the receptive field to acquire muti-scale context while decreasing the loss of information. HPGN [18] proposes a novel pyramid graph network targeting features, which is closely connected behind the backbone network to explore multi-scale spatial structural features. PBSL [19] introduces a bidirectional perception module to explore the contextual relationships of prominent features between positive and negative samples. PSP-Net [20] makes use of the pooling to extract the hierarchical global context. However, the bottom feature layer often contains valuable positional information that is sometimes overlooked.

2.3 Multi-scale Feature Fusion↩︎

Multi-scale features possess the capability to capture object representations at various scales in images, showcasing excellent performance across multiple tasks [21][27]. NAS-FPN [28] employs reinforcement learning to train a controller that identifies the optimal model architectures within a predefined search space. EMRN [29] proposes a multi-resolution features dimension uniform module to fix dimensional features from images of varying resolutions. BiFPN [30] removes those nodes that only have one input edge, adds an extra edge from the original input to the output node if they are at the same level, and treats each bidirectional path as one feature network layer, which makes better accuracy and efficiency trade-offs. However, these methods fail to consider the extraction of low-level location information and the nuanced interaction within the context.

3 Proposed Method↩︎

3.1 Overview↩︎

Based on the original structure of FPN [3], we denote the feature maps used to build the feature pyramid as {F\(_1\), F\(_2\), F\(_3\), F\(_4\)}, and denote the outputs of the feature pyramid as {P\(_1\), P\(_2\), P\(_3\), P\(_4\), P\(_5\)}. The structure is shown in Fig. 2. In the construction of the feature pyramid, we add extra layers by using \(3\times3\) convolutions to expand the information from {P3} to {P4, P5}, which will ultimately provide 5 scales of outputs for subsequent detection heads. The overall computation formula of the network we designed is as follows: \[P_{3} = f_{4}(F_{4}+F_{4}^{*}),\] where \(P\) denotes the outputs in our FPN. \(f_{i}(\cdot)\) denotes the lateral connection block we redesign, which will be introduced later. \(F\) denotes the inputs gained from the backbone. \(F_{1}^{*}\) denotes the outputs of the shallow position information extraction operation, which is introduced later. \[\overline{P}_{i-1} = f_{i}(F_{i}+F_{i}^{*} ),i=2,3,\] where \(i\) represents the layer of the pyramid. \[P_{i-1}=\overline{P}_{i-1}+R(P_{i})\] The operation denoted by \(R(\cdot)\) is used to resize the features so that they have the same spatial sizes. \[P_{k+1} = Conv_{3\times3}(P_{k}),k=3,4,\] where \(Conv_{3\times3}\) denotes the convolutional neural network with the kernel size of 3. Two components of our FPN will be described in the following subsections.

Table 1: SOTA Comparison on DOTAV1.0 [31]. The abbreviations of categories are defined as: Plane (PL), Baseball diamond (BD), Bridge (BR), Ground field track (GTF), Small vehicle (SV), Large vehicle (LV), Ship (SH), Tennis court (TC), Basketball court (BC), Storage tank (ST), Soccer-ball field (SBF), Roundabout (RA), Harbor (HA), Swimming pool (SP), and Helicopter (HC). We respectively bold and underline the optimal and suboptimal results for each category metric.
Method Backbone mAP PL BD BR GTF SV LV SH TC BC ST SBF RA HA SP HC
IENet [32] 57.14 80.20 64.54 39.82 32.07 49.71 65.01 52.58 81.45 44.66 78.51 46.54 56.73 64.40 64.24 36.75
R-DFPN [33] 57.94 80.92 65.82 33.77 58.94 55.77 50.94 54.78 90.33 66.34 68.66 48.73 51.76 55.10 51.32 35.88
PIoU [34] DLA-34 60.53 80.91 69.70 24.11 60.22 38.34 64.43 64.84 90.98 77.22 70.45 46.59 37.12 57.11 61.97 64.02
Faster-RCNN [35] ResNet50 60.42 80.31 77.54 32.84 68.11 53.67 52.44 50.05 90.45 75.05 59.55 56.09 49.81 61.67 56.42 41.87
Light-Head R-CNN [36] 66.95 88.02 76.99 36.70 72.54 70.15 61.79 75.77 90.14 73.81 85.04 56.57 62.63 53.30 59.54 41.91
RADet [37] 67.66 79.66 77.36 47.64 67.61 65.06 74.35 68.82 90.05 74.72 75.67 45.60 61.84 64.88 68.00 53.67
H2RBox [38] ResNet50 67.82 88.51 73.52 40.83 56.91 77.53 65.45 77.93 90.91 83.25 85.33 55.31 62.92 52.45 63.67 43.38
ICN [39] 68.16 81.36 74.30 47.70 70.32 64.89 67.82 69.98 90.76 79.06 78.20 53.64 62.90 67.02 64.17 50.23
GSDet [40] 68.34 81.11 76.85 40.81 75.93 64.57 58.43 74.24 89.93 79.43 78.83 64.58 63.42 66.03 58.01 52.24
PVANet-FFN [41] PVANet 69.08 90.21 73.25 45.47 62.36 67.12 70.13 73.64 90.38 76.58 80.45 49.59 59.61 65.87 68.62 62.97
R\(^3\)Det [2] (Baseline) 67.54 89.41 73.10 40.54 57.48 76.48 72.32 78.99 90.88 76.71 83.49 52.89 60.77 61.29 59.45 39.40
LR-FPN (Ours) ResNet50 69.68 89.46 74.53 42.46 64.88 77.90 75.48 83.96 90.98 81.33 83.69 56.58 63.46 65.36 60.89 36.29

SOTA Comparison on HRSC2016 [42].

3.2 Shallow Position Information Extraction Module↩︎

FPN [3] leverages the feature hierarchy within a neural network to generate feature maps with varying resolutions. However, the features that target the improvement of semantic information lose the position information from the shallow layer, decreasing the positioning function in high layers. What’s more, in remote sensing detection tasks, similar objects are always in dense arrangement, which needs more powerful capability in positioning.

This inspires us to propose location extraction fusion, namely the shallow position information extraction module (SPIEM). In this module, we expect to supply the location information before the fusion. Here is the computation formula for this module: \[\overline{F}_{i} = \overline{W}_i \odot AAP_i(F_{1}),\] \[AAP(F^k)=\frac{1}{h\times w} \sum_{i=1}^{h} \sum_{j=1}^{w}F_{i,j}^k,\] where \(\overline{W}_i\) denotes the adaptive weight. \(\odot\) denotes the hadamard product. \(AAP(\cdot)\) denotes the adaptive average pooling layer to adapt different scales in {C\(_2\), C\(_3\), C\(_4\)}. \[\widetilde{F}_{i} = \widetilde{W}_i \odot AMP_{i}(F_{1}),\] \[AMP(F^k)= \underset{i=1,2,...,h}{max} \quad \underset{j=1,2,...,w}{max} (F_{i,j}^k ),\] where \(\widetilde{W}_i\) denotes the adaptive weight. \(AMP(\cdot)\) denotes the adaptive max pooling layer. \[F_{i}^{*} = Conv_{1\times1}({\overline{F}{_i}+\widetilde{F}{_i}}),\] where \(Conv_{1\times1}\) denotes the convolutional neural network with the kernel size of 1.

Table 2: Ablation study on key modules of LR-FPN.The shallow position information extraction module (SPIEM) has two critical components: the saliency pooling (\(\boldsymbol{SP}\)) and the position pooling (\(\boldsymbol{PP}\)). The contextual interaction module (CIM) has two essential components: the spatial interaction (\(\boldsymbol{SI}\)) and the channel interaction (\(\boldsymbol{CI}\)).
*No. *Methods Components Metric
4-10 SPIEM CIM *AP\(_{50}\) *AP\(_{75}\) *mAP
4-7 \(\boldsymbol{SP}\) \(\boldsymbol{PP}\) \(\boldsymbol{SI}\) \(\boldsymbol{CI}\)
0 Baseline 88.2 65.4 55.9
1 + SPIEM 89.1 66.3 56.1
2 + CIM 89.7 66.1 56.2
3 + CIM + SPIEM (only \(\boldsymbol{SP}\))) 89.7 67.6 56.3
4 + CIM + SPIEM (only \(\boldsymbol{PP}\))) 89.9 67.0 56.3
5 + SPIEM + CIM (only \(\boldsymbol{SI}\))) 89.7 66.7 57.2
6 + SPIEM + CIM (only \(\boldsymbol{CI}\))) 89.8 67.0 57.2
7 LR-FPN 90.4 68.9 59.5

Initially, we construct a feature pyramid by utilizing the multi-scale features as inputs obtained from the backbone layers ({C\(1\), C\(_2\), C\(_3\), C\(_4\)}). Then our FPN outputs the aggregated features {P\(_1\), P\(_2\), P\(_3\), P\(_4\), P\(_5\)}. We hope to reduce the computational cost of operation so we don’t use the feature {F\(_1\)} to build a lateral connection because of the large scale. However, We extract localization information from the feature {F\(_1\)} through the module, and then append this information to {F\(_2\), F\(_3\), F\(_4\)} before the lateral connection operation. Considering some salient information in {F\(_1\)}, we utilize saliency pooling to increase saliency. SPIEM optimizes the preservation of precise target location details while simultaneously extracting positional and saliency information.

3.3 Contextual Interaction Module↩︎

Original FPN does not efficiently interact with spatial and channel information. To address this, we introduce an enhanced lateral connection module, the contextual interaction module (CIM), designed to tackle this issue more effectively. We separate the channel and spatial information with the detailed framework illustrated in Fig. 3. We substitute the \(1 \times 1\) convolution network in the original FPN with a combination of depthwise convolutions, dilated depthwise convolutions, and a network of channel interaction modules in a parallel-addition form. Our central concept is to utilize depthwise convolution to interact with local spatial information within each channel. Concurrently, we employ dilated depthwise convolutions to tackle the challenge of non-local spatial interaction. This approach expands the receptive field and enhances the non-local interaction of spatial information, thereby addressing the limitations of the original FPN structure.

In the channel interaction branch, we prioritize the interplay of information between channels. To harness key information and further boost the overall performance of lateral connections, we introduce a weighted processing method between channels. This method allows us to model the significance of different channels and adjust the contribution of information accordingly, thereby strengthening the lateral connections. To encapsulate the global information and key features of each channel, we employ global average and max pooling techniques respectively. Following this, we use a fully connected network, FC\(_1\), to jointly learn the shared weights of these two types of information. The results are then concatenated and passed through another fully connected network,FC\(_2\), and corresponding activation functions to derive the weights of each channel. These weights are subsequently applied to the original channels. The final output is obtained through residual connections, ensuring the preservation of original information while adding the weighted enhancements. The computation formula of this specially designed module is as follows:

\[A_{i}(x) = Wx+x,i=2,3,4,\] where \(A_{i}\) denotes the channel interaction operation we design and \(W\) denotes the weight generated by the channel interaction block. \[\widetilde{f_{i}}(x) = Conv_{DW}(x) + Conv^{*}_{DW}(x) + A_{i}(x),\] where \(Conv_{DW}\) denotes the depthwise convolution. \(Conv^{*}_{DW}\) denotes the dilated depthwise convolution. \[f_{i}(x) = Conv_{1\times1}(\widetilde{f_{i}}(x)), i=2,3,4,\] where \(Conv_{1\times1}\) denotes the convolution neural network with the kernel size of 1.

In comparison to the original \(1\times1\) convolution network in FPN [3], Our CIM leverages spatial and channel interaction to effectively incorporate robust location information into various layers of the original FPN, explicitly enhancing the representation of object areas.

4 Experiment and Analysis↩︎

4.1 Datasets↩︎

DOTAV1.0 [31] dataset comprises 2,806 high-resolution aerial images captured by various sensors and platforms. Due to the inclusion of angle information and relatively small object sizes in DOTAV1.0 [31] datasets, each image is divided into \(1024 \times 1024\) subimages, with an overlap of 200 pixels to make the detection task smooth.

HRSC2016 [42] dataset comprises images from two scenarios, encompassing ships at sea and ships close to the shore. The images are collected from six renowned harbors. The image dimensions range from \(300\times300\) to \(1,500\times900\). The dataset is divided into two sets: a training set with 436 images and a testing set with 444 images.

4.2 Implementation Details↩︎

All experiments are conducted using MMRotate as the underlying framework and developed by PyTorch. To show the feasibility and efficiency of LR-FPN, we choose ResNet50 as the backbone instead of ResNet101. Then, we resize the image into (1024,1024) in DOTAV1.0 [31] and HRSC2016 [42]. We train the model using 1 GPU (RTX 3090) and set the batch size to 4. The learning rate initially set to 0.0025 is employed with a linear warm-up strategy [43] for the first 500 iterations. We train the model on the DOTAV1.0 [31] dataset for 12 epochs and on the HRSC2016 [42] dataset for 72 epochs. Weight decay and momentum are 0.0001 and 0.9, respectively.

Table 3: Ablation study on the interaction methods within the contextual interaction module (CIM).All experiments were conducted based on adding the SPIEM.
*Methods CIM Metric
3-9 *\(\boldsymbol{CI}\) \(\boldsymbol{SI}\) *AP\(_{50}\) *AP\(_{75}\) *mAP
5-6 \(\boldsymbol{NI}\) \(\boldsymbol{LI}\)
Baseline + SPIEM 89.1 66.3 56.1
+ \(\boldsymbol{CI}\) + \(\boldsymbol{SI}\) (only \(\boldsymbol{NI}\)) 89.9 64.3 57.2
+ \(\boldsymbol{CI}\) + \(\boldsymbol{SI}\) (only \(\boldsymbol{LI}\)) 89.9 66.5 57.0
LR-FPN (Ours) 90.4 68.9 59.5

4.3 Comparison with State-of-the-art Methods↩︎

4.3.1 Comparisons on DOTAV1.0↩︎

From Table 1, It can be observed that LR-FPN excels across all objective indices. Compared to Faster-RCNN, which also employs ResNet50 as its backbone network, the proposed LR-FPN maintains strong competitiveness. This suggests that enhancing the interaction in the information propagation process, facilitating the fusion of features from multiple scales, can lead to a better understanding of features. Our mAP metric significantly outperforms that of PVANet-FFN [41], which uses PVANet as the backbone network, indicating that extracting positional and saliency information from low-level feature maps to maximize the enhancement and retention of location information can significantly improve the performance.

4.3.2 Comparisons on HRSC2016 [42]↩︎

As shown in Table [table:HRSC2016], we present the performance comparison of LR-FPN with other state-of-the-art methods on the HRSC2016 dataset [42]. Experimental results indicate that LR-FPN excels in terms of mAP, surpassing the baseline (R\(^3\)Det [2]) by 2.2%. The observed improvement in the performance of LR-FPN on the HRSC2016 [42] dataset aligns with its performance on the DOTAV1.0 [31] dataset, suggesting that the results generated by LR-FPN are more stable across various scenarios in remote sensing datasets.

4.4 Ablation Studies and Analysis↩︎

In the previous subsection, we have shown the superiority of LR-FPN by comparing it to state-of-the-art methods. In what follows, we comprehensively analyze intrinsic factors that lead to LR-FPN’s superiority in DOTAV1.0 and HRSC2016.

The role of LR-FPN. As shown in the first group of Table 2, after using the SPIEM independently, the model achieves 0.9% improvement on both AP\(_{50}\) and AP\(_{75}\), respectively. The second group leveraging CIM independently reflects that AP\(_{50}\) and AP\(_{75}\) achieve growth of 1.5% and 0.7%. The effectiveness of SPIEM and CIM are verified. This indicates that SPIEM excels in extracting precise positional and saliency information, maximizing the enhancement and preservation of location details. On the other hand, CIM facilitates enriched contextual interaction by considering spatial and channel dimensions.

Moreover, the joint effect between modules is illustrated in the 7\(^{th}\) group of Table 2. Simultaneously introducing SPIEM and CIM, AP\(_{50}\), AP\(_{75}\) and mAP increase by 2.2%, 3.5% and 3.6%, respectively. The results demonstrate that SPIEM optimizes the preservation of precise location information of the target, while CIM seamlessly and effectively integrates and processes the information, which enhances the comprehensive ability of our model.

Figure 4: The comparison with various FPNs.

Comparisons with different FPNs. In order to further validate the effectiveness of LR-FPN, our comparison includes advanced FPN variants, namely NASFPN [28], HRFPN [44], and PAFPN [6]. Fig. 4 depicts the detection outcomes achieved through the utilization of advanced feature pyramid networks. In the DOTAV1.0 [31], our LR-FPN demonstrates superior performance over the suboptimal PAFPN [6], exhibiting a 0.6% improvement in mAP. Similarly, in the HRSC2016 [42], our LR-FPN achieves a substantial 1.6% increase in mAP when compared to the PAFPN [6], showcasing its superior performance across multiple benchmarks. Benefited by the successful extraction of shallow positional information and the effective interaction of fine-grained context information, our network demonstrates exceptional applicability in remote sensing scenes. This further consolidates its effectiveness within this specific domain.

The effectiveness of variant SPIEM. The 3\(^{rd}\) and 4\(^{th}\) groups in Table 2 show the result of variant SPIEM. It is observed that incorporating either the SP or PP method in CIM only results in marginal improvements in the AP\({_{50}}\) metric. However, when considering the AP\(_{75}\) metric, the addition of the SP method increased the metric by 1.5% and the addition of the PP method increased it by 0.9%. This observation suggests that both the SP and PP methods are capable of extracting valuable localization information and salient features, thereby maximizing the enhancement and retention of location information when there is a requirement for more precise localization. Significantly, when the SP and PP methods are combined, their effects become more prominent and can lead to a more substantial improvement in performance.

Figure 5: The comparison with variants of CIM.

Figure 6: Visualizations of detection results on HRSC2016 [42]. Failures are marked by red boxes.

The impact of variant CIM. To delve deeper into CIM’s functionality and assess its performance, we design a series of comprehensive experiments. The 5\(^{th}\) and 6\(^{th}\) group in Table 2 unequivocally demonstrates that our spatial and channel interaction blocks effectively promote information interaction. What’s more, Table 3 presents the outcomes obtained from employing various interaction methods within CIM. Both local and non-local interaction blocks demonstrate improvement in the metrics, and their combined effect is more pronounced. This also indicates that our CIM not only facilitates interaction among local information but also establishes long-range dependencies. Additionally, we conduct a comparison between our interaction method and advanced attention methods, including CA [45], SE [46], and CBAM [47]. The results and findings of this comparison are illustrated in Fig. 5. Through this analysis, we aim to evaluate the performance and effectiveness of our interaction method with these state-of-the-art attention methods. Regarding mAP, CIM shows superior performance to the state-of-the-art CBAM [47], achieving a higher accuracy of 2.4%. This result proves that our contextual interaction method effectively facilitates fine-grained context information while accounting for long-range dependencies, improving object coverage and area.

4.5 Visualization↩︎

From Fig. 6, we present three sets of results on the HRSC2016 [42] dataset. They include the outputs of R\(^3\)Det (Baseline) [2], the outputs of LR-FPN (Ours), and the ground truth (Ground Truth). Compared to the first row (R\(^3\)Det), we observe that our model exhibits a higher level of precision in orienting the ships. In addition, in terms of both false positives and false negatives detection, our model outperforms the baseline. It can be seen from this that our SPIEM undoubtedly extracts effect location information, while CIM facilitates contextual interaction, thus enhancing the overall capabilities of our model. Thanks to our modules, the fusion and learning of multi-scale feature information show great performance, which contributes in particular to detecting challenging multi-scale objects, especially in the remote sensing field.

5 Conclusion↩︎

This study addressed the shortcomings of existing feature pyramid networks in remote sensing target detection, specifically their disregard for low-level positional information and fine-grained context interaction. We introduced a novel location-refined feature pyramid network (LR-FPN) that enhances shallow positional information extraction and promotes fine-grained context interaction. The LR-FPN, equipped with a shallow position information extraction module (SPIEM) and a contextual interaction module (CIM), effectively harnessed robust location information. We also implemented a local and non-local interaction strategy for superior saliency information retention. The LR-FPN can be incorporated into standard object detection frameworks, significantly boosting performance. In the future. Despite LR-FPN’s top-tier results on two prevalent remote sensing datasets using a CNN-based architecture, its performance within a transformer-based framework remains unverified. We will further investigate the potential of implementing our approach within a transformer-based architecture.

References↩︎

[1]
R. Guan, Z. Li, W. Tu, J. Wang, Y. Liu, X. Li, C. Tang, and R. Feng, “Contrastive multi-view subspace clustering of hyperspectral images based on graph convolutional networks,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024.
[2]
X. Yang, J. Yan, Z. Feng, and T. He, “R3det: Refined single-stage detector with feature refinement for rotating object,” vol. 35, no. 4, pp. 3163–3171, 2021.
[3]
T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
[4]
J. Hu, Z. Huang, F. Shen, D. He, and Q. Xian, “A bag of tricks for fine-grained roof extraction,” in IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium.1em plus 0.5em minus 0.4emIEEE, 2023.
[5]
C. Guo, B. Fan, Q. Zhang, S. Xiang, and C. Pan, “Augfpn: Improving multi-scale feature learning for object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 12 595–12 604.
[6]
S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 8759–8768.
[7]
W. Weng, W. Ling, F. Lin, J. Ren, and F. Shen, “A novel cross frequency-domain interaction learning for aerial oriented object detection,” in Chinese Conference on Pattern Recognition and Computer Vision (PRCV).1em plus 0.5em minus 0.4emSpringer, 2023.
[8]
W. Weng, M. Wei, J. Ren, and F. Shen, “Enhancing aerial object detection with selective frequency interaction network,” IEEE Transactions on Artificial Intelligence, vol. 1, no. 01, pp. 1–12, 2024.
[9]
C. Qiao, F. Shen, X. Wang, R. Wang, F. Cao, S. Zhao, and C. Li, “A novel multi-frequency coordinated module for sar ship detection,” in 2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI).1em plus 0.5em minus 0.4emIEEE, 2022, pp. 804–811.
[10]
S. Shivappriya, M. Priyadarsini, A. Stateczny, C. Puttamadappa, and B. Parameshachari, “Cascade object detection and remote sensing object detection method based on trainable activation function,” Remote Sensing, vol. 13, no. 2, p. 200, 2021.
[11]
W. Huang, G. Li, Q. Chen, M. Ju, and J. Qu, “Cf2pn: A cross-scale feature fusion pyramid network based remote sensing target detection,” Remote Sensing, vol. 13, no. 5, p. 847, 2021.
[12]
K. Li, G. Cheng, S. Bu, and X. You, “Rotation-insensitive and context-augmented object detection in remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 4, pp. 2337–2348, 2017.
[13]
J. Liu, F. Shen, M. Wei, Y. Zhang, H. Zeng, J. Zhu, and C. Cai, “A large-scale benchmark for vehicle logo recognition,” in 2019 IEEE 4th International Conference on Image, Vision and Computing (ICIVC).1em plus 0.5em minus 0.4emIEEE, 2019, pp. 479–483.
[14]
S. Lai, X. Hu, Y. Li, Z. Ren, Z. Liu, and D. Miao, “Shared and private information learning in multimodal sentiment analysis with deep modal alignment and self-supervised multi-task learning,” arXiv preprint arXiv:2305.08473, 2023.
[15]
M. Li, M. Wei, X. He, and F. Shen, “Enhancing part features via contrastive attention module for vehicle re-identification,” in 2022 IEEE International Conference on Image Processing (ICIP).1em plus 0.5em minus 0.4emIEEE, 2022, pp. 1816–1820.
[16]
S. Lai, L. Hu, J. Wang, L. Berti-Equille, and D. Wang, “Faithful vision-language interpretation via concept bottleneck models,” in The Twelfth International Conference on Learning Representations, 2023.
[17]
L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017.
[18]
F. Shen, J. Zhu, X. Zhu, Y. Xie, and J. Huang, “Exploring spatial significance via hybrid pyramidal graph network for vehicle re-identification,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 7, pp. 8793–8804, 2021.
[19]
F. Shen, X. Shu, X. Du, and J. Tang, “Pedestrian-specific bipartite-aware similarity learning for text-based person retrieval,” in Proceedings of the 31th ACM International Conference on Multimedia, 2023.
[20]
H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890.
[21]
S. Lai, X. Hu, H. Xu, Z. Ren, and Z. Liu, “Multimodal sentiment analysis: A survey,” Displays, p. 102563, 2023.
[22]
F. Shen, Y. Xie, J. Zhu, X. Zhu, and H. Zeng, “Git: Graph interactive transformer for vehicle re-identification,” IEEE Transactions on Image Processing, 2023.
[23]
H. Xu, S. Lai, X. Li, and Y. Yang, “Cross-domain car detection model with integrated convolutional block attention mechanism,” Image and Vision Computing, vol. 140, p. 104834, 2023.
[24]
R. Guan, Z. Li, T. Li, X. Li, J. Yang, and W. Chen, “Classification of heterogeneous mining areas based on rescapsnet and gaofen-5 imagery,” Remote Sensing, vol. 14, no. 13, p. 3216, 2022.
[25]
R. Guan, Z. Li, X. Li, and C. Tang, “Pixel-superpixel contrastive learning and pseudo-label correction for hyperspectral image clustering,” in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 6795–6799.
[26]
J. Liu, R. Guan, Z. Li, J. Zhang, Y. Hu, and X. Wang, “Adaptive multi-feature fusion graph convolutional network for hyperspectral image classification,” Remote Sensing, vol. 15, no. 23, p. 5483, 2023.
[27]
W. Tu, R. Guan, S. Zhou, C. Ma, X. Peng, Z. Cai, Z. Liu, J. Cheng, and X. Liu, “Attribute-missing graph clustering network,” in Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI), 2024.
[28]
G. Ghiasi, T.-Y. Lin, and Q. V. Le, “Nas-fpn: Learning scalable feature pyramid architecture for object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 7036–7045.
[29]
F. Shen, J. Zhu, X. Zhu, J. Huang, H. Zeng, Z. Lei, and C. Cai, “An efficient multiresolution network for vehicle reidentification,” IEEE Internet of Things Journal, vol. 9, no. 11, pp. 9049–9059, 2021.
[30]
M. Tan, R. Pang, and Q. V. Le, “Efficientdet: Scalable and efficient object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 781–10 790.
[31]
G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, M. Pelillo, and L. Zhang, “Dota: A large-scale dataset for object detection in aerial images,” pp. 3974–3983, 2018.
[32]
Y. Lin, P. Feng, J. Guan, W. Wang, and J. Chambers, “Ienet: Interacting embranchment one stage anchor free detector for orientation aerial object detection,” arXiv preprint arXiv:1912.00969, 2019.
[33]
X. Yang, H. Sun, K. Fu, J. Yang, X. Sun, M. Yan, and Z. Guo, “Automatic ship detection in remote sensing images from google earth of complex scenes based on multiscale rotation dense feature pyramid networks,” Remote sensing, vol. 10, no. 1, p. 132, 2018.
[34]
Z. Chen, K. Chen, W. Lin, J. See, H. Yu, Y. Ke, and C. Yang, Piou loss: Towards accurate oriented object detection in complex environments, 2020, pp. 195–211.
[35]
S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
[36]
Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun, “Light-head r-cnn: In defense of two-stage object detector,” arXiv preprint arXiv:1711.07264, 2017.
[37]
Y. Li, Q. Huang, X. Pei, L. Jiao, and R. Shang, “Radet: Refine feature pyramid network and multi-layer attention network for arbitrary-oriented object detection of remote sensing images,” Remote Sensing, vol. 12, no. 3, p. 389, 2020.
[38]
X. Yang, G. Zhang, W. Li, X. Wang, Y. Zhou, and J. Yan, “H2rbox: Horizonal box annotation is all you need for oriented object detection,” arXiv preprint arXiv:2210.06742, 2022.
[39]
S. M. Azimi, E. Vig, R. Bahmanyar, M. Körner, and P. Reinartz, Towards multi-class object detection in unconstrained remote sensing imagery, 2018, pp. 150–165.
[40]
W. Li, W. Wei, and L. Zhang, “Gsdet: Object detection in aerial images based on scale reasoning,” IEEE Transactions on Image Processing, vol. 30, pp. 4599–4609, 2021.
[41]
H. Chi, X. Zhang, and X. Gao, “Multi-target detection for aerial images based on fully convolutional networks,” in 2019 Chinese Control Conference (CCC).1em plus 0.5em minus 0.4emIEEE, 2019, pp. 8801–8806.
[42]
Z. Liu, L. Yuan, L. Weng, and Y. Yang, “A high resolution optical satellite image dataset for ship recognition and some new baselines,” in International conference on pattern recognition applications and methods, vol. 2.1em plus 0.5em minus 0.4emSciTePress, 2017, pp. 324–331.
[43]
F. Shen, X. Du, L. Zhang, and J. Tang, “Triplet contrastive learning for unsupervised vehicle re-identification,” arXiv preprint arXiv:2301.09498, 2023.
[44]
C. Yu, B. Xiao, C. Gao, L. Yuan, L. Zhang, N. Sang, and J. Wang, “Lite-hrnet: A lightweight high-resolution network,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 10 440–10 450.
[45]
Q. Hou, D. Zhou, and J. Feng, “Coordinate attention for efficient mobile network design,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 13 713–13 722.
[46]
J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
[47]
S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European Conference on Computer Vision (ECCV), September 2018.

  1. * Corresponding author↩︎