Supplementary material for
“On Train-Test Class Overlap and Detection for Image Retrieval”
April 01, 2024
How important is it for training and evaluation sets to not have class overlap in image retrieval? We revisit Google Landmarks v2 clean [1], the most popular training set, by identifying and removing class overlap with Revisited Oxford and Paris [2], the most popular evaluation set. By comparing the original and the new \(\mathcal{R}\)GLDv2-clean on a benchmark of reproduced state-of-the-art methods, our findings are striking. Not only is there a dramatic drop in performance, but it is inconsistent across methods, changing the ranking.
What does it take to focus on objects or interest and ignore background clutter when indexing? Do we need to train an object detector and the representation separately? Do we need location supervision? We introduce Single-stage Detect-to-Retrieve (CiDeR), an end-to-end, single-stage pipeline to detect objects of interest and extract a global image representation. We outperform previous state-of-the-art on both existing training sets and the new \(\mathcal{R}\)GLDv2-clean. Our dataset is available at https://github.com/dealicious-inc/RGLDv2-clean.
Instance-level image retrieval is a significant computer vision problem, attracting substantial investigation before and after deep learning. High-quality datasets are crucial for advancing research. Image retrieval has benefited from the availability of landmark datasets [1], [3]–[6]. Apart from depicting particular landmarks, an important property of training sets [4], [5] is that they do not contain landmarks overlapping with the evaluation sets [2], [7], [8]. Google landmarks [1] has gained widespread adoption in state of the art benchmarks, but falls short in this property [9].
At the same time, a fundamental challenge in image retrieval is to find a particular object among other objects or background clutter. In this direction, it is common to use attention [10]–[12] but it is more effective use object detection [13], [14] in order to represent only objects of interest for retrieval. These detect-to-retrieve (D2R) [15] methods however, necessitate complex two-stage training and indexing pipelines, as shown in 1 (a), often requiring a separate training set with location supervision.
Motivated by the above challenges, we investigate two directions in this work. First, in the direction of data, we revisit GLDv2-clean dataset [1]. We analyze and remove overlaps of landmark categories with evaluation sets [2], introducing a new version, \(\mathcal{R}\)GLDv2-clean. We then reproduce and benchmark state-of-the-art methods on the new dataset and compare with the original. Remarkably, we find that, although the images removed are only a tiny fraction, there is a dramatic drop in performance.
Second, in the direction of the method, we introduce CiDeR, a simple attention-based approach to detect objects of interest at different levels and obtain a global image representation that effectively ignores background clutter. Importantly, as shown in 1 (b), this is a streamlined end-to-end approach that only needs single-stage training, single-stage indexing and is free of any location supervision.
In summary, we make the following contributions:
We introduce \(\mathcal{R}\)GLDv2-clean, a new version of an established dataset for image retrieval.
We show that it is critical to have no class overlap between training and evaluation sets.
We introduce CiDeR, an end-to-end, single-stage D2R method requiring no location supervision.
By using exisiting components developed outside image retrieval, we outperform more complex, specialized state-of-the-art retrieval models on several datasets.
Research on image retrieval can be categorized according to the descriptors used. Local descriptors [6], [16], [17] have been applied before deep learning, using SIFT [18] for example. Given that multiple descriptors are generated per image, aggregation methods [7], [19], [20] have been developed. Deep learning extensions include methods such as DELF [6], DELG [21], and extensions of ASMK [15], [22]. DELF is similar to our work in that it uses spatial attention without location supervision, but differs in that it uses it for local descriptors.
Global descriptors [1], [3], [12], [23], [24] are useful as they only generate a single feature per image, simplifying the retrieval process. Research has focused on spatial pooling [4], [5], [5], [10], [25]–[27] to extract descriptors from 3D convolutional activations. Local descriptors can still be used in a second re-ranking stage after filtering by global descriptors, but this is computationally expensive.
It is beneficial for image retrieval to detect objects of interest in database images and ignore background clutter [28]–[34]. Following Teichmann et al.. [15], we call these methods detect-to-retrieve (D2R). In most existing studies, either training or indexing are two-stage processes, for example learn to detect and learn to retrieve; also, most rely on location supervision in learning to detect.
For example, DIR [4] performs 1-stage indexing but 2-stage training for a region proposal network (RPN) and for retrieval. Its location supervision does not involve humans but rather originates in automatically analyzing the dataset, hence technically training is 3-stage. Salvador et al.. [29] performs 1-stage end-to-end training, but is using human location supervision, in fact from the evaluation set. R-ASMK [15], involves 2-stage training and 2-stage indexing. It also uses large-scale human location supervision from an independent set.
1 shows previous studies organized according to their properties. We can see that, unlike previous studies, we propose a novel method that supports 1-stage training, indexing and inference, as well as allowing end-to-end D2R learning without location supervision. Compared with the previous studies, ours more thus efficient.
2pt
Method | LD | GD | D2R | E2E | Self | Land | |
---|---|---|---|---|---|---|---|
DELF [6] | ✔ | ✔ | |||||
DELG [21] | ✔ | ✔ | ✔ | ||||
Tolias et al.. [22] | ✔ | ✔ | |||||
DIR [4] | ✔ | ✔ | |||||
AGeM [35] | ✔ | ✔ | |||||
SOLAR [11] | ✔ | ✔ | |||||
GLAM [12] | ✔ | ✔ | |||||
Kucer et al.. [31] | ✔ | ✔ | |||||
PS-Net [32] | ✔ | ✔ | |||||
Peng et al.. [36] | ✔ | ✔ | |||||
Zhang et al.. [37] | ✔ | ✔ | ✔ | ||||
Liao et al.. [38] | ✔ | ✔ | ✔ | ||||
R-ASMK [15] | ✔ | ✔ | ✔ | ||||
Salvador et al.. [29] | ✔ | ✔ | ✔ | ✔ | |||
CiDeR(Ours) | ✔ | ✔ | ✔ | ✔ | ✔ |
A key weakness of current landmark retrieval datasets is their fragmented origins: training and evaluation sets are often independently collected and released by different studies. Initial datasets contained tens of thousands of images, a number that has now grown into the millions.
Evaluation sets such as Oxford5k (Ox5k) [7] and Paris6k (Par6k) [8], as well as their more recent versions, Revisited Oxford (\(\mathcal{R}\)Oxford or \(\mathcal{R}\)Oxf) and Paris (\(\mathcal{R}\)Paris or \(\mathcal{R}\)Par) [2], are commonly used for benchmarking. Concurrently, training sets such as Neural Codes (NC) [3], Neural Codes clean (NC-clean) [4], SfM-120k [5], Google Landmarks v1 (GLDv1) [6], and Google Landmarks v2 (GLDv2 and GLDv2-clean) [1] have been sequentially introduced and are widely used for representation learning.
These training sets are typically curated according to two criteria: first, to depict particular landmarks, and second, to not contain landmarks that overlap with those in the evaluation sets. They are originally collected by text-based web search using particular landmark names as queries. This often results in noisy images in addition to images depicting the landmarks. Thus, NC, GLDv1 and GLDv2 are noisy datasets. To solve this problem, images are filtered in different ways [4], [39] to ensure that they contain only the same landmark (instance). Accordingly, NC-clean, SfM-120k, and GLDv2-clean are clean datasets.
The clean datasets are also typically filtered to remove overlap with the evaluation sets. However, while NC-clean and SfM-120k adhere to both criteria, GLDv2-clean falls short of the second criterion. This discrepancy is not a limitation of GLDv2-clean per se, because the dataset comes with its own split of training, index and query images. However, the community is still using the \(\mathcal{R}\)Oxford and \(\mathcal{R}\)Paris evaluation sets, whose landmarks have not been removed from GLDv2-clean. Besides, landmarks are still overlapping between the GLDv2-clean training and index sets.
This discrepancy is particularly concerning because GLDv2-clean is the most common training set in state-of-the-art studies. It has been acknowledged in previous work [9] and in broader community discussions2. The effect is that results of training on GLDv2-clean are not directly comparable with those of training on NC-clean or SfM-120k. Results on GLDv2-clean may show artificially inflated performance. This is often attributed to its larger scale but may in fact be due to overlap. Our study aims to address this problem by introducing a new version of GLDv2-clean.
First, it is necessary to confirm whether common landmark categories exist between the training and evaluation sets. We extract image features from the training sets GLDv2-clean, NC-clean, and SfM-120k, as well as the evaluation sets \(\mathcal{R}\)Oxf and \(\mathcal{R}\)Par. The features of the training sets are then indexed and the features of the evaluation sets \(\mathcal{R}\)Oxf and \(\mathcal{R}\)Par are used as queries to search into the training sets.
2 displays the results. Interestingly, none of the retrieved images from NC-clean and SfM-120k training sets depict the same landmark as the query image from the evaluation set. By contrast, the top-5 most similar images from GLDv2-clean all depict the same landmark as the query. This suggests that using GLDv2-clean for training could lead to artificially inflated performance during evaluation, when compared to NC-clean and SfM-120k. A fair comparison between training sets should require no overlap with the evaluation set.
Now, focusing on GLDv2-clean training set, we verify the overlapping landmarks. Each image in this set belongs to a landmark category and each category is identified by a GID and has a landmark name. We begin by visual matching. In particular, we retrieve images for each query image from the evaluation set as above and we filter the top-\(k\) ranked images by two verification steps.
First, we automatically verify that the same landmark is depicted by using robust spatial matching on correspondences obtained by local features and descriptors. Second, since automatic verification may fail, three human evaluators visually inspect all matches obtained in the first step. We only keep matches that are confirmed by at least one human evaluator. For every query from the evaluation set, we collect all confirmed visual matches from GLDv2-clean and we remove the entire landmark category of the GID that appears more frequently in this image collection.
Independently, we collect all GIDs where the landmark name contains “Oxford” or “Paris” and we also mark them as candidate for removal. The entire landmark category of a GID is removed if it is confirmed by at least one human evaluator that it is in one the evaluation sets. This is the case for “Hotel des Invalides Paris”. 3 illustrates the complete ranking and verification process.
3pt
Eval | #Eval Img | #dupl Eval | #dupl gldv2 GID | #dupl gldv2 Img |
---|---|---|---|---|
\(\mathcal{R}\)Par | 70 | 36 (51%) | 11 | 1,227 |
\(\mathcal{R}\)Oxf | 70 | 38 (54%) | 6 | 315 |
text | 1 | 23 | ||
total | 140 | 74 | 18 | 1,565 |
Training Set | #Images | #Categories |
---|---|---|
NC-clean | 27,965 | 581 |
SfM-120k | 117,369 | 713 |
GLDv2-clean | 1,580,470 | 81,313 |
\(\mathcal{R}\)GLDv2-clean (ours) | 1,578,905 | 81,295 |
By removing a number of landmark categories from GLDv2-clean as specified above, we derive a revisited version of the dataset, which we call \(\mathcal{R}\)GLDv2-clean. As shown in 2, \(\mathcal{R}\)Par and \(\mathcal{R}\)Oxf have landmark overlap with GLDv2-clean respectively for 36 and 38 out of 70 queries, which corresponds to a percentage of 51% and 54%, respectively. This is a very large percentage, as it represents more than half queries in both evaluation sets. In the new dataset, we remove 1,565 images from 18 GIDs of GLDv2-clean.
3 compares statistics between existing clean datasets and the new \(\mathcal{R}\)GLDv2-clean. We observe that a very small proportion of images and landmark categories are removed from GLDv2-clean to derive \(\mathcal{R}\)GLDv2-clean. Yet, it remains to find what is the effect on retrieval performance, when evaluated on \(\mathcal{R}\)Oxf and \(\mathcal{R}\)Par. For fair comparisons, we exclude from our experiments previous results obtained by training on GLDv2-clean; we limit to NC-clean, SfM-120k and the new \(\mathcal{R}\)GLDv2-clean.
From the perspective of instance-level image retrieval, the key challenge is that target objects or instances are situated in different contexts within the image. One common solution is to use object localization or detection, isolating the objects of interest from the background. The detected objects are then used to extract an image representation for retrieval, as shown in 1 (a). This two-stage process can be applied to the indexed set, the queries, or both.
This approach comes with certain limitations. First, in addition to the training set for representation learning, a specialized training set is also required that is annotated with location information for the objects of interest [13], [14]. Second, the two stages are often trained separately rather than end-to-end. Third, this approach incurs higher computational cost at indexing and search because it requires two forward passes through the network for each image.
In this work, we attempt to address these limitations. We replace the localization step with a spatial attention mechanism, which does not require location supervision. This allows us to solve for both localization and representation learning through a single, end-to-end learning process on a single network, as illustrated in 1 (b). This has the advantage of eliminating the need for a specialized training set for localization and the separate training cycles.
This component, depicted in 1 (b) and elaborated in 4, is designed for instance detection and subsequent image representation based on the detected objects. It employs a spatial attention mechanism [6], [10], [40], which does not need location supervision. Given a feature tensor \(\mathbf{F}\in \mathbb{R}^{w \times h \times d}\), where \(w \times h\) is the spatial resolution and \(d\) the feature dimension, we obtain the spatial attention map \[A = \eta(\zeta(f^\ell(\mathbf{F}))) \in \mathbb{R}^{w \times h}. \label{eq:attn}\tag{1}\] Here, \(f^\ell\) is a simple mapping, for example a \(1 \times 1\) convolutional layer that reduces dimension to 1, \(\zeta(x) \mathrel{:=}\ln(1+e^x)\) for \(x \in \mathbb{R}\) is the softplus function and \[\eta(X) \mathrel{:=}\frac{X - \min X}{\max X - \min X} \in \mathbb{R}^{w \times h} \label{eq:minmax}\tag{2}\] linearly normalizes \(X \in \mathbb{R}^{w \times h}\) to the interval \([0,1]\). To identify object regions, we then apply a sequence of thresholding operations, obtaining a corresponding sequence of masks \[M_i(\mathbf{p}) = \left\{ \begin{array}{ll} \beta, & \textrm{if}\;\;A(\mathbf{p}) < \tau_i \\ 1, & \textrm{otherwise} \end{array} \right. \label{eq:mask}\tag{3}\] for \(i \in \{1,\dots,T\}\). Here, \(T\) is the number of masks, \(\mathbf{p}\in \{1,\dots,w\} \times \{1,\dots,h\}\) is the spatial position, \(\tau_i \in [0,1]\) is the \(i\)-th threshold, \(\beta\) is a scalar corresponding to background and \(1\) corresponds to foreground.
Unlike a conventional fixed value like \(\beta = 0\), we use a dynamic, randomized approach. In particular, for each \(\mathbf{p}\), we draw a sample \(\epsilon\) from a normal distribution and we clip it to \([0,1]\) by defining \(\beta = \min(0,\max(1,\epsilon))\). The motivation is that randomness compensates for incorrect predictions of the attention map (1 ), especially at an early stage of training. This choice is ablated in 6.
5 shows examples of attentional localization. Comparing (a) with (b) shows that the spatial attention map generated by our model is much more attentive to the object being searched than the pretrained network. These results show that the background is removed relatively well, despite not using any location supervision at training.
The sequence of masks \(M_1, \dots, M_T\) (3 ) is applied independently to the feature tensor \(\mathbf{F}\) and the resulting tensors are fused into a single tensor \[\mathbf{F}^\ell = \mathtt{H}(M_1 \odot \mathbf{F}, \dots, M_T \odot \mathbf{F}) \in \mathbb{R}^{w \times h \times d}, \label{eq:alm}\tag{4}\] where \(\odot\) denotes Hadamard product over spatial dimension, with broadcasting over the feature dimension. Fusion amounts to a learnable convex combination \[\mathtt{H}(\mathbf{F}_1, \dots, \mathbf{F}_T) = \frac{w_1 \mathbf{F}_1 + \cdots + w_T \mathbf{F}_T}{w_1 + \cdots + w_T}, \label{eq:fuse}\tag{5}\] where, for \(i \in \{1,\dots,T\}\), the \(i\)-th weight is defined as \(w_i = \zeta(\alpha_i)\) and \(\alpha_i\) is a learnable parameter. Thus, the importance of each threshold in localizing objects from the spatial attention map is implicitly learned from data, without supervision. 7 ablates the effect of the number \(T\) of thresholds on the fusion efficacy.
\centering
\def\arraystretch{1.1}{
\centering
\scriptsize
\setlength\extrarowheight{-1pt}
\setlength{\tabcolsep}{3pt}
{
\begin{tabular}
{lc|cccccccccc|cc}
\toprule
\mr{3}{\Th{Method}} &
\mr{3}{\Th{Train Set}} &
\multicolumn{2}{c}{\Th{Base}} &
\multicolumn{4}{c}{\Th{Medium}} &
\multicolumn{4}{c|}{\Th{Hard}} &
\mr{3}{\Th{Mean}} &
\mr{3}{\Th{Diff}} \\ \cmidrule{3-12}
& & {\fontsize{7}{6}\selectfont \oxf5k} & {\fontsize{7}{6}\selectfont \paris6k} &
\multicolumn{2}{c}{{\fontsize{7}{6}\selectfont \rox}} &
\multicolumn{2}{c}{{\fontsize{7}{6}\selectfont \rpa}} &
\multicolumn{2}{c}{{\fontsize{7}{6}\selectfont \rox}} &
\multicolumn{2}{c|}{{\fontsize{7}{6}\selectfont \rpa}} \\
& & \tiny mAP & \tiny mAP & \tiny mAP & \tiny mP@10 & \tiny mAP & \tiny mP@10 & \tiny mAP & \tiny mP@10 & \tiny mAP & \tiny mP@10 \\
\midrule
Yokoo~\etal~\cite{SCH01} & GLDv2-clean & 91.9 & 94.5 & 72.8 & 86.7 & 84.2 & 95.9 & 49.9 & 62.1 & 69.7 & 88.4 & 79.5 & \gain{-5.4} \\ \rowcolor{LightCyan}
Yokoo~\etal~\cite{Yokoo01}$^\dagger$ & $\cR$GLDv2-clean & 86.1 & 93.9 & 64.5 & 81.0 & 84.1 & 95.4 & 35.6 & 51.5 & 68.7 & 86.4 & {74.1} & \\ \midrule
SOLAR~\cite{tokens} & GLDv2-clean & -- & -- & 79.7 & -- & 88.6 & -- & 60.0 & -- & 75.3 & -- & 75.9 & \gain{-8} \\ \rowcolor{LightCyan}
SOLAR~\cite{Ng01}$^\dagger$ & $\cR$GLDv2-clean & 90.6 & 94.4 & 70.8 & 84.6 & 84.1 & 95.4 & 48.0 & 62.3 & 68.7 & 86.4 & 67.9 & \\ \midrule
GLAM~\cite{SCH01} & GLDv2-clean & 94.2 & 95.6 & 78.6 & 88.2 & 88.5 & 97.0 & 60.2 & 72.9 & 76.8 & 93.4 & 83.4 & \gain{-4.1} \\ \rowcolor{LightCyan}
GLAM~\cite{SCH01}$^\ddagger$ & $\cR$GLDv2-clean & 90.9 & 94.1 & 72.2 & 84.7 & 83.0 & 95.0 & 49.6 & 61.6 & 65.6 & 87.6 & 79.3 & \\ \midrule
DOLG~\cite{dtop} & GLDv2-clean & -- & -- & 78.8 & -- & 87.8 & -- & 58.0 & -- & 74.1 & -- & 74.7 & \gain{-7.4} \\ \rowcolor{LightCyan}
DOLG~\cite{yang2021dolg}$^\dagger$ & $\cR$GLDv2-clean & 88.3 & 93.9 & 70.8 & 85.3 & 83.2 & 95.4 & 47.4 & 60.0 & 67.9 & 87.4 & 67.3 & \\ \midrule
Token~\cite{tokens} & GLDv2-clean & -- & -- & 82.3 & -- & 75.6 & -- & 66.6 & -- & 78.6 & -- & 75.8 & \gain{-18.2} \\ \rowcolor{LightCyan}
Token~\cite{tokens}$^\dagger$ & $\cR$GLDv2-clean & 84.3 & 90.0 & 61.4 & 76.4 & 75.8 & 94.0 & 36.9 & 55.2 & 54.4 & 81.0 & 57.6 & \\
\bottomrule
\end{tabular}
}
}
\vspace{-6pt}
\caption{Comparison of the original GLDv2-clean training set with our revisited version $\cR$GLDv2-clean for a number of SOTA methods that we reproduce with ResNet101 backbone, ArcFace loss and same sampling, settings and hyperparameters. $\dagger/\ddagger$: official/our code.
}
\label{tab:sota295diff}
Most instance-level image retrieval studies propose a kind of head on top of the backbone network that performs a particular operation to enhance retrieval performance. The same is happening independently in studies of category-level tasks like localization, even though the operations may be similar. Comparison is often challenging, when official code is not released. Our focus is on detection for retrieval in this work but we still need to compare with SOTA methods, which may perform different operations. We thus follow a neutral approach whereby we reuse existing, well-established components from the literature, introduced either for instance-level or category-level tasks.
In particular, given an input image \(x \in \mathcal{X}\), where \(\mathcal{X}\) is the image space, we obtain an embedding \(\mathbf{u}= f(x) \in \mathbb{R}^d\), where \(d\) is the embedding dimension and \[f = f^p \circ f^\ell \circ f^c \circ f^e \circ f^b \label{eq:map}\tag{6}\] is the composition of a number of functions. Here,
\(f^b: \mathcal{X}\to \mathbb{R}^{w \times h \times d}\) is the backbone network;
\(f^e: \mathbb{R}^{w \times h \times d} \to \mathbb{R}^{w \times h \times d}\) is backbone enhancement (BE), including non-local interactions like ECNet [41], NLNet [42], Gather-Excite [43] or SENet [44];
\(f^c: \mathbb{R}^{w \times h \times d} \to \mathbb{R}^{w \times h \times d}\) is selective context (SC), enriching contextual information to apply locality more effectively like ASPP [45] or SKNet [46];
\(f^\ell: \mathbb{R}^{w \times h \times d} \to \mathbb{R}^{w \times h \times d}\) is our attentional localization (AL) (4), localizing objects of interest in an unsupervised fashion;
\(f^p: \mathbb{R}^{w \times h \times d} \to \mathbb{R}^d\) is a spatial pooling operation, such as GAP or GeM [5], optionally followed by other mappings, e.g.. whitening.
In the Appendix, we ablate different options for \(f^e, f^c\) and we specify our choice for \(f^p\); then in 5.5 we ablate, apart from hyperparameters of \(f^\ell\), the effect of the presence of components \(f^e, f^c, f^\ell\) on the overall performance. By default, we embed images using \(f\) (6 ), where for each component we use default settings as specified in 5.5 or in the Appendix.
Certain existing works [4], [6] train the backbone network first on classification loss without the head corresponding to the method and then fine-tune including the head. We refer to this approach as “fine-tuning” (FT). To allow for comparisons, we train our model in two ways. Without fine-tuning, referred to as CiDeR, everything is trained in a single stage end-to-end. With fine-tuning, referred to as CiDeR-FT, we freeze the backbone while only training the head in the second stage. We give more details in the Appendix, along with all experimental setings.
We reproduce a number of state-of-the-art (SOTA) methods using official code where available, we train them on both the original GLDv2-clean dataset our revisited version \(\mathcal{R}\)GLDv2-clean and we compare their performance on the evaluation sets. To ensure a fair evaluation, we use the same ResNet101 backbone [4], [5], [10]–[12], [23], [35], [47], [48] and ArcFace loss [12], [23], [24], [47], [48] as in previous studies.
[tab:sota295diff] shows that using \(\mathcal{R}\)GLDv2-clean leads to severe performance degradation across all methods, ranging from 1% up to 30%. Because the difference between the two training sets in terms of both images and landmark categories is very small (3), this degradation can be safely attributed to the overlap of landmarks between the original training set, GLDv2-clean, and the evaluation sets, Oxford5k and Paris6k, as discussed in 3. In other words, this experiment demonstrates that existing studies using GLDv2-clean as a training set have artificially inflated accuracy metrics comparing with studies using other training sets with no overlap, such as NC-clean and SfM-120k.
2pt
Method | Train Set | Net | Pooling | Loss | FT | E2E | Self | Dim | Base | \(\mathcal{R}\)Medium | \(\mathcal{R}\)Hard | Mean | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10-15 | Oxf5k | Par6k | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | |||||||||
Local Descriptors | |||||||||||||||
HesAff-rSIFT-ASMK\(^{\star}\)+SP [2] | SfM-120k | R50 | – | – | ✔ | – | – | – | – | – | 60.6 | 61.4 | 36.7 | 35.0 | – |
DELF-ASMK\(^{\star}\)+SP [2] | SfM-120k | R50 | – | CLS | ✔ | – | – | – | – | – | 67.8 | 76.9 | 43.1 | 55.4 | – |
Local Descriptors+D2R | |||||||||||||||
R-ASMK\(^{\star}\) [15] | NC-clean | R50 | – | CLS,LOCAL | ✔ | – | – | – | 69.9 | 78.7 | 45.6 | 57.7 | – | ||
R-ASMK\(^{\star}\)+SP [15] | NC-clean | R50 | – | CLS,LOCAL | ✔ | – | – | – | 71.9 | 78.0 | 48.5 | 54.0 | – | ||
Global Descriptors | |||||||||||||||
DIR [24] | SfM-120k | R101 | RMAC | TP | ✔ | – | – | 2048 | 79.0 | 86.3 | 53.5 | 68.3 | 25.5 | 42.4 | 59.2 |
Radenovic et al.. [2], [5] | SfM-120k | R101 | GeM | SIA | – | – | 2048 | 87.8 | 92.7 | 64.7 | 77.2 | 38.5 | 56.3 | 69.5 | |
AGeM [35] | SfM-120k | R101 | GeM | SIA | – | – | 2048 | – | – | 67.0 | 78.1 | 40.7 | 57.3 | – | |
SOLAR [24] | SfM-120k | R101 | GeM | TP,SOS | ✔ | – | – | 2048 | 78.5 | 86.3 | 52.5 | 70.9 | 27.1 | 46.7 | 60.3 |
GLAM [12] | SfM-120k | R101 | GeM | AF | – | – | 512 | 89.7 | 91.1 | 66.2 | 77.5 | 39.5 | 54.3 | 69.7 | |
DOLG [24] | SfM-120k | R101 | GeM,GAP | AF | – | – | 512 | 72.8 | 74.5 | 46.4 | 56.6 | 18.1 | 26.6 | 49.2 | |
Global Descriptors+D2R | |||||||||||||||
Mei et al.. [28] | [O] | R101 | FC | CLS | 4096 | 38.4 | – | – | – | – | – | – | |||
Salvador et al.. [29] | Pascal VOC | V16 | GSP | CLS,LOCAL | ✔ | 512 | 67.9 | 72.9 | – | – | – | – | – | ||
Chen et al.. [30] | OpenImageV4 [49] | R50 | MAC | MSE | ✔ | 2048 | 50.2 | 65.2 | – | – | – | – | – | ||
Liao et al.. [38] | Oxford,Paris | A,V16 | CroW | CLS,LOCAL | 768 | 80.1 | 90.3 | – | – | – | – | – | |||
DIR+RPN [4] | NC-clean | R101 | RMAC | TP | ✔ | 2048 | 85.2 | 94.0 | – | – | – | – | – | ||
CiDeR(Ours) | SfM-120k | R101 | GeM | AF | ✔ | ✔ | 2048 | 92.0 | 57.5 | ||||||
CiDeR-FT(Ours) | SfM-120k | R101 | GeM | AF | ✔ | ✔ | ✔ | 2048 |
\def\arraystretch{1.13}\centering
\scriptsize
\setlength\extrarowheight{-1pt}
\setlength{\tabcolsep}{2pt}
{
\begin{tabular}
{l|cc|cccccccc|cccccccc}
\toprule
\multirow{3}{*}{\Th{Method}} &
\multicolumn{2}{c|}{\Th{Base}} & \multicolumn{8}{c|}{\Th{Medium}} & \multicolumn{8}{c}{\Th{Hard}} \\ \cmidrule{2-19}
& {\fontsize{7}{6}\selectfont \oxf5k} & {\fontsize{7}{6}\selectfont \paris6k} & \multicolumn{2}{c}{{\fontsize{7}{6}\selectfont \rox}} & \multicolumn{2}{c}{{\fontsize{7}{6}\selectfont \rox+\r1m}} & \multicolumn{2}{c}{{\fontsize{7}{6}\selectfont \rpa}} & \multicolumn{2}{c|}{{\fontsize{7}{6}\selectfont \rpa+\r1m}} & \multicolumn{2}{c}{{\fontsize{7}{6}\selectfont \rox}} & \multicolumn{2}{c}{{\fontsize{7}{6}\selectfont \rox+\r1m}} & \multicolumn{2}{c}{{\fontsize{7}{6}\selectfont \rpa}} & \multicolumn{2}{c}{{\fontsize{7}{6}\selectfont \rpa+\r1m}} \\
& \tiny mAP & \tiny mAP & \tiny mAP & \tiny mP@10 & \tiny mAP & \tiny mP@10 & \tiny mAP & \tiny mP@10 & \tiny mAP & \tiny mP@10 & \tiny mAP & \tiny mP@10 & \tiny mAP & \tiny mP@10 & \tiny mAP & \tiny mP@10 & \tiny mAP & \tiny mP@10 \\
\midrule
\rowcolor{lightgray2}
\multicolumn{19}{c}{\Th{Global Descriptors (SfM-120k)}} \\ \midrule
DIR~\cite{dtop} & 79.0 & 86.3 & 53.5 & 76.9& -- & -- & 68.3 & 97.7 & -- & -- & 25.5 & 42.0 & -- & --& 42.4 & 83.6 & -- & -- \\
Filip~\etal~\cite{Radenovic01,RITAC18} & 87.8 & {92.7} & 64.7 & \tb{84.7} & \tb{45.2} & \tb{71.7} & 77.2 & \tb{98.1} & \tb{52.3} & \tb{95.3} & 38.5 & \tb{53.0} & \tb{19.9} & \tb{34.9} & 56.3 & \tb{89.1} & 24.7 & \tb{73.3} \\
AGeM~\cite{gu2018attention} & -- & -- & \tb{67.0} & -- & -- & -- & \tb{78.1} & -- & -- & -- & \tb{40.7} & -- & -- & -- & 57.3 & -- & -- & -- \\
SOLAR~\cite{dtop} & 78.5 & 86.3 & 52.5 & 73.6 & -- & -- & 70.9 & \tb{98.1} & -- & -- & 27.1 & 41.4 & -- & -- & 46.7 & 83.6 & -- & -- \\
GeM~\cite{dtop} & 79.0 & 82.6 & 54.0 & 72.5 & -- & -- & 64.3 & 92.6 & -- & -- & 25.8 & 42.2 & -- & -- & 36.6 & 67.6 & -- & -- \\
GLAM~\cite{dtop} & \tb{89.7} & 91.1 & 66.2 & -- & -- & -- & 77.5 & -- & -- & -- & 39.5 & -- & -- & -- & 54.3 & -- & -- & -- \\
DOLG~\cite{dtop} & 72.8 & 74.5 & 46.4 & 66.8 & -- & -- & 56.6 & 91.1 & -- & -- & 18.1 & 27.9 & -- & -- & 26.6 & 62.6 & -- & -- \\
\midrule \rowcolor{LightCyan}
\tb{\ours (Ours)} & \ok{\tb{89.9}} & {92.0} & \ok{\tb{67.3}}& \ok{\tb{85.1}} & \ok{\tb{50.3}} & \ok{\tb{75.5}} & \ok{\tb{79.4}}& 97.9 & 51.4 & \ok{\tb{95.7}} & \ok{\tb{42.4}} & \ok{\tb{56.4}} & \ok{\tb{22.4}} & \ok{\tb{35.9}} & {{57.5}} & 87.1 & 22.4 & 69.4 \\ \rowcolor{LightCyan}
\tb{\oursf (Ours)} & \red{\tb{92.6}} & \red{\tb{95.1}} & \red{\tb{76.2}}& \red{\tb{87.3}} & \red{\tb{60.5}} & \red{\tb{78.6}} & \red{\tb{84.5}} & 98.0 & \red{\tb{56.9}} & \red{\tb{95.9}} & \red{\tb{58.9}} & \red{\tb{71.1}} & \red{\tb{36.8}} & \red{\tb{55.7}} & \red{\tb{68.9}} & \red{\tb{91.3}} & \red{\tb{30.1}} & \red{\tb{73.9}} \\ \rowcolor{LightCyan}
\midrule
\rowcolor{lightgray2}
\multicolumn{19}{c}{\Th{Global Descriptors ($\cR$GLDV2-clean)}} \\ \midrule
Yokoo~\etal~\cite{Yokoo01}$^\dagger$ (Base) & 86.1 & 93.9 & 64.5 & 81.0 & 51.3 & 72.1 & 84.1 & \tb{95.4} & 54.2 & 90.3 & 35.6 & 51.5 & 22.2 & 42.9 & \tb{68.7} & 86.4 & 27.4 & 66.9 \\
SOLAR~\cite{Ng01}$^\dagger$ & 90.6 & \tb{94.4} & 70.8 & 84.6 & 55.8 & 76.1 & 80.3 & 94.6 & 57.6 & \tb{92.0} & 48.0 & \tb{62.3} & 30.3 & 45.3 & 61.8 & 83.9 & 30.7 & 71.6 \\
GLAM~\cite{SCH01}$^\ddagger$ & \tb{90.9} & 94.1 & \tb{72.2} & 84.7 & \tb{58.6} & 76.1 & 83.0 & 95.0 & \tb{58.6} & 91.7 & \tb{49.6} & 61.6 & \tb{34.1} & \tb{50.9} & 65.6 & \tb{87.6} & \tb{33.3} & 72.1 \\
DOLG~\cite{yang2021dolg}$^\dagger$ & 88.3 & 93.9 & 70.8 & \tb{85.3} & 57.3 & \tb{76.8} & \tb{83.2} & \tb{95.4} & 57.3 & \tb{92.0} &47.4 & 60.0 & 29.5 & 46.2 & 67.9 & 87.4 & 32.7 & \tb{72.4} \\
Token~\cite{tokens}$^\dagger$ & 81.2 & 89.6 & 60.8 & 77.7 & 44.0 & 60.9 & 75.8 & 94.3 & 44.1 & 86.9 & 37.3 & 54.1 & 23.2 & 37.7 & 54.8 & 81.3 & 19.7 & 54.4 \\
\midrule \rowcolor{LightCyan}
\tb{\ours (Ours)} & 89.8 & \ok{\tb{94.6}} & \ok{\tb{73.7}} & \ok{\tb{85.5}} & \ok{\tb{58.6}} & 76.3 & \ok{\tb{84.6}} & \ok{\tb{96.7}} & \ok{\tb{59.0}} & \red{\tb{95.1}} & \ok{\tb{54.9}} & \ok{\tb{66.6}} & \ok{\tb{34.6}} & \ok{\tb{54.7}} &\ok{\tb{68.5}} & \ok{\tb{89.1}} & \ok{\tb{33.5}} & \red{\tb{76.9}} \\ \rowcolor{LightCyan}
\tb{\oursf (Ours)} & \red{\tb{90.9}} & \red{\tb{96.1}} & \red{\tb{77.8}} & \red{\tb{88.0}} & \red{\tb{61.8}} & \red{\tb{78.0}} &\red{\tb{87.4}} & \red{\tb{97.0}} & \red{\tb{61.6}} & \ok{\tb{94.3}} & \red{\tb{61.9}} & \red{\tb{70.4}} & \red{\tb{39.4}} & \red{\tb{56.8}} &\red{\tb{75.3}} & \red{\tb{90.0}} & \red{\tb{35.8}} & \ok{\tb{72.7}} \\
\bottomrule
\end{tabular}
}
\vspace{-8pt}
\caption{Large-scale mAP comparison of SOTA on training sets with no overlap with evaluation sets. In the new $\cR$GLDv2-clean, settings are same as in \ref{tab:sota295diff}. In the existing SfM-120k, results are as published. $\dagger/\ddagger$: official/our code. Red: best results; blue: our results higher than previous methods; black bold: best previous method per block. \Th{FT}:fine-tuning.}
\label{tab:1m}
4 compares different methods using global or local descriptors, with or without a D2R approach, on existing clean datasets NC-clean and SfM-120k, which do not overlap with the evaluation sets.
Comparing with methods using global descriptors without D2R, our method demonstrates SOTA performance and brings significant improvements over AGeM [35], the previous best competitor. In particular, 2.9%, 0.6% mAP on Ox5k, Par6k Base, 9.2%, 18.2% on \(\mathcal{R}\)Oxf, \(\mathcal{R}\)Par Medium, and 6.4%, 9.5% on \(\mathcal{R}\)Oxf, \(\mathcal{R}\)Par Hard.
Comparing with methods using global descriptors without D2R, our method outperforms the highest-ranking approach by DIR+RPN [4], which was trained on the SfM-120k dataset. Specifically, our method improves mAP by 7.4% on Ox5k dataset and by 1.1% on Par6k. Interestingly, methods in the D2R category employ different training sets, as no single dataset provides annotations for both D2R tasks. Our study is unique in being single-stage, end-to-end (E2E) trainable and at the same time requiring no location supervision (Loc), thereby eliminating the need for a detection-specific training set.
[tab:1m] provides complete experimental results, including the impact of introducing 1 million distractors (\(\mathcal{R}\)1M) into the evaluation set, on our new clean training set, \(\mathcal{R}\)GLDv2-clean, as well as the previous most popular clean set, SfM-120k. Contrary to previous studies, we compare methods trained on the same training and evaluation sets to ensure fairness.
Without fine-tuning, we improve 1.3% mAP on \(\mathcal{R}\)Oxf\(\!+\) \(\mathcal{R}\)1M (medium), 5.1% on \(\mathcal{R}\)Oxf\(\!+\) \(\mathcal{R}\)1M (hard), 1.7% on \(\mathcal{R}\)Paris\(\!+\) \(\mathcal{R}\)1M (medium), and 0.8% on \(\mathcal{R}\)Paris\(\!+\) \(\mathcal{R}\)1M (hard) compared to DOLG [23] on \(\mathcal{R}\)GLDv2-clean. With fine-tuning, our CiDeR-FT establishes new SOTA for nearly all metrics. In particular, we improve 4.5% mAP on \(\mathcal{R}\)Oxf\(\!+\) \(\mathcal{R}\)1M (medium), 5.3% on \(\mathcal{R}\)Oxf\(\!\!+\) \(\mathcal{R}\)1M (hard), 4.3% on \(\mathcal{R}\)Paris\(\!+\) \(\mathcal{R}\)1M (medium), and 3.1% on \(\mathcal{R}\)Paris\(\!+\) \(\mathcal{R}\)1M (hard) compared to DOLG [23] on \(\mathcal{R}\)GLDv2-clean.
6 shows examples of the top-5 ranking images retrieved for a number of queries by our model, along with the associated spatial attention map. The spatial attention map \(A\) (1 ) focuses exclusively on the object of interest as specified by the cropped area provided by the evaluation set, essentially ignoring the background.
7 shows t-SNE visualizations of image embeddings of the \(\mathcal{R}\)Paris dataset [2], obtained by the off-the shelf network as pre-trained on ImageNet vs.. our method with fine-tuning on SfM-120k [5]. It indicates superior embedding quality for our model.
We study the effect of the presence of components \(f^e, f^c, f^\ell\) (6 ) on the overall performance of the proposed model. Starting from the baseline, which is ResNet101 backbone (\(f^b\)) followed by GeM pooling (\(f^p\)), we add selective context (SC, \(f^c\)), attentional localization (AL, \(f^\ell\)) and backbone enhancement (BE, \(f^e\)). 5 provides the results, illustrating the performance gains achieved by the proposed components.
3.0pt
SC | AL | BE | Oxf5k | Par6k | Medium | Hard | ||
---|---|---|---|---|---|---|---|---|
6-9 | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | ||||
80.2 | 83.2 | 55.1 | 67.7 | 25.8 | 40.7 | |||
✔ | 87.6 | 90.7 | 64.7 | 76.6 | 38.2 | 52.7 | ||
✔ | ✔ | 89.4 | 91.1 | 66.1 | 76.7 | 40.6 | 53.3 | |
✔ | ✔ | 88.2 | 91.5 | 66.0 | 78.4 | 40.8 | 55.9 | |
✔ | ✔ | 89.7 | 92.0 | 67.0 | 79.4 | 41.0 | 57.4 | |
✔ | ✔ | ✔ | 89.9 | 92.0 | 67.3 | 79.4 | 42.4 | 57.5 |
We study the effect of setting the background value \(\beta\) in masks (3 ) to a fixed value vs.. clipping a sample \(\epsilon\) from the normal distribution. 6 indicates that our dynamic, randomized approach is superior when \(\epsilon \sim \mathcal{N}(0.1, 0.9)\), which we choose as default.
2pt
\(\beta\) setting | Oxf5k | Par6k | Medium | Hard | |||
---|---|---|---|---|---|---|---|
4-7 | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | |||
Fixed (0.0) | 87.4 | 91.6 | 64.9 | 77.5 | 39.1 | 53.8 | |
Fixed (0.5) | 87.5 | 91.7 | 64.8 | 77.7 | 38.8 | 54.3 | |
\(\mathcal{N}\)(0.1, 0.5) | 90.2 | 90.5 | 67.4 | 78.1 | 40.2 | 55.2 | |
\(\mathcal{N}\)(0.1, 0.9) | 89.9 | 92.0 | 67.3 | 79.4 | 42.4 | 57.5 |
We study the effect of the number of masks \(T\) (3 ) in our attentional localization, obtained by thresholding operations on the spatial attention map \(A\) (1 ). 7 shows that optimal performance is achieved for \(T=2\), which we choose as default.
3.0pt
\(T\) | Oxf5k | Par6k | Medium | Hard | |||
---|---|---|---|---|---|---|---|
4-7 | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | |||
1 | 87.5 | 91.7 | 64.8 | 77.7 | 38.8 | 54.3 | |
2 | 89.9 | 92.0 | 67.3 | 79.4 | 42.4 | 57.5 | |
3 | 89.4 | 92.2 | 67.5 | 78.5 | 42.4 | 55.3 | |
6 | 89.4 | 91.6 | 66.5 | 78.1 | 40.5 | 55.0 |
We confirm that training and evaluation sets for instance-level image retrieval really should not have class overlap. Our new \(\mathcal{R}\)GLDv2-clean dataset makes fair comparisons possible with previous clean datasets. The comparison between the two versions reveals that class overlap indeed brings inflated performance, although the relative difference in number of images is small. Importantly, the ranking of SOTA methods is different on the two training sets.
On the algorithmic front, D2R methods typically require an additional object detection training stage with location supervision, which is inherently inefficient. Our method CiDeR provides a single-stage training pipeline without the need for location supervision. CiDeR improves the SOTA not only on established clean training sets but also on the newly released \(\mathcal{R}\)GLDv2-clean.
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) under the metaverse support program to nurture the best talents (IITP-2024-RS-2023-00254529) grant funded by the Korea government (MSIT).
In our experiments, we use a computational environment featuring 8 RTX 3090 GPUs with PyTorch[50]. We perform transfer learning from models pre-trained on ImageNet [51]. To ensure a fair comparison with previous studies [12], [23], [48], we configure the learning environment as closely as possible. Specifically, we use ResNet101 [37] as the backbone with final feature dimension \(d = 2048\).
We use ArcFace [52] loss function for training, with margin parameter 0.3. For optimization, we use stochastic gradient descent with momentum 0.9, weight decay 0.00001, initial learning rate 0.001, a warm-up phase [53] of three epochs and cosine annealing. We train SfM-120k for 100 epochs and \(\mathcal{R}\)GLDv2-clean for 50 epochs. Previous work has shown the effectiveness of preserving the original image resolution during the training of image retrieval models [4], [54]. We adopt this principle following [12], [24], [47], where each training batch consists of images with similar aspect ratios instead of a single fixed size. The batch size is 128. Following DIR [4] and DELF [6], we carry out classification-based training of the backbone only and subsequently fine-tune the model. During fine-tuning, we train CiDeR while the backbone is frozen, as shown in 10.
For evaluation, we use multi-resolution representation [4] on both query and database images, applying \(\ell_2\)-normalization and whitening [5] on the final features.
5.5pt
Network | #Params (M) | #GFLOPs |
---|---|---|
R101 | 42.50 | 7.86 |
Yokoo et al.. [47] | 43.91 | 7.86 |
SOLAR [11] | 53.36 | 8.57 |
DOLG [23] | 47.07 | 8.07 |
Token [48] | 54.43 | 8.05 |
CiDeR (Ours) | 46.12 | 7.94 |
3.0pt
R101 | BE | SC | AL | Pooling | #Params (M) | #GFLOPs |
---|---|---|---|---|---|---|
✔ | 42.50 | 7.86 | ||||
✔ | ✔ | 43.58 | 7.91 | |||
✔ | ✔ | ✔ | 43.88 | 7.93 | ||
✔ | ✔ | ✔ | ✔ | 44.02 | 7.94 | |
✔ | ✔ | ✔ | ✔ | ✔ | 46.12 | 7.94 |
8 compares the model complexity3 of CiDeR with other models. In this table, R101 is the baseline for all related studies, all of which use the feature maps of its last layer. We observe that our model has the least complexity after Yokoo et al.. [47], which only uses GeM + FC. 9 shows model complexity for each of the components of CiDeR, as defined in 5.1.
\centering
\scriptsize
\begin{tabular}{cccll} \toprule
\Th{\#} & \Th{GID} & \Th{\# Images} & \Th{GLDv2 Landmark Name} & \Th{Oxford/Paris Landmark Name} \\
\midrule
1 & 6190 & 98 & \href{http://commons.wikimedia.org/wiki/Category:Radcliffe_Camera}{Radcliffe Camera} & Oxford \\
2 & 19172 & 32 & \href{http://commons.wikimedia.org/wiki/Category:All_Souls_College,_Oxford}{All Souls College, Oxford} & All Souls Oxford\\
3 & 37135 & 18 & \href{http://commons.wikimedia.org/wiki/Category:Oxford_University_Museum_of_Natural_History}{Oxford University Museum of Natural History} & Pitt Rivers Oxford \\
4 & 42489 & 55 & \href{http://commons.wikimedia.org/wiki/Category:Pont_au_Double}{Pont au Double} & Jesus Oxford\\
5 & 147275 & 18 & \href{http://commons.wikimedia.org/wiki/Category:Magdalen_Tower}{Magdalen Tower} & Ashmolean Oxford \\
6 & 152496 & 71 & \href{http://commons.wikimedia.org/wiki/Category:Christ_Church,_Oxford}{Christ Church, Oxford} & Christ Church Oxford\\
7 & 167275 & 55 & \href{http://commons.wikimedia.org/wiki/Category:Bridge_of_Sighs_(Oxford)}{Bridge of Sighs (Oxford)} & Magdalen Oxford \\
8 & 181291 & 60 & \href{http://commons.wikimedia.org/wiki/Category:Petit-Pont}{Petit-Pont} & Notre Dame Paris \\
9 & 192090 & 23 & \href{http://commons.wikimedia.org/wiki/Category:Christ_Church_Great_Quadrangle}{Christ Church Great Quadrangle} & Paris \\
10 & 28949 & 91 & \href{http://commons.wikimedia.org/wiki/Category:Moulin_Rouge}{Moulin Rouge} & Moulin Rouge Paris \\
11 & 44923 & 41 & \href{http://commons.wikimedia.org/wiki/Category:Jardin_de_l'Intendant}{Jardin de l'Intendant} & Hotel des Invalides Paris \\
12 & 47378 & 731 & \href{http://commons.wikimedia.org/wiki/Category:Eiffel_Tower}{Eiffel Tower} & Eiffel Tower Paris \\ 13 & 69195 & 34 & \href{http://commons.wikimedia.org/wiki/Category:Place_Charles-de-Gaulle_(Paris)}{Place Charles-de-Gaulle (Paris)} & Arc de Triomphe Paris \\ 14 & 167104 & 23 & \href{http://commons.wikimedia.org/wiki/Category:H15 & 145268 & 72 & \href{http://commons.wikimedia.org/wiki/Category:Louvre_Pyramid}{Louvre Pyramid} & Louvre Paris\\
16 & 146388 & 80 & \href{http://commons.wikimedia.org/wiki/Category:Basilique_du_Sacr17 & 138332 & 30 & \href{http://commons.wikimedia.org/wiki/Category:Parvis_Notre-Dame_-_place_Jean-Paul-II_(Paris)}{Parvis Notre-Dame - place Jean-Paul-II (Paris)} & Notre Dame Paris \\
18 & 144472 & 33 & \href{http://commons.wikimedia.org/wiki/Category:Esplanade_des_Invalides}{Esplanade des Invalides} & Paris \\
\bottomrule
}}\end{tabular}
\vspace{-6pt}
\caption{Details of GIDs removed from GLDv2-clean dataset.}
\label{tab:removing95gid}
To identify overlapping landmarks, we use GLAM [12] to extract image features from the training and evaluation sets. Extracted features from the training sets are indexed using the Approximate Nearest Neighbor (ANN)4 search method. For verification, we use SIFT [18] local descriptors. We find tentative correspondences between local descriptors by a kd-tree and we verify by obtaining inlier correspondences using RANSAC.
In addition to 2 in 3, 8 shows overlapping landmark categories between the training set (GLDv2-clean, NC-clean, SfM-120k) and the evaluation set (\(\mathcal{R}\)Oxford, \(\mathcal{R}\)Paris). Clearly, only GLDv2-clean has overlapping categories with the evaluation set.
[tab:removing95gid] shows the details of the 18 GIDs that are removed from GLDv2-clean due to overlap with the evaluation sets. The new, revisited \(\mathcal{R}\)GLDv2-clean dataset is what remains after this removal.
2.5pt
Method | trainset | OC | Ox5k | Par6k | Medium | Hard | Mean | |||
---|---|---|---|---|---|---|---|---|---|---|
6-9 | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | ||||||
SOLAR [48] | GLDv2-clean | Y | 82.1 | 95.2 | 72.5 | 88.8 | 47.3 | 75.8 | 77.0 | |
N | 81.6 | 95.9 | 66.1 | 84.8 | 44.3 | 70.6 | 73.9 | |||
SOLAR [11]\(^\dagger\) | \(\mathcal{R}\)GLDv2-clean | Y | 77.7 | 87.6 | 65.4 | 78.4 | 36.0 | 62.2 | 67.9 | |
N | 80.1 | 92.0 | 66.3 | 81.6 | 42.6 | 68.7 | 71.9 | |||
GLAM [12] | GLDv2-clean | Y | 81.6 | 93.9 | 73.6 | 88.6 | 53.6 | 77.4 | 78.1 | |
N | 76.8 | 94.2 | 62.9 | 83.8 | 42.0 | 69.5 | 71.5 | |||
GLAM [12]\(^\ddagger\) | \(\mathcal{R}\)GLDv2-clean | Y | 76.4 | 89.4 | 69.3 | 85.2 | 48.9 | 74.2 | 73.9 | |
N | 75.6 | 93.3 | 61.7 | 84.0 | 43.1 | 68.1 | 71.0 | |||
DOLG [24] | GLDv2-clean | Y | 81.5 | 94.3 | 72.8 | 87.0 | 48.2 | 76.0 | 76.6 | |
N | 75.7 | 93.1 | 62.7 | 82.1 | 42.0 | 64.4 | 70.0 | |||
DOLG [23]\(^\dagger\) | \(\mathcal{R}\)GLDv2-clean | Y | 76.1 | 88.6 | 66.1 | 79.7 | 41.5 | 64.1 | 69.4 | |
N | 74.6 | 91.9 | 61.1 | 82.0 | 37.1 | 65.0 | 68.6 |
10 elaborates on the results of [tab:sota295diff] by comparing the original GLDv2-clean training set with our revisited version \(\mathcal{R}\)GLDv2-clean separately for overlapping vs.. non-overlapping classes. That is, classes of the evaluation set that overlap or not with the original training set. As expected, mAP is much higher for overlapping than non-overlapping classes on GLDv2-clean. On \(\mathcal{R}\)GLDv2-clean, differences are smaller or even non-overlapping are higher.
We employ transfer learning from models pre-trained on ImageNet [51]. Following DIR [4] and DELF [6], we first perform classification-based training of the backbone only on the landmark training set and then fine-tuning the model on the same training set, training CiDeR while the backbone is kept frozen. 10 visualizes this process, while 9 shows the training and validation loss and accuracy, with and without the fine-tuning process. These plots confirm that fine-tuning results in lower loss and higher accuracy on both training and validation sets. This is corroborated by improved performance results (CiDeR+FT) in [tab:1m]. Compared to the results without the fine-tuning, we obtain gains of 2.7% and 3.1% on Ox5k and Par6k Base, 8.9% and 5.1% on \(\mathcal{R}\)Oxf and \(\mathcal{R}\)Par Medium, and 16.5% and 11.4% on \(\mathcal{R}\)Oxf and \(\mathcal{R}\)Par Hard.
5.0pt
Method | Oxf5k | Par6k | Medium | Hard | |||
---|---|---|---|---|---|---|---|
4-7 | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | |||
ECNet [41] | 88.2 | 91.5 | 66.8 | 78.3 | 42.0 | 55.4 | |
NLNet [42] | 89.4 | 91.8 | 66.5 | 77.6 | 39.1 | 53.7 | |
Gather-Excite [43] | 89.4 | 90.5 | 66.7 | 77.1 | 41.2 | 53.8 | |
SENet [44] | 89.9 | 92.0 | 67.3 | 79.4 | 42.4 | 57.5 |
We apply four methods in a plug-and-play fashion [41]–[44]. As shown in 11, SENet [44] performs best. We select it for backbone enhancement in the remaining experiments.
6.0pt
Method | Oxf5k | Par6k | Medium | Hard | |||
---|---|---|---|---|---|---|---|
4-7 | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | |||
ASPP [45] | 90.3 | 92.2 | 67.9 | 78.2 | 41.6 | 55.8 | |
SKNet [46] | 89.3 | 92.4 | 67.4 | 78.4 | 42.3 | 55.5 | |
SKNet\(^\dagger\) | 89.9 | 92.0 | 67.3 | 79.4 | 42.4 | 57.5 |
Here we compare ASPP [45], SKNet [46] and our modification, SKNet\(^\dagger\). The modification is that instead of a simple element-wise sum to initially fuse multiple context information, we introduce a learnable parameter (5 ) to fuse feature maps based on importance. As shown in 12, our modification SKNet\(^\dagger\) performs best, confirming that this approach better embeds context information.
5.5pt
SC | AL | Oxf5k | Par6k | Medium | Hard | |||
---|---|---|---|---|---|---|---|---|
5-8 | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | ||||
87.1 | 90.6 | 63.9 | 77.3 | 36.7 | 53.9 | |||
✔ | 87.6 | 90.8 | 64.7 | 77.8 | 37.9 | 54.8 | ||
✔ | 89.7 | 92.0 | 66.8 | 79.4 | 41.8 | 57.5 | ||
✔ | ✔ | 89.9 | 92.0 | 67.3 | 79.4 | 42.4 | 57.5 |
We introduce learnable parameters (5 ) to fuse multiple feature maps for SC and AL. 13 compares this learnable fusion with simple sum for both SC and AL. We evaluate four different combinations, using learnable fusion and sum for SC and AL. The results indicate that learnable fusion improves performance wherever it is applied.
3.0pt
Backbone | Oxf5k | Par6k | Medium | Hard | |||
---|---|---|---|---|---|---|---|
4-7 | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | |||
Attention-based pooling | 89.8 | 92.3 | 67.2 | 79.4 | 41.8 | 56.5 | |
Mask-based pooling (Ours) | 89.9 | 92.0 | 67.3 | 79.4 | 42.4 | 57.6 |
Because of the binary masks (3 ), the pooling operation of our attentional localization (AL) can be called mask-based pooling. Here we derive a simpler baseline and connect it with attention in transformers. Given the feature tensor \(\mathbf{F}\in \mathbb{R}^{w \times h \times d}\), we flatten the spatial dimensions to obtain the keys \(K \in \mathbb{R}^{p \times d}\), where \(p = w \times h\) is the number of patches. The weights of the \(1 \times 1\) convolution \(f^\ell\) can be represented by query \(Q \in \mathbb{R}^{1 \times d}\), which plays the role of a learnable CLS token. Then, replacing the nonlinearity \(\eta(\zeta(\cdot))\) by softmax, the spatial attention map (1 ) becomes \[A = \operatorname{softmax}(Q K^\top) \in \mathbb{R}^{1 \times p}. \label{eq:attn2}\tag{7}\] Then, by omitting the masking operation and using the attention map \(A\) to weight the values \(V = K \in \mathbb{R}^{p \times d}\), (4 ) simplifies to \[\mathbf{F}^\ell = A^\top\odot V \in \mathbb{R}^{p \times d}, \label{eq:alm2}\tag{8}\] Finally, we apply spatial pooling \(f^p\), like GAP or GeM. For example, in the case of GAP, the final pooled representation becomes \[f^p(\mathbf{F}^\ell) = A V \in \mathbb{R}^{1 \times d}, \label{eq:att-pool}\tag{9}\] which is the same as a simplified cross-attention operation between the features \(\mathbf{F}\) and a learnable CLS token, without projections. By using GeM pooling, we refer to this baseline as attention-based pooling. Variants of this approach have been used, mostly for classification [55]–[59]. As shown in 14, our mask-based pooling is on par or performs better than the attention-based pooling baseline, especially on the hard protocol.
8.0pt
Dim | Oxf5k | Par6k | Medium | Hard | ||
---|---|---|---|---|---|---|
4-7 | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | ||
4096 | 87.8 | 90.2 | 64.8 | 76.8 | 38.1 | 52.8 |
3097 | 89.8 | 90.5 | 67.4 | 76.9 | 42.5 | 53.0 |
2048 | 89.9 | 92.0 | 67.3 | 79.4 | 42.4 | 57.5 |
1024 | 88.9 | 91.2 | 65.7 | 76.7 | 40.1 | 52.0 |
512 | 85.3 | 89.2 | 61.9 | 74.4 | 36.5 | 48.9 |
After applying spatial pooling like GeM, we apply an FC layer to generate the final features. The feature dimension \(d\) is a hyperparameter. 15 shows the performance for different dimensions \(d\). Interestingly, a feature dimension of 2,048 works best, with larger dimensions not necessarily offering any more performance improvement.
4.0pt
Query | Database | Oxf5k | Par6k | \(\mathcal{R}\)Medium | \(\mathcal{R}\)Hard | ||
---|---|---|---|---|---|---|---|
5-8 | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | |||
Single | Single | 90.5 | 91.5 | 67.0 | 77.4 | 40.3 | 55.0 |
Multi | Single | 92.6 | 92.9 | 68.4 | 79.2 | 41.2 | 56.5 |
Single | Multi | 87.1 | 90.4 | 64.8 | 77.5 | 39.1 | 55.9 |
Multi | Multi | 89.9 | 92.0 | 67.3 | 79.4 | 42.4 | 57.5 |
At inference, we use a multi-resolution representation at image scales (0.4, 0.5, 0.7, 1.0, 1.4) for both the query and the database images. Features are extracted at each scale and then averaged to form the final representation. 16 provides a comparative analysis, with and without the multi-resolution representation for query and database images. We find that applying multi-resolution to both query and database images works best for \(\mathcal{R}\)Oxford and \(\mathcal{R}\)Paris [2].
6.0pt
Backbone | Oxf5k | Par6k | Medium | Hard | |||
---|---|---|---|---|---|---|---|
4-7 | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | |||
88.2 | 92.7 | 65.3 | 78.8 | 38.6 | 56.5 | ||
TorchVision | 89.9 | 92.0 | 67.3 | 79.4 | 42.4 | 57.5 |
Different research teams have released models pre-trained on ImageNet [51] for major image classification tasks. It is common to use a pre-trained ResNet101 model from TorchVision [60]. Recent works [23], [61] have also used pre-trained models released by Facebook5. As shown in 17, we find that the TorchVision model works best.
5.6pt
Warm-up | Oxf5k | Par6k | Medium | Hard | ||
---|---|---|---|---|---|---|
4-7 | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | ||
89.9 | 92.0 | 66.7 | 78.9 | 41.5 | 56.7 | |
✔ | 89.9 | 92.0 | 67.3 | 79.4 | 42.4 | 57.5 |
To enhance model performance, we employ a warm-up phase [53] during training, consisting of three epochs. 18 shows that the warm-up phase improves the performance.
5.3pt
whitening | Oxf5k | Par6k | Medium | Hard | ||
---|---|---|---|---|---|---|
4-7 | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | ||
85.8 | 90.7 | 60.5 | 77.0 | 31.7 | 54.2 | |
✔ | 89.9 | 92.0 | 67.3 | 79.4 | 42.4 | 57.5 |
We utilize the supervised whitening method pioneered by Radenović et al.. [5], which is common in related work to improve retrieval performance. 19 shows the performance gain obtained by the application of whitening.
4.2pt
Method | Oxf5k | Par6k | \(\mathcal{R}\)Medium | \(\mathcal{R}\)Hard | ||
---|---|---|---|---|---|---|
4-7 | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | \(\mathcal{R}\)Oxf | \(\mathcal{R}\)Par | ||
Fixed-size (224 × 224) | 69.0 | 86.0 | 42.2 | 69.5 | 15.8 | 45.0 |
Group-size (Ours) | 89.9 | 92.0 | 67.3 | 79.4 | 42.4 | 57.5 |
Several previous studies suggest organizing training batches based on image size for efficient learning. Methods such as DIR [4], DELF [6], MobileViT [62], and Yokoo et al..[47] opt for variable image sizes rather than adhering to a single, fixed dimension. Our approach employs group-size sampling [12], [24], [47], where we construct image batches with similar aspect ratios. 20 compares the results of fixed-size (224 \(\times\) 224) and group-size sampling. We find that using dynamic input sizes to preserve the aspect ratio significantly improves performance.