April 02, 2024

Mapping is one of the crucial tasks enabling autonomous navigation of a mobile robot. Conventional mapping methods output dense geometric map representation, e.g. an occupancy grid, which is not trivial to keep consistent for the prolonged runs covering large environments. Meanwhile, capturing the topological structure of the workspace enables fast path planning, is less prone to odometry error accumulation and does not consume much memory. Following this idea, this paper introduces PRISM-TopoMap – a topological mapping method that maintains a graph of locally aligned locations not relying on global metric coordinates. The proposed method involves learnable multimodal place recognition paired with the scan matching pipeline for localization and loop closure in the graph of locations. The latter is updated online and the robot is localized in a proper node at each time step. We conduct a broad experimental evaluation of the suggested approach in a range of photo-realistic environments and on a real robot (wheeled differential driven Husky robot), and compare it to state of the art. The results of the empirical evaluation confirm that PRISM-Topomap consistently outperforms competitors across several measures of mapping and navigation efficiency and performs well on a real robot. The code of PRISM-Topomap is open-sourced and available at: https://github.com/kirillMouraviev/prism-topomap.

Building an accurate map of the environment is crucial for mobile robots navigation. Common mapping methods such as RTAB-MAP [1] and Cartographer [2] build a map as a dense 2D or 3D metric structure like an occupancy grid or a point cloud. However, such dense metric maps require significant memory for storage, potentially leading to memory overflow when the robot navigates large environment [3]. Coupled with odometry error accumulation, this may cause mapping and loop closure failures with the map size growth.

Topological mapping offers an alternative approach to mapping. Capturing only the topological properties (i.e. the connectivity) of the environment significantly reduces memory consumption and computational costs while also mitigating the accumulation of positioning errors [4], [5]. Furthermore, the sparsity of topological maps, typically represented as graphs, facilitates faster path planning compared to that when utilizing metric maps [6].

One of the most common topological structures for navigation is a graph of locations (e.g. living room, corridor, corner of a hall, etc.), where edges denote adjacency or straight-line reachability between two locations – see Fig. 1. Such approach enables fast and convenient path planning, however, localization in the graph, i.e. aligning a robot with a location, still remains a problem. Learning-based place recognition techniques are typically used to address this problem [7]. However, place recognition errors may lead to building edges in a graph between the non-adjacent locations, which may cause navigation failures. For example, state-of-the-art topological method TSGM [8] links locations at the opposite ends of the environment, as shown in [9].

In this work, we propose PRISM-TopoMap – a novel hybrid approach to build a graph of locations which involves place recognition techniques coupled with fine 2D feature-based alignment of the found locations. The key contributions of our paper are as follows:

A multimodal place recognition model designed and fine-tuned specifically for topological mapping. It effectively fuses multi-camera images and point cloud learnable features.

A point-cloud matching technique based on 2D features extracted from the cloud projections, which filters place recognition results and estimates relative poses between the locations.

A novel online topological mapping method, PRISM-TopoMap, that builds upon the developed place recognition model and the point cloud matching technique, and relies only on local odometry data to localize in the topological map, without utilizing global metric coordinates.

PRISM-Topomap is evaluated and compared against state-of-the-art topological mapping methods in photo-realistic Habitat simulator and on a real wheeled robot. All tests involve moving through the large indoor environments. PRISM-Topomap builds consistent topological maps and notably outperforms competitors across a range of performance measures.

A large family of methods like [6], [10] builds a topological map in offline mode from a global pre-built metric map. Another family of methods, including [11], [12], constructs a topological map paired with a metric map from raw sensor data. Such methods provide complete scene representation and enable fast hybrid path planning. However, they are still resource-demanding, and are affected by odometry error accumulation.

In recent years, a large number of learning-based topological mapping methods which do not rely on a metric map was emerged. However, most of them (like [8], [13]) are designed for solving specific task like image-goal navigation in simulated environments and were not tested on data from a real robot. Also, such methods usually rely on neural network-predicted features only, which can cause linking far locations in the environment and result in navigation failures. Some other learning-based methods like [14] work in real environments with edge filtering, but require a pre-built topological map at start.

To better analyse the landscape of the available topological mapping methods we distinguish the several key features:

*Online*– the method is able to construct method in real-time in an online fashion.*No metric map input*– the method does not require a global metric map as an input (however it may build it in addition to a topological map).*Connected graphs*– the method guarantees the connectivity of a built graph.*Edge filtering*– the method utilizes a dedicated technique of visual embedding matches rectification using scan matching or positional data, or usage of positional data in graph update. Mark in this column means that the method relies on matching visual embeddings only during edge creation.*Real data*– the method is evaluted on data from a real robotic system.

Table [tab:properties] shows how state-of-the-art topological mapping methods support those features. As will be shown in the paper, not only our method support all of them, but it, indeed, outperforms the others in practice.

```
\centering
\caption{Properties of topological mapping methods}
\setlength\tabcolsep{2pt}
\scriptsize
\label{tab:properties}
\begin{tabular}{c|c|c|c|c|c}
\hline
Method & Online & \begin{tabular}{c}No metric \\ map input\end{tabular} & \begin{tabular}{c}Connected\\ graphs\end{tabular} & \begin{tabular}{c}Edge\\ filtering\end{tabular} & \begin{tabular}{c}Real\\ data\end{tabular}\\
\hline
Topomap \cite{blochliger2018topomap} & \ding{55} & \ding{55} & ✔ & ✔ & ✔\\[3pt]
Taichislam \cite{chen2022fast} & \ding{55} & \ding{55} & \ding{55} & ✔ & ✔\\[3pt]
Hydra \cite{hughes2022hydra} & ✔ & ✔ & \ding{55} & ✔ & ✔\\[3pt]
IncrementalTopo \cite{yuan2019incrementally} & ✔ & ✔ & \ding{55} & ✔ & ✔\\[3pt]
Lifelong \cite{wiyatno2022lifelong} & \ding{55} & ✔ & ✔ & ✔ & ✔\\[3pt]
VGM \cite{kwon2021visual} & ✔ & ✔ & ✔ & \ding{55} & \ding{55}\\[3pt]
TSGM \cite{kim2023topological} & ✔ & ✔ & ✔ & \ding{55} & \ding{55}\\[3pt]
\hline
[Ours] PRISM-TopoMap & ✔ & ✔ & ✔ & ✔ & ✔\\
\hline
\end{tabular}
```

Place recognition is crucial for autonomous navigation, enabling robots to recognize previously visited locations. Methods fall into three categories: camera-based, LiDAR-based, and multimodal, combining camera and LiDAR.

Camera-based research highlights include the use of convolutional neural networks (CNNs) for appearance-invariant descriptors [15], the introduction of a trainable VLAD layer in NetVLAD [16], and innovations like CosPlace [17] which frames learning as a classification challenge. MixVPR [18] demonstrates the recent shift towards Transformer architectures in vision tasks. Some approaches [19], [20] adopt a two-stage process to refine initial results by focusing on local image regions’ descriptors, albeit at a slower computational pace. LiDAR-based techniques leverage geometrical data, enhancing robustness. PointNetVLAD [21] creates global descriptors from point-wise features. MinkLoc3D [22] and its enhancement [23] use voxelization and sparse convolutions. SVT-Net [24] further refines this with an attention mechanism. Multimodal methods, like the combination of PointNetVLAD and ResNet50 in [25], leverage both sensor types for richer descriptors, often employing fusion techniques. MinkLoc++ [26] pairs MinkLoc3D with ResNet18 for image and point cloud processing. AdaFusion [27] introduces a middle fusion strategy using an attention mechanism for feature integration.

In our paper, we use the multimodal late-fusion method similar to MinkLoc++ modified to work with multiple cameras. We also compare different place recognition models for extracting features for each modality.

Consider a robot equipped with a perception sensor (e.g. an RGB-D camera, or LiDAR) and an odometry sensor. The robot moves through the indoor environment along a pre-defined trajectory and is tasked with constructing and continuously updating a graph of the visited locations, which can be used for further mission planning and navigation (the latter problems are not addressed in this paper).

We consider 2D environments in this work for the sake of simplicity (however, the suggested methods can function in 3D environments as well). The robot’s workspace, \(W \subset \mathbb{R}^2\), is composed of the free space and the obstacles: \(W = W_{free} \cup W_{obs}\) (\(W_{free} \cap W_{obs} = \emptyset\)). A location \(loc \subset W\) is a subset of the workspace that may represent, for example, a room, a fragment of a corridor or a hall, etc. (see Fig. 1 for an example). Each location is associated with its observation point \(loc_{obs}\), and with its feature map \(F(loc) = FMap(loc_{obs})\), that is constructed from the observation captured in this point, e.g. a descriptor extracted from the point cloud, which is provided by the robot’s perception sensor, using a neural network. Two locations \(loc\) and \(loc'\) are said to be adjacent if they overlap by the traversable space: \(loc \cap loc' \cap W_{free} \neq \emptyset\). Two adjacent locations define an edge, \(e=(loc, loc')\), in the graph of locations. An edge is also associated with its feature map \(F(e)\), which is generally intended to help robot move between the locations. An example of such feature map can be a direction from the former location’s observation point to the latter location’s observation point, or the piece of the panoramic image of the former location where the latter location is observed, or another feature which may aid navigation.

To enable successful planning and efficient navigation, the graph of locations should be connected, and should cover the whole area observed by the robot. Moreover, the edges in the graph should link only locations which are really adjacent in the environment. Also, as many pairs of adjacent locations as possible should be linked by an edge, in order to plan optimal routes using the graph.

According to aforementioned requirements, we estimate the quality of the graph of locations \(G=(V,E)\) using the following metrics:

Connectivity: the number of connected components in the graph.

Coverage of the main component: \[Coverage = \frac{Area(\cup_{loc \in V_{main}} loc)}{Area(W)},\] where \((V_{main}, E_{main})\) is the main connectivity component of the graph \(G\), and \(W\) is the robot’s workspace.

Percentage of inconsistent edges (the edges which link non-adjacent locations): \[PIE = \frac{|(u, v) \in E: u \cap v \cap W_{free} = \emptyset|}{|E|}.\]

Effectiveness of navigation with the graph, which is calculated as the Success weighted by Path Length (SPL) value averaged across \(N\) sampled pairs \((u_i, v_i)\) of the graph nodes: \[SPL = \sum\limits_{i=1}^N \frac{|Path(u_i, v_i, G)| \cdot I_{cons}(Path(u_i, v_i, G))}{|Path(u_i, v_i, M)|},\] where \(Path(u, v, G)\) is the shortest path between observation points of the locations \(u\) and \(v\) in the graph \(G\); \(Path(u, v, M)\) is the true shortest path between these observation points; and \(I_{cons}(Path(u, v, G))\) is 1 if \(Path(u, v, G)\) does not contain inconsistent edges, and 0 otherwise.

PRISM-TopoMap builds and updates a graph of locations of the environment in an online mode. Fig. 2 shows the principal scheme of our pipeline, consisting of the two main modules: graph maintaining and localization. The former updates the graph and always keeps the robot attached to a vertex, representing current location. To do this it leverages learning-based multi-modal place recognition and feature-based scan matching. Both the graph maintaining and the localization modules are detailed below.

The graph maintaining module builds and expands a graph of locations using observations from the robot’s perception sensors and results of the localization module. A location in the graph is attributed with the sensory data captured from its observation point (in our experiments we use LiDAR pointclouds and front-view and back-view RGB-images). Additionally, each location is assigned a descriptor for place recognition. An edge between locations is attributed with the relative pose between observation points of the locations it connects.

At each step \(t\), the module outputs the graph of locations \(G_t\) covering the area visited by the robot so far. The module also maintains the current robot’s state in the graph: location \(v_{cur}^t\) where the robot is currently located, and relative transformation \(T_{cur}^t\) between \(v_{cur}^t\)’s observation point and the robot’s position. The scheme of the graph maintaining module is depicted in Fig. 3. The input of the module is the graph \(G_{t-1} = (V_{t-1}, E_{t-1})\) with state \(v_{cur}^{t-1}\) and \(T_{cur}^{t-1}\), localization results, odometry measurement \(o_t\), and point cloud \(C_t\) captured from the robot’s position. The output of the module is the updated graph \(G_t = (V_t, E_t)\) and updated robot state: \(v_{cur}^t\) and \(T_{cur}^t\). The graph update process is divided into the following stages:

First we check whether the robot is still inside \(v_{cur}^{t-1}\), and its current scan overlaps with \(v_{cur}^{t-1}\) with sufficient percent. If the check is passed, we apply the odometry measurement to \(T_{cur}\):

\[v_{cur}^t = v_{cur}^{t-1}; T_{cur}^t = T_{cur}^{t-1} \cdot o_t\]If the robot is outside \(v_{cur}^{t-1}\), or scans overlapping percent is low, we first try to change \(v_{cur}\)’s value to one of its neighbors in the graph (i.e. move along an edge). For this purpose, we match scans of \(v_{cur}^{t-1}\) and its neighbors using Harris corner feature detector [28] and relative transformations assigned to the edges as an initial guess for alignment. If there exists a location \(v_{next}\) whose scan is matched to \(v_{cur}^{t-1}\)’s one with transformation \(T\), we change \(v_{cur}\)’s value to it:

\[v_{cur}^t = v_{next}; T_{cur}^t = T \cdot T_{cur}^{t-1}\]

Otherwise, we try to change \(v_{cur}\)’s value to one of the locations from the localization results. For this purpose, we apply the transforms from localization results, and perform overlapping check for each of the locations from the localization results. If there exists a location \(v_{loc}\) passing the check, it becomes the current location and is linked to \(v_{cur}^{t-1}\): \[v_{cur}^t = v_{loc}; T_{cur}^t = T_{loc}; E_t = E_{t-1} \cup \{(v_{cur}^{t-1}, v_{loc})\}\]

If there is no proper location to change \(v_{cur}\), a new location \(v_{new}\) observed from the current robot’s position is added to the graph. After addition, we change \(v_{cur}\)’s value to it and link it to \(v_{cur}^{t-1}\) and all the nodes from the localization module output \(V_{loc}\): \[v_{cur}^t = v_{new};\;T_{cur}^t = I;\] \[E_{t} = E_{t-1} \cup \{(v_{new}, v_{cur}^{t-1})\} \cup \{(v_{new}, v_{loc})_{v_{loc} \in V_{loc}}\}\]

Furthermore, to make navigation more effective, the graph maintaining module searches for loops passing the current robot’s position, checking each pair of the localized nodes. In all the cases of new node addition, the module always links it to the previous location, which guarantees connectivity of the built graph.

In our place recognition method, we utilize a late fusion technique to derive a comprehensive global descriptor from the input data, which comprises front-view and back-view images and a point cloud.

A scheme of the method is shown in Fig. 4. The network architecture is divided into two main branches. In the image-processing branch, a CNN backbone (ResNet18) is employed to extract feature maps from the images. Following this, a pooling layer is applied to each image to obtain a global descriptor. As a pooling layer, we study either GeM [29] or NetVLAD [16] with a Fully Connected layer to reduce the descriptor size. These descriptors are then combined via stacking and processed through a GeM 1D pooling layer to produce a single global descriptor for the images, denoted as \(d_{img}\). For the point cloud processing branch, the point cloud data is initially voxelized. Subsequently, we apply the MinkLoc3Dv2 [23] architecture with GeM pooling to derive the global descriptor for the point cloud, \(d_{cloud}\). These two descriptors are concatenated to form a unified descriptor for the current place: \(d = Concat(d_{img}, d_{cloud})\).

We then identify the top-5 most similar place descriptors within the nodes of the location graph based on Euclidean distance. These five candidates are utilized in the subsequent stages of our pipeline.

Learning-based place recognition methods sometimes output false positive matches due to visual similarity of different parts of the environment. So, relying on a place recognition network only, our topological mapping method can link far, non-adjacent locations. To prevent this, we match point clouds (find the transformation which aligns them) from the current robot’s position and localized nodes using their 2D projections and features extracted from their geometry (see Fig. 5). If the matching score is insufficient, we remove the node from the list of localized nodes.

For point cloud matching, we first remove floor and ceiling cells from both point clouds. Then, we project both point clouds into 2D and convert the projection to black-and-white image format. After that, we extract features from the obtained images using a classical feature detector like SIFT [30] or ORB [31], and match the extracted features using FLANN method [32]. The resulting transformation is estimated using least squares method after several iterations of outlier removal. The procedure of outlier removal and transform estimation is described in Algorithm 6.

To prove the efficiency of our method, we tested it on five large simulated scenes and compared it to three state-of-the-art topological mapping methods. Additionally, we carried out experiments on data from a real robot without any models re-training and fine-tuning.

The dataset used for training place recognition models and evaluation of our topological mapping method consisted of 180 scenes from HM3D [33], 73 scenes from Gibson [34], and 10 large area scenes from Matterport3D [35]. We selected 36, 14 and 5 scenes from each dataset for validation subset, the rest were used for training. Frames were systematically sampled across the scenes from locations positioned on a uniform grid, each spaced 1 meter apart. At each location, four frames were captured, corresponding to rotations of 0, 90, 180, and 270 degrees around the yaw axis.

We followed a standard training approach [22], [26], utilizing *triplet margin loss*, *batch hard negative mining* for identifying challenging triplets within batches, and *dynamic batch sizing* to
prevent training collapse. Frames within 3 meters are considered a *positive pair*, while those beyond 10 meters are *negative pairs*. The learning rate was set at 0.0001 for image-based and 0.001 for cloud-based methods. Training was
conducted with the Adam optimizer over 60 epochs, adjusting the learning rate down by a factor of 0.1 at the 30th and 50th epochs.

```
[ht]
\caption{The metrics of Place Recognition on the test subset}
\label{tab:pr95habitat95test}
\centering
\setlength\tabcolsep{2pt}
\begin{tabular}{l|ccc}
\hline
Method & Modalities & AR@1 & AR@5 \\
\hline
GeM (ResNet18) & RGB & 87.44 & 96.26 \\
NetVLAD & RGB & 77.22 & 91.68 \\ MixVPR & RGB & 86.23 & 95.08 \\ CosPlace & RGB & 85.54 & 95.01 \\
MinkLoc3Dv2 & Point Cloud & 86.14 & 95.02 \\
SVT-Net & Point Cloud & 84.20 & 93.98 \\
\hline
{Ours (GeM+MinkLoc3Dv2)} & {RGB + Point Cloud}
& \textbf{88.50} & \textbf{96.58} \\ \hline
\end{tabular}
```

For testing, we evaluated each scene separately by using frames as queries against a database of other frames, excluding identical-location frames. We calculated Recall@1 and Recall@5 to measure the accuracy of matches within 5 meters for the top-1 and top-5 nearest neighbors. These metrics were averaged across all scenes to compute the Average Recall (AR) metric, detailed in Table [tab:pr95habitat95test]. We can see that the best metrics are achieved by our proposed multimodal neural network approach which is shown in Fig. 4 and based on GeM and MinkLoc3Dv2.

It should be noted that among unimodal RGB methods, the GeM-based model shows the best results. And among the methods based on point clouds, the MinkLoc3Dv2 model has the highest quality. This justifies their use as components of our multimodal approach.

Method | TPR | FPR | FNR | Runtime, ms |
---|---|---|---|---|

RANSAC + ICP | 0.50 | 0.04 | 0.43 | 208 |

Geotransformer | 0.78 | 0.70 | 0.00 |
278 |

(Ours) Feature2D-SIFT | 0.73 | 0.03 | 0.27 | 83 |

(Ours) Feature2D-ORB | 0.83 |
0.00 |
0.17 | 64 |

To choose the most suitable point cloud matching approach for our topological mapping, we compared our 2D feature-based method with a common classic approach based on RANSAC and ICP algorithms, as well as a state-of-the-art learning-based methods Geotransformer [36]. The point cloud pairs for comparison were sampled from the scenes used in the simulation experiments. We used 242 point cloud pairs, 60 of them had overlapping percent above 0.5, and 153 of them had overlapping percent above 0.25. For our approach, we used SIFT and ORB detectors. The results are shown in Table 1. We measured true positive rate and false positive rate across the point cloud pairs with overlapping percent more than 0.5. For false negative rate estimation, we used all the point cloud pairs.

As seen in the table, the proposed point cloud matching method with ORB detector demonstrates the best results in all the parameters except false negative rate. The FNR value of Geotransformer method is 0 because it has no technique of matching reject, it always outputs a transformation. In both simulation and robot experiments, we used the proposed method with ORB detector for point cloud matching.

Method | N components | Coverage | PIE | SPL |
---|---|---|---|---|

Hydra | 7.2 | 0.74 | 0.02 | 0.39 |

IncrementalTopo | 10.6 | 0.90 | 0.01 |
0.35 |

TSGM | 1.0 |
0.95 |
0.07 | 0.60 |

PRISM-TopoMap | 1.0 |
0.91 | 0.01 |
0.80 |

To estimate quality of our method, we used 5 large scenes from Matterport3D [35] dataset. In each scene, a virtual agent moved along a pre-defined trajectory exploring all the scene. The trajectory length varied from 100 to 300 m, the scene area varied from 100 to 700 \(m^2\). At each step, the agent received a panoramic RGB-D image from 4 cameras with 90 degrees field of view, a point cloud created from this image, and precise odometry data from the simulator. The result was the graph constructed after passing the whole trajectory.

For comparison, we used three state-of-the-art topological mapping methods: Hydra [11], IncrementalTopo [12] and TSGM [8]. Our method used front and back images, a point cloud, and an odometry as input data. TSGM was fed with the full panoramic RGB-D image only. Hydra and IncrementalTopo were fed with point cloud and odometry data, and Hydra also computed semantic data using SegFormer [37] neural network. Hydra and IncrementalTopo constructed a global metric map and used global positioning data, and TSGM ignored position data and constructed topological map only. The results of the comparison are shown in Table 2.

As seen in the table, the proposed PRISM-TopoMap method significantly outperformed all the other methods in terms of navigation efficiency. The SPL values of metric-topological Hydra and IncrementalTopo methods were less than 0.4, because of disconnection of the constructed graphs. The SPL value of learning-based TSGM method reached only 0.6 because of significant part of inconsistent edges (0.07). Our method built connected graphs with only 1% of inconsistent edges, and its SPL reached 0.8.

An example of the built graphs for all the methods is shown in Fig. 7. The methods Hydra and IncrementalTopo built disconnected graphs. TSGM built connected graph, but with inconsistent edges (from the left end of the map to the right one, due to visual similarity between the left and right corridors). Our method built a connected graph that contained fewer edges than the TSGM’s graph; however all the edges of the graph are consistent. This explains that our method has slightly fewer coverage metric than TSGM.

To prove efficiency of our method in real-world scenarios, we carried out an experiment on a wheeled mobile robot, Husky A200, that moved through a long corridor of a university building in different directions. The robot was equipped with front-view Zed and back-view Realsense D435 RGB-D cameras, and Velodyne VLP-16 lidar. We used front-view and back-view RGB images and lidar point clouds as input, as well as visual odometry data provided from Zed camera.

The robot traveled 205 m distance, passing all the branches of the corridor in both directions. The result of topological mapping on its trajectory is shown in Fig. 8. Despite the differences between the front and back camera parameters, and presence of people walking near the robot, the PRISM-TopoMap method successfully built a graph of locations relying on raw, noisy odometry data.

In this work we proposed PRISM-TopoMap – a novel topological mapping method which combines learning-based multimodal place recognition, feature-based scan matching, and rule-based graph maintaining to build connected and consistent graphs of locations from raw perception and odometry sensors data. The experiments conducted on large photorealistic simulated scenes demonstrated that the proposed method constructs connected graphs almost without inconsistent edges. Furthermore, it significantly outperforms the state-of-the-art methods in terms of navigation efficiency. Moving forward, we plan to include semantic information to our method for more accurate localization, and develop techniques of navigation with the constructed graphs of locations.

[1]

M. Labbé and F. Michaud, “Rtab-map as an open-source lidar and visual simultaneous localization and mapping library for large-scale and long-term online
operation,” *Journal of Field Robotics*, vol. 36, no. 2, pp. 416–446, 2019.

[2]

W. Hess, D. Kohler, H. Rapp, and D. Andor, “Real-time loop closure in 2d lidar slam,” in *2016 IEEE international conference on robotics and automation (ICRA)*.1em
plus 0.5em minus 0.4emIEEE, 2016, pp. 1271–1278.

[3]

K. Muravyev and K. Yakovlev, “Evaluation of rgb-d slam in large indoor environments,” in *Interactive Collaborative Robotics: 7th International Conference, ICR 2022,
Fuzhou, China, December 16-18, 2022, Proceedings*.1em plus 0.5em minus 0.4emSpringer, 2022, pp. 93–104.

[4]

C. Gomez, M. Fehr, A. Millane, A. C. Hernandez, J. Nieto, R. Barber, and R. Siegwart, “Hybrid topological and 3d dense mapping through autonomous exploration for large indoor
environments,” in *2020 IEEE International Conference on Robotics and Automation (ICRA)*.1em plus 0.5em minus 0.4emIEEE, 2020, pp. 9673–9679.

[5]

L. Schmid, V. Reijgwart, L. Ott, J. Nieto, R. Siegwart, and C. Cadena, “A unified approach for autonomous volumetric exploration of large scale environments under severe odometry
drift,” *IEEE Robotics and Automation Letters*, vol. 6, no. 3, pp. 4504–4511, 2021.

[6]

F. Blochliger, M. Fehr, M. Dymczyk, T. Schneider, and R. Siegwart, “Topomap: Topological mapping and navigation based on visual slam maps,” in *2018 IEEE International
Conference on Robotics and Automation (ICRA)*.1em plus 0.5em minus 0.4emIEEE, 2018, pp. 3818–3825.

[7]

P. Yin, S. Zhao, I. Cisneros, A. Abuduweili, G. Huang, M. Milford, C. Liu, H. Choset, and S. Scherer, “General Place Recognition Survey: Towards the
Real-world Autonomy Age,” Sept. 2022.

[8]

N. Kim, O. Kwon, H. Yoo, Y. Choi, J. Park, and S. Oh, “Topological semantic graph memory for image-goal navigation,” in *Conference on Robot Learning*.1em plus 0.5em
minus 0.4emPMLR, 2023, pp. 393–402.

[9]

K. Muravyev and K. Yakovlev, “Evaluation of topological mapping methods in indoor environments,” *IEEE Access*, vol. 11, pp. 132 683–132 698, 2023.

[10]

X. Chen, B. Zhou, J. Lin, Y. Zhang, F. Zhang, and S. Shen, “Fast 3d sparse topological skeleton graph generation for mobile robot global planning,” in *2022 IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS)*.1em plus 0.5em minus 0.4emIEEE, 2022, pp. 10 283–10 289.

[11]

N. Hughes, Y. Chang, and L. Carlone, “Hydra: a real-time spatial perception system for 3d scene graph construction and optimization,” 2022.

[12]

Y. Yuan and S. Schwertfeger, “Incrementally building topology graphs via distance maps,” in *2019 IEEE International Conference on Real-time Computing and Robotics
(RCAR)*.1em plus 0.5em minus 0.4emIEEE, 2019, pp. 468–474.

[13]

O. Kwon, N. Kim, Y. Choi, H. Yoo, J. Park, and S. Oh, “Visual graph memory with unsupervised representation for visual navigation,” in *Proceedings of the IEEE/CVF
International Conference on Computer Vision*, 2021, pp. 15 890–15 899.

[14]

R. R. Wiyatno, A. Xu, and L. Paull, “Lifelong topological visual navigation,” *IEEE Robotics and Automation Letters*, vol. 7, no. 4, pp. 9271–9278, 2022.

[15]

Z. Chen, O. Lam, A. Jacobson, and M. Milford, “Convolutional Neural Network-based Place Recognition,” Nov. 2014.

[16]

R. Arandjelović, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD: CNN architecture for weakly supervised place recognition,”
May 2016.

[17]

G. Berton, C. Masone, and B. Caputo, “Rethinking Visual Geo-localization for Large-Scale Applications,” *arXiv:2204.02287 [cs]*, Apr.
2022.

[18]

A. Ali-bey, B. Chaib-draa, and P. Giguère, “MixVPR: Feature Mixing for Visual Place
Recognition,” in *Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision*, 2023, pp. 2998–3007.

[19]

S. Hausler, S. Garg, M. Xu, M. Milford, and T. Fischer, “Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place
Recognition,” in *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, 2021, pp. 14 141–14 152.

[20]

R. Wang, Y. Shen, W. Zuo, S. Zhou, and N. Zheng, “TransVPR: Transformer-Based Place Recognition With Multi-Level Attention Aggregation,” in
*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, 2022, pp. 13 648–13 657.

[21]

M. A. Uy and G. H. Lee, “PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place Recognition,” in *Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition*, 2018, pp. 4470–4479.

[22]

J. Komorowski, “MinkLoc3D: Point Cloud Based Large-Scale Place Recognition,” in *Proceedings of the IEEE/CVF Winter
Conference on Applications of Computer Vision*, 2021, pp. 1790–1799.

[23]

——, “Improving Point Cloud Based Place Recognition with Ranking-based Loss and Large Batch Training,” in *2022 26th
International Conference on Pattern Recognition (ICPR)*, Aug. 2022, pp. 3699–3705.

[24]

Z. Fan, Z. Song, H. Liu, Z. Lu, J. He, and X. Du, “SVT-Net: Super Light-Weight Sparse Voxel Transformer for Large Scale Place
Recognition,” *Proceedings of the AAAI Conference on Artificial Intelligence*, vol. 36, no. 1, pp. 551–560, June 2022.

[25]

S. Xie, C. Pan, Y. Peng, K. Liu, and S. Ying, “Large-Scale Place Recognition Based on Camera-LiDAR Fused Descriptor,” *Sensors*, vol. 20,
no. 10, p. 2870, Jan. 2020.

[26]

J. Komorowski, M. Wysoczańska, and T. Trzcinski, “MinkLoc++: Lidar and Monocular Image Fusion for Place
Recognition,” in *2021 International Joint Conference on Neural Networks (IJCNN)*, July 2021, pp. 1–8.

[27]

H. Lai, P. Yin, and S. Scherer, “AdaFusion: Visual-LiDAR Fusion With Adaptive Weights for Place Recognition,” *IEEE Robotics and
Automation Letters*, vol. 7, no. 4, pp. 12 038–12 045, Oct. 2022.

[28]

C. Harris, M. Stephens, *et al.*, “A combined corner and edge detector,” in *Alvey vision conference*, vol. 15, no. 50.1em plus 0.5em minus 0.4emCiteseer,
1988, pp. 10–5244.

[29]

F. Radenović, G. Tolias, and O. Chum, “Fine-tuning cnn image retrieval with no human annotation,” *IEEE transactions on pattern analysis and machine
intelligence*, vol. 41, no. 7, pp. 1655–1668, 2018.

[30]

D. G. Lowe, “Object recognition from local scale-invariant features,” in *Proceedings of the seventh IEEE international conference on computer vision*, vol. 2.1em
plus 0.5em minus 0.4emIeee, 1999, pp. 1150–1157.

[31]

E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in *2011 International conference on computer vision*.1em plus
0.5em minus 0.4emIeee, 2011, pp. 2564–2571.

[32]

M. Muja and D. Lowe, “Flann-fast library for approximate nearest neighbors user manual,” *Computer Science Department, University of British Columbia, Vancouver, BC,
Canada*, vol. 5, p. 6, 2009.

[33]

S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, *et al.*, “Habitat-matterport 3d dataset
(hm3d): 1000 large-scale 3d environments for embodied ai,” *arXiv preprint arXiv:2109.08238*, 2021.

[34]

F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese, “Gibson env: Real-world perception for embodied agents,” in *Proceedings of the IEEE conference on computer
vision and pattern recognition*, 2018, pp. 9068–9079.

[35]

A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,” *arXiv
preprint arXiv:1709.06158*, 2017.

[36]

Z. Qin, H. Yu, C. Wang, Y. Guo, Y. Peng, S. Ilic, D. Hu, and K. Xu, “Geotransformer: Fast and robust point cloud registration with geometric transformer,” *IEEE
Transactions on Pattern Analysis and Machine Intelligence*, 2023.

[37]

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,” *Advances in
Neural Information Processing Systems*, vol. 34, pp. 12 077–12 090, 2021.

\(^{1}\)Kirill Muravyev and Konstantin Yakovlev are with Federal Research Center for Computer Science and Control of Russian Academy of Sciences,

`{muraviev,yakovlev}@isa.ru`

↩︎\(^{2}\)Alexander Melekhin and Dmitriy Yudin are with Moscow Institute of Physics and Technology (MIPT)

`{melekhin.aa,yudin.da}@phystech.edu`

↩︎\(^{3}\) Dmitriy Yudin and Konstantin Yakovlev are also with Artificial Intelligence Research Institute (AIRI)

`{yudin.da,yakovlev}@airi.net`

↩︎This is a pre-print of the paper submitted to IROS 2024 conference↩︎