Surface Reconstruction from Gaussian Splatting via Novel Stereo Views


Abstract

The Gaussian splatting for radiance field rendering method has recently emerged as an efficient approach for accurate scene representation. It optimizes the location, size, color, and shape of a cloud of 3D Gaussian elements to visually match, after projection, or splatting, a set of given images taken from various viewing directions. And yet, despite the proximity of Gaussian elements to the shape boundaries, direct surface reconstruction of objects in the scene is a challenge.

We propose a novel approach for surface reconstruction from Gaussian splatting models. Rather than relying on the Gaussian elements’ locations as a prior for surface reconstruction, we leverage the superior novel-view synthesis capabilities of 3DGS. To that end, we use the Gaussian splatting model to render pairs of stereo-calibrated novel views from which we extract depth profiles using a stereo matching method. We then combine the extracted RGB-D images into a geometrically consistent surface. The resulting reconstruction is more accurate and shows finer details when compared to other methods for surface reconstruction from Gaussian splatting models, while requiring significantly less compute time compared to other surface reconstruction methods.

We performed extensive testing of the proposed method on in-the-wild scenes, taken by a smartphone, showcasing its superior reconstruction abilities. Additionally, we tested the proposed method on the Tanks and Temples benchmark, and it has surpassed the current leading method for surface reconstruction from Gaussian splatting models. Project page: https://gs2mesh.github.io/.

Figure 1: Qualitative results on Mip-NeRF360[1] dataset - Our method, which was able to reconstruct fine details such as the small crevices of the table, obtains on-par results with the BakedSDF [2] method while taking significantly less time to run.

1 Introduction↩︎

The Gaussian Splatting Model for radiance field rendering (3DGS) [3] has recently marked a significant leap forward in the realm of novel view rendering. By optimizing the distribution, size, color, and transparency of a cloud of Gaussian elements, 3DGS offers a unique way to generate images of complex scenes from new viewing directions. However, despite the fact that each Gaussian has a unique location and orientation, a direct reconstruction of surfaces from these models involves significant challenges. The main problem is that the locations of the Gaussian elements in 3D space do not form a geometrically consistent surface, as those are optimized for best matching the input images when projected back onto their image planes. Thus, reconstructing a surface based on the Gaussians’ centers yields noisy and inaccurate results. Current state-of-the-art methods attempt to regularize the 3DGS optimization process by directing the Gaussian elements to geometrically align with the reconstructed surface [4].

In this paper, we propose an alternative novel approach for surface reconstruction from 3DGS, which preserves the inherent properties of the model. The proposed method leverages the primary goal of 3DGS for which it is optimized – generating accurate and consistent novel views of a scene. Therefore, instead of relying on Gaussian elements for the reconstruction, we directly extract the scene’s geometry via novel view rendering.

The pipeline consists of capturing a scene with 3DGS, and generating pairs of novel stereo-calibrated views. We then apply a stereo matching model to extract a depth map from each pair of novel views. Lastly, we integrate all the RGB-D data using the Truncated Signed Distance Function (TSDF) algorithm [5], to create a smooth and geometrically consistent surface. Additionally, the proposed framework allows for the reconstruction of a specific object in the scene, by segmenting the object using a combination of segment-anything (SAM) [6] masks with the depth map information, for semi-automatic segmentation of objects in 3D.

The proposed method reduces surface reconstruction time dramatically, taking only a small overhead on top of the 3DGS capturing of the scene, which is much more efficient compared to other neural scene representations. For instance, reconstruction of an in-the-wild scene taken by a standard smartphone camera requires less than 5 minutes of additional computation time after a 3DGS scene capture. Additionally, since we reconstruct the surface based on the 3DGS capture, it is straightforward to bind the mesh to the original model, as mentioned in [4], [7], for mesh-based manipulation of the Gaussian elements. Moreover, since our mesh is more accurate, it does not require any additional refinements [4].

We tested the proposed method on the Tanks and Temples benchmark [8], a Multi-View Stereo (MVS) dataset, and surpassed SuGaR [4], the current state-of-the-art method for surface reconstruction from 3DGS. Additionally, we extensively tested it on scenes captured with a smartphone, showing qualitative results of the proposed method’s reconstruction abilities.

1.0.0.1

Main contribution. Out-of-the-box method for fast and accurate in-the-wild surface reconstruction, based on stereo-calibrated novel view synthesis.

2 Related Efforts↩︎

2.1 Multi-View Stereo and Stereo Matching↩︎

Multi-View Stereo (MVS), the task of extracting depth maps from multiple views, is a fundamental surface reconstruction method in computer vision. In the field of deep MVS methods, the pioneering work of MVSNet [9] introduced an end-to-end framework for MVS learning, which can be divided into three parts: 2D feature extraction, homography, and 3D cost volume with 3D convolutions. Latter methods presented an improvement to this scheme, by improving the 3D cost volume [10], [11], improving the architecture for 2D feature extraction [12], using a vision transformer (ViT) architecture for feature extraction [13], and improving 3D convolutions for more efficient computations using a coarse-to-fine method [14], [15]. To fuse the extracted depth maps into one point cloud or mesh there are two main methods; Fusibile [16], which has recently been generalized by [17], and TSDF [5], which we chose to use in our method.

Deep stereo matching methods [18][21] are related to deep MVS methods, however, since it is guaranteed that matching pixels between two images must lie in the same row, the cost volume layer works on the disparity instead of the depth. Recent state-of-the-art stereo matching models, such as RAFT [22], IGEV [23], and DLNR [24], use iterative refinement using GRU or LSTM layers. In our method we chose to use DLNR.

The common flaw of MVS methods is that they deeply rely on accurate camera poses for the calculation of epipolar lines, and on in-the-wild scenes they struggle to achieve high accuracy as in the controlled environment, since small errors in pose estimation result in a noisy reconstruction, as we show in our ablation study. Furthermore, in contrast to MVS methods, the camera poses of stereo-calibrated images are relatively close to one another, and the images share the same image plane. Therefore, our method is not only less prone to occlusions, but also requires consistency from the novel view synthesis algorithm only for short distances and on the same image plane, thus, ensuring a more accurate and less noisy reconstruction.

2.2 Novel View Synthesis↩︎

Novel view synthesis methods aim to render views of a scene from any given pose, while being trained on a set of images taken from the scene. A major leap forward in accuracy was achieved by the work of Neural Radiance Fields (NeRF) [25], which incorporates importance sampling and positional encoding to enhance rendering quality. However, the use of relatively large Multi-Layer Perceptrons (MLP) to capture the scene results in long training times. Later, Mip-NeRF [1] improved the quality of the rendered view with a different sampling method. Despite improving the rendering quality, its training and rendering times remained long. InstantNGP [26] tackled the extended training times of previous efforts, by incorporating a hash grid and an occupancy grid with a small MLP, showcasing impressively fast capturing of scenes. Recently, another major leap forward was presented by 3DGS [3], a faster and more accurate method for scene capturing, which serves as the backbone of our method, since 3DGS can produce calibrated images that are accurate enough for Stereo matching. 3DGS represents the scene as a point cloud of 3D Gaussians, where each Gaussian has the properties of opacity, rotation, scale, location, and spherical harmonics. The scaling of the Gaussians is anisotropic, which allows them to represent thin structures in the scene. The capturing time is short compared to other methods based on MLPs, and it is capable of real-time rendering, which allows our method to reconstruct surfaces faster compared to neural surface reconstruction methods. The first stage of the 3DGS optimization is a shape-from-motion (SFM) algorithm [27], [28], which extracts the camera poses and provides an initial guess for the gaussian locations. Recently, several papers [29][32] suggested improvements to the 3DGS, however, in our method we use the vanilla version of 3DGS, which we find satisfactory for our purposes. Recent methods try to manipulate the Gaussian elements to extract more accurate surfaces [4], [33]. SuGaR [4] added a regularization term for post-process optimization based on opacity levels of Gaussians, forcing the Gaussian element cloud to align with the surface. However, since this method utilizes the location and opacity of the Gaussian elements, it reconstructs the surface with noisy undulations.

2.3 Neural Surface Reconstruction↩︎

Accurate neural surface reconstruction methods [2], [34][36] aim to enable both accurate surface reconstruction and accurate novel view synthesis. In IDR, [36], an SDF represented by an MLP was trained to reconstruct both color and geometry. For better geometric reconstruction, Neus [35] utilized weighted volume rendering to reduce geometric error, and HF-Neus [37] decomposed the implicit SDF into a base function and a displacement function to allow for a coarse-to-fine refinement for high-frequency detail reconstruction. Other methods utilize additional regularization for improved reconstruction. In RegSDF [38], using a point cloud obtained form shape-from-motion as regularization was suggested, in addition to a regularization on the curvature of the zero-level of the SDF function, and in NeuralWarp [39], it was suggested to refine the geometry by regularizing image consistency between different views, by warping them based on the implicit geometry. Using a 3D hash coded grid, Neuralangelo [40] has enabled detailed reconstruction, showcasing state-of-the-art results on leading benchmarks. However, the reconstruction requires an extensive computation time of several days per scene, and the same goes for BakedSDF [2], which requires training an MLP with an architecture similar to Mip-NeRF [1], which both take a couple of days of training per scene. In contrast to neural rendering methods, our method is fast - it only requires a few minutes of computation for a typical scene of a 360 object given a trained 3DGS, and it produces an accurate and geometrically consistent surface.

3 Method↩︎

We propose a method for surface reconstruction, which integrates several components, see 2. We begin with a 3DGS model for synthesizing novel views, followed by rendering stereo-calibrated pairs for depth extraction via stereo matching. These depth views are then aggregated using TSDF[5]. Additionally, for targeted object reconstruction within scenes, we utilize the depth maps to project segmentation masks between consecutive images.

Figure 2: The proposed pipeline for surface reconstruction. First, we represent the scene by applying a 3DGS model. We then use the 3DGS model to render novel-view pairs of stereo-calibrated images. For each pair, using a shape from stereo algorithm, we reconstruct an RGB-D structure, which is then integrated from all views using TSDF [5] into a triangulated mesh of the scene.

3.1 Scene Capture and Pose Estimation↩︎

Our pipeline receives a video or images of a scene as input. Following the vanilla 3DGS, we employ COLMAP [27], [28] for Structure-from-Motion (SFM) to identify points of interest and deduce camera matrices from the provided images.

3.2 3DGS and Stereo-Calibrated Novel View Rendering↩︎

The elements extracted from the previous stage are then fed into the 3DGS optimization process to accurately represent the scene. After capturing the scene with 3DGS, our method generates stereo-calibrated views. When generating the stereo-calibrated views, we note a major issue with 3DGS which is that views distant from the training views are less consistent - there may be artifacts such as floating and noisy Gaussians that can negatively affect the depth reconstruction from these views. To address this issue, we take several measures. First, we input a sufficient amount of images that represent the scene from a variety of positions and angles. Second, we avoid using random virtual camera positions, or using a dome-like structure of virtual cameras, and instead we keep the left virtual camera of each stereo pair in the same position and angle as an original camera in the training set. This ensures that the rendered images are as close as possible to the original images on which the 3DGS model was optimized. Additionally, to enforce consistency for the paired views, we use a baseline that is up to \(7\%\) from the scene radius, which is calculated through the COLMAP[27], [28] poses.

3.3 Stereo Depth Estimation↩︎

With the rendered stereo-calibrated image pairs, a stereo matching algorithm is applied; in our pipeline, we use DLNR [24], a state-of-the-art method for stereo matching and disparity calculation. Additionally, we incorporate an occlusion mask which ensures consistency between left-to-right and right-to-left disparities, by masking out areas where occlusions occur. We allow for the loss of these areas, since we have multiple stereo views, and therefore the information about an occluded area can be obtained from an adjacent view.

In addition, we introduce a different masking technique based on the baseline of the rendered views. The relationship between stereo matching errors is described as follows, \[\begin{align} \epsilon(Z) &\approx& \dfrac{\epsilon(d)}{f_x \cdot B}Z^2, \end{align}\] where \(\epsilon(d)\) represents the disparity output error, \(Z\) is the ground-truth depth, \(\epsilon(Z)\) is the error of the depth estimate, \(f_x\) denotes the camera’s horizontal focal length, and \(B\) is the baseline [41]. Conversely, the disparity between matching pixels in two images of an object that is positioned at a short distance from the cameras can exceed the maximum disparity limit produced by stereo matching algorithms. Thus, estimating the depth of an object that is too close to the camera can result in an error due to the limitation of the matching algorithms, and estimating the depth of an object that is distant results in a quadratic error. Therefore, we consider depth in the range \[\begin{align} 2 B & \leq & Z \,\leq \, 10 B. \end{align}\] This approach enhances the overall accuracy and reliability of the depth estimation process, ensuring more consistent geometric reconstructions.

3.4 Depth Fusion into Triangulated Surface↩︎

Despite the various masks described earlier which enhance the geometric consistency, combining the masked point clouds into a geometrically consistent mesh still requires extra care. Simply taking the union of all point clouds would result in a noisy 3D reconstruction, as there are various forms of noise and errors in each view, as well as areas which overlap between two views and form “layers" which would negatively effect the meshing algorithm. Therefore, to smooth out the noise and errors from the individual views, we choose to aggregate all of the extracted depth estimations using the Truncated Signed Distance Function (TSDF) algorithm [5]. The TSDF algorithm aggregates the different views sequentially using RGB-D data and camera parameters, resulting in a smooth triangulated mesh with less noise, and with an RGB value at each vertex.

3.5 Object Segmentation using Depth and SAM↩︎

In some cases we want to extract only the surface of a given object in the scene. Other methods such as Segment-Any-Gaussians require additional training after the 3DGS, and are more suited for scenes with multiple objects rather than 360 object-centric scenes [42]. Thus, we choose to segment each image. The naive approach to object segmentation in a scene would involve independently segmenting every image, a method that can be labor-intensive. Instead, we employ a technique that leverages the power of Segment Anything Model (SAM) [6] in conjunction with depth information and geometric transformations. Initially, we annotate the first image of the scene using SAM to obtain a precise object mask. This initial segmentation acts as a foundation for tracking and segmenting the object across subsequent images. By performing this process after obtaining the depth of the identified object, we can project its mask onto the next image in the sequence using the extrinsic camera parameters. To accommodate for potential errors in SAM, we dilate the projected mask in the new image. Then, utilizing farthest point sampling, we select points that represent the extremities of the object within this dilated mask. These points are used as the seed for a new SAM annotation for the next image. This process is applied iteratively to each subsequent image in the series, allowing for dynamic and precise object segmentation throughout the scene.

4 Experiments and Results↩︎

We present experiments which demonstrate that our method is able to accurately reconstruct surfaces in a more geometrically consistent way than other 3DGS-based or MVS approaches, as well as achieve comparable performance to neural reconstruction methods while taking significantly less time to run. For quantitative results, we tested our method on Tanks and Temples [8]. Additionally, we show qualitative reconstruction results from Mip-NeRF360 [1], demonstrating that our method achieves comparable visual quality to neural reconstruction methods, and on in-the-wild videos taken from smartphones, we show our superiority in terms of geometric consistency and smoothness when compared to SuGaR [4]. Finally, we conduct an ablation study on in-the-wild scenes and on the MobileBrick dataset [43] which validates the contribution of novel-view image generation and stereo, by comparing between an MVS model with rendered images as input and an MVS model with the original images at the same poses as input, and by comparing our method to both of these MVS versions. We note that in the MobileBrick dataset [43] the camera poses are manually refined, and are shown to be more accurate than COLMAP [27], [28] poses for reconstruction. The comparison we present thus favors MVS models in that respect.

Figure 3: Qualitative evaluation of mesh reconstruction from in-the-wild videos. Our method successfully reconstructs meshes from Gaussian splatting while maintaining their fine details. We also demonstrate the effectiveness of our masking technique. It is evident that our reconstruction method outperforms SuGaR [4], which turns texture into geometry (coffee cups and the sculpture) and is based on converting flattened Gaussian elements into mesh with better accuracy.

4.1 Datasets↩︎

Tanks and Temples [8]. We evaluate our method on the Tanks and Temples (TnT) [8] training set, and compare our results with other neural reconstruction methods [35], [39], [40] as well as with SuGaR [4]. The TnT [8] dataset contains videos of large objects such as vehicles, buildings and statues. These objects are scanned with a laser scanner for an accurate ground-truth 3D point cloud. For evaluation we use the official TnT [8] evaluation method. It first aligns the point clouds using ICP [44], and then calculates the precision, recall, and the F1 score.

Mip-NeRF360 [1]. This dataset contains scenes taken from a 360 degree view, with emphasis on minimizing photometric variations through controlled capture conditions. Since there is no ground-truth in terms of surface reconstruction, we leave this as a qualitative comparison only.

MobileBrick [43]. This dataset contains videos of LEGO models, with corresponding 3D ground-truth meshes created from the LEGO 3D model. The poses are manually refined and are more accurate than the COLMAP ones [27], [28]. This dataset is challenging since most of the videos in the test set are taken from a top view of the model, thus, creating occlusions and leaving areas in the model with little visibility. We use the official evaluation code.

In-the-wild videos. We claim that for object surface reconstruction in-the-wild, our method presents a favorable balance between accuracy and computation time. To validate this claim, we captured scenes containing various objects such as plants, sculptures, figures and other everyday items, with intricate geometries and textures, and reconstructed their surface. The videos contain one or two cycles of moving around the object, depending on the object’s size, without any measures to maintain a persistent radius or pose of the camera, and without any control of the lighting in the environment. Since these objects are filmed only with a smartphone camera, there is no ground-truth reconstruction for these objects.

4.2 Baselines↩︎

Since our method’s objective of surface reconstruction from 3DGS is most related to SuGaR [4], we conduct an extensive comparison between the two. For the TnT [8] benchmark, we follow prior work and besides SuGaR [4] we also show for completeness the neural rendering methods, Neuralangelo [40], NeuralWarp[39], and NeuS [35]. We also compare against MVSformer[13], a state-of-the art deep MVS network, for in-the-wild scenes and for MobileBrick [43]. For the Mip-NeRF360 [1] dataset, we compare our results to the surface reconstruction of BakedSDF [2] and SuGaR[4].

Table 1: Quantitative results on the Tanks and Temples [8] benchmark.Our method achieves the best performance for surface reconstruction from 3DGS.In Truck and Ignatius scenes, our method performs on-par with neural reconstruction methods, while taking significantly less time to run.
Tanks and Temples [8]
Barn Caterpillar Ignatius Truck Average Runtime
F1 Score \(\uparrow\) Neural reconstruction methods
NeuralWarp [39] 0.22 0.18 0.02 0.35 0.19
NeuS [35] 0.29 0.29 0.83 0.45 0.47 \(\sim\)​16h-48h
Neuarlangelo [40] 0.70 0.36 0.89 0.48 0.61
F1 Score (Precision) \(\uparrow\) Gaussian splatting based methods
SuGaR [4] 0.01 (0.08) 0.02 (0.09) 0.06 (0.34) 0.05 (0.17) 0.04 (0.17) \(\sim\)​2h
Ours 0.21 (0.22) 0.17 (0.12) 0.64 (0.68) 0.46 (0.40) 0.37 (0.36) \(\sim\)​1h

Figure 4: Qualitative results on Tanks and Temples [8]. Top row: Ignatius scene, compared to SuGaR [4]. Bottom row: Barn Scene, compared to SuGaR. We encourage the reader to compare visual results of Barn scene with the relevant figure in the Neuralangelo paper [40]

4.3 Results↩︎

Tanks and Temples[8]. Table 1 presents a summary of the reconstruction results on the TnT [8] benchmark. Since SuGaR [4] yields a sparse mesh, and thus its recall drops significantly, we include a precision metric that is unaffected by mesh sparsity. However, it is important to note that this metric does not account for missing parts in the reconstruction. The results show that our method outperforms SuGaR [4] in both F1 and precision. Additionally, it is evident from Figure 4 that our method is able to reconstruct fine details such as in the Barn scene, and comparing to the Barn figure in the Neuralangelo paper [40] confirms that we achieve comparable visual performance to that of neural reconstruction methods [35], [39], [40]. Moreover, our method has a significant advantage in terms of processing time, requiring less than 60 minutes of additional computation time per TnT[8] scene, compared to the 16-48 hours needed by neural reconstruction methods. The reason for our longer computation times for TNT is due to each scene containing hundreds of frames, compared to a typical in-the-wild scene which contains less than 100 frames. It is important to note that the TnT[8] dataset predominantly features large scenes, whereas our method is based on 3DGS reconstruction that is designed for accurate reconstruction of smaller ones. This is particularly evident in the case of the Ignatius and Truck scenes, relatively small scenes, where our method performed on-par with the neural reconstruction methods.

Figure 5: Additional results on Mip-NeRF360 [1] dataset, showing that our method is on-par with BakedSDF [2] in fine-detail reconstruction.

Mip-NeRF360[1]. As illustrated in Figures 1 and 5, we present a qualitative analysis of scenes from the Mip-NeRF360 [1] dataset. This comparison reveals that our approach surpasses SuGaR[4] in terms of reconstruction quality and presents on-par results with BakedSDF [2]. Notably, our method excels in reconstructing fine details; for instance, even the small groves in the garden scene’s table are evident in the reconstruction, and there are intricate details in the objects on the countertop scene. Furthermore, while BakedSDF [2] requires 48 hours for training, our method achieves comparable results in less than an hour. Compared to SuGaR[4], our method generates smoother and more realistic surfaces, especially in reflective areas; we note that our countertop is smooth and flat, while SuGaR’s countertop has many bumps in areas with glare. This is likely due to our model’s use of a small disparity for stereo matching, where the reconstruction distortion is relatively small, and additionally, due to our model integrating the reconstructed patches from various viewing directions, which further reduces potential distortions.

In-the-wild comparison. Our comprehensive in-the-wild comparisons demonstrate the superior performance of our method across various scenes, as illustrated in Figure 3, with additional results provided in the supplementary material. Our method surpasses previous approaches in extracting accurate and noise-free meshes from 3DGS. We also evaluate our masking technique, which focuses on reconstructing a single object in a scene.

4.4 Ablation Studies↩︎

Figure 6: From left to right: One of the input images, the reconstructed mesh from MVSFormer [13] with the original poses and original input images and the COLMAP [27], [28] poses, the reconstructed mesh from MVSFormer with the original poses and rendered images, and the reconstructed mesh from our method.

We conduct an ablation study to evaluate the advantages of rendering stereo-calibrated pairs and applying stereo matching for surface reconstruction over simply using MVS on the same input. In our pipeline, we initially use COLMAP[27], [28] to determine the extrinsic and intrinsic parameters of the cameras. Following this, we capture the scene with 3DGS and render pairs of stereo calibrated images. For our first ablation, we explore the contribution of stereo matching to the pipeline, by applying our method on the stereo-calibrated rendered images, and applying an MVS method on the left rendered image of each stereo-calibrated pair, which has the same pose as an original image. Rendering images in this way reduces distortion and camera noise, which may enhance the quality of the reconstruction regardless of the stereo matching. For our second ablation, we explore the contribution of 3DGS to the pipeline, by applying an MVS method on the original images with the COLMAP[27], [28] poses, and applying the same MVS on images rendered from the 3DGS with the same COLMAP[27], [28] poses. We evaluate these methods both on in-the-wild scenes for qualitative results, as well as on the MobileBrick [43] dataset for quantitative results.

COLMAP and MVS. For this evaluation, we use MVSFormer[13], a state-of-the-art deep MVS model. To ensure a fair comparison, we adapt the MVS aggregation method from Fusibile [16] to TSDF [5], as per our method. For in-the-wild scenes, the reconstruction achieved using the MVS method (Fig. 6) exhibits a mesh that is noisier and more prone to holes compared to our method, which yields consistent geometry and better accuracy. The flower reconstruction demonstrates this phenomenon, where some of the flowers are missing in the reconstruction, and in the other examples, holes and noise in the reconstruction are shown. We note that the missing flowers are due to the post-processing performed on all of the meshes, which includes removing clusters with less than a certain number of triangles. Since the raw mesh contained holes, certain flowers were separated from the rest of the mesh and thus removed in the cleaning process. These geometric inconsistencies are particularly problematic when integrating the 3DGS with the mesh for different applications, such as animation.

Rendered images with MVS. Figure 6 shows that rendering images from the same poses and then applying MVS significantly improves the quality of the reconstruction, as there is less noise and the number and size of holes is reduced. This indicates that re-rendering from a 3DGS model reduces the camera distortion or improves the consistency of the camera poses, and overall enhances the accuracy of the reconstruction. However, when compared to rendering stereo-calibrated images, this method is still inferior, as holes are still abundant. This indicates a significant advantage in rendering stereo novel views in conjunction with a stereo matching algorithm.

Evaluation on MobileBrick [43]. Evaluation on the MobileBrick test set confirms that inputting rendered images to the same MVS model results in a smoother reconstructed surface, as evident by the higher recall, with a slight trade-off in accuracy. Overall, our method performs better, as evident by the higher recall and F1 and lower Chamfer distance, even though the manually refined poses given by the MobileBrick [43] dataset should give an advantage to MVSFormer [13]. We provide an example in Figure 7, with the full set of meshes available in the supplementary material.

Table 2: Ablation study results on MobileBrick [43] dataset. We compare our method against MVSFormer [13] with two types of inputs: the original images with the original refined poses, and the rendered images with the same poses.
MobileBrick [43]
2.5mm Radius 5mm Radius Chamfer Distance (mm) \(\downarrow\)
2-7 Method Acc \(\uparrow\) Recall \(\uparrow\) F1 \(\uparrow\) Acc \(\uparrow\) Recall \(\uparrow\) F1 \(\uparrow\)
MVSFormer [13] 61.15 63.10 60.87 87.94 79.39 82.52 7.04
MVSFormer + Rendered 59.93 66.57 62.50 87.61 82.33 84.46 6.36
Ours 56.04 70.69 62.14 83.13 88.45 85.50 5.76

Figure 7: Example from MobileBrick [43] dataset, on the castle scene. From left to right: the ground truth mesh, reconstruction of MVSFormer [13] with original images, reconstruction of MVSFormer with rendered images, and our reconstruction.

5 Limitations↩︎

a

b

Figure 8: Examples of limitations of our method. On the left, we show a rendered image from the Caterpillar scene from TnT [8] dataset, highlighting an area with “floater” Gaussians. On the right, the Truck scene from TnT [8] dataset, highlighting the missing windshield..

Our pipeline, which integrates 3DGS with stereo matching, exhibits certain limitations that require discussion. We illustrate two issues in Figure 8. The first issue arises from the initial 3DGS phase. If the 3DGS fails to accurately capture the scene - for instance, due to a limited number of viewpoints - this can lead to errors in the reconstructed surface. We illustrate this issue in the left side of Figure 8, which shows "floater" Gaussians that occur in regions not well-represented by the original images. These Gaussians introduce anomalies that can potentially mislead the stereo matching algorithm, leading to inaccuracies.

The second issue arises from the stereo matching algorithm. This technique is inherently susceptible to issues such as transparent surfaces, which can significantly compromise the accuracy of the reconstruction in those areas. We provide an example in the right side of Figure 8, illustrating how transparent surfaces can induce errors in the reconstruction process.

6 Conclusion↩︎

The novel approach we introduced for surface reconstruction from 3DGS-captured scenes significantly advances the capabilities of surface reconstruction methods. By leveraging the generation of stereo-calibrated novel views and applying a stereo matching algorithm, our method bypasses the limitations associated with direct surface reconstruction from Gaussian element locations. This strategy not only preserves the inherent properties of the 3DGS representation, but also enhances the accuracy and fidelity of the reconstructed surfaces. The combination of our approach with segmentation methods for object-specific surface reconstruction further demonstrates the versatility and efficiency of our methodology. Finally, our experimental results - the Tanks and Temples[8] dataset, Mip-NeRF360[1] dataset, and real-world scenes captured using smartphones - demonstrate the superiority of our method over the current state-of-the-art method for surface reconstruction from Gaussian splatting models, offering both improved accuracy and significantly shorter computation times.

Surface Reconstruction from Gaussian Splatting via Novel Stereo Views
Supplementary Material

7 Additional Examples↩︎

7.1 Additional In-the-Wild Examples↩︎

In Figure 10 We present additional qualitative comparisons between reconstructions from our method and reconstructions from SuGaR [4], on in-the-wild videos taken by a smartphone in an uncontrolled environment. We run SuGaR according to the instructions from their official repository, with density regularization, no train/eval split, and 15000 refinement iterations.

Figure 9: image.

Figure 10: Additional qualitative comparisons between our method and SuGaR [4] on surface reconstruction from in-the-wild videos.

7.2 Additional Examples from The Ablation Study↩︎

Figure 11 presents additional examples from the ablation study, showing MVSFormer[13], MVSFormer with rendered images as input, and our method. Figure 14 presents our results on the MobileBrick [43] test set, compared to MVSFormer [13] with original and rendered images as input.

Figure 11: Additional examples from the ablation study, showing reconstructions done by MVSFormer [13] on the original images and on the rendered images from the same poses, as well as our method’s reconstruction.

Figure 12: image.

Figure 13: image.

Figure 14: Left to right: ground truths of MobileBrick [43] dataset, reconstructions from MVSFormer [13] with original and rendered images, as well as our reconstruction.

8 Number of 3DGS Iterations↩︎

We claim that our method runs significantly faster compared to other reconstruction methods, and that the bottleneck of our runtime is the 3DGS optimization time. For a typical in-the-wild scene containing a central object, with roughly 80 images, 3DGS optimization takes 5-30 minutes on an Nvidia A40 GPU, depending on the number of iterations; 5-10 minutes for 7000 iterations, 10-20 minutes for 30000 iterations, and 20-30 minutes for 60000 iterations. In the examples shown throughout our paper, we ran 3DGS for 60000 iterations. In Figure 15, we show that we can reconstruct smooth, geometrically-consistent surfaces even after 7000 and 30000 3DGS iterations. The ability to achieve adequate reconstructions in a much shorter runtime, if needed, further validates that our method can compete with MVS methods [13], as well as neural reconstruction methods[2], [35], [40], and reconstruction methods from 3DGS[4], as a fast and accurate surface reconstruction method.

Figure 15: From left to right: One of the input images, and the reconstructed mesh using our method, after 7000, 30000 and 60000 3DGS iterations.

References↩︎

[1]
.
[2]
.
[3]
.
[4]
.
[5]
.
[6]
.
[7]
.
[8]
.
[9]
.
[10]
.
[11]
.
[12]
.
[13]
.
[14]
vol. for high–resolution multi–view stereo and stereo matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2495–2504 (2020).
[15]
vol. pyramid based depth inference for multi–view stereo. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4877–4886 (2020).
[16]
.
[17]
.
[18]
.
[19]
.
[20]
.
[21]
vol. for robust stereo matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13906–13915 (2021).
[22]
.
[23]
vol. for stereo matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21919–21928 (2023).
[24]
.
[25]
.
[26]
.
[27]
.
[28]
.
[29]
.
[30]
.
[31]
.
[32]
.
[33]
.
[34]
.
[35]
vol. rendering for multi–view reconstruction (2021).
[36]
.
[37]
.
[38]
.
[39]
.
[40]
.
[41]
.
[42]
.
[43]
.
[44]
.