Spin-UP: Spin Light for Natural Light Uncalibrated Photometric Stereo

Zongrui Li\(^{1,2,}\)1  Zhan Lu\(^{2,4,*,}\)2  Haojie Yan\(^{3,4}\) Boxin Shi\(^{5,6}\) Gang Pan\(^{3,4}\) Qian Zheng\(^{3,4,\ddag}\) Xudong Jiang\(^{1,2}\)
\(^1\)Rapid-Rich Object Search (ROSE) Lab, Interdisciplinary Graduate Programme, Nanyang Technological University
\(^2\)School of Electrical and Electronic Engineering, Nanyang Technological University
\(^3\)College of Computer Science and Technology, Zhejiang University
\(^4\)The State Key Lab of Brain-Machine Intelligence, Zhejiang University
\(^5\)National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
\(^6\)National Engineering Research Center of Visual Technology, School of Computer Science, Peking University
{zongrui001, zhan007, EXDJiang}ntu.edu.sg?, {hjyan, gpan, qianzheng}zju.edu.cn?, shiboxin@pku.edu.cn


Abstract

Natural Light Uncalibrated Photometric Stereo (NaUPS) relieves the strict environment and light assumptions in classical Uncalibrated Photometric Stereo (UPS) methods. However, due to the intrinsic ill-posedness and high-dimensional ambiguities, addressing NaUPS is still an open question. Existing works impose strong assumptions on the environment lights and objects’ material, restricting the effectiveness in more general scenarios. Alternatively, some methods leverage supervised learning with intricate models while lacking interpretability, resulting in a biased estimation. In this work, we propose Spin Light Uncalibrated Photometric Stereo (Spin-UP), an unsupervised method to tackle NaUPS in various environment lights and objects. The proposed method uses a novel setup that captures the object’s images on a rotatable platform, which mitigates NaUPS’s ill-posedness by reducing unknowns and provides reliable priors to alleviate NaUPS’s ambiguities. Leveraging neural inverse rendering and the proposed training strategies, Spin-UP recovers surface normals, environment light, and isotropic reflectance under complex natural light with low computational cost. Experiments have shown that Spin-UP outperforms other supervised / unsupervised NaUPS methods and achieves state-of-the-art performance on synthetic and real-world datasets. Codes and data are available at https://github.com/LMozart/CVPR2024-SpinUP.

1 Introduction↩︎

Natural light uncalibrated photometric stereo (NaUPS) [1] is proposed to relieve the dark room and directional light assumption in classical uncalibrated photometric stereo, aiming to reconstruct the surface normal given images of an object captured at arbitrary environment light. The implications of NaUPS are far-reaching: it makes photometric stereo universal. However, solving NaUPS is still an open question because of the intrinsic ill-posedness introduced by the varying light of each image and the high dimensional ambiguities between the light and objects [1].

Previous optimization-based methods use the simple light model to represent the varying environment lights and Lambertian reflectance to represent the material [1][3]. These models help mitigate the ill-posedness and ambiguities to some extent but become ineffective in handling objects with general reflectance (e.g., non-Lambertian reflectance) under complex environment light, leading to unsatisfactory reconstruction outcomes. Besides, since they solely model the varying lights in each image, the unknowns introduced by the light model may increase with the resolution and numbers of the images, restricting their method to low-resolution and insufficient images.

Considering the difficulties of explicitly mitigating the ill-posedness and ambiguities, recent advances [4], [5] turn to data-driven methods: they train a deep learning model on large-scale datasets and implicitly exploit deep light features from images to improve performance. However, those methods lack interpretability, making them hard to constrain during training. Consequently, the model may be affected by the data bias and prone to specific types of light sources [4] or reflection variations [5] among images.

Despite a persistent exploration in this research field, a method capable of handling general objects under natural light while free from data bias is still missing. In this paper, we provide a new perspective on solving NaUPS. Specifically, we propose a novel setup that acquires images on a rotatable platform under a static environment light. In such a case, the object is illuminated by rotated environment lights. Representing such rotated lights requires fewer parameters than solely modeling varying lights in each image since we model the lights as a uniform environment light varied by low degree-of-freedom (DOF [6]) rotation. This frees us from significantly increasing unknowns when implementing advanced parametric light models (e.g., spherical Gaussian) and reflectance models to handle general scenarios. Additionally, based on such a setup, we further derive a reliable light initialization method by analyzing the pixel value at object’s occluding boundary. Such light initialization will help the model converge at the beginning and alleviate the ambiguity between light and objects during training.

With the help of the proposed setup and light initialization method, we develop Spin Light Uncalibrated Photometric Stereo (Spin-UP), addressing NaUPS by optimizing inverse rendering framework in an unsupervised manner. To our best knowledge, this is the first unsupervised method that can handle general objects under natural light. Unlike previous methods, Spin-UP can jointly reconstruct environment light, isotropic reflectance, and complicated shapes through iterative optimization. To reduce the computational cost and improve convergence, we propose two strategies: interval sampling and shrinking range computing, which help optimize Spin-UP in low GPU usage (5 GB) and fast running time (25 mins). Experiments on synthetic and real-world datasets demonstrate our superior performance over previous methods on general scenarios. Overall, our contributions are summarized as follows:

  1. We design a novel setup for NaUPS, which reduces unknowns of light representation and facilitates solving NaUPS in an unsupervised manner.

  2. We introduce a light prior, which leverages an object’s occluding boundaries to initialize a reliable environment light. Based on the setup and light prior, we propose the unsupervised NaUPS method named Spin-UP.

  3. We present two training strategies for fast training and convergence of Spin-UP.

2 Related Work↩︎

In this section, we briefly review recent supervised and unsupervised NaUPS methods. We also summarize other techniques that exploit priors from occluding boundaries. Additionally, we discuss recent advances in 3D vision to distinguish Spin-UP from other neural inverse rendering approaches. Given that the focus of this paper is on NaUPS, a group of works reconstructing 3D surfaces from a single image by deep learning under natural light [7] or shading [8][10] are not included.

Natural Light Uncalibrated Photometric Stereo. Unsupervised NaUPS methods jointly recover the light, reflectance properties, and surface normal. These methods explicitly model the environment light by low-order spherical harmonics (SH) [3], [11], spatially varying spherical harmonics (SV-SH) [2], [12] or equivalent directional light (Eqv-Dir) [1] to mitigate the ill-posedness, and use integrability constraint [11], shape initialization [2], non-physical lighting regularization [3], or graph-based ambiguity relaxation [1] to alleviate the ambiguity. In contrast, supervised NaUPS methods [4], [5] apply deep learning models like transformers to reconstruct normal maps without explicitly estimating the environment light. The models are trained on a dataset containing images of diverse objects captured under various lighting conditions, including directional, point, and environment light. Compared to previous work, the proposed Spin-UP distinguishes itself in three key aspects: 1) it features a novel setup explicitly designed to model correlations among observed images to mitigate the ill-posedness of NaUPS, 2) leveraging this unique setup, a novel light initialization method is introduced to mitigate ambiguities, and 3) an advanced light and material model is implemented to address a broader range of scenarios.

Priors from the Boundaries. The occluding boundaries of an object are considered to reveal adequate information about the object’s shape and the scene’s light. Given the fact that the projection of the boundaries’ normal to xy-plane is perpendicular to the boundaries in orthographic projection, methods are developed to constraint the surface normal estimation during iterative optimization [13] or recover a rough shape to initialize the geometry in multi-view [14] or photometric stereo [2]. Other methods associate the boundaries normal with the reflectance to estimate a rough position of the directional lights [15]. However, none of them derive the environment light from the boundary reflectance. Given the setup in Spin-UP, we can roughly estimate the environment by analyzing occluding boundaries and the corresponding pixel points. This approach provides a reliable initialization for lighting that alleviates ambiguity in NaUPS.

Inverse rendering in 3D Vision. Neural Radiance Fields (NeRF) [16] implicitly store the scene’s shape and reflectance through MLPs optimized by inverse volume rendering. While NeRF can only recover coarse 3D shapes, several subsequent works [17][22] have been proposed to combine the surface rendering and volume rendering techniques, recovering fine shapes under varying viewpoints but static environment light. In contrast, viewpoints in Spin-UP are relatively static to the objects. While most neural field methods aim to recover the whole 3D geometries, Spin-UP only recovers the object’s surface.

3 Proposed Method↩︎

In Sec. 3.1, we explain the Spin-UP’s setup and how it reduces the unknowns. In Sec. 3.2, we introduce the light prior that alleviates ambiguities in NaUPS, including details of the light initialization method based on that prior. In Sec. 3.3, we describe the implementation details of the proposed Spin-UP framework and losses. In Sec. 3.4, we demonstrate two proposed training strategies.

Figure 1: The proposed image capturing setup. Left: an illustration of image-capturing equipment consisting of a rotatable platform, a camera, and the target object. We spin the platform in \(360^\circ\) and capture images of the object. The object and camera rotate together with the platform. Right-top: Four observed images at different positions. Right-bottom: Ground truth environment light. Dashed color boxes indicate the corresponding camera views.

3.1 Spin Light Setup↩︎

As shown in Fig. 1, we capture a sequence of images \(\boldsymbol{I} \triangleq \{I_j | j\in [1, ..., N_I]\}\) for an object3 by rotating it together with a linear perspective camera in \(360^\circ\) on a rotatable platform. Since the relative positions and orientations between the camera and the object are fixed during rotation, each observed image is aligned with the rotated environment light \(\boldsymbol{L}(\boldsymbol{R}_j \cdot \boldsymbol{\omega})\), where \(\boldsymbol{\omega} \in \mathbb{R}^3\) indicates the incident light’s direction, \(\boldsymbol{R}_j=\boldsymbol{R}(\theta_j)\) is the 1-DoF [6] rotation matrix representing rotation about the vertical axis, \(\theta_j\) is the rotation angle, \(\theta_1=0\). As we conduct \(360^{\circ}\) rotation and assume a constant velocity (despite not strictly required in practice), \(\boldsymbol{R}_j\) can be initialized given \(\theta_j = 2\pi (j-1) / {N_I}\). Given a sequence of images \(\boldsymbol{I}\) and the initialized \(\boldsymbol{R}\), Spin-UP iteratively optimizes the normal map \(\boldsymbol{N}\), the environment light \(\boldsymbol{L}\), isotropic BRDF map \(\boldsymbol{M}\), and rotation angle \(\boldsymbol{R}\) by solving \[\begin{align} \label{eq:inv95rend} \underset{\boldsymbol{L}, \boldsymbol{M}, \boldsymbol{N}, \boldsymbol{R}}{\arg \min } \sum_{i=1}^{N_P} \sum_{j=1}^{N_I} \mathcal{E}\left(\boldsymbol{m}_{ij}, \hat{\boldsymbol{m}}_{ij}\right), \end{align}\tag{1}\] where \(N_P\) is the number of sampled points on the surface, \(\boldsymbol{m}_{ij}\) and \(\hat{\boldsymbol{m}}_{ij}\) indicate the ground truth and estimation of point \(i\)’s color in image \(I_j\), respectively. \(\mathcal{E}(\cdot,\cdot)\) is loss function between \(\boldsymbol{m}_{ij}\) and \(\hat{\boldsymbol{m}}_{ij}\) (i.e, mean absolute error). We adopt the rendering equation to calculate the color \(\boldsymbol{\hat{m}}\)4 \[\begin{align} \begin{aligned} \label{eq:render95eq} \boldsymbol{\hat{m}} &=\int_\Omega s\boldsymbol{L}\left(\boldsymbol{\omega}\right) \rho\left(\boldsymbol{\omega} \cdot \boldsymbol{n}\right) \mathrm{d} \boldsymbol{\omega}, \\ & =\int_\Omega s\boldsymbol{L}\left(\boldsymbol{\omega}\right) (\rho^s + \rho^d)\left(\boldsymbol{\omega} \cdot\boldsymbol{n}\right) \mathrm{d} \boldsymbol{\omega}. \end{aligned} \end{align}\tag{2}\] where \(\Omega\) represents the upper hemisphere centered at the normal vector \(\boldsymbol{n}\). \(s\) is the cast shadow. \(\rho^s\) and \(\rho^d\) indicate the specular and diffuse reflectance, respectively. The ambiguities between the light \(\boldsymbol{L}\) and reflectance of the object \(\boldsymbol{M}\) are often disregarded [13], [23].

0.1cm 0.1cm 15pt

Table 1: A comparison of type, unknowns’ number (number), and representation capacity (rep. capacity) of light model among Spin-UP and representative unsupervised NaUPS methods. The unknown number is calculated on \(50\) images (\(512\times512\)). ‘freq.’ represents frequency.
QL15 [3] HY19 [2] GM21 [1] Spin-UP
type SV-SH Global SH Eqv-Dir SG
number \(1.5\)K \(450\) \(1.4\)M \(434\)
low-freq. low-freq. low-freq. high-freq.

Figure 2: The proposed light initialization method in Spin-UP. We crop the boundary pixels \(m_b\) and normal \(\boldsymbol{n}_b\) from input images. Then, we remap them on the sphere and rotate them with their corresponding rotations \(\boldsymbol{R}\). Based on a light probe composed of gray-scale boundary pixels, we optimize the SG light model to obtain the environment light.

Unknowns reduction. The proposed spin light setup reduces the unknowns of the light representation \(\boldsymbol{L}\) by exploiting correlations between different images. Unlike previous NaUPS methods that separately model the light for each image, we consider a uniform environment light \(\boldsymbol{L}\) for all images represented by the parametric model like spherical Gaussian, and 1-DoF rotation angle \(\theta\) for each image. As such, the unknowns consist of the environment light model’s parameters and the number of rotational angles that are quantitatively equal to \(N_I\). The total unknown amount is reduced compared to other methods (Table 1), which helps mitigate the ill-posedness and facilitates solving NaUPS with advanced light and reflectance models in an unsupervised manner.

3.2 Light Prior from Boundaries↩︎

Based on the spin light setup, we can exploit priors from the object boundary for light initialization to alleviate the ambiguity. The idea is motivated by the observation that the pixel value \(m_b\) at an object’s boundary provides insights into the environment light (see Fig. 2). For an object with occluding boundaries, the normal of those boundaries \(\boldsymbol{n}_b\) can be pre-computed [13], [23]. By bonding \(m_b\), \(\boldsymbol{n}_b\), and \(\boldsymbol{R}\), we can roughly derive a light map indicating the light sources’ positions and intensities, where \(m_b\) directly represent the light intensity \(\boldsymbol{L}(\boldsymbol{\omega}_b)\) at \(\boldsymbol{\omega}_b = \boldsymbol{R}\cdot\boldsymbol{n}_b\). However, the derived light map for objects with different materials may contain mismatched light source positions and chromatic bias, leading to inaccurate light initialization.

The mismatched light source positions are caused by the specular component \(m^s_b\) in \(m_b\). When \(m_b^s\) dominates, approximating \(\boldsymbol{L}(\boldsymbol{\omega}_b)\) by \(m_b\) becomes a biased estimation as \(m_b^s\) is a reflection of lights in different directions than \(\omega_b\) (Fig. 3). By contrast, the approximation is more reasonable when the diffuse component (\(m_b^d\)) dominates since \(m^d_b=\int_{\Omega} \boldsymbol{L}\left(\boldsymbol{\omega}\right) \rho^d\left(\boldsymbol{\omega} \cdot \boldsymbol{\omega}_b\right) \mathrm{d} \boldsymbol{\omega}\) indicates that \(\boldsymbol{L}(\boldsymbol{\omega}_b)\) contributes most to the actual pixel value, making it less biased to use \(m^d_b\) to represent \(\boldsymbol{L}(\boldsymbol{\omega}_b)\). Therefore, to facilitate a less biased estimation of environment light for initialization, a diffuse filter \(\mathcal{F}^d(.)\) is necessary on \(m^d_b\), alleviating the mismatched light source positions issue. Similarly, a chromatic filter \(\mathcal{F}^c(.)\) is also required to reduce the chromatic bias caused by the spatially varying material at boundaries. The filtered pixel value \(\hat{m}_b^d=\mathcal{F}^c (\mathcal{F}^d\left(m_b\right))\) are the basis for our light initialization method.

Figure 3: An illustration of mismatched light source positions. Rows from top to bottom: an illustration of reflection on materials with different roughness; the initial environment light before applying any filters given objects’ boundary pixels; the ground truth environment light. Objects are (a) the diffuse-dominant sphere, (b) the diffuse and specular mixture sphere, and (c) the specular dominant sphere. The yellow and red lines in rows 2 and 3 indicate a rough position of the light sources.

Light initialization method. The procedure of light initialization method is summarized in Algorithm 4. This method aims to derive an initial environment light model with parameter \(\Theta\). Specifically, we use \(N_L=64\) spherical Gaussian (SG) bases [24] as the light model, where \(\boldsymbol{L}(\boldsymbol{\omega}|\boldsymbol{\xi}_t, \lambda_t, \boldsymbol{\mu}_t) = \sum^{N_L}_{t=1} G(\boldsymbol{\omega}; \boldsymbol{\xi}_t, \lambda_t, \boldsymbol{\mu}_t)\). \(\boldsymbol{\xi}_t\), \(\lambda_t\), and \(\boldsymbol{\mu}_t\) stands for Gaussian lobes’ direction, sharpness, and amplitude, respectively5. Inspired by [25] indicating that diffuse reflectance can be approximated by the low-frequency reflectance, we design \(\mathcal{F}^d(.)\) as a combination of a threshold filter \(\mathcal{F}^d_{TH}(.)\) that removes high-intensity reflectance (top \(20\%\) bright) based on boundary points’ intensity profile [25] and a low-pass filter (i.e, 3-order spherical harmonics filter6) noted as \(\mathcal{F}^d_{SH}\) [25], [26]. While \(\mathcal{F}^d_{TH}(.)\) reduces the bias by removing the brightest parts that are usually aligned with specular reflectance in observed images. \(\mathcal{F}^d_{SH}(.)\) helps extract the low-frequency reflectance. For \(\mathcal{F}^c(.)\), we design it as a converter transferring pixel values into gray-scale7, mitigating biases from spatially varying material.

Figure 4: Light Initialization Method

3.3 Framework of Spin-UP↩︎

With reliable initial SG lights and rotation matrices \(\boldsymbol{R}\), we develop Spin-UP based on neural inverse rendering [13], [17], [21], [23] given the rendering equation in Eq. (2 ).

Shape model. We use the neural depth field to represent the 3D surface. A multi-layer perceptron (MLP) predicts the depth value given the image coordinates. To compute normal given the depth map, we extend the normal fitting method described in [13] to the perspective projection8.

Material model. We represent the spatially varying and isotropic reflectance as a modified Disney Model [17]. The diffuse albedo \(\boldsymbol{\rho}^d\) is predicted by another MLP with a similar structure given the image coordinate. The spatially varying specular reflectance is calculated as a weighted sum of \(N_S=12\) SG bases, so \(\boldsymbol{\rho}^s=\sum_{n=1}^{N_S} c^n\mathcal{D}\left(\boldsymbol{v},\boldsymbol{\omega}\right)\mathcal{F}\left(\boldsymbol{h}, \boldsymbol{\omega}\right) \mathcal{G}(\boldsymbol{n}, \boldsymbol{\omega}, \boldsymbol{v}, \lambda_n)\), where \(\mathcal{D}\), \(\mathcal{F}\), and \(\mathcal{G}\) accounts for micro-facet’s normal distribution, Fresnel effects, and self-occlusion, respectively. \(\boldsymbol{v}\) is the view direction. \(\boldsymbol{h}\) is the half-vector, calculated by \(\boldsymbol{h}=(\boldsymbol{v}+\boldsymbol{\omega}) / \left\|\boldsymbol{v}+\boldsymbol{\omega}\right\|\), \(\lambda_n\) is the roughness terms initialized as \((0.1 + 0.9(n-1)) / (N_S-1)\) and set as learnable parameters, \(c^n\) is the weights predicted by the MLP.

Shadow model. We apply a shadow mask similar in [23] to handle the cast shadow.

Loss functions. Similar to other inverse rendering-based methods [13], [23], we use the inverse rendering loss (i.e, Eq. (1 )) to train the framework. The three-stage schema [13] is applied, as well as other smoothness terms (total variance regularization [13], [23]) calculate as \(\operatorname{TV}(.) = \frac{1}{N_P} \sum_{i=1}^{N_P}\left|\frac{\partial (.)}{\partial x}+\frac{\partial (.)}{\partial y}\right|\), where \((x, y)\) is the image coordinates. We implement \(\operatorname{TV}(.)\) on the normal map (\(\boldsymbol{N}\)), diffuse albedo map (\(\boldsymbol{A}\)), and the Gaussian bases’ weights (\(c\)) for the material, and gradually drop it following the three-stage training schema [13]. Similar to [27], the normalized color loss calculated as \(\|\operatorname{Nor}(\boldsymbol{A})-\operatorname{Nor}(\boldsymbol{I})\|\) is implemented to help Spin-UP learn a better albedo representation, where \(\operatorname{Nor}(.)\) is the vector normalization. Following [13], we calculate the boundary loss as the cosine similarity between the pre-computed and estimated boundary normal9.

Figure 5: Proposed training strategies. (Top) Interval sampling (IS). The high-resolution images are down-sampled into several low-resolution sub-images by extracting pixels with an interval of \(N_B\), where \(N_B=2\) in this example. (Bottom) Shrinking range computing (SRC). Far points (yellow circles) that are \(k\) (\(k=3\) in this example) point away w.r.t the query position is selected to interpolate the close points (blue circles)’s depth for normal calculation. During optimization, \(k\) will be gradually reduced to 1.

3.4 Training Strategies↩︎

Optimizing Spin-UP requires smoothness terms to facilitate convergence and avoid local optima. However, those terms are often implemented on full-resolution images, leading to high computational costs. To reduce those costs, we propose to use a sampling strategy noted as interval sampling (IS). To further improve convergence, we introduce another technique noted as shrinking range computing (SRC).

Interval sampling (IS). IS samples ray batches from images to reduce the costs. Unlike random ray sampling [16] or patch-based sampling [28], IS preserves the object’s shape. The idea is similar to downsampled techniques in [5], [29], but we don’t merge the sub-images to full resolution. We experimentally find this strategy important for training on down-sampled sub-images with smoothness terms to avoid local optima (Sec. 5.2). Specifically, we divide the image in full resolution into non-overlapping blocks, and each block contains \(N_B\times N_B\) pixel points. By extracting pixel points from the same position in each block (e.g., the left-top pixel), we obtain \(N_B\times N_B\) sub-images with a down-sampled resolution (See Fig. 5 for illustration). During training, those sub-images are randomly sampled in each epoch, and the smoothness terms are calculated based on the sub-images resolution, which reduces the computational cost and ensures the effectiveness of those terms.

c|cccc|c|cccc|c Method& <multicolumn,5>Shape Group & & & & & <multicolumn,5>Reflectance Group & & & &
& Ball & Bear & Buddha & Reading & AVG & Pot2 (D.) & Pot2 (S.) & Reading (D.) & Reading (S.) & AVG
HY19 [2] & 41.32 & 53.88 & 67.90 & 54.85 & 54.49 & 57.45 & 37.43 & 65.48 & 58.04 & 54.60
S22 [4] & 7.35 & 14.03 & 26.37 & 18.77 & 16.63 & 15.56 & 11.83 & 18.97 & 18.38 & 16.19
S23 [5] & 5.56 & 10.37 & 18.54 & 15.10 & 12.39 & 13.46 & 9.75 & 16.22 & 12.67 & 13.03
Spin-UP & 3.54 & 6.33 & 17.30 & 7.71 & 8.72 & 5.83 & 7.11 & 13.09 & 10.30 & 9.08
Method& <multicolumn,5>Light Group & & & & & <multicolumn,5>Spatially Varying Material Group & & & &
& Cow (U.) & Cow (A.) & Cow (S.) & Cow (L.) & AVG & Pot2 (D.) & Pot2 (S.) & Reading (D.) & Reading (S.) & AVG
HY19 [2] & 67.63 & 39.21 & 40.28 & 48.47 & 48.89 & 40.97 & 37.46 & 49.27 & 48.96 & 44.17
S22 [4] & 17.17 & 12.74 & 17.11 & 11.35 & 14.59 & 18.59 & 17.63 & 22.80 & 23.75 & 20.69
S23 [5] & 11.93 & 7.52 & 12.38 & 11.60 & 10.84 & 14.22 & 11.00 & 14.58 & 14.31 & 13.53
Spin-UP & 5.50 & 4.40 & 3.33 & 4.94 & 4.54 & 5.58 & 6.97 & 12.54 & 11.52 & 9.15

Shrinking range computing (SRC). Without merging sub-images back to a full-resolution image in IS, there will be an aliasing issue in the inverse rendering process. Such an issue is caused by the fact that the normal calculation in our framework on sub-images requires four adjacent points’ depths in sub-images resolution [13], which degrades the precision of normal calculation10. Therefore, SRC is applied for anti-aliasing. It uses points adjacent to the query point (blue circles in Fig. 5) in the full-resolution image coordinates to calculate the normal for each pixel in the sub-images. Such a strategy maintains the precision of normal calculation. However, at the early stage of training, calculating the normal based on the blue circles’ depth is vulnerable to perturbation in per-pixel training. Therefore, SRC gradually selects points (yellow circles) from far (\(k=3\) points away) to close (blue circles) w.r.t query points to interpolate the blue circles’ depth, as normal calculation on far points’ depth will lead to a more smooth and stable normal map at the early stage, which eventually improves convergence, as validated in Sec. 5.2.

4 Experiments↩︎

We validate the effectiveness of the proposed Spin-UP on synthetic and real-world data. We use mean angle error (MAE) to evaluate the normal map’s reconstructed quality and PU-PSNR [30], and PU-SSIM [30] to evaluate the reconstructed environment light. Since no existing datasets follow our spin light setup, we collect the dataset using Blender and our device.

4.1 Evaluation on Synthetic Datasets↩︎

We collect several objects, environment maps, and materials to render the synthetic dataset in Blender by Cycles. Specifically, five shapes from the DiLiGenT-MV dataset [31] (i.e, Buddha, Bear, Cow, Pot2, and Reading) and a generated shape (Ball), five HDR environment maps (i.e, Landscape, Quarry, Urban, Attic, Studio), two PBR materials (i.e, Rusty Steel, Leather), and four synthetic materials (i.e, Voronoi Diff, Voronoi Spec, Green Diff, Green Spec)11 are used for evaluation. We devise four groups of data for evaluation: shape group, light group, reflectance group, and spatially varying material group, each containing four scenes. For each scene, 50 observed images with a resolution of \(512 \times 512\) are rendered by a perspective camera with the focal of \(50 \mathrm{mm}\) and a frame size of \(36 \mathrm{mm}\times 36 \mathrm{mm}\). The camera rotation \(\theta\) for consecutive images follows a non-uniform rotation velocity. We compare Spin-UP with three advanced NaUPS methods, including two supervised NaUPS methods (S22 [4] and S23 [5]) and one unsupervised UPS method (HY19 [2])12.

Normal estimation comparison. According to results in Table [tab:syn], Spin-UP presents a superior performance compared to all other NaUPS methods. Specifically, in shape group, the low MAE on Ball, Bear, Buddha, Reading in Rusty Steel rendered under Quarry indicate the practicability of Spin-UP to various shapes. In light group, the low-variance of MAE (\(0.81^\circ\) for Spin-UP vs. \(1.94^\circ\) for S23 [5]) on Cow in Leather rendered under Quarry, Urban, Attic, and Studio demonstrates robustness toward different environment lights. In reflectance group, the results on Pot2 and Reading in Green Diff and Green Spec rendered under Landscape demonstrate the ability to handle non-Lambertian objects. In spatially varying material group, results on Pot2 and Reading rendered in Voronoi Diff or Voronoi Spec under Landscape prove adaptability to challenging scenarios. A comparative analysis of the outcomes of the reflectance group and the spatially varying material group in our method reveals that the MAE remains relatively consistent across identical objects with different materials, underscoring the robustness of Spin-UP in handling diverse materials. Also, we find that Spin-UP sometimes performs better on specular objects than on diffuse objects (i.e, Reading). We attribute this to the high-frequency details in specular reflectance that may be useful for shape-light reconstruction.

Figure 6: The visual quality comparison of light between the estimated one by Spin-UP (columns 2-5) and the ground truth (column 1) on the four groups, i.e, shape (row 1), reflectance (row 2), spatially varying material (row 3), and light group (row 4-5). The intensity of the estimated light maps is scaled for visualization.

Figure 7: Illustration of the device for real data collection. Please refer to the supplementary material for more details. Left: the capture system contains a camera, a spin platform, and an object. Right: Captured images and paired mirror balls (as light reference) of two objects in the indoor and outdoor scenes, respectively.

Light estimation comparison. Fig. 6 provides a qualitative comparison between the estimated environment light and the ground truth on four groups. We can observe that the learned light map reflects the position of the light source, especially in Cow (S.) given such a challenging setup without any prior information about the object material or shape. It also reconstructs a reasonable light map for objects with diffuse reflectance, such as Pot2 (D.) and Reading (D.), further highlighting the effectiveness of the proposed light initialization. However, it should be noted that the estimated environment light is influenced by the object material, particularly when handling objects with a nearly uniform base color (e.g., reflectance group) or spatially varying material, which may generate unpleasant artifacts (e.g., inconsistent color in estimated environment light of Reading (D.) in the spatially varying material group). Those artifacts are hard to eliminate without priors.

4.2 Evaluation on Real-world Datasets↩︎

We set up our spin light capture system to collect real data under indoor and outdoor scenes, as shown in Fig. 7. After preprocessing, we end up with 50 images for each object, with a \(540 \times 540\) resolution. Here, we showcase the normal estimation results on two objects (i.e, Soldier, and Player) under indoor and outdoor environment lights in Fig. 8, compared with S23 [5], S22 [4]. We do not show HY19 [2]’s result as it failed on the captured data.

Figure 8: Qualitative comparison for the estimated normal map and environment light on Soldier and Player capturing under outdoor (columns 1-6) and indoor (columns 7-12) between ours (column 3, 9), S23 [5] (column 5, 11), and S22 [4] (column 6, 12). We also show the reference sphere and reference environment light (columns 2, 7), and estimated light (columns 4, 10).

Normal estimation comparison. According to the Fig. 8, Spin-UP has competitive performance compared to the state-of-the-art supervised method [5]. In some scenarios, we recover more reasonable results regarding the overall normal distribution than the reference sphere, particularly for Soldier indoor and Player indoor. Spin-UP can effectively capture high-frequency details such as wrinkles on clothes in Player and Soldier. By contrast, S23 [5] may contain artifacts (i.e, an incorrect normal map distribution) even though they have more details than ours, and S22 [4] generates over-smooth results. By comparing indoor and outdoor results, we observe that the performance of S23 [5] degrades significantly in indoor scenarios, which may be attributed to the data bias and low pixel variance, while our method is not greatly affected.

5 Ablation Study↩︎

5.1 Light Initialization Validation↩︎

To comprehensively validate the effectiveness of our light initialization method, we conduct experiments in two aspects: a comparison of different light initialization methods and the effectiveness of the filters.

15pt

Table 2: Ablation studies on Spin-UP’s alternatives regarding average PU-PSNR [30] and PU-SSIM [30] on synthetic dataset. ‘Fib.’ and ‘Rand.’ represent the Fibonacci and random initialization method, respectively. Bold number indicates the best results.
w Rand. w Fib. Spin-UP
PU-PSNR [30]\(\uparrow\) 16.86 18.98 22.06
PU-SSIM [30]\(\uparrow\) 0.45 0.52 0.60

Comparison on light initialization methods. We compare our light initialization method with two widely used SG light initialization methods, i.e, the random initialization noted as ‘w Rand.’ and Fibonacci lattice [17] noted as ‘w Fib.’. A quantitative comparison of reconstructed normal and light maps is shown in Table [tab:lgt95quant] and Table [tab:abl]. Compared with ‘w Rand.’, the improvement in average (\(1.75^\circ\) reduction in MAE on normal estimation and \(5.20\) / \(0.15\) increase in PU-PSNR / PU-SSIM on light estimation, respectively) indicates our method’s adaptability. Compared with ‘w Fib.’, we observe an obvious advantage in the shape group (\(0.61^\circ\) reduction in MAE), while a smaller advantage in the spatially varying material group (\(0.19^\circ\) reduction in MAE). This is because the material and shape will affect the initial environment light’s quality. While the estimated environment light is most accurate on smooth geometry with simple material, the quality will degrade on complex geometry and spatially varying material.

Comparison on the designed filters. We compare Spin-UP with three alternatives (i.e, w/o \(\mathcal{F}^c\), w/o \(\mathcal{F}^d_{SH}\), and w/o \(\mathcal{F}^d_{TH}\)). The results in Table [tab:abl] demonstrate the effectiveness of those filters. Specifically, dropping \(\mathcal{F}^d_{TH}\) will lead to mismatching light source position in the initialized environment light introduced by the specular reflectance, which eventually affects the accuracy of the estimated normal; dropping \(\mathcal{F}^d_{SH}\) will harm the performance, especially in the spatially varying material group (\(6.58^{\circ}\) increase in MAE) since \(\mathcal{F}^d_{SH}\) is essential in extracting low-frequency reflectance to initialize the environment light; dropping \(\mathcal{F}^c\) will increase MAE in average (\(0.32^{\circ}\)), illustrating the necessity of reducing chromatic bias.

5.2 Training Strategies Validation↩︎

The interval sampling will facilitate the training of Spin-UP in two ways. First, the training time is two times shorter (25 min per object on average vs. 60 min on average, depending on the image’s valid points for the object), and the GPU memory occupation is five times smaller (around 5 GB vs. 25 GB during training) than directly training on full-image resolution. Second, comparing Spin-UP with ‘w/o Intv.’, which applies a random sampling strategy and calculates the smoothness terms on patches (\(3 \times 3\) pixels), we find that the performance drops \(0.52^\circ\) on average, and most (\(0.97^\circ\)) on the spatially varying material group. This is because the patch-based smoothness may not work uniformly on different parts of the object, diminishing the effectiveness of smoothness terms, especially on objects in spatially varying material with abrupt texture changes.

12pt

Table 3: Ablation studies on Spin-UP’s alternatives regarding average MAE on four groups (shape, light, reflectance, and spatially varying material group). ‘Fib.’, ‘Rand.’, ‘Intv.’, and ‘Shrk.’ represent the Fibonacci initialization method, random initialization, interval sampling, and shrinking range computing, respectively. Bold number indicates the best results in MAE.
Shape Light Ref. SV. AVG
w Rand. 12.02 6.44 9.79 10.28 9.60
w Fib. 9.33 5.34 10.20 9.34 8.55
w/o Intv. 8.87 5.02 9.83 10.12 8.37
w/o Shrk. 10.10 4.92 10.12 10.43 8.81
w/o \(\mathcal{F}^c\) 9.44 4.61 9.24 9.40 8.17
w/o \(\mathcal{F}^d_{SH}\) 9.75 8.29 8.93 15.73 10.41
w/o \(\mathcal{F}^d_{TH}\) 9.30 5.04 9.18 12.49 8.82
Spin-UP 8.72 4.54 9.08 9.15 7.85
S23 [5] 12.42 8.56 12.52 12.33 11.46
Spin-UP 11.62 9.25 11.07 9.07 9.48

Method with  is tested on dataset with point light + environment light Sec. 5.3.

The shrinking range computing helps avoid local optima when training Spin-UP on down-sampled images while still using full-resolution image coordinates for normal calculation. We compared Spin-UP with the alternative ‘w/o Shrk.’, which does not implement this strategy. The average MAE on normal estimation for four groups increases \(0.96^\circ\), highlighting the importance of this strategy.

5.3 Additional Validation on Point Light Source↩︎

To ensure a more fair comparison with the state-of-the-art supervised methods (S23 [5]), we add a dominant point light to the environment light in synthetic and real-world dataset13. According to Table [tab:abl], the proposed Spin-UP has a lower MAE on estimated normal maps than S23 [5] on the synthetic dataset (\(9.48^\circ\) ours v.s. \(11.46^\circ\) S23 [5]). As shown in Fig. 9, we have a visually comparable result on the real-world dataset given far-field point light (\(2\text{m}\)) and a better result given near-field point light (\(0.4\text{m}\)), validating the adaptability of Spin-UP on unseen light sources.

Figure 9: (a) Illustration of new light setup. (b) Qualitative comparison on Soldier between our method and S23 [5] with dominant point light.

6 Conclusion↩︎

This paper proposes Spin-UP to address NaUPS in an unsupervised manner. Thanks to our setup to mitigate the ill-posedness, the light initialization method to alleviate the ambiguities of NaUPS, and the proposed training strategies to facilitate fast convergence, Spin-UP can recover surfaces with isotropic reflectance under various lights. Experiments in synthetic and real-world datasets have shown that Spin-UP is robust to various shapes, lights, and reflectances.

Limitations and future work. Although Spin-UP is efficient and robust in solving NaUPS, it has several limitations: 1) Spin-UP assumes infinitely far light sources, which omit the spatially varying lighting; 2) the materials’ base color will bias the estimated environment light; 3) Spin-UP assumes objects to have isotropic reflectance, ignoring inter-reflections and anisotropic features, meaning that it cannot perform well on objects with anisotropic reflectance, such as aluminum, or strong inter-reflections, such as a glass bowl; 4) Spin-UP doesn’t compute the shadow iteratively, which may result in artifacts on objects with complicated shapes. Overcoming the above limitations will be regarded as our future work. Besides, the rotation axis of our device does not align with the object’s center due to the consideration of structural stability, which may introduce bias from the spatially varying light in observed images. Redesigning the image capture device is also one of the future works. At last, we find it interesting to improve the setup by relieving the requirement for single-axis \(360^\circ\) rotation to free rotations for easier implementation on portable devices.

Acknowledgments. This work is supported by the National Natural Science Foundation of China under Grant No. 62136001, 62088102, Rapid-Rich Object Search (ROSE) Lab, Interdisciplinary Graduate Programme, Nanyang Technological University, Singapore, and the State Key Lab of Brain-Machine Intelligence, Zhejiang University, Hangzhou, China.

Supplementary Material↩︎

In this supplementary material,

  1. we give more implementation details in Sec. 7, including details of framework structure (footnote 6) and hyperparameters setup (footnote 7).

  2. we introduce more about boundary normal calculation and normal calculation for rendering equation in perspective projection in Sec. 8 (footnote 6);

  3. we provide an overview of the synthetic and real-world dataset in Sec. 9. We also explain how we collect and preprocess the real-world dataset;

  4. we showcase a qualitative comparison between Spin-UP and other methods on the real-world dataset in Sec. 10 (footnote 10). More results from the real-world dataset are also included in this section (footnote 10);

7 Implementation Details↩︎

7.1 Network Structure↩︎

We use the similar multi-layer perceptrons (MLPs)’ structures in [13], [23], shown in Fig. 10. The input of MLPs is pixels’ 2D coordinate (\(p=(x, y)\)) in an image, which will pass through a positional encoding module similar in [16] calculated as \[\begin{align} \begin{aligned} \gamma(p)= \left(\sin \left(2^0 \pi p\right), \cos \left(2^0 \pi p\right), \cdots, \sin \left(2^{L_p-1} \pi p\right), \cos \left(2^{L_p-1} \pi p\right)\right), \end{aligned} \end{align}\] where \(L_p\) is the positional code’s dimension, set as 10 for \(\gamma^1(.)\) and 6 for \(\gamma^2(.)\)

Figure 10: Network structure of MLPs for depth and material estimation in Spin-UP.

7.2 Loss Functions and Hyperparameters Setup↩︎

In Spin-UP, we implement:

  1. L1 inverse rendering loss \(L_{r}\) calculated as, \(\sum_{i=1}^{N_P} \sum_{j=1}^{N_I} |\boldsymbol{m}_{i j} - \hat{\boldsymbol{m}}_{i j}|\).

  2. Normalized color loss \(L_{\mathrm{color}}\), calculated as, \(\lambda_c \|\mathrm{Nor}(\boldsymbol{A}) - \mathrm{Nor}(\boldsymbol{I})\|\), where \(\lambda_c=0.5\).

  3. Boundary loss \(L_{\mathrm{b}}\), calculated as the cosine similarity between the pre-computed and estimated boundary normal.

  4. Smoothness terms \(L_{\mathrm{sm}}\) on albedo map \(\boldsymbol{A}\), normal map \(\boldsymbol{N}\), spatially varying Gaussian bases weights \(c^n\), is calculated as, \[\begin{align} \begin{aligned} L_{\mathrm{sm}} =\frac{\lambda}{N_P} \sum_{i=1}^{N_P}\left|\frac{\partial \boldsymbol{A}}{\partial x} +\frac{\partial \boldsymbol{A}}{\partial y}\right| +\frac{\lambda_N}{N_P} \sum_{i=1}^{N_P}\left|\frac{\partial \boldsymbol{N}}{\partial x} +\frac{\partial \boldsymbol{N}}{\partial y}\right| + \frac{\lambda_S}{N_P} \sum_{n=1}^{N_S}\sum_{i=1}^{N_P} \left|\frac{\partial c^n_i}{\partial x} + \frac{\partial c^n_i}{\partial y}\right|, \end{aligned} \end{align}\] where, \(\lambda = 0.01\), \(\lambda_N = 0.02\), \(\lambda_S = 0.01\).

We train the Spin-UP in three stages similar to [13]. For the first stage, the loss \(\mathcal{L}_\mathrm{stage1}\) is calculated as below for a faster convergence. \[\begin{align} \begin{aligned} \mathcal{L}_\mathrm{stage1} = \mathcal{L}_r + \mathcal{L}_{\mathrm{b}} + \lambda_c \mathcal{L}_\mathrm{color} + \mathcal{L}_\mathrm{sm}, \end{aligned} \end{align}\] For the second stage, we drop the smoothness term on the albedo map and reduce \(\lambda_N\) to 0.05 for details refinement, where \(\mathcal{L}_N=\operatorname{TV}(\boldsymbol{N})\) is the smoothness term on normal map. \[\begin{align} \begin{aligned} \mathcal{L}_\mathrm{stage2} = \mathcal{L}_r + \mathcal{L}_{\mathrm{b}} + \lambda_c \mathcal{L}_\mathrm{color} + \lambda_{N } \mathcal{L}_N, \end{aligned} \end{align}\] For the third stage, we drop the smoothness terms \(\mathcal{L}_N\) to further refine the details. \[\begin{align} \begin{aligned} \mathcal{L}_\mathrm{stage3} = \mathcal{L}_r + \mathcal{L}_{\mathrm{b}} + \lambda_c \mathcal{L}_\mathrm{color}. \end{aligned} \end{align}\] The three stages take 500, 1000, and 500 epochs, respectively. During training, we use Adam as the optimizer with a learning rate \(\alpha=0.001\) and a batch size of 4 images per iteration.

8 Normal Calculation in Perspective View↩︎

8.1 Boundary Normal Calculation↩︎

Figure 11: An illustration of occluding boundaries’ normal relationship with view directions for (a) front view and (b) side view of a surface. The dotted line in (b) indicates the outermost boundaries of an object in perspective projection.

Figure 12: An illustration of (a) adjacent points’ positions for normal fitting method [13] in perspective projection, (b) Eq. (4 ).

In perspective projection, the surface normal is perpendicular to the object’s occluding boundaries \(B(x, y)\) and view direction \(\boldsymbol{v}\), as shown in Fig. 11. Therefore, the boundaries’ normal \(\boldsymbol{n}^b\) is calculated as \[\begin{align} \begin{aligned} \boldsymbol{n}^b \cdot \boldsymbol{v}^b = 0, ~\boldsymbol{n}^b \cdot (\frac{\partial B}{x}, \frac{\partial B}{y}, 1)^\top= 0, \end{aligned} \end{align}\] In practice, the outer boundaries of an object in images may not precisely match its actual boundaries due to limited image resolution. Therefore, we add a small offset (\(\beta=0.1\)) to make the pre-computed boundaries normal more accurate: \[\begin{align} \begin{aligned} \boldsymbol{n}^b = \mathrm{Nor}(n^{bx}, n^{by}, n^{bz} + \beta). \end{aligned} \end{align}\]

8.2 Normal Calculation For Rendering Equation↩︎

The normal fitting method [13] in orthogonal projection is shown below: \[\begin{align} \begin{aligned} \boldsymbol{n} & =\sum_{k=1}^4 \gamma^k \boldsymbol{n}^k=\sum_{k=1}^4 \gamma^k \operatorname{Nor}\left[\left(\boldsymbol{p}^{k+1}-\boldsymbol{p}\right) \times\left(\boldsymbol{p}^k-\boldsymbol{p}\right)\right]^{\top}, \\ \gamma^k & =\frac{\left|d^k\right|^{-1}}{\sum_{k=1}^4\left|d^k\right|^{-1}}, \quad d^k=z^k+z^{k+1}-2 z, \end{aligned} \label{eq:nml95fit} \end{align}\tag{3}\] where, \(\boldsymbol{p}^k=(x^k, y^k, z^k)\) is the adjacent point of the query point \(\boldsymbol{p}=(x, y, z)\) and \(x, y \in [-1, 1]\), \(k = 1\) if \(k + 1 > 4\), as shown in Fig. 12, (a). To extend the normal fitting method to the perspective projection, we first compute the points’ coordinates in the camera coordinate system by \[\begin{align} \begin{aligned} \boldsymbol{p}^{k\prime} & =(x^k\frac{z^k}{f}s_x, y^k\frac{z^k}{f}s_y, z^k), \\ \boldsymbol{p}^{\prime} & =(x\frac{z}{f}s_x, y\frac{z}{f}s_y, z). \end{aligned} \label{eq:persp} \end{align}\tag{4}\] where \(f\) is the camera’s focal, \(s_x\) and \(s_y\) are half of the width and height of the camera’s frame. Replace \(\boldsymbol{p}^{k}\) and \(\boldsymbol{p}\) in Eq. (3 ) by \(\boldsymbol{p}^{k\prime}\) and \(\boldsymbol{p}^{\prime}\), we get the normal fitting method in perspective projection.

9 Datasets↩︎

9.1 Synthetic Dataset↩︎

In Fig. 13, we showcase all 5 objects with 6 materials under 5 HDR environment maps rendered by Blender Cycles14. This results in 16 scenes15 of synthetic data that are classified into 4 groups, i.e, the shape group, light group, reflectance group, and spatially varying group.

Figure 13: (a) HDR environment maps (row 1), objects (row 2), and materials (row 3) involved in the synthetic dataset. Each figure in row 2 consists of two subfigures for 3D model preview (left) and normal map (right). Each figure in row 3 consists of two subfigures for material rendered on a sphere (left) and albedo (right). (b) Example images from each scene in four groups.

c|ccccc Properties & Soldier & Player & Dancer & Policeman & Eevee
Length (cm) & 9.50 & 11.50 & 4.00 & 4.00 & 4.00
Width (cm) & 7.00 & 11.00 & 5.00 & 4.00 & 4.00
Height (cm) & 3.00 & 28.00 & 4.00 & 9.00 & 9.00
Distance (m) & 0.90 & 0.90 & 0.40 & 0.40 & 0.30

9.2 Real-world Dataset↩︎

The real-world dataset contains 5 objects captured under indoor and outdoor environments with spatially-varying materials. The five real-world objects used in our study are the Soldier, Player, Policeman, Dancer, and Eevee. The objects’ sizes are shown in Table [tab:32object].

Device introduction. Soldier, Player, Policeman, and Dancer’s observed images were captured by a customized device shown in Fig. 14 (left), which consists of two stands (one for holding the subject being photographed, the other for supporting the camera) and a rotating mechanism. The distance from the camera to the object is adjustable. In addition to this, we also consider a more portable device shown in Fig. 14 (right), which is made up of a wooden rotatable platform16 with a diameter of 39mm and the camera. We capture Eevee’s observed images based on this device.

Figure 14: Left: (a) Overview of the device, (b) Stand for the camera, (c) Stand for the object being photographed with dark cloth for interreflection removal, (d) Rotating hinge. Right: A portable version of image capturing device, shown in top and bottom views.

Photographing requirements. Before photographing, the distance between the camera and the object is determined based on the proportion of the object in the viewfinder, ensuring a balance of the occupied portion between the objects and the camera. Three typical distances were used: 0.9 meters for large and 0.4 meters (or 0.3 meters) for small objects. During photographing, the thumb rule is to capture a clear image with less noise and keep rotation velocity as uniform as possible. For the camera’s parameters, we chose ISO 1600, an aperture size of f/13 for outdoor scenes; and ISO 3200, an aperture size of f/6.3 for indoor scenes, respectively. The focal size is fixed at 31mm for different scenes.

Pre-processing pipeline. In the pre-processing pipeline, we extracted 50 images from the video at equal intervals to use as our data. We then obtain the objects’ masks in each scene from the first frame by Photoshop. Those masks help separate objects and backgrounds. In practice, there are translational motions in the horizontal and vertical directions, mostly obvious on objects due to structural instability. Therefore, after calculating the relative rotation angle \(\theta_j\), we used a simple algorithm for motion correction, assuming that the only motion of the object relative to the camera was translational in the horizontal and vertical directions. Specifically, we pre-set the range of motion and iterate over the distance vector to find the distance of movement (plus or minus 20 pixels) that minimizes the difference between the front and back frames after applying the mask. Note that although large movement is corrected in this step, minor movements still exist and are hard to eliminate. Fortunately, our method can tolerate those minor movements.

10 Qualitative Comparison↩︎

10.1 Qualitative Comparison on Synthetic Dataset↩︎

We show all the estimated normal maps, error maps of Spin-UP, S23 [5], S22 [4], and HY19 [2] of shape, light, reflectance, and spatially-varying material groups in Fig. 16-Fig. 18.

Figure 15: The visual quality comparison among Spin-UP, S23 [5], S22 [4], and HY19 [2] on the light group in terms of normal map (row 1, 3, 5, 7), error map (row 2, 4, 6, 8). Numbers indicate the MAE for surface normal.

Figure 16: The visual quality comparison among Spin-UP, S23 [5], S22 [4], and HY19 [2] on the shape group in terms of normal map (row 1, 3, 5, 7), error map (row 2, 4, 6, 8). Numbers indicate the MAE for surface normal.

Figure 17: The visual quality comparison among Spin-UP, S23 [5], S22 [4], and HY19 [2] on the reflectance group in terms of normal map (row 1, 3, 5, 7), error map (row 2, 4, 6, 8). Numbers indicate the MAE for surface normal.

Figure 18: The visual quality comparison among Spin-UP, S23 [5], S22 [4], and HY19 [2] on the spatially varying material group in terms of normal map (rows 1, 3, 5, 7), error map (rows 2, 4, 6, 8). Numbers indicate the MAE for surface normal.

10.2 Qualitative Comparison on Real-world Dataset↩︎

We show all the estimated normal maps of Spin-UP, S23 [5], and S22 [4] of real-world dataset in Fig. 19 and  Fig. 20.

Figure 19: The visual quality comparison among Spin-UP, S23 [5], and S22 [4] on the Soldier, Player, Policeman, and Dancer in terms of the normal map. Left (right) side of the solid line: objects captured in Campus (Workplace) environment.

Figure 20: The visual quality comparison among Spin-UP, S23 [5], and S22 [4] on Eevee captured in a living room in terms of the normal map based on more portable device.

References↩︎

[1]
.
[2]
.
[3]
.
[4]
.
[5]
.
[6]
.
[7]
.
[8]
.
[9]
.
[10]
.
[11]
of Computer Vision (IJCV) , 2007.
[12]
.
[13]
.
[14]
.
[15]
.
[16]
.
[17]
.
[18]
vol. rendering for multi–view reconstruction. In Proc. Conference on Neural Information Processing Systems (NeurIPS), 2021.
[19]
.
[20]
.
[21]
.
[22]
.
[23]
.
[24]
.
[25]
.
[26]
.
[27]
.
[28]
.
[29]
.
[30]
.
[31]
.
[32]
.
[33]
.

  1. Co-first author.   Corresponding author.↩︎

  2. Work completed while interning at the State Key Lab of Brain-Machine Intelligence, Zhejiang University.↩︎

  3. We assume the object’s boundary is geometrically smooth (occluding boundary).↩︎

  4. The subscripts are omitted for simplicity.↩︎

  5. In practice, we find that initializing Gaussians’ parameters by Fibonacci lattice [17] and freezing \(\lambda_t\) gives the best results.↩︎

  6. We implement a Gaussian filter and rescale the pixel value’s range to \(\mathcal{F}^d_{TH}(m_b)\)’s range to suppress ringing effect and negative energy in SH.↩︎

  7. We use gray-scale value to initialize the RGB value of light model.↩︎

  8. Please refer to the supplementary material for more details about the network structures and modified normal fitting method.↩︎

  9. Please refer to the supplementary material for more details about the setup of hyperparameters.↩︎

  10. According to normal’s definition, the smaller the distance between the adjacent points and the query points (red circle in Fig. 5), the more accurately representing the geometry at query points. Therefore, the blue circles’ depths are preferred for normal calculation.↩︎

  11. The generated pattern for Voronoi Diff and Voronoi Spec follow similar setup in CNN-PS [32].↩︎

  12. Please refer to the supplementary material for all the qualitative comparison between Spin-UP and other methods on the synthetic and real-world dataset.↩︎

  13. Point light’s setup follows [5] and [33].↩︎

  14. https://www.blender.org↩︎

  15. One scene representing an object with one material rendered under HDR environment maps.↩︎

  16. https://www.ikea.com/sg/en/p/snudda-lazy-susan-solid-wood-40176460/↩︎