Zongrui Li\(^{1,2,}\)^{1} Zhan Lu\(^{2,4,*,}\)^{2} Haojie Yan\(^{3,4}\) Boxin Shi\(^{5,6}\) Gang Pan\(^{3,4}\) Qian Zheng\(^{3,4,\ddag}\) Xudong Jiang\(^{1,2}\)

\(^1\)Rapid-Rich Object Search (ROSE) Lab, Interdisciplinary Graduate Programme, Nanyang Technological University

\(^2\)School of Electrical and Electronic Engineering, Nanyang Technological University

\(^3\)College of Computer Science and Technology, Zhejiang University

\(^4\)The State Key Lab of Brain-Machine Intelligence, Zhejiang University

\(^5\)National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University

\(^6\)National Engineering Research Center of Visual Technology, School of Computer Science, Peking University

{zongrui001, zhan007, EXDJiang}**ntu.edu.sg?**, {hjyan, gpan, qianzheng}**zju.edu.cn?**, shiboxin@pku.edu.cn

April 02, 2024

Natural Light Uncalibrated Photometric Stereo (NaUPS) relieves the strict environment and light assumptions in classical Uncalibrated Photometric Stereo (UPS) methods. However, due to the intrinsic ill-posedness and high-dimensional ambiguities,
addressing NaUPS is still an open question. Existing works impose strong assumptions on the environment lights and objects’ material, restricting the effectiveness in more general scenarios. Alternatively, some methods leverage supervised learning with
intricate models while lacking interpretability, resulting in a biased estimation. In this work, we propose **Spin** Light **U**ncalibrated **P**hotometric Stereo (**Spin-UP**), an unsupervised method to
tackle NaUPS in various environment lights and objects. The proposed method uses a novel setup that captures the object’s images on a rotatable platform, which mitigates NaUPS’s ill-posedness by reducing unknowns and provides reliable priors to alleviate
NaUPS’s ambiguities. Leveraging neural inverse rendering and the proposed training strategies, Spin-UP recovers surface normals, environment light, and isotropic reflectance under complex natural light with low computational cost. Experiments have shown
that Spin-UP outperforms other supervised / unsupervised NaUPS methods and achieves state-of-the-art performance on synthetic and real-world datasets. Codes and data are available at https://github.com/LMozart/CVPR2024-SpinUP.

Natural light uncalibrated photometric stereo (NaUPS) [1] is proposed to relieve the dark room and directional light assumption in classical uncalibrated photometric stereo, aiming to reconstruct the surface normal given images of an object captured at arbitrary environment light. The implications of NaUPS are far-reaching: it makes photometric stereo universal. However, solving NaUPS is still an open question because of the intrinsic ill-posedness introduced by the varying light of each image and the high dimensional ambiguities between the light and objects [1].

Previous optimization-based methods use the simple light model to represent the varying environment lights and Lambertian reflectance to represent the material [1]–[3]. These models help mitigate the ill-posedness and ambiguities to some extent but become ineffective in handling objects with general
reflectance (*e.g.*, non-Lambertian reflectance) under complex environment light, leading to unsatisfactory reconstruction outcomes. Besides, since they solely model the varying lights in each image, the unknowns introduced by the light model may
increase with the resolution and numbers of the images, restricting their method to low-resolution and insufficient images.

Considering the difficulties of explicitly mitigating the ill-posedness and ambiguities, recent advances [4], [5] turn to data-driven methods: they train a deep learning model on large-scale datasets and implicitly exploit deep light features from images to improve performance. However, those methods lack interpretability, making them hard to constrain during training. Consequently, the model may be affected by the data bias and prone to specific types of light sources [4] or reflection variations [5] among images.

Despite a persistent exploration in this research field, a method capable of handling general objects under natural light while free from data bias is still missing. In this paper, we provide a new perspective on solving NaUPS. Specifically, we propose
a novel setup that acquires images on a rotatable platform under a static environment light. In such a case, the object is illuminated by rotated environment lights. Representing such rotated lights requires fewer parameters than solely modeling varying
lights in each image since we model the lights as a uniform environment light varied by low degree-of-freedom (DOF [6]) rotation.
This frees us from significantly increasing unknowns when implementing advanced parametric light models (*e.g.*, spherical Gaussian) and reflectance models to handle general scenarios. Additionally, based on such a setup, we further derive a
reliable light initialization method by analyzing the pixel value at object’s occluding boundary. Such light initialization will help the model converge at the beginning and alleviate the ambiguity between light and objects during training.

With the help of the proposed setup and light initialization method, we develop **Spin** Light **U**ncalibrated **P**hotometric Stereo (**Spin-UP**), addressing NaUPS by optimizing inverse rendering
framework in an unsupervised manner. To our best knowledge, this is the first unsupervised method that can handle general objects under natural light. Unlike previous methods, Spin-UP can jointly reconstruct environment light, isotropic reflectance, and
complicated shapes through iterative optimization. To reduce the computational cost and improve convergence, we propose two strategies: interval sampling and shrinking range computing, which help optimize Spin-UP in low GPU usage (5 GB) and fast running
time (25 mins). Experiments on synthetic and real-world datasets demonstrate our superior performance over previous methods on general scenarios. Overall, our contributions are summarized as follows:

We design a novel setup for NaUPS, which reduces unknowns of light representation and facilitates solving NaUPS in an unsupervised manner.

We introduce a light prior, which leverages an object’s occluding boundaries to initialize a reliable environment light. Based on the setup and light prior, we propose the unsupervised NaUPS method named Spin-UP.

We present two training strategies for fast training and convergence of Spin-UP.

In this section, we briefly review recent supervised and unsupervised NaUPS methods. We also summarize other techniques that exploit priors from occluding boundaries. Additionally, we discuss recent advances in 3D vision to distinguish Spin-UP from other neural inverse rendering approaches. Given that the focus of this paper is on NaUPS, a group of works reconstructing 3D surfaces from a single image by deep learning under natural light [7] or shading [8]–[10] are not included.

**Natural Light Uncalibrated Photometric Stereo**. Unsupervised NaUPS methods jointly recover the light, reflectance properties, and surface normal. These methods explicitly model the environment light by low-order spherical harmonics
(SH) [3], [11], spatially varying spherical harmonics (SV-SH) [2], [12] or equivalent directional light
(Eqv-Dir) [1] to mitigate the ill-posedness, and use integrability constraint [11], shape initialization [2], non-physical lighting
regularization [3], or graph-based ambiguity relaxation [1] to alleviate the ambiguity. In contrast, supervised NaUPS methods [4], [5] apply deep learning models like transformers to reconstruct normal maps without explicitly estimating the environment light. The models are trained on a dataset containing images of
diverse objects captured under various lighting conditions, including directional, point, and environment light. Compared to previous work, the proposed Spin-UP distinguishes itself in three key aspects: 1) it features a novel setup explicitly designed to
model correlations among observed images to mitigate the ill-posedness of NaUPS, 2) leveraging this unique setup, a novel light initialization method is introduced to mitigate ambiguities, and 3) an advanced light and material model is implemented to
address a broader range of scenarios.

**Priors from the Boundaries**. The occluding boundaries of an object are considered to reveal adequate information about the object’s shape and the scene’s light. Given the fact that the projection of the boundaries’ normal to xy-plane is
perpendicular to the boundaries in orthographic projection, methods are developed to constraint the surface normal estimation during iterative optimization [13] or recover a rough shape to initialize the geometry in multi-view [14] or photometric
stereo [2]. Other methods associate the boundaries normal with the reflectance to estimate a rough position of the directional
lights [15]. However, none of them derive the environment light from the boundary reflectance. Given the setup in
Spin-UP, we can roughly estimate the environment by analyzing occluding boundaries and the corresponding pixel points. This approach provides a reliable initialization for lighting that alleviates ambiguity in NaUPS.

**Inverse rendering in 3D Vision**. Neural Radiance Fields (NeRF) [16] implicitly store the scene’s shape and
reflectance through MLPs optimized by inverse volume rendering. While NeRF can only recover coarse 3D shapes, several subsequent works [17]–[22] have been
proposed to combine the surface rendering and volume rendering techniques, recovering fine shapes under varying viewpoints but static environment light. In contrast, viewpoints in Spin-UP are relatively static to the objects. While most neural field
methods aim to recover the whole 3D geometries, Spin-UP only recovers the object’s surface.

In Sec. 3.1, we explain the Spin-UP’s setup and how it reduces the unknowns. In Sec. 3.2, we introduce the light prior that alleviates ambiguities in NaUPS, including details of the light initialization method based on that prior. In Sec. 3.3, we describe the implementation details of the proposed Spin-UP framework and losses. In Sec. 3.4, we demonstrate two proposed training strategies.

As shown in Fig. 1, we capture a sequence of images \(\boldsymbol{I} \triangleq \{I_j | j\in [1, ..., N_I]\}\) for an object^{3} by rotating it together with a linear perspective camera in \(360^\circ\) on a rotatable platform. Since the relative positions and orientations between the camera and the object are
fixed during rotation, each observed image is aligned with the rotated environment light \(\boldsymbol{L}(\boldsymbol{R}_j \cdot \boldsymbol{\omega})\), where \(\boldsymbol{\omega} \in
\mathbb{R}^3\) indicates the incident light’s direction, \(\boldsymbol{R}_j=\boldsymbol{R}(\theta_j)\) is the 1-DoF [6] rotation matrix representing rotation about the vertical axis, \(\theta_j\) is the rotation angle, \(\theta_1=0\). As we conduct \(360^{\circ}\) rotation and assume a constant velocity (despite not strictly required in practice), \(\boldsymbol{R}_j\) can be initialized given \(\theta_j = 2\pi (j-1)
/ {N_I}\). Given a sequence of images \(\boldsymbol{I}\) and the initialized \(\boldsymbol{R}\), Spin-UP iteratively optimizes the normal map \(\boldsymbol{N}\), the environment light \(\boldsymbol{L}\), isotropic BRDF map \(\boldsymbol{M}\), and rotation angle \(\boldsymbol{R}\) by solving \[\begin{align}
\label{eq:inv95rend} \underset{\boldsymbol{L}, \boldsymbol{M}, \boldsymbol{N}, \boldsymbol{R}}{\arg \min } \sum_{i=1}^{N_P} \sum_{j=1}^{N_I} \mathcal{E}\left(\boldsymbol{m}_{ij}, \hat{\boldsymbol{m}}_{ij}\right),
\end{align}\tag{1}\] where \(N_P\) is the number of sampled points on the surface, \(\boldsymbol{m}_{ij}\) and \(\hat{\boldsymbol{m}}_{ij}\)
indicate the ground truth and estimation of point \(i\)’s color in image \(I_j\), respectively. \(\mathcal{E}(\cdot,\cdot)\) is loss function between \(\boldsymbol{m}_{ij}\) and \(\hat{\boldsymbol{m}}_{ij}\) (*i.e*, mean absolute error). We adopt the rendering equation to calculate the color \(\boldsymbol{\hat{m}}\)^{4} \[\begin{align}
\begin{aligned}
\label{eq:render95eq} \boldsymbol{\hat{m}} &=\int_\Omega s\boldsymbol{L}\left(\boldsymbol{\omega}\right) \rho\left(\boldsymbol{\omega} \cdot \boldsymbol{n}\right) \mathrm{d} \boldsymbol{\omega}, \\ & =\int_\Omega
s\boldsymbol{L}\left(\boldsymbol{\omega}\right) (\rho^s + \rho^d)\left(\boldsymbol{\omega} \cdot\boldsymbol{n}\right) \mathrm{d} \boldsymbol{\omega}.
\end{aligned}
\end{align}\tag{2}\] where \(\Omega\) represents the upper hemisphere centered at the normal vector \(\boldsymbol{n}\). \(s\) is the cast
shadow. \(\rho^s\) and \(\rho^d\) indicate the specular and diffuse reflectance, respectively. The ambiguities between the light \(\boldsymbol{L}\) and
reflectance of the object \(\boldsymbol{M}\) are often disregarded [13], [23].

0.1cm 0.1cm 15pt

QL15 [3] | HY19 [2] | GM21 [1] | Spin-UP | ||
---|---|---|---|---|---|

type | SV-SH | Global SH | Eqv-Dir | SG | |

number | \(1.5\)K | \(450\) | \(1.4\)M | \(434\) | |

low-freq. | low-freq. | low-freq. | high-freq. |

**Unknowns reduction.** The proposed spin light setup reduces the unknowns of the light representation \(\boldsymbol{L}\) by exploiting correlations between different images. Unlike previous NaUPS methods
that separately model the light for each image, we consider a uniform environment light \(\boldsymbol{L}\) for all images represented by the parametric model like spherical Gaussian, and 1-DoF rotation angle \(\theta\) for each image. As such, the unknowns consist of the environment light model’s parameters and the number of rotational angles that are quantitatively equal to \(N_I\). The total unknown
amount is reduced compared to other methods (Table 1), which helps mitigate the ill-posedness and facilitates solving NaUPS with advanced light and reflectance models in an
unsupervised manner.

Based on the spin light setup, we can exploit priors from the object boundary for light initialization to alleviate the ambiguity. The idea is motivated by the observation that the pixel value \(m_b\) at an object’s boundary provides insights into the environment light (see Fig. 2). For an object with occluding boundaries, the normal of those boundaries \(\boldsymbol{n}_b\) can be pre-computed [13], [23]. By bonding \(m_b\), \(\boldsymbol{n}_b\), and \(\boldsymbol{R}\), we can roughly derive a light map indicating the light sources’ positions and intensities, where \(m_b\) directly represent the light intensity \(\boldsymbol{L}(\boldsymbol{\omega}_b)\) at \(\boldsymbol{\omega}_b = \boldsymbol{R}\cdot\boldsymbol{n}_b\). However, the derived light map for objects with different materials may contain mismatched light source positions and chromatic bias, leading to inaccurate light initialization.

The mismatched light source positions are caused by the specular component \(m^s_b\) in \(m_b\). When \(m_b^s\) dominates, approximating \(\boldsymbol{L}(\boldsymbol{\omega}_b)\) by \(m_b\) becomes a biased estimation as \(m_b^s\) is a reflection of lights in different directions than \(\omega_b\) (Fig. 3). By contrast, the approximation is more reasonable when the diffuse component (\(m_b^d\)) dominates since \(m^d_b=\int_{\Omega} \boldsymbol{L}\left(\boldsymbol{\omega}\right) \rho^d\left(\boldsymbol{\omega} \cdot \boldsymbol{\omega}_b\right) \mathrm{d} \boldsymbol{\omega}\) indicates that \(\boldsymbol{L}(\boldsymbol{\omega}_b)\) contributes most to the actual pixel value, making it less biased to use \(m^d_b\) to represent \(\boldsymbol{L}(\boldsymbol{\omega}_b)\). Therefore, to facilitate a less biased estimation of environment light for initialization, a diffuse filter \(\mathcal{F}^d(.)\) is necessary on \(m^d_b\), alleviating the mismatched light source positions issue. Similarly, a chromatic filter \(\mathcal{F}^c(.)\) is also required to reduce the chromatic bias caused by the spatially varying material at boundaries. The filtered pixel value \(\hat{m}_b^d=\mathcal{F}^c (\mathcal{F}^d\left(m_b\right))\) are the basis for our light initialization method.

**Light initialization method.** The procedure of light initialization method is summarized in Algorithm 4. This method aims to derive an initial environment light model with parameter \(\Theta\). Specifically, we use \(N_L=64\) spherical Gaussian (SG) bases [24] as the
light model, where \(\boldsymbol{L}(\boldsymbol{\omega}|\boldsymbol{\xi}_t, \lambda_t, \boldsymbol{\mu}_t) = \sum^{N_L}_{t=1} G(\boldsymbol{\omega}; \boldsymbol{\xi}_t, \lambda_t, \boldsymbol{\mu}_t)\). \(\boldsymbol{\xi}_t\), \(\lambda_t\), and \(\boldsymbol{\mu}_t\) stands for Gaussian lobes’ direction, sharpness, and amplitude, respectively^{5}. Inspired by [25] indicating that diffuse reflectance can be approximated by the
low-frequency reflectance, we design \(\mathcal{F}^d(.)\) as a combination of a threshold filter \(\mathcal{F}^d_{TH}(.)\) that removes high-intensity reflectance (top \(20\%\) bright) based on boundary points’ intensity profile [25] and a low-pass filter (*i.e*, 3-order spherical
harmonics filter^{6}) noted as \(\mathcal{F}^d_{SH}\) [25], [26]. While \(\mathcal{F}^d_{TH}(.)\) reduces the bias by removing the brightest parts that are usually aligned with specular
reflectance in observed images. \(\mathcal{F}^d_{SH}(.)\) helps extract the low-frequency reflectance. For \(\mathcal{F}^c(.)\), we design it as a converter transferring pixel values into
gray-scale^{7}, mitigating biases from spatially varying material.

With reliable initial SG lights and rotation matrices \(\boldsymbol{R}\), we develop Spin-UP based on neural inverse rendering [13], [17], [21], [23] given the rendering equation in Eq. (2 ).

**Shape model**. We use the neural depth field to represent the 3D surface. A multi-layer perceptron (MLP) predicts the depth value given the image coordinates. To compute normal given the depth map, we extend the normal fitting method
described in [13] to the perspective projection^{8}.

**Material model**. We represent the spatially varying and isotropic reflectance as a modified Disney Model [17]. The diffuse
albedo \(\boldsymbol{\rho}^d\) is predicted by another MLP with a similar structure given the image coordinate. The spatially varying specular reflectance is calculated as a weighted sum of \(N_S=12\) SG bases, so \(\boldsymbol{\rho}^s=\sum_{n=1}^{N_S} c^n\mathcal{D}\left(\boldsymbol{v},\boldsymbol{\omega}\right)\mathcal{F}\left(\boldsymbol{h}, \boldsymbol{\omega}\right)
\mathcal{G}(\boldsymbol{n}, \boldsymbol{\omega}, \boldsymbol{v}, \lambda_n)\), where \(\mathcal{D}\), \(\mathcal{F}\), and \(\mathcal{G}\) accounts
for micro-facet’s normal distribution, Fresnel effects, and self-occlusion, respectively. \(\boldsymbol{v}\) is the view direction. \(\boldsymbol{h}\) is the half-vector, calculated by \(\boldsymbol{h}=(\boldsymbol{v}+\boldsymbol{\omega}) / \left\|\boldsymbol{v}+\boldsymbol{\omega}\right\|\), \(\lambda_n\) is the roughness terms initialized as \((0.1 +
0.9(n-1)) / (N_S-1)\) and set as learnable parameters, \(c^n\) is the weights predicted by the MLP.

**Shadow model**. We apply a shadow mask similar in [23] to handle the cast shadow.

**Loss functions**. Similar to other inverse rendering-based methods [13], [23], we use the inverse rendering loss (*i.e*, Eq. (1 )) to train the framework. The three-stage schema [13] is applied, as well as other smoothness terms (total variance regularization [13], [23]) calculate as \(\operatorname{TV}(.) = \frac{1}{N_P} \sum_{i=1}^{N_P}\left|\frac{\partial (.)}{\partial x}+\frac{\partial (.)}{\partial y}\right|\), where \((x,
y)\) is the image coordinates. We implement \(\operatorname{TV}(.)\) on the normal map (\(\boldsymbol{N}\)), diffuse albedo map (\(\boldsymbol{A}\)),
and the Gaussian bases’ weights (\(c\)) for the material, and gradually drop it following the three-stage training schema [13]. Similar to [27], the normalized color loss calculated as \(\|\operatorname{Nor}(\boldsymbol{A})-\operatorname{Nor}(\boldsymbol{I})\|\) is implemented to help Spin-UP learn a better albedo representation, where \(\operatorname{Nor}(.)\) is the vector
normalization. Following [13], we calculate the boundary loss as the cosine similarity between the pre-computed and estimated boundary normal^{9}.

Optimizing Spin-UP requires smoothness terms to facilitate convergence and avoid local optima. However, those terms are often implemented on full-resolution images, leading to high computational costs. To reduce those costs, we propose to use a sampling
strategy noted as *interval sampling* (IS). To further improve convergence, we introduce another technique noted as *shrinking range computing* (SRC).

**Interval sampling (IS)**. IS samples ray batches from images to reduce the costs. Unlike random ray sampling [16] or
patch-based sampling [28], IS preserves the object’s shape. The idea is similar to downsampled techniques in [5], [29], but we don’t merge the sub-images to full resolution. We experimentally find this
strategy important for training on down-sampled sub-images with smoothness terms to avoid local optima (Sec. 5.2). Specifically, we divide the image in full resolution into non-overlapping blocks, and each
block contains \(N_B\times N_B\) pixel points. By extracting pixel points from the same position in each block (*e.g.*, the left-top pixel), we obtain \(N_B\times N_B\) sub-images
with a down-sampled resolution (See Fig. 5 for illustration). During training, those sub-images are randomly sampled in each epoch, and the smoothness terms are calculated based on the sub-images resolution, which
reduces the computational cost and ensures the effectiveness of those terms.

c|cccc|c|cccc|c Method& <multicolumn,5>Shape Group & & & & & <multicolumn,5>Reflectance Group & & & &

& Ball & Bear & Buddha & Reading & AVG & Pot2 (D.) & Pot2 (S.) & Reading (D.) & Reading (S.) & AVG

HY19 [2] & 41.32 & 53.88 & 67.90 & 54.85 & 54.49 & 57.45 & 37.43 & 65.48 & 58.04 &
54.60

S22 [4] & 7.35 & 14.03 & 26.37 & 18.77 & 16.63 & 15.56 & 11.83 & 18.97 & 18.38 &
16.19

S23 [5] & 5.56 & 10.37 & 18.54 & 15.10 & 12.39 & 13.46 & 9.75 & 16.22 & 12.67 & 13.03

Spin-UP & **3.54** & **6.33** & **17.30** & **7.71** & **8.72** & **5.83** & **7.11** & **13.09** &
**10.30** & **9.08**

Method& <multicolumn,5>Light Group & & & & & <multicolumn,5>Spatially Varying Material Group & & & &

& Cow (U.) & Cow (A.) & Cow (S.) & Cow (L.) & AVG & Pot2 (D.) & Pot2 (S.) & Reading (D.) & Reading (S.) & AVG

HY19 [2] & 67.63 & 39.21 & 40.28 & 48.47 & 48.89 & 40.97 & 37.46 & 49.27 & 48.96 &
44.17

S22 [4] & 17.17 & 12.74 & 17.11 & 11.35 & 14.59 & 18.59 & 17.63 & 22.80 & 23.75 &
20.69

S23 [5] & 11.93 & 7.52 & 12.38 & 11.60 & 10.84 & 14.22 & 11.00 & 14.58 & 14.31 & 13.53

Spin-UP & **5.50** & **4.40** & **3.33** & **4.94** & **4.54** & **5.58** & **6.97** & **12.54** &
**11.52** & **9.15**

**Shrinking range computing (SRC)**. Without merging sub-images back to a full-resolution image in IS, there will be an aliasing issue in the inverse rendering process. Such an issue is caused by the fact that the normal calculation in our
framework on sub-images requires four adjacent points’ depths in sub-images resolution [13], which degrades the precision of normal calculation^{10}. Therefore, SRC is applied for anti-aliasing. It uses points adjacent to the query point (blue circles in Fig. 5) *in the
full-resolution image coordinates* to calculate the normal for each pixel in the sub-images. Such a strategy maintains the precision of normal calculation. However, at the early stage of training, calculating the normal based on the blue circles’ depth
is vulnerable to perturbation in per-pixel training. Therefore, SRC gradually selects points (yellow circles) from far (\(k=3\) points away) to close (blue circles) *w.r.t* query points to interpolate the blue
circles’ depth, as normal calculation on far points’ depth will lead to a more smooth and stable normal map at the early stage, which eventually improves convergence, as validated in Sec. 5.2.

We validate the effectiveness of the proposed Spin-UP on synthetic and real-world data. We use mean angle error (MAE) to evaluate the normal map’s reconstructed quality and PU-PSNR [30], and PU-SSIM [30] to evaluate the reconstructed environment light. Since no existing datasets follow our spin light setup, we collect the dataset using Blender and our device.

We collect several objects, environment maps, and materials to render the synthetic dataset in Blender by Cycles. Specifically, five shapes from the DiLiGenT-MV dataset [31] (*i.e*, Buddha, Bear, Cow, Pot2, and Reading) and a generated shape (Ball), five HDR environment maps (*i.e*, Landscape, Quarry, Urban, Attic, Studio), two PBR materials (*i.e*, Rusty Steel,
Leather), and four synthetic materials (*i.e*, Voronoi Diff, Voronoi Spec, Green Diff, Green Spec)^{11} are used for evaluation. We devise four groups of data for
evaluation: shape group, light group, reflectance group, and spatially varying material group, each containing four scenes. For each scene, 50 observed images with a resolution of \(512 \times 512\) are rendered by a
perspective camera with the focal of \(50 \mathrm{mm}\) and a frame size of \(36 \mathrm{mm}\times 36 \mathrm{mm}\). The camera rotation \(\theta\) for
consecutive images follows a non-uniform rotation velocity. We compare Spin-UP with three advanced NaUPS methods, including two supervised NaUPS methods (S22 [4] and S23 [5]) and one unsupervised UPS method (HY19 [2])^{12}.

**Normal estimation comparison**. According to results in Table [tab:syn], Spin-UP presents a superior performance compared to all other NaUPS methods. Specifically,
in *shape group*, the low MAE on Ball, Bear, Buddha, Reading in Rusty Steel rendered under Quarry indicate the practicability of Spin-UP to various shapes. In *light group*, the low-variance of MAE (\(0.81^\circ\) for Spin-UP vs. \(1.94^\circ\) for S23 [5]) on Cow in
Leather rendered under Quarry, Urban, Attic, and Studio demonstrates robustness toward different environment lights. In *reflectance group*, the results on Pot2 and Reading in Green Diff and Green Spec rendered under Landscape demonstrate the
ability to handle non-Lambertian objects. In *spatially varying material group*, results on Pot2 and Reading rendered in Voronoi Diff or Voronoi Spec under Landscape prove adaptability to challenging scenarios. A comparative analysis of the outcomes
of the reflectance group and the spatially varying material group in our method reveals that the MAE remains relatively consistent across identical objects with different materials, underscoring the robustness of Spin-UP in handling diverse materials.
Also, we find that Spin-UP sometimes performs better on specular objects than on diffuse objects (*i.e*, Reading). We attribute this to the high-frequency details in specular reflectance that may be useful for shape-light reconstruction.

**Light estimation comparison**. Fig. 6 provides a qualitative comparison between the estimated environment light and the ground truth on four groups. We can observe that the learned light map reflects the
position of the light source, especially in Cow (S.) given such a challenging setup without any prior information about the object material or shape. It also reconstructs a reasonable light map for objects with diffuse reflectance, such as Pot2 (D.) and
Reading (D.), further highlighting the effectiveness of the proposed light initialization. However, it should be noted that the estimated environment light is influenced by the object material, particularly when handling objects with a nearly uniform base
color (*e.g.*, reflectance group) or spatially varying material, which may generate unpleasant artifacts (*e.g.*, inconsistent color in estimated environment light of Reading (D.) in the spatially varying material group). Those artifacts are
hard to eliminate without priors.

We set up our spin light capture system to collect real data under indoor and outdoor scenes, as shown in Fig. 7. After preprocessing, we end up with 50 images for each object, with a \(540 \times 540\) resolution. Here, we showcase the normal estimation results on two objects (*i.e*, Soldier, and Player) under indoor and outdoor environment lights in Fig. 8, compared
with S23 [5], S22 [4]. We do not show HY19 [2]’s result as it failed on the captured data.

**Normal estimation comparison**. According to the Fig. 8, Spin-UP has competitive performance compared to the state-of-the-art supervised method [5]. In some scenarios, we recover more reasonable results regarding the overall normal distribution than the reference sphere, particularly for Soldier indoor and Player indoor. Spin-UP can
effectively capture high-frequency details such as wrinkles on clothes in Player and Soldier. By contrast, S23 [5] may contain
artifacts (*i.e*, an incorrect normal map distribution) even though they have more details than ours, and S22 [4] generates
over-smooth results. By comparing indoor and outdoor results, we observe that the performance of S23 [5] degrades significantly in
indoor scenarios, which may be attributed to the data bias and low pixel variance, while our method is not greatly affected.

To comprehensively validate the effectiveness of our light initialization method, we conduct experiments in two aspects: a comparison of different light initialization methods and the effectiveness of the filters.

15pt

w Rand. |
w Fib. |
Spin-UP | |
---|---|---|---|

PU-PSNR [30]\(\uparrow\) | 16.86 | 18.98 | 22.06 |

PU-SSIM [30]\(\uparrow\) | 0.45 | 0.52 | 0.60 |

**Comparison on light initialization methods**. We compare our light initialization method with two widely used SG light initialization methods, *i.e*, the random initialization noted as ‘*w* Rand.’ and Fibonacci lattice [17] noted as ‘*w* Fib.’. A quantitative comparison of reconstructed normal and light maps is shown in Table [tab:lgt95quant] and Table [tab:abl]. Compared with ‘*w* Rand.’, the improvement in average (\(1.75^\circ\) reduction in MAE on normal estimation and \(5.20\) / \(0.15\) increase in PU-PSNR / PU-SSIM on light estimation, respectively) indicates our
method’s adaptability. Compared with ‘*w* Fib.’, we observe an obvious advantage in the shape group (\(0.61^\circ\) reduction in MAE), while a smaller advantage in the spatially varying material group (\(0.19^\circ\) reduction in MAE). This is because the material and shape will affect the initial environment light’s quality. While the estimated environment light is most accurate on smooth geometry with simple material, the
quality will degrade on complex geometry and spatially varying material.

**Comparison on the designed filters**. We compare Spin-UP with three alternatives (*i.e*, *w/o* \(\mathcal{F}^c\), *w/o* \(\mathcal{F}^d_{SH}\), and
*w/o* \(\mathcal{F}^d_{TH}\)). The results in Table [tab:abl] demonstrate the effectiveness of those filters. Specifically, dropping \(\mathcal{F}^d_{TH}\) will lead to mismatching light source position in the initialized environment light introduced by the specular reflectance, which eventually affects the accuracy of the estimated normal; dropping \(\mathcal{F}^d_{SH}\) will harm the performance, especially in the spatially varying material group (\(6.58^{\circ}\) increase in MAE) since \(\mathcal{F}^d_{SH}\) is essential in extracting low-frequency reflectance to initialize the environment light; dropping \(\mathcal{F}^c\) will increase MAE in average (\(0.32^{\circ}\)), illustrating the necessity of reducing chromatic bias.

The interval sampling will facilitate the training of Spin-UP in two ways. First, the training time is two times shorter (25 min per object on average vs. 60 min on average, depending on the image’s valid points for the object), and the GPU memory
occupation is five times smaller (around 5 GB vs. 25 GB during training) than directly training on full-image resolution. Second, comparing Spin-UP with ‘*w/o* Intv.’, which applies a random sampling strategy and calculates the smoothness terms on
patches (\(3 \times 3\) pixels), we find that the performance drops \(0.52^\circ\) on average, and most (\(0.97^\circ\)) on the spatially varying material
group. This is because the patch-based smoothness may not work uniformly on different parts of the object, diminishing the effectiveness of smoothness terms, especially on objects in spatially varying material with abrupt texture changes.

12pt

Shape | Light | Ref. | SV. | AVG | |
---|---|---|---|---|---|

w Rand. |
12.02 | 6.44 | 9.79 | 10.28 | 9.60 |

w Fib. |
9.33 | 5.34 | 10.20 | 9.34 | 8.55 |

w/o Intv. |
8.87 | 5.02 | 9.83 | 10.12 | 8.37 |

w/o Shrk. |
10.10 | 4.92 | 10.12 | 10.43 | 8.81 |

w/o \(\mathcal{F}^c\) |
9.44 | 4.61 | 9.24 | 9.40 | 8.17 |

w/o \(\mathcal{F}^d_{SH}\) |
9.75 | 8.29 | 8.93 |
15.73 | 10.41 |

w/o \(\mathcal{F}^d_{TH}\) |
9.30 | 5.04 | 9.18 | 12.49 | 8.82 |

Spin-UP | 8.72 |
4.54 |
9.08 | 9.15 |
7.85 |

S23 [5] | 12.42 | 8.56 | 12.52 | 12.33 | 11.46 |

Spin-UP | 11.62 | 9.25 | 11.07 | 9.07 | 9.48 |

Method with is tested on dataset with point light + environment light Sec. 5.3.

The shrinking range computing helps avoid local optima when training Spin-UP on down-sampled images while still using full-resolution image coordinates for normal calculation. We compared Spin-UP with the alternative ‘*w/o* Shrk.’, which does not
implement this strategy. The average MAE on normal estimation for four groups increases \(0.96^\circ\), highlighting the importance of this strategy.

To ensure a more fair comparison with the state-of-the-art supervised methods (S23 [5]), we add a dominant point light to the
environment light in synthetic and real-world dataset^{13}. According to Table [tab:abl], the
proposed Spin-UP has a lower MAE on estimated normal maps than S23 [5] on the synthetic dataset (\(9.48^\circ\) ours v.s. \(11.46^\circ\) S23 [5]). As shown in Fig. 9, we have a visually comparable result on the real-world dataset given far-field point light (\(2\text{m}\)) and a better result given near-field point light (\(0.4\text{m}\)), validating the adaptability of Spin-UP on unseen light sources.

This paper proposes Spin-UP to address NaUPS in an unsupervised manner. Thanks to our setup to mitigate the ill-posedness, the light initialization method to alleviate the ambiguities of NaUPS, and the proposed training strategies to facilitate fast convergence, Spin-UP can recover surfaces with isotropic reflectance under various lights. Experiments in synthetic and real-world datasets have shown that Spin-UP is robust to various shapes, lights, and reflectances.

**Limitations and future work.** Although Spin-UP is efficient and robust in solving NaUPS, it has several limitations: 1) Spin-UP assumes infinitely far light sources, which omit the spatially varying lighting; 2) the materials’ base color
will bias the estimated environment light; 3) Spin-UP assumes objects to have isotropic reflectance, ignoring inter-reflections and anisotropic features, meaning that it cannot perform well on objects with anisotropic reflectance, such as aluminum, or
strong inter-reflections, such as a glass bowl; 4) Spin-UP doesn’t compute the shadow iteratively, which may result in artifacts on objects with complicated shapes. Overcoming the above limitations will be regarded as our future work. Besides, the rotation
axis of our device does not align with the object’s center due to the consideration of structural stability, which may introduce bias from the spatially varying light in observed images. Redesigning the image capture device is also one of the future works.
At last, we find it interesting to improve the setup by relieving the requirement for single-axis \(360^\circ\) rotation to free rotations for easier implementation on portable devices.

**Acknowledgments.** This work is supported by the National Natural Science Foundation of China under Grant No. 62136001, 62088102, Rapid-Rich Object Search (ROSE) Lab, Interdisciplinary Graduate Programme, Nanyang Technological University,
Singapore, and the State Key Lab of Brain-Machine Intelligence, Zhejiang University, Hangzhou, China.

In this supplementary material,

we give more implementation details in Sec. 7, including details of framework structure (footnote 6) and hyperparameters setup (footnote 7).

we introduce more about boundary normal calculation and normal calculation for rendering equation in perspective projection in Sec. 8 (footnote 6);

we provide an overview of the synthetic and real-world dataset in Sec. 9. We also explain how we collect and preprocess the real-world dataset;

we showcase a qualitative comparison between Spin-UP and other methods on the real-world dataset in Sec. 10 (footnote 10). More results from the real-world dataset are also included in this section (footnote 10);

We use the similar multi-layer perceptrons (MLPs)’ structures in [13], [23], shown in Fig. 10. The input of MLPs is pixels’ 2D coordinate (\(p=(x, y)\)) in an image, which will pass through a positional encoding module similar in [16] calculated as \[\begin{align} \begin{aligned} \gamma(p)= \left(\sin \left(2^0 \pi p\right), \cos \left(2^0 \pi p\right), \cdots, \sin \left(2^{L_p-1} \pi p\right), \cos \left(2^{L_p-1} \pi p\right)\right), \end{aligned} \end{align}\] where \(L_p\) is the positional code’s dimension, set as 10 for \(\gamma^1(.)\) and 6 for \(\gamma^2(.)\)

In Spin-UP, we implement:

L1 inverse rendering loss \(L_{r}\) calculated as, \(\sum_{i=1}^{N_P} \sum_{j=1}^{N_I} |\boldsymbol{m}_{i j} - \hat{\boldsymbol{m}}_{i j}|\).

Normalized color loss \(L_{\mathrm{color}}\), calculated as, \(\lambda_c \|\mathrm{Nor}(\boldsymbol{A}) - \mathrm{Nor}(\boldsymbol{I})\|\), where \(\lambda_c=0.5\).

Boundary loss \(L_{\mathrm{b}}\), calculated as the cosine similarity between the pre-computed and estimated boundary normal.

Smoothness terms \(L_{\mathrm{sm}}\) on albedo map \(\boldsymbol{A}\), normal map \(\boldsymbol{N}\), spatially varying Gaussian bases weights \(c^n\), is calculated as, \[\begin{align} \begin{aligned} L_{\mathrm{sm}} =\frac{\lambda}{N_P} \sum_{i=1}^{N_P}\left|\frac{\partial \boldsymbol{A}}{\partial x} +\frac{\partial \boldsymbol{A}}{\partial y}\right| +\frac{\lambda_N}{N_P} \sum_{i=1}^{N_P}\left|\frac{\partial \boldsymbol{N}}{\partial x} +\frac{\partial \boldsymbol{N}}{\partial y}\right| + \frac{\lambda_S}{N_P} \sum_{n=1}^{N_S}\sum_{i=1}^{N_P} \left|\frac{\partial c^n_i}{\partial x} + \frac{\partial c^n_i}{\partial y}\right|, \end{aligned} \end{align}\] where, \(\lambda = 0.01\), \(\lambda_N = 0.02\), \(\lambda_S = 0.01\).

We train the Spin-UP in three stages similar to [13]. For the first stage, the loss \(\mathcal{L}_\mathrm{stage1}\) is calculated as below for a faster convergence. \[\begin{align} \begin{aligned} \mathcal{L}_\mathrm{stage1} = \mathcal{L}_r + \mathcal{L}_{\mathrm{b}} + \lambda_c \mathcal{L}_\mathrm{color} + \mathcal{L}_\mathrm{sm}, \end{aligned} \end{align}\] For the second stage, we drop the smoothness term on the albedo map and reduce \(\lambda_N\) to 0.05 for details refinement, where \(\mathcal{L}_N=\operatorname{TV}(\boldsymbol{N})\) is the smoothness term on normal map. \[\begin{align} \begin{aligned} \mathcal{L}_\mathrm{stage2} = \mathcal{L}_r + \mathcal{L}_{\mathrm{b}} + \lambda_c \mathcal{L}_\mathrm{color} + \lambda_{N } \mathcal{L}_N, \end{aligned} \end{align}\] For the third stage, we drop the smoothness terms \(\mathcal{L}_N\) to further refine the details. \[\begin{align} \begin{aligned} \mathcal{L}_\mathrm{stage3} = \mathcal{L}_r + \mathcal{L}_{\mathrm{b}} + \lambda_c \mathcal{L}_\mathrm{color}. \end{aligned} \end{align}\] The three stages take 500, 1000, and 500 epochs, respectively. During training, we use Adam as the optimizer with a learning rate \(\alpha=0.001\) and a batch size of 4 images per iteration.

In perspective projection, the surface normal is perpendicular to the object’s occluding boundaries \(B(x, y)\) and view direction \(\boldsymbol{v}\), as shown in Fig. 11. Therefore, the boundaries’ normal \(\boldsymbol{n}^b\) is calculated as \[\begin{align} \begin{aligned} \boldsymbol{n}^b \cdot \boldsymbol{v}^b = 0, ~\boldsymbol{n}^b \cdot (\frac{\partial B}{x}, \frac{\partial B}{y}, 1)^\top= 0, \end{aligned} \end{align}\] In practice, the outer boundaries of an object in images may not precisely match its actual boundaries due to limited image resolution. Therefore, we add a small offset (\(\beta=0.1\)) to make the pre-computed boundaries normal more accurate: \[\begin{align} \begin{aligned} \boldsymbol{n}^b = \mathrm{Nor}(n^{bx}, n^{by}, n^{bz} + \beta). \end{aligned} \end{align}\]

The normal fitting method [13] in orthogonal projection is shown below: \[\begin{align} \begin{aligned} \boldsymbol{n} & =\sum_{k=1}^4 \gamma^k \boldsymbol{n}^k=\sum_{k=1}^4 \gamma^k \operatorname{Nor}\left[\left(\boldsymbol{p}^{k+1}-\boldsymbol{p}\right) \times\left(\boldsymbol{p}^k-\boldsymbol{p}\right)\right]^{\top}, \\ \gamma^k & =\frac{\left|d^k\right|^{-1}}{\sum_{k=1}^4\left|d^k\right|^{-1}}, \quad d^k=z^k+z^{k+1}-2 z, \end{aligned} \label{eq:nml95fit} \end{align}\tag{3}\] where, \(\boldsymbol{p}^k=(x^k, y^k, z^k)\) is the adjacent point of the query point \(\boldsymbol{p}=(x, y, z)\) and \(x, y \in [-1, 1]\), \(k = 1\) if \(k + 1 > 4\), as shown in Fig. 12, (a). To extend the normal fitting method to the perspective projection, we first compute the points’ coordinates in the camera coordinate system by \[\begin{align} \begin{aligned} \boldsymbol{p}^{k\prime} & =(x^k\frac{z^k}{f}s_x, y^k\frac{z^k}{f}s_y, z^k), \\ \boldsymbol{p}^{\prime} & =(x\frac{z}{f}s_x, y\frac{z}{f}s_y, z). \end{aligned} \label{eq:persp} \end{align}\tag{4}\] where \(f\) is the camera’s focal, \(s_x\) and \(s_y\) are half of the width and height of the camera’s frame. Replace \(\boldsymbol{p}^{k}\) and \(\boldsymbol{p}\) in Eq. (3 ) by \(\boldsymbol{p}^{k\prime}\) and \(\boldsymbol{p}^{\prime}\), we get the normal fitting method in perspective projection.

In Fig. 13, we showcase all 5 objects with 6 materials under 5 HDR environment maps rendered by Blender Cycles^{14}. This results in 16 scenes^{15} of synthetic data that are classified into 4 groups, *i.e*, the shape group, light group, reflectance group, and spatially varying group.

c|ccccc Properties & Soldier & Player & Dancer & Policeman & Eevee

Length (cm) & 9.50 & 11.50 & 4.00 & 4.00 & 4.00

Width (cm) & 7.00 & 11.00 & 5.00 & 4.00 & 4.00

Height (cm) & 3.00 & 28.00 & 4.00 & 9.00 & 9.00

Distance (m) & 0.90 & 0.90 & 0.40 & 0.40 & 0.30

The real-world dataset contains 5 objects captured under indoor and outdoor environments with spatially-varying materials. The five real-world objects used in our study are the Soldier, Player, Policeman, Dancer, and Eevee. The objects’ sizes are shown in Table [tab:32object].

**Device introduction.** Soldier, Player, Policeman, and Dancer’s observed images were captured by a customized device shown in Fig. 14 (left), which consists of two stands (one for holding the subject being
photographed, the other for supporting the camera) and a rotating mechanism. The distance from the camera to the object is adjustable. In addition to this, we also consider a more portable device shown in Fig. 14 (right),
which is made up of a wooden rotatable platform^{16} with a diameter of 39mm and the camera. We capture Eevee’s observed images based on this device.

**Photographing requirements.** Before photographing, the distance between the camera and the object is determined based on the proportion of the object in the viewfinder, ensuring a balance of the occupied portion between the objects and
the camera. Three typical distances were used: 0.9 meters for large and 0.4 meters (or 0.3 meters) for small objects. During photographing, the thumb rule is to capture a clear image with less noise and keep rotation velocity as uniform as possible. For
the camera’s parameters, we chose ISO 1600, an aperture size of f/13 for outdoor scenes; and ISO 3200, an aperture size of f/6.3 for indoor scenes, respectively. The focal size is fixed at 31mm for different scenes.

**Pre-processing pipeline**. In the pre-processing pipeline, we extracted 50 images from the video at equal intervals to use as our data. We then obtain the objects’ masks in each scene from the first frame by Photoshop. Those masks help
separate objects and backgrounds. In practice, there are translational motions in the horizontal and vertical directions, mostly obvious on objects due to structural instability. Therefore, after calculating the relative rotation angle \(\theta_j\), we used a simple algorithm for motion correction, assuming that the only motion of the object relative to the camera was translational in the horizontal and vertical directions. Specifically, we pre-set the range of
motion and iterate over the distance vector to find the distance of movement (plus or minus 20 pixels) that minimizes the difference between the front and back frames after applying the mask. Note that although large movement is corrected in this step,
minor movements still exist and are hard to eliminate. Fortunately, our method can tolerate those minor movements.

We show all the estimated normal maps, error maps of Spin-UP, S23 [5], S22 [4], and HY19 [2] of shape, light, reflectance, and spatially-varying material groups in Fig. 16-Fig. 18.

We show all the estimated normal maps of Spin-UP, S23 [5], and S22 [4] of real-world dataset in Fig. 19 and Fig. 20.

[1]

.
[2]

.
[3]

.
[4]

.
[5]

.
[6]

.
[7]

.
[8]

.
[9]

.
[10]

.
[11]

[12]

.
[13]

.
[14]

.
[15]

.
[16]

.
[17]

.
[18]

vol. rendering for multi–view reconstruction. In Proc. Conference on Neural Information Processing Systems (NeurIPS), 2021.

[19]

.
[20]

.
[21]

.
[22]

.
[23]

.
[24]

.
[25]

.
[26]

.
[27]

.
[28]

.
[29]

.
[30]

.
[31]

.
[32]

.
[33]

.
Co-first author. Corresponding author.↩︎

Work completed while interning at the State Key Lab of Brain-Machine Intelligence, Zhejiang University.↩︎

We assume the object’s boundary is geometrically smooth (occluding boundary).↩︎

The subscripts are omitted for simplicity.↩︎

In practice, we find that initializing Gaussians’ parameters by Fibonacci lattice [17] and freezing \(\lambda_t\) gives the best results.↩︎

We implement a Gaussian filter and rescale the pixel value’s range to \(\mathcal{F}^d_{TH}(m_b)\)’s range to suppress ringing effect and negative energy in SH.↩︎

We use gray-scale value to initialize the RGB value of light model.↩︎

Please refer to the supplementary material for more details about the network structures and modified normal fitting method.↩︎

Please refer to the supplementary material for more details about the setup of hyperparameters.↩︎

According to normal’s definition, the smaller the distance between the adjacent points and the query points (red circle in Fig. 5), the more accurately representing the geometry at query points. Therefore, the blue circles’ depths are preferred for normal calculation.↩︎

The generated pattern for Voronoi Diff and Voronoi Spec follow similar setup in CNN-PS [32].↩︎

Please refer to the supplementary material for all the qualitative comparison between Spin-UP and other methods on the synthetic and real-world dataset.↩︎

One scene representing an object with one material rendered under HDR environment maps.↩︎

https://www.ikea.com/sg/en/p/snudda-lazy-susan-solid-wood-40176460/↩︎