GraphicsDreamer: Image to 3D Generation with Physical Consistency

Pei Chen\(^{1}\footnotemark[1]\), Fudong Wang\(^{1}\footnotemark[1]\), Yixuan Tong\(^{1,2}\), Jingdong Chen\(^{1}\), Ming Yang\(^{1}\) Minghui Yang\(^{1}\)
\(^{1}\) Ant Group \(^{2}\) Fudan University


Abstract

Recently, the surge of efficient and automated 3D AI-generated content (AIGC) methods has increasingly illuminated the path of transforming human imagination into complex 3D structures. However, the automated generation of 3D content is still significantly lags in industrial application. This gap exists because 3D modeling demands high-quality assets with sharp geometry, exquisite topology, and physically based rendering (PBR), among other criteria. To narrow the disparity between generated results and artists’ expectations, we introduce GraphicsDreamer, a method for creating highly usable 3D meshes from single images. To better capture the geometry and material details, we integrate the PBR lighting equation into our cross-domain diffusion model, concurrently predicting multi-view color, normal, depth images, and PBR materials. In the geometry fusion stage, we continue to enforce the PBR constraints, ensuring that the generated 3D objects possess reliable texture details, supporting realistic relighting. Furthermore, our method incorporates topology optimization and fast UV unwrapping capabilities, allowing the 3D products to be seamlessly imported into graphics engines. Extensive experiments demonstrate that our model can produce high quality 3D assets in a reasonable time cost compared to previous methods.

image

1 Introduction↩︎

Traditional 3D modeling processes heavily relies on manual labor, which significantly hinders the application of 3D content in Internet and gaming industries. Excitingly, with the evolution of generative AI, many works [1][8] have emerged that attempt to quickly produce 3D models from text or images, showcasing the potential to free human efforts from this complex and labor-intensive task of 3D modeling. However, achieving this goal is challenging as the 3D models qualified for direct integration into rendering engines must meet high graphics standards in terms of geometry and materials.

In terms of geometry, a desired 3D model should have sharp, clear edges and a concise topology to be efficiently rendered and edited in graphics engines. Models with blurry geometric features, messy topology, and excessively high polygon counts are difficult for human artists to further edit, making them practically useless. In terms of materials, we certainly hope that each 3D character, object can display reasonable light and shadow effects under different lighting conditions, which can be achieved with PBR materials.

As the field of image generation sees more impressive works emerge, an intuitive method is to distill prior knowledge from image diffusion models and use Score Distillation Sampling (SDS) [9] to generate 3D models from text or images, such as DreamFusion [9], Fantasia3D [10], Magic3D [11], and ProlificDreamer [12]. While these pioneers have demonstrated the feasibility of this approach, they suffer from two limitations. The first is slow inference speeds because the per-shape optimization involves tens of thousands of iterations, often requiring tens of minutes or even hours. The second issue, known as the ‘Janus problem’ [9], [13], arises because the model, in each iteration, strives to align the current view with the input image, often resulting in objects with multi-faces that severely impact the model robustness.

Another strand of research directly uses 3D data as training data, training a 3D generative model from scratch to perform end-to-end generation of structures like voxels [14], [15], point clouds [8], [16][18], meshes [19], [20], and neural fields [21][24]. However, due to the limitation of publicly accessible 3D data, these models have weak generalization capabilities and low fidelity, and often generating 3D structures not present in the input images.

To address the issue of consistency, subsequent methods directly generate multi-view consistent 2D images and reconstruct 3D geometry from them, such as SyncDreamer [1] and MVDream [25]. Fantasia3D [10] attempts to decouple geometry and textures, as well as PBR materials from ambient lighting. However, since in 2D images, the object’s geometry, texture, and lighting information are naturally intertwined, relying solely on color images to optimize normal maps leads to unstable optimizations due to differing data distributions, resulting in significant detail loss. Wonder3D [26] introduces normal maps in the training data and uses a cross-domain diffusion model to generate multi-view consistent color and normal images for 3D reconstruction, enhancing the detail of the 3D results but lacking a comprehensive understanding of 3D structures. RichDreamer [27] utilizes a Normal-Depth diffusion model to similarly control and generate across multiple domains and employs an albedo diffusion model to mitigate the interference of mixed illumination, yet it has not evolved to the generation of complete PBR material maps.

Figure 1: No caption

Our method also adopts a two-stage scheme, initially generating multi-view images, followed by reconstruction, and incorporates the PBR lighting condition into each stage. Recognizing that depth information encapsulates a comprehensive understanding of the scene’s overall geometric structure, normal maps represents the surface details of 3D objects, and PBR materials provide a rich description of the object’s surface textures, we expand the first stage into a PBR diffusion model. This model predicts the joint distribution of six domains, including colors, normals, depths, and PBR components (albedo, roughness, metallic), as shown in Fig. 2. It uses a cross-domain diffusion model similar to Wonder3D [26] as its basic structure and integrates the PBR lighting conditions. To aggregate generated multi view images and intrinsic materials into a complete 3D object as a surface mesh, we further introduce a deep learning-based inverse rendering approach consists of a mixed representation of implicit and explicit surface, also along with a PBR-constrained intrinsic material reconstruction. With the support of the PBR lighting module, our model can easily distinguish highlights and shadows from the actual surface textures of 3D objects, see Fig. 6.

Furthermore, to align with the computer graphics pipeline, we have automated the topology optimization and UV unwrapping of 3D objects, allowing them to be directly imported into rendering engines such as Blender [28], Unreal Engine, and Unity. Extensive testing on the Google Scanned Object (GSO) [29] dataset has shown that, compared to other baseline methods, GraphicsDreamer can produce high-quality 3D meshes with photorealistic textures.

The core contributions of this paper are as follows:

  • GraphicsDreamer integrates both geometry and materials generation into the multi-view diffusion model, which is further enhanced by a PBR condition with environment lighting approximated as spherical Gaussian. This enhanced model provides a wealth of usable information for the subsequent reconstruction phase.

  • GraphicsDreamer proposes a deep learning-based inverse rendering approach consists of a mixed surface representation and a PBR-constrained intrinsic material enhancement. The resulting complete 3D surface mesh features smooth geometry and distinct textures since it leverages both the generalization capability of the diffusion model and the refinement ability of inverse rendering.

  • GraphicsDreamer incorporates capabilities for topology optimization and UV unwrapping, which are often neglected in academic 3D generation methods. Experiments have shown that in terms of geometric and texture details, GraphicsDreamer is at a leading level. Moreover, the complete PBR material maps and clean topology allow the generated 3D models to be directly imported into graphics engines for immediate use.

2 Related Work↩︎

2.1 2D Diffusion for 3D Generation↩︎

Recent advancements has demonstrated that utilizing CLIP model [30][33] or a 2D diffusion model [34], [35], researchers can directly generate 3D objects from user prompts. The pioneer work DreamFusion [9] leverages score distillation sampling (SDS) to extract prior knowledge from a 2D diffusion model, iteratively optimizing a neural radiance field (NeRF) [36] and achieve zero-shot text-to-3D generation. Concurrently, SJC [13] utilizes score Jacobian chaining to achieve a similar goal. Building on this foundation, Magic3D [11] employs an improved multi-resolution SDS to enhances the precision of 3D generation. Fantasia3D [10] integrates DMTet and SDS, separating textures from geometry, aiming to improve texture quality.

However, due to the lack of multi-view constraints, these methods often produce objects with multi faces. Additionally, the hour-level generation time required for per-shape optimization is usually difficult to accept. Other methods, such as One-2-3-45 [37], Magic123 [3], and Make-it-3d [38], directly generate 3D geometries from image conditions, though they significantly speed up generation, the quality is lower, lacking in geometric and texture details.

2.2 3D Generative Models↩︎

This type of generative model is trained directly on 3D data, learning to capture the distribution of 3D data, and have achieved convincing results. The forms of 3D representations can generally be classified into voxels [14], [15], point clouds [8], [16], [17], meshes [19], [20], and neural fields [21][23], [39], [40]. However, limited by the scale of available 3D training data, such models are often restricted to generating objects within certain specific categories and frequently fabricate elements not present in the input images. In contrast, our method uses a 2D representation across \(6\) domains to model 3D objects, ensures better zero-shot capabilities and faithfully reproduces the 3D geometry according to the input images.

Figure 2: No caption

2.3 Multi-view Diffusion Models↩︎

Zero-1-to-3 [6] enhances the 2D diffusion by fine-tuning the Stable Diffusion model [35], enabling it to perform novel-view synthesis from specified views. More recent developments have seen significant improvements in the consistency of multi-view image generation through multi-view diffusion [1], [25], [41][46]. A prominent project in this series is MVDream [25], which fine-tunes a pretrained diffusion model using multi-view images rendered from 3D objects in the Objaverse dataset [47]. However, because they rely solely on RGB images, often encounter texture ambiguity during geometric reconstruction.

Subsequent work, Wonder3D [26] incorporates normal images into its training data, utilizing an RGB-Normal diffusion model to enhance the generated geometric details. RichDreamer’s [27] Normal-Depth diffusion model enables it to generate rich geometric details. In contrast, our model constructs 3D objects as a joint distribution of images in six domains, comprehensively representing both geometry and materials. By integrating the PBR conditions into both the multi-view image synthesis stage and geometry fusion stage, our approach significantly boosts the generation quality, fully adhering to the computer graphics pipeline.

2.4 Materials and Light Estimation↩︎

Known as intrinsic image decomposition [48], [49] or inverse rendering [50], it’s a challenging task to estimate the geometry, intrinsic materials and lighting of observed objects with single or multi view images based on physics principles [51]. The ill-posed nature of this problem demands that multiple values (normal, albedo, roughness, metallic, specular, illumination, etc.) per pixel should be calculated with only several correspondent RGB values observed cross views. The early literature [48], [50], [52] can solve some simplified situations with specific priors, and some recent works aim to enhance the captured images using more sophisticated capture systems [53] or controllable lighting environment [54]. Inspired by the development of modern deep learning, many works [49], [55][58] aim to learn the decompositions directly from images with neural network models, which are trained upon elaborately constructed datasets that consist of indoor scenes [59][62] or objects [63], [64]. Most of them focus on learning intrinsic priors from single image and then estimate the lighting with an optimization method if needed.

Moreover, the decomposition of intrinsics and lighting with implicit geometry representations e.g., neural radiance filed, has garnered significant attention in recent years to take advantages of the more flexible differentiability. These works [65][70] inspired by Nerf [71] reformulate the imaging process using intrinsic materials and lighting accompanied with implicit geometry representation like signed distance function (SDF). Some latest studies [72], [73] go further utilize the 3D gaussian splatting-based representation and then optimize the intrinsic materials. For the lighting representation, spherical Gaussian [66], [74][76] and its variant [77] are primary representations that can recover higher-frequency reflection compared with spherical harmonic lighting [78].

3 Method↩︎

As illustrated in Fig. 1, GraphicsDreamer consists of three phases. Given a single input image of the desired object, we first build a diffusion model to generate multi view images (6 views), including RGB color for the overall appearance, normal and depth as geometric information, intrinsic materials for texture details, which are simultaneously controlled by a PBR condition as demonstrated in Sec. 3.1. The generated images are then treated as pseudo ground truth for refining a deep learning-based inverse rendering reconstruction, which is also conditioned by PBR to keep consistent with the generative model. It results in a complete 3D object as a surface mesh with smooth geometry and distinct textures, as described in Sec. 3.2 and Sec. 3.3. At last, our method will product appealing 3D assets with artistically optimized topology and UV textures, which are essential in modern CG workflow, as shown in Sec. 3.4.

3.1 PBR Diffusion Model↩︎

The Distribution of 3D Assets. Previous work, such as Wonder3D [26] and RichDreamer [27], selected color images along with corresponding normal or depth images as learning targets for their 2D diffusion models. While this representation can model 3D geometry, it falls short in adequately decoupling intrinsic textures and lighting effects, which affects the quality of relighting results.

Therefore, to better decouple geometry, materials, and lighting, and to make the appearance of the 3D content more realistic, we propose modeling 3D assets as a joint distribution of color images and corresponding normal, depth images, as well as material component maps. Specifically, the distribution of 3D assets denoted as \(p_a(\mathbf{z})\) is defined as \[\label{eqn:joint95distribution} p_a(\mathbf{z})=p_{pbr}\left(c^{1:K}, n^{1:K}, d^{1:K}, a^{1:K}, r^{1:K}, m^{1:K} | y \right)\tag{1}\] Here, \(p_{pbr}\) refers to the distribution of the 3D asset’s multi-view images, which are colors \(c^{1:K}\), normals \(n^{1:K}\), depths \(d^{1:K}\), albedos \(a^{1:K}\), roughnesses \(r^{1:K}\), and metallics \(m^{1:K}\) under the condition of input viewpoint \(y\). \(K\) denotes the number of camera views, which is set to 6 in our experiments. Thus, our objective is to train a model \(f\) capable of predicting multi-view images across these six domains, given a single input view \(y\) and a fixed set of camera configurations \(\boldsymbol{\pi}_{1:K}\): \[\label{eqn:32model95func950} (c^{1:K}, n^{1:K}, d^{1:K}, a^{1:K}, r^{1:K}, m^{1:K})=f(y, \boldsymbol{\pi}_{1:K})\tag{2}\]

Multi-view Diffusion for Geometry and Materials. Similar to MVDream [25] and Wonder3D [26], we employ a multi-view self-attention mechanism. Previous methods have proven that by connecting keys and values of different views within the attention layer for information sharing, the model’s capacity for 3D global perception is enhanced and results in significant improvement in the consistency of multi-view generation.

Then, we refine Wonder3D’s [26] cross-domain attention network to accommodate more domains. We utilize a domain switcher \(s\in\{0, 1\}\) to label different domains. The switcher is then encoded and concatenated with input images and camera parameters before feeding it into the UNet of the diffusion model for training. Experiments show that this design preserves the prior knowledge of the pretrained model and supports fast convergence and robust generalization, even with the expansion to six domains.

The key challenge, however, is to ensure that the six domain images generated from a single view are geometrically consistent. To tackle this, we first recognize that color images most completely reflect an object’s appearance, while normal, depth, albedo, roughness, and metallic can be considered the most fundamental, indivisible atomic properties that contribute to the final appearance of an object, that is, the color image. Consequently, we treat the color image domain as the primary domain, using Query \(Q_c\) from the color domain to calculate cross-attention \(\alpha_i\) separately with Keys \(K_i\) from the other five domains, which is \[\label{eqn:32cross95attention} \alpha_i = \text{softmax}\left({Q_c \cdot K_i}/{\sqrt{d_k}}\right),\tag{3}\] where \(i\) represents the other five domains, and \(d_k\) is the dimension of the \(K_i\) vectors. Following this, we employ a single-layer MLP to perform a weighted fusion of the cross-attention weights.

Then, as a process related to lighting and PBR, we use the rendering equation detailed in Eqn. 7 to supervise whether the color images are accurately represented by the four component maps: normal, albedo, roughness, and metallic. This supervision is crucial for achieving photorealistic rendering and enhances the interpretability of the model. For the specific mathematical process, see Sec. 3.3.

Figure 3: No caption

3.2 Mixed Surface Representation↩︎

With the above generated sparse-view images and materials of object, we need to recover the geometry based on which the physically-based rendering and lighting procedure is performed. Has been proved its efficiency in current studies [70], [71] that implicit representation, signed distance function (SDF), can achieve better differentiability and stability, it is incompatible with the reflection occurring on object’s surface incorporated in PBR function and thus can not be used directly. In this section, we introduce a mixture representation consists of implicit SDF and explicit surface such that both the differentiability, stability and compatibility can be achieved simultaneously.

Implicit SDF Initilization. We initially adopt the implicit SDF representation introduced in NeuS [70] that can convergent to the zero-level iso-surface \(\mathcal{S}\) faster. Given a ray \(\mathbf{r}\triangleq \mathbf{r}_o + t\cdot \mathbf{r}_d\), where \(\mathbf{r}_o, \mathbf{r}_d\in \mathbb{R}^3\) are the ordinary and destination respectively, \(t\in \mathbb{R}^+\) is the depth of the current sample distance, we define the SDF \(f\in \mathbb{R}\) of a sampled point \(\mathbf{x}\triangleq\mathbf{r}_o + t_x\cdot \mathbf{r}_d\) on the ray \(\mathbf{r}\) as \(f(\mathbf{x})\triangleq f_m(\theta, \mathbf{x})\), where \(f_m(\theta, \cdot)\) consists of MLPs w.r.t. the neural parameters \(\theta\). The zero-level isosurface \(\mathcal{S}\) of the desired object is \(f_m(\theta, \mathbf{x})=0 \Leftrightarrow \mathbf{x}\in \mathcal{S}\).

Empirically, we can represent an intersection point \(\mathbf{x}_s\) lying on a ray \(\mathbf{r}\) and iso-surface \(\mathcal{S}\) simply, , as a weighted sum of all the sampled candidates along this ray, \[\label{eq:iso95p0}\mathbf{x}_s\triangleq \sum_{i=0}^N w_i\mathbf{x}_i, ~~\text{where }\mathbf{x}_i\triangleq \mathbf{r}_o + t_i\cdot \mathbf{r}_d.\tag{4}\] Note that, this formulation is based on the assumption guaranteed by Neus [70] that the distribution of weights \(\{w_i\}_i\in [0, 1]\) along a ray is well-approximated as a unimodal function whose value increases if \(\mathbf{x}_i\) gets closer to \(\mathcal{S}\) from the visible view (not back view). Unfortunately, we find that the resulted \(\mathbf{x}_s\) is not smooth enough when it marches closed to \(\mathcal{S}\), somehow suffering from the limitation of geometry consistency of generated sparse view images and materials in Sec. 3.1. Therefore, we introduce an simple yet efficient sampling method to extract more smooth \(\{\mathbf{x}_s\}_s\) in an explicit way, as illustrated in Fig. 4.

Figure 4: Mixed surface representation and Physically-based Rendering implemented in our methd.

Explicit Surface Sampling. With SDF \(f_m(\theta)\) defined above, an intersection point \(\mathbf{x}_s\in \mathbf{r}\cap\mathcal{S}\) lies between two specific sample points \(\mathbf{x}_p, \mathbf{x}_n \in \mathbf{r}\), which are defined as: \[\begin{align} \mathbf{x}_p\triangleq \mathop{\text{argmin}}\limits_{\mathbf{x}_i\in\mathbf{r}}f_m(\theta, \mathbf{x}_i), \forall~\mathbf{x}_i\in\mathbf{r}~s.t.~f(\theta, \mathbf{x}_i) \geq 0,\tag{5}\\ \mathbf{x}_n\triangleq \mathop{\text{argmax}}\limits_{\mathbf{x}_i\in\mathbf{r}}f_m(\theta, \mathbf{x}_i), \forall~\mathbf{x}_i\in\mathbf{r}~s.t.~f(\theta, \mathbf{x}_i) \leq 0.\tag{6} \end{align}\] To ensure the visibility of \(\mathbf{x}_s\), we choose \(\mathbf{x}_p, \mathbf{x}_n\) with smallest depth \(t_i\) following the z-buffer [79] method.

With the selected two section points \(\mathbf{x}_p, \mathbf{x}_n\), we uniformly sample \(d\) (, 8) points \(\{\mathbf{x}_i'\}_i\) between them, and compute the weights of these samples by considering their SDF valuses and cosine similarity between gradients and view directions, except that we only interpolate an single point \(\mathbf{x}_s'\) with probability \(p=0.5\) w.r.t the cumulative distribution function (CDF) of \(\{\mathbf{x}_i'\}_i\). Finally, the interpolated point \(\mathbf{x}_s'\) can guarantee that \(f_m(\theta, \mathbf{x}_s')\rightarrow 0\) during training with acceptable tolerance, and the smoothness is naturally enhanced. since the intervals \(\Delta\mathbf{x}_i'\) of samples \(\{\mathbf{x}_i'\}_i\) is much less than the original intervals \(\Delta\mathbf{x}_i\) along the ray.

This sampling method looks similar to the well-known ray marching [80] that is also used in [66], [67]. However, we have observed the slow convergence rate of ray marching if we directly apply it in our case. Moreover, the verbose iterations of ray-marching for searching the closet points to \(\mathcal{S}\) is much more time-consuming than ours method.

3.3 Physically-based Rendering↩︎

For training the multi-view generative model and the inverse rendering reconstruction, we both implement a physically-based rendering [51] approximation with the simplified Disney principle BRDF [81] and spherical Gaussian [74], [75] lighting as illustrated in Fig. 4.

The Rendering Equation. To formulate the intersection of light and object’s surface, the rendering equation [82] has been introduced based on the physical law of energy conservation, accounting the contribution of intrinsic materials of object. Given a surface point \(\mathbf{x}\) (, \(\mathbf{x}_s\) above) with normal \(\mathbf{n}\), the radiance of incident light \(L_i\) shining at point \(\mathbf{x}\) along incident direction \(\omega_i\) is \(L_i(\mathbf{\omega}_i, \mathbf{x})\), with an observation direction \(\omega_o\) to the point \(\mathbf{x}\), the reflection accounting for geometry and materials can be reduced by the bidirectional reflectance distribution function (BRDF) as \(f_r({\omega}_o, {\omega}_i; \mathbf{x})\), the observed radiance \(L_o({\omega}_o, \mathbf{x})\) is defined as, \[\label{eq:pbr95eq0} L_o({\omega}_o, \mathbf{x}) = \int_{\Omega} {L_i({\omega}_i)} f_r({\omega}_o, {\omega}_i; \mathbf{x}) ({\omega}_i \cdot \mathbf{n}) d{\omega}_i,\tag{7}\] where \(\Omega \triangleq \{{\omega}_i \,|\, {\omega}_i \cdot \mathbf{n} \geq 0\}\) is the hemisphere over which the integral Eqn. 7 is conducted.

Disney BRDF. The implementation of BRDF plays a core role of Eqn. 7 , we utilize the widely used Disney BRDF [81] developed from Cook-Torrance [83] as, \[\begin{align} f_r({\omega}_o, {\omega}_i; \mathbf{x})&\triangleq f_d + f_s({\omega}_o, {\omega}_i; \mathbf{x})\\ &=k_d\frac{\mathbf{a}}{\pi} + \frac{D(\mathbf{h}) F(\omega_o, \omega_i) G(\omega_o, \omega_i)}{4(\omega_o \cdot \mathbf{n})(\omega_i \cdot \mathbf{n})} \end{align}\] where \(k_d\) is the diffuse refraction, \(\mathbf{a}\in [0, 1]^3\) is the albedo, \(\mathbf{h}=(\omega_o + \omega_i)/||\omega_o + \omega_i||_2\) is the half vector, \(D, F, G\) are the normal distribution function (NDF), Fresnel and geometry term, respectively. The integral in Eqn. 7 can be approximately calculated either in discrete integration (, precomputed radiance transfer (PRT)) [84] or closed form [74], [75]. To achieve more efficiency we adopt the implementation in closed form, which needs to be approximated as spherical Gaussians described as follows.

Spherical Gaussian Formulation. An spherical Gaussian (SG) in \(\mathbb{R}^3\) is defined as [75]: \[\label{eq:SG95def} G_s(\mathbf{v};\mathbf{p},\lambda,\boldsymbol{\mu)}\triangleq \boldsymbol{\mu} e^{\lambda(\mathbf{v}\cdot\mathbf{p}-1)},\tag{8}\] where \(\mathbf{v}\in\mathbb{S}^2\) is the input vector, \(\mathbf{p}\in \mathbb{S}^2\) is the lobe axis, \(\lambda\in\mathbb{R}^+\) is the sharpness, and \(\boldsymbol{\mu}\in \mathbb{R}_+^3\) is the amplitude.

Light as SG. Concretely, the light \(L_i\) is represented as a mixture of \(N=16\) SGs as \[\label{eq:sg95li} L_i(\omega_i)\triangleq \sum_{j=1}^N G_s(\omega_i;\mathbf{p}_j,\lambda_j,\boldsymbol{\mu}_j).\tag{9}\]

BRDF as SG. The NDF term \(D(\mathbf{h})\) can be approximately represented as a single spherically wrapped SG [75], \[\label{eq:sg95wrap95D} D(\mathbf{h}) \approx G_s(\mathbf{h};\mathbf{p}^w, \lambda^w, \boldsymbol{\mu}^w),\tag{10}\] where \(\mathbf{p}^w, \lambda^w, \boldsymbol{\mu}^w\) are wrapped as: \[\begin{align} \mathbf{p}^w \triangleq2(\omega_o\cdot\mathbf{n})\mathbf{n}-\omega_o, \quad\lambda^w &\triangleq \frac{2}{r^4}, \quad\boldsymbol{\mu}^w \triangleq \frac{1}{\pi r^4}, \end{align}\] with \(r\in [0, 1]\) is the roughness at point \(\mathbf{x}\). Moreover, under the smooth and constant assumption [75], \(F(\omega_o, \omega_i)\) and \(G(\omega_o, \omega_i)\) can be calculated as constants \(F_0, G_0\) on the support of \(D(\mathbf{h})\) with approximation \(\omega_i\approx 2(\omega_o\cdot\mathbf{n})\mathbf{n}-\omega_o\). Thus \(f_s({\omega}_o, {\omega}_i; \mathbf{x})\) can be rewritten as \[\label{eq:sg95wrap95final} f_s({\omega}_o, {\omega}_i; \mathbf{x})\approx G_s(\mathbf{h};\mathbf{p}^w, \frac{\lambda^w}{4|\omega_o\cdot\mathbf{n}|}, F_0G_0\boldsymbol{\mu}^w).\tag{11}\] See more details about calculating \(k_d, F_0, G_0\) w.r.t. \(\omega_o, \omega_i, \mathbf{n}, r, m, s\) 1 in our supplemental materials.

Cosine as SG. Now we consider the only left cosine \(\omega_i\cdot \mathbf{n}\) in Eqn. 7 . As proposed in [74], it can be approximated as \[\label{eq:sg95cosine} \omega_i\cdot \mathbf{n}\approx G_s(\omega_i;\mathbf{n},0.0315,32.7080) - 31.7003.\tag{12}\]

Finally, the integrand in Eqn. 7 is the product of three SGs, Eqn. 911 and 12 , it’s also a spherical Gaussian and can be integrated with closed form [74].

3.4 Asset Enhancement↩︎

Mesh Quadrification and UV Unwrapping. The meshes generated by the Marching Cubes algorithm typically consist of millions of uneven triangles and messy topology, making it very difficult for artists to make any edits, such as reducing polygons to create additional levels of detail (LOD) variants. To overcome this challenge, we use Blender’s [28] Quad Remesher [85] tool to remesh these triangular meshes into quad-faced meshes with a reasonable number of faces (e.g. \(20k\)), while preserving sharp edges and flat surfaces. Next, we unwrap the UVs of the remeshed objects automatically, and bake the per-vertex colors from the original high-poly model onto the remeshed low-poly model, ultimately converting it into a 3D asset that meets PBR standards, as shown in Fig. 1. This whole process efficiently and reliably produces high-quality, refined 3D assets, facilitates the direct use of the generated digital assets within existing computer graphics (CG) workflows.

4 Experiments↩︎

4.1 Implementation Details↩︎

We wrote a Blender script to filter approximately \(32,000\) 3D objects with complete PBR material maps from the Objaverse dataset [47], and then normalized all objects to a unit scale. To create a multi-view image dataset, cameras were placed in six positions: front, back, left, right, front-right, and front-left. We automated the modification of shader node connections in the \(.glb\) format 3D objects to render multi-view images of color, normal, depth, albedo, roughness, and metallic.

In the multi-view synthesis stage, we fine-tuned on the pretrained Stable Diffusion Image Variants Model, which has image-to-image generation capabilities. We employed a batch size of \(512\) and an image resolution of \(256\), with the multi-view self-attention training for \(30,000\) steps and the cross-domain attention training for \(20,000\) steps. The entire training process was conducted on a single machine with \(8\) A100 GPUs, taking nearly \(5\) days.

In the inverse rendering phase, the SDF MLP \(f_m(\theta)\) consists of 8 nonlinear layers of width 128, with a skip connection at \(4^{\text{th}}\) layer. The material MLP \(f_c(\theta')\) comprises 4 nonlinear layers of width 128, concluding with Sigmoid activation layer. Positional encoding [71] is employed in both of \(f_m(\theta)\) and \(f_c(\theta')\), utilizing \(L=10\) frequency components. For the parameters \(\{\mathbf{p}_j,\lambda_j,\boldsymbol{\mu}_j\}_j\) of light \(L_i\), we uniformly initialize \(\{\mathbf{p}_j\) to be distributed on the unit sphere \(\mathcal{S}^2\) and normalize \(\boldsymbol{\mu}_j\}_j\) by dividing by the total energy.

Figure 5: No caption

4.2 Evaluation↩︎

Baseline We adopt Zero123 [6], One-2-3-45++ [86], SyncDreamer [1], Wonder3D [26], InstantMesh [87], SF3D [88], and the latest work, 3DTopia-XL [89], as baselines for image-to-3D generation. Zero123 [86] can generate novel views from any viewpoint as input. One-2-3-45++ quickly generates 3D content through a multi-view diffusion scheme. SyncDreamer [1] focuses on generating more consistent multi-view images. Wonder3D [26] extends the operational domain of the diffusion model to RGB and normal images. InstantMesh [87], SF3D [88], and 3DTopia-XL [89] are categorized as 3D generative models.

Evaluation Datasets As used in other works [1], [6], we choose the Google Scanned Object (GSO) dataset [29] for evaluation. It includes various everyday items, toys, and animals, and we have also added some plant and game prop images collected from the internet for visulization.

Metrics For the first stage of novel-view synthesis, we use PSNR and SSIM [90] to assess the quality of the generated color images. For the sparse view reconstruction task, We measure two metrics, Chamfer Distances (CD) and Volume IoU, between the reconstructed shape and ground truth.

4.3 Novel View Synthesis↩︎

We evaluated the quality of novel-view synthesis across different methods. Qualitative results can be seen in Fig. 2 and Fig. 5, while the quantitative outcomes are presented in Tab. 1. Wonder3D [26], which lacks depth information as overall geometric prior, sometimes produces distorted geometries and struggles with complex structures. Despite SyncDreamer [1] introduced a volume attention scheme to enhance consistency, their model often yields unreasonable results. In contrast, our method faithfully generates 3D models according to the input images and performs well in terms of both geometry and texture.

Table 1: The quantitative comparison on novel view synthesis.
Method PSNR\(\uparrow\) SSIM\(\uparrow\)
Zero123 [6] \(18.64\) \(0.796\)
SyncDreamer [1] \(20.05\) \(0.803\)
Wonder3D [26] \(24.11\) \(0.893\)
Ours 27.93 0.937

4.4 Surface Reconstruction↩︎

Table 2: Quantitative comparison on single view reconstruction.
Method Chamfer Dist. \(\downarrow\) Volume IoU \(\uparrow\)
Zero123 [6] \(0.0342\) \(0.5033\)
One-2-3-45++ [86] \(0.0274\) \(0.5433\)
SyncDreamer [1] \(0.0249\) \(0.5301\)
Wonder3D [26] \(0.0237\) \(0.5762\)
InstantMesh [87] \(0.0246\) \(0.5591\)
SF3D [88] \(0.0311\) \(0.5203\)
3DTopia-XL [89] \(0.0378\) \(0.5126\)
Ours \(\mathbf{0.0231}\) \(\mathbf{0.5779}\)

We evaluate the effectiveness of the mixed representation of surface proposed in Sec. 3.2, which enhances the smoothness and contributes to a more visually appealing surface. Compared to other methods, our approach achieves a SOTA surface accuracy, as shown in Tab. 2 and Fig. 3.

Figure 6: No caption

4.5 Materials and Relighting↩︎

a

Input image

b

Bright light

c

Midtone light

d

Dark light

Figure 7: Relighting of the 3D objects w.r.t. varying lighting environment maps from bright to dark..

We can naturally relight the 3D objects using the materials integrated into the complete 3D surface mesh following the Physically Based Rendering (PBR) process, which has been mapped into UV space. We employ three types of environment maps to simulate dark, ordinary, and bright conditions, respectively. In Fig. 6, we demonstrate that our model can handle special cases such as highlights and metallic materials. And as shown in Fig. 7, it evaluates how the intrinsic materials extracted by our method interact with varying light environment maps.

5 Conclusions↩︎

In this paper, we introduce GraphicsDreamer, an advanced workflow for 3D objects modeling tailored for modern graphics. It comprises a multi view diffusion generative model that integrates geometry and intrinsic materials, alongside a deep learning-based inverse rendering that aggregates multi view images and intrinsic materials into a complete 3D surface mesh. By employing Physically Based Rendering (PBR) as a condition in both stages, GraphicsDreamer ensures convincing consistency in materials throughout the entire process. Furthermore, GraphicsDreamer incorporates topology optimization and UV unwrapping, which are often overlooked in previous works. Experiments demonstrate that our method excels in fidelity and accuracy, particularly regarding geometric and textural details, while also ensuring clean topology.

In future work, we plan to increase the amount of training data and simultaneously render objects with a wider variety of lighting conditions using thousands of environment maps. With the improved re-training of generative intrinsic materials, we aim to achieve a more stable and accurate inverse rendering refinement process.

In this supplementary document, we first introduce the creation of the training dataset used for generating PBR materials. Then, we outline some implementation details of inverse rendering, and finally, we present more visual results.

6 Dataset Preparation↩︎

To make full use of the material information embedded within the .glb format 3D assets from the Objaverse dataset [47], and to supply our cross-domain PBR material generation model with high-quality training data, we wrote a Blender [28] script and filtered out approximately \(32,000\) 3D objects with complete PBR materials. After normalizing them to a unit size, we rendered these objects from six viewpoints: front, back, left, right, front-right, and front-left, to obtain multi-view images of color, normal, depth, albedo, roughness, and metallic, as shown in Fig. 8. Specifically, the first four types of images - color, normal, depth, and albedo, can be directly output from its existing shader nodes, while roughness and metallic images are obtained by automatically modifying the connections of shader nodes and output from an added ShaderNodeEmission node, see Fig. 9. We will release the dataset creation code upon acceptance of this paper.

7 Details of Inverse Rendering↩︎

In this section, we demonstrate the implementation details of inverse rendering, which includes sampling the explicit surface in Sec. (3.2) and calculating the simplified Disney BRDF in Sec. (3.3.).

7.1 Explicit Surface Sampling↩︎

Z-buffer Implementation. As noted in lines \(335-367\), for each ray \(\mathbf{r}\), we will first find two specific sample points \(\mathbf{x}_p,~\mathbf{x}_n\in\mathbf{r}\) following the z-buffer [79] method. Since the original z-buffer method is used for rasterizing a mesh surface, we have implemented it with specific revisions in our work.

Concretely, given \(N\) (, \(N=64\)) sample points \(\{\mathbf{x}_i\}_{i=1}^N\) along the ray \(\mathbf{r}\), with the definition of \(\mathbf{x}_i\) in Eqn. (4) in line \(343\), we reorder \(\{\mathbf{x}_i\}_i\) w.r.t. an increasing order of \(\{t_i\}_i\). Next, we compute the SDF values of \(\{\mathbf{x}_i\}_i\) with the SDF MLPs as \(\{f_m(\theta,\mathbf{x}_i)\}_i\), which hold positive values in the exterior space and negative values in the interior space. We also compute the signs of these SDF values as \(\{\text{sign}(f_m(\theta,\mathbf{x}_i))\}_i\). Therefore, we have

\[\label{eq:indicator} \text{sign}(f_m(\theta,\mathbf{x}_i))=\left\{ \begin{array}{lr}+1,~~\mathbf{x}_i~{\text{in exterior space}}, & \\-1,~~\mathbf{x}_i~{\text{in interior space}}.\\ \end{array} \right.\tag{13}\]

By this way, the product of signs of two neighbor points \(\mathbf{x}_{i-1},~\mathbf{x}_i\) as \(\text{sign}(f_m(\theta,\mathbf{x}_{i-1}))\cdot \text{sign}(f_m(\theta,\mathbf{x}_i))\) equals \(-1\) means that the there exists an intersection point \(\mathbf{x}_s\) between \(\mathbf{x}_{i-1}\) and \(\mathbf{x}_i\). The rule of z-buffer requires that the intersection point \(\mathbf{x}_s\) has the minimal depth \(t_s\) along the ray \(\mathbf{r}\). Fortunately, since the depth values \(\{t_i\}_i\) are arranged in increasing order w.r.t the indices \(i=0,1,...N-1\), we can compute the desired pair \((\mathbf{x}_{i-1}, \mathbf{x}_i)\) at the first instance where \(\text{sign}(f_m(\theta,\mathbf{x}_{i-1}))\cdot \text{sign}(f_m(\theta,\mathbf{x}_i)) == -1\) occurred. This can be efficiently implemented using the \(\text{argmax}()\) or \(\text{argmin}()\) operator. Thus, we have indeed identified a pair \((\mathbf{x}_p, \mathbf{x}_n)\triangleq (\mathbf{x}_{i-1}, \mathbf{x}_i)\).

7.2 Disney BRDF Calculation↩︎

As demonstrated in lines \(421-433\), we calculate the simplified Disney BRDF as follows.

Given an intersection point \(\mathbf{x}_s\) sampled above, we derive the normal of \(\mathbf{x}_s\) as the normalized gradient of SDF MLPs \(f_m(\theta,\cdot)\) as \[\mathbf{n}=\frac{\nabla_{\mathbf{x}} f_m(\theta, \mathbf{x}_s)}{||\nabla_{\mathbf{x}} f_m(\theta, \mathbf{x}_s)||_2}.\] Also, we utilize the materials MLPs \(f_c(\theta', \cdot)\) to compute the correspondent intrinsic materials as \(f_c(\theta', \mathbf{x}_s)=[\mathbf{a}, r, m, s] \in [0, 1]^6\). Next, we use the following expressions to calculate the terms of BRDF inspired by prior work [81].

For the diffuse refraction \(k_d\) in Eqn. (9) in line \(403\), we compute \(k_d=(1-m)k_d^ik_d^o\) with \[\begin{align} \omega_i &\approx 2(\omega_o\cdot\mathbf{n})\mathbf{n}-\omega_o,\\ \mathbf{h} &= \frac{\omega_i + \omega_o}{||\omega_i + \omega_o||_2},\\ F_{D90} &= 0.5 + 2 (\omega_i\cdot\mathbf{h})^2 r,\\ k_d^i &= 1 + (F_{D90} - 1)(1 - \omega_i\cdot \mathbf{n})^5,\\ k_d^o &= 1 + (F_{D90} - 1)(1 - \omega_o\cdot \mathbf{n})^5. \end{align}\]

For the Fresnel term \(F_0\) in Eqn. (14) in line \(431\), we compute it by \[\begin{align} C_s &= (1-m)s + m\mathbf{a},\\ F_0 &= C_s + (1-C_s)(1- \omega_i\cdot\mathbf{h})^5. \end{align}\]

For the geometry term \(G_0\) in Eqn. (14) in line \(431\), we compute it by \[\begin{align} k &= \frac{(r+1)^2}{8},\\ G_0 &= \frac{\omega_i\cdot \mathbf{n}}{\omega_i\cdot \mathbf{n}(1-k)+k}\cdot\frac{\omega_o\cdot \mathbf{n}}{\omega_o\cdot \mathbf{n}(1-k)+k}. \end{align}\]

Figure 8: No caption

Figure 9: No caption

8 More Results↩︎

We enhance the generated 3D objects by automated remeshing, UV unwrapping, and baking, producing 3D assets that can be directly imported into graphics engines. More results can be seen in Fig. 10 and Fig. 11.

Figure 10: No caption

Figure 11: No caption

References↩︎

[1]
Y. Liu et al., “SyncDreamer: Generating multiview-consistent images from a single-view image,” arXiv preprint arXiv:2309.03453, 2023.
[2]
H. Jun and A. Nichol, “Shap-e: Generating conditional 3d implicit functions,” arXiv preprint arXiv:2305.02463, 2023.
[3]
G. Qian et al., “Magic123: One image to high-quality 3D object generation using both 2D and 3D diffusion priors,” arXiv preprint arXiv:2306.17843, 2023.
[4]
L. Melas-Kyriazi, I. Laina, C. Rupprecht, and booktitle=CVPR. Vedaldi Andrea, “Realfusion: 360deg reconstruction of any object from a single image,” 2023.
[5]
Z. Dou et al., “Tore: Token reduction for efficient human mesh recovery with transformer,” 2023, pp. 15143–15155.
[6]
R. Liu, R. Wu, B. Van Hoorick, P. Tokmakov, S. Zakharov, and booktitle=ICCV. Vondrick Carl, “Zero-1-to-3: Zero-shot one image to 3d object,” 2023.
[7]
X. Long et al., “Adaptive surface normal constraint for depth estimation , booktitle = Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV),” 2021, pp. 12849–12858.
[8]
A. Nichol, H. Jun, P. Dhariwal, P. Mishkin, and M. Chen, “Point-e: A system for generating 3d point clouds from complex prompts,” arXiv preprint arXiv:2212.08751, 2022.
[9]
B. Poole, A. Jain, J. T. Barron, and booktitle=ICLR. Mildenhall Ben, “Dreamfusion: Text-to-3d using 2d diffusion,” 2023.
[10]
R. Chen, Y. Chen, N. Jiao, and K. Jia, “Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation,” arXiv preprint arXiv:2303.13873, 2023.
[11]
C.-H. Lin et al., “Magic3d: High-resolution text-to-3d content creation,” 2023.
[12]
Z. Wang et al., “ProlificDreamer: High-fidelity and diverse text-to-3D generation with variational score distillation,” arXiv preprint arXiv:2305.16213, 2023.
[13]
H. Wang, X. Du, J. Li, R. A. Yeh, and booktitle=CVPR. Shakhnarovich Greg, “Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation,” 2023.
[14]
J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum, “Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling,” NeurIPS, vol. 29, 2016.
[15]
P. Henzler, N. J. Mitra, and booktitle=ICCV. Ritschel Tobias, “Escaping plato’s cave: 3d shape from adversarial rendering,” 2019, pp. 9984–9993.
[16]
X. Zeng et al., “LION: Latent point diffusion models for 3D shape generation,” 2022.
[17]
S. Luo and booktitle=Proceedings. of the I. C. on C. V. and P. R. Hu Wei, “Diffusion probabilistic models for 3d point cloud generation,” 2021, pp. 2837–2845.
[18]
L. Zhou, Y. Du, and booktitle=Proceedings. of the I. I. C. on C. V. Wu Jiajun, “3d shape generation and completion through point-voxel diffusion,” 2021, pp. 5826–5835.
[19]
Z. Liu, Y. Feng, M. J. Black, D. Nowrouzezahrai, L. Paull, and booktitle=ICLR. Liu Weiyang, “MeshDiffusion: Score-based generative 3d mesh modeling,” 2023.
[20]
J. Gao et al., “Get3d: A generative model of high quality 3d textured shapes learned from images,” NeurIPS, 2022.
[21]
T. Wang et al., “Rodin: A generative model for sculpting 3d digital avatars using diffusion,” 2023.
[22]
Y.-C. Cheng, H.-Y. Lee, S. Tulyakov, A. G. Schwing, and booktitle=CVPR. Gui Liang-Yan, “SDFusion: Multimodal 3d shape completion, reconstruction, and generation,” 2023.
[23]
A. Gupta, W. Xiong, Y. Nie, I. Jones, and B. Oğuz, “3dgen: Triplane latent diffusion for textured mesh generation,” arXiv preprint arXiv:2303.05371, 2023.
[24]
Z. Erkoç, F. Ma, Q. Shan, M. Nießner, and A. Dai, “Hyperdiffusion: Generating implicit neural fields with weight-space diffusion,” arXiv preprint arXiv:2303.17015, 2023.
[25]
Y. Shi, P. Wang, J. Ye, M. Long, K. Li, and X. Yang, “Mvdream: Multi-view diffusion for 3d generation,” arXiv preprint arXiv:2308.16512, 2023.
[26]
X. Long et al., “Wonder3D: Single image to 3D using cross-domain diffusion,” arXiv preprint arXiv:2310.15008, 2023.
[27]
L. Qiu et al., “Richdreamer: A generalizable normal-depth diffusion model for detail richness in text-to-3d,” 2024, pp. 9914–9925.
[28]
“Blender - a 3D modelling and rendering package.” 2024 , howpublished = {\url{http://www.blender.org}}.
[29]
L. Downs et al., “Google scanned objects: A high-quality dataset of 3d scanned household items,” 2022.
[30]
A. Radford et al., “Learning transferable visual models from natural language supervision,” 2021.
[31]
A. Jain, B. Mildenhall, J. T. Barron, P. Abbeel, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Poole Ben, “Zero-shot text-guided object generation with dream fields,” 2022, pp. 867–876.
[32]
J. Xu et al., “Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models,” 2023.
[33]
N. Mohammad Khalid, T. Xie, E. Belilovsky, and booktitle=SIGGRAPH. A. C. P. Popa Tiberiu, “Clip-mesh: Generating textured meshes from text using pretrained image-text models,” 2022, pp. 1–8.
[34]
C. Saharia et al., “Photorealistic text-to-image diffusion models with deep language understanding,” NeurIPS, 2022.
[35]
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and booktitle=CVPR. Ommer Björn, “High-resolution image synthesis with latent diffusion models,” 2022.
[36]
B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and booktitle=ECCV. Ng Ren, “NeRF: Representing scenes as neural radiance fields for view synthesis,” 2020.
[37]
H. Face, “One-2-3-45 , howpublished = https://huggingface.co/spaces/One-2-3-45/One-2-3-45.” 2023.
[38]
J. Tang et al., “Make-it-3D: High-fidelity 3D creation from a single image with diffusion prior.” 2023 , eprint={2303.14184}, archivePrefix={arXiv}, primaryClass={cs.CV}.
[39]
N. Müller, Y. Siddiqui, L. Porzi, S. R. Bulo, P. Kontschieder, and booktitle=CVPR. Nießner Matthias, “Diffrf: Rendering-guided 3d radiance field diffusion,” 2023.
[40]
B. Zhang, J. Tang, M. Niessner, and booktitle=SIGGRAPH. Wonka Peter, “3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models,” 2023.
[41]
J. Ye, P. Wang, K. Li, Y. Shi, and H. Wang, “Consistent-1-to-3: Consistent image to 3D view synthesis via geometry-aware diffusion models,” arXiv preprint arXiv:2310.03020, 2023.
[42]
H. Weng et al., “Consistent123: Improve consistency for one image to 3D object synthesis,” arXiv preprint arXiv:2310.08092, 2023.
[43]
R. Shi et al., “Zero123++: A single image to consistent multi-view diffusion base model,” arXiv preprint arXiv:2310.15110, 2023.
[44]
H.-Y. Tseng, Q. Li, C. Kim, S. Alsisan, J.-B. Huang, and booktitle=CVPR. Kopf Johannes, “Consistent view synthesis with pose-guided diffusion models,” 2023.
[45]
M. Zhao et al., “EfficientDreamer: High-fidelity and robust 3D creation via orthogonal-view diffusion prior,” arXiv preprint arXiv:2308.13223, 2023.
[46]
S. Szymanowicz, C. Rupprecht, and A. Vedaldi, “Viewset diffusion:(0-) image-conditioned 3D generative models from 2D data,” arXiv preprint arXiv:2306.07881, 2023.
[47]
M. Deitke et al., “Objaverse: A universe of annotated 3d objects,” 2023.
[48]
H. G. Barrow and J. M. Tenenbaum, “Recovering intrinsic scene characteristics from images,” Computer Vision Systems, pp. 3–26, 1978.
[49]
E. Garces, C. Rodriguez-Pardo, D. Casas, and J. Lopez-Moreno, “A survey on intrinsic images: Delving deep into lambert and beyond,” IJCV, vol. 130, no. 3, p. 836?868, 2022.
[50]
G. Patow and X. Pueyo, “A survey of inverse rendering problems,” Computer Graphics Forum, 2003.
[51]
Physically based rendering: From theory to implementation. Morgan Kaufmann Publishers Inc. , edition = 3rd, 2016 , isbn = {0128006455}.
[52]
E. H. Land and J. J. McCann, “Lightness and retinex theory,” Journal of the Optical Society of America, vol. 61, no. 1, pp. 1–11, 1971.
[53]
K. Guo et al., “The relightables: Volumetric performance capture of humans with realistic relighting,” TOG, vol. 38, no. 6.
[54]
G. Nam, J. H. Lee, D. Gutierrez, and M. H. Kim, “Practical SVBRDF acquisition of 3D objects with unstructured flash photography,” TOG, vol. 37, no. 6, pp. 267:1–12.
[55]
C. Careaga and Y. Aksoy, “Intrinsic image decomposition via ordinal shading,” TOG, vol. 43, no. 1 , articleno = 12, p. 24, 2023.
[56]
R. Zhu, Z. Li, J. Matai, F. Porikli, and M. Chandraker, “Irisformer: Dense vision transformers for single-image inverse rendering in indoor scenes , booktitle = CVPR,” 2022.
[57]
X. Chen et al., “IntrinsicAnything: Learning diffusion priors for inverse rendering under unknown illumination , booktitle = ECCV,” 2024.
[58]
J. Zhu et al., “Learning-based inverse rendering of complex indoor scenes with differentiable monte carlo raytracing,” 2022 , booktitle = {SIGGRAPH Asia}, articleno = {6}, p. 8.
[59]
S. Bell, K. Bala, and N. Snavely, “Intrinsic images in the wild,” TOG, vol. 33, no. 4, pp. 159:1–159:12, 2014.
[60]
B. Kovacs, S. Bell, N. Snavely, and K. Bala, “Shading annotations in the wild , booktitle = CVPR,” 2017, pp. 850–859.
[61]
Z. Li, M. Shafiei, R. Ramamoorthi, K. Sunkavalli, and M. Chandraker, “Inverse rendering for complex indoor scenes: Shape, spatially-varying lighting and SVBRDF from a single image , booktitle = CVPR,” 2020, pp. 2472–2481.
[62]
Z. Li et al., “OpenRooms: An open framework for photorealistic indoor scene datasets , booktitle = CVPR,” 2021, pp. 7186–7195.
[63]
Z. Kuang et al., “Stanford-ORB: A real-world 3D object inverse rendering benchmark,” in NIPS, 2023.
[64]
D. Lichy, J. Wu, S. Sengupta, and booktitle=CVPR. Jacobs David W., “Shape and material capture at home,” 2021.
[65]
J. Zhang et al., “Neilf++: Inter-reflectable light fields for geometry and material estimation , booktitle = ICCV,” 2023.
[66]
K. Zhang, F. Luan, Q. Wang, K. Bala, and booktitle=CVPR. Noah Snavely, “PhySG : Inverse rendering with spherical gaussians for physics-based material editing and relighting,” 2021.
[67]
L. Yariv et al., “Multiview neural surface reconstruction by disentangling geometry and appearance,” NIPS, vol. 33, 2020.
[68]
J. Hasselgren, N. Hofmann, and J. Munkberg, “Shape, light, and material decomposition from images using monte carlo rendering and denoising , booktitle = NIPS,” 2022.
[69]
M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger, “Differentiable volumetric rendering: Learning implicit 3D representations without 3D supervision , booktitle = CVPR,” 2020, pp. 3504–3515.
[70]
P. Wang, L. Liu, Y. Liu, C. Theobalt, T. Komura, and W. Wang, “NeuS: Learning neural implicit surfaces by,” NIPS, vol. Rendering for Multi–view Reconstruction, 2021.
[71]
B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “NeRF: Representing scenes as neural radiance fields for view synthesis,” 2020 , booktitle=ECCV.
[72]
S. Saito, G. Schwartz, T. Simon, J. Li, and G. Nam, “Relightable gaussian codec avatars , booktitle = CVPR,” 2024.
[73]
J. Gao et al., “Relightable 3D gaussian: Real-time point cloud relighting with BRDF decomposition and ray tracing,” ECCV, 2024.
[74]
J. Meder and B. Brüderlin, “Hemispherical gaussians for accurate light integration , booktitle=Computer Vision and Graphics,” 2018, pp. 3–15.
[75]
J. Wang, P. Ren, M. Gong, J. Snyder, and B. Guo, “All-frequency rendering of dynamic, spatially-varying reflectance , booktitle = Proceedings of SIGGRAPH Asia,” 2009, pp. 1–10.
[76]
P. Kocsis, V. Sitzmann, and M. Niessner, “Intrinsic image diffusion for indoor single-view material estimation , booktitle=CVPR,” 2024.
[77]
Z. Li, M. Shafiei, R. Ramamoorthi, K. Sunkavalli, and M. Chandraker, “Inverse rendering for complex indoor scenes: Shape, spatially-varying lighting and SVBRDF from a single image , booktitle = CVPR,” 2020.
[78]
H. Zhou, X. Yu, and book Jacobs David, “ICCV, title=GLoSH: Global-Local Spherical Harmonics for Intrinsic Image Decomposition,” 2019.
[79]
T. Theoharis, G. Papaioannou, and E.-A. Karabassi, “The magic of the z-buffer: A survey , booktitle = The 9-th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision,” 2001, pp. 379–386.
[80]
A. Hadji-Kyriacou and O. Arandjelovi?, “Raymarching distance fields with CUDA,” Electronics, vol. 10, no. 22, 2021.
[81]
B. Burley and W. D. A. Studios, “Physically based shading at disney , booktitle = SIGGRAPH,” 2012, pp. 1–7.
[82]
J. T. Kajiya, “The rendering equation , booktitle = Proceedings of the 13th Annual Conference on Computer Graphics and Interactive Techniques,” 1986.
[83]
R. L. Cook and K. E. Torrance, “A reflectance model for computer graphics , booktitle = SIGGRAPH,” 1981, vol. 1, pp. 307–316.
[84]
P.-P. Sloan, J. Kautz, and J. Snyder, “Precomputed radiance transfer for real-time rendering in dynamic, low-frequency lighting environments , booktitle = SIGGRAPH,” 2002, pp. 527–536.
[85]
J. Huang, Y. Zhou, M. Niessner, J. R. Shewchuk, and L. J. Guibas, “QuadriFlow: A scalable and robust method for quadrangulation,” Computer Graphics Forum, vol. 37, 2018, doi: 10.1111/cgf.13498.
[86]
M. Liu et al., “One-2-3-45++: Fast single image to 3D objects with consistent multi-view generation and 3D diffusion,” arXiv preprint arXiv:2311.07885, 2023.
[87]
J. Xu, W. Cheng, Y. Gao, X. Wang, S. Gao, and Y. Shan, “InstantMesh: Efficient 3D mesh generation from a single image with sparse-view large reconstruction models,” arXiv preprint arXiv:2404.07191, 2024.
[88]
M. Boss, Z. Huang, A. Vasishta, and V. Jampani, “SF3D: Stable fast 3D mesh reconstruction with UV-unwrapping and illumination disentanglement,” arXiv preprint, 2024.
[89]
Z. Chen et al., “3DTopia-XL: High-quality 3D PBR asset generation via primitive diffusion,” arXiv preprint arXiv:2409.12957, 2024.
[90]
Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” TIP, 2004.

  1. \(m, s\) are metallic and specular values at point \(\mathbf{x}\), predicted by a materials MLP \(f_c(\theta',\mathbf{x})=[\mathbf{a}, r, m, s]\in [0, 1]^6\).↩︎