FashionEngine: Interactive Generation and Editing of 3D Clothed Humans


Abstract

We present FashionEngine, an interactive 3D human generation and editing system that allows us to design 3D digital humans in a way that aligns with how humans interact with the world, such as natural languages, visual perceptions, and hand-drawing. FashionEngine automates the 3D human production with three key components: 1) A pre-trained 3D human diffusion model that learns to model 3D humans in a semantic UV latent space from 2D image training data, which provides strong priors for diverse generation and editing tasks. 2) Multimodality-UV Space encoding the texture appearance, shape topology, and textual semantics of human clothing in a canonical UV-aligned space, which faithfully aligns the user multimodal inputs with the implicit UV latent space for controllable 3D human editing. The multimodality-UV space is shared across different user inputs, such as texts, images, and sketches, which enables various joint multimodal editing tasks. 3) Multimodality-UV Aligned Sampler learns to sample high-quality and diverse 3D humans from the diffusion prior for multimodal user inputs. Extensive experiments validate FashionEngine’s state-of-the-art performance for conditional generation/editing tasks. In addition, we present an interactive user interface for our FashionEngine that enables both conditional and unconditional generation tasks, and editing tasks including pose/view/shape control, text-, image-, and sketch-driven 3D human editing and 3D virtual try-on, in a unified framework. Our project page is at: https://taohuumd.github.io/projects/FashionEngine.

image

<ccs2012> <concept> <concept_id>10010520.10010553.10010562</concept_id> <concept_desc>Computer systems organization Embedded systems</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>10010520.10010575.10010755</concept_id> <concept_desc>Computer systems organization Redundancy</concept_desc> <concept_significance>300</concept_significance> </concept> <concept> <concept_id>10010520.10010553.10010554</concept_id> <concept_desc>Computer systems organization Robotics</concept_desc> <concept_significance>100</concept_significance> </concept> <concept> <concept_id>10003033.10003083.10003095</concept_id> <concept_desc>Networks Network reliability</concept_desc> <concept_significance>100</concept_significance> </concept> </ccs2012>

1 Introduction↩︎

With the development of game, virtual reality and film industry, there is an increasing demand for high-quality 3D contents, especially 3D avatars. Traditionally, the production of 3D avatars requires days of work from highly skilled 3D content creators, which is not only time-consuming but also expensive. There have been some attempts in trying to automate the avatar generation pipeline [1], [2]. However, they usually lack control over the generation process, making it difficult to use in practice, as shown in Tab. 1. To reduce the friction of using learning-based avatar generation algorithms, we propose FashionEngine, an interactive system that enables the generation and editing of high-quality photo-realistic 3D humans. The process is controlled by multiple modalities, e.g. texts, images, and hand-drawing sketches, making the system easy to use even for layman users.

Table 1: A set of recent generation and editing approaches.
Methods Uncond. Text Image Sketch 3D-aware
EG3D [3]
StyleSDF [4]
EVA3D [1]
AG3D [2]
StructLDM [5]
DragGAN [6]
InstructP2P [7]
Text2Human [8]
Text2Performer [9]
FashionEngine (Ours)

FashionEngine automates the 3D human production in three steps. As shown in Fig. [fig:teaser], in the first step, a candidate 3D human is generated either randomly or conditionally from text descriptions or hand-drawing sketches. With the generated human, users can use text, reference images, or simply draw sketches to edit the appearance of the 3D human in an interactive way. In the last step, final adjustments in poses and shapes can be performed before being rendered into images or videos. A key challenge for human editing lies in the alignment between the user inputs and the human representation space. FashionEngine enables the controllable generation and editing with three components.

Firstly, FashionEngine utilizes the 3D human priors learned in a pre-trained model [5] that models humans in a semantic UV latent space. The UV latent space preserves the articulated structure of the human body and enables detailed appearance capture and editing. From the latent space, a 3D-aware auto-decoder is employed to embed the properties learned from the 2D training data, which decodes the latents into 3D humans under different poses, viewpoints, and clothing styles. Furthermore, a 3D diffusion model [5] learns the latent space for generative human sampling, which serves as strong priors for different editing tasks.

Secondly, we construct a Multimodality-UV Space from the learned prior, which encodes the texture appearance, shape topology, and textual semantics of human clothing in a canonical UV space. The multimodal user inputs (e.g., texts, images, and sketches) are faithfully aligned with the implicit UV latent space for controllable 3D human editing. Notably, the multimodality-UV space is shared across different user inputs, which enables various joint multimodal editing tasks.

Thirdly, we propose Multimodality-UV Aligned Samplers conditioned on the user inputs to sample high-quality 3D humans from the diffusion prior. Specifically, a Text-UV Aligned Sampler and a Sketch-UV Aligned Sampler are proposed for text-driven and sketch-driven generation or editing tasks, respectively. A key component is to search the Multimodality-UV Space to sample desired latents that are well aligned with the user inputs for controllable generation and editing.

Quantitative and qualitative experiments are performed on different generation and editing tasks, including conditional generation, and text-, image-, and sketch-driven 3D human editing. Experimental results illustrate the versatility and scalability of FashionEngine. In addition, our system runs at about 9.2 FPS to render \(512^2\) resolution images on an NVIDIA V100 GPU, which enables interactive editing tasks. To conclude, our contributions are listed as follows.

1) We propose an interactive 3D human generation and editing system, FashionEngine, which enables easy and fast production of high-quality 3D humans.

2) Making use of the pre-trained 3D human prior, we propose a 3D human editing framework in the UV-latent space, which effectively unifies controlling signals from multiple modalities, i.e. texts, images, and sketches, for joint multimodal editing.

3) Extensive experiments show the advantages of UV-based editing system and the state-of-the-art performance of FashionEngine in 3D human editing.

2 Related Work↩︎

Human Image Generation and Editing. It has been a great success for single-category generation, e.g. human faces, with generative adversarial networks (GAN) [10][13]. However, generating the whole body with diverse clothing and poses is more challenging for GANs [8], [14][16]. Carefully scaling-up dataset size improves the generation fidelity [17], [18]. The disentanglement of GAN latent space provides opportunities for image editing [19][22]. DragGAN [23] presents an intuitive interface for easy image editing. The recent success of diffusion models in general image generation has also motivated researchers to apply this technology on human generation [24], [25]. They can also be conditioned on human poses, reference images or texts to control the generation process.

3D Human Generation and Editing. With the advancement of volume rendering [26], we can train 3D-aware generative networks from 2D image collections [3], [27][30]. Motivated by this line of work, 3D-aware human GANs are studied. StylePeople [31] uses convolutional network to achieve photorealistic generation results. However, they lack multi-view consistency. ENARF-GAN [32] firstly uses neural radiance fields with adversarial training to achieve better multi-view consistency. EVA3D [1] decompose human into submodules for better representation efficiency. AG3D [2] focuses more on human faces by adding a dedicated discriminator for faces. It also uses a normal discriminator to achieve better geometry. These methods use 1D latent, which makes it challenging to perform disentanglement and editing. Other than the data-driven methods, 2D priors [33], [34] can also be used to enable text-to-3D human generation [35][37]. Most recently, PrimDiffusion [38] uses primitives to represent 3D humans, upon which a diffusion model is trained for unconditional sampling. Simple editing tasks, e.g. texture transfer, are shown. StructLDM [5] models humans in boundary-free UV latent space [39] in contrast to the tranditional UV space used in [38], [40], [41], and learns a latent diffusion model for unconditional generation. In this work, making use of UV-based latent space and the diffusion prior [5], we propose 3D human conditional editing from multiple modalities.

3D Diffusion Models. Diffusion models are proven to be able to capture complex distributions. Directly learning diffusion models on 3D representations has been explored in recent year. Diffusion models on point clouds [42][44], voxels [45], [46] and implicit models [47] are used for coarse shape generation. As a compact 3D representation, tri-plane [3] can be used with 2D network backbones for efficient 3D generation [48][51].

3 FashionEngine↩︎

FashionEngine learns to dress humans interactively and controllably, and it works in three stages. 1) In the first stage, a 3D human appearance prior \(\mathcal{Z}\) is learned from the training dataset by a latent diffusion model [5] (Sec. 3.1). 2) With the prior \(\mathcal{Z}\), users are allowed to generate a textured clothed 3D human by randomly sampling \(\mathcal{Z}\) or uploading some texts describing the human appearance or a sketch mask describing the clothing mask for controllable generation (Sec. 3.3). 3) Users can optionally edit the generated humans by uploading the desired appearance styles in the form of texts, sketches, or reference images (Sec. 3.4).

Figure 1: 3D human prior learning [5] in two stages (S1 and S2). S1 learns an auto-decoder containing a set of structured embeddings \({Z}\) corresponding to the human subjects in the training dataset. The embeddings \({Z}\) are then employed to train a latent diffusion model in the semantic UV latent space in the second stage.

3.1 3D Human Prior Learning↩︎

FashionEngine is built upon StructLDM [5] that models humans in a structured semantic UV-aligned space, and it learns 3D human priors in two stages. In the first stage, from a training dataset containing various human subject images with estimated SMPL and camera parameters distribution, an auto-decoder is learned with a set of structured embeddings \({Z}\) corresponding to the training human subjects. In the second stage, a latent diffusion model is learned from the embeddings \({Z}\) in the semantic UV latent space, which provides strong priors for diverse and realistic human generations. The pipeline of prior learning is depicted in Fig. 1, and refer to StructLDM [5] for more details. We utilize the priors learned in StructLDM for the following editing tasks.

Figure 2: Multimodality-UV Space (Sec. [sec:method95space]). Based on the learned prior \(\mathcal{Z}\), we construct a Multimodality-UV space including an Appearance-Canonical Space (App-Can, \(\mathbb{A}^{can}\)), an Appearance-UV Space (App-UV, \(\mathbb{A}^{uv}\)), a textual Semantics-UV Space (Sem-UV, \(\mathbb{T}^{uv}\)), and Shape-UV Space (\(\mathbb{S}^{uv}\)).

Figure 3: Pipeline of multimodal generation Sec. 3.3. (a) Text- and sketch-driven generation: Given text input \(\mathbf{I}_T\) or sketch input \(\mathbf{I}^{uv}_S\) in the template UV space, we present Text-UV Aligned Samplers and Sketch-UV Aligned Samplers to sample latents (\(z^{*}_{T}\) and \(z^{*}_{S}\)) from the learned human prior \(\mathcal{Z}\) (Sec 3.1) respectively, which can be rendered into images by latent diffusion and rendering (Diff-Render) [5]. (b) Illustration of TextMatch and ShapeMatch: \(\{z_k, z_i\}\) \(\raisebox{.5pt}{\textcircled{\raisebox{-0.9pt}{1}}}\raisebox{.5pt}{\textcircled{\raisebox{-0.9pt}{2}}}\) and \(\{z_j, z_i\}\) \(\raisebox{.5pt}{\textcircled{\raisebox{-0.9pt}{3}}}\raisebox{.5pt}{\textcircled{\raisebox{-0.9pt}{2}}}\) are taken as the best match to construct the target latents (\(z^{*}_{T}\) and \(z^{*}_{S}\)) for the text or sketch input based on the TextMatch and ShapeMatch algorithms respectively.

3.2 Multimodality-UV Space↩︎

With the human prior \(\mathcal{Z}\) learned in the template UV space, we construct a multimodal-UV space that is aligned with the latent space for multimodal generation.

We sample a set of base latent \(\mathbb{Z} \sim \mathcal{Z}\) from the learned prior, and construct a Multimodality-UV space by rendering and annotating the latent space. Each latent \(z_i \sim \mathcal{Z}\) can be rendered into an image \(A^{can}_i\) in a canonical space, under fixed pose, shape, and camera parameters by renderer \(\mathcal{R}\). The rendered image \(A^{can}_i\) is warped by \(\mathcal{W}\) to UV space \(A^{uv}_i\) using the UV correspondences (UV coordinate map in posed space). Note that though \(A^{uv}_i\) is warped from only a single partial view, it well preserves the clothing topology attributes in a user-readable way, such as the length of sleeves, neckline shape, and the length of lower clothing. In addition, each latent \(z_i\) is annotated with detailed text descriptions as shown in Fig. 2. We also render a segmentation map for each latent by [52], which is warped to UV space \({Seg}^{uv}_i\). \({Seg}^{uv}_i\) preserves the clothing shape attributes such as the length of sleeves, neckline shape, and length of lower clothing.

We render all the latents in \(\mathbb{Z}\) to construct a multimodal space including an Appearance-Canonical Space (App-Can., \(\mathbb{A}^{can}\)), an Appearance-UV Space (App-UV., \(\mathbb{A}^{uv}\)), a textual Semantics-UV Space (Sem-UV, \(\mathbb{T}^{uv}\)), and Shape-UV Space (\(\mathbb{S}^{uv}\)) as shown in Fig. 2.

3.3 Controllable Multimodal Generation↩︎

FashionEngine generates human images given text input \(\mathbf{I}_T\) or sketch input \(\mathbf{I}^{uv}_S\) in the template UV space. We present UV-aligned samplers to sample the multimodal-UV space for controllable multimodal generation (Sec. 3.3.1, 3.3.2).

3.3.1 Text-Driven Generation.↩︎

Users are allowed to upload input texts \(\mathbb{I}_T\) describing the appearances of the desired human, including hairstyle, and six clothing properties: clothing color, texture pattern, neckline shape, sleeve length, length of lower clothing, and types of shoe, as shown in Fig. 3, which semantically corresponds to six parts \(P\)={head, neck, body, arm, leg, foot} in the UV space. A Text Parser \(\mathcal{P}_T\) is proposed to align the input texts with semantic body parts in UV space, which yields part-aware Text-UV aligned semantics \(\mathbf{I}^{uv}_T = \mathcal{P}_T(\mathbf{I}_T)\).

Text-UV Aligned Sampler. We further propose a Text-UV Aligned Sampler \(\mathcal{T}\) to sample an latent \(z^{*}_{T}\) from the learned prior \(\mathcal{Z}\) conditioned on the input text \(z^{*}_{T} = \mathcal{T}(\mathbf{I}^{uv}_T; \mathcal{Z})\). The sampler works in two stages. In the first stage, we independently search the latents in the Sem-UV Space \(\mathbb{T}^{uv}\) that match the input text descriptions \(\mathbf{I}^{uv}_T\) semantically for each body part \(p \in P\) through SemMatch: \[\label{eq:sem95match} Z_p = \underset{z_i \in \mathbb{Z}, t^{uv}_i \in \mathbb{T}^{uv}}{\arg\max} SemMatch(t^{uv}_i[p], \mathbf{I}^{uv}_T[p])\tag{1}\] where \(\mathbf{I}^{uv}_T[p]\) indicates extracting the textual semantics of part \(p\). In the paper, we consider top-n matches, and E.q. 1 yields a set of candidate latents for each body part \(p\), \(Z_p = \{z_0, ..., z_{k-1}\}\).

In the second step, we find the best match latent \(z^*_p\) for each part based on the appearance similarity score by AppMatch: \[\label{eq:m95gen95app95match} \max_{z^*_p \in Z_p, A^{uv}_p \subset \mathbb{A}^{uv}} AppMatch(\{A^{uv}_p[M_{body}] | p \in P\})\tag{2}\] where \(A^{uv}_p = \{A^{uv}_i | z_i \in Z_p\}\), and \(M_{body}\) indicates the body mask in UV space ( in Fig. 2 (b)) in UV space. In the paper, we calculate the multichannel SSIM [53] score in AppMatch.

With \(\{z^*_p \in Z_p\}\), an optimal latent \(z^{*}_{T}\) is constructed conditioned on input text \({\mathbf{I}_T}\): \[\label{eq:m95gen95construct} z^{*}_{T} = \sum_{ p \in P} z^*_{p} * M_p\tag{3}\]

where \(M_p\) is the mask of body part \(p\).

To explain this process, let’s take Fig. 3 as an example. For the sake of clarity, we will only consider two attributes: hairstyle and sleeve length. For the input text {"long hair", "long sleeve"}, we get \(\{ Z_{head}=\{z_k\}, Z_{arm}=\{z_i, z_m\} \}\) where \(\{z_k\}\) meet "long hair", and \(\{z_i, z_m\}\) meet "long sleeve", as shown in Fig. 3 (b). The group \(\{z*_{head}=z_k, z*_{arm} = z_i\}\) is ranked as the optimal since \(AppMatch(\raisebox{.5pt}{\textcircled{\raisebox{-0.9pt}{1}}}, \raisebox{.5pt}{\textcircled{\raisebox{-0.9pt}{2}}})\) \(>\) \(AppMatch(\raisebox{.5pt}{\textcircled{\raisebox{-0.9pt}{1}}}, \raisebox{.5pt}{\textcircled{\raisebox{-0.9pt}{4}}})\) in terms of SSIM score. Note that instead of calculating the similarity in image space, we warp the patches (e.g., in Fig. 3) to a compact UV space (e.g., Fig. 2) for efficiency, and it also eliminates the effects of the tightness of different clothing types in evaluation. We finally get \(z^{*}_{T} = z_k * M_{head} + z_i * M_{arm}\), which is rendered as images by Diff-Render, \(\mathbf{Y}_T = \mathcal{R} \circ \mathcal{D} (z^{*}_{T})\).

3.3.2 Sketch-Driven Generation.↩︎

Users are also allowed to design a human by simply sketching to describe the dress shape including neckline shape, length of sleeve, and lower clothing in the template UV space, \(\mathbf{I}^{uv}_S\) as shown in Fig. 3 (a). A Sketch Parser is employed to translate the raw sketch into a clothing mask \(M^{uv}_T\).

Sketch-UV Aligned Sampler. We further propose a Sketch-UV Aligned Sampler \(\mathcal{S}\) to sample a latent \(z^{*}_{T} = \mathcal{T}(I^{uv}_M; \mathcal{Z})\) from the learned prior \(\mathcal{Z}\) conditioned on the sketch mask \(M^{uv}_S\) in two stages. In the first stage, we search the latents in the Shape-UV Space \(\mathbb{S}^{uv}\) that matches the input sketch mask \(M^{uv}_S\) for body part \(p\) using ShapeMatch: \[\label{eq:sketch95match} Z_p = \underset{z_i \in \mathbb{Z}, s^{uv}_i \in \mathbb{S}^{uv}}{\arg\min} ShapeMatch(s^{uv}_i[p], M^{uv}_S[p])\tag{4}\] where \(M^{uv}_S[p]\) indicates the pixels of part \(p\). The ShapeMatch algorithm evaluates the shape similarity between two binary masks, and it is the lower the result, the better match it is. It is calculated based on the hu-moment values [54]. In the paper, we consider top-k matches, and E.q. 4 yields a set of candidate latents for each body part \(p\). We employ AppMatch E.q. 2 to find the optimal latent \(\{z^*_p | p \in P\}\), and target latent \(z^*_S\) is similarly constructed via E.q. 3 .

We take Fig. 3 (a) as an example to explain the process. For the sake of clarity, we will only consider two attributes: neckline shape and sleeve length. For the input clothing mask \(M^{uv}_S\) which is parsed as a dress shape with medium length of lower clothing, sleeves cut off, and V-shape neckline, we get the matched latents \(\{Z_{neck}=\{z_i, z_m\}, Z_{arm}=\{z_j\}\}\) as shown in Fig. 3 (b).

The group \(\{z^*_{neck}=z_i, z^*_{arm} = z_j\}\) is ranked as the optimal since \(AppMatch(\raisebox{.5pt}{\textcircled{\raisebox{-0.9pt}{3}}}, \raisebox{.5pt}{\textcircled{\raisebox{-0.9pt}{2}}})\) \(>\) \(AppMatch(\raisebox{.5pt}{\textcircled{\raisebox{-0.9pt}{3}}}, \raisebox{.5pt}{\textcircled{\raisebox{-0.9pt}{4}}})\) in terms of SSIM score. We finally get \(z^{*}_S = z_i * M_{neck} + z_j * M_{arm}\), which is rendered as images by Diff-Render, \(\mathbf{Y}_S = \mathcal{R} \circ \mathcal{D} (z^{*}_S)\).

Figure 4: Text-, Sketch-, and Image-Driven Editing (Sec. 3.4). To edit a source human with latent \(z\), FashionEngine allows users to type texts \(\mathbf{I}_T\), draw sketches \(S^{img}\), or provide an reference image with sketh masks for style transfer, and target latents are constructed corresponding to the user inputs. Note that the sketch can describe the length of sleeves in two different ways (e.g., ), or describe the geometry (e.g., ).

3.4 Controllable Multimodal Editing↩︎

FashionEngine also allows users to edit the generated humans by uploading texts, sketches, or guided images to describe the desired clothing appearances, as shown in Fig. 4.

3.4.1 Text-Driven Editing.↩︎

For the generated human with source latent \(z\), given input text \(\mathcal{I}_T\) describing the editing commands, the Text Parser \(\mathcal{P}_T\) is employed to parse the input in UV space, which yields \(\mathcal{I}^{uv}_T\) and a mask \(M^{uv}_T\) indicating which body part to edit. The Text-UV aligned sampler \(\mathcal{T}\) searches the Sem-UV Space \(\mathbb{T}^{uv}\) to find an optimal latent \(z^*_T\) that matches the input editing texts by SemMatch E.q. 1 with a higher appearance similarity by AppMatch E.q. 2 : \(z^{*}_T = \mathcal{T}(\mathbf{I}^{uv}_T, z; \mathcal{Z})\). Note that a preprocessing step is to render \(z\) to the canonical space as introduced in Sec. [sec:method95space], yielding a warped image \(A^{uv}\) in UV space for appearance matching in AppMatch. Finally, the updated latent is calculated as \(z'_T = z^*_T * M^{uv}_T + (1 - M^{uv}_T) * z\), which can be rendered by Diff-Render.

Fig. 4 shows the qualitative results, which suggest that given text descriptions, FashionEngine is capable of synthesizing view-consistent appearance editing results for sleeve (1), neckline (2), and shoes (3).

3.4.2 Sketch-Driven Editing.↩︎

Users can also edit the clothing by simply sketching to edit the clothing style. For a generated human image with source latent z, the user sketches in image space \(S^{img}\), i.e., extend the sleeves , change the neckline to V-shape , and extend the length of lower clothing shown in Fig. 4. We first transform the sketch from image space to UV space by \(\mathcal{W}\), and a Sketch Parser \(\mathcal{P}_S\) is employed to synthesize clothing masks \(M^{uv}_S\) for editing. Conditioned on the source latent \(z\) and mask, the Sketch-UV aligned sampler \(\mathcal{S}\) searches the Shape-UV Space \(\mathbb{S}^{uv}\) to find an optimal latent \(z^*_S\) that matches the sketch mask by ShapeMatch E.q. 4 with a higher appearance similarity by AppMatch E.q. 2 : \(z^{*}_S = \mathcal{S}(\mathbf{M}^{uv}_T, z; \mathcal{Z})\). Similar to text-driven editing, we render a canonical image of \(z\) for AppMatch. Finally, the updated latent is calculated as \(z'_S = z^*_S * M^{uv}_S + (1 - M^{uv}_S) * z\), which can be rendered by Diff-Render. Fig. 4 shows that our approach synthesizes consistent clothing well aligned with the input sketch, including V-shape neckline , short left and right sleeves , and longer dress .

3.4.3 Image-Driven Editing.↩︎

A picture is worth a thousand words; texts sometimes struggle to describe a specific clothing style, whereas images provide concrete references. Given a reference image of a dressed human, FashionEngine allows users to transfer any parts of the clothing style from the reference image to edit the target humans, as shown in Fig. 4. To be more specific, given a reference image, users can sketch to mark which parts of clothing style will be transferred, e.g., \(S^{img}\) shown in Fig. 4.

We transform the sketch from image space to UV space by \(\mathcal{W}\), and utilize a Sketch Parser \(\mathcal{P}_S\) to synthesize clothing masks \(M^{uv}_I\) for editing. Given the reference latent \(z_R\), the source latent \(z\) is updated to \(z'_I = z_R * M^{uv}_I + z * (1 - M^{uv}_I)\)., which is rendered into images by Diff-Render. Fig. 4 shows that selected styles are faithfully transferred to the source human including the style of right arm , hemline structure , neckline and shoes.

Figure 5: FashionEngine User Interface (Sec 3.5). Users can generate humans by unconditional (‘Random Style’) or or conditional generation (e.g., ‘Text to Human’). Users are also allowed to edit the generated humans by sketch-based (‘Sketch to Human’), image-based (‘Style Transfer’), and text-based (Text-to-Human) editing. For conditional sketch or text input (e.g., text ‘search v-neck neckline’, ‘Matching Style’ is provided to search candidate styles (e.g., ‘v-neck neckline’) that match the input, and users are allowed to select desired styles for more flexible and generalizable style editing. Users can also check the generated humans under different poses by selecting one specific pose in ‘Pose Corpus’, and under different viewpoints or shapes by adjusting the ‘View’ or ‘Shape’ slider. The generated humans are also animatable. See the live demo for more details.

3.5 Interactive User Interface↩︎

We present an interactive user interface for our FashionEngine as shown in Fig. 5. FashionEngine provides users with unconditional and conditional human generations, and three different manner of sketch-, image-, and text-based editing. Users can check the generated humans under different poses/viewpoints/shapes, which are achieved by changing the camera or human template parameters (e.g. SMPL). The generated humans are also animatable. See the live demo for more details.

4 Experiments↩︎

4.1 Experimental Setup↩︎

Datasets and Metrics. We perform experiments on a monocular video dataset UBCFashion [55]. UBCFashion contains 500 monocular human videos, with natural fashion motions. For editing tasks, we conduct a perceptual user study and report how often the generated images by our method are preferred over other methods in terms of visual quality, consistency with input, and identity preservation.

Due to the scale of dataset, instead of learning a language model to encode the text inputs, we use keyword matching with a unified template as used in [9]. We use [52] to segment humans from images.

4.2 Comparisons to State-of-the-Art Methods↩︎

Figure 6: User study on conditional human image editing quantitatively shows the superiority of FashionEngine over text-driven baseline [7] (a), and sketch-driven baseline DragGAN [6] (b) in three aspects: 1) visual quality, 2) consistency with input, and 3) identity reservation.


Text-Driven Editing. We compare against InstructPix2Pix for text-driven human image editing, as shown in Fig. 8 (a). We generate high-quality images, and the editing results are well-aligned with the text inputs. In addition, we also faithfully preserve the identity information, whereas InstructPix2Pix cannot achieve this. The advantages over InstructPix2Pix are further confirmed by the user study in Fig. 6. 25 participants are asked to select the images with better visual quality, better consistency with the text inputs, and better preservation of identity information. It was observed that more than 90% of editing images by our methods are considered to be more realistic than InstructPix2Pix.

Sketch-Driven Editing. We also compare against DragGAN [23] for sketch-driven human image editing, as shown in Fig. 8 (b). Our method supports local editing, and we faithfully preserve the identity information. However, DragGAN often edits humans in a global latent space and the local editing generally affects the full-body appearances, and hence the identity is not well preserved. A user study in Fig. 6 (b) further confirms this. 25 participants are asked to select the images with better visual quality, better consistency with the sketch inputs, and better preservation of identity information. More than 85% participants prefer our results in terms of visual quality, and about 80% participants prefer our method for consistency with sketch input and identity preservation.

4.3 Ablation Study↩︎

Table 2: Ablation study of the size of Receptive Field (RF) in global style mixer.
LPIPS \(\downarrow\) FID \(\downarrow\) PSNR \(\uparrow\)
\(RF = 2\) .067 14.880 23.575
\(RF = 4\) .060 12.077 23.803

Figure 7: Ablation study of AppMatch (a) and the size of Receptive Field (b).

4.3.1 Receptive Field (RF) in Global Style Mixer↩︎

We utilize the architecture of StructLDM [5], where a global style mixer is employed to learn full-body appearance style. We evaluate how the size of RF affects the reconstruction quality by comparing the performances of reconstructing humans under RF=2 and RF=4 on about 4,000 images in the autodecoding stage. The qualitative results are shown in Fig. 7 (b), which suggests that a bigger Receptive Field (RF=4) captures the global clothing styles better than RF=2 , and also recovers more details than RF=2, such as the high heels . In addition, RF=4 successfully reconstructs the hemline and between-leg offsets of the dress, while a smaller RF fails . The conclusion is further confirmed by the quantitative comparisons listed in Tab. 2, where a bigger RF achieves better quantitative results on LPIPS [56], FID [57] and PSNR. Note that since RF is also related to super-resolution, RF=2 upsamples the original 2D feature maps from \(256^2\) to \(512^2\) image resolutions, while RF=4 upsamples the features from \(128^2\) to \(512^2\).

4.3.2 Appearance Match↩︎

As presented in Sec. [sec:appMatch], an Appearance Match (AppMatch) technique is required to sample desired latents for both Text-UV Aligned Sampler and Sketch-UV Aligned Sampler. For the cases in Fig. 3, we provide more intermediate details to analyze the effects of the AppMatch, as shown in Fig. 7 (a). For both the text-driven and sketch-driven generation tasks in Fig. 3, we get some candidate latents by TextMatch or ShapeMatch, such as Cand.1, Cand.2, and Cand.3, which all meet the requirements of "long sleeves" and "V-shape neckline". With AppMatch, the sleeves of Cand. 1 will be transferred to the source identity 1 (Src1) since AppMatch(Src1, Cand. 1) \(>\) AppMatch(Src1, Cand. 2), which enables higher quality generation than Cand. 2 or 3. Similarly, AppMatch also improves the results for sketch-driven generation tasks, such as the generation of V-shape neckline .

Figure 8: Qualitative comparisons with InstructPix2Pix [7] and DragGAN [23] for text-driven and sketch-driven editing. We generate high-quality images with faithful identity preservation, and the editing results are well-aligned with the text/sketch inputs.

5 Discussion↩︎

We propose an interactive 3D human generation and editing system, FashionEngine, which allows easy and fast production of 3D digital human for users. Based on a pre-trained 3D human prior, a 3D human editing framework is proposed that unifies controlling signals from multiple modalities, for joint multimodal editing. We show the advantages of our UV-based editing system in 3D human editing tasks.

Limitations. Our synthesized humans are biased toward generating females with dresses due to the dataset bias. In future applications, more data can be involved in the training to alleviate this bias.

Acknowledgement↩︎

This study is supported by the Ministry of Education, Singapore, under its MOE AcRF Tier 2 (MOET2EP20221- 0012), NTU NAP, and under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).

References↩︎

[1]
Fangzhou Hong, Zhaoxi Chen, Yushi Lan, Liang Pan, and Ziwei Liu.2022. . ArXivabs/2210.04888(2022). ://api.semanticscholar.org/CorpusID:252780848.
[2]
Zijian Dong, Xu Chen, Jinlong Yang, Michael J. Black, Otmar Hilliges, and Andreas Geiger.2023. . ArXivabs/2305.02312(2023). ://api.semanticscholar.org/CorpusID:258461509.
[3]
Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al2022. . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16123–16133.
[4]
Roy Or-El, Xuan Luo, Mengyi Shan, Eli Shechtman, Jeong Joon Park, and Ira Kemelmacher-Shlizerman.2022. . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13503–13513.
[5]
Tao Hu, Fangzhou Hong, and Ziwei Liu.2024. StructLDM: Structured Latent Diffusion for 3D Human Generation.  [cs.CV].
[6]
Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie Liu, Abhimitra Meka, and Christian Theobalt.2023. . In ACM SIGGRAPH 2023 Conference Proceedings.
[7]
Tim Brooks, Aleksander Holynski, and Alexei A. Efros.2023. . In CVPR.
[8]
Yuming Jiang, Shuai Yang, Haonan Qiu, Wayne Wu, Chen Change Loy, and Ziwei Liu.2022. . ACM Transactions on Graphics (TOG)41, 4, Article 162(2022), 11 pages. ://doi.org/10.1145/3528223.3530104.
[9]
Yuming Jiang, Shuai Yang, Tong Liang Koh, Wayne Wu, Chen Change Loy, and Ziwei Liu.2023. . In Proceedings of the IEEE/CVF International Conference on Computer Vision.
[10]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio.2020. . Commun. ACM63, 11(2020), 139–144.
[11]
Tero Karras, Samuli Laine, and Timo Aila.2019. . In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4401–4410.
[12]
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila.2020. . In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8110–8119.
[13]
Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila.2021. . In Proc. NeurIPS.
[14]
Kripasindhu Sarkar, Lingjie Liu, Vladislav Golyanik, and Christian Theobalt.2021. . arXiv preprint arXiv:2103.06902(2021).
[15]
Kathleen M Lewis, Srivatsan Varadharajan, and Ira Kemelmacher-Shlizerman.2021. . ACM Transactions on Graphics (TOG)40, 4(2021), 1–10.
[16]
Kripasindhu Sarkar, Vladislav Golyanik, Lingjie Liu, and Christian Theobalt.2021. . arXiv preprint arXiv:2102.11263(2021).
[17]
Jianglin Fu, Shikai Li, Yuming Jiang, Kwan-Yee Lin, Chen Qian, Chen Change Loy, Wayne Wu, and Ziwei Liu.2022. . In European Conference on Computer Vision. ://api.semanticscholar.org/CorpusID:248377018.
[18]
Anna Frühstück, Krishna Kumar Singh, Eli Shechtman, Niloy J Mitra, Peter Wonka, and Jingwan Lu.2022. . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7723–7732.
[19]
Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A Efros.2016. . In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. Springer, 597–613.
[20]
Andrey Voynov Artem Babenko.2020. . In International conference on machine learning. PMLR, 9786–9796.
[21]
Yujun Shen Bolei Zhou.2021. . In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1532–1540.
[22]
Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and Sylvain Paris.2020. . Advances in neural information processing systems33(2020), 9841–9850.
[23]
Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie Liu, Abhimitra Meka, and Christian Theobalt.2023. . In ACM SIGGRAPH 2023 Conference Proceedings. 1–11.
[24]
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.2023. Adding Conditional Control to Text-to-Image Diffusion Models.
[25]
Xian Liu, Jian Ren, Aliaksandr Siarohin, Ivan Skorokhodov, Yanyu Li, Dahua Lin, Xihui Liu, Ziwei Liu, and Sergey Tulyakov.2023. . arXiv preprint arXiv:2310.08579(2023).
[26]
Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng.2020. . In ECCV.
[27]
Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein.2021. . In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5799–5809.
[28]
Michael Niemeyer Andreas Geiger.2021. . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11453–11464.
[29]
Tao Hu, Tao Yu, Zerong Zheng, He Zhang, Yebin Liu, and Matthias Zwicker.2022. . 3DV(2022).
[30]
Tao Hu, Fangzhou Hong, and Ziwei Liu.2024. SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering.  [cs.CV].
[31]
Artur Grigorev, Karim Iskakov, Anastasia Ianina, Renat Bashirov, Ilya Zakharkin, Alexander Vakhitov, and Victor S. Lempitsky.2021. . 2021 (CVPR)(2021), 5147–5156.
[32]
Atsuhiro Noguchi, Xiao Sun, Stephen Lin, and Tatsuya Harada.2022. . arXiv preprint arXiv:2204.08839(2022).
[33]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al2021. . In International conference on machine learning. PMLR, 8748–8763.
[34]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.2022. . In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
[35]
Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, and Ziwei Liu.2022. . arXiv preprint arXiv:2205.08535(2022).
[36]
Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, and Kwan-Yee K Wong.2023. . arXiv preprint arXiv:2304.00916(2023).
[37]
Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, and Kwan-Yee K Wong.2023. . arXiv preprint arXiv:2308.09705(2023).
[38]
Zhaoxi Chen, Fangzhou Hong, Haiyi Mei, Guangcong Wang, Lei Yang, and Ziwei Liu.2023. . In Thirty-seventh Conference on Neural Information Processing Systems.
[39]
Wang Zeng, Wanli Ouyang, Ping Luo, Wentao Liu, and Xiaogang Wang.2020. . 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(2020), 7052–7061. ://api.semanticscholar.org/CorpusID:219558352.
[40]
Tao Hu, Kripasindhu Sarkar, Lingjie Liu, Matthias Zwicker, and Christian Theobalt.2021. . In ICCV.
[41]
Tao Hu, Hongyi Xu, Linjie Luo, Tao Yu, Zerong Zheng, He Zhang, Yebin Liu, and Matthias Zwicker.2023. . IEEE Transactions on Visualization and Computer Graphics(2023), 1–15. ://doi.org/10.1109/TVCG.2023.3297721.
[42]
Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen.2022. . arXiv preprint arXiv:2212.08751(2022).
[43]
Shitong Luo Wei Hu.2021. . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2837–2845.
[44]
Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, and Karsten Kreis.2022. . arXiv preprint arXiv:2210.06978(2022).
[45]
Linqi Zhou, Yilun Du, and Jiajun Wu.2021. . In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5826–5835.
[46]
Norman Müller, Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulò, Peter Kontschieder, and Matthias Nießner.2022. . arXiv preprint arXiv:2212.01206(2022).
[47]
Heewoo Jun Alex Nichol.2023. . arXiv preprint arXiv:2305.02463(2023).
[48]
J Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, and Gordon Wetzstein.2022. . arXiv preprint arXiv:2211.16677(2022).
[49]
Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, et al2022. . arXiv preprint arXiv:2212.06135(2022).
[50]
Jiatao Gu, Alex Trevithick, Kai-En Lin, Joshua M Susskind, Christian Theobalt, Lingjie Liu, and Ravi Ramamoorthi.2023. . In International Conference on Machine Learning. PMLR, 11808–11826.
[51]
Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Barlas Oğuz.2023. . arXiv preprint arXiv:2303.05371(2023).
[52]
Peike Li, Yunqiu Xu, Yunchao Wei, and Yi Yang.2020. . IEEE Transactions on Pattern Analysis and Machine Intelligence(2020). ://doi.org/10.1109/TPAMI.2020.3048039.
[53]
Zhou Wang, A. Bovik, H. R. Sheikh, and E. P. Simoncelli.2004. . IEEE Transactions on Image Processing13(2004), 600–612.
[54]
Ming-Kuei Hu.1962. . IRE Transactions on Information Theory8, 2(1962), 179–187. ://doi.org/10.1109/TIT.1962.1057692.
[55]
Polina Zablotskaia, Aliaksandr Siarohin, Bo Zhao, and Leonid Sigal.2019. . arXiv preprint arXiv:1910.09139(2019).
[56]
Richard Zhang, Phillip Isola, Alexei A. Efros, E. Shechtman, and O. Wang.2018. . CVPR(2018), 586–595.
[57]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.2017. . In NIPS.