Yixuan Zhu^{1}^{1} Ao Li^{2} Yansong Tang\(^{2\dagger}\) Wenliang Zhao^{1} Jie
Zhou^{1} Jiwen Lu^{1}
^{1}Department of Automation, Tsinghua University
^{2}Tsinghua Shenzhen International Graduate School, Tsinghua University
April 01, 2024
The recovery of occluded human meshes presents challenges for current methods due to the difficulty in extracting effective image features under severe occlusion. In this paper, we introduce DPMesh, an innovative framework for occluded human mesh recovery that capitalizes on the profound diffusion prior about object structure and spatial relationships embedded in a pre-trained text-to-image diffusion model. Unlike previous methods reliant on conventional backbones for vanilla feature extraction, DPMesh seamlessly integrates the pre-trained denoising U-Net with potent knowledge as its image backbone and performs a single-step inference to provide occlusion-aware information. To enhance the perception capability for occluded poses, DPMesh incorporates well-designed guidance via condition injection, which produces effective controls from 2D observations for the denoising U-Net. Furthermore, we explore a dedicated noisy key-point reasoning approach to mitigate disturbances arising from occlusion and crowded scenarios. This strategy fully unleashes the perceptual capability of the diffusion prior, thereby enhancing accuracy. Extensive experiments affirm the efficacy of our framework, as we outperform state-of-the-art methods on both occlusion-specific and standard datasets. The persuasive results underscore its ability to achieve precise and robust 3D human mesh recovery, particularly in challenging scenarios involving occlusion and crowded scenes. Code is available at https://github.com/EternalEvan/DPMesh.
The goal of human mesh recovery involves estimating the 3D human pose and shape from either monocular or multi-view images and videos. Over the past decade, this field has evolved into a burgeoning and captivating research problem, gaining prominence for its extensive applications in film-making, game development, and sports. In recent years, a plethora of approaches [1], [2], [6]–[14] grounded in deep learning have emerged, paving the way to address this inherently ill-posed problem by effectively regressing body parameters from image features.
Nonetheless, extracting more effective information from monocular images in complex scenarios (e.g., occlusions and crowded environments), remains a pivotal challenge. Existing methods address the intricacies of mesh recovery in complex scenarios by incorporating 2D prior knowledge as hints, drawing the models’ attention to visible body parts and reinforcing their 2D alignment proficiency. Following this line, mainstream methods [15], [16] intuitively apply off-the-shelf key-point detectors to achieve coarse human joints, others such as [11] and [17] introduce partial segmentation masks and UV maps as pixel-level knowledge. Despite these efforts, persistent shortcomings become apparent, particularly when confronted with severe occlusion, since they excessively depend on 2D alignment and disregard the vivid information embedded in natural images. Consequently, disturbances to the 2D detector due to noise and occlusion significantly impact accuracy, yielding unsatisfactory outcomes. Recently, Diffusion Models (DMs) [18], [19] have introduced a step-by-step generation framework, showcasing remarkable image synthesis capacity and producing visually appealing results. Inspired by DMs, recent works like [3]–[5] utilize the generative approach proposed by diffusion models for pose estimation, achieving high-accuracy results. However, diffusion-based methods suffer from repeated iterations and neglect the learned knowledge for image processing within text-to-image diffusion models, causing the potential of diffusion not fully exploited. More recent studies [20]–[22] investigate the pre-trained diffusion model for 3D-related tasks, e.g., image synthesis from the depth map, text-to-3D generation and depth estimation. It has been verified that the pre-trained diffusion model can provide structure-aware knowledge for 3D generation and perception tasks. Although diffusion models possess rich knowledge of 3D structure and spatial interaction from generative training, the challenge persists in effectively leveraging these capabilities for complex regression tasks like occluded human mesh recovery.
To overcome the aforementioned challenges, we present DPMesh, a simple yet effective framework for occluded human mesh recovery. DPMesh employs a pre-trained text-to-image diffusion model as the backbone, fully leveraging its potent knowledge of the 3D structure and spatial relationships learned from generative training, hence yielding a robust estimator for occluded poses, as illustrated in Figure 1. Our primary goal is to harness both the high-level and low-level visual concepts within a pre-trained diffusion model for the demanding occluded pose estimation task. Instead of following the time-consuming step-by-step denoising process, we replace conventional image backbones with the pre-trained denoising U-Net and perform an efficient single-step inference style with designed conditions as guidance, as depicted in Figure 2. Considering the pre-training of the diffusion model on text-to-image generation tasks, we confront two challenges: (1) preserving the learned knowledge within the pre-trained diffusion model and adapting it to the occluded human mesh recovery task, and (2) designing appropriate conditions and controls to enhance the model’s perception ability. To address these issues, we introduce an efficient framework to tailor the diffusion model for mesh recovery, leveraging an effective condition injection. To align with the original diffusion model and facilitate the interaction between image features and 2D prior information, we refine the spatial information from an off-the-shelf detector and inject the diffusion model with these conditions. This yields detailed knowledge of the 2D position and the key-points uncertainty. The processed 2D information serves as guidance for the diffusion model, ultimately producing rich visual content, encompassing both human structure and spatial interaction for the subsequent regressor. Furthermore, we present a noisy key-point reasoning approach to improve the robustness of our model, rendering it more stable for occlusion and crowds.
We conduct extensive experiments on various occlusion benchmarks 3DPW-OC [17], [23], 3DPW-PC [6], [23], 3DOH [17], 3DPW-Crowd [15], [23] and CMU-Panoptic [24], as well as the standard benchmark 3DPW test split [23]. Remarkably, without any finetuning on the 3DPW training set, our DPMesh achieves an exciting performance, surpassing previous state-of-the-art methods and demonstrating significantly improved accuracy. Specifically, we achieve MPJPE values of 70.9, 82.2, 79.9, and 73.6 on 3DPW-OC, 3DPW-PC, 3DPW-Crowd, and 3DPW test split, respectively, underscoring the proficiency of our framework. Furthermore, we carry out comprehensive ablation studies to highlight the effectiveness of the diffusion-based backbone, the condition construction and the designed noisy key-point reasoning.
Human Mesh Recovery. During the past decade, parameterized human model [25]–[27] has been widely used to express 3D human pose and shape. Many proceeding works explore approaches to estimate accurate model parameters from monocular images [1], [2], [7]–[14], [28]–[30]. They usually regress parameters from extracted image features. Some [14], [31] leverage 2D and 3D visual observations to enhance the 2D alignment of image features. Nevertheless, they always fall short when confronted with complex scenarios, e.g. occlusion and crowded environments since the conventional backbones and estimators provide vague information about the occluded region for the regressor. To handle this challenge, a series of methods [11], [16], [17], [32]–[34] propose effective approaches involving segmentation masks, center maps and 3D representations to improve the 2D and 3D alignments. However, they have limitations since they pay too much attention to enhancing the usage of the 2D and 3D observations and ignore the quality of image features, which are fundamental and significant. Most recently, methods based on diffusion models [3]–[5] introduce an iterative framework to estimate human poses in the repeated denoising process. Though they achieve satisfying accuracy, they suffer from extensive time-consuming and do not fully exploit the rich knowledge in diffusion models.
Diffusion Models. Diffusion denoising probabilistic models, commonly referred to as diffusion models [18], [19], have emerged as a prominent family of generative models, showcasing remarkable synthesis quality and controllability. The core concept of the diffusion model involves training a denoising autoencoder to estimate the inverse of a Markovian diffusion process [35]. Through generative training on large-scale datasets with image-text pairs (e.g., LAION-5B [36]), diffusion models acquire a powerful capability to generate high-quality images with diverse content and reasonable structures. This proficiency is harnessed during diffusion sampling, which can be perceived as a progressive denoising procedure that necessitates repeated inference of the denoising autoencoder. Recently, [21] propose a controllable architecture, named ControlNet, to add spatial controls, e.g., depth maps and human poses, to pre-trained diffusion models, broadening their applications to controlled image generation. Although originally tailored to 2D text-to-image tasks, pre-trained diffusion models also possess rich knowledge about object structure and spatial interaction. They can adapt to various 3D-related tasks like image synthesis from depth map, text-to-3D generation and depth estimation, as explored in [20]–[22]. However, fully exploiting the structure-aware generative prior in the diffusion model for complex mesh recovery, especially in occluded and crowded scenarios, still poses a significant challenge due to the need for proficient visual perception capability.
In this section, we present DPMesh, an effective framework for occluded human mesh recovery with the pre-trained diffusion prior and proper conditional control. We will start by reviewing the background of diffusion models with conditional control and the human body model. Then we will provide a detailed walkthrough of the entire pipeline to introduce our designs in DPMesh. This includes how we leverage the generative diffusion prior for the human recovery task and inject valuable conditions to guide the denoising U-Net. Moreover, we will present a noisy key-point reasoning approach to enhance the robustness of our model. The overall framework of our DPMesh is illustrated in Figure 3.
Conditional Control for Diffusion Models. Diffusion models achieve high controllability thanks to the effective cross-attention layers in the denoising U-Net \({\boldsymbol{\epsilon}}_\theta\) [37] which bridges a way for the interactions between image features and various conditions. Recently, ControlNet [21] successfully enhances the fine-grained spatial control on latent diffusion model (LDM) [19] by leveraging a trainable copy of the encoding layers in the denoising U-Net as a strong backbone for learning diverse conditional controls. During the training of the ControlNet framework, images are first projected to latent representations \({\boldsymbol{z}}_0\) by a trained VQGAN consisting of the encoder \(\mathcal{E}\) and the decoder \(\mathcal{D}\). Denoting \({\boldsymbol{z}}_t\) as the noisy image at \(t\)-th timestep, it is produced by: \[{\boldsymbol{z}}_t = \sqrt{\bar{\alpha}_t}{\boldsymbol{z}}_0+\sqrt{1-\bar{\alpha}_t}{\boldsymbol{\epsilon}}, \label{eq:preliminaries32latent32diffusion321}\tag{1}\] where \(\bar{\alpha}_t=\prod_{s=1}^{t}\alpha_s\) and \({\boldsymbol{\epsilon}} \sim \mathcal{N}(0,{\boldsymbol{I}})\). Given the noisy image and conditions, the training objective of the ControlNet framework can be derived as: \[L_{\rm CLDM}=\mathbb{E}_{{\boldsymbol{z}}_0,t,{\boldsymbol{c}}_{\rm t},{\boldsymbol{c}}_{\rm f},{\boldsymbol{\epsilon}}} \bigg[\Vert {\boldsymbol{\epsilon}}-{\boldsymbol{\epsilon}}_{\theta}({\boldsymbol{z}}_t,t,{\boldsymbol{c}}_{\rm t},{\boldsymbol{c}}_{\rm f}) \Vert_2^2 \bigg], \label{eq:preliminaries32latent32diffusion322}\tag{2}\] where \({\boldsymbol{z}}_t\) is computed from Equation (1 ), \({\boldsymbol{c}}_{\rm t}\) denotes the text condition embedding extracted from frozen CLIP [38] text encoder and \({\boldsymbol{c}}_{\rm f}\) is a task-specific condition, such as human skeleton poses or canny maps. To prevent harmful noise that influences the hidden states of neural network layers at the start of training, ControlNet applies zero convolution layers for the trainable copy branch. The condition branch consumes \({\boldsymbol{c}}_{\rm f}\) as input and injects the outcomes to the output blocks of diffusion model \({\boldsymbol{\epsilon}}_\theta\). In order to keep generation capability and reduce computational costs, it freezes the parameters in \({\boldsymbol{\epsilon}}_\theta\). By utilizing the fine-grained conditions, ControlNet successfully achieves controllable human image generation with various conditions like 2D skeletons.
Human Body Model. We use SMPL [26] model to parameterize human body mesh. SMPL represents 3D human body with three vectors, denoted by pose \(\Theta \in \mathbb{R}^{72}\), shape \(\beta \in \mathbb{R}^{10}\) and camera parameters \(\pi \in \mathbb{R}^{3}\). The body mesh is generated by a differentiable function \(\mathcal{M}(\Theta,\beta) \in \mathbb{R}^{6890}\). Then we can obtain the 3D joint coordinates by \(\mathcal{J}_{\rm 3D}=\mathcal{W}\mathcal{M} \in \mathbb{R}^{N \times 3}\), where \(\mathcal{W}\) is a pre-trained linear regressor and \(N\) represents the number of joints. With the predicted camera parameters \(\pi\), we can obtain reprojected 2D joints \(\mathcal{J}_{\rm 2D}=\Pi(\mathcal{J}_{\rm 3D},\pi) \in \mathbb{R}^{N \times 2}\) by perspective projection.
Overview. Our primary goal is to fully exploit the pre-trained diffusion model’s potential for occluded human mesh recovery, leveraging its learned knowledge of the object structure and spatial interactions. In contrast to previous methods involving repeated diffusion sampling, our basic idea is to simply employ the pre-trained diffusion model as the image backbone, performing a single inference to extract features from image \({\boldsymbol{x}}\). To provide effective guidance, we adopt the condition injection to play an essential role in processing pre-detected 2D observations into the conditions. Then we utilize the noisy key-point reasoning approach to improve the occlusion awareness of our model to further enhance the robustness of the proposed framework. In conclusion, we propose DPMesh, which takes the image and corresponding noisy key-point observations as inputs and estimates the SMPL parameters \(\Theta\), \(\beta\) and \(\pi\), collectively denoted as the output \({\boldsymbol{y}}\). This process can be formulated as: \[p_{\phi}({\boldsymbol{y}}|{\boldsymbol{x}}) = p_{\phi_3}({\boldsymbol{y}}|\mathcal{F})p_{\phi_2}(\mathcal{F}|{\boldsymbol{x}},\mathcal{C}) p_{\phi_1}(\mathcal{C}|{\boldsymbol{x}}), \label{eq:overview}\tag{3}\] where \(\mathcal{F}\) denotes the extracted feature maps and \(\mathcal{C}\) represents the conditions. We will describe the role of each term in Equation (3 ) along with their detailed designs.
Condition Injection with 2D HeatMap. \(p_{\phi_1}(\mathcal{C}|{\boldsymbol{x}})\) aims to construct effective conditions, which is significant since it provides spatial guidance for the denoising U-Net backbone \({\boldsymbol{\epsilon}}_{\theta}\). Therefore we design the condition injection to introduce high-level and spatial information that guides the backbone to focus on the region of interest. For each cropped image, we utilize an off-the-shelf 2D key-point detector [39] to obtain 2D joints \(J_{\rm 2D}\) along with their corresponding confidence and generate the heatmaps \(H_{\rm 2D} \in \mathbb{R}^{N \times H_{0} \times W_{0}}\) as conditioning input using 2-dimensional Gaussian kernels, where \(N\) represents the number of detected key-points. After that, we concatenate the heatmap \(H_{2D}\) with the input image \({\boldsymbol{z}}_0\) to obtain \({\boldsymbol{c}}_{\rm j} \in \mathbb{R}^{(N+C_{z}) \times H_0 \times W_0}\). It is noteworthy that, in the original ControlNet architecture, the conditioning image is fused with \({\boldsymbol{z}}_0\) by element-wise adding after passing through convolution layers. However, we observe that this addition significantly damages the final condition quality. The preparation for \({\boldsymbol{c}}_{\rm j}\) can be expressed as: \[{\boldsymbol{c}}_{\rm j} = \mathop{\mathrm{Concat}}({\boldsymbol{z}}_0,H_{\rm 2D}), H_{\rm 2D} = \mathop{\mathrm{Gaussian}}(J_{\rm 2D}). \label{eq:controlnet}\tag{4}\] Then we employ the ControlNet architecture to process fine-grained conditions from \({\boldsymbol{c}}_{\rm j}\) and inject them to the image features in the denoising U-Net \({\boldsymbol{\epsilon}}_\theta\). The output of the \(i\)-th layer \(F_i\) in the decoding layers of \({\boldsymbol{\epsilon}}_\theta\) is derived as: \[F_i = F(F_{i-1};\theta_i)+\mathop{\mathrm{Conv}}(F({\boldsymbol{c}}_{\rm j};\theta_{\rm cond});\theta_{\rm Conv}),\] where \(F_{i-1}\) is the output of the previous block and \(F(\cdot;\theta)\) denotes a trained neural network. \(\theta_i,\theta_{\rm cond}\) represent the parameters within the denoising U-Net and ControlNet, respectively. \(\theta_{\rm Conv}\) is the parameters of zero convolution layers with both weights and bias initialized to zeros. Note that in the original ControlNet architecture, \(\theta_{\rm cond}\) is a trainable copy of the encoding blocks in the denoising U-Net.
Besides the ControlNet that provides controls for the denoising U-Net \({\boldsymbol{\epsilon}}_\theta\), we also consider using the cross-attention prompt to pinpoint more accurate spatial information. In original diffusion models, the prompt \({\boldsymbol{c}}_{\rm t}\) conditions typically consist of text embeddings from frozen CLIP. However, in our framework, we replace text with visible 2D joint coordinates \(J_{\rm 2D} \in \mathbb{R}^{N \times 2}\). We then apply a two-layer MLP to elevate the dimension of 2D joint coordinates to match the text token dimension \(D_{\rm t}\), which is set to 768 in the pre-trained diffusion model. Thus we obtain the auxiliary spatial condition \({\boldsymbol{c}}_{\rm t} \in \mathbb{R}^{N \times D_t}\). We take \({\boldsymbol{c}}_{\rm t}\) as prompt guidance and send it to all cross-attention blocks in the denoising U-Net. The construction of \({\boldsymbol{c}}_{\rm t}\) can be formulated as: \[{\boldsymbol{c}}_{\rm t} = \mathop{\mathrm{MLP}}(J_{\rm 2D}).\] To sum up, we construct the condition set \(\mathcal{C}\) consisting of \({\boldsymbol{c}}_{\rm j}\) and \({\boldsymbol{c}}_{\rm t}\) and inject them into \({\boldsymbol{\epsilon}}_\theta\) through different passways.
Feature Extraction with Diffusion Prior. \(p_{\phi_2}(\mathcal{F}|{\boldsymbol{x}},\mathcal{C})\) is dedicated to the feature extraction from the input image. In contrast to previous methods that employ convolution-based and Transformer-based backbones, we leverage the pre-trained diffusion model as our backbone, exploiting the visual perception capability within the denoising U-Net learned from the generative training. As verified in [20], there exist enough visual priors about the object structure and spatial interactions in a pre-trained denoising U-Net \({\boldsymbol{\epsilon}}_\theta\). By unlocking the potential of \({\boldsymbol{\epsilon}}_\theta\), we can tune the generative capability in the pre-trained diffusion model to address the human mesh recovery task. Motivated by this perspective, we design a straightforward feature extractor implemented by the pre-trained denoising U-Net, which receives effective guidance \(\mathcal{C}\) from the condition injection. To prepare the input images, we convert the cropped image \({\boldsymbol{x}} \in \mathbb{R}^{H \times W \times C}\) from pixel space to the latent space with the frozen encoder \(\mathcal{E}\) to obtain the latent representation \({\boldsymbol{z}}_0 \in \mathbb{R}^{H_{0} \times W_{0} \times C_{z}}\). Then we feed \({\boldsymbol{z}}_0\) into the pre-trained U-Net \({\boldsymbol{\epsilon}}_{\theta}\) and extract the multi-scale feature maps \(F_i\), where \(i \in \{1,4,7\}\), from the decoding layers as the implicit image features. Furthermore, we empirically observe that the cross-attention maps \(T_i \in \mathbb{R}^{|{\boldsymbol{c}}_{\rm t}| \times H_i \times W_i}\) can provide significant occlusion-aware information indicating the invisible parts and explicit object structure knowledge about the human pose and shape. Therefore we concatenate the cross-attention maps with the feature maps \(F_i\) and obtain the hierarchical feature maps \(\mathcal{F}\leftarrow \{[F_i, T_i]\}\), which incorporate both implicit and explicit diffusion priors for subsequent regression.
SMPL Mesh Regressor. \(p_{\phi_3}({\boldsymbol{y}}|\mathcal{F})\) represents the regressor responsible for predicting the parameters of the body model from feature maps \(\mathcal{F}\). To capture the body information in \(\mathcal{F}\), we employ a cascade Transformer decoder for the regressor. In order to provide sufficient human pose priors and maintain the symmetry of the VQGAN framework, we train a VQVAE on a large motion dataset [40] with massive SMPL pose parameters to learn discrete representations for human poses. During the final regression, we predict the entry indices of the learned codebook and feed the corresponding pose embedding to the decoder \(\mathcal{D}\) of the VQVAE to attain the pose parameters \(\Theta\). As for shape and camera parameters (i.e.\(\beta\) and \(\pi\)), which are highly dependent on image features, we directly regress them using linear layers.
Method | 3DPW-OC | 3DPW-PC | 3DOH | 3DPW-Crowd | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
2-4(lr)5-7(lr)8-10(lr)11-13 | MPJPE\(\downarrow\) | PA-MPJPE\(\downarrow\) | MPVE\(\downarrow\) | MPJPE\(\downarrow\) | PA-MPJPE\(\downarrow\) | MPVE\(\downarrow\) | MPJPE\(\downarrow\) | PA-MPJPE\(\downarrow\) | MPVE\(\downarrow\) | MPJPE\(\downarrow\) | PA-MPJPE\(\downarrow\) | MPVE\(\downarrow\) |
SPIN [2] | 95.5 | 60.7 | 121.4 | 122.1 | 77.5 | 159.8 | 110.5 | 71.6 | 124.2 | 121.2 | 69.9 | 144.1 |
PyMAF [14] | 89.6 | 59.1 | 113.7 | 117.5 | 74.5 | 154.6 | 101.6 | 67.7 | 116.6 | 115.7 | 66.4 | 147.5 |
ROMP [6] | 91.0 | 62.0 | - | 98.7 | 69.0 | - | - | - | - | 104.8 | 63.9 | 127.8 |
OCHMR [33] | 112.2 | 75.2 | 145.9 | - | - | - | - | - | - | - | - | - |
PARE [11] | 83.5 | 57.0 | 101.5 | 95.8 | 64.5 | 122.4 | 109.0 | 63.8 | 117.4 | 94.9 | 57.5 | 117.6 |
3DCrowdNet [15] | 83.5 | 57.1 | 101.5 | 90.9 | 64.4 | 114.8 | 102.8 | 61.6 | 111.8 | 85.8 | 55.8 | 108.5 |
JOTR [16] | 75.7 | 52.2 | 92.6 | 86.5 | 58.3 | 109.7 | 98.7 | 59.3 | 104.8 | 82.4 | 52.0 | 103.4 |
DPMesh (Ours) | 70.9 | 48.0 | 88.0 | 82.2 | 56.6 | 105.4 | 97.1 | 59.0 | 106.4 | 79.9 | 51.1 | 101.5 |
As we introduce an off-the-shelf 2D key-point detector for providing 2D observation hints, there naturally arises a problem: How robust is our framework in the presence of noisy key-points? The noisy key-points, often arising from severe occlusion, can adversely impact the model’s performance during evaluation. This consideration motivates us to reinforce our backbone with extra supervision to mitigate the model’s reliance on noisy key-points. To achieve this, we leverage a self-supervised distillation approach, called Noisy Key-point Reason (NKR), that focuses on 2D detection errors, including missing key-points, jitters and mismatch. The core concept involves training a teacher model adept at accurately encoding feature maps with precise ground truth key-points. Then we utilize the teacher’s feature maps \(\mathcal{F}^{T}\) to guide and supervise the student’s feature maps \(\mathcal{F}^{S}\). During the distillation process, we minimize both the SimCLR loss [41] and MSE loss: \[L_{\rm NKR} = L_{\rm SimCLR} + L_{\rm MSE}, \label{eq:simclr}\tag{5}\] where the SimCLR loss is computed by: \[L_{\rm SimCLR} = \log \frac{\exp(\mathop{\mathrm{Dist}}(F_{i}^{S}, F_{i}^{T}))}{\sum_{i \neq j} \exp(\mathop{\mathrm{Dist}}(F_{i}^{S}, F_{j}^{T}))}.\] The SimCLR loss extends the distance between the two models’ features for different inputs while minimizing the distance for the same input. The MSE loss also provides additional supervision to align features between teacher and student. This noisy key-point reasoning approach enhances the robustness of our framework against 2D detection errors, ensuring its stability under challenging occlusion.
We use Stable Diffusion V1-5 [19] with ControlNet [21], pre-trained for human-like image generation from 2D skeletons, as our image backbone. Following [15], [16], we take the cropped image in 256 \(\times\) 256 resolution as input and encode it into the latent code \({\boldsymbol{z}}_0 \in \mathbb{R}^{4 \times 32 \times 32}\). We extract feature maps with the size 8, 16 and 32 and the cross-attention maps in the same resolution from the denoising U-Net. To maintain the learned knowledge in the pre-trained diffusion model, we use LoRA [42] to unfreeze the linear layers in cross-attention blocks, setting the rank of LoRA modules to 64. We find that even fine-tuning a small number of parameters via LoRA yields satisfying results. More details are shown in the appendix file.
Finally, we obtain mesh vertices \(\mathcal{M}(\Theta,\beta) \in \mathbb{R}^{6890}\) and 3D joints \(\mathcal{J}_{\rm 3D}=\mathcal{W}\mathcal{M} \in \mathbb{R}^{N \times 3}\) with functions mentioned in 3.1. We reproject the body joints to the image by \(\mathcal{J}_{\rm 2D}=\Pi(\mathcal{J}_{\rm 3D},\pi) \in \mathbb{R}^{N \times 2}\). In the reprojection process, we approximately estimate the focal length with the length of the image diagonal following [43]. For the training objectives, we utilize wide-used losses on SMPL parameters, 2D joints and 3D joints when 3D joint annotations are available to optimize our framework. In conclusion, the entire loss function can be formulated as: \[\begin{align} L =&\;\lambda_{\rm 2D}L_{\rm 2D} + \lambda_{\rm 3D}L_{\rm 3D} + \\ &\;\lambda_{\rm SMPL}L_{\rm SMPL} + \lambda_{\rm NKR}L_{\rm NKR}. \end{align} \label{eq:whole32loss}\tag{6}\] The first three terms are computed by: \[L_{\rm 2D} = \Vert \mathcal{J}_{\rm 2D} - \hat{\mathcal{J}}_{\rm 2D} \Vert , \;\;L_{\rm 3D} = \Vert \mathcal{J}_{\rm 3D} - \hat{\mathcal{J}}_{\rm 3D} \Vert , \label{eq:split32loss321}\tag{7}\] \[L_{\rm SMPL} = \Vert \Theta - \hat{\Theta} \Vert + \Vert \beta - \hat{\beta} \Vert , \label{eq:split32loss322}\tag{8}\] where \(\hat{\mathcal{J}}_{\rm 2D}\), \(\hat{\mathcal{J}}_{\rm 3D}\), \(\hat{\Theta}\) and \(\hat{\beta}\) represent the ground truth annotations of 2D joints, 3D joints, SMPL body pose parameters and shape parameters, respectively.
To verify the effectiveness of our proposed DPMesh, we conduct comprehensive experiments and ablation studies on the standard benchmark and various occlusion datasets. We will introduce the experimental settings including the implementation for training and evaluation. Subsequently, we will present our main results and offer a comprehensive analysis through detailed ablations.
Training Details. In alignment with previous works [15], [16], we train our model on a hybrid dataset with 2D or 3D annoations, including Human3.6M [44], MuCo-3DHP [45], MSCOCO [46], and CrowdPose [47]. We exclusively utilize the training sets of these datasets, adhering to standard split protocols. For the 2D dataset, we utilize their pseudo ground-truth SMPL parameters [48]. During training, we add realistic errors on the ground truth (GT) 2D pose following [9], [49] to simulate erroneous 2D pose, rather than generating detected key-point results. We use AdamW optimizer with a batch size of 16 and a weight decay of 1e-6.We set the initial learning rate to 1e-4 and cut it to 1e-5 in the last 5 epochs. We adopt the AdamW optimizer, paired with a batch size of 16 and a weight decay of 1e-6. The whole training process takes 30 epochs. The initial learning rate is set at 1e-4, which is reduced to 1e-5 for the final 5 epochs. The VQVAE is trained on AMASS [40] for 2,000 epochs and the teacher model is trained with ground truth key-point labels for 10 epochs.
Evaluation Details. We evaluate our model on 3DPW [23] test split, 3DOH [17] test split, 3DPW-PC [6], [23], 3DPW-OC [17], [23], 3DPW-Crowd [15], [23] and CMU-Panoptic dataset [24]. 3DPW-PC is the person-person occlusion subset of 3DPW and 3DPW-OC is the person-object occlusion subset of 3DPW. 3DOH is another person-object occlusion dataset. The metrics we use are mean per joint position error (MPJPE) in mm, Procrustes-aligned mean per joint position error (PA-MPJPE) in mm for evaluating the accuracy of 3D joints and mean per vertex error (MPVE) in mm for evaluating 3D mesh error. For the CMU-Panoptic dataset, we only evaluate MPJPE following previous work [1], [50]–[52].
3DPW-OC [17], [23] is a person-object occlusion subset of 3DPW and contains 20243 persons. Table 1 shows that our DPMesh outperforms all competitors with 70.9 mm MPJPE and 48.0 mm PA-MPJPE, demonstrating its promising capability in effectively handling complex in-the-wild scenes.
3DOH [17] is a person-object occlusion-specific dataset that encompasses 1290 images in the test split. All methods we report are not fine-tuned on the training split for fair comparison. We achieve the best results on 97.1 MPJPE and 59.0 PA-MPJPE, as shown in Table 1. We further exhibit the qualitative findings in Figure 5. Our DPMesh proficiently manages heavy occlusion situations.
Method | MPJPE\(\downarrow\) | PA-MPJPE\(\downarrow\) | MPVE\(\downarrow\) | |
---|---|---|---|---|
HMR [1] | 130.0 | 76.7 | - | |
GraphCMR [53] | - | 70.2 | - | |
SPIN [2] | 96.9 | 56.2 | 116.4 | |
PyMaf [14] | 92.8 | 58.9 | 110.1 | |
OCHMR [33] | 89.7 | 58.3 | 107.1 | |
ROMP [6] | 89.3 | 53.5 | 105.6 | |
PARE [11] | 82.9 | 52.3 | 99.7 | |
3DCrowdNet [15] | 81.7 | 51.2 | 98.3 | |
JOTR [16] | 76.4 | 48.7 | 92.6 | |
DPMesh (Ours) | 73.6 | 47.4 | 90.7 |
3DPW-PC [6], [23] is a person-person occlusion subset of 3DPW, comprising 2218 individuals. Images within this dataset contain annotations for multiple persons, potentially distracting the feature extractor. This necessitates a robust backbone capable of effectively interpreting the spatial relationships among these individuals. As shown in Table 1, DPMesh exhibits superior performance compared to previous methods with 82.2 MPJPE and 56.6 PA-MPJPE.
3DPW-Crowd [15], [23] is a person crowded subset of 3DPW and contains 1923 persons. DPMesh exceeds state-of-the-art methods across all metrics. As demonstrated in Table 1, we reach the best result on 79.9 MPJPE and 51.1 PA-MPJPE compared with previous methods.
CMU-Panoptic [24] dataset is a multi-person indoor dataset collected with multi-view cameras. In order to ensure a fair comparison, we choose 4 scenes for evaluation, following [15], [54]. Results are shown in Table 3, and we outshine other competitors on all video clips.
3DPW [23] is a widely-used benchmark for human mesh recovery, featuring 60 videos and 3D annotations of 35,515 individuals in its test split. For a fair comparison with other methods, we do not fine-tune our model on the 3DPW train split. As presented in Table 2, our method achieves state-of-the-art performance on the test split. We also show the qualitative results in Figure 4, which demonstrate that our DPMesh is robust in complex, wild scenes.
Effective Backbone with Diffusion Prior. We engage in a comparative study between our diffusion-based backbone with convolution-based backbones such as ResNet50 [55] and HRNet-W48 [56], as well as transformer-based backbones like ViT-L-16 [57] and Swin-V2-L [58]. Note that our selection includes both supervised pre-trained models (e.g., ResNet50) and self-supervised pre-trained models (e.g., Swin-V2-L). To implement our comparison, we concatenate pre-detected heatmaps with early-stage image features and fine-tune each backbone for 30 epochs. As revealed in Table 4, our diffusion-based backbone exhibits superior performance compared to other competitors, proving its exceptional perception capability for occluded human mesh recovery. Furthermore, as illustrated in Figure 6, our diffusion-based backbone accurately captures the occlusion-aware information in the cross-attention maps, which provides explicit guidance for the subsequent regressor by recognizing the target from various occlusions.
Method | Haggl. | Mafia | Ultim. | Pizza | Mean |
---|---|---|---|---|---|
Zanfir et al. [51] | 140.0 | 165.9 | 150.7 | 156.0 | 153.4 |
Zanfir et al. [52] | 141.4 | 152.3 | 145.0 | 162.5 | 150.3 |
Jiang et al. [54] | 129.6 | 133.5 | 153.0 | 156.7 | 143.2 |
ROMP [6] | 111.8 | 129.0 | 148.5 | 149.1 | 134.6 |
REMIPS [50] | 121.6 | 137.1 | 146.4 | 148.0 | 138.3 |
3DCrowdNet [15] | 109.6 | 135.9 | 129.8 | 135.6 | 127.6 |
JOTR [16] | 99.9 | 113.5 | 115.7 | 123.6 | 114.7 |
DPMesh (Ours) | 97.2 | 109.8 | 114.3 | 120.5 | 110.4 |
Settings | 3DPW | 3DPW-OC | ||
---|---|---|---|---|
2-3(lr)4-5 | MPJPE\(\downarrow\) | PA-MPJPE\(\downarrow\) | MPJPE\(\downarrow\) | PA-MPJPE\(\downarrow\) |
type of conditioning inputs | ||||
\({\boldsymbol{z}}_0\) | 87.6 | 54.9 | 89.6 | 60.7 |
\({\boldsymbol{z}}_0+{\boldsymbol{c}}_{\rm j}\) | 75.1 | 49.8 | 73.7 | 50.6 |
\({\boldsymbol{z}}_0+{\boldsymbol{c}}_{\rm j}^{\rm add}\) | 100.3 | 62.0 | 109.0 | 73.4 |
\({\boldsymbol{z}}_0+{\boldsymbol{c}}_{\rm j}+{\boldsymbol{c}}_{\rm t}\) | 73.6 | 47.4 | 70.9 | 48.0 |
noisy key-point reasoning | ||||
ResNet50 [55] w/o NKR | 80.2 | 52.4 | 78.9 | 52.8 |
HRNet-W48 [56] w/o NKR | 78.7 | 50.9 | 76.9 | 51.8 |
ViT-L-16 [57] w/o NKR | 76.9 | 49.3 | 75.6 | 52.0 |
Swin-V2-L [58] w/o NKR | 77.0 | 48.8 | 76.1 | 52.2 |
74.9 | 48.2 | 73.9 | 49.9 | |
type of backbones | ||||
ResNet50 [55] | 79.4 | 51.8 | 76.1 | 50.9 |
HRNet-W48 [56] | 77.2 | 50.8 | 75.6 | 50.1 |
ViT-L-16 [57] | 75.2 | 48.5 | 73.1 | 49.6 |
Swin-V2-L [58] | 77.3 | 48.5 | 73.5 | 49.8 |
DPMesh (Ours) | 73.6 | 47.4 | 70.9 | 48.0 |
Designs of Conditions. We investigate the effects of different designs for conditioning inputs in our model. The results are presented at the top of Table 4. We find that both the spatial heatmap condition \({\boldsymbol{c}}_{\rm j}\) and the joint coordinates prompt \({\boldsymbol{c}}_{\rm t}\) can improve performance. This demonstrates the effectiveness of these conditions in helping our model focus on critical areas. Moreover, we experiment with another way to incorporate \({\boldsymbol{c}}_{\rm j}\) by element-wise adding it to the image \({\boldsymbol{z}}_0\). However, this approach (\({\boldsymbol{z}}_0+{\boldsymbol{c}}_{\rm j}^{\rm add}\)) yields terrible results that are even worse than the baseline without \({\boldsymbol{c}}_{\rm j}\) condition. We assume that simply adding a highly abstracted latent code with the heatmap is meaningless in the latent space and introduces a misleading hint for the model to learn. Consequently, we concatenate \({\boldsymbol{z}}_0\) and heatmap on the channel dimension to reduce mutual interference.
Noisy Key-point Reasoning. We further investigate the effectiveness of our designed NKR approach. As shown in the middle part of Table 4, we disable NKR on various backbones and evaluate their performance. The results indicate that all backbones without NKR perform worse in occlusion scenes. For the diffusion backbone in DPMesh, NKR approach provides a slight improvement, which confirms its capability to reduce the disturbance of noisy 2D observation. This demonstrates that NKR results in a more robust framework, enabling DPMesh to handle challenging occlusion scenarios and produce more accurate reconstructions.
In this paper, we introduce DPMesh, a simple yet effective framework for occluded human mesh recovery, which fully exploits the rich knowledge about object structure and spatial interaction within the pre-trained diffusion model. We successfully tame the diffusion model with the designed condition injection to perform accurate occluded mesh recovery in a single step. Furthermore, we leverage a noisy key-point reasoning approach to enhance the robustness of our model. Extensive experiments demonstrate our framework can achieve accurate estimation even in severe occlusion and crowded environments. We hope our work will provide a new perspective for occluded human mesh recovery and inspire more research in employing diffusion models for perception tasks.
Acknowledgement. This work was supported in part by the National Natural Science Foundation of China under Grant 62125603, Grant 62321005, and Grant 62336004, and in part by CCF-Tencent Rhino-Bird Open Research Fund.
In this appendix, we provide additional detailed implementations, qualitative comparisons and ablation studies in Section 6, Section 7 and Section 8.
We utilize Stable-Diffusion [19] V1-5 with ControlNet pre-trained specifically on skeleton image conditions so that we can further unleash its diffusion priors. The denoising U-Net has 25 blocks in total, with a middle block and 12 input and output blocks each. The ControlNet takes 12 input blocks as a parallel branch, connected with zero convolution layers as the output layer. The latent code \(\boldsymbol{z}_0\) is in 32 \(\times\) 32 and is downsampled to 16 \(\times\) 16 and 8 \(\times\) 8 resolutions. During the upsampling process in output blocks, we extract cross-attention maps in the shape of 8 \(\times\) 8 and 16 \(\times\) 16 and concatenate them with corresponding feature maps in channel dimension. Finally, we use linear layers to merge the pyramid multi-layer feature maps into 2048 \(\times\) 8 \(\times\) 8 for subsequent regression processing. For the teacher model, we employ the same framework as the student.
As for the ablation study of various backbones, we follow [15], [16] to initially extract an early-stage image feature in 64 \(\times\) 64. Note that for convolution-based backbones (e.g., ResNet50 [55] and HRNet-W48 [56]), we use a convolution kernel with 2 strides and a max-pooling layer to downsample the image, while for transformer-based backbones (e.g., ViT-L [57] and Swin-V2-L [58]) we apply patch embedding layers. We set the patch size to 4 for Swin-V2-L and 16 for ViT-L, considering the latter’s substantial computational cost. We further quantify the amount of model parameters as illustrated in Table 5. With the well-designed implementation of LoRA [42], we significantly reduce the trainable parameters while effectively maintaining the diffusion prior in the pre-trained models. In comparison with ViT-L, our backbone achieves superior results with reduced computational expenses.
We apply a pose parameter decoder pre-trained on a huge SMPL dataset AMASS [40]. The framework of our VQVAE is built with linear layers and we take the pose parameters in rotation matrix representation as input in the shape of 24 \(\times\) 9. The codebook class number is 2048 and the token dim is 256. During the pre-train stage, we supervise the results with the reconstruction loss following [27]. We also utilize Exponential Weighted Average on codebook optimization inspired by [59]. As for the inference stage, the regressor will provide 48 tokens to the decoder and finally retrieve the SMPL pose parameters \({\rm \Theta} \in \mathbb{R}^{24 \times 3}\).
For traninig loss, we set \(\lambda_{\rm 2D}\), \(\lambda_{\rm 3D}\), \(\lambda_{\rm SMPL}\) to 5.0, 2.0 and 1.0 respectively. \(\lambda_{\rm NKR}\) is set to 0.1. We speed up training by using distributed training with Pytorch [60] using 8 Nvidia GeForce RTX 4090 GPUs.
Backbone | Total Params. | Trainable Params. | MPJPE\(\downarrow\) |
---|---|---|---|
ResNet50 [55] | 39.7M | 39.0M | 76.1 |
HRNet-W48 [56] | 77.8M | 77.1M | 75.6 |
ViT-L [57] | 1257.2M | 1256.4M | 73.1 |
Swin-V2-L [58] | 211.2M | 210.3M | 73.5 |
DPMesh (Ours) | 1426.3M | 408.9M | 70.9 |
Robustness to noisy 2D key-points. We illustrate the prediction of DPMesh under noisy key-points compared with previous methods in Figure 7. We draw the bounding box according to the region encompassing visible key-points. Given the complexities associated with individual interactions and the absence of precise key-point hints, traditional approaches may yield false predictions. However, by incorporating a robust diffusion backbone and the Noisy Key-point Reasoning (NKR) approach, our DPMesh algorithm achieves a marked improvement in accuracy.
Comparison on OCHuman [61]. We present additional qualitative assessments on OCHuman, an in-the-wild dataset with substantial occlusion consisting of 8,110 meticulously annotated human instances across 4,731 images. Initially, we employ AlphaPose [62] as an off-the-shelf detector to obtain coarse 2D key-points. Subsequently, we estimate the mesh results with previous methods and our DPMesh. As shown in Figure 8, our DPMesh demonstrates exceptional performance in tackling challenging occlusions and complex human poses.
Comparison on CrowdPose [47]. The CrowdPose dataset comprises 8,000 images characterized by dense occlusions and complex crowd scenarios. We compare our DPMesh with 3DCrowdNet [15], which is tailored for in-the-wild crowded scenes and addresses 3D human mesh recovery issues with a joint-based regressor. As shown in Figure 9, our DPMesh proficiently estimates the shape and pose of all individuals within the view, effectively handling ambiguous person interactions and body truncations.
Conditions | 3DPW | 3DPW-OC | ||
---|---|---|---|---|
2-3(lr)4-5 | MPJPE\(\downarrow\) | PA-MPJPE\(\downarrow\) | MPJPE\(\downarrow\) | PA-MPJPE\(\downarrow\) |
skeleton map | 74.9 | 48.4 | 72.1 | 49.7 |
heatmap | 73.6 | 47.4 | 70.9 | 48.6 |
Settings | 3DPW | 3DPW-OC | ||
---|---|---|---|---|
2-3(lr)4-5 | MPJPE\(\downarrow\) | PA-MPJPE\(\downarrow\) | MPJPE\(\downarrow\) | PA-MPJPE\(\downarrow\) |
ResNet+RLL | 81.6 | 53.9 | 82.2 | 54.1 |
ResNet+CTD | 80.2 | 52.4 | 78.9 | 52.8 |
Diffusion+RLL | 75.4 | 49.0 | 72.5 | 50.0 |
Diffusion+CTD | 73.6 | 47.4 | 70.9 | 48.0 |
LoRA rank | ResBlock | MPJPE\(\downarrow\) | PA-MPJPE\(\downarrow\) |
---|---|---|---|
8 | 73.2 | 50.1 | |
64 | 75.9 | 51.5 | |
64 | 70.9 | 48.0 |
Type of 2D spatial conditions. In conventional controllable generative models such as ControlNet [21], the input human pose guidance is typically provided in the form of a RGB skeleton image detected from Openpose [39]. Adhering to the approach of ControlNet, we first extract image features from the skeleton image using convolutional layers and then concatenate the spatial feature with \({\boldsymbol{z}}_0\) as a spatial condition. As summarized from Table 6, we assume that the heatmap guidance carries more information such as the joint correspondence to different heatmap channels, than a single skeleton image. Furthermore, noisy key-points may lead to incorrect skeleton connections, which can negatively impact performance. Therefore, our DPMesh utilizes heatmap guidance to effectively introduce spatial conditions.
Settings | MPJPE\(\downarrow\) | PA-MPJPE\(\downarrow\) | OCC-MPJPE\(\downarrow\) | OCC-PA-MPJPE\(\downarrow\) |
---|---|---|---|---|
JOTR [27] | 75.7 | 52.2 | 89.2 | 63.5 |
w/o CAM | 73.7 | 50.6 | 89.4 | 61.0 |
w/o NKR | 73.9 | 49.9 | 91.1 | 61.6 |
70.9 | 48.0 | 86.2 | 57.6 |
Implementation of different mesh regressors. In order to assess the efficacy of diffusion-based backbone, we apply recurrent linear layers (RLL) derived from [2] and cascade transformer decoder learned from [16] as the SMPL regressor. Results are presented in Table 7. Without bells and whistles, even with vanilla recurrent linear layers, our diffusion-based feature extractor outperforms the performance of JOTR [16], which carefully designs a contrastive learning loss for its cascade transformer decoder. These findings indicate that our outstanding performance is not severely dependent on the specific regressor employed, highlighting the versatility and effectiveness of the diffusion-based backbone.
Influence of LoRA [42]. In order to preserve the diffusion prior within the pre-trained model and minimize computational expenses, we utilize LoRA to finetune only a few parameters in U-Net. We study different LoRA ranks and unlock more blocks to optimize. As shown in Table 8, lower LoRA rank fails to thoroughly unleash the potential of diffusion for visual perception tasks and further unfreezing the ResBlocks may compromise the diffusion prior learned from extensive data. Therefore, employing LoRA matrices to simply unlock cross-attention blocks is appropriate for this specific task, striking a balance between performance and computational efficiency.
Method | 3DPW | 3DPW-OC | ||
---|---|---|---|---|
2-3(lr)4-5 | MPJPE\(\downarrow\) | PA-MPJPE\(\downarrow\) | MPJPE\(\downarrow\) | PA-MPJPE\(\downarrow\) |
HyBrIK [63] | 71.6 | 41.8 | 90.8 | 58.8 |
NIKI [64] | 71.3 | 40.6 | 85.5 | 53.5 |
68.4 | 42.8 | 70.9 | 48.0 |
Method | 3DPW | 3DPW-PC | ||
---|---|---|---|---|
2-3(lr)4-5 | MPJPE\(\downarrow\) | PA-MPJPE\(\downarrow\) | MPJPE\(\downarrow\) | PA-MPJPE\(\downarrow\) |
HMDiff [5] | 72.7 | 44.5 | 114.2 | 73.5 |
68.4 | 42.8 | 82.2 | 56.6 |
Evaluation on occluded joints. Our NKR focuses on dealing with noisy guidance from erroneous key-points. Given that these noisy key points only take a small fraction of the total body, they may not have a significant impact on the overall result. To further assess the NKR’s impact, we introduce OCC-MPJPE\(\downarrow\) and OCC-PA-MPJPE\(\downarrow\) to evaluate errors on occluded joints. As shown in Table 9, the cross-attention module and the Noisy Key-point module both are effective on occluded key-points input.
More comparisons. We compare DPMesh with more methods. For a fair comparison, when testing on the 3DPW test split, we fine-tune our model with the 3DPW [23] training set. As illustrated in Table 10, DPMesh achieves competitive performance on the 3DPW test split benchmark and significantly outperforms the occlusion benchmark. Furthermore, we compare our one-step DPMesh with the step-by-step denoising framework HMDiff [5]. HMDiff is an optimization method that takes over 200 steps to recover human mesh. As shown in Table 11, DPMesh exhibits much better results on the 3DPW-PC benchmark while also outperforming HMDiff on the 3DPW test split.
Equal contribution. Corresponding authors.↩︎