EDTalk: Efficient Disentanglement for
Emotional Talking Head Synthesis
April 02, 2024
Achieving disentangled control over multiple facial motions and accommodating diverse input modalities greatly enhances the application and entertainment of the talking head generation. This necessitates a deep exploration of the decoupling space for facial features, ensuring that they a) operate independently without mutual interference and b) can be preserved to share with different modal inputs—both aspects often neglected in existing methods. To address this gap, this paper proposes a novel Efficient Disentanglement framework for Talking head generation (EDTalk). Our framework enables individual manipulation of mouth shape, head pose, and emotional expression, conditioned on video or audio inputs. Specifically, we employ three lightweight modules to decompose the facial dynamics into three distinct latent spaces representing mouth, pose, and expression, respectively. Each space is characterized by a set of learnable bases whose linear combinations define specific motions. To ensure independence and accelerate training, we enforce orthogonality among bases and devise an efficient training strategy to allocate motion responsibilities to each space without relying on external knowledge. The learned bases are then stored in corresponding banks, enabling shared visual priors with audio input. Furthermore, considering the properties of each space, we propose an Audio-to-Motion module for audio-driven talking head synthesis. Experiments are conducted to demonstrate the effectiveness of EDTalk. We recommend watching the project website: https://tanshuai0219.github.io/EDTalk/
Talking head animation has garnered significant research attention owing to its wide-ranging applications in education, filmmaking, virtual digital humans, and the entertainment industry [1]. While previous methods [2]–[5] have achieved notable advancements, most of them generate talking head videos in a holistic manner, lacking fine-grained individual control. Consequently, attaining precise and disentangled manipulation over various facial motions such as mouth shapes, head poses, and emotional expressions remains a challenge, crucial for crafting lifelike avatars [6]. Moreover, existing approaches typically cater to only one driving source: either audio [7], [8] or video [9], [10], thereby limiting their applicability in the multimodal context. There is a pressing need for a unified framework capable of simultaneously achieving individual facial control and handling both audio-driven and video-driven talking face generation.
To tackle the challenges, an intuition is to disentangle the entirety of facial dynamics into distinct facial latent spaces dedicated to individual components. However, it is non-trivial due to the intricate interplay among facial movements [6]. For instance, mouth shapes profoundly impact emotional expressions, where one speaks happily with upper lip corners but sadly with the depressed ones [11], [12]. Despite the extensive efforts made in facial disentanglement by previous studies [6], [13]–[16], we argue there exist three key limitations. (1) Overreliance on external and prior information increases the demand for data and complicates the data pre-processing: One popular line [6], [13], [14] relies heavily on external audio data to decouple the mouth space via contrastive learning [17]. Subsequently, they further disentangle the pose space using predefined 6D pose coefficients extracted from 3D face reconstruction models [18]. However, such external and prior information escalates dataset demands and any inaccuracies therein can lead to the trained model errors. (2) Disentangling latent spaces without internal constraints leads to incomplete decoupling. Previous works [14], [16] simply constrain each space externally with a prior during the decoupling process, overlooking inter-space constraints. This oversight fails to ensure that each space exclusively handles its designated component without interference from others, leading to training complexities, reduced efficiency, and performance degradation. (3) Inefficient training strategy escalates the training time and computational cost. When disentangling a new sub-space, some methods [6], [15] require training the entire heavyweight network from scratch, which significantly incurs high time and computational costs [11]. It can be costly and unaffordable for many researchers. Furthermore, most methods are unable to utilize audio and video inputs simultaneously.
To cope with such issues, this paper proposes an Efficient Disentanglement framework, tailored for one-shot talking head generation with precise control over mouth shape, head pose, and emotional expression, conditioned on video or audio inputs. Our key insight lies in our requirements for decoupled space: (a) The decoupled spaces should be disjoint, which means each space captures solely the motion of its corresponding component without the interference from others. This also ensures that decoupling a new space will not affect the trained models, thereby avoiding the necessity of training from scratch. (b) Once the spaces are disentangled from video data to support video-driven paradigm, they should be stored to share with the audio inputs for further audio-driven setting.
To this end, drawing inspiration from the observation that the entire motion space can be represented by a set of directions [19], we innovatively disentangle the whole motion space into three distinct component-aware latent spaces. Each space is characterized by a set of learnable bases. To ensure that different latent spaces do not interfere with each other, we constrain bases orthogonal to each other not only intra-space [19] but also inter-space. To accomplish the disentanglement without prior information, we introduce a progressive training strategy comprising cross-reconstruction mouth-pose disentanglement and self-reconstruction complementary learning for expression decoupling. Despite comprising two stages, our decoupling process involves training only the proposed lightweight Latent Navigation modules, keeping the weights of other heavier modules fixed for efficient training.
To explicitly preserve the disentangled latent spaces, we store the base sets of disentangled spaces in the corresponding banks. These banks serve as repositories of prior bases essential for audio-driven talking head generation. Consequently, we introduce an Audio-to-Motion module designed to predict the weights of the mouth, pose, and expression banks, respectively. Specifically, we employ an audio encoder to synchronize lip motions with the audio input. Given the non-deterministic nature of head motions [20], we utilize normalizing flows [21] to generate probabilistic and realistic poses by sampling from a Gaussian distribution, guided by the rhythm of audio. Regarding expression, we aim to extract emotional cues from the audio [22] and transcripts. It ensures that the generated talking head video aligns with the tone and context of audio, eliminating the need for additional expression references. In this way, our EDTalk enables talking face generation directly from the sole audio input.
Our contributions are outlined as follows: 1) We present EDTalk, an efficient disentanglement framework enabling precise control over talking head synthesis concerning mouth shape, head pose, and emotional expression. 2) By introducing orthogonal bases and an efficient training strategy, we successfully achieve complete decoupling of these three spaces. Leveraging the properties of each space, we implement Audio-to-Motion modules to facilitate audio-driven talking face generation. 3) Extensive experiments demonstrate that our EDTalk surpasses the competing methods in both quantitative and qualitative evaluation.
Facial dynamics typically involve coordinated movements such as head poses, mouth shapes, and emotional expressions in a global manner [23], making their separate control challenging. Several works have been developed to address this issue. PC-AVS [16] employs contrastive learning to isolate the mouth space related to audio. Yet since similar pronunciations tend to correspond to the same mouth shape [24], the constructed negative pairs in a mini-batch often include positive pairs and the number of negative pairs in the mini-batch is too small [25], both of which results in subpar results. Similarly, PD-FGC [6] and TH-PAD [13] face analogous challenges in obtaining content-related mouth spaces. Although TH-PAD incorporates lip motion decorrelation loss to extract non-lip space, it still retains a coupled space where expressions and head poses are intertwined. This coupling results in randomly generated expressions co-occurring with head poses, compromising user-friendliness and content relevance. Despite the achievement of PD-FGC in decoupling facial details, its laborious coarse-to-fine disentanglement process consumes substantial computational resources and time. DPE [15] introduces a bidirectional cyclic training strategy to disentangle head pose and expression from talking head videos. However, it necessitates two generators to independently edit expression and pose sequentially, escalating computational resource consumption and runtime. In contrast, we propose an efficient decoupling approach to segregate faces into mouth, head pose, and expression components, readily controllable by different sources. Moreover, our method requires only a unified generator, and minimal additional resources are needed when exploring a new disentangled space.
Audio-driven talking head generation [26], [27] endeavors to animate images with accurate lip movements synchronized with input audio clips. Research in this area is predominantly categorized into two groups: intermediate representation based methods and reconstruction-based methods. Intermediate representation based methods [4], [7], [28]–[34] typically consist of two sub-modules: one predicts intermediate representations from audio, and the other synthesizes photorealistic images from these representations. For instance, Das et al.[29] employ landmarks as an intermediate representation, utilizing an audio-to-landmark module and a landmark-to-image module to connect audio inputs and video outputs. Yin et al.[5] extract 3DMM parameters [35] to warp source images using predicted flow fields. However, obtaining such intermediate representations, like landmarks and 3D models, is laborious and time-consuming. Moreover, they often offer limited facial dynamics details, and training the two sub-modules separately can accumulate errors, leading to suboptimal performance. In contrast, our approach operates within a reconstruction-based framework [2], [8], [36]–[41]. It integrates features extracted by encoders from various modalities to reconstruct talking head videos in an end-to-end manner, alleviating the aforementioned issues. A notable example is Wav2Lip [42], which employs an audio encoder, an identity encoder, and an image decoder to generate precise lip movements. Similarly, Zhou et al. [16] incorporate an additional pose encoder for free pose control, yet disregard the nondeterministic nature of natural movement. To address this, we propose employing a probabilistic model to establish a distribution of non-verbal head motions. Additionally, none of the existing methods consider facial expressions, crucial for authentic talking head generation. Our approach aims to integrate facial expressions into the model to enhance the realism and authenticity of the generated talking heads.
Emotional talking head generation is gaining traction due to its wide-ranging applications and heightened entertainment potential. On the one hand, some studies [11], [22], [43]–[47] identify emotions using discrete emotion labels, albeit facing a challenge to generate controllable and fine-grained expressions. On the other hand, recent methodologies [6], [14], [48]–[51] incorporate emotional images or videos as references to indicate desired expressions. Ji et al.[49], for instance, mask the mouth region of an emotional video and utilize the remaining upper face as an expression reference for emotional talking face generation. However, as mouth shape plays a crucial role in conveying emotion[23], they struggle to synthesize vivid expressions due to their failure to decouple expressions from the entire face. Thanks to our orthogonal base and efficient training strategy, we are capable of fully disentangling different motion spaces like mouth shape and emotional expression, thus achieving finely controlled talking head synthesis. Moreover, we also incorporate emotion contained within audio and transcripts. To the best of our knowledge, we are the first to achieve this goal—automatically inferring suitable expressions from audio tone and text, thereby generating consistent emotional talking face videos without relying on explicit image/video references.
As illustrated in Fig. 2 (a), given an identity image \(I^i\), we aim to synthesize emotional talking face image \(\hat{I}^g\) that maintains consistency in identity information, mouth shape, head pose, and emotional expression with various driving sources \(I^i\), \(I^m\), \(I^p\) and \(I^e\). Our intuition is to disentangle different facial components from the overall facial dynamics. To this end, we propose EDTalk (Sec. 3.1) with learnable orthogonal bases stored in banks \(B^*\) (\(*\) refers to the mouth source \(m\), pose source \(p\) and expression source \(e\) for simplicity), each representing a distinct direction of facial movements. To ensure the bases are component-aware, we propose an efficient disentanglement strategy (Sec. 3.2), comprising Mouth-Pose Decoupling and Expression Decoupling, which decompose the overall facial motion into mouth, pose, and expression spaces. Leveraging these disentangled spaces, we further explore an Audio-to-Motion module (Section 3.3, Figure 3) to produce audio-driven emotional talking face videos featuring probabilistic poses, audio-synchronized lip motions, and semantically-aware expressions.
Figure 2 (a) illustrates the structure of EDTalk, which is based on an autoencoder architecture consisting of an Encoder \(E\), three Component-aware Latent Navigation modules (CLNs) and a Generator \(G\). The encoder \(E\) maps the identity image \(I^i\) and various driving source \(I^*\) into the latent features \(\textcolor{f1}{f^{i \rightarrow r}} = E(I^i)\) and \(\textcolor{f1}{f^{* \rightarrow r}} = E(I^*)\). The process in inspired by FOMM [9] and LIA [19]. Instead of directly modeling motion transformation \(f^{i\rightarrow *}\) from identity image \(I^i\) to driving image \(I^*\) in the latent space, we posit the existence of a canonical feature \(f^r\), that facilitates motion transfer between identity features and driving ones, expressed as \(\textcolor{f3}{f^{i\rightarrow *}} = \textcolor{f1}{f^{i\rightarrow r}} + \textcolor{f2}{f^{r\rightarrow *}}\).
Thus, upon acquiring the latent features \(f^{* \rightarrow r}\) extracted by \(E\) from driving images \(I^*\), we devise three Component-aware Latent Navigation modules to transform them into \(\textcolor{f2}{f^{r\rightarrow *}} = CLN(\textcolor{f1}{f^{* \rightarrow r}})\). For clarity, we use pose as an example, denoted as \(*=p\). Within the Pose-aware Latent Navigation (PLN) module, we establish a pose bank \(B^p = \{b^p_1, ..., b^p_n\}\) to store \(n\) learnable base \(b^p_i\). To ensure each base represents a distinct pose motion direction, we enforce orthogonality between every pair of bases by imposing a constraint of \(\left\langle b^p_i, b^p_j \right \rangle= 0\quad (i \not= j)\), where \(\left\langle \cdot, \cdot \right \rangle\) signifies the dot product operation. It allows us to depict various head pose movements as linear combinations of the bases. Consequently, we design a Multi-Layer Perceptron layer \(MLP^p\) to predict the weights \(W^p = \{w^p_1, ..., w^p_n\}\) of the pose bases from the latent feature \(f^{p \rightarrow r}\): \[W^p = \{w^p_1, ..., w^p_n\} = MLP^p(\textcolor{f1}{f^{p \rightarrow r}}), \qquad \textcolor{f2}{f^{r \rightarrow p}} = \sum_{i=1}^{n} w^p_i b^p_i,\]
Mouth and Expression-aware Latent Navigation module share the same architecture with PLM but have different parameters, where we can also derive \(\textcolor{f2}{f^{r \rightarrow m}} = \sum_{i=1}^{n} w^m_i b^m_i, W^m = MLP^m(\textcolor{f1}{f^{m \rightarrow r}})\) and \(\textcolor{f2}{f^{r \rightarrow e}} = \sum_{i=1}^{n} w^e_i b^e_i, W^e = MLP^e(\textcolor{f1}{f^{e \rightarrow r}})\) in the similar manner. It’s worth noting that to achieve complete disentanglement of facial components and prevent changes in one component from affecting others, we ensure orthogonality between the three banks (\(B^m,B^p,B^e\)). This also allows us to directly combine the three features to obtain the driving feature \(\textcolor{f2}{f^{r \rightarrow d}} = \textcolor{f2}{f^{r \rightarrow m}}+\textcolor{f2}{f^{r \rightarrow p}}+\textcolor{f2}{f^{r \rightarrow e}}\). We further get \(\textcolor{f3}{f^{i \rightarrow d}} = \textcolor{f1}{f^{i \rightarrow r}}+\textcolor{f2}{f^{r \rightarrow d}}\), which is subsequently fed into the Generator \(G\) to synthesize the final result \(\hat{I}^g\). To maintain identity information, \(G\) incorporates the identity features \(f^{id}\) of the identity image via skip connections. Additionally, to enhance emotional expressiveness with the assistance of the emotion feature \(\textcolor{f2}{f^{r\rightarrow e}}\), we introduce a lightweight plug-and-play Emotion Enhancement Module (\(EEM\)), which will be discussed in the subsequent subsection. In summary, the generation process can be formulated as follows: \[\label{eq:g} \hat{I}^g = G(\textcolor{f3}{f^{i \rightarrow d}}, f^{id}, EEM(\textcolor{f2}{f^{r \rightarrow e}})),\tag{1}\] where \(EEM\) is exclusively utilized during emotional talking face generation. For brevity, we omit \(f^{id}\) in the subsequent equations.
Based on the outlined framework, the crux lies in training each Component-aware Latent Navigation module to store only the bases corresponding to the motion of its respective components and to ensure no interference between different components. To achieve this, we propose an efficient disentanglement strategy comprising Mouth-Pose Decoupling and Expression Decoupling, thereby separating the overall facial dynamics into mouth, pose, and expression components.
Mouth-Pose Decouple. As depicted at the top of Fig. 2 (b), we introduce cross-reconstruction technical, which involves synthesized images of switched mouths: \(I^{m_a}_{p_b}\) and \(I^{m_b}_{p_a}\). Here, we superimpose the mouth region of \(I^a\) onto \(I^b\) and vice versa. Subsequently, the encoder \(E\) encodes them into canonical features, which are processed through \(PLN\) and \(MLN\) to obtain corresponding features: \[f^{p_b}, f^{m_a} = PLN(E(I^{m_a}_{p_b})), MLN(E(I^{m_a}_{p_b}))\] \[f^{p_a}, f^{m_b} = PLN(E(I^{m_b}_{p_a})), MLN(E(I^{m_b}_{p_a}))\] Next, we substitute the extracted mouth features and feed them into the generator \(G\) to perform cross reconstruction of the original images: \(\hat{I}^b = G(f^{p_b}, f^{m_b})\) and \(\hat{I}^a = G(f^{p_a}, f^{m_a})\). Additionally, we include identity features \(f^{id}\) extracted from another frame of the same identity as input to the generator \(G\). Afterward, we supervise the Mouth-Pose Decouple module by adopting reconstruction loss \(\mathcal{L}_\text{rec}\), perceptual loss \(\mathcal{L}_\text{per}\) [52], [53] and adversarial loss \(\mathcal{L}_\text{adv}\): \[\label{eq:1} \mathcal{L}_\text{rec} = \sum_{\#={a,b}}\|I^\#-\hat{I}^\#\|_1; \qquad \mathcal{L}_\text{per} = \sum_{\#={a,b}}\|\Phi(I^\#)-\Phi(\hat{I}^\#)\|^2_2;\tag{2}\] \[\label{eq:2} \mathcal{L}_\text{adv} = \sum_{\#={a,b}}(\text{log}D(I^\#)+\text{log}(1-D(\hat{I}^\#))),\tag{3}\] where \(\Phi\) denotes the feature extractor of VGG19 [54] and \(D\) is a discriminator tasked with distinguishing between reconstructed images and ground truth (GT). In addition, self-reconstruction of the Ground Truth (GT) is crucial, where mouth features and pose features are extracted from the same image and then input into \(G\) to reconstruct itself using \(\mathcal{L}_\text{self}\). Furthermore, we impose feature-level constraints on the network: \[\label{eq:3} \mathcal{L}_\text{fea} = \sum_{\#={a,b}}(exp(-\mathcal{S}(f^{p_\#}, PLN(E(I^{\#})))) + exp(-\mathcal{S}(f^{m_\#}, MLN(E(I^{\#}))))),\tag{4}\] where we extract mouth features and pose features from \(I^a\) and \(I^b\), aiming to minimize their disparity with those extracted from synthesized images of switched mouths using cosine similarity \(\mathcal{S}(\cdot,\cdot)\). Once the losses have converged, the parameters are no longer updated for the remainder of training, significantly reducing training time and resource consumption for subsequent stages.
Expression Decouple. As illustrated in the bottom of Fig. 2 (b), to decouple expression information from driving image \(I^d\), we introduce Expression-aware Latent Navigation module (\(ELN\)) and a lightweight plug-and-play Emotion Enhancement Module (\(EEM\)), both trained via self-reconstruction complementary learning. Specifically, given an identity source \(I^i\) and a driving image \(I^d\) sharing the same identity as \(I^i\) but differing in mouth shapes, head poses and emotional expressions, our pre-trained modules (i.e., \(E\), \(MLN\), \(PLN\), and \(G\)) from previous stage effectively disentangle mouth shape and head pose from \(I^d\) and drive \(I^i\) to generate \(\hat{I}^g_n\) with matching mouth shape and head pose as \(I^d\) but with the same expression with \(I^i\). Therefore, to faithfully reconstruct \(I^d\) with the same expression, \(ELN\) is compelled to learn complementary information not disentangled by \(MLN\), \(PLN\), precisely the expression information. Motivated by the observation [6] that expression variation in a video sequence is typically less frequent than changes in other motions, we define a window of size \(K\) around \(I^d\) and average \(K\) extracted expression features to obtain a clean expression feature \(f^{r\rightarrow e}\). \(f^{r\rightarrow e}\) is then combined with extracted mouth and pose features as input to the generator \(G\). Additionally, \(EEM\) takes \(f^{r\rightarrow e}\) as input and utilizes affine transformations to produce \(f^e = (f^e_s, f^e_b)\) that control adaptive instance normalization (AdaIN) [55] operations. The AdaIN operations further adapt identity feature \(f^{id}\) as emotion-conditioned features \(f^{id}_e\) by: \[f^{id}_e := EEM(f^{id})= f^e_s \frac{f^{id} - \mu(f^{id})}{\sigma(f^{id})} + f^e_b,\] where \(\mu(\cdot)\) and \(\sigma(\cdot)\) represent the average and variance operations. Subsequently, we generate output \(\hat{I}^g_e\) with the expression of \(I^d\) via Eq. 1 . We enforce a motion reconstruction loss [6] \(\mathcal{L}_\text{mot}\) in addition to the same reconstruction loss \(\mathcal{L}_\text{rec}\), perceptual loss \(\mathcal{L}_\text{per}\) and adversarial loss \(\mathcal{L}_\text{adv}\) as Eq. 2 and Eq. 3 : \[\mathcal{L}_\text{mot} = \|\phi(I^d)-\phi(\hat{I}^g_e)\|_2 + \|\psi(I^d)-\psi(\hat{I}^g_e)\|_2,\] where \(\phi(\cdot)\) and \(\psi(\cdot)\) denote features extracted by the 3D face reconstruction network and the emotion network of [18]. Moreover, to ensure that the synthesized image accurately mimics the mouth shape of the driving frame, we further introduce a mouth consistency loss \(\mathcal{L}_\text{m-c}\): \[\mathcal{L}_\text{m-c} = e^{-\mathcal{S}(MLN(E(\hat{I}^g_e)), MLN(E(I^{d})))},\] where \(MLN\) and \(E\) are pretrained in the previous stage. During training, we only need to train lightweight \(ENL\) and \(EEM\), resulting in fast training.
After successfully training the two-stage Efficient Disentanglement module, we acquire three disentangled spaces, enabling one-shot video-driven talking face generation with separate control of identity, mouth shape, pose, and expression, given different driving sources, as illustrated in Fig. 2 (a).
Integrating the disentangled spaces, we aim to address a more appealing but challenging task: audio-driven talking face generation. In this section, depicted in Fig. 3, we introduce three modules to predict the weights of pose, mouth, and expression from audio. These modules replace the driving video input, facilitating audio-driven talking face generation.
Audio-Driven Lip Generation. Prior works [23], [51] generate facial dynamics, encompassing lip motions and expressions, in a holistic manner, which proves challenging for two main reasons: 1) Expressions, being acoustic-irrelevant motions, can impede lip synchronization [20]. 2) The absence of lip visual information hinders fine details synthesis at the phoneme level [56]. Thanks to the disentangled mouth space obtained in the previous stage, we naturally mitigate the influence of expression without necessitating special training strategies or loss functions like [20]. Additionally, since the decoupled space is trained during video-driven talking face generation using video as input, which offers ample visual information in the form of mouth bases \(b^m_i\) stored in the bank \(B^m\), we eliminate the need for extra visual memory like [56]. Instead, we only need to predict the weight \(w^m_i\) of each base \(b^m_i\), which generates the fine-grained lip motion. To achieve this, we design an Audio Encoder \(E_a\), which embeds the audio feature into a latent space \(f^a = E_a(a_{1:N})\). Subsequently, a linear layer \(MLP^m_A\) is added to decode the mouth weight \(\hat{W}^m\). During training, we fix the weights of all modules and only update \(E_a\) and \(MLP^m_A\) using the weighted sum of feature loss \(\mathcal{L}^m_{fea}\), reconstruction loss \(\mathcal{L}^m_{rec}\) and sync loss \(\mathcal{L}^m_{sync}\) [42]: \[\mathcal{L}^m_{fea} = \|W^m - \hat{W}^m\|_2,\qquad \mathcal{L}^m_{rec} = \|I - \hat{I}\|_2,\] \[\mathcal{L}^m_{sync} = -\text{log}(\frac{v\cdot s}{max(\|v\|_2 \cdot \|s\|_2, \epsilon)}),\] where \(W^m = MLN(E(I))\) is the GT mouth weight extracted from GT image \(I\) and \(\hat{I}\) is generated image using Eq. 1 . \(\mathcal{L}^m_{sync}\) is introduced from [42], where \(v\) and \(s\) are extracted by the speech encoder and image encoder in SyncNet [57].
Flow-Based Probabilistic Pose Generation. Due to the nature of one-to-many mapping from the input audio to head poses, learning a deterministic mapping like previous works [28], [32], [33] output the same results, which bring ambiguity and inferior visual results. To generate probabilistic and realistic head motions, we predict the pose weights \(\hat{W}^p\) using Normalizing Flow \(\varphi_p\) [21], as illustrated in Fig. 3. During training (indicated by dash lines), we extract pose weights \(W^p\) from videos as the ground truth and feed them into our \(\varphi_p\). By incorporating Maximum Likelihood Estimation (MLE) in Eq. 5 , we embed it into a Gaussian distribution \(p_Z\) conditioned on audio feature \(f^a = E_a(a_{1:N})\): \[\label{eq:mle} z_t = \varphi_p^{-1}(w^p_t, f^a_t), \qquad \mathcal{L}_\text{MLE}=-\sum_{t=0}^{N-1} \log p_\mathcal{Z}\left(z_t\right)\tag{5}\] As the normalizing flow \(\varphi_p\) is bijective, we reconstruct the pose weight \(\hat{W}^p = \varphi_p(z, f^a_t)\) and utilize a pose reconstruction loss \(\mathcal{L}^p_\text{rec}\) along with a temporal loss \(\mathcal{L}^p_\text{tem}\) to constrain \(\varphi_p\): \[\label{eq:pose} \mathcal{L}^p_\text{rec} = \|W^p-\hat{W}^p\|_2, \qquad \mathcal{L}_\text{tem}=\frac{1}{N-1}\sum_{t=1}^{N-1}\|(w^p_t - w^p_{t-1}) - (\hat{w}^p_t - \hat{w}^p_{t-1})\|_2\tag{6}\] During inference, we randomly sample \(\hat{z}\) from the constructed distribution \(p_{Z}\) and then generate pose weights \(\hat{W}^p = \varphi_p(z, f^a_t)\). This process ensures the diversity of head motions while maintaining consistency with the audio rhythm.
Semantically-Aware Expression Generation. As finding videos with a desired expression may not always be feasible, potentially limiting their application [58], we aim to explore the emotion contained in audio and transcript with the aid of the introduced Semantics Encoder \(E_S\) and Text Encoder \(E_T\). Inspired by [59], our Semantics Encoder \(E_S\) is constructed upon the pretrained HuBERT model [60], which consists of a CNN-based feature encoder and a transformer-based encoder. We freeze the CNN-based feature encoder and only fine-tuned the transformer blocks. Text Encoder \(E_T\) is inherited from the pretrained Emoberta [61], which encodes the overarching emotional context embedded within textual descriptions. We concatenate the embeddings generated by \(E_S\) and \(E_T\) and feed them into a \(MLP^e_A\) to generate the expression weights \(\hat{W}^e\). Since audio or text may not inherently contain emotion during inference, such as in TTS-generated speech, in order to support the prediction of emotion from a single modality, we randomly mask (\(\mathcal{M}\)) a modality with probability \(p\) during training, inspired by HuBERT: \[\label{eq6} \hat{W}^e=\left\{ \begin{align} MLP^e_a(E_S(a), E_T(T)) & , & 0.5 \le p \le 1, \\ MLP^e_a(\mathcal{M}(E_S(a)), E_T(T)) & , & 0.25 \le p < 0.5, \\ MLP^e_a(E_S(a), \mathcal{M}(E_T(T))) & , & 0 \le p < 0.25. \\ \end{align} \right.\tag{7}\]
We employ \(\mathcal{L}_\text{exp} = \|W^e - \hat{W}^e\|_1\) to encourage \(\hat{W}^e\) close to weight \(W^e\) generated by pretrained \(ELN\) from emotional frames. Until now, we are able to generate probabilistic semantically-aware talking head videos solely from an identity image and the driving audio.
Implement Details. Our model is trained and evaluated on the datasets MEAD [45] and HDTF [62]. Additionally, we report results on additional datasets, including LRW [63] and Voxceleb2 [64], for further assessment of our method in the supplementary. All video frames are cropped following FOMM [9] and resized to \(256\times 256\). Our method is implemented using PyTorch and trained using the Adam optimizer on 2 NVIDIA GeForce GTX 3090 GPUs. The dimension of the latent code \(f^{*\rightarrow r}\) and bases \(b^*\) is set to 512, and the number of bases of \(B^m\), \(B^p\) and \(B^e\) are set to 20, 6 and 10, respectively. The weight for \(\mathcal{L}_\text{mot}\) is set to 10 and the remaining weights are set to 1.
Method | MEAD [45] | HDTF [62] | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
2-7 (lr)8-12 | PSNR\(\uparrow\) | SSIM\(\uparrow\) | M/F-LMD\(\downarrow\) | FID\(\downarrow\) | \(\text{Sync}_\text{conf}\uparrow\) | \(\text{Acc}_\text{emo}\uparrow\) | PSNR\(\uparrow\) | SSIM\(\uparrow\) | M/F-LMD\(\downarrow\) | FID\(\downarrow\) | \(\text{Sync}_\text{conf}\uparrow\) |
MakeItTalk [28] | 19.442 | 0.614 | 2.541/2.309 | 37.917 | 5.176 | 14.64 | 21.985 | 0.709 | 2.395/2.182 | 18.730 | 4.753 |
Wav2Lip [42] | 19.875 | 0.633 | 1.438/2.138 | 44.510 | 8.774 | 13.69 | 22.323 | 0.727 | 1.759/2.002 | 22.397 | 9.032 |
Audio2Head [32] | 18.764 | 0.586 | 2.053/2.293 | 27.236 | 6.494 | 16.35 | 21.608 | 0.702 | 1.983/2.060 | 29.385 | 7.076 |
PC-AVS [16] | 16.120 | 0.458 | 2.649/4.350 | 38.679 | 7.337 | 12.12 | 22.995 | 0.705 | 2.019/1.785 | 26.042 | 8.482 |
AVCT [33] | 17.848 | 0.556 | 2.870/3.160 | 37.248 | 4.895 | 13.13 | 20.484 | 0.663 | 2.360/2.679 | 19.066 | 5.661 |
SadTalker [20] | 19.042 | 0.606 | 2.038/2.335 | 39.308 | 7.065 | 14.25 | 21.701 | 0.702 | 1.995/2.147 | 14.261 | 7.414 |
IP-LAP [31] | 19.832 | 0.627 | 2.140/2.116 | 46.502 | 4.156 | 17.34 | 22.615 | 0.731 | 1.951/1.938 | 19.281 | 3.456 |
TalkLip [65] | 19.492 | 0.623 | 1.951/2.204 | 41.066 | 5.724 | 14.00 | 22.241 | 0.730 | 1.976/1.937 | 23.850 | 1.076 |
EAMM [49] | 18.867 | 0.610 | 2.543/2.413 | 31.268 | 1.762 | 31.08 | 19.866 | 0.626 | 2.910/2.937 | 41.200 | 4.445 |
StyleTalk [51] | 21.601 | 0.714 | 1.800/1.422 | 24.774 | 3.553 | 63.49 | 21.319 | 0.692 | 2.324/2.330 | 17.053 | 2.629 |
PD-FGC [6] | 21.520 | 0.686 | 1.571/1.318 | 30.240 | 6.239 | 44.86 | 23.142 | 0.710 | 1.626/1.497 | 25.340 | 7.171 |
EMMN [23] | 17.120 | 0.540 | 2.525/2.814 | 28.640 | 5.489 | 48.64 | 18.236 | 0.596 | 2.795/3.368 | 36.470 | 5.793 |
EAT [11] | 20.007 | 0.652 | 1.750/1.668 | 21.465 | 7.984 | 64.40 | 22.076 | 0.719 | 2.176/1.781 | 28.759 | 7.493 |
EDTalk-A | 21.628 | 0.722 | 1.537/1.290 | 17.698 | 8.115 | 67.32 | 25.156 | 0.811 | 1.676/1.315 | 13.785 | 7.642 |
EDTalk-V | 22.771 | 0.769 | 1.102/1.060 | 15.548 | 6.889 | 68.85 | 26.504 | 0.845 | 1.197/1.111 | 13.172 | 6.732 |
GT | 1.000 | 1.000 | 0.000/0.000 | 0.000 | 7.364 | 79.65 | 1.000 | 1.000 | 0.000/0.000 | 0.000 | 7.721 |
Comparison Setting. We compare our method with: (a) emotion-agnostic talking face generation methods: MakeItTalk [28], Wav2Lip [42], Audio2Head [32], PC-AVS [16], AVCT [33], SadTalker [20], IP-LAP [31], TalkLip [65]. (b) Emotional talking face generation methods: EAMM [49], StyleTalk [51], PD-FGC [6], EMMN [23], EAT [11], EmoGen [66]. Different from previous work, EDTalk encapsulates the entire face generation process without any other sources (e.g. poses [11], [49], 3DMM [5], [51], phoneme [33], [51]) and pre-processing operations during inference, which facilitates the application. We evaluate our model in both audio-driven setting (EDTalk-A) and video-driven setting (EDTalk-V) w.r.t. (i) generated video quality using PSNR, SSIM [67] and FID [68]. (ii) audio-visual synchronization using Landmarks Distances on the Mouth (M-LMD) [7] and the confidence score of SyncNet [57]. (ii) emotional accuracy using \(\text{Acc}_\text{emo}\) calculated by pretrained Emotion-Fan [69] and Landmarks Distances on the Face (F-LMD). Partial results are moved to Appendix (8 and 4) due to limited space.
Quantitative Results. The quantitative results are presented in Tab. 1, where our EDTalk-A and EDTalk-V achieve the best performance across most metrics, except \(\text{Sync}_\text{conf}\). Wav2Lip pretrains their SyncNet discriminator on a large dataset [70], which might lead the model to prioritize achieving a higher \(\text{Sync}_\text{conf}\) over optimizing visual performance. It is evident in the blur mouths generated by Wav2Lip and inferior M-LMD score to our method.
Qualitative Results. Fig. 4 demonstrates comparison of visual results. TalkLip and IP-LAP struggle to generate accurate lip motions. Despite elevated lip-synchronization of SadTalker, they can only produce slight lip motions with closed mouth and are also bothered by jitter between frames. StyleHEAT generates accurate mouth shape driven by Mouth GT video instead of audio but suffers from incorrect head pose and identity loss. This issue also plagues EmoGen, EAMM and PD-FGC. Besides, EmoGen and EAMM fail to perform the desired expression. Due to discrete emotion input, EAT cannot synthesize fine-grained expression like the narrowed eyes performed by expression reference. In the case of "happy", unexpected closed eyes and weird teeth are observed in EAT and PD-FGC, respectively. In contrast, both EDTalk-A and EDTalk-V excel in producing realistic expressions, precise lip synchronization and correct head poses.
Efficiency analysis. Our approach is highly efficient in terms of training time, required data and computational resources in decoupling spaces. In the mouth-pose decoupling stage, we solely utilize the HDTF dataset, containing 15.8 hours of videos, for the decoupling. Training with a batch size of 4 on two 3090 GPUs for 4k iterations achieves state-of-the-art performance, which takes about one hour. In contrast, DPE is trained on the VoxCeleb dataset, which comprises 351 hours of video, for 100K iterations initially, then an additional 50K iterations with a batch size of 32 on 8 V100 GPUs, which takes over 2 days. Besides, they need to train two task-specific generators for expression and pose. Similarly, PD-FGC takes 2 days on 4 Tesla V100 GPUs for lip, and another 2 days on 4 Tesla V100 GPUs for pose decoupling. It significantly exceeds our computational resources and training time. In the expression decouple stage, we train our model on MEAD and HDTF dataset (total 54.8 hours of videos) for 6 hours. On the other hand, PD-FGC decouples expression space on Voxceleb2 dataset (2400 hours) by discorelation loss for 2 weeks. The visualization in Fig. 5 allows for a more intuitive comparison of the differences between the different methods concerning required training time, training data, and computational arithmetic.
r0.5
Metric/Method | TalkLip | IP-LAP | EAMM | EAT | EDTalk | GT |
---|---|---|---|---|---|---|
Lip-sync | 3.31 | 3.42 | 3.49 | 3.85 | 4.13 | 4.74 |
Realness | 3.14 | 3.13 | 3.26 | 3.75 | 4.92 | 4.81 |
\(\text{Acc}_\text{emo}\) (%) | 19.7 | 17.6 | 44.3 | 59.7 | 64.5 | 75.6 |
User Study. We conduct a user study to evaluate our method for human likeness test. We generate 10 videos for each method and invite 20 participants (10 males, 10 females) to score from 1 (worst) to 5 (best) in terms of lip-synchronization, realness, and emotion classification. The average scores reported in Tab. 2 demonstrate that our method achieves the best performance in all aspects.
Latent space. To analyze the contributions of our key designs on obtaining the disentangled latent spaces, we conduct an ablation study with two variants: (1) remove base banks (w/o Bank). (2) remove orthogonal constraint (w/o Orthogonal). Fig. 6 presents our ablation study results on video-driven and audio-driven settings, respectively. Since w/o Bank struggles to decouple different latent spaces, only exp fails to extract the emotional expression. Additionally, without the visual information stored in banks, the quality of the generated full frame is poor. Although w/o Orthogonal improves the image quality through vision-rich banks, due to the lack of orthogonality constraints on the base, it interferes with different spaces, resulting in less obvious generated emotions. The Full Model achieves the best performance in both aspects. The quantitative results in Tab. 3 also validate the effectiveness of each component.
r0.5
Method/Metric | PSNR\(\uparrow\) | SSIM\(\uparrow\) | M/F-LMD\(\downarrow\) | FID\(\downarrow\) | \(\text{Sync}_\text{conf}\uparrow\) | \(\text{Acc}_\text{emo}\uparrow\) |
---|---|---|---|---|---|---|
w/o \(\mathcal{L}_{fea}\) | 21.134 | 0.713 | 1.914/1.625 | 28.053 | 5.601 | 54.34 |
w/o \(\mathcal{L}_{self}\) | 20.913 | 0.707 | 1.815/1.629 | 29.314 | 5.030 | 44.23 |
w/o \(\mathcal{L}^m_{rec}\) | 21.955 | 0.744 | 1.666/1.397 | 18.528 | 5.447 | 67.19 |
w/o \(\mathcal{L}^m_{sync}\) | 21.524 | 0.728 | 1.626/1.349 | 17.844 | 4.007 | 61.29 |
w/o Orthogonal | 21.429 | 0.711 | 1.687/1.320 | 17.820 | 4.398 | 38.71 |
w/o Bank | 20.302 | 0.660 | 2.137/1.711 | 26.842 | 2.316 | 9.677 |
w/o \(EEM\) | 20.731 | 0.673 | 2.131/1.927 | 27.135 | 7.326 | 49.367 |
only lip | 19.799 | 0.639 | 1.767/1.920 | 31.918 | 8.291 | 15.13 |
lip+pose | 21.519 | 0.695 | 1.645/1.378 | 19.571 | 8.474 | 16.75 |
Full Model | 21.628 | 0.722 | 1.537/1.290 | 17.698 | 8.115 | 67.32 |
Loss functions. We further explored the effects of different loss functions on the MEAD dataset. The results in Table 3 indicate that \(\mathcal{L}_\text{fea}\) and \(\mathcal{L}_\text{self}\) contribute to more disentangled spaces, while \(\mathcal{L}^{m}_\text{rec}\) and \(\mathcal{L}^{m}_\text{sync}\) lead to more accurate lip synchronization. Notably, the Full Model shows a reduction in \(\text{Sync}_\text{conf}\) compared to only lip and lip+pose, suggesting a trade-off between lip-sync accuracy and emotion performance. In this work, we sacrifice a slight lip-sync accuracy to enhance expression.
This paper introduces EDTalk, a novel system designed to efficiently disentangle facial components into latent spaces, enabling fine-grained control for talking head synthesis. The core insight is to represent each space with orthogonal bases stored in dedicated banks. We propose an efficient training strategy that autonomously allocates spatial information to each space, eliminating the necessity for external or prior structures. By integrating these spaces, we enable audio-driven talking head generation through a lightweight Audio-to-Motion module. Experiments showcase the superiority of our method in achieving disentangled and precise control over diverse facial motions. We provide more discussion about the limitations and ethical considerations in the Appendix.
In the main paper, we introduce an innovative framework designed to produce emotional talking face videos, which enables individual manipulation of mouth shape, head pose, and emotional expression, conditioned on both video and audio inputs. This appendix delves deeper into: 1) Implementation Details. 2) Additional Experimental Results. 3) Discussion. In addition, we highly encourage viewing the Supplementary Video: https://tanshuai0219.github.io/EDTalk/.
We utilize identical structures for Generator \(G\) in LIA [19]. We recommend consulting their original paper for further elaboration. Here, we delineate the details of the other network architectures depicted in 7.
The component projects the identity source \(I^i\) and driving source \(I^*\) into the identity feature \(f^{id}\) and the latent features \(f^{i \rightarrow r}\), \(f^{* \rightarrow r}\). It comprises several convolutional neural networks (CNN) and ResBlocks. The outputs of ResBlock serve as the identity feature \(f^{id}\), which is then fed into Generator \(G\) to enrich identity information through skip connections. Subsequently, four multi-layer perceptrons (MLP) are employed to generate the latent features \(f^{i \rightarrow r}\), \(f^{* \rightarrow r}\).
To achieve efficient training and inference, these four modules are implemented with four simple MLPs.
This network takes audio feature sequences \(a_{1:T}\) as input. These sequences are passed through a series of convolutional layers to produce audio feature \(f^a_{1:N}\).
Normalizing Flow \(\varphi_p\) comprises \(K\) flow step, each consisting of actnorm, invertible convolution and the affine coupling layer. Initially, given the mean \(\mu\) and standard deviation \(\delta\) for the weights \(W^p\) of pose bank \(B^p\), actnorm is implemented as an affine transformation \(h' = \frac{\beta-\mu}{\delta}\). Subsequently, \(\varphi_p\) introduces an invertible \(1 \times 1\) convolution layer, \(h'' = \mathbf{W}\cdot h'\), to handle potential channel variable. Following this, we utilize a transformer-based coupling layer \(\mathcal{F}\) to derive \(z\) from \(h''\) and \(f^a_{1:N}\). Specifically, we split \(h''\) into \(h''_{h1}\) and \(h''_{h2}\), where \(h''_{h2}\) undergoes affine transformation by \(\mathcal{F}\) based on \(h''_{h1}\): \(t,s = \mathcal{F}(h''_{h1},f^a_{1:N}); h = (h''_{h2}+t)\odot s,\) where \(t\) and \(s\) represent the transformation parameters. Thanks to the unchanged \(h''_{h1}\), tractability is easily maintained in reverse. In summary, we can map \(W^p\) into the latent code \(z\) and predict weight \(\hat{W}^p\) from a sampled code \(\hat{z} \in p_\mathcal{Z}\) as follows: \[z = \varphi_p^{-1}(W^p,f^a_{1:N})\] \[\label{eq:reverse} \hat{W}^p = \varphi_p(\hat{z},f^a_{1:N})\tag{8}\]
MEAD entails 60 speakers, with 43 speakers accessible, delivering 30 sentences expressing eight emotions at three varying intensity levels in a laboratory setting. Consistent with prior studies [11], [49], we designate videos featuring speakers identified as ‘M003,’ ‘M030,’ ‘W009,’ and ‘W015’ for testing, while the videos of the remaining speakers are allocated for training.
The videos of the HDTF dataset are collected from YouTube, renowned for their high quality, high definition content, featuring over 300 distinct identities. To facilitate training and testing, we partition the dataset using an 8:2 ratio based on speaker identities, allocating 80% for training and 20% for testing.
Voxceleb2 [64] is a large-scale talking head dataset, boasting over 1 million utterances from 6,112 celebrities. It’s important to note that we solely utilize Voxceleb2 for evaluation purposes, selecting 200 videos randomly from its extensive collection.
LRW [63] is a word-level dataset comprising more than 1000 utterances encompassing 500 distinct words. For evaluation, we randomly select 500 videos from the dataset.
For video preprocessing, we employ face cropping and resize the cropped videos to the resolution of \(256 \times 256\) for training and testing following FOMM [9]. Adhere to Wav2Lip [42], audio is down-sampled to 16 kHz and transformed into mel-spectrograms using an FFT window size of 800, hop length of 200, and 80 Mel filter banks. During the evaluation, for datasets without emotional labels, we utilize the first frame of each video as the source image and the corresponding audio as the driving audio to generate talking head videos. For emotional videos sourced from MEAD, we use the video itself as an expression reference. We select a frame with a ‘Neutral’ emotion from the same speaker as the source image for emotional talking head synthesis.
Method | Voxceleb2 [64] | LRW [63] | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
2-6 (lr)7-11 | PSNR\(\uparrow\) | SSIM\(\uparrow\) | M-LMD\(\downarrow\) | F-LMD\(\downarrow\) | \(\text{Sync}_\text{conf}\uparrow\) | PSNR\(\uparrow\) | SSIM\(\uparrow\) | M-LMD\(\downarrow\) | F-LMD\(\downarrow\) | \(\text{Sync}_\text{conf}\uparrow\) |
MakeItTalk [28] | 20.526 | 0.706 | 2.435 | 2.380 | 3.896 | 22.334 | 0.729 | 2.099 | 1.960 | 3.137 |
Wav2Lip [42] | 20.760 | 0.723 | 2.143 | 2.182 | 8.680 | 23.299 | 0.764 | 1.699 | 1.703 | 7.545 |
Audio2Head [32] | 17.344 | 0.577 | 3.651 | 3.712 | 5.541 | 18.703 | 0.601 | 2.866 | 3.435 | 5.428 |
PC-AVS [16] | 21.643 | 0.720 | 2.088 | 1.830 | 7.928 | 16.744 | 0.509 | 5.603 | 4.691 | 3.622 |
AVCT [33] | 18.751 | 0.645 | 2.739 | 3.062 | 4.238 | 21.188 | 0.689 | 2.290 | 2.395 | 3.927 |
SadTalker [20] | 20.278 | 0.700 | 2.252 | 2.388 | 6.356 | - | - | - | - | - |
IP-LAP [20] | 20.955 | 0.724 | 2.125 | 2.154 | 3.295 | 23.727 | 0.770 | 1.779 | 1.683 | 3.027 |
TalkLip [65] | 20.633 | 0.723 | 2.084 | 2.191 | 6.520 | 22.706 | 0.751 | 1.803 | 1.770 | 6.021 |
EAMM [49] | 17.038 | 0.562 | 4.172 | 4.163 | 3.815 | 18.643 | 0.607 | 3.593 | 3.773 | 3.414 |
StyleTalk [51] | 21.112 | 0.722 | 2.113 | 2.136 | 2.120 | 21.283 | 0.705 | 2.394 | 2.142 | 2.430 |
PD-FGC [6] | 22.110 | 0.729 | 1.743 | 1.630 | 6.686 | 22.481 | 0.711 | 1.576 | 1.534 | 6.119 |
EAT [11] | 20.370 | 0.689 | 2.586 | 2.383 | 6.864 | 21.384 | 0.704 | 2.128 | 1.927 | 6.630 |
EDTalk-A | 22.107 | 0.763 | 1.851 | 1.608 | 6.591 | 23.409 | 0.779 | 1.729 | 1.379 | 6.914 |
EDTalk-V | 22.133 | 0.764 | 1.829 | 1.583 | 6.155 | 24.574 | 0.823 | 1.202 | 1.139 | 6.027 |
GT | 1.000 | 1.000 | 0.000/0.000 | 0.000 | 6.808 | 1.000 | 1.000 | 0.000/0.000 | 0.000 | 6.952 |
The encoder \(E\) and generator \(G\) are pre-trained in a similar setting as LIA [19]. Subsequently, we freeze the weights of the encoder \(E\) and generator \(G\), focusing solely on training the Mouth-Pose Decouple Module. In this stage, our model is trained exclusively on the emotion-agnostic HDTF dataset, where videos consistently exhibit a ‘Neutral’ emotion alongside diverse head poses. It ensures that the Mouth-Pose Decouple Module concentrates solely on variations in head pose and mouth shape, avoiding the encoding of expression-related information. All loss function weights are set to 1. The training process typically requires approximately one hour, employing a batch size of 4 and a learning rate of 2e-3, executed on 2 NVIDIA GeForce GTX 3090 GPUs with 24GB memory. Once the Mouth-Pose Decouple Module is trained, we freeze all trained parameters and solely update the expression-related modules, including \(MLP^e\), expression bases \(B^e\), and the Expression Enhance Module \(EEM\), utilizing both the MEAD and HDTF datasets. This stage typically takes around 6 hours, employing a batch size of 10 and a learning rate of 2e-3, conducted on 2 NVIDIA GeForce GTX 3090 GPUs with 24GB memory. We train our Audio-to-Lip model on the HDTF dataset for 30k iterations with a batch size of 4, requiring approximately 7 hours of computation on 2 NVIDIA GeForce GTX 3090 GPUs with 24GB memory. The Audio-to-Pose model is trained on the HDTF dataset for one hour.
Apart from the quantitative assessments conducted on the MEAD and HDTF datasets, as detailed in the main paper, we present additional quantitative comparisons on Voxceleb2 [64] and LRW [63]. The comparison results outlined in 4 demonstrate that our method outperforms state-of-the-art approaches in both audio-driven (EDTalk-A) and video-driven (EDTalk-V) scenarios across various metrics. We offer a plausible explanation for the superior \(\text{Sync}_\text{conf}\) achieved by Wav2Lip [42] in the main paper. IP-LAP [31] merely alters the mouth shape of the source image while maintaining the same head pose and expression, hence achieving a higher PSNR score. PD-FGC [6] attains superior M-LMD performance by training on Voxceleb2, a dataset comprising over 1 million utterances from 6,112 celebrities, totaling 2400 hours of data, which is hundreds of times larger than our dataset (15.8 hours). Nevertheless, we still outperform PD-FGC in terms of F-LMD. SadTalker [20] encounters challenges in processing even one second of audio, leading to the failure to generate talking face videos on the LRW dataset, where all videos are one second in duration.
In addition to the state-of-the-art (SOTA) methods discussed in the main paper, we extend our comparative analysis to include both emotion-agnostic talking face generation methods: MakeItTalk [28], Wav2Lip [42], Audio2Head [32], AVCT [33], and PC-AVS [16], as well as emotional talking face generation methods: StyleTalk [51] and EMMN [23]. The comprehensive qualitative results can be found in 8, serving as a supplement to the previously presented data in Fig. 4 of the main paper. We further conduct the comparison experiments with several SOTA talking face generation methods, including: GC-AVT [14], EVP [22], ECG [71] and DiffTalk [41]. However, due to the unavailability of codes and pre-trained models for these methods (except EVP), we can only extract video clips from the provided demo videos for comparison. The results are demonstrated in 9. Specifically, EVP and ECG are emotional talking face generation methods that utilize one-hot labels for emotional guidance, with EVP being a person-specific model and ECG being a one-shot method. Our method outperforms these methods in terms of emotional expression, while the teeth generated by ECG contribute to slightly unrealistic results. GC-AVT aims to mimic emotional expressions and generate accurate lip motions synchronized with input speech, resembling the setting of our EDTalk. However, compared to EDTalk, GC-AVT struggles to preserve reference identity, resulting in significant identity loss. DiffTalk is hindered by severe mouth jitter, which is more evident in the Supplementary Video.
r0.5
Method/Metric | PSNR\(\uparrow\) | SSIM\(\uparrow\) | LPIPS\(\downarrow\) | \(\mathcal{L}_1 \downarrow\) | AKD\(\downarrow\) | AED\(\downarrow\) |
---|---|---|---|---|---|---|
PIRenderer [72] | 22.13 | 0.72 | 0.22 | 0.053 | 2.24 | 0.032 |
OSFV [73] | 23.29 | 0.74 | 0.17 | 0.037 | 1.83 | 0.025 |
LIA [19] | 24.75 | 0.77 | 0.16 | 0.036 | 1.88 | 0.019 |
DaGAN [10] | 23.21 | 0.74 | 0.16 | 0.041 | 1.93 | 0.023 |
MCNET [74] | 21.74 | 0.69 | 0.26 | 0.057 | 2.05 | 0.037 |
StyleHEAT [5] | 22.15 | 0.65 | 0.25 | 0.075 | 2.95 | 0.045 |
VPGC [75] | - | - | - | - | - | - |
EDTalk | 26.5 | 0.85 | 0.13 | 0.031 | 1.74 | 0.017 |
We perform a comparative analysis with state-of-the-art face reenactment methods, including PIRenderer [72], OSFV [73], LIA [19], DaGAN [10], MCNET [74], StyleHEAT [5], and VPGC [75], where VPGC is a person-specific model. Given that the compared methods are not specifically trained on emotional datasets, we conduct comparisons using videos with and without emotion, the results of which are presented in the Supplementary Video (4:07-4:50). Our method demonstrates superior performance in terms of face reenactment.
We additionally offer extensive quantitative comparisons regarding: (1) Generated video quality assessed through PSNR and SSIM. (2) Reconstruction faithfulness evaluated using LPIPS and \(\mathcal{L}_1\) norms. (3) Semantic consistency measured by average keypoint distance (AKD) and average Euclidean distance (AED). The quantitative results on the HDTF dataset are outlined in 5, showcasing the superior performance of our EDTalk method. Note that since VPGC is a person-specific model, it cannot be generalized on identities in HDTF dataset.
Our method demonstrates robustness across out-of-domain portraits, encompassing real human subjects, paintings, sculptures, and images generated by Stable Diffusion [76]. Moreover, our approach exhibits generalizability to various audio inputs, including songs, diverse languages (English, French, German, Italian, Japanese, Korean, Spanish, Chinese), and noisy audio. Please refer to the Supplementary Video (5:40-8:40) for the better visualization of these results.
We accomplish expression manipulation by interpolating between expression weights \(W^e\) of the expression bank \(B^e\), which are extracted from any two distinct expression reference clips, using the following equation: \[W^e = \alpha W^e_1+(1-\alpha)W^e_2,\] where \(W^e_1\) and \(W^e_2\) represent expression weights extracted from two emotional clips, while \(\alpha\) denotes the interpolation weight. 10 illustrates an example of expression manipulation generated by our EDTalk. In this example, we successfully transition from \(Expression 1\) to \(Expression 2\) by varying the interpolation weight \(\alpha\). This demonstrates the effectiveness of our \(ELN\) module in accurately capturing the expression of the provided clip, as discussed in the main paper.
Thanks to the distribution \(p_Z\) modeled by the Audio2Pose module, we are able to sample diverse and realistic head poses from it. As shown in 11, by passing the same inputs through our EDTalk, our method synthesizes various yet natural head motions while preserving the expression and mouth shape unchanged.
We input two transcripts into a Text-To-Speech (TTS) system to synthesize two audio clips. These audios, along with their respective transcripts, are then fed into our Audio-to-Motion module to generate talking face videos. The results of semantically-aware expression generation are depicted in 12, showcasing our method’s ability to accurately generate expressions corresponding to the transcripts (left: happy; right: sad). Additionally, in the Supplementary Video, we provide further results where expressions are inferred directly from audio.
We initially present the results showcasing individual control over mouth shape, head pose, and emotional expression in 13. Specifically, by feeding our EDTalk with an identity source and various driving sources (first row of each part), our method generates corresponding disentangled outcomes in the second row. Subsequently, we integrate these individual facial motions into full emotional talking head videos with synchronized lip movements, head gestures, and emotional expressions. It’s worth noting that our method facilitates the combination of any two facial parts, such as ‘expression+lip’, ‘expression+pose’, etc. An example of ‘lip+pose’ is shown in the first row in the lower right corner of 13. Additionally, we provide comparisons with state-of-the-art facial disentanglement methods like PD-FGC [6] and DPE [15] in terms of facial disentanglement performance and computational efficiency. For further details, please refer to the Supplementary Video (4:50-5:12).
We are also intrigued by understanding how each base in the banks influences motion direction. Consequently, we manipulate only a specific base \(b^*_i\) and repeat the setup. The results, as depicted in 14, indicate that the bases hold semantic significance for fundamental visual transformations such as mouth opening/closing, head rotation, and happiness/sadness/anger.
In this section, we perform a series of experiments on the MEAD dataset to explore the impact of base number selection on final performance. Specifically, we vary the base number of the Mouth Bank \(B^m\) and Expression Bank \(B^e\) across values of 5, 10, 20, and 40, respectively. The quantitative results are provided in 6, where we observe the best performance when utilizing 20 bases in \(B^m\) and 10 bases in \(B^e\).
Method | Mouth Bank \(B^m\) | Expression Bank \(B^e\) | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
2-6 (lr)7-11 | PSNR\(\uparrow\) | SSIM\(\uparrow\) | M/F-LMD\(\downarrow\) | \(\text{Sync}_\text{conf}\uparrow\) | \(\text{Acc}_\text{emo}\uparrow\) | PSNR\(\uparrow\) | SSIM\(\uparrow\) | M/F-LMD\(\downarrow\) | \(\text{Sync}_\text{conf}\uparrow\) | \(\text{Acc}_\text{emo}\uparrow\) |
5 | 20.39 | 0.69 | 2.02/1.67 | 6.35 | 63.53 | 21.54 | 0.70 | 1.60/1.35 | 8.27 | 53.26 |
10 | 21.45 | 0.72 | 1.65/1.33 | 7.89 | 65.74 | 21.63 | 0.72 | 1.54/1.29 | 8.12 | 67.32 |
20 | 21.63 | 0.72 | 1.54/1.29 | 8.12 | 67.32 | 21.37 | 0.72 | 1.64/1.46 | 8.23 | 61.34 |
40 | 20.79 | 0.71 | 1.65/1.48 | 7.62 | 63.12 | 21.41 | 0.71 | 1.68/1.42 | 8.16 | 59.65 |
Our approach is efficient thanks to the constraints we impose on the latent spaces (requirement (a), (b)). Based on these requirements, we propose a simple and easy-to-implement framework and training strategy. This does not require large amounts of training time, training data, and computational resources. However, it does not indicate a lack of innovation in our approach. Quite the contrary, in an age where computational power reigns, our aim is to propose an efficient strategy that attains state-of-the-art performance with minimal computational resources, eschewing complex network architectures or training gimmicks. We aspire for our method to offer encouragement and insight to researchers operating within resource-constrained environments, presented in a simple and elegant manner!
Since we only replace the mouth regions of the data during training mouth-pose decouple module, the decoupled ‘pose’ space in this stage actually refers to the ‘non-mouth’ region, including expression and head pose. To mitigate the influence of expression on this pose space, we exclusively train with an expression-agnostic dataset, where all images maintain a neutral expression. As a result, the mouth-pose decouple module in this stage solely focuses on the head pose and lacks the capability to model emotive expression. Therefore, we refer to it as ‘pose’ instead of ‘non-mouth’. This hypothesis was further validated in our experiments (13 and 14); even when emotional videos are inputted, the \(PLN\) module solely extracts head pose without incorporating emotional expression.
We notice that there exist some color artifacts in synthesized images (pointed by red arrows in 15). However, we argue that these artifacts do not significantly impact performance and provide a detailed analysis to support this claim. (1) Our Encoder \(E\) and Generator \(G\) are pretrained in a similar setting as LIA [19], using a dataset collected from various sources with diverse identities, backgrounds, and motions. This diversity results in richness and colorfulness in each frame, making the Encoder \(E\) robust to different input images. We have verified this robustness in our experiments (see 7.3). Therefore, despite the presence of artifacts, the Encoder \(E\) can effectively process synthetic images. (2) During the training process, we employ not only cross-reconstruction but also self-reconstruction loss (\(\mathcal{L}_{self}\)) on images without mouth replacement. This loss makes the training data contain not only synthesized images but also a large number of unmodified images, thereby preventing performance degradation. We have also confirmed the contribution of self-reconstruction through our ablation study.
One might raise concerns regarding the evaluation datasets, as both MEAD and HDTF datasets used for evaluation are also the datasets on which the model is trained. Moreover, several prior works used for comparison haven’t been trained on the HDTF dataset. For instance, PD-FGC isn’t trained on the HDTF dataset, raising questions about the fairness of such comparisons. We provide several explanations to address these concerns: (1) To maintain consistency with previous works, we adhere to the comparison protocol established by them [49], [51]. Specifically, both MEAD and HDTF datasets contain a mix of 43 available speakers and over 300 speakers. We randomly allocate 4 and 60 speakers for testing and the remainder for training. This ensures that the test set comprises identities unseen during training, thereby ensuring a fair comparison. (2) While some works, such as PD-FGC, aren’t trained on the HDTF or MEAD datasets, they utilized the Voxceleb2 dataset, which includes over 1 million utterances from 6,112 celebrities. This dataset size is hundreds of times larger than ours, ensuring that they have ample data for training. (3) Additionally, we conduct comparisons on the LRW and Voxceleb2 datasets, which are not utilized by our method. The results presented in 4 reaffirm the superiority of our approach, providing further validation of the performance.
While our current work has made significant strides, it also possesses certain limitations. Firstly, due to the low resolution of the training data, our approach is constrained to generating videos with a resolution of \(256 \times 256\). Consequently, the blurred teeth in the generated results may diminish their realism. Secondly, our method currently overlooks the influence of emotion on head pose, which represents a meaningful yet unexplored task. Unfortunately, the existing emotional MEAD dataset [45] maintains consistent head poses across emotions, making it challenging to model the impact of emotion on pose. However, once relevant datasets become available, our approach can readily be extended to incorporate the influence of emotion on head pose by introducing emotion labels \(e\) as an additional conditioning factor, as depicted in Eq. (13): \(\hat{W}^p = \varphi_p(z, f^a_t, e)\).
Our approach is geared towards generating talking face animations with individual facial control, which holds promise for various applications such as entertainment and filmmaking. However, there is a potential for malicious misuse of this technology on social media platforms, leading to negative societal implications. Despite significant advancements in deepfake detection research [77]–[80], there is still room for improvement in detection accuracy, particularly with the availability of more diverse and comprehensive datasets. In this regard, we are pleased to offer our talking face results, which can contribute to enhancing detection algorithms to better handle increasingly sophisticated scenarios.
Corresponding author.↩︎