EDTalk: Efficient Disentanglement for
Emotional Talking Head Synthesis

Shuai Tan\(^{1}\), Bin Ji\(^{1}\), Mengxiao Bi\(^{2}\), Ye Pan\(^{1}\)1
\(^1\)Shanghai Jiao Tong University\(^2\)NetEase Fuxi AI Lab
{tanshuai0219, bin.ji, whitneypanye}@sjtu.edu.cn
bimengxiao@corp.netease.com


Abstract

Achieving disentangled control over multiple facial motions and accommodating diverse input modalities greatly enhances the application and entertainment of the talking head generation. This necessitates a deep exploration of the decoupling space for facial features, ensuring that they a) operate independently without mutual interference and b) can be preserved to share with different modal inputs—both aspects often neglected in existing methods. To address this gap, this paper proposes a novel Efficient Disentanglement framework for Talking head generation (EDTalk). Our framework enables individual manipulation of mouth shape, head pose, and emotional expression, conditioned on video or audio inputs. Specifically, we employ three lightweight modules to decompose the facial dynamics into three distinct latent spaces representing mouth, pose, and expression, respectively. Each space is characterized by a set of learnable bases whose linear combinations define specific motions. To ensure independence and accelerate training, we enforce orthogonality among bases and devise an efficient training strategy to allocate motion responsibilities to each space without relying on external knowledge. The learned bases are then stored in corresponding banks, enabling shared visual priors with audio input. Furthermore, considering the properties of each space, we propose an Audio-to-Motion module for audio-driven talking head synthesis. Experiments are conducted to demonstrate the effectiveness of EDTalk. We recommend watching the project website: https://tanshuai0219.github.io/EDTalk/

1 Introduction↩︎

Talking head animation has garnered significant research attention owing to its wide-ranging applications in education, filmmaking, virtual digital humans, and the entertainment industry [1]. While previous methods [2][5] have achieved notable advancements, most of them generate talking head videos in a holistic manner, lacking fine-grained individual control. Consequently, attaining precise and disentangled manipulation over various facial motions such as mouth shapes, head poses, and emotional expressions remains a challenge, crucial for crafting lifelike avatars [6]. Moreover, existing approaches typically cater to only one driving source: either audio [7], [8] or video [9], [10], thereby limiting their applicability in the multimodal context. There is a pressing need for a unified framework capable of simultaneously achieving individual facial control and handling both audio-driven and video-driven talking face generation.

Figure 1: Illustrative animations produced by EDTalk. Given an identity source, EDTalk synthesizes talking face videos characterized by mouth shapes, head poses, and expressions consistent with mouth GT, pose source and expression source. These facial dynamics can also be inferred directly from driven audio. Importantly, EDTalk demonstrates superior efficiency in disentanglement training compared to other methods.

To tackle the challenges, an intuition is to disentangle the entirety of facial dynamics into distinct facial latent spaces dedicated to individual components. However, it is non-trivial due to the intricate interplay among facial movements [6]. For instance, mouth shapes profoundly impact emotional expressions, where one speaks happily with upper lip corners but sadly with the depressed ones [11], [12]. Despite the extensive efforts made in facial disentanglement by previous studies [6], [13][16], we argue there exist three key limitations. (1) Overreliance on external and prior information increases the demand for data and complicates the data pre-processing: One popular line [6], [13], [14] relies heavily on external audio data to decouple the mouth space via contrastive learning [17]. Subsequently, they further disentangle the pose space using predefined 6D pose coefficients extracted from 3D face reconstruction models [18]. However, such external and prior information escalates dataset demands and any inaccuracies therein can lead to the trained model errors. (2) Disentangling latent spaces without internal constraints leads to incomplete decoupling. Previous works [14], [16] simply constrain each space externally with a prior during the decoupling process, overlooking inter-space constraints. This oversight fails to ensure that each space exclusively handles its designated component without interference from others, leading to training complexities, reduced efficiency, and performance degradation. (3) Inefficient training strategy escalates the training time and computational cost. When disentangling a new sub-space, some methods [6], [15] require training the entire heavyweight network from scratch, which significantly incurs high time and computational costs [11]. It can be costly and unaffordable for many researchers. Furthermore, most methods are unable to utilize audio and video inputs simultaneously.

To cope with such issues, this paper proposes an Efficient Disentanglement framework, tailored for one-shot talking head generation with precise control over mouth shape, head pose, and emotional expression, conditioned on video or audio inputs. Our key insight lies in our requirements for decoupled space: (a) The decoupled spaces should be disjoint, which means each space captures solely the motion of its corresponding component without the interference from others. This also ensures that decoupling a new space will not affect the trained models, thereby avoiding the necessity of training from scratch. (b) Once the spaces are disentangled from video data to support video-driven paradigm, they should be stored to share with the audio inputs for further audio-driven setting.

To this end, drawing inspiration from the observation that the entire motion space can be represented by a set of directions [19], we innovatively disentangle the whole motion space into three distinct component-aware latent spaces. Each space is characterized by a set of learnable bases. To ensure that different latent spaces do not interfere with each other, we constrain bases orthogonal to each other not only intra-space [19] but also inter-space. To accomplish the disentanglement without prior information, we introduce a progressive training strategy comprising cross-reconstruction mouth-pose disentanglement and self-reconstruction complementary learning for expression decoupling. Despite comprising two stages, our decoupling process involves training only the proposed lightweight Latent Navigation modules, keeping the weights of other heavier modules fixed for efficient training.

To explicitly preserve the disentangled latent spaces, we store the base sets of disentangled spaces in the corresponding banks. These banks serve as repositories of prior bases essential for audio-driven talking head generation. Consequently, we introduce an Audio-to-Motion module designed to predict the weights of the mouth, pose, and expression banks, respectively. Specifically, we employ an audio encoder to synchronize lip motions with the audio input. Given the non-deterministic nature of head motions [20], we utilize normalizing flows [21] to generate probabilistic and realistic poses by sampling from a Gaussian distribution, guided by the rhythm of audio. Regarding expression, we aim to extract emotional cues from the audio [22] and transcripts. It ensures that the generated talking head video aligns with the tone and context of audio, eliminating the need for additional expression references. In this way, our EDTalk enables talking face generation directly from the sole audio input.

Our contributions are outlined as follows: 1) We present EDTalk, an efficient disentanglement framework enabling precise control over talking head synthesis concerning mouth shape, head pose, and emotional expression. 2) By introducing orthogonal bases and an efficient training strategy, we successfully achieve complete decoupling of these three spaces. Leveraging the properties of each space, we implement Audio-to-Motion modules to facilitate audio-driven talking face generation. 3) Extensive experiments demonstrate that our EDTalk surpasses the competing methods in both quantitative and qualitative evaluation.

2 Related Work↩︎

2.1 Disentanglement on the face↩︎

Facial dynamics typically involve coordinated movements such as head poses, mouth shapes, and emotional expressions in a global manner [23], making their separate control challenging. Several works have been developed to address this issue. PC-AVS [16] employs contrastive learning to isolate the mouth space related to audio. Yet since similar pronunciations tend to correspond to the same mouth shape [24], the constructed negative pairs in a mini-batch often include positive pairs and the number of negative pairs in the mini-batch is too small [25], both of which results in subpar results. Similarly, PD-FGC [6] and TH-PAD [13] face analogous challenges in obtaining content-related mouth spaces. Although TH-PAD incorporates lip motion decorrelation loss to extract non-lip space, it still retains a coupled space where expressions and head poses are intertwined. This coupling results in randomly generated expressions co-occurring with head poses, compromising user-friendliness and content relevance. Despite the achievement of PD-FGC in decoupling facial details, its laborious coarse-to-fine disentanglement process consumes substantial computational resources and time. DPE [15] introduces a bidirectional cyclic training strategy to disentangle head pose and expression from talking head videos. However, it necessitates two generators to independently edit expression and pose sequentially, escalating computational resource consumption and runtime. In contrast, we propose an efficient decoupling approach to segregate faces into mouth, head pose, and expression components, readily controllable by different sources. Moreover, our method requires only a unified generator, and minimal additional resources are needed when exploring a new disentangled space.

2.2 Audio-driven Talking Head Generation↩︎

Audio-driven talking head generation [26], [27] endeavors to animate images with accurate lip movements synchronized with input audio clips. Research in this area is predominantly categorized into two groups: intermediate representation based methods and reconstruction-based methods. Intermediate representation based methods [4], [7], [28][34] typically consist of two sub-modules: one predicts intermediate representations from audio, and the other synthesizes photorealistic images from these representations. For instance, Das et al.[29] employ landmarks as an intermediate representation, utilizing an audio-to-landmark module and a landmark-to-image module to connect audio inputs and video outputs. Yin et al.[5] extract 3DMM parameters [35] to warp source images using predicted flow fields. However, obtaining such intermediate representations, like landmarks and 3D models, is laborious and time-consuming. Moreover, they often offer limited facial dynamics details, and training the two sub-modules separately can accumulate errors, leading to suboptimal performance. In contrast, our approach operates within a reconstruction-based framework [2], [8], [36][41]. It integrates features extracted by encoders from various modalities to reconstruct talking head videos in an end-to-end manner, alleviating the aforementioned issues. A notable example is Wav2Lip [42], which employs an audio encoder, an identity encoder, and an image decoder to generate precise lip movements. Similarly, Zhou et al. [16] incorporate an additional pose encoder for free pose control, yet disregard the nondeterministic nature of natural movement. To address this, we propose employing a probabilistic model to establish a distribution of non-verbal head motions. Additionally, none of the existing methods consider facial expressions, crucial for authentic talking head generation. Our approach aims to integrate facial expressions into the model to enhance the realism and authenticity of the generated talking heads.

2.3 Emotional Talking Head Generation↩︎

Emotional talking head generation  is gaining traction due to its wide-ranging applications and heightened entertainment potential. On the one hand, some studies [11], [22], [43][47] identify emotions using discrete emotion labels, albeit facing a challenge to generate controllable and fine-grained expressions. On the other hand, recent methodologies [6], [14], [48][51] incorporate emotional images or videos as references to indicate desired expressions. Ji et al.[49], for instance, mask the mouth region of an emotional video and utilize the remaining upper face as an expression reference for emotional talking face generation. However, as mouth shape plays a crucial role in conveying emotion[23], they struggle to synthesize vivid expressions due to their failure to decouple expressions from the entire face. Thanks to our orthogonal base and efficient training strategy, we are capable of fully disentangling different motion spaces like mouth shape and emotional expression, thus achieving finely controlled talking head synthesis. Moreover, we also incorporate emotion contained within audio and transcripts. To the best of our knowledge, we are the first to achieve this goal—automatically inferring suitable expressions from audio tone and text, thereby generating consistent emotional talking face videos without relying on explicit image/video references.

3 Methodology↩︎

Figure 2: Illustration of our proposed EDTalk. (a) EDTalk framework. Given an identity source \(I^i\) and various driving images \(I^*\) (\(* \in \{m,p,e\}\)) for controlling corresponding facial components, EDTalk animates the identity image \(I^i\) to mimic the mouth shape, head pose, and expression of \(I^m\), \(I^p\) and \(I^e\) with the assistance of three Component-aware Latent Navigation modules: MLN, PLN and ELN. (b) Efficient Disentanglement. The disentanglement process consists of two parts: Mouth-Pose decouple and Expression Decouple. For the former, we introduce the cross-reconstruction training strategy aimed at separating mouth shape and head pose. For the latter, we achieve expression disentanglement using self-reconstruction complementary learning.

As illustrated in Fig. 2 (a), given an identity image \(I^i\), we aim to synthesize emotional talking face image \(\hat{I}^g\) that maintains consistency in identity information, mouth shape, head pose, and emotional expression with various driving sources \(I^i\), \(I^m\), \(I^p\) and \(I^e\). Our intuition is to disentangle different facial components from the overall facial dynamics. To this end, we propose EDTalk (Sec. 3.1) with learnable orthogonal bases stored in banks \(B^*\) (\(*\) refers to the mouth source \(m\), pose source \(p\) and expression source \(e\) for simplicity), each representing a distinct direction of facial movements. To ensure the bases are component-aware, we propose an efficient disentanglement strategy (Sec. 3.2), comprising Mouth-Pose Decoupling and Expression Decoupling, which decompose the overall facial motion into mouth, pose, and expression spaces. Leveraging these disentangled spaces, we further explore an Audio-to-Motion module (Section 3.3, Figure 3) to produce audio-driven emotional talking face videos featuring probabilistic poses, audio-synchronized lip motions, and semantically-aware expressions.

3.1 EDTalk Framework↩︎

Figure 2 (a) illustrates the structure of EDTalk, which is based on an autoencoder architecture consisting of an Encoder \(E\), three Component-aware Latent Navigation modules (CLNs) and a Generator \(G\). The encoder \(E\) maps the identity image \(I^i\) and various driving source \(I^*\) into the latent features \(\textcolor{f1}{f^{i \rightarrow r}} = E(I^i)\) and \(\textcolor{f1}{f^{* \rightarrow r}} = E(I^*)\). The process in inspired by FOMM [9] and LIA [19]. Instead of directly modeling motion transformation \(f^{i\rightarrow *}\) from identity image \(I^i\) to driving image \(I^*\) in the latent space, we posit the existence of a canonical feature \(f^r\), that facilitates motion transfer between identity features and driving ones, expressed as \(\textcolor{f3}{f^{i\rightarrow *}} = \textcolor{f1}{f^{i\rightarrow r}} + \textcolor{f2}{f^{r\rightarrow *}}\).

Thus, upon acquiring the latent features \(f^{* \rightarrow r}\) extracted by \(E\) from driving images \(I^*\), we devise three Component-aware Latent Navigation modules to transform them into \(\textcolor{f2}{f^{r\rightarrow *}} = CLN(\textcolor{f1}{f^{* \rightarrow r}})\). For clarity, we use pose as an example, denoted as \(*=p\). Within the Pose-aware Latent Navigation (PLN) module, we establish a pose bank \(B^p = \{b^p_1, ..., b^p_n\}\) to store \(n\) learnable base \(b^p_i\). To ensure each base represents a distinct pose motion direction, we enforce orthogonality between every pair of bases by imposing a constraint of \(\left\langle b^p_i, b^p_j \right \rangle= 0\quad (i \not= j)\), where \(\left\langle \cdot, \cdot \right \rangle\) signifies the dot product operation. It allows us to depict various head pose movements as linear combinations of the bases. Consequently, we design a Multi-Layer Perceptron layer \(MLP^p\) to predict the weights \(W^p = \{w^p_1, ..., w^p_n\}\) of the pose bases from the latent feature \(f^{p \rightarrow r}\): \[W^p = \{w^p_1, ..., w^p_n\} = MLP^p(\textcolor{f1}{f^{p \rightarrow r}}), \qquad \textcolor{f2}{f^{r \rightarrow p}} = \sum_{i=1}^{n} w^p_i b^p_i,\]

Mouth and Expression-aware Latent Navigation module share the same architecture with PLM but have different parameters, where we can also derive \(\textcolor{f2}{f^{r \rightarrow m}} = \sum_{i=1}^{n} w^m_i b^m_i, W^m = MLP^m(\textcolor{f1}{f^{m \rightarrow r}})\) and \(\textcolor{f2}{f^{r \rightarrow e}} = \sum_{i=1}^{n} w^e_i b^e_i, W^e = MLP^e(\textcolor{f1}{f^{e \rightarrow r}})\) in the similar manner. It’s worth noting that to achieve complete disentanglement of facial components and prevent changes in one component from affecting others, we ensure orthogonality between the three banks (\(B^m,B^p,B^e\)). This also allows us to directly combine the three features to obtain the driving feature \(\textcolor{f2}{f^{r \rightarrow d}} = \textcolor{f2}{f^{r \rightarrow m}}+\textcolor{f2}{f^{r \rightarrow p}}+\textcolor{f2}{f^{r \rightarrow e}}\). We further get \(\textcolor{f3}{f^{i \rightarrow d}} = \textcolor{f1}{f^{i \rightarrow r}}+\textcolor{f2}{f^{r \rightarrow d}}\), which is subsequently fed into the Generator \(G\) to synthesize the final result \(\hat{I}^g\). To maintain identity information, \(G\) incorporates the identity features \(f^{id}\) of the identity image via skip connections. Additionally, to enhance emotional expressiveness with the assistance of the emotion feature \(\textcolor{f2}{f^{r\rightarrow e}}\), we introduce a lightweight plug-and-play Emotion Enhancement Module (\(EEM\)), which will be discussed in the subsequent subsection. In summary, the generation process can be formulated as follows: \[\label{eq:g} \hat{I}^g = G(\textcolor{f3}{f^{i \rightarrow d}}, f^{id}, EEM(\textcolor{f2}{f^{r \rightarrow e}})),\tag{1}\] where \(EEM\) is exclusively utilized during emotional talking face generation. For brevity, we omit \(f^{id}\) in the subsequent equations.

3.2 Efficient Disentanglement↩︎

Based on the outlined framework, the crux lies in training each Component-aware Latent Navigation module to store only the bases corresponding to the motion of its respective components and to ensure no interference between different components. To achieve this, we propose an efficient disentanglement strategy comprising Mouth-Pose Decoupling and Expression Decoupling, thereby separating the overall facial dynamics into mouth, pose, and expression components.

Mouth-Pose Decouple. As depicted at the top of Fig. 2 (b), we introduce cross-reconstruction technical, which involves synthesized images of switched mouths: \(I^{m_a}_{p_b}\) and \(I^{m_b}_{p_a}\). Here, we superimpose the mouth region of \(I^a\) onto \(I^b\) and vice versa. Subsequently, the encoder \(E\) encodes them into canonical features, which are processed through \(PLN\) and \(MLN\) to obtain corresponding features: \[f^{p_b}, f^{m_a} = PLN(E(I^{m_a}_{p_b})), MLN(E(I^{m_a}_{p_b}))\] \[f^{p_a}, f^{m_b} = PLN(E(I^{m_b}_{p_a})), MLN(E(I^{m_b}_{p_a}))\] Next, we substitute the extracted mouth features and feed them into the generator \(G\) to perform cross reconstruction of the original images: \(\hat{I}^b = G(f^{p_b}, f^{m_b})\) and \(\hat{I}^a = G(f^{p_a}, f^{m_a})\). Additionally, we include identity features \(f^{id}\) extracted from another frame of the same identity as input to the generator \(G\). Afterward, we supervise the Mouth-Pose Decouple module by adopting reconstruction loss \(\mathcal{L}_\text{rec}\), perceptual loss \(\mathcal{L}_\text{per}\) [52], [53] and adversarial loss \(\mathcal{L}_\text{adv}\): \[\label{eq:1} \mathcal{L}_\text{rec} = \sum_{\#={a,b}}\|I^\#-\hat{I}^\#\|_1; \qquad \mathcal{L}_\text{per} = \sum_{\#={a,b}}\|\Phi(I^\#)-\Phi(\hat{I}^\#)\|^2_2;\tag{2}\] \[\label{eq:2} \mathcal{L}_\text{adv} = \sum_{\#={a,b}}(\text{log}D(I^\#)+\text{log}(1-D(\hat{I}^\#))),\tag{3}\] where \(\Phi\) denotes the feature extractor of VGG19 [54] and \(D\) is a discriminator tasked with distinguishing between reconstructed images and ground truth (GT). In addition, self-reconstruction of the Ground Truth (GT) is crucial, where mouth features and pose features are extracted from the same image and then input into \(G\) to reconstruct itself using \(\mathcal{L}_\text{self}\). Furthermore, we impose feature-level constraints on the network: \[\label{eq:3} \mathcal{L}_\text{fea} = \sum_{\#={a,b}}(exp(-\mathcal{S}(f^{p_\#}, PLN(E(I^{\#})))) + exp(-\mathcal{S}(f^{m_\#}, MLN(E(I^{\#}))))),\tag{4}\] where we extract mouth features and pose features from \(I^a\) and \(I^b\), aiming to minimize their disparity with those extracted from synthesized images of switched mouths using cosine similarity \(\mathcal{S}(\cdot,\cdot)\). Once the losses have converged, the parameters are no longer updated for the remainder of training, significantly reducing training time and resource consumption for subsequent stages.

Expression Decouple. As illustrated in the bottom of Fig. 2 (b), to decouple expression information from driving image \(I^d\), we introduce Expression-aware Latent Navigation module (\(ELN\)) and a lightweight plug-and-play Emotion Enhancement Module (\(EEM\)), both trained via self-reconstruction complementary learning. Specifically, given an identity source \(I^i\) and a driving image \(I^d\) sharing the same identity as \(I^i\) but differing in mouth shapes, head poses and emotional expressions, our pre-trained modules (i.e., \(E\), \(MLN\), \(PLN\), and \(G\)) from previous stage effectively disentangle mouth shape and head pose from \(I^d\) and drive \(I^i\) to generate \(\hat{I}^g_n\) with matching mouth shape and head pose as \(I^d\) but with the same expression with \(I^i\). Therefore, to faithfully reconstruct \(I^d\) with the same expression, \(ELN\) is compelled to learn complementary information not disentangled by \(MLN\), \(PLN\), precisely the expression information. Motivated by the observation [6] that expression variation in a video sequence is typically less frequent than changes in other motions, we define a window of size \(K\) around \(I^d\) and average \(K\) extracted expression features to obtain a clean expression feature \(f^{r\rightarrow e}\). \(f^{r\rightarrow e}\) is then combined with extracted mouth and pose features as input to the generator \(G\). Additionally, \(EEM\) takes \(f^{r\rightarrow e}\) as input and utilizes affine transformations to produce \(f^e = (f^e_s, f^e_b)\) that control adaptive instance normalization (AdaIN) [55] operations. The AdaIN operations further adapt identity feature \(f^{id}\) as emotion-conditioned features \(f^{id}_e\) by: \[f^{id}_e := EEM(f^{id})= f^e_s \frac{f^{id} - \mu(f^{id})}{\sigma(f^{id})} + f^e_b,\] where \(\mu(\cdot)\) and \(\sigma(\cdot)\) represent the average and variance operations. Subsequently, we generate output \(\hat{I}^g_e\) with the expression of \(I^d\) via Eq. 1 . We enforce a motion reconstruction loss [6] \(\mathcal{L}_\text{mot}\) in addition to the same reconstruction loss \(\mathcal{L}_\text{rec}\), perceptual loss \(\mathcal{L}_\text{per}\) and adversarial loss \(\mathcal{L}_\text{adv}\) as Eq. 2 and Eq. 3 : \[\mathcal{L}_\text{mot} = \|\phi(I^d)-\phi(\hat{I}^g_e)\|_2 + \|\psi(I^d)-\psi(\hat{I}^g_e)\|_2,\] where \(\phi(\cdot)\) and \(\psi(\cdot)\) denote features extracted by the 3D face reconstruction network and the emotion network of [18]. Moreover, to ensure that the synthesized image accurately mimics the mouth shape of the driving frame, we further introduce a mouth consistency loss \(\mathcal{L}_\text{m-c}\): \[\mathcal{L}_\text{m-c} = e^{-\mathcal{S}(MLN(E(\hat{I}^g_e)), MLN(E(I^{d})))},\] where \(MLN\) and \(E\) are pretrained in the previous stage. During training, we only need to train lightweight \(ENL\) and \(EEM\), resulting in fast training.

After successfully training the two-stage Efficient Disentanglement module, we acquire three disentangled spaces, enabling one-shot video-driven talking face generation with separate control of identity, mouth shape, pose, and expression, given different driving sources, as illustrated in Fig. 2 (a).

3.3 Audio-to-Motion↩︎

Integrating the disentangled spaces, we aim to address a more appealing but challenging task: audio-driven talking face generation. In this section, depicted in Fig. 3, we introduce three modules to predict the weights of pose, mouth, and expression from audio. These modules replace the driving video input, facilitating audio-driven talking face generation.

Figure 3: The overview of Audio-to-Motion. We design three modules to predict weights \(\hat{W}^p\), \(\hat{W}^p\), \(\hat{W}^p\) for mouth, pose, expression.

Audio-Driven Lip Generation. Prior works [23], [51] generate facial dynamics, encompassing lip motions and expressions, in a holistic manner, which proves challenging for two main reasons: 1) Expressions, being acoustic-irrelevant motions, can impede lip synchronization [20]. 2) The absence of lip visual information hinders fine details synthesis at the phoneme level [56]. Thanks to the disentangled mouth space obtained in the previous stage, we naturally mitigate the influence of expression without necessitating special training strategies or loss functions like [20]. Additionally, since the decoupled space is trained during video-driven talking face generation using video as input, which offers ample visual information in the form of mouth bases \(b^m_i\) stored in the bank \(B^m\), we eliminate the need for extra visual memory like [56]. Instead, we only need to predict the weight \(w^m_i\) of each base \(b^m_i\), which generates the fine-grained lip motion. To achieve this, we design an Audio Encoder \(E_a\), which embeds the audio feature into a latent space \(f^a = E_a(a_{1:N})\). Subsequently, a linear layer \(MLP^m_A\) is added to decode the mouth weight \(\hat{W}^m\). During training, we fix the weights of all modules and only update \(E_a\) and \(MLP^m_A\) using the weighted sum of feature loss \(\mathcal{L}^m_{fea}\), reconstruction loss \(\mathcal{L}^m_{rec}\) and sync loss \(\mathcal{L}^m_{sync}\) [42]: \[\mathcal{L}^m_{fea} = \|W^m - \hat{W}^m\|_2,\qquad \mathcal{L}^m_{rec} = \|I - \hat{I}\|_2,\] \[\mathcal{L}^m_{sync} = -\text{log}(\frac{v\cdot s}{max(\|v\|_2 \cdot \|s\|_2, \epsilon)}),\] where \(W^m = MLN(E(I))\) is the GT mouth weight extracted from GT image \(I\) and \(\hat{I}\) is generated image using Eq. 1 . \(\mathcal{L}^m_{sync}\) is introduced from [42], where \(v\) and \(s\) are extracted by the speech encoder and image encoder in SyncNet [57].

Flow-Based Probabilistic Pose Generation. Due to the nature of one-to-many mapping from the input audio to head poses, learning a deterministic mapping like previous works [28], [32], [33] output the same results, which bring ambiguity and inferior visual results. To generate probabilistic and realistic head motions, we predict the pose weights \(\hat{W}^p\) using Normalizing Flow \(\varphi_p\) [21], as illustrated in Fig. 3. During training (indicated by dash lines), we extract pose weights \(W^p\) from videos as the ground truth and feed them into our \(\varphi_p\). By incorporating Maximum Likelihood Estimation (MLE) in Eq. 5 , we embed it into a Gaussian distribution \(p_Z\) conditioned on audio feature \(f^a = E_a(a_{1:N})\): \[\label{eq:mle} z_t = \varphi_p^{-1}(w^p_t, f^a_t), \qquad \mathcal{L}_\text{MLE}=-\sum_{t=0}^{N-1} \log p_\mathcal{Z}\left(z_t\right)\tag{5}\] As the normalizing flow \(\varphi_p\) is bijective, we reconstruct the pose weight \(\hat{W}^p = \varphi_p(z, f^a_t)\) and utilize a pose reconstruction loss \(\mathcal{L}^p_\text{rec}\) along with a temporal loss \(\mathcal{L}^p_\text{tem}\) to constrain \(\varphi_p\): \[\label{eq:pose} \mathcal{L}^p_\text{rec} = \|W^p-\hat{W}^p\|_2, \qquad \mathcal{L}_\text{tem}=\frac{1}{N-1}\sum_{t=1}^{N-1}\|(w^p_t - w^p_{t-1}) - (\hat{w}^p_t - \hat{w}^p_{t-1})\|_2\tag{6}\] During inference, we randomly sample \(\hat{z}\) from the constructed distribution \(p_{Z}\) and then generate pose weights \(\hat{W}^p = \varphi_p(z, f^a_t)\). This process ensures the diversity of head motions while maintaining consistency with the audio rhythm.

Semantically-Aware Expression Generation. As finding videos with a desired expression may not always be feasible, potentially limiting their application [58], we aim to explore the emotion contained in audio and transcript with the aid of the introduced Semantics Encoder \(E_S\) and Text Encoder \(E_T\). Inspired by [59], our Semantics Encoder \(E_S\) is constructed upon the pretrained HuBERT model [60], which consists of a CNN-based feature encoder and a transformer-based encoder. We freeze the CNN-based feature encoder and only fine-tuned the transformer blocks. Text Encoder \(E_T\) is inherited from the pretrained Emoberta [61], which encodes the overarching emotional context embedded within textual descriptions. We concatenate the embeddings generated by \(E_S\) and \(E_T\) and feed them into a \(MLP^e_A\) to generate the expression weights \(\hat{W}^e\). Since audio or text may not inherently contain emotion during inference, such as in TTS-generated speech, in order to support the prediction of emotion from a single modality, we randomly mask (\(\mathcal{M}\)) a modality with probability \(p\) during training, inspired by HuBERT: \[\label{eq6} \hat{W}^e=\left\{ \begin{align} MLP^e_a(E_S(a), E_T(T)) & , & 0.5 \le p \le 1, \\ MLP^e_a(\mathcal{M}(E_S(a)), E_T(T)) & , & 0.25 \le p < 0.5, \\ MLP^e_a(E_S(a), \mathcal{M}(E_T(T))) & , & 0 \le p < 0.25. \\ \end{align} \right.\tag{7}\]

We employ \(\mathcal{L}_\text{exp} = \|W^e - \hat{W}^e\|_1\) to encourage \(\hat{W}^e\) close to weight \(W^e\) generated by pretrained \(ELN\) from emotional frames. Until now, we are able to generate probabilistic semantically-aware talking head videos solely from an identity image and the driving audio.

4 Experiments↩︎

4.1 Experimental Settings↩︎

Implement Details. Our model is trained and evaluated on the datasets MEAD [45] and HDTF [62]. Additionally, we report results on additional datasets, including LRW [63] and Voxceleb2 [64], for further assessment of our method in the supplementary. All video frames are cropped following FOMM [9] and resized to \(256\times 256\). Our method is implemented using PyTorch and trained using the Adam optimizer on 2 NVIDIA GeForce GTX 3090 GPUs. The dimension of the latent code \(f^{*\rightarrow r}\) and bases \(b^*\) is set to 512, and the number of bases of \(B^m\), \(B^p\) and \(B^e\) are set to 20, 6 and 10, respectively. The weight for \(\mathcal{L}_\text{mot}\) is set to 10 and the remaining weights are set to 1.

Table 1: Quantitative comparisons with state-of-the-art methods.
Method MEAD [45] HDTF [62]
2-7 (lr)8-12 PSNR\(\uparrow\) SSIM\(\uparrow\) M/F-LMD\(\downarrow\) FID\(\downarrow\) \(\text{Sync}_\text{conf}\uparrow\) \(\text{Acc}_\text{emo}\uparrow\) PSNR\(\uparrow\) SSIM\(\uparrow\) M/F-LMD\(\downarrow\) FID\(\downarrow\) \(\text{Sync}_\text{conf}\uparrow\)
MakeItTalk [28] 19.442 0.614 2.541/2.309 37.917 5.176 14.64 21.985 0.709 2.395/2.182 18.730 4.753
Wav2Lip [42] 19.875 0.633 1.438/2.138 44.510 8.774 13.69 22.323 0.727 1.759/2.002 22.397 9.032
Audio2Head [32] 18.764 0.586 2.053/2.293 27.236 6.494 16.35 21.608 0.702 1.983/2.060 29.385 7.076
PC-AVS [16] 16.120 0.458 2.649/4.350 38.679 7.337 12.12 22.995 0.705 2.019/1.785 26.042 8.482
AVCT [33] 17.848 0.556 2.870/3.160 37.248 4.895 13.13 20.484 0.663 2.360/2.679 19.066 5.661
SadTalker [20] 19.042 0.606 2.038/2.335 39.308 7.065 14.25 21.701 0.702 1.995/2.147 14.261 7.414
IP-LAP [31] 19.832 0.627 2.140/2.116 46.502 4.156 17.34 22.615 0.731 1.951/1.938 19.281 3.456
TalkLip [65] 19.492 0.623 1.951/2.204 41.066 5.724 14.00 22.241 0.730 1.976/1.937 23.850 1.076
EAMM [49] 18.867 0.610 2.543/2.413 31.268 1.762 31.08 19.866 0.626 2.910/2.937 41.200 4.445
StyleTalk [51] 21.601 0.714 1.800/1.422 24.774 3.553 63.49 21.319 0.692 2.324/2.330 17.053 2.629
PD-FGC [6] 21.520 0.686 1.571/1.318 30.240 6.239 44.86 23.142 0.710 1.626/1.497 25.340 7.171
EMMN [23] 17.120 0.540 2.525/2.814 28.640 5.489 48.64 18.236 0.596 2.795/3.368 36.470 5.793
EAT [11] 20.007 0.652 1.750/1.668 21.465 7.984 64.40 22.076 0.719 2.176/1.781 28.759 7.493
EDTalk-A 21.628 0.722 1.537/1.290 17.698 8.115 67.32 25.156 0.811 1.676/1.315 13.785 7.642
EDTalk-V 22.771 0.769 1.102/1.060 15.548 6.889 68.85 26.504 0.845 1.197/1.111 13.172 6.732
GT 1.000 1.000 0.000/0.000 0.000 7.364 79.65 1.000 1.000 0.000/0.000 0.000 7.721

Comparison Setting. We compare our method with: (a) emotion-agnostic talking face generation methods: MakeItTalk [28], Wav2Lip [42], Audio2Head [32], PC-AVS [16], AVCT [33], SadTalker [20], IP-LAP [31], TalkLip [65]. (b) Emotional talking face generation methods: EAMM [49], StyleTalk [51], PD-FGC [6], EMMN [23], EAT [11], EmoGen [66]. Different from previous work, EDTalk encapsulates the entire face generation process without any other sources (e.g. poses [11], [49], 3DMM [5], [51], phoneme [33], [51]) and pre-processing operations during inference, which facilitates the application. We evaluate our model in both audio-driven setting (EDTalk-A) and video-driven setting (EDTalk-V) w.r.t. (i) generated video quality using PSNR, SSIM [67] and FID [68]. (ii) audio-visual synchronization using Landmarks Distances on the Mouth (M-LMD) [7] and the confidence score of SyncNet [57]. (ii) emotional accuracy using \(\text{Acc}_\text{emo}\) calculated by pretrained Emotion-Fan [69] and Landmarks Distances on the Face (F-LMD). Partial results are moved to Appendix (8 and 4) due to limited space.

4.2 Experimental Results↩︎

Figure 4: Qualitative comparisons with state-of-the-art methods. See full comparison in 8.

Quantitative Results. The quantitative results are presented in Tab. 1, where our EDTalk-A and EDTalk-V achieve the best performance across most metrics, except \(\text{Sync}_\text{conf}\). Wav2Lip pretrains their SyncNet discriminator on a large dataset [70], which might lead the model to prioritize achieving a higher \(\text{Sync}_\text{conf}\) over optimizing visual performance. It is evident in the blur mouths generated by Wav2Lip and inferior M-LMD score to our method.

Qualitative Results. Fig. 4 demonstrates comparison of visual results. TalkLip and IP-LAP struggle to generate accurate lip motions. Despite elevated lip-synchronization of SadTalker, they can only produce slight lip motions with closed mouth and are also bothered by jitter between frames. StyleHEAT generates accurate mouth shape driven by Mouth GT video instead of audio but suffers from incorrect head pose and identity loss. This issue also plagues EmoGen, EAMM and PD-FGC. Besides, EmoGen and EAMM fail to perform the desired expression. Due to discrete emotion input, EAT cannot synthesize fine-grained expression like the narrowed eyes performed by expression reference. In the case of "happy", unexpected closed eyes and weird teeth are observed in EAT and PD-FGC, respectively. In contrast, both EDTalk-A and EDTalk-V excel in producing realistic expressions, precise lip synchronization and correct head poses.

a

Time

b

Dataset

c

GPU

d

Figure 5: Resources for training..

Efficiency analysis. Our approach is highly efficient in terms of training time, required data and computational resources in decoupling spaces. In the mouth-pose decoupling stage, we solely utilize the HDTF dataset, containing 15.8 hours of videos, for the decoupling. Training with a batch size of 4 on two 3090 GPUs for 4k iterations achieves state-of-the-art performance, which takes about one hour. In contrast, DPE is trained on the VoxCeleb dataset, which comprises 351 hours of video, for 100K iterations initially, then an additional 50K iterations with a batch size of 32 on 8 V100 GPUs, which takes over 2 days. Besides, they need to train two task-specific generators for expression and pose. Similarly, PD-FGC takes 2 days on 4 Tesla V100 GPUs for lip, and another 2 days on 4 Tesla V100 GPUs for pose decoupling. It significantly exceeds our computational resources and training time. In the expression decouple stage, we train our model on MEAD and HDTF dataset (total 54.8 hours of videos) for 6 hours. On the other hand, PD-FGC decouples expression space on Voxceleb2 dataset (2400 hours) by discorelation loss for 2 weeks. The visualization in Fig. 5 allows for a more intuitive comparison of the differences between the different methods concerning required training time, training data, and computational arithmetic.

r0.5

Table 2: User study results.
Metric/Method TalkLip IP-LAP EAMM EAT EDTalk GT
Lip-sync 3.31 3.42 3.49 3.85 4.13 4.74
Realness 3.14 3.13 3.26 3.75 4.92 4.81
\(\text{Acc}_\text{emo}\) (%) 19.7 17.6 44.3 59.7 64.5 75.6

User Study. We conduct a user study to evaluate our method for human likeness test. We generate 10 videos for each method and invite 20 participants (10 males, 10 females) to score from 1 (worst) to 5 (best) in terms of lip-synchronization, realness, and emotion classification. The average scores reported in Tab. 2 demonstrate that our method achieves the best performance in all aspects.

4.3 Ablation Study↩︎

Figure 6: Ablation results.

Latent space. To analyze the contributions of our key designs on obtaining the disentangled latent spaces, we conduct an ablation study with two variants: (1) remove base banks (w/o Bank). (2) remove orthogonal constraint (w/o Orthogonal). Fig. 6 presents our ablation study results on video-driven and audio-driven settings, respectively. Since w/o Bank struggles to decouple different latent spaces, only exp fails to extract the emotional expression. Additionally, without the visual information stored in banks, the quality of the generated full frame is poor. Although w/o Orthogonal improves the image quality through vision-rich banks, due to the lack of orthogonality constraints on the base, it interferes with different spaces, resulting in less obvious generated emotions. The Full Model achieves the best performance in both aspects. The quantitative results in Tab. 3 also validate the effectiveness of each component.

r0.5

Table 3: Ablation study results.
Method/Metric PSNR\(\uparrow\) SSIM\(\uparrow\) M/F-LMD\(\downarrow\) FID\(\downarrow\) \(\text{Sync}_\text{conf}\uparrow\) \(\text{Acc}_\text{emo}\uparrow\)
w/o \(\mathcal{L}_{fea}\) 21.134 0.713 1.914/1.625 28.053 5.601 54.34
w/o \(\mathcal{L}_{self}\) 20.913 0.707 1.815/1.629 29.314 5.030 44.23
w/o \(\mathcal{L}^m_{rec}\) 21.955 0.744 1.666/1.397 18.528 5.447 67.19
w/o \(\mathcal{L}^m_{sync}\) 21.524 0.728 1.626/1.349 17.844 4.007 61.29
w/o Orthogonal 21.429 0.711 1.687/1.320 17.820 4.398 38.71
w/o Bank 20.302 0.660 2.137/1.711 26.842 2.316 9.677
w/o \(EEM\) 20.731 0.673 2.131/1.927 27.135 7.326 49.367
only lip 19.799 0.639 1.767/1.920 31.918 8.291 15.13
lip+pose 21.519 0.695 1.645/1.378 19.571 8.474 16.75
Full Model 21.628 0.722 1.537/1.290 17.698 8.115 67.32

Loss functions. We further explored the effects of different loss functions on the MEAD dataset. The results in Table 3 indicate that \(\mathcal{L}_\text{fea}\) and \(\mathcal{L}_\text{self}\) contribute to more disentangled spaces, while \(\mathcal{L}^{m}_\text{rec}\) and \(\mathcal{L}^{m}_\text{sync}\) lead to more accurate lip synchronization. Notably, the Full Model shows a reduction in \(\text{Sync}_\text{conf}\) compared to only lip and lip+pose, suggesting a trade-off between lip-sync accuracy and emotion performance. In this work, we sacrifice a slight lip-sync accuracy to enhance expression.

5 Conclusion↩︎

This paper introduces EDTalk, a novel system designed to efficiently disentangle facial components into latent spaces, enabling fine-grained control for talking head synthesis. The core insight is to represent each space with orthogonal bases stored in dedicated banks. We propose an efficient training strategy that autonomously allocates spatial information to each space, eliminating the necessity for external or prior structures. By integrating these spaces, we enable audio-driven talking head generation through a lightweight Audio-to-Motion module. Experiments showcase the superiority of our method in achieving disentangled and precise control over diverse facial motions. We provide more discussion about the limitations and ethical considerations in the Appendix.

Appendix↩︎

Figure 7: Detailed architecture for different components in our EDTalk.

In the main paper, we introduce an innovative framework designed to produce emotional talking face videos, which enables individual manipulation of mouth shape, head pose, and emotional expression, conditioned on both video and audio inputs. This appendix delves deeper into: 1) Implementation Details. 2) Additional Experimental Results. 3) Discussion. In addition, we highly encourage viewing the Supplementary Video: https://tanshuai0219.github.io/EDTalk/.

Figure 8: Additional qualitative results, which are supplement to the main paper

6 Implementation Details↩︎

6.1 Network Architecture↩︎

We utilize identical structures for Generator \(G\) in LIA [19]. We recommend consulting their original paper for further elaboration. Here, we delineate the details of the other network architectures depicted in 7.

6.1.0.1 Encoder \(E\).

The component projects the identity source \(I^i\) and driving source \(I^*\) into the identity feature \(f^{id}\) and the latent features \(f^{i \rightarrow r}\), \(f^{* \rightarrow r}\). It comprises several convolutional neural networks (CNN) and ResBlocks. The outputs of ResBlock serve as the identity feature \(f^{id}\), which is then fed into Generator \(G\) to enrich identity information through skip connections. Subsequently, four multi-layer perceptrons (MLP) are employed to generate the latent features \(f^{i \rightarrow r}\), \(f^{* \rightarrow r}\).

6.1.0.2 \(MLP^m\), \(MLP^p\), \(MLP^e\) and \(MLP^m_A\).

To achieve efficient training and inference, these four modules are implemented with four simple MLPs.

6.1.0.3 Audio Encoder \(E_a\).

This network takes audio feature sequences \(a_{1:T}\) as input. These sequences are passed through a series of convolutional layers to produce audio feature \(f^a_{1:N}\).

6.1.0.4 Normalizing Flow \(\varphi_p\).

Normalizing Flow \(\varphi_p\) comprises \(K\) flow step, each consisting of actnorm, invertible convolution and the affine coupling layer. Initially, given the mean \(\mu\) and standard deviation \(\delta\) for the weights \(W^p\) of pose bank \(B^p\), actnorm is implemented as an affine transformation \(h' = \frac{\beta-\mu}{\delta}\). Subsequently, \(\varphi_p\) introduces an invertible \(1 \times 1\) convolution layer, \(h'' = \mathbf{W}\cdot h'\), to handle potential channel variable. Following this, we utilize a transformer-based coupling layer \(\mathcal{F}\) to derive \(z\) from \(h''\) and \(f^a_{1:N}\). Specifically, we split \(h''\) into \(h''_{h1}\) and \(h''_{h2}\), where \(h''_{h2}\) undergoes affine transformation by \(\mathcal{F}\) based on \(h''_{h1}\): \(t,s = \mathcal{F}(h''_{h1},f^a_{1:N}); h = (h''_{h2}+t)\odot s,\) where \(t\) and \(s\) represent the transformation parameters. Thanks to the unchanged \(h''_{h1}\), tractability is easily maintained in reverse. In summary, we can map \(W^p\) into the latent code \(z\) and predict weight \(\hat{W}^p\) from a sampled code \(\hat{z} \in p_\mathcal{Z}\) as follows: \[z = \varphi_p^{-1}(W^p,f^a_{1:N})\] \[\label{eq:reverse} \hat{W}^p = \varphi_p(\hat{z},f^a_{1:N})\tag{8}\]

6.2 Data Details↩︎

6.2.1 Datasets↩︎

6.2.1.1 MEAD.

MEAD entails 60 speakers, with 43 speakers accessible, delivering 30 sentences expressing eight emotions at three varying intensity levels in a laboratory setting. Consistent with prior studies [11], [49], we designate videos featuring speakers identified as ‘M003,’ ‘M030,’ ‘W009,’ and ‘W015’ for testing, while the videos of the remaining speakers are allocated for training.

6.2.1.2 HDTF.

The videos of the HDTF dataset are collected from YouTube, renowned for their high quality, high definition content, featuring over 300 distinct identities. To facilitate training and testing, we partition the dataset using an 8:2 ratio based on speaker identities, allocating 80% for training and 20% for testing.

6.2.1.3 Voxceleb2.

Voxceleb2 [64] is a large-scale talking head dataset, boasting over 1 million utterances from 6,112 celebrities. It’s important to note that we solely utilize Voxceleb2 for evaluation purposes, selecting 200 videos randomly from its extensive collection.

6.2.1.4 LRW.

LRW [63] is a word-level dataset comprising more than 1000 utterances encompassing 500 distinct words. For evaluation, we randomly select 500 videos from the dataset.

6.2.2 Data Processing↩︎

For video preprocessing, we employ face cropping and resize the cropped videos to the resolution of \(256 \times 256\) for training and testing following FOMM [9]. Adhere to Wav2Lip [42], audio is down-sampled to 16 kHz and transformed into mel-spectrograms using an FFT window size of 800, hop length of 200, and 80 Mel filter banks. During the evaluation, for datasets without emotional labels, we utilize the first frame of each video as the source image and the corresponding audio as the driving audio to generate talking head videos. For emotional videos sourced from MEAD, we use the video itself as an expression reference. We select a frame with a ‘Neutral’ emotion from the same speaker as the source image for emotional talking head synthesis.

Table 4: Quantitative comparisons with state-of-the-art methods. We test each method on Voxceleb2 and LRW datasets, and the best scores in each metric are highlighted in bold. The symbol \("\uparrow"\) and \("\downarrow"\) indicate higher and lower metric values for better results, respectively.
Method Voxceleb2 [64] LRW [63]
2-6 (lr)7-11 PSNR\(\uparrow\) SSIM\(\uparrow\) M-LMD\(\downarrow\) F-LMD\(\downarrow\) \(\text{Sync}_\text{conf}\uparrow\) PSNR\(\uparrow\) SSIM\(\uparrow\) M-LMD\(\downarrow\) F-LMD\(\downarrow\) \(\text{Sync}_\text{conf}\uparrow\)
MakeItTalk [28] 20.526 0.706 2.435 2.380 3.896 22.334 0.729 2.099 1.960 3.137
Wav2Lip [42] 20.760 0.723 2.143 2.182 8.680 23.299 0.764 1.699 1.703 7.545
Audio2Head [32] 17.344 0.577 3.651 3.712 5.541 18.703 0.601 2.866 3.435 5.428
PC-AVS [16] 21.643 0.720 2.088 1.830 7.928 16.744 0.509 5.603 4.691 3.622
AVCT [33] 18.751 0.645 2.739 3.062 4.238 21.188 0.689 2.290 2.395 3.927
SadTalker [20] 20.278 0.700 2.252 2.388 6.356 - - - - -
IP-LAP [20] 20.955 0.724 2.125 2.154 3.295 23.727 0.770 1.779 1.683 3.027
TalkLip [65] 20.633 0.723 2.084 2.191 6.520 22.706 0.751 1.803 1.770 6.021
EAMM [49] 17.038 0.562 4.172 4.163 3.815 18.643 0.607 3.593 3.773 3.414
StyleTalk [51] 21.112 0.722 2.113 2.136 2.120 21.283 0.705 2.394 2.142 2.430
PD-FGC [6] 22.110 0.729 1.743 1.630 6.686 22.481 0.711 1.576 1.534 6.119
EAT [11] 20.370 0.689 2.586 2.383 6.864 21.384 0.704 2.128 1.927 6.630
EDTalk-A 22.107 0.763 1.851 1.608 6.591 23.409 0.779 1.729 1.379 6.914
EDTalk-V 22.133 0.764 1.829 1.583 6.155 24.574 0.823 1.202 1.139 6.027
GT 1.000 1.000 0.000/0.000 0.000 6.808 1.000 1.000 0.000/0.000 0.000 6.952

6.3 Training Details↩︎

The encoder \(E\) and generator \(G\) are pre-trained in a similar setting as LIA [19]. Subsequently, we freeze the weights of the encoder \(E\) and generator \(G\), focusing solely on training the Mouth-Pose Decouple Module. In this stage, our model is trained exclusively on the emotion-agnostic HDTF dataset, where videos consistently exhibit a ‘Neutral’ emotion alongside diverse head poses. It ensures that the Mouth-Pose Decouple Module concentrates solely on variations in head pose and mouth shape, avoiding the encoding of expression-related information. All loss function weights are set to 1. The training process typically requires approximately one hour, employing a batch size of 4 and a learning rate of 2e-3, executed on 2 NVIDIA GeForce GTX 3090 GPUs with 24GB memory. Once the Mouth-Pose Decouple Module is trained, we freeze all trained parameters and solely update the expression-related modules, including \(MLP^e\), expression bases \(B^e\), and the Expression Enhance Module \(EEM\), utilizing both the MEAD and HDTF datasets. This stage typically takes around 6 hours, employing a batch size of 10 and a learning rate of 2e-3, conducted on 2 NVIDIA GeForce GTX 3090 GPUs with 24GB memory. We train our Audio-to-Lip model on the HDTF dataset for 30k iterations with a batch size of 4, requiring approximately 7 hours of computation on 2 NVIDIA GeForce GTX 3090 GPUs with 24GB memory. The Audio-to-Pose model is trained on the HDTF dataset for one hour.

7 Additional Experimental Results↩︎

7.1 More Comparison with SOTA Audio-Driven Talking Face Generation Methods↩︎

7.1.0.1 More quantitative results.

Apart from the quantitative assessments conducted on the MEAD and HDTF datasets, as detailed in the main paper, we present additional quantitative comparisons on Voxceleb2 [64] and LRW [63]. The comparison results outlined in 4 demonstrate that our method outperforms state-of-the-art approaches in both audio-driven (EDTalk-A) and video-driven (EDTalk-V) scenarios across various metrics. We offer a plausible explanation for the superior \(\text{Sync}_\text{conf}\) achieved by Wav2Lip [42] in the main paper. IP-LAP [31] merely alters the mouth shape of the source image while maintaining the same head pose and expression, hence achieving a higher PSNR score. PD-FGC [6] attains superior M-LMD performance by training on Voxceleb2, a dataset comprising over 1 million utterances from 6,112 celebrities, totaling 2400 hours of data, which is hundreds of times larger than our dataset (15.8 hours). Nevertheless, we still outperform PD-FGC in terms of F-LMD. SadTalker [20] encounters challenges in processing even one second of audio, leading to the failure to generate talking face videos on the LRW dataset, where all videos are one second in duration.

Figure 9: Comparison results with SOTA methods that have not released their codes and pretrained models.

7.1.0.2 More qualitative results.

In addition to the state-of-the-art (SOTA) methods discussed in the main paper, we extend our comparative analysis to include both emotion-agnostic talking face generation methods: MakeItTalk [28], Wav2Lip [42], Audio2Head [32], AVCT [33], and PC-AVS [16], as well as emotional talking face generation methods: StyleTalk [51] and EMMN [23]. The comprehensive qualitative results can be found in 8, serving as a supplement to the previously presented data in Fig. 4 of the main paper. We further conduct the comparison experiments with several SOTA talking face generation methods, including: GC-AVT [14], EVP [22], ECG [71] and DiffTalk [41]. However, due to the unavailability of codes and pre-trained models for these methods (except EVP), we can only extract video clips from the provided demo videos for comparison. The results are demonstrated in 9. Specifically, EVP and ECG are emotional talking face generation methods that utilize one-hot labels for emotional guidance, with EVP being a person-specific model and ECG being a one-shot method. Our method outperforms these methods in terms of emotional expression, while the teeth generated by ECG contribute to slightly unrealistic results. GC-AVT aims to mimic emotional expressions and generate accurate lip motions synchronized with input speech, resembling the setting of our EDTalk. However, compared to EDTalk, GC-AVT struggles to preserve reference identity, resulting in significant identity loss. DiffTalk is hindered by severe mouth jitter, which is more evident in the Supplementary Video.

7.2 More Comparison with SOTA Face Reenactment Methods↩︎

r0.5

Table 5: The quantitative results compared with SOTA face reenactment methods on HDTF dataset.
Method/Metric PSNR\(\uparrow\) SSIM\(\uparrow\) LPIPS\(\downarrow\) \(\mathcal{L}_1 \downarrow\) AKD\(\downarrow\) AED\(\downarrow\)
PIRenderer [72] 22.13 0.72 0.22 0.053 2.24 0.032
OSFV [73] 23.29 0.74 0.17 0.037 1.83 0.025
LIA [19] 24.75 0.77 0.16 0.036 1.88 0.019
DaGAN [10] 23.21 0.74 0.16 0.041 1.93 0.023
MCNET [74] 21.74 0.69 0.26 0.057 2.05 0.037
StyleHEAT [5] 22.15 0.65 0.25 0.075 2.95 0.045
VPGC [75] - - - - - -
EDTalk 26.5 0.85 0.13 0.031 1.74 0.017

7.2.0.1 Qualitative results.

We perform a comparative analysis with state-of-the-art face reenactment methods, including PIRenderer [72], OSFV [73], LIA [19], DaGAN [10], MCNET [74], StyleHEAT [5], and VPGC [75], where VPGC is a person-specific model. Given that the compared methods are not specifically trained on emotional datasets, we conduct comparisons using videos with and without emotion, the results of which are presented in the Supplementary Video (4:07-4:50). Our method demonstrates superior performance in terms of face reenactment.

7.2.0.2 Quantitative results.

We additionally offer extensive quantitative comparisons regarding: (1) Generated video quality assessed through PSNR and SSIM. (2) Reconstruction faithfulness evaluated using LPIPS and \(\mathcal{L}_1\) norms. (3) Semantic consistency measured by average keypoint distance (AKD) and average Euclidean distance (AED). The quantitative results on the HDTF dataset are outlined in 5, showcasing the superior performance of our EDTalk method. Note that since VPGC is a person-specific model, it cannot be generalized on identities in HDTF dataset.

7.3 Robustness↩︎

Our method demonstrates robustness across out-of-domain portraits, encompassing real human subjects, paintings, sculptures, and images generated by Stable Diffusion [76]. Moreover, our approach exhibits generalizability to various audio inputs, including songs, diverse languages (English, French, German, Italian, Japanese, Korean, Spanish, Chinese), and noisy audio. Please refer to the Supplementary Video (5:40-8:40) for the better visualization of these results.

7.4 Expression Manipulation↩︎

We accomplish expression manipulation by interpolating between expression weights \(W^e\) of the expression bank \(B^e\), which are extracted from any two distinct expression reference clips, using the following equation: \[W^e = \alpha W^e_1+(1-\alpha)W^e_2,\] where \(W^e_1\) and \(W^e_2\) represent expression weights extracted from two emotional clips, while \(\alpha\) denotes the interpolation weight. 10 illustrates an example of expression manipulation generated by our EDTalk. In this example, we successfully transition from \(Expression 1\) to \(Expression 2\) by varying the interpolation weight \(\alpha\). This demonstrates the effectiveness of our \(ELN\) module in accurately capturing the expression of the provided clip, as discussed in the main paper.

Figure 10: The results of expression manipulation.

7.5 Probabilistic Pose Generation↩︎

Figure 11: The results of generated head poses.

Thanks to the distribution \(p_Z\) modeled by the Audio2Pose module, we are able to sample diverse and realistic head poses from it. As shown in 11, by passing the same inputs through our EDTalk, our method synthesizes various yet natural head motions while preserving the expression and mouth shape unchanged.

7.6 Semantically-Aware Expression Generation↩︎

We input two transcripts into a Text-To-Speech (TTS) system to synthesize two audio clips. These audios, along with their respective transcripts, are then fed into our Audio-to-Motion module to generate talking face videos. The results of semantically-aware expression generation are depicted in 12, showcasing our method’s ability to accurately generate expressions corresponding to the transcripts (left: happy; right: sad). Additionally, in the Supplementary Video, we provide further results where expressions are inferred directly from audio.

Figure 12: The results of semantically-aware expression generation.

7.7 Motion Direction Controlled by Base↩︎

We initially present the results showcasing individual control over mouth shape, head pose, and emotional expression in 13. Specifically, by feeding our EDTalk with an identity source and various driving sources (first row of each part), our method generates corresponding disentangled outcomes in the second row. Subsequently, we integrate these individual facial motions into full emotional talking head videos with synchronized lip movements, head gestures, and emotional expressions. It’s worth noting that our method facilitates the combination of any two facial parts, such as ‘expression+lip’, ‘expression+pose’, etc. An example of ‘lip+pose’ is shown in the first row in the lower right corner of 13. Additionally, we provide comparisons with state-of-the-art facial disentanglement methods like PD-FGC [6] and DPE [15] in terms of facial disentanglement performance and computational efficiency. For further details, please refer to the Supplementary Video (4:50-5:12).

Figure 13: The results of individual control over mouth shape, head pose, emotional expression and combined facial dynamics.

Figure 14: Motion direction controlled by each base.

We are also intrigued by understanding how each base in the banks influences motion direction. Consequently, we manipulate only a specific base \(b^*_i\) and repeat the setup. The results, as depicted in 14, indicate that the bases hold semantic significance for fundamental visual transformations such as mouth opening/closing, head rotation, and happiness/sadness/anger.

7.8 Ablation Study↩︎

7.8.0.1 Bank size.

In this section, we perform a series of experiments on the MEAD dataset to explore the impact of base number selection on final performance. Specifically, we vary the base number of the Mouth Bank \(B^m\) and Expression Bank \(B^e\) across values of 5, 10, 20, and 40, respectively. The quantitative results are provided in 6, where we observe the best performance when utilizing 20 bases in \(B^m\) and 10 bases in \(B^e\).

Table 6: Ablation study on the number of base.
Method Mouth Bank \(B^m\) Expression Bank \(B^e\)
2-6 (lr)7-11 PSNR\(\uparrow\) SSIM\(\uparrow\) M/F-LMD\(\downarrow\) \(\text{Sync}_\text{conf}\uparrow\) \(\text{Acc}_\text{emo}\uparrow\) PSNR\(\uparrow\) SSIM\(\uparrow\) M/F-LMD\(\downarrow\) \(\text{Sync}_\text{conf}\uparrow\) \(\text{Acc}_\text{emo}\uparrow\)
5 20.39 0.69 2.02/1.67 6.35 63.53 21.54 0.70 1.60/1.35 8.27 53.26
10 21.45 0.72 1.65/1.33 7.89 65.74 21.63 0.72 1.54/1.29 8.12 67.32
20 21.63 0.72 1.54/1.29 8.12 67.32 21.37 0.72 1.64/1.46 8.23 61.34
40 20.79 0.71 1.65/1.48 7.62 63.12 21.41 0.71 1.68/1.42 8.16 59.65

8 Discussion↩︎

8.1 Novelty↩︎

Our approach is efficient thanks to the constraints we impose on the latent spaces (requirement (a), (b)). Based on these requirements, we propose a simple and easy-to-implement framework and training strategy. This does not require large amounts of training time, training data, and computational resources. However, it does not indicate a lack of innovation in our approach. Quite the contrary, in an age where computational power reigns, our aim is to propose an efficient strategy that attains state-of-the-art performance with minimal computational resources, eschewing complex network architectures or training gimmicks. We aspire for our method to offer encouragement and insight to researchers operating within resource-constrained environments, presented in a simple and elegant manner!

8.2 Potential Worries about Mouth-pose Decouple Module↩︎

8.2.0.1 ‘Pose’ or ‘Non-Mouth’?

Since we only replace the mouth regions of the data during training mouth-pose decouple module, the decoupled ‘pose’ space in this stage actually refers to the ‘non-mouth’ region, including expression and head pose. To mitigate the influence of expression on this pose space, we exclusively train with an expression-agnostic dataset, where all images maintain a neutral expression. As a result, the mouth-pose decouple module in this stage solely focuses on the head pose and lacks the capability to model emotive expression. Therefore, we refer to it as ‘pose’ instead of ‘non-mouth’. This hypothesis was further validated in our experiments (13 and 14); even when emotional videos are inputted, the \(PLN\) module solely extracts head pose without incorporating emotional expression.

Figure 15: Examples of synthesized images. \(I^{m_B}_{p_A}\) refers to image \(A\) with the mouth of \(B\), and vice versa.

8.2.0.2 Color Artifact caused by replacing mouth.

We notice that there exist some color artifacts in synthesized images (pointed by red arrows in 15). However, we argue that these artifacts do not significantly impact performance and provide a detailed analysis to support this claim. (1) Our Encoder \(E\) and Generator \(G\) are pretrained in a similar setting as LIA [19], using a dataset collected from various sources with diverse identities, backgrounds, and motions. This diversity results in richness and colorfulness in each frame, making the Encoder \(E\) robust to different input images. We have verified this robustness in our experiments (see 7.3). Therefore, despite the presence of artifacts, the Encoder \(E\) can effectively process synthetic images. (2) During the training process, we employ not only cross-reconstruction but also self-reconstruction loss (\(\mathcal{L}_{self}\)) on images without mouth replacement. This loss makes the training data contain not only synthesized images but also a large number of unmodified images, thereby preventing performance degradation. We have also confirmed the contribution of self-reconstruction through our ablation study.

8.2.0.3 Comparison Protocol.

One might raise concerns regarding the evaluation datasets, as both MEAD and HDTF datasets used for evaluation are also the datasets on which the model is trained. Moreover, several prior works used for comparison haven’t been trained on the HDTF dataset. For instance, PD-FGC isn’t trained on the HDTF dataset, raising questions about the fairness of such comparisons. We provide several explanations to address these concerns: (1) To maintain consistency with previous works, we adhere to the comparison protocol established by them [49], [51]. Specifically, both MEAD and HDTF datasets contain a mix of 43 available speakers and over 300 speakers. We randomly allocate 4 and 60 speakers for testing and the remainder for training. This ensures that the test set comprises identities unseen during training, thereby ensuring a fair comparison. (2) While some works, such as PD-FGC, aren’t trained on the HDTF or MEAD datasets, they utilized the Voxceleb2 dataset, which includes over 1 million utterances from 6,112 celebrities. This dataset size is hundreds of times larger than ours, ensuring that they have ample data for training. (3) Additionally, we conduct comparisons on the LRW and Voxceleb2 datasets, which are not utilized by our method. The results presented in 4 reaffirm the superiority of our approach, providing further validation of the performance.

8.2.0.4 Limitation

While our current work has made significant strides, it also possesses certain limitations. Firstly, due to the low resolution of the training data, our approach is constrained to generating videos with a resolution of \(256 \times 256\). Consequently, the blurred teeth in the generated results may diminish their realism. Secondly, our method currently overlooks the influence of emotion on head pose, which represents a meaningful yet unexplored task. Unfortunately, the existing emotional MEAD dataset [45] maintains consistent head poses across emotions, making it challenging to model the impact of emotion on pose. However, once relevant datasets become available, our approach can readily be extended to incorporate the influence of emotion on head pose by introducing emotion labels \(e\) as an additional conditioning factor, as depicted in Eq. (13): \(\hat{W}^p = \varphi_p(z, f^a_t, e)\).

8.2.0.5 Ethical considerations.

Our approach is geared towards generating talking face animations with individual facial control, which holds promise for various applications such as entertainment and filmmaking. However, there is a potential for malicious misuse of this technology on social media platforms, leading to negative societal implications. Despite significant advancements in deepfake detection research [77][80], there is still room for improvement in detection accuracy, particularly with the availability of more diverse and comprehensive datasets. In this regard, we are pleased to offer our talking face results, which can contribute to enhancing detection algorithms to better handle increasingly sophisticated scenarios.

References↩︎

[1]
Pat Pataranutaporn, Valdemar Danry, Joanne Leong, Parinya Punpongsanon, Dan Novy, Pattie Maes, and Misha Sra. Ai-generated characters for supporting personalized learning and well-being. Nature Machine Intelligence, 3(12):1013–1022, 2021.
[2]
Shuai Shen, Wanhua Li, Zheng Zhu, Yueqi Duan, Jie Zhou, and Jiwen Lu. Learning dynamic facial radiance fields for few-shot talking head synthesis. In European Conference on Computer Vision, pages 666–682. Springer, 2022.
[3]
Taras Khakhulin, Vanessa Sklyarova, Victor Lempitsky, and Egor Zakharov. Realistic one-shot mesh-based head avatars. In European Conference on Computer Vision, pages 345–362. Springer, 2022.
[4]
Kewei Yang, Kang Chen, Daoliang Guo, Song-Hai Zhang, Yuan-Chen Guo, and Weidong Zhang. Face2face \(\rho\): Real-time high-resolution one-shot face reenactment. In European conference on computer vision, pages 55–71. Springer, 2022.
[5]
Fei Yin, Yong Zhang, Xiaodong Cun, Mingdeng Cao, Yanbo Fan, Xuan Wang, Qingyan Bai, Baoyuan Wu, Jue Wang, and Yujiu Yang. Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan. In European conference on computer vision, pages 85–101. Springer, 2022.
[6]
Duomin Wang, Yu Deng, Zixin Yin, Heung-Yeung Shum, and Baoyuan Wang. Progressive disentangled representation learning for fine-grained controllable talking head synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17979–17989, 2023.
[7]
Lele Chen, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7832–7841, 2019.
[8]
Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. Neural voice puppetry: Audio-driven facial reenactment. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pages 716–731. Springer, 2020.
[9]
Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation. Advances in neural information processing systems, 32, 2019.
[10]
Fa-Ting Hong, Longhao Zhang, Li Shen, and Dan Xu. Depth-aware generative adversarial network for talking head video generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3397–3406, 2022.
[11]
Yuan Gan, Zongxin Yang, Xihang Yue, Lingyun Sun, and Yi Yang. Efficient emotional adaptation for audio-driven talking-head generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22634–22645, 2023.
[12]
Paul Ekman and Wallace V Friesen. Facial action coding system. Environmental Psychology & Nonverbal Behavior, 1978.
[13]
Zhentao Yu, Zixin Yin, Deyu Zhou, Duomin Wang, Finn Wong, and Baoyuan Wang. Talking head generation with probabilistic audio-to-visual diffusion priors. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7645–7655, 2023.
[14]
Borong Liang, Yan Pan, Zhizhi Guo, Hang Zhou, Zhibin Hong, Xiaoguang Han, Junyu Han, Jingtuo Liu, Errui Ding, and Jingdong Wang. Expressive talking head generation with granular audio-visual control. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3387–3396, 2022.
[15]
Youxin Pang, Yong Zhang, Weize Quan, Yanbo Fan, Xiaodong Cun, Ying Shan, and Dong-ming Yan. Dpe: Disentanglement of pose and expression for general video portrait editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 427–436, 2023.
[16]
Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4176–4186, 2021.
[17]
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673, 2020.
[18]
Radek Daněček, Michael J Black, and Timo Bolkart. Emoca: Emotion driven monocular face capture and animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20311–20322, 2022.
[19]
Yaohui Wang, Di Yang, Francois Bremond, and Antitza Dantcheva. Latent image animator: Learning to animate images via latent space navigation. In International Conference on Learning Representations, 2021.
[20]
Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8652–8661, 2023.
[21]
Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pages 1530–1538. PMLR, 2015.
[22]
Xinya Ji, Hang Zhou, Kaisiyuan Wang, Wayne Wu, Chen Change Loy, Xun Cao, and Feng Xu. Audio-driven emotional video portraits. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14080–14089, 2021.
[23]
Shuai Tan, Bin Ji, and Ye Pan. Emmn: Emotional motion memory network for audio-driven emotional talking face generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22146–22156, 2023.
[24]
Dongze Li, Kang Zhao, Wei Wang, Bo Peng, Yingya Zhang, Jing Dong, and Tieniu Tan. Ae-nerf: Audio enhanced neural radiance field for few shot talking head synthesis. arXiv preprint arXiv:2312.10921, 2023.
[25]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
[26]
Christoph Bregler, Michele Covell, and Malcolm Slaney. Video rewrite: Driving visual speech with audio. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 715–722. 2023.
[27]
Xian Liu, Yinghao Xu, Qianyi Wu, Hang Zhou, Wayne Wu, and Bolei Zhou. Semantic-aware implicit neural audio-driven video portrait generation. In European Conference on Computer Vision, pages 106–125. Springer, 2022.
[28]
Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, and Dingzeyu Li. Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graphics (TOG), 39(6):1–15, 2020.
[29]
Dipanjan Das, Sandika Biswas, Sanjana Sinha, and Brojeshwar Bhowmick. Speech-driven facial animation using cascaded gans for learning of motion and texture. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 408–424. Springer, 2020.
[30]
Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. Few-shot adversarial learning of realistic neural talking head models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9459–9468, 2019.
[31]
Weizhi Zhong, Chaowei Fang, Yinqi Cai, Pengxu Wei, Gangming Zhao, Liang Lin, and Guanbin Li. Identity-preserving talking face generation with landmark and appearance priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2023.
[32]
S Wang, L Li, Y Ding, C Fan, and X Yu. Audio2head: Audio-driven one-shot talking-head generation with natural head motion. In International Joint Conference on Artificial Intelligence. IJCAI, 2021.
[33]
Suzhen Wang, Lincheng Li, Yu Ding, and Xin Yu. One-shot talking face generation from single-speaker audio-visual correlation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2531–2539, 2022.
[34]
Lele Chen, Guofeng Cui, Celong Liu, Zhong Li, Ziyi Kou, Yi Xu, and Chenliang Xu. Talking-head generation with rhythmic head motion. In European Conference on Computer Vision, pages 35–51. Springer, 2020.
[35]
Volker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pages 187–194, 1999.
[36]
Lele Chen, Zhiheng Li, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. Lip movements generation at a glance. In Proceedings of the European Conference on Computer Vision (ECCV), pages 520–535, 2018.
[37]
Yang Song, Jingwen Zhu, Dawei Li, Andy Wang, and Hairong Qi. Talking face generation by conditional recurrent adversarial network. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, Aug 2019.
[38]
Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. Talking face generation by adversarially disentangled audio-visual representation. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 9299–9306, 2019.
[39]
Liyang Chen, Zhiyong Wu, Runnan Li, Weihong Bao, Jun Ling, Xu Tan, and Sheng Zhao. Vast: Vivify your talking avatar via zero-shot expressive facial style transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2977–2987, 2023.
[40]
Jiayu Wang, Kang Zhao, Shiwei Zhang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. Lipformer: High-fidelity and generalizable talking face generation with a pre-learned facial codebook. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13844–13853, 2023.
[41]
Shuai Shen, Wenliang Zhao, Zibin Meng, Wanhua Li, Zheng Zhu, Jie Zhou, and Jiwen Lu. Difftalk: Crafting diffusion models for generalized audio-driven portraits animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1982–1991, 2023.
[42]
KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, pages 484–492, 2020.
[43]
Shuai Tan, Bin Ji, and Ye Pan. Flowvqtalker: High-quality emotional talking face generation through normalizing flow and quantization. arXiv preprint arXiv:2403.06375, 2024.
[44]
Sanjana Sinha, Sandika Biswas, Ravindra Yadav, and Brojeshwar Bhowmick. Emotion-controllable generalized talking face generation. In International Joint Conference on Artificial Intelligence. IJCAI, 2021.
[45]
Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI, pages 700–717. Springer, 2020.
[46]
Ye Pan, Shuai Tan, Shengran Cheng, Qunfen Lin, Zijiao Zeng, and Kenny Mitchell. Expressive talking avatars. IEEE Transactions on Visualization and Computer Graphics, 2024.
[47]
Ye Pan, Ruisi Zhang, Shengran Cheng, Shuai Tan, Yu Ding, Kenny Mitchell, and Xubo Yang. Emotional voice puppetry. IEEE Transactions on Visualization and Computer Graphics, 29(5):2527–2535, 2023.
[48]
Shuai Tan, Bin Ji, Yu Ding, and Ye Pan. Say anything with any style. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5088–5096, 2024.
[49]
Xinya Ji, Hang Zhou, Kaisiyuan Wang, Qianyi Wu, Wayne Wu, Feng Xu, and Xun Cao. Eamm: One-shot emotional talking face via audio-based emotion-aware motion model. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022.
[50]
Shuai Tan, Bin Ji, and Ye Pan. Style2talker: High-resolution talking head generation with emotion style and art style. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5079–5087, 2024.
[51]
Yifeng Ma, Suzhen Wang, Zhipeng Hu, Changjie Fan, Tangjie Lv, Yu Ding, Zhidong Deng, and Xin Yu. Styletalk: One-shot talking head generation with controllable speaking styles. arXiv preprint arXiv:2301.01081, 2023.
[52]
Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 694–711. Springer, 2016.
[53]
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
[54]
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[55]
Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision, pages 1501–1510, 2017.
[56]
Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, and Yong Man Ro. Synctalkface: Talking face generation with precise lip-syncing via audio-lip memory. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2062–2070, 2022.
[57]
Joon Son Chung and Andrew Zisserman. Out of time: automated lip sync in the wild. In Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13, pages 251–263. Springer, 2017.
[58]
Yifeng Ma, Suzhen Wang, Yu Ding, Bowen Ma, Tangjie Lv, Changjie Fan, Zhipeng Hu, Zhidong Deng, and Xin Yu. Talkclip: Talking head generation with text-guided expressive speaking styles. arXiv preprint arXiv:2304.00334, 2023.
[59]
Yingzhi Wang, Abdelmoumene Boumadane, and Abdelwahab Heba. A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding. arXiv preprint arXiv:2111.02735, 2021.
[60]
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
[61]
Taewoon Kim and Piek Vossen. Emoberta: Speaker-aware emotion recognition in conversation with roberta. arXiv preprint arXiv:2108.12009, 2021.
[62]
Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3661–3670, 2021.
[63]
Joon Son Chung and Andrew Zisserman. Lip reading in the wild. In Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13, pages 87–103. Springer, 2017.
[64]
Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622, 2018.
[65]
Jiadong Wang, Xinyuan Qian, Malu Zhang, Robby T Tan, and Haizhou Li. Seeing what you said: Talking face generation guided by a lip reading expert. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14653–14662, 2023.
[66]
Sahil Goyal, Sarthak Bhagat, Shagun Uppal, Hitkul Jangra, Yi Yu, Yifang Yin, and Rajiv Ratn Shah. Emotionally enhanced talking face generation. In Proceedings of the 1st International Workshop on Multimedia Content Generation and Evaluation: New Methods and Practice, pages 81–90, 2023.
[67]
Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 2004.
[68]
Maximilian Seitzer. . https://github.com/mseitzer/pytorch-fid, August 2020. Version 0.3.0.
[69]
Debin Meng, Xiaojiang Peng, Kai Wang, and Yu Qiao. Frame attention networks for facial expression recognition in videos. In 2019 IEEE international conference on image processing (ICIP), pages 3866–3870. IEEE, 2019.
[70]
Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence, 44(12):8717–8727, 2018.
[71]
Sanjana Sinha, Sandika Biswas, Ravindra Yadav, and Brojeshwar Bhowmick. Emotion-controllable generalized talking face generation.
[72]
Yurui Ren, Ge Li, Yuanqi Chen, Thomas H Li, and Shan Liu. Pirenderer: Controllable portrait image generation via semantic neural rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13759–13768, 2021.
[73]
Ting-Chun Wang, Arun Mallya, and Ming-Yu Liu. One-shot free-view neural talking-head synthesis for video conferencing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10039–10049, 2021.
[74]
Fa-Ting Hong and Dan Xu. Implicit identity representation conditioned memory compensation network for talking head video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23062–23072, 2023.
[75]
Kaisiyuan Wang, Hang Zhou, Qianyi Wu, Jiaxiang Tang, Zhiliang Xu, Borong Liang, Tianshu Hu, Errui Ding, Jingtuo Liu, Ziwei Liu, et al. Efficient video portrait reenactment via grid-based codebook. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–9, 2023.
[76]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021.
[77]
Lucy Chai, David Bau, Ser-Nam Lim, and Phillip Isola. What makes fake images detectable? understanding properties that generalize. Lecture Notes in Computer Science, 2020.
[78]
David Guera and Edward J. Delp. Deepfake video detection using recurrent neural networks. Advanced Video and Signal Based Surveillance, 2018.
[79]
Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In Proceedings of the IEEE international conference on computer vision, pages 609–617, 2017.
[80]
Yipin Zhou and Ser-Nam Lim. Joint audio-visual deepfake detection. International Conference on Computer Vision, 2021.

  1. Corresponding author.↩︎