November 26, 2024
In recent years, deepfakes (DFs) have been utilized for malicious purposes, such as individual impersonation, misinformation spreading, and artists’ style imitation, raising questions about ethical and security concerns. However, existing surveys have focused on accuracy performance of passive DF detection approaches for single modalities, such as image, video or audio. This comprehensive survey explores passive approaches across multiple modalities, including image, video, audio, and multi-modal domains, and extend our discussion beyond detection accuracy, including generalization, robustness, attribution, and interpretability. Additionally, we discuss threat models for passive approaches, including potential adversarial strategies and different levels of adversary knowledge and capabilities. We also highlights current challenges in DF detection, including the lack of generalization across different generative models, the need for comprehensive trustworthiness evaluation, and the limitations of existing multi-modal approaches. Finally, we propose future research directions that address these unexplored and emerging issues in the field of passive DF detection, such as adaptive learning, dynamic benchmark, holistic trustworthiness evaluation, and multi-modal detectors for talking-face video generation.
The term "deepfake" (DF) describes realistic media produced by a Generative Artificial Intelligence (GenAI) model for malicious purposes [1]. It is crucial to distinguish DFs from entirely synthetic data. DFs typically manipulate or synthesize media based on existing real data, whereas entirely synthetic content is generated from noise by learning from large datasets [2], [3]. The increasing popularity of DFs presents significant challenges to privacy, security, and information integrity, such as high-profile impersonations [4]–[6], the creation and spread of fake news [7], and the unauthorized imitation of artists’ styles [8], [9]. Consequently, there is a growing need for effective detection approaches. Passive DF detection has emerged as a promising approach to address this challenge which aims to identify synthetic content after its creation without prior knowledge of the generation process [10]–[16]. By analyzing the intrinsic properties of the media itself, passive approaches offer a versatile solution applicable to scenarios where the origin or creation process of the content is unknown.
This survey provides a comprehensive overview of passive DF detection approaches across multiple modalities, including image, video, audio, and multi-modal domains. We present a novel taxonomy of existing methodologies, categorizing them based on their underlying principles and techniques. Our analysis compares the advantages and disadvantages of various approaches, offering insights into their effectiveness and applicability. Beyond detection accuracy, we extend our discussion to crucial aspects such as generalization capability, robustness against adversarial attacks, attribution of DFs to their sources, and the interpretability of detection models. Figure 1 illustrates our taxonomy for passive approaches across multi-modalities. Additionally, we discuss threat models for passive approaches, including potential adversarial strategies and varying levels of adversary knowledge and capabilities.
Survey Method. In this work, we have conducted an exhaustive search for relevant publications in top-tier artificial intelligence (AI) and security conferences and journals spanning the past five years, using databases such as IEEE Xplore, ACM Digital Library, and Google Scholar. This thorough approach allows us to present the most current and relevant advancements in the field of passive DF detection.
Related survey works. Recent surveys have focused on passive approaches for single modalities [1], [17]–[21]. However, comparing approaches across domains remains challenging due to the disparate nature of features, artifacts, and evaluation metrics specific to each modality. As far as we know, our survey is the first work that reviews passive approaches across multiple modalities and extend the review beyond detection accuracy to include generalization, robustness, attribution, and interpretability. While our work shares similarities with [22], who focused on entirely synthetic media generated by large AI models, we specifically address the unique challenges posed by DFs. Notably, our survey is also the first to discuss threat models for passive approaches, providing crucial insights into security considerations for robust detection systems.
Survey structure. Section 2 presents common benchmarks, datasets, and evaluation metrics. We review passive DF detection approaches across multi-modalities in Section 3 and explores aspects beyond detection accuracy, including generalization, robustness, attribution, and interpretability in Section 4. Section 5 defines threat and defense models for DF detection. Section 6 discusses current challenges and future research directions and Section 7 concludes the survey with key findings and insights.
width = , colspec = Q[50]Q[100]Q[25]Q[55]Q[45]Q[45]Q[25]Q[60]Q[75]Q[40]Q[30], cells = c, cell11 = r=2, cell12 = r=2, cell13 = r=2, cell14 = c=20.05, cell16 = c=20.05, cell18 = c=30.08, cell111 = r=2, cell31 = r=4, cell71 = r=10, cell171 = r=5, cell221 =
r=3, vline2-4,6,8,11 = 1-2, vline2-4,6,8,11 = 1-24, hline1,25 = -0.08em, hline2 = 4-10, hline3,7,17,22 = -, hline3 = 2-, Domain & Dataset & Year & Samples & &
Generator\(^*\) & & Properties\(^{**}\) & & & Public
& & & Real & Fake & GANs & DMs & Manipulation & ****Augmentation**** & ****Quality**** &
Image & DFFD [23] & 2019 & 58,703 & 240,336 & & & FS, FR, FE & - & L &
& FakeSpotter [24] & 2019 & 6,000 & 5,000 & & & - & - & H &
& ForgeryNet [25] & 2021 & 1,438,201 & 1,457,861 & & & FS, FR, FE & G, C, L, CT & L, H &
& DiffusionForensics [26] & 2023 & 134,000 & 137,200 & & & - & - & - &
Video & UADFV [27] & 2018 & 49 & 49 & & & FS & - & L &
& DF-TIMIT [28] & 2018 & 320 & 640 & & & FS & - & L &
& DFFD [23] & 2019 & 1,000 & 3,000 & & & FS, FR, FE & - & L &
& FF++ [29] & 2019 & 1,000 & 4,000 & & & FS, FR & C, Q & L &
& DFDC [30] & 2019 & 1,131 & 4,113 & & & - & G, C, CT & H &
& Celeb-DF [31] & 2020 & 590 & 5,639 & & & FS, FR & L & H &
& DeeperForensics-1.0 [32] & 2020 & 50,000 & 10,000 & & & FS & G, C, TE, P & H
&
& Wild-DF [33] & 2021 & 3,805 & 3,509 & & & - & - & H &
& ForgeryNet [25] & 2021 & 99,630 & 121,617 & & & FS, FR, FE & G, C, L, CT & H &
& DF-Platter [34] & 2023 & 133,260 & 132,496 & & & FS & C & L, H &
Voice & FoR [35] & 2019 & 111,000 & 87,000 & & & VC & C & Cl &
& ASVspoof 2021 [36] & 2021 & 22,617 & 589,212 & & & VC & - & Cl, N &
& WaveFake [37] & 2021 & 18,100 & 117,985 & & & VC & - & Cl &
& ADD 2022 [38] & 2022 & 36,953 & 123,932 & & & VC & - & Cl &
& ADD 2023 [39] & 2023 & 172,819 & 113,042 & & & VC & - & Cl, N &
Multi-
modal & FakeAVCeleb [40] & 2021 & 10,000 & 10,000 & & & FS, FR, VC & G, C, CT & H &
& LAV-DF [41] & 2022 & 36,431 & 99,873 & & & VC, FE & - & L &
& AV-Deepfake1M [42] & 2024 & 286,721 & 860,039 & & & VC, FE & - & H &
This column indicates the generator category used to synthesize DFs: GANs (Generative Adversarial Networks), DMs (Diffusion Models)
* The column includes Manipulation types: FS (face swapping), FR (face reenactment), FE (face editing), VC (voice conversion); Augmentation techniques applied in the dataset: G (Gaussian blur), C (Compression), L (Lighting modification), Q (Quantization), CT (Color Transformation), TE (Transmission Error), P (Perturbation); Quality of generated samples: H (High), L (Low), Cl (Clean), N (Noisy)
Table [tab:passive-dataset] highlights characteristics of datasets on passive DF detection for each domain.
Image and Video. Image and video datasets for DF detection are often interchangeable, as videos can be treated as sequences of static images. Many datasets have been developed to address various aspects of DF detection. UADFV dataset [27] focuses on head pose inconsistencies, while DF-TIMIT (DF-TIMIT) [28] uses GAN-based face swapping. FaceForensics++ (FF++) [29] includes real videos from YouTube and DF videos generated using Face2Face, FaceSwap, and Neural Rendering techniques. The DFDC dataset [30] offers a large collection of DF videos with diverse demographics. DFFD [23] is a collection of \(2.6\) million real and fake images covering four main types of facial manipulations: identity swap, expression swap, attribute manipulation, and entire face synthesis. Celeb-DF [31] presents high-quality DF videos of celebrities, while DeeperForensics-1.0 [32] is a dataset containing videos generated using the DF Variational AutoEncoder (DF-VAE) framework. Lastly, WildDF [33] contains both real and DF samples collected directly from the internet, making it one of the most challenging DF detection datasets to date. ForgeryNet [25] provides comprehensive annotations for multiple forgery analysis tasks. DiffusionForensics [26] features images generated by various diffusion models across three public computer vision datasets. DF-Platter [34] is the only dataset that features single-subject and multi-subject DFs. Audio. Fake-of-Real is the first DF audio dataset introduced, which contains over \(198,000\) real and fake utterances. Compared to ASVspoof 2019 [43], ASVspoof 2021 [36] contains an additional DF subset, which comprises manipulated speech compressed with various codecs. However, these datasets are limited in diversity of synthesis methods and often use outdated models. To overcome this limitations, [37], [37] introduced WaveFake including DF utterances generated from 6 state-of-the-art (SOTA) speech synthesis models. [38] presented two datasets for challenging scenarios, including low-quality fakes and partial manipulations [38], [39].
Multi-modal. FakeAVCeleb [40] is Audio-Video (AV) Multimodal DF dataset containing both DF videos and their respective synthesized lip-synced fake audios. LAV-DF [41] is specifically designed for the task of learning temporal forgery localization, containing content-driven audio-visual manipulations performed strategically to change the sentiment polarity of videos. AV-Deepfake1M [42] is a large-scale AV DF dataset which contains over 1 million videos.
This section provides an overview of commonly utilized metrics for passive DF detection methodologies. Note that for the video domain, these metrics can be calculated either at the frame level or aggregated over the entire video sequence, depending on the specific task and evaluation criteria.
Accuracy Rate (ACC)[44]. This metric signifies the overall proportion of samples (both bona fide and DF) accurately classified by the detection approaches.
Area Under the ROC Curve (AUC)[45]. The AUC metric [45] quantifies the area under the Receiver Operating Characteristic (ROC) curve, which plots the true positive rate (TPR) against the false positive rate (FPR) at different decision thresholds. Higher AUC-ROC values denote better detection performance across all possible thresholds.
Average Precision (AP)[46]. This metric summarizes the precision-recall curve into a single value by computing the average of precisions achieved at each threshold.
F1-score[47]. F1-score is calculated as the harmonic mean of precision and recall.
Equal Error Rate (EER)[48]. The EER corresponds to the operating point where the false negative rate and false positive rate are equal.
Intersection over Union (IoU)[49]. IoU is used to measure the overlap between the predicted manipulation mask and the ground-truth manipulation mask. It is defined as the area of intersection divided by the area of union of the two masks
width = , colspec = Q[65]Q[90]Q[90]Q[180]Q[50]Q[100]Q[30]Q[30]Q[30]Q[30]Q[30], cells = c, cell11 = r=2, cell12 = r=2, cell13 = r=2, cell14 = r=2, cell15 = r=2, cell16 = r=2, cell17 = c=50.08, cell31 = r=5, cell32 = r=3, cell62 = r=2, cell81 = r=15,
cell82 = r=5, cell84 = r=2, cell114 = r=2, cell132 = r=4, cell144 = r=2, cell145 = r=2, cell172 = r=4, cell212 = r=2, cell231 = r=6, cell232 = r=4, cell244 = r=3, cell245 = r=3, cell272 = r=2, cell291 = r=5, cell302 = r=4, vline2-7 = 1-2, vline2-7 = 3-33,
hline1,34 = -0.08em, hline2 = 7-11, hline3,8,23,29 = -, hline3 = 2-, hline6,13,17,21,27,30 = 2-11, Categories & Approach & Article & Key Idea & GenAI\(^*\) & Metric & Evaluation\(^{**}\) & & & &
& & & & & & GT & CT & F & B & C
Forensic-based
& Physical-based & [50] & Co-occurrence matrices & GAN & ACC & & & & &
& & [51] & Chrominance components & GAN & ACC & & & & &
& & [52] & Dual-color spaces & GAN & ACC & & & & &
& Physiologically-based & [27] & Facial attributes & GAN & ACC & & & & &
& & [49] & Eyes inconsistency & GAN & IoU & & & & &
Data-driven
& Conventional classification & [53], [54] & Binary
classification & GAN & AUC & & & & &
& & [55] & & DM & ACC, AUC & & & & &
& & [56] & Multi-task classification & GAN & ACC, AUC & & & & &
& & [57] & Fine-grained classification & GAN & ACC, AUC & & & & &
& & [47] & & GAN + DM & AUC, F1-score & & & & &
& Segmentation map estimation & [58] & Binary map & GAN & AUC & & & & &
& & [59] & Gray-scale map & GAN & AUC & & & & &
& & [60] & & & ACC, AUC, EER, IoU & & & & &
& & [61] & Siamese network & GAN & ACC, AUC & & & & &
& Reconstruction-based learning & [62] & Re-synthesis & GAN & ACC & & & & &
& & [63] & Re-synthesis + graph reasoning & GAN & ACC, AUC, EER & & & & &
& & [10] & Re-synthesis + federated learning & GAN & ACC, AUC & & & & &
& & [11] & Re-synthesis + two reconstruct. heads & GAN & ACC, AUC, EER & & & & &
& Spatial noise fusion & [12], [64], [65] & Noise features & GAN & ACC, AUC & & & & &
& & [12] & Multi-scale noise features & GAN & ACC, AUC, EER, IoU & & & & &
Fingerprint-based
& GAN fingerprint & [66] & Gram matrices & GAN & ACC & & & & &
& & [67] & Spectral features & GAN & ACC & & & & &
& & [68] & & & ACC, AUC & & & & &
& & [69] & & & ACC, AP & & & & &
& DM fingerprint & [26] & Reconstruction error & DM & ACC, AUC & & & & &
& & [70] & Stepwise reverse-denoising error & DM & ACC, AUC & & & & &
Hybrid
& Single-stream network & [71]–[73] & Spatial +
spectral features & GAN & ACC, AUC & & & & &
& Two-stream network & [74] & Multi-task learning & GAN & ACC, AUC & & & & &
& & [75] & Single-center loss & GAN & ACC, AUC & & & & &
& & [76] & Graph reasoning & GAN & ACC, AUC, EER & & & & &
& & [77] & Adaptive learning & GAN & AUC & & & & &
This column specifies the type of GenAI models that each method is designed to identify: GANs (Generative Adversarial Networks), DMs (Diffusion Models)
* This column represents post-processing operations used to evaluate the robustness. GT: Geometric Transformation, CT: Color Transformation, F: Filtering, B: Blurring, C: Compression
This section provides a comprehensive overview of passive approaches designed to identify artifacts in DF contents. We organize this section based on the primary modalities of DFs, including unimodal (image, video, and audio) and multi-modal domains.
Unimodal DF detection approaches focus on analyzing a single modality, typically either the visual or audio component of media, to identify manipulated content. In the following subsections, we provide a comprehensive review of SOTA approaches for each domain, discussing their methodologies, strengths, and limitations.
We divide the methods in image domain into four main categories: Forensic-based, Data-driven, Fingerprint-based, and Hybrid. Figure 2 illustrates four main categories in the image domain and Table [tab:img-detection] summarizes unimodal image-level approaches.
Forensic-based methods. Methods in this group aim to identify the inconsistency of natural faces and GenAI-generated faces by examining facial attributes, texture information, and color information. Forensic-based methods can be categorized into two main subgroups: (i) Physical-based methods and (ii) Physiologically-based approaches.
The first group focuses on identifying inconsistencies in the physical world, such as lighting conditions, shadow, and reflection. [50], [50] hypothesized that GAN-generated images might fail to accurately replicate complex inter-channel relationships in natural images. They utilized co-occurrence matrices to capture and analyze cross-band correlations. Expanding on this concept, other works examined chrominance components in the YCbCr color space [51] or analyzed both luminance and chrominance across RGB and YCbCr spaces [52]. Physiologically-based approaches aim to detect the absence of natural physiological signals intrinsic to living beings but often overlooked or poorly replicated in DFs, such as facial asymmetry, eye blinking, and iris color. [78], [78] extracted inconsistent 3D head poses estimated from facial landmarks in the whole face as a physiological signal to detect DFs, while [49], [49] relied on the inconsistency of corneal specular highlights between two eyes to discriminate fake and real faces.
Data-driven methods. These approaches leverage the power of deep learning algorithms to automatically learn discriminative features from large-scale datasetsr. These methods can be broadly categorized into several subgroups based on their unique characteristics and methodologies.
Conventional classification methods simply formulate DF detection as a classification problem and employ various techniques to extract features from inputs. Central difference convolution and atrous spatial pyramid pooling have been employed by [53] to extract multi-scale texture difference features. However, some researchers argue that simply treating DF detection as a binary classification problem may not be optimal due to the subtle and localized nature of the differences between real and fake images [56], [57]. [56], [56] reformulated the problem as multi-task learning by designing a network that can detect multiple types of DF face images simultaneously, including real face images, entire face synthesis, face swap, and facial attribute manipulation. [57], [57] reformulated as a fine-grained classification task and applied multi-scale attention to capture the artifacts of various forgery attributes.
Segmentation map estimation (SME) approaches focus on automatically identifying manipulated regions at the pixel level by estimating segmentation maps. These methods often use \(\mathcal{L}_1\) loss to measure the difference between predicted map and ground-truth maps. [23], [23] used an attention-based layer to estimate an attention map, which is then multiplied channel-wise with the input features to refine them, allowing the network to focus on informative regions for classification. This technique was also utilized by [58], [58] to aggregate features for different regions, generating a manipulation mask where each pixel represents the probability of the corresponding image pixel being manipulated. [59], [59] computed gray maps that can reveal the blending boundary in forged face images while [61], [61] leveraged a Siamese network structure with two branches to produce segmentation maps for the original image and a quality-degraded version. However, previous methods often produce low-resolution maps with limited information. [60], [60] addressed this limitation by leveraging an encoder-decoder network to generate full-resolution gray-scale maps, which indicate the likelihood of manipulation for each pixel in the input image. Nevertheless, SME methods often require pixel-level ground-truth maps, which can be labor-intensive to obtain and their performance may be sensitive to the quality of the segmentation maps.
Reconstruction-based learning methods leverage the discrepancies between original images and their reconstructed counterparts to detect DFs. [63], [63] used an encoder-decoder reconstruction network that takes the real input image and attempts to reconstruct it, while [62], [62] proposed to reconstruct high-resolution images from inputs’ downsampled version. [10], [10] applied the reconstruction to both real and fake images and focuses on analyzing the residual differences between the original and reconstructed images to detect forgeries. However, [11], [11] recognized that single reconstruction-based methods have limited feature representation and proposed a double-head reconstruction module that combines discrepancy-guided encoding, dual reconstruction, and aggregation-based detection to improve forgery detection performance. While reconstruction-based learning approach is promising for generalization, it may struggle to detect high-quality DFs that closely resemble real images, as the reconstruction errors may be small.
Spatial and noise fusion methods recognize the importance of noise patterns in DF detection and combine both spatial and noise features. [64], [64] first proposed a two-stream network that leverages both noise features extracted from SRM filters [79] alongside RGB features. Concurrently, [12], [12] enhanced the approach by extracting multi-scale noise features by employing SRM not only on the input image but also on low-level feature maps at multiple scales. [65], [65] expanded the idea by introducing a localization branch designed to detect forged regions. However, the effectiveness of these methods may be limited when dealing with advanced DF techniques that can better mimic the noise characteristics of real images [80].
Fingerprint-based methods. Fingerprint-based methods for DF detection explore the unique patterns or artifacts unintentionally embedded into the generated images by the architectural design and training process of GenAI models. Several studies have explored that the upsampling operation of GANs often leaves a trace in the frequency domain, which provides cues for discriminating DF images from real ones. [66], [66] used gray-level co-occurrence matrices to confirm the presence of these artifacts and proposed to use Gram matrices to capture global texture representation at multiple semantic levels. Concurrently, other studies utilized the spectrum version of the image as input to the DL classifier to detect these artifacts [67]–[69]. Recently, some works have leveraged the unique characteristics inherent to the diffusion generation process to detect diffusion-generated images. [26], [26] measured the error between an input image and its reconstructed version obtained by a pre-trained diffusion model, based on the observation that diffusion-generated images can be more accurately reconstructed compared to real images. Meanwhile, [70], [70] leveraged the timestep errors between the reverse and denoising steps as a form of fingerprint to identify fake images.
width = , colspec = Q[55]Q[110]Q[90]Q[180]Q[50]Q[100]Q[30]Q[30]Q[30]Q[30]Q[30], cells = c, cell11 = r=2, cell12 = r=2, cell13 = r=2, cell14 = r=2, cell15 = r=2, cell16 = r=2, cell17 = c=50.28, cell31 = r=5, cell32 = r=5, cell81 = r=15, cell82 = r=5,
cell132 = r=10, vline2-7 = 1-2, vline2-7 = 3-22, hline1,23 = -0.08em, hline2 = 7-11, hline3,8 = -, hline3 = 2-, hline13 = 2-11, Category & Approach & Article & Key Idea &
GenAI\(^*\) & Metric & Evaluation\(^{**}\) & & & &
& & & & & & GT & CT & F & B & C
Forensic-based methods
& Physiologically-based & [81] & rPPG & GANs & ACC, AUC & & & & &
& & [82] & Ear inconsistency & GANs & AUC & & & & &
& & [83] & Facial landmarks & GANs & AUC & & & & &
& & [84] & Facial geometry + optical flow & GANs & ACC, AUC & & & & &
& & [85] & Facial symmetrics + multi-stream learning & GANs & ACC, AUC, F1-score & & & & &
Data-driven methods
& Frame-based & [86] & Single-frame inconsistency + model ensembling & GANs & AUC & & & &
&
& & [87], [88] & Single-frame inconsistency +
noise analysis & GANs & AUC, ACC & & & & &
& & [89] & Single-frame inconsítency + surface geometry & GANs & AUC & & & & &
& & [90] & Optical flow & GANs & ACC & & & & &
& & [91] & Adjacent frames inconsistency & & ACC, AUC & & & & &
& Temporal-based & [92]–[94] & CNN-RNN hybrid & GANs & F1-score, ACC & & & & &
& & [95] & Identity-based inconsistency & GANs & ACC, AUC & & & & &
& & [96] & Identity-based inconsistency + 3D modelling & GANs & ACC, AUC & & & & &
& & [13] & Identity-based inconsistency + triplet loss & GANs & ACC, AUC & & & & &
& & [97] & Two-stage network & GANs & ACC & & & & &
& & [98] & Lipreading extractor & GANs & ACC, AUC & & & & &
& & [14], [99]–[101] & Snippet-based sampling & GANs & ACC, AUC & & & & &
& & [102] & Future frame prediction & GANs & ACC, AUC & & & & &
& & [103] & Difference between face manipulation and facial motions & GANs & AUC & & & & &
& & [104] & Spatial + frequency fusion & GANs & ACC, AUC, F1-score & & & & &
Type of GenAI models that each method is designed to identify: GANs (Generative Adversarial Networks)
* Post-processing operations used to evaluate the robustness. GT: Geometric Transformation, CT: Color Transformation, F: Filtering, B: Blurring, C: Compression
Hybrid methods. Detection approaches that solely rely on spatial domain information have shown to be highly susceptible to variations in dataset quality and post-processing operations, such as high levels of compression or the presence of Gaussian noise [46], [68]. To address this limitation, hybrid approaches leverage the complementary strengths of spatial and frequency-domain features by integrating them into a unified framework. Based on the integration techniques, methods in this category can be classified into single-stream network and two-stream network.
Single-stream hybrid methods utilized a network to extract both spatial and frequency features, allowing for the joint learning of spatial and frequency information. [73], [73] proposed concatenating the RGB image with a spatial representation of its phase spectrum along the channel dimension. Other works [71], [72] have extracted the frequency features from the intermediate RGB feature maps at different levels within the network. The motivation behind this approach is to discard low-frequency components and retain only middle-high frequency information, based on the observation that the differences between authentic and manipulated faces are more prominent in the middle-high frequency regions.
In contrast, two-stream hybrid approaches learn spatial and frequency features separately using unique network branches and then fuse the learned features for the final classification. [75], [75] proposed a single-center loss to enhance the separability between real and fake faces. In the framework designed by [74], [74], the RGB stream uses multi-scale transformers to capture inconsistencies at different spatial scales, while the frequency stream adopts learnable frequency filters to extract forgery features in the frequency domain. [105], [105] suggested exploiting fine-grained frequency cues to decouple real and fake traces. [76], [76] proposed a framework that exploits high-order relationships between domains via dynamic graph learning. [77], [77] hypothesized that capturing image-specific forgery clues adaptively in both domains can improve the performance of face forgery detection models.
Discussion. There are several advantages and disadvantages of each category of DF detection methods. Forensic-based approaches offer interpretability through explainable evidence but may struggle with advanced generative models that can mimic the physical and physiological properties of genuine media. Data-driven methods can detect subtle patterns but require extensive labeled data, potentially lack generalization to novel DFs, and raise transparency concerns due to their black-box nature. Fingerprint-based techniques are content-agnostic but vulnerable to intentional fingerprint removal (e.g., incorporate an additional constraint into the GenAI model’s loss function) [106], [107]. Hybrid methods show improved robustness against variations in dataset quality and post-processing operations but increase computational complexity and require careful fusion strategy design.
In recent years, the rapid advancement of GenAI models has led to the proliferation of highly realistic DF videos. Researchers have proposed various methods to identify fake videos, which can be broadly categorized into two main groups: forensic-based and data-driven methods. Figure 3 illustrates main categories in the video domain and Table [tab:video-detection] summarizes unimodal video-level approaches.
Forensic-based methods focus on analyzing the inconsistencies in physiological present in videos. [82], [82] exploited both the static biometric shape of the human ear and the dynamic correlations between aural movements in different ear regions and oral signals like lip motions and audio. [83], [83] utilized the spatial angles and temporal rotation angles to characterize the inherent consistency of facial landmarks. [84], [84] extracted facial landmarks from the videos and used temporal modeling on the precise landmark locations to identify abnormal facial movements and time discontinuities that are characteristic of DFs. [85], [85] cropped symmetrical face patches by randomly selecting symmetrical areas on either side of the face’s vertical symmetry axis, and then analyzed these patches to detect inconsistencies in DF videos. Several studies [81], [108] have utilized remote photoplethysmography (rPPG) to detect subtle color changes in human skin, indicating the presence of blood under the tissues, for DF detection.
Data-driven approaches rely on DL algorithms trained on large datasets to learn the distinguishing features between real and fake videos. These detection methods can be broadly categorized into frame-based and video-level approaches.
Frame-based methods focus on extracting discriminative features from individual frames without leveraging temporal information. The final decision is often obtained by averaging the results of all frames. [86], [86] investigated the ensembling of various trained CNN models, while [87], [87] leveraged a Siamese network with a denoiser to assess the interaction between noise features from face and background regions in each frame. [88], [88] focused on analyzing the characteristics of surfaces depicted in each frame to obtain a global surface descriptor for training a CNN. [89], [89] presented a framework that captures broader forgery clues by extracting multiple non-overlapping local representations and fusing them into a global semantic-rich feature. In contrast to single-frame methods, [91], [91] and [90], [90] utilized the inconsistency information between adjacent frames and optical flow, respectively, to capture inter-frame inconsistencies in DF videos.
Video-based methods utilize a sequence of frames as input to identify temporal artifacts in fake videos. Early attempts [92]–[94] combined convolution-based networks and recurrent-based networks to extract global spatio-temporal features; however, this approach has been demonstrated to be less effective [109]. Researchers have designed specific architectures to more efficiently capture spatial and temporal information. [104], [104] proposed a two-branch network structure, in which one RGB branch to extract spatial features and another branch applies a Laplacian of Gaussian filter to suppress face content and amplify artifacts in the frequency domain. [97], [97] incorporated a Temporal Transformer [110] following a Temporal Convolution network (TCN) [111] to investigate long-term temporal coherence by capturing long-range dependencies along the time dimension of the TCN features. [98], [98] designed a multi-scale TCN that includes a 3D convolutional layer to extract short-term and long-term temporal features. Rather than relying on sparsely sampled frames, some works [14], [99]–[101] have focused on capturing local temporal inconsistency within densely sampled video snippets, where the entire video is densely divided into multiple snippets. This dense sampling approach facilitates the capture of local inconsistency caused by subtle motions, as the variations are more readily observable among adjacent frames. [102], [102] proposed predicting future frames for DF detection, hypothesizing that the correlation between predicted and actual future facial representations will be lower for fake videos compared to real ones. [103], [103] aimed to address the problem of disturbance caused by facial motions that limit the detection performance of detectors. They employed self-attention and a fine-grained denoising operation to eliminate inter-frame differences caused by facial motions while highlighting differences caused by forgery. Some other works aim to identify the identity-related inconsistencies in DF videos, assuming that the face-swapped DF is simply not the person it purports to be. [95], [95] combined a static biometric based on facial recognition with a temporal biometric that captures spatio-temporal facial expressions and head movements. [96], [96] and [13], [13] captured temporal identity feature by adopting a 3D morphable model and triplet loss, respectively.
Discussion. Although forensic-based methods inherent better explanation of detectors’ decision, they may require high-quality video data to extract and analyze physiological signals, such as eye blinking patterns, which may not always be available in real-world scenarios since DF videos are often compressed. Frame-based approaches can benefit from the advancements developed in detecting DF images, but they fail to capture important temporal inconsistencies that are only discernible when analyzing multiple frames in sequence. While video-level methods address this issue by considering the entire frame sequence, they require significant memory resources, which may hinder their practical application in real-time or resource-constrained scenarios. Furthermore, some methods that focus on identifying identity-related inconsistencies may encounter difficulties when dealing with face reenactment, where the identity is preserved throughout the manipulated video.
width = , colspec = Q[100]Q[100]Q[60]Q[210]Q[50]Q[80]Q[30]Q[30]Q[30]Q[30]Q[30], cells = c, cell11 = r=2, cell12 = r=2, cell13 = r=2, cell14 = r=2, cell15 = r=2, cell16 = r=2, cell17 = c=50.05, cell31 = r=2, cell32 = r=2, cell51 = r=6, cell52 = r=5,
cell111 = r=2, cell112 = r=2, vline2-7 = 1-2, vline2-7 = 3-12, hline1,13 = -0.08em, hline2 = 7-11, hline3,5,11 = -, hline3 = 2-, hline10 = 2-11, Category & Approach & Article & Key
Idea & GenAI\(^*\) & Metric & Evaluation\(^{**}\) & & & &
& & & & & & C & R & SC & PS & N
Frequency-based & Physical-based & [112] & Bispectral analysis & GANs & AUC & & & &
&
& & [113] & Bispectral + cepstral analysis & GANs & AUC, ACC & & & & &
Data-driven methods & Learnable features & [114] & Hierarchical pooling + multi-level token aggregation & GANs & EER & &
& & &
& & [115] & Neuron coverage technique & GANs & AUC, EER, F1-score & & & & &
& & [116] & Multi-dataset co-training + sharpness-aware optimization & GANs & EER & & & & &
& & [117] & Multi-view features & GANs & EER & & & & &
& & [48] & Self-superivsed pretrained + Multi-view features & GANs & EER & & & & &
& Challenge-response protocol & [118] & Random tasks & GANs & EER, AUC & & & & &
Fingerprint-based & Vocoder fingerprint & [119] & Linear frequency cepstral coefficients & GANs & F1-score & &
& & &
& & [120] & Multi-task learning & GANs & AUC, ACC & & & & &
Type of GenAI models that each method is designed to identify: GANs (Generative Adversarial Networks)
* Post-processing operations used to evaluate the robustness. C: Compression, R: Resampling, SC: Speed Change, PS: Pitch Shifting, N: Noise
In the context of DF audio, existing approaches can be broadly categorized into three main groups: frequency-based methods, data-driven methods, and fingerprint-based methods. It is important to note that we exclude studies that evaluate their proposed methods solely on the ASVSpoof 2019 dataset [43] as this dataset is not specifically designed for the task of DF detection. Figure 4 illustrates main categories in the audio domain and Table [tab:audio-detection] summarizes unimodal audio-level approaches.
Frequency-based methods. This group aims to analyze frequency domain to exploit the fake audios. Bispectral analysis [112] measured higher-order spectral correlations not typically present in human speech but found in synthesized speech, possibly due to the NN architectures used. Building on this work, [113], [113] integrated bispectral analysis and cepstral analysis to differentiate human speech from AI-synthesized speech.
Data-driven methods. Similar to methods in the visual domain, most works in this group leverage large datasets to train DL models for classifying authentic and synthesized audio.
Learnable Features. Hand-crafted features are manually designed by experts based on their understanding of speech characteristics and acoustic properties using some techniques such as Mel-frequency cepstral coefficients. Meanwhile, learnable features are automatically extracted by DL models. [115], [115] introduced a novel technique for capturing layer-wise neuron activation patterns using designed neuron coverage criteria, which are then used as input features for a binary classifier to distinguish between real and fake voices. [116], [116] proposed a multi-dataset training strategy with sharpness-aware optimization, which effectively handles domain mismatch between diverse datasets. By employing optimization techniques such as sharpness-aware minimization and adaptive sharpness-aware minimization, their compact model achieves performance comparable to large pre-trained models by seeking flat minima during training. [117], [117] proposed using multi-view features combined with an attention mechanism, including prosodic, pronunciation, and wav2vec features, to improve the performance of fake audio detection models. Inspired by advancements in speech self-supervised learning, [48], [48] introduced the usage of a self-supervised model as a front-end feature extractor and designed a multi-fusion-based classifier for discriminating between real and fake features. [114], [114] utilized hierarchical pooling and multi-level classification token aggregation methods to capture both local and global features for detecting spoofing evidence in audio.
Challenge-response protocols. Inspired by NLP Captcha [121], approaches in this group require the human to respond to a challenge within a limited time. [118], [118] first introduced a defense system to DF calls through a challenge-response protocol, which integrates multiple complicated modules. Given a suspicious call, the system will randomly assign a challenge to the caller, and verifies the corresponding response, providing an additional layer of security.
Fingerprint-based methods. Approaches in this group focus on identifying vocoder fingerprints by specific synthesis models in generated audio. Initial work demonstrates that audio generated by different neural vocoders exhibits distinct fingerprints in the spectrogram [119]. The authors showed that a ResNet-based classifier trained on linear frequency cepstral coefficients can accurately identify the source vocoder of synthesized speech, laying the groundwork for vocoder-specific artifact detection. [120], [120] extended this idea by introducing a multi-task learning framework that leverages vocoder identification as a pretext task. By constraining the feature extractor to focus on vocoder-specific artifacts, their approach enables the extraction of highly discriminative features for the final binary classification of authentic and synthesized speech.
Discussion. Despite the advancements made in DF voice detection, it is important to note that the number of methods proposed in this domain is relatively limited compared to the visual domain. This disparity can be attributed to several factors, such as the lack of large-scale and diverse datasets for training and evaluation, and the challenges in identifying subtle artifacts or inconsistencies in synthesized speech.
width = , colspec = Q[70]Q[60]Q[180]Q[50]Q[100]Q[30]Q[30]Q[30]Q[30]Q[30], cells = c, cell11 = r=2, cell12 = r=2, cell13 = r=2, cell14 = r=2, cell15 = r=2, cell16 = c=50.08, cell31 = r=7, cell101 = r=3, vline2-6 = 1-2, vline2-6 = 3-12, hline1,13 =
-0.08em, hline2 = 6-10, hline3,10 = -, hline3 = 2-, Approach & Article & Key Idea & GenAI\(^*\) & Metric &
Evaluation\(^{**}\) & & & &
& & & & & GT & CT & F & B & C
Audio-Visual Fusion & [122] & Emotion inconsistency & GANs & AUC & & & & &
& [123] & Emotion inconsistency & GANs & ACC, AUC & & & & &
& [15] & Video-level contrastive learning & GANs & ACC, AUC & & & & &
& [124] & Segment-level contrastive learnning & GANs & AUC & & & & &
& [16] & Person-specific anomaly detection & GANs & ACC, AUC & & & & &
& [125] & Ateention + contrastive learning & GANs & ACC, AUC, F1-score & & & & &
& [126] & Word-conditioned facial motion analysis & GANs & ACC, AUC & & & & &
Audio-visual Synchr. & [127] & Inter-attention & GANs & ACC, AUC & & & & &
& [128] & Anomaly detection + self-supervised learning & GANs & AP, AUC & & & & &
& [129] & Two-stream encoders + Bi-directional cross-attention & GANs & ACC, AUC & & & & &
Type of GenAI models that each method is designed to identify: GANs (Generative Adversarial Networks)
* Post-processing operations used to evaluate the robustness. GT: Geometric Transformation, CT: Color Transformation, F: Filtering, B: Blurring, C: Compression
The rapid advancements in audio-visual generation technology have led to the creation of highly realistic DFs where the lip movements are accurately synchronized with the speech [130]–[132]. DF detection methods that focus solely on visual or audio anomalies may struggle to identify these advanced DFs. Therefore, it is crucial to develop sophisticated multi-modal DF detectors that can analyze both the audio and visual components simultaneously. Based on methodologies, we categorized recent works into two main groups: audio-visual fusion and audio-visual synchronization. Figure 5 illustrates two main categories in the multimodal domain and Table [tab:multi-modal-detection] summarizes multimodal approaches.
Audio-visual fusion methods aim to combine information from both audio and visual modalities to learn a joint representation and capture the intrinsic correlation or semantic consistency between them. A line of work detect DFs using emotion cues extracted from the face and voice of the speaker. [122], [122] leveraged a pre-trained emotion extraction model to predict emotion categories (e.g., happy, sad, angry) from either the facial or speech modality separately, while [123], [123] utilized LSTM model to predict continuous valence and arousal values from the low-level facial and speech descriptors extracted by two pre-trained face and speech networks, respectively. Contrastive learning is popularly utilized to effectively capture the intrinsic correlation and consistency between the audio and visual modalities. By pulling the audio-visual representations of real videos closer and pushing those of fake videos apart during training, contrastive learning enables the model to learn the inherent synchronization. This learned audio-visual correspondence serves as a powerful cue for distinguishing real videos from DFs, enhancing the model’s ability to detect manipulation in either modality. Recent works compute contrastive loss over entire videos [15], short segments [124], or with joint audio-visual loss terms [16]. Unlike the previous papers that perform contrastive learning over the entire video or short temporal segments, [125], [125] applied contrastive learning after an audio-visual attention fusion mechanism that aligns the audio and visual features. Instead of focussing on detecting visual artifacts or inconsistencies, [126], [126] learned person-specific biometric patterns associated with specific words to capture the intrinsic correlation between an individual’s facial movements and spoken words.
Audio-visual synchronization methods leverage the intrinsic temporal alignment and coherence between visual streams and corresponding speech audio to detect DF videos. These methods may employ techniques like cross-modal attention, temporal alignment, or dynamic time warping to measure the synchronization between modalities. [127], [127] proposed a two-stream network with a synchronization module that temporally aligns the audio and visual features by leveraging inter-attention mechanism and contrastive loss. [128], [128] pose DF detection as an anomaly detection problem that leverages an audio-visual synchronization network to extract features for anomaly detection. [129], [129] used two-stream encoders to capture spatial-temporal features from visual and audio. They also built a bi-directional cross-attention block to fuse the extracted audio-visual features and jointly learn their inherent relationships for detecting inconsistencies.
Discussion. Multi-modal DF detection methods offer significant advantages in identifying subtle inconsistencies by analyzing both audio and visual components simultaneously. However, there are two major challenges: (i) the complexity of modeling the intricate relationships between audio and visual modalities, which requires sophisticated fusion techniques and architectures; (ii) the sensitivity to audio-visual desynchronization that may occur naturally in genuine videos due to factors such as network latency, audio-visual capture device misalignment, or post-processing.
DF detection approaches reviewed in Section 3 are effective in terms of accuracy performance. However, as the sophistication and prevalence of DFs continue to grow, other critical aspects of detection systems have been recognized beyond accuracy performance. This section provides a comprehensive survey of approaches focusing on four key areas: generalization, robustness, attribution, and interpretability. Figure 6 illustrates these aspects while methods for each aspect are summarized from Table [tab:generalization] to Table [tab:interpretability].
width = , colspec = Q[120]Q[50]Q[200]Q[40]Q[100]Q[30]Q[30]Q[30]Q[30], cells = c, cell11 = r=2, cell12 = r=2, cell13 = r=2, cell14 = r=2, cell15 = r=2, cell16 = c=40.05, cell31 = r=8, cell111 = r=3, cell141 = r=2, cell161 = r=3, cell191 = r=6, vline2-6 =
1-2, vline2-6 = 3-24, hline1,25 = -0.08em, hline2 = 6-9, hline3,11,14,16,19 = -, hline3 = 2-, Approach & Article & Key Idea & GenAI\(^*\) &
Metric & Evaluation\(^{**}\) & & &
& & & & & CD & CG & UM & PP
Data Augmentation & [46] & Static data augmentation & GANs & ACC, AP & & & &
& [133] & Static data augmentation & GANs & ACC, AP & & & &
& [134] & Static data augmentation & GANs & AUC & & & &
& [135] & Dynamic data augmentation & GANs & AUC & & & &
& [136] & Contrastive learning & GANs & AUC, EER & & & &
& [137] & Dual-contrastive learning & GANs & AUC, ACC, EER & & & &
& [138] & Video-specific data augmentation & GANs & AUC & & & &
& [139] & Latent space augmentation & GANs & AUC, EER, AP & & & &
Synthetic Data Training & [140] & GAN framework + self-supervised auxiliary tasks & GANs & AUC & & & &
& [141] & Self-blending technique & GANs & AUC, AP & & & &
& [141] & Video-level self-blending technique & GANs & AUC, ACC, EER & & & &
Disentanglement Learning & [142] & Dual-encoder architecture & GANs & ACC, AUC, EER & & & &
& [143] & Conditional decoder + contrastive learning & GANs & ACC, AUC, AP, EER & & & &
Unsupervised Learning & [144] & Patch consistency learning & GANs & AUC & & & &
& [145] & Cross-modal self-supervised learning & GANs & ACC, AUC & & & &
& [146] & Self-supervised front-end & GANs & EER & & & &
Adaptive Learning & [147] & Meta-split and meta-optimization & GANs & ACC, AUC, EER & & & &
& [148] & Meta-learning + test-time training & GANs & ACC, AUC, EER & & & &
& [149] & Incremental learning + knowledge distillation & GANs & ACC, AUC & & & &
& [150] & Radian Weight Modification & GANs & ACC, EER & & & &
& [151] & Source domains aggredation + triplet loss & GANs & F1-score, AUC, EER & & & &
& [152] & One-class knowledge distillation & GANs & EER & & & &
Type of GenAI models that each method is designed to identify: GANs (Generative Adversarial Networks)
* Evaluation experiments to measure the generalization. CD: Cross Datasets, CG: Cross Generator Families, UM: Unseen Manipulation, PP: Post-processing Operations.
Generalization refers to the capability of DF detectors to effectively adapt to unseen datasets or newly released generators. This capability is crucial given the rapid evolution of DF creation techniques. Researchers typically assess generalization by evaluating detectors across diverse datasets, manipulation techniques, generator families, and post-processing operations.
Data Augmentation. Several approaches have simply employed data augmentation (DA) techniques, such as Gaussian blur, JPEG compression, and Face-Cutout, to enhance the diversity of training datasets and improve the generalization ability of DF detectors [46], [133], [134]. Rather than employing these static DA methods, [135], [135] introduced a dynamic occlusion technique that dynamically erasers sensitive facial regions, preventing the model from overfitting to forgeries in the training dataset. Contrastive learning (CL) has been integrated with DA to construct paired data, effectively eliminating task-irrelevant contextual factors [136]. [137], [137] extended this concept by proposing a dual CL strategy (inter-instance and intra-instance CL) to enable the model to simultaneously learn global discriminative features and local forgery cues. In the video domain, [138], [138] implemented video-specific DA techniques, such as temporal dropout, temporal repeat, and clip-level blending. Departing from pixel-level augmentation, [139], [139] proposed latent space DA techniques to enlarge the forgery space by constructing and simulating variations within and across forgery features in the latent space.
Synthetic Data Training. Instead of relying on dataset-specific artifacts, some works generate forgeries during the training process to expand the diversity of forgery types. [140], [140] leveraged the generator-discriminator framework with self-supervised auxiliary tasks (such as manipulated region, blending type, and blending ratio) to encourage the discriminator to learn more generalizable features sensitive to various forgery characteristics. [141], [141] introduced self-blended techniques that generate self-blended images by blending transformed versions of a single base image. This technique produces more challenging forgery artifacts and helps the detector to learn more generic representations. This idea has been extended to video domain by [141], [141], who proposed a method to disrupt temporal consistency across frames through frame-by-frame processing with varying parameters.
Disentanglement Learning. Some studies revealed that content-specific biases affect the generalization of DF detectors [142], [143]. By disentangling content information (e.g., identity, background) from artifact features, the detector can focus specifically on learning artifacts without being influenced by content-specific biases. [142], [142] designed a dual-encoder architecture that separately extract content and artifact features and employed CL to maximize the separation between these two feature spaces. [143], [143] proposed a multi-task disentanglement framework that decomposes the input into three components: forgery-irrelevant content features, method’s fingerprint features, and common forgery features. The authors employed a conditional decoder for reconstruction and a contrastive regularization loss for encouraging separation between common and specific forgery features.
width = , colspec = Q[135]Q[217]Q[250]Q[92]Q[110]Q[133], cells = c, cell11 = r=2, cell12 = r=2, cell13 = r=2, cell14 = r=2, cell15 = r=2, cell31 = r=5, vline2-6 = 1-2, vline2-6 = 3-7, hline1,8 = -0.08em, hline2 = 6, hline3 = -, hline3 = 2-,
Approach & Article & Key Idea & GenAI\(^*\) & Metric & Evaluation\(^{**}\)
& & & & & AP
Adversarial Training & [153] & I-FGSM adversarial training & GANs & ACC &
& [154] & PGD adversarial training + D-CAPTCHA & GANs & F1-score &
& [155] & Frequency Perturbation & GANs & ACC, AP &
& [156] & Adaptive adversarial training & GANs & EER, F1-score &
& [157] & Disjoint frequency ensemble & GANs & AP &
Type of GenAI models that each method is designed to identify: GANs (Generative Adversarial Networks)
* Evaluation experiments to measure the generalization. AE: Adversarial Perturbations.
width = , colspec = Q[125]Q[80]Q[200]Q[80]Q[110]Q[30]Q[30], cells = c, cell11 = r=2, cell12 = r=2, cell13 = r=2, cell14 = r=2, cell15 = r=2, cell16 = c=20.05, cell31 = r=5, cell81 = r=2, vline2-6 = 1-2, vline2-6 = 3-9, hline1,10 = -0.08em, hline2 = 6-7,
hline3,8 = -, hline3 = 2-, Approach & Article & Key Idea & GenAI\(^*\) & Metric & Evaluation\(^{**}\) &
& & & & & AP & CG
Supervised Attribution & [158] & Generation inversion & GANs & ACC, AUC & &
& [159] & Learning-based fingerprint estimation & GANs & F1-score & &
& [160] & Hand-crafted noise residuals & GANs & AUC & &
& [161] & image-transformation classification + patchwise contrastive learning & GANs & ACC, F1-score & &
& [162] & Intermediate features mixing + contrastive learning & GANs & ACC & &
Unsupervised Attribution & [163] & Iterative clustering and refinement & GANs & AP & &
& [164] & Model parsing + reverse engineering & GANs & ACC, AUC, F1-score & &
Type of GenAI models that each method is designed to identify: GANs (Generative Adversarial Networks)
* Evaluation experiments to measure the generalization. AE: Adversarial Perturbations; CG: Cross-Generators
width = , colspec = Q[125]Q[80]Q[200]Q[80]Q[110]Q[30]Q[30], cells = c, cell11 = r=2, cell12 = r=2, cell13 = r=2, cell14 = r=2, cell15 = r=2, cell16 = c=20.05, cell31 = r=2, cell51 = r=3, vline2-6 = 1-2, vline2-6 = 3-7, hline1,8 = -0.08em, hline2 = 6-7,
hline3,5 = -, hline3 = 2-, Approach & Article & Key Idea & GenAI\(^*\) & Metric & Evaluation\(^{**}\) &
& & & & & V & M
Saliency Map & [165] & Shapley value & GANs & ACC, AUC & 2714 & 2718
& [136] & Heatmap + UMAP & GANs & ACC & 2714 & 2718
Model Behavior & [166] & Feature whitening & GANs & ACC, AUC, AP & 2714 & 2714
& [167] & Dynamic prototype learning & GANs & AUC, EER & 2714 & 2718
& [164] & Decomposition + Self-attention & GANs & ACC, AUC & 2714 & 2718
Type of GenAI models that each method is designed to identify: GANs (Generative Adversarial Networks)
* Evaluation experiments to measure the generalization. V: Visualization; M: Masking
Unsupervised Learning. Some studies employed unsupervised learning to encourage DF detectors to capture more general patterns. [144], [144] used multivariate Gaussian estimation to generate pseudo annotations for attention maps of manipulated locations without requiring pixel-level labels and trained the network with these pseudo labels. [145], [145] proposed a two-stage learning strategy: (i) self-supervised representation learning and (ii) face forgery detection. In the first stage, dense video representations are learned from unlabeled real talking face videos by exploiting the natural correspondence between visual and auditory modalities. In the second stage, the forgery detector is trained in a multi-task manner - predicts real or fake while simultaneously predicting the video targets produced in the first stage. In audio domain, [146], [146] leveraged a large-scale self-supervised model as the front-end to extract domain-invariant features from raw audio across different languages and domains.
Adaptive Learning. Adaptive learning approaches aim to enable DF detectors to adapt to new manipulation methods, generator families, or data distributions without compromising performance on previously learned knowledge. Meta-learning has emerged as a potential approach that trains the model on a variety of learning tasks, such that it can quickly adapt to new tasks using only a small number of training samples [147]. [148], [148] combined meta-learning and one-shot test-time training (OST) for DF detection. This approach synthesizes pseudo-training samples by blending test samples with source images, updates the detector using these samples, and employs meta-learning for quick adaptation to new test data. [149], [149] leveraged incremental learning to learn domain-invariant representations across old and new tasks while using knowledge distillation to mitigate catastrophic forgetting of previous knowledge. [150], [150] leveraged Radian Weight Modification to categorize classes into two groups based on feature distribution similarities across tasks, and then use a self-attention mechanism to learn optimal gradient modification directions for different data types, mitigating the problem catastrophic forgetting. [151], [151] proposed a approach in the audio domain that uses a single-side adversarial domain discriminator to aggregate real speech features from different domains, while using triplet mining to separate fake speech features. [152], [152] proposed a one-class knowledge distillation method, where a teacher model trained on both real and fake speech to guide a student model that is only trained on real speech, allowing the student to learn a distribution of real speech features that can better detect unseen fake speech attacks.
Many recent research indicated that DF detectors are vulnerable to adversarial examples (AEs), which are imperceptibly perturbed into fake images to deceive such detectors’ decision [168], [169]. Enhancing robustness involves mitigating the vulnerability of detectors to adversarial attacks and improving their resilience.
Adversarial Training. This technique is widely adopted to improve model robustness against adversarial attacks. This approach involves incorporating AEs into the training process, allowing the model to learn to classify these perturbed inputs correctly. [153], [153] naively employed adversarial training with Iterative Fast Gradient Sign Method (I-FGSM) to make the detector robust against white-box adversarial attacks. [154], [154] first investigated the resilience of the D-CAPTCHA system, a challenge-response protocol for detecting fake calls, under transferable imperceptible adversarial attacks in a black-box setting. They exposed the system’s vulnerability and proposed D-CAPTCHA++ that employs Projected Gradient Descent (PGD) adversarial training to enhance the robustness of both DF detectors and task classifiers within the system, effectively mitigating the success rate of transferable adversarial attacks. [156], [156] introduced an adaptive adversarial training approach that continuously analyzes attack difficulty and adjusts the sampling of AEs during training. [155], [155] proposed a Frequency Perturbation GAN to generate frequency-level perturbation maps that are added to both real and fake images during the training process. [157], [157] presented a novel ensemble-based method that leverages disjoint partitions of the frequency spectrum, exploiting redundancy in frequency domain artifacts of DFs. Unlike previous approaches that train a single model on AEs, this method partitions the input space across multiple models, reducing the dimensionality of the adversarial subspace and making it harder for an attacker to find AEs that fool the entire ensemble.
Attribution refers to the process of identifying the source of DFs, not only aiding in tracing the origin of synthetic media but also serving as a deterrent against the misuse of AI technologies for creating malicious content.
Supervised Attribution. These methods train their methods on a dataset of images labeled with their source models or require the access to the architecture and weights of known generators. Therefore, these methods typically can only classify images into a fixed set of known models. [160], [160] extracted hand-crafted noise residuals from images to estimate unique fingerprints of each generator’s architecture. [158], [158] proposed to invert the generation process, trying to find the latent vector that produces the closest match to the image when passed through each known generator. [159], [159] extended this idea to learning-based fingerprint estimation, which means that a classifier is trained on image-source pairs to learn fingerprints from images that are unique to each seen GAN model. [161], [161] first trained an image-transformation classifier that classified different types of image transformations applied to real images, and is then trained with a patchwise contrastive learning, helping the model focus more on architecture-related traces. [162], [162] proposed to mixing features at an intermediate layer rather than at the pixel level to preserve the GAN-architecture-related fingerprints and then trained the model on two tasks: real/fake prediction and generator type classification.
Unsupervised Attribution. The primary purpose of unsupervised attribution approaches is to trace fake images to unseen generators. [163], [163] started with a small labeled set from known sources to train an initial network which is then used to extract features and cluster unlabeled images, iteratively improving the clusters and discovering new GAN sources. [164], [164] first constructed a network to estimate fingerprints left by the generators on their images and then a parsing network is designed to predict network architecture and loss function parameters from the estimated fingerprints. By training on a diverse dataset of generative models and using clustering to leverage similarities between models, the approach can generalize to attribute images from unseen generative model architectures.
This aspect is crucial for understanding how detectors identify manipulated media and for building trust in their decisions. As DF detection systems are often built as black-boxs, providing clear explanations for their outputs becomes essential for fostering user confidence and enabling effective human-AI collaboration in content authentication. To evaluate the effectiveness of the proposed methods in improving the interpretability, these methods often provide Qualitative visualizations of learned features/explanations. A few methods quantify by masking specific facial parts to (e.g., eyes, mouths) test whether the model can detect and explain manipulations in various facial regions, demonstrating its ability to provide fine-grained explanations.
Saliency Map. These methods focus on quantifying and visualizing the importance of input features in the model’s output. [165], [165] employed the Shapley value, a concept from cooperative game theory, to quantitatively evaluate the contributions of different visual concepts to the model’s predictions. [136], [136] used heatmap visualizations to highlight salient facial regions and employed UMAP-based feature space visualizations to analyze how the models separate real and fake images.
Model Behavior. This category focuses on understanding and interpreting the internal decision-making processes of DF detection models. [167], [167] designed a Dynamic Prototype Network that learns prototypical representations of temporal inconsistencies in DF videos. The model facilitates interpretation by comparing input video dynamics against learned prototypes, which can be visualized as video clips showing temporal artifacts. Zero-phase component analysis was implemented by [166], [166] to decorrelate internal feature representations across different channels, helping different channels capture distinct aspects of the input while maintaining semantic relevance. This method makes the model’s internal representations more interpretable. [109], [109] used a decomposed spatial-temporal self-attention mechanism to separate the model’s processing into distinct spatial and temporal components, allowing the model to visualize class-discriminative heatmaps or both spatial and temporal dimensions separately.
In this section, we define the potential threats that could compromise the security of passive DF detectors. This model is crucial for understanding the vulnerabilities of detectors and developing evaluation strategies to assess the detectors. Additionally, we also define the detector’s capabilities to combat DFs. The threat model for passive DF detection comprises three key actors: the defender, the provider, and the adversary. The defender utilizes a passive model or system to detect and classify DF content, while the provider is responsible for developing and deploying these detection systems. In contrast, the objective of the adversary is to evade the detection of passive systems.
Adversary’s Objectives. Evasion is the primary goal of adversaries in which they strive to produce DFs capable of bypassing the detection mechanisms of passive systems. Another goal is Model Extraction which involves stealing the detection model itself. By acquiring the model, adversaries aim to conduct comprehensive analyses or develop more sophisticated attack strategies, thereby enhancing their ability to create undetectable DFs.
Adversary’s Knowledge. The knowledge base of adversaries in passive DF detection scenarios can be categorized into three distinct types. Black-box adversaries operate with limited information, lacking insight into the internal workings of the passive detector, including its architecture, parameters, and training dataset. Their interaction with the system is constrained to accessing outputs via API. In contrast, white-box adversaries pose a more significant threat due to their full knowledge of the detector’s inner workings.
Adversary’s Capabilities. This aspect outlines the actions an adversary can take to compromise the detector. Variations in the adversary’s power may include:
manipulate training dataset;
manipulate test dataset
manipulate substitute model to mimic the behavior of the target detector.
Adversary’s Strategies. To achieve their goals, adversaries may employ several strategies and leverage various capabilities:
produce adversarial examples designed to deceive the target detector. This can be achieved by constructing a surrogate model that mimics the behavior of the target detector or performing a black-box attack through iterative querying of the target detector;
perform transferable attack methods that do not require direct interaction with the target detector.
Defender and Provider’s Capabilities. Providers have full access to their detection model, its parameters, and the training dataset. They bear the responsibility for regularly updating and maintaining the detection applications or systems in response to emerging attack vectors. Meanwhile, defenders continuously update their models to address emerging threats.
Recent research has highlighted significant challenges in the generalization capability of passive methods. First, detectors trained on one type of generative model (e.g., GANs) often fail to identify fakes produced by other models (e.g., diffusion models) [170]–[172]. Second, while detectors perform well when training and testing on the same dataset, they struggle with unknown generation methods or post-processing techniques in real-world scenarios [52], [173]. Although data augmentation has been applied as a promising solution in severval works [46], [133], [135], it is impractical to account for every possible post-processing step and its varying intensities by data augmentation [148]. Finally, different DF techniques produce distinct artifacts, leading detectors to specialize and underperform on new manipulation methods.
Current evaluation methodologies for DF detection systems focus on accuracy metrics, but this single-dimensional assessment fails to account for other crucial aspects of trustworthy DF detection systems, such as robustness, explainability, and privacy protection [174]:
Robustness: Adversarial perturbations have been demonstrated can compromise the detection of passive methods [168], [169] through imperceptible examples. While [168], [168] proposed two defense methods to improve the robustness against adversarial attacks, research on this issue should be extended.
Transparency and Explainability: Many high-performing DF detection models operate as black boxes, causing difficulties in understanding their decision-making processes. This issue can significantly hinder trust and adoption, particularly in deployment scenarios. Recent works have begun addressing this issue [136], [165], but proposed approaches for interpretability are still in early stages.
Privacy: The evaluation of DF detectors rarely considers the privacy implications of their operation. For instance, the data used for detection might inadvertently reveal sensitive information about individuals.
Multi-modal approaches for DF detection, while promising, face significant challenges in real-world scenarios due to two impractical assumptions. First, many approaches assume precise temporal alignment between audio and visual streams and use inconsistencies as manipulation indicators. However, real-world videos often exhibit natural misalignment due to video encoding errors, network transmission latencies, post-production editing, or recording imperfections [128], [175]. Such misalignment typically manifest as a consistent shift of a few frames between the audio and visual streams. While this shift may be imperceptible to human viewers, it can significantly impact the performance of audio-visual fusion approaches. Second, multi-modal methods often assume that only single-modality (audio or visual) is manipulated and try to identify inconsistencies between two modalities. However, this approach is ineffective for advanced talking-face video generation techniques that simultaneously manipulate both audio and visual elements [176].
The field of passive deepfake detection is limited by a critical lack of standardized and comprehensive evaluation benchmarks. This poses significant challenges to the fair comparison of detection methods, including:
Inconsistency in evaluation protocols: Without a unified benchmark, there is substantial variability in data processing pipelines across studies. This inconsistency results in disparate data inputs for detection models, which inevitably leads to different results and makes fair comparisons more difficult [177].
Limited reproducibility: Many studies fail to provide open-source implementations or adequate implementation details [177]. This lack of transparency, coupled with inconsistent evaluation practices, severely impairs the reproducibility of reported results.
Adaptive Learning Techniques: Future work should focus on developing continuous learning mechanisms that allow passive detection models to update and enhance their capabilities in response to emerging DF techniques. This could involve the implementation of meta-learning approaches, facilitating rapid adaptation to novel manipulation methods, as well as the exploration of self-supervised learning techniques that can harness large volumes of unlabeled data to improve model generalization. Although [149], [149] have made initial progress in this direction by utilizing incremental learning techniques to enable detection models to continually learn from limited new samples, future work should expand upon this research direction. Furthermore, additional challenges inherent in adaptive learning should be considered, such as mitigating catastrophic forgetting, maintaining model stability across updates, and ensuring efficient resource utilization during the adaptation process.
Standardized and Dynamic Benchmarks: To address the lack of consistent evaluation protocols, future research should prioritize the creation of comprehensive, standardized benchmarks for passive approaches. These benchmarks should be designed with expandability in mind, allowing for the incorporation of new GenAI families. Key aspects to consider include: (i) create diverse datasets that encompass a wide range of DF types and generation methods; (ii) establish evaluation metrics to ensure fair comparisons between different methods; (iii) develop a modular benchmark structure that allows for easy integration of new generative model families and DF techniques.
Holistic Trustworthiness Evaluation: Future work should shift the current DF detection paradigm toward a more holistic view of trustworthiness [174]. This transformation requires a multifaceted approach that goes beyond mere accuracy metrics. Researchers must develop comprehensive evaluation frameworks that assess multiple aspects of trustworthiness, including robustness, fairness, explainability, and privacy preservation. These frameworks should not only evaluate each aspect independently but also explore the intricate trade-offs and potential synergies between different trustworthiness dimensions.
Accelerating Multimodal Detectors for Talking-Face Video Generation: As talking-face video generation techniques become increasingly sophisticated, there is an urgent need to develop more advanced multimodal detection approaches. Addressing the challenge of detecting DFs that manipulate both audio and visual modalities simultaneously is crucial and should be explored in future research.