September 25, 2025
The increasing realism of synthetic images generated by advanced models such as VAEs, GANs, and LDMs poses significant challenges for synthetic image detection. To address this issue, we explore two artifact types introduced during the generation process: (1) latent distribution deviations and (2) decoding-induced smoothing effects, which manifest as inconsistencies in local textures, edges, and color transitions. Leveraging local pixel dependencies (LPD) properties rooted in Markov Random Fields, we reconstruct synthetic images using neighboring pixel information to expose disruptions in texture continuity and edge coherence. Building upon LPD, we propose FerretNet, a lightweight neural network with only 1.1M parameters that delivers efficient and robust synthetic image detection. Extensive experiments demonstrate that FerretNet, trained exclusively on the 4-class ProGAN dataset, achieves an average accuracy of 97.1% on an open-world benchmark comprising across 22 generative models, surpassing state-of-the-art methods by 10.6%.
The field of AI-based image generation has progressed rapidly, driven by the development of powerful generative models such as Variational Autoencoders (VAEs) [1], Generative Adversarial Networks (GANs) [2]–[4], and Latent Diffusion Models (LDMs) [5]–[7]. These models have enabled widespread applications across art, entertainment, and e-commerce, allowing users to effortlessly create realistic and engaging images. However, the potential misuse of such content has raised ethical concerns and driven extensive research on synthetic image detection [8]–[11].
Many existing detection approaches rely heavily on model-specific features, which limits their generalization ability to unseen generative architectures. For example, Durall et al. [12] observed characteristic frequency artifacts in GAN-generated images. Although frequency-domain techniques [8], [13], [14] have demonstrated strong performance under known conditions, they often struggle to generalize across different models. DIRE [15] introduced a diffusion-based detection framework that distinguishes synthetic images by reconstructing them through a diffusion model, a capability that fails with real images. However, this method performs poorly when applied to GAN-generated content.
To address the generalization challenge, Ojha et al. [10] explored the utilization of pre-trained models, employed frozen backbone for image encoding, providing universal representations from pre-training, followed by a linear classifier. FatFormer [11] introduced an Adaptor to CLIP [16] to enhance the pre-trained model’s ability to learn artifacts. While these methods achieved encouraging results, they are constrained by large parameter counts or low computational efficiency.
To overcome the dual challenges of limited generalization and computational inefficiency in synthetic image detection, we conduct a comprehensive analysis of artifact patterns shared across GAN-, VAE-, and LDM-based generative models. Our analysis reveals that visual anomalies-such as unnatural textures, geometric distortions, and poor object-background integration—primarily originate from two sources: (1) distributional shifts in the latent variable \(z\), and (2) over-smoothing and color discontinuities introduced during the decoding process.
Based on these insights and grounded in the theory of Markov Random Fields, we introduce a pixel-level artifact representation that captures local pixel dependencies (LPD) through median-based reconstruction. We further propose FerretNet, a lightweight detector designed with depthwise separable and dilated convolutions to enhance both performance and efficiency.
Contributions of this work are as follows:
We propose a novel approach that leverages Markov Random Fields and median-based statistics to capture local pixel dependencies for detecting artifacts and anomalies in synthetic images.
We construct a large-scale benchmark dataset, Synthetic-Pop, comprising 30,000 synthetic images generated by six different models and 30,000 real images from COCO [17] and LAION-Aesthetics V2 (6.5+) [18], totaling 60,000 images.
We introduce FerretNet, a lightweight model with only 1.1 million parameters, which achieves 97.1% accuracy on synthetic image detection across 22 generative models, while maintaining low computational overhead.
We categorize existing synthetic image detection methods into two main paradigms: pixel-based and frequency-based approaches.
Wang et al. [8] trained a classifier on images generated by a single model to detect fake images across various architectures and datasets, addressing cross-model generalization via data augmentation and diverse training samples. Shi et al. [19] proposed a difference-guided reconstruction learning framework that exploits discrepancies between real and synthetic images to enhance detection accuracy. Ojha et al. [10] tackled the generalization problem to unseen generative models by leveraging a feature space not explicitly trained for real/fake discrimination, employing nearest-neighbor and linear probing strategies. He et al. [20] introduced a super-resolution-based re-synthesis technique to reconstruct test images and extract residual or layered artifact features, thereby reducing reliance on frequency artifacts. Tan et al. [21] proposed NPR, a method that revisits the upsampling process in generative CNNs by modeling Neighbor Pixel Relations, aiming to improve generalization in deepfake detection. Liu et al. [22] designed a robust detection framework based on multi-view image completion, which simulates real image distributions and captures frequency-independent features. FatFormer [11] presented a forgery-aware adaptive transformer incorporating forgery-specific adapters and language-guided alignment modules to better adapt pre-trained models for synthetic image detection.
F3Net [23] introduced a dual-branch architecture that captures frequency-aware clues for detecting subtle forgery traces, particularly in low-quality and facial imagery. FrePGAN [14] developed a frequency-level perturbation GAN framework, where a generator-discriminator pair is used to iteratively improve classifier robustness against unseen categories and generative models. Tan et al. [24] exploited pre-trained CNN gradients to generate generalizable representations of GAN-specific artifacts. BiHPF [13] amplified frequency-level artifacts via a high-pass filtering approach, achieving improved robustness across diverse image categories, color manipulations, and generative models. FreqNet [25] introduced high-frequency representations and frequency-specific convolution layers to enhance detection by focusing on localized high-frequency components, addressing overfitting and poor generalization seen in prior methods.
Generative models such as VAEs, GANs, and LDMs are widely used for image synthesis. Despite differences in architecture and training objectives, these models share a common two-stage generation pipeline, as illustrated in Figure 1.
1. Obtaining the latent variable \(z\):
In LDMs, the generation process begins with Gaussian noise \(\epsilon \sim \mathcal{N}(0, I)\), which is iteratively denoised into a latent representation \(z\) within the compressed latent space of a pretrained autoencoder, using a denoising network such as U-Net [5], [26] or Diffusion Transformer (DiT) [27], [28]. In contrast, VAEs and GANs directly sample \(z\) from predefined prior distributions, such as a standard normal distribution \(\mathcal{N}(0, I)\) or a uniform distribution \(U(-1,1)\).
2. Decoding \(z\) to generate images: In both VAEs and LDMs, a decoder transforms \(z\) into the final image through a series of convolutional layers with specific kernel sizes and strides. In GANs, the generator plays an analogous role, mapping \(z\) to the image space with the aim of approximating the target data distribution.
While this two-stage framework enables high-fidelity image synthesis, it can also introduce artifacts such as texture irregularities, unnatural transitions, and local detail loss. These artifacts commonly arise from two major sources: (1) deviations in the distribution of the latent variable \(z\), and (2) imperfections introduced during the decoding process.
The quality of synthetic images exhibits significant sensitivity to the distribution of the latent representation \(z\) [29]–[32]. Ideally, the sampled distribution \(Q(z)\) should match the prior distribution \(P(z)\) assumed or learned during training. However, in practice, factors such as data imbalance or insufficient training can lead to a mismatch between \(Q(z)\) and \(P(z)\). This discrepancy can be quantified using the Kullback–Leibler (KL) divergence: \[\begin{align} D_{\mathrm{KL}}(Q(z) \| P(z)) = \int Q(z) \log \frac{Q(z)}{P(z)} \, dz > \delta, \end{align}\] where \(\delta\) denotes an acceptable divergence threshold. When this threshold is exceeded, the resulting images are prone to visible artifacts, including texture inconsistencies and the loss of fine structural details. For example, in GANs, if the latent space is poorly aligned with the true data distribution, the generator may fail to reproduce realistic textures, resulting in unnatural or distorted outputs [33].
Even when \(z\) is accurately sampled, decoding artifacts may still arise due to limitations in the network architecture [34]. The kernel size and stride used in convolutional layers are particularly influential in determining the fidelity of the output [3]. Large kernels may over-smooth local features, while improper stride configurations can lead to aliasing, both of which degrade image quality.
Moreover, upsampling operations—such as nearest-neighbor or bilinear interpolation—are known to introduce specific artifacts. Nearest-neighbor interpolation often produces jagged edges, whereas bilinear interpolation may blur textures due to its smoothing effect. These operations can significantly impact the realism and perceptual quality of the generated images, especially in high-frequency regions.







SDXL-Turbo
Figure 2: Local pixel dependencies (LPD) comparison between real and synthetic images. Top row: real images (COCO, LAION) and synthetic images (BigGAN, SDXL-Turbo, StyleGAN, RealVisXL-4.0). Bottom row: LPD maps derived from neighborhood-median reconstruction emphasize structural differences.. g — COCO, h — LAION, i — BigGAN, k — StyleGAN, l — RealVisXL-4.0
We propose a synthetic image detection method based on local statistical dependencies. The core idea is to identify generation artifacts by quantifying the deviation of each pixel from the median of its surrounding neighborhood. The full computational procedure is outlined in Algorithm 3.
Let \(I\) denote the input image, and \(x_{i,j}\) represent the pixel value at location \((i, j)\). According to the Markov Random Field (MRF) assumption, the probability distribution of a pixel depends only on its local neighborhood. Specifically, \[\begin{align} P(x_{i,j} \mid x_{k,l}, (k,l) \neq (i,j)) = P(x_{i,j} \mid x_{k,l}, (k,l) \in \mathcal{N}_{i,j}), \end{align}\] where \(\mathcal{N}_{i,j}\) is the set of neighboring pixels located within an \(n \times n\) window centered at \((i,j)\), excluding the center pixel itself: \[\begin{align} \mathcal{N}_{i,j} = \left\{ x_{k,l} \;\middle|\; \begin{array}{l} i - m \leq k \leq i + m,\; j - m \leq l \leq j + m, \\ (k,l) \ne (i,j) \end{array} \right\}, \end{align}\] with \(n = 2m + 1\) and \(m \in \mathbb{Z}^{+}\).
To enhance the robustness of the median filtering process and prevent contamination from generated pixels, we introduce a zero-masking strategy that replaces the center pixel with zero before computing the median. This adjustment is particularly beneficial when the neighborhood contains an even number of pixels. The median-based reconstruction at location \((i,j)\) is therefore computed as: \[y_{i,j} = \text{Median}(x_{k,l}, (k,l) \in \mathcal{N}'_{i,j}),\] where \(\mathcal{N}'_{i,j} = \mathcal{N}_{i,j} \cup \{x_{i,j} = 0\}\) is the extended neighborhood that includes the masked center pixel.
By applying the above operation to all pixels, we obtain a median-reconstructed image \(I'\), where each pixel value is replaced by its corresponding \(y_{i,j}\). The final local pixel dependency (LPD) feature map is then computed as the pixel-wise difference: \[LPD = I - I'.\]
Since both \(I\) and \(I'\) conform to local dependency assumptions, the LPD feature map effectively captures pixel-level inconsistencies and subtle structural deviations, offering strong cues for distinguishing synthetic from natural content, as illustrated in Figure 2.
This method effectively integrates the local dependency modeling capabilities of Markov Random Fields with the robustness of median filtering, providing a principled and resilient strategy for detecting subtle inconsistencies in synthetic imagery.
FerretNet is a lightweight convolutional neural network designed to achieve a balance between computational efficiency and feature extraction capability. As illustrated in Figure 4, the network begins with two conventional \(3\times3\) convolutional layers for initial feature extraction, each followed by Batch Normalization (BN) [35] and ReLU activation.
At the core of FerretNet are four cascaded Ferret Blocks, which progressively refine the extracted features while keeping the model compact. The final stage comprises a \(1\times1\) convolution, global average pooling, Dropout regularization, and a fully connected layer for classification.
The key innovation lies in the Ferret Block, which is designed to expand the effective receptive field under constrained network depth, thereby enhancing the model’s capacity for local pattern extraction. Each Ferret Block adopts a dual-path parallel architecture to increase the receptive field:
The primary path employs a \(3\times3\) dilated grouped convolution with a dilation rate of 2. The number of groups equals the number of input channels, allowing the receptive field to expand without increasing the number of parameters.
The secondary path utilizes a standard \(3\times3\) grouped convolution, maintaining the same grouping structure to capture fine-grained local patterns.
This dual-path configuration approximates a sparse \(5\times5\) receptive field via parallel processing, enabling FerretNet to simulate deeper network behaviors within shallower layers, thus reducing computational cost. The outputs from both paths are fused through a \(1\times1\) convolution, followed by BN and ReLU activation. Additional \(3\times3\) grouped and \(1\times1\) convolution layers further enrich the feature representation. Residual connections are employed to facilitate stable gradient propagation and enhance learning stability.
To ensure a consistent evaluation baseline, we follow the protocols established in [10], [11], [13], [21], utilizing four semantic classes (car, cat, chair, horse) from the ForenSynths dataset [8]. Each class contains 18,000 synthetic images generated by ProGAN [2], paired with an equal number of real images from the LSUN dataset [36]. In all experiments conducted in this study, only the aforementioned 4-class data (i.e., the 4-class ProGAN dataset) was used for training; no additional data sources were included.
To assess the generalization ability of the proposed method under real-world conditions, we evaluate its performance on diverse synthetic and real images from four distinct test sets, comprising a total of 22 generative models:
ForenSynths. This test set includes synthetic images generated by eight representative generative models: ProGAN [2], StyleGAN [37], StyleGAN2 [34], BigGAN [4], CycleGAN [38], StarGAN [39], GauGAN [40], and Deepfake [41]. Real images are sourced from six widely-used datasets: LSUN [36], ImageNet [42], CelebA [43], CelebA-HQ [44], COCO [17], and FaceForensics++ [41], totaling 62,000 images.
Diffusion-6-cls. As described in FatFormer [11], this test set comprises synthetic images generated by six diffusion-based models collected from DIRE [15] and Ojha et al. [10], including DALL-E [45], Guided [46], PNDM [47], VQ-Diffusion [48], Glide [49], and LDM [5]. Variants produced by Glide and LDM with different parameter configurations are treated as separate categories (see original papers for details). Each subset includes 1,000 synthetic and 1,000 real images, with some real images reused across subsets.
Synthetic-Pop. To capture the latest progress in high-resolution image generation, we constructed the Synthetic-Pop dataset using six popular models—Openjourney [50], Proteus-0.3 [51], RealVisXL-4.0 [52], SD-3.5-Medium [7], SDXL-Turbo [6], and YiffyMix [53]. Each model was prompted with 5,000 captions randomly sampled from COCO [17]. Real images were drawn from COCO and LAION-Aesthetics V2 (6.5+) [18], resulting in six subsets, each containing 5,000 synthetic and 5,000 real images (60,000 images total).
Synthetic-Aesthetic. To further investigate the aesthetic and stylistic diversity of synthetic imagery, we sampled 40,000 images from the Simulacra Aesthetic Captions (SAC) dataset [54], which were generated by CompVis latent GLIDE [5] and Stable Diffusion [5] using prompts sourced from over 40,000 real users. An equal number of real images were sampled from LAION-Aesthetics V2 (6.5+) [18], resulting in a total of 80,000 images. This dataset provides a challenging benchmark for evaluating performance under realistic and user-driven conditions.
FerretNet is trained from scratch without any pretraining. We use the Adam optimizer with a learning rate of \(2\times10^{-4}\), betas of \((0.937, 0.999)\), and a weight decay of \(5\times10^{-4}\). The model is trained for 100 epochs using a batch size of 32. During training, input images are randomly cropped to a resolution of \(224\times224\) and augmented with random horizontal flipping. Binary Cross Entropy with Logits Loss (BCEWithLogitsLoss) is adopted as the loss function. For evaluation, images are center-cropped to \(256\times256\).
Following previous work [10], [11], [13], Accuracy (ACC) and Average Precision (AP) are used as the primary evaluation metrics. To measure real-world performance, we report throughput on the Synthetic-Aesthetic test set using an NVIDIA RTX 4090 GPU and an Intel(R) Xeon(R) Gold 6430 CPU (16 vCPUs), with a batch size of 128.
| Methods | ProGAN | StyleGAN | StyleGAN2 | BigGAN | CycleGAN | StarGAN | GauGAN | Deepfake | Mean | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Wang [8] | 91.4/99.4 | 63.8/91.4 | /97.5 | 52.9/73.3 | /88.6 | 63.8/90.8 | /92.2 | 51.7/62.3 | /86.9 | ||||||||||
| F3Net [23] | 99.4/100.0 | 92.6/99.7 | /99.8 | 65.3/69.9 | /84.3 | 100.0/100.0 | /56.7 | 63.5/78.8 | /86.2 | ||||||||||
| FrePGAN [14] | 99.0/99.9 | 80.7/89.6 | /98.6 | 69.2/71.1 | /74.4 | 99.9/100.0 | /71.7 | 70.9/91.9 | /87.2 | ||||||||||
| BiHPF [13] | 90.7/86.2 | 76.9/75.1 | /74.7 | 84.9/81.7 | /78.9 | 94.4/94.4 | /78.1 | 54.4/54.6 | /77.9 | ||||||||||
| LGrad [24] | 99.9/100.0 | 94.8/99.9 | /99.9 | 82.9/90.7 | /94.0 | 99.6/100.0 | /79.3 | 58.0/67.9 | /91.5 | ||||||||||
| Ojha [10] | 99.7/100.0 | 89.0/98.7 | /98.4 | 90.5/99.1 | /99.8 | 91.4/100.0 | /100.0 | 80.2/90.2 | /98.3 | ||||||||||
| FreqNet [25] | 99.6/100.0 | 90.2/99.7 | /99.5 | 90.5/96.0 | /99.6 | 85.7/99.8 | /98.6 | 88.9/94.4 | /98.5 | ||||||||||
| NPR [21] | 99.8/100.0 | 96.3/99.8 | /100.0 | 87.5/94.5 | /99.5 | 99.7/100 | /88.8 | 77.4/86.2 | /96.1 | ||||||||||
| FatFormer [11] | 99.9/100.0 | 97.2/99.8 | /99.9 | 99.5/100.0 | /100.0 | 99.8/100.0 | /100.0 | 93.2/98.0 | 98.4/99.7 | ||||||||||
| FerretNet (Our) | 99.9/100.0 | 98.0/100.0 | /100.0 | 92.6/98.5 | /99.9 | 99.1/100.0 | /99.8 | 89.2/96.7 | 95.9/99.3 |
| Dataset | Wang [8] | F3Net [23] | LGrad [24] | Ojha [10] | FreqNet [25] | NPR [21] | FatFormer [11] | FerretNet | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dall-E | 51.8/61.3 | 71.6/79.9 | /97.3 | 89.5/96.8 | /99.7 | 90.9/98.1 | /99.8 | 91.4/98.2 | |||||||||
| Guided | 54.9/66.6 | 69.2/70.8 | /100.0 | 75.7/85.1 | /75.4 | 74.0/78.1 | /92.0 | 92.1/98.6 | |||||||||
| PNDM | 50.8/90.3 | 72.8/99.5 | /98.5 | 75.3/92.5 | /99.9 | 97.5/100.0 | /100.0 | 96.9/100.0 | |||||||||
| VQ-Diffusion | 50.0/71.0 | 100.0/100.0 | /100.0 | 83.5/97.7 | /100.0 | 100.0/100.0 | /100.0 | 99.9/100.0 | |||||||||
| Glide-50-27 | 54.2/76.0 | 88.5/95.4 | /95.1 | 91.1/97.4 | /95.8 | 97.5/99.5 | /99.4 | 97.2/99.7 | |||||||||
| Glide-100-10 | 53.3/72.9 | 88.3/95.4 | /94.9 | 90.1/97.0 | /96.0 | 97.8/99.5 | /99.2 | 97.9/99.9 | |||||||||
| Glide-100-27 | 53.0/71.3 | 87.0/94.5 | /93.2 | 90.7/97.2 | /95.6 | 97.4/99.5 | /99.1 | 97.3/99.7 | |||||||||
| LDM-100 | 51.9/63.7 | 74.1/84.0 | /99.2 | 90.5/97.0 | /99.9 | 98.0/99.6 | /99.9 | 98.8/100.0 | |||||||||
| LDM-200 | 52.0/64.5 | 73.4/83.3 | /99.1 | 90.2/97.1 | /99.9 | 98.2/99.6 | /99.8 | 98.8/100.0 | |||||||||
| LDM-200-CFG | 51.6/63.1 | 80.7/89.1 | /99.2 | 77.3/88.6 | /99.9 | 98.0/99.5 | /99.1 | 98.5/99.9 | |||||||||
| Mean | 52.4/70.1 | 80.6/89.2 | /97.7 | 85.4/94.6 | /96.2 | 94.9/97.3 | 95.0/98.8 | 96.9/99.6 |
| Methods | Openjourney | Proteus-0.3 | RealVisXL-4.0 | SD-3.5-Medium | SDXL-Turbo | YiffyMix | Mean | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| FreqNet [25] | 56.3 / 63.6 | 44.0 / 41.2 | / 66.6 | 78.5 / 86.8 | / 86.0 | 74.3 / 84.4 | / 71.4 | ||||||||
| NPR [21] | 78.8 / 83.5 | 68.6 / 69.3 | / 82.0 | 80.4 / 84.1 | / 82.9 | 80.0 / 85.1 | 77.4 / 81.2 | ||||||||
| FatFormer [11] | 58.8 / 65.4 | 93.9 / 97.6 | / 41.7 | 81.9 / 89.1 | / 65.3 | 80.9 / 89.9 | / 74.8 | ||||||||
| FerretNet | 98.4 / 99.7 | 98.6 / 99.7 | / 99.9 | 97.2 / 99.6 | / 100.0 | 97.8 / 99.7 | 98.3 / 99.8 |
| Parameters | Throughput | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4-8 | (M) \(\downarrow\) | (Img / s) \(\uparrow\) | ForenSynths | Diffusion-6-cls | Synthetic-Pop | Synthetic-Aesthetic | Mean | ||||||||
| FreqNet [25] | 1.9 | 200.2 | / 98.5 | 90.1 / 96.2 | / 71.4 | 70.1 / 81.2 | / 86.8 | ||||||||
| NPR [21] | 1.4 | 720.9 | / 96.1 | 94.8 / 97.7 | 77.4 / 81.2 | 81.4 / 82.4 | 86.5 / 89.4 | ||||||||
| FatFormer [11] | 577.3 | 88.6 | 98.4 / 99.7 | 95.0 / 98.8 | / 74.8 | 80.4 / 90.6 | / 91.0 | ||||||||
| FerretNet | 1.1 | 772.1 | 95.9 / 99.3 | 96.9 / 99.6 | 98.3 / 99.8 | 97.3 / 99.6 | 97.1 / 99.6 |




















Figure 5: Grad-CAM Visualizations: The fake images are synthesized using RealVisXL-4.0, SDXL-Turbo, SD-3.5-Medium, StyleGAN, and BigGAN. FerretNet show stronger activation responses in synthetic images, while showing no significant responses in real ones..
We begin by evaluating FerretNet on GAN-based and Deepfake images using the ForenSynths test set. As shown in Table 1, it achieves an average accuracy (ACC) of 95.9%, outperforming lightweight baselines such as FreqNet [25] (91.5%) and NPR [21] (92.5%). Although FatFormer [11] reports a higher ACC of 98.4%, it relies on pre-trained CLIP weights, whereas FerretNet achieves competitive accuracy with significantly fewer parameters.
Next, on diffusion-generated images (Table 2), FerretNet attains an ACC of 96.9% and an AP of 99.6%, outperforming FatFormer [11] by 1.9 and 0.8 percentage points (pp), respectively. Other lightweight models such as NPR [21] and FreqNet [25] perform less favorably, with ACC scores falling below 95.0%.
We further evaluate performance on high-quality synthetic images using the Synthetic-Pop test set (Table 3). Existing methods experience noticeable degradation; for example, NPR [21]—the strongest among them—achieves only 77.4% ACC and 81.2% AP. In contrast, FerretNet maintains 98.3% ACC and 99.8% AP, highlighting its robustness and reliability on visually realistic forgeries.
To evaluate real-world applicability, we tested FerretNet on the Synthetic-Aesthetic set for both detection performance and efficiency. As shown in Table 4, FerretNet achieves 97.3% ACC and 99.6% AP with 1.1M parameters and 772.1 FPS on an RTX 4090. While comparisons are made on this set, its average performance across all four benchmarks reaches 97.1% ACC and 99.6% AP. Notably, it outperforms FatFormer [11] by 11.0 and 8.6 pp in ACC and AP, respectively, while using only 0.2% of its parameters.
Finally, Grad-CAM [55] visualizations in Figure 5 show that FerretNet focuses on pixel-level artifacts rather than high-level semantics, contributing to its strong generalization to unseen generators.
Unless specified, all ablation results report the average ACC and AP across four datasets: ForenSynths, Diffusion-6-cls, Synthetic-Pop, and Synthetic-Aesthetic.
| 2-4 (lr)5-9 | \(3\times3\) | \(5\times5\) | \(7\times7\) | ForenSynths | Diffusion-6-cls | Synthetic-Pop | Synthetic-Aesthetic | Mean |
| \(I\) | 84.6 / 88.9 | 87.8 / 96.8 | 84.5 / 92.9 | 90.5 / 95.3 | 86.9 / 93.5 | |||
| \(LPD\) | ✔ | 95.9 / 99.3 | 96.9 / 99.6 | 98.3 / 99.8 | 97.3 / 99.6 | 97.1 / 99.6 | ||
| \(LPD\) | ✔ | 91.8 / 96.2 | 95.8 / 99.3 | 91.1 / 97.4 | 96.9 / 98.9 | 93.9 / 98.0 | ||
| \(LPD\) | ✔ | 82.4 / 90.6 | 85.2 / 93.6 | 78.6 / 91.9 | 85.0 / 94.4 | 82.8 / 92.6 | ||
Table 5 shows that \(LPD\) extracted using a \(3 \times 3\) local neighborhood substantially enhances detection accuracy compared to raw input \(I\). Average ACC improves from 86.9% to 97.1% (+10.2%), and AP rises from 93.5% to 99.6% (+6.1%). However, performance deteriorates as the neighborhood size increases. For instance, using a \(7 \times 7\) neighborhood weakens feature discrimination and significantly reduces detection accuracy.
This trend aligns with the structural characteristics of generative models, which typically employ \(2\times\) upsampling and small convolutional kernels (\(1 \times 1\) or \(3 \times 3\)). The \(3 \times 3\) neighborhood is particularly effective in capturing localized decoding artifacts for two reasons: 1) It matches the scale of operations used in generative architectures, making it ideal for exposing subtle synthesis artifacts; 2) It captures local pixel variations while suppressing potential noise artifacts.
According to Section 4.1, the neighborhood median \(y_{i,j}\) should satisfy two key requirements: reducing the interference of center pixels in median computation, and ensuring the median value equals a real pixel value from the neighborhood set when possible, thus enhancing statistical correlation with the original image. To validate the effectiveness of zero-value masking strategy, we compared three center pixel processing methods:
r0.5
| Methods | \(3\times 3\) | \(5\times 5\) | \(7\times 7\) |
|---|---|---|---|
| Mask | 97.1 / 99.6 | 93.9 / 98.0 | 82.8 / 92.6 |
| Exclusion | 95.3 / 98.8 | 90.5 / 96.3 | 86.4 / 93.6 |
| Retention | 93.3 / 97.6 | 89.7 / 96.3 | 87.5 / 93.1 |
1. Zero-value Masking: Set the center pixel to zero while keeping it in the set. This increases the probability that the median equals a real neighborhood pixel and reduces the center pixel’s influence.
2. Complete Exclusion: Remove the center pixel entirely. This results in a non-existent pixel value (i.e., not from the original image), thereby weakening the dependency on the source image.
3. Center Pixel Retention: Keep the original center pixel, as in standard median filtering. This approach compromises the ability to detect local anomalies.
The experimental results in Table 6 demonstrate that for local neighborhood sizes of \(3\times3\) and \(5\times5\), the zero-value masking achieves the highest detection accuracy, followed by the complete exclusion, with the center pixel retention yielding the lowest accuracy. These findings validate the effectiveness of the proposed strategy.
r0.5
| Methods | \(3\times 3\) | \(5\times 5\) | \(7\times 7\) |
|---|---|---|---|
| Max | 93.6 / 97.9 | 86.8 / 94.3 | 88.9 / 94.8 |
| Avg | 92.2 / 97.2 | 88.2 / 94.4 | 90.0 / 96.6 |
| Min | 91.8 / 96.9 | 88.3 / 94.7 | 87.6 / 94.0 |
| Med | 97.1 / 99.6 | 93.9 / 98.0 | 82.8 / 92.6 |
To verify the advantages of the neighborhood median-based feature extraction strategy in synthetic image detection, we designed three alternative methods: selecting the maximum, minimum, and average values from the neighborhood. The center pixel was masked by setting it to infinity, negative infinity, or zero, respectively, to reduce its influence on feature extraction. The experimental results in Table 7 show that, for both \(3 \times 3\) and \(5 \times 5\) local neighborhoods, the median strategy significantly outperforms the other methods.
r0.5
| Methods | Params | w / o | Throughput \(\uparrow\) | ACC / AP \(\uparrow\) |
|---|---|---|---|---|
| Xception | 20.8 M | × | 730.5 Img/s | 89.8 / 94.1 |
| ✔ | 710.6 Img/s | 95.1 / 98.8 | ||
| ResNet50 | 23.5 M | × | 755.4 Img/s | 75.0 / 80.3 |
| ✔ | 750.9 Img/s | 81.1 / 85.6 | ||
| FerretNet | 1.1 M | × | 777.8 Img/s | 86.9 / 93.5 |
| ✔ | 772.1 Img/s | 97.1 / 99.6 |
We evaluated ResNet50 [56], Xception [57], and our proposed FerretNet on both raw image \(I\) and \(LPD\) inputs. As shown in Table 8, FerretNet achieves competitive accuracy on raw images despite having significantly fewer parameters, and outperforms the other architectures when leveraging \(LPD\). Across all backbones, replacing \(I\) with \(LPD\) consistently delivers accuracy gains with negligible effect on inference speed.
This work presents a universal artifact representation framework and introduces FerretNet, a lightweight yet effective neural network for synthetic image detection. FerretNet achieves a remarkable 99.8% reduction in parameters compared to state-of-the-art methods FatFormer [11], while maintaining exceptional detection accuracy, reaching 97.1% on images generated by 22 different generative models. It demonstrates strong generalization capabilities and computational efficiency, outperforming existing approaches on high-quality synthetic datasets. Our contributions include a novel artifact representation approach and the introduction of the Synthetic-Pop dataset.
Limitations and Future Work. While the proposed method demonstrates robust performance, its effectiveness against compression-altered synthetic images has yet to be fully explored. Future work will focus on improving detection of compression-altered images and extending the approach to address challenges posed by emerging forms of synthetic media.