FerretNet: Efficient Synthetic Image Detection via Local Pixel Dependencies

Shuqiao Liang Jian Liu Renzhang Chen\(\thanks{Corresponding authors.}\) Quanlong Guan\(^{*}\)
Jinan University
{xigua7105, liujian2143}@gmail.com
{jnulion, gql}l@jnu.edu.cn
https://github.com/xigua7105/FerretNet


Abstract

The increasing realism of synthetic images generated by advanced models such as VAEs, GANs, and LDMs poses significant challenges for synthetic image detection. To address this issue, we explore two artifact types introduced during the generation process: (1) latent distribution deviations and (2) decoding-induced smoothing effects, which manifest as inconsistencies in local textures, edges, and color transitions. Leveraging local pixel dependencies (LPD) properties rooted in Markov Random Fields, we reconstruct synthetic images using neighboring pixel information to expose disruptions in texture continuity and edge coherence. Building upon LPD, we propose FerretNet, a lightweight neural network with only 1.1M parameters that delivers efficient and robust synthetic image detection. Extensive experiments demonstrate that FerretNet, trained exclusively on the 4-class ProGAN dataset, achieves an average accuracy of 97.1% on an open-world benchmark comprising across 22 generative models, surpassing state-of-the-art methods by 10.6%.

1 Introduction↩︎

The field of AI-based image generation has progressed rapidly, driven by the development of powerful generative models such as Variational Autoencoders (VAEs) [1], Generative Adversarial Networks (GANs) [2][4], and Latent Diffusion Models (LDMs) [5][7]. These models have enabled widespread applications across art, entertainment, and e-commerce, allowing users to effortlessly create realistic and engaging images. However, the potential misuse of such content has raised ethical concerns and driven extensive research on synthetic image detection [8][11].

Many existing detection approaches rely heavily on model-specific features, which limits their generalization ability to unseen generative architectures. For example, Durall et al. [12] observed characteristic frequency artifacts in GAN-generated images. Although frequency-domain techniques [8], [13], [14] have demonstrated strong performance under known conditions, they often struggle to generalize across different models. DIRE [15] introduced a diffusion-based detection framework that distinguishes synthetic images by reconstructing them through a diffusion model, a capability that fails with real images. However, this method performs poorly when applied to GAN-generated content.

To address the generalization challenge, Ojha et al. [10] explored the utilization of pre-trained models, employed frozen backbone for image encoding, providing universal representations from pre-training, followed by a linear classifier. FatFormer [11] introduced an Adaptor to CLIP [16] to enhance the pre-trained model’s ability to learn artifacts. While these methods achieved encouraging results, they are constrained by large parameter counts or low computational efficiency.

To overcome the dual challenges of limited generalization and computational inefficiency in synthetic image detection, we conduct a comprehensive analysis of artifact patterns shared across GAN-, VAE-, and LDM-based generative models. Our analysis reveals that visual anomalies-such as unnatural textures, geometric distortions, and poor object-background integration—primarily originate from two sources: (1) distributional shifts in the latent variable \(z\), and (2) over-smoothing and color discontinuities introduced during the decoding process.

Based on these insights and grounded in the theory of Markov Random Fields, we introduce a pixel-level artifact representation that captures local pixel dependencies (LPD) through median-based reconstruction. We further propose FerretNet, a lightweight detector designed with depthwise separable and dilated convolutions to enhance both performance and efficiency.

Contributions of this work are as follows:

  • We propose a novel approach that leverages Markov Random Fields and median-based statistics to capture local pixel dependencies for detecting artifacts and anomalies in synthetic images.

  • We construct a large-scale benchmark dataset, Synthetic-Pop, comprising 30,000 synthetic images generated by six different models and 30,000 real images from COCO [17] and LAION-Aesthetics V2 (6.5+) [18], totaling 60,000 images.

  • We introduce FerretNet, a lightweight model with only 1.1 million parameters, which achieves 97.1% accuracy on synthetic image detection across 22 generative models, while maintaining low computational overhead.

2 Related Work↩︎

We categorize existing synthetic image detection methods into two main paradigms: pixel-based and frequency-based approaches.

2.1 Pixel-based Synthetic Image Detection↩︎

Wang et al. [8] trained a classifier on images generated by a single model to detect fake images across various architectures and datasets, addressing cross-model generalization via data augmentation and diverse training samples. Shi et al. [19] proposed a difference-guided reconstruction learning framework that exploits discrepancies between real and synthetic images to enhance detection accuracy. Ojha et al. [10] tackled the generalization problem to unseen generative models by leveraging a feature space not explicitly trained for real/fake discrimination, employing nearest-neighbor and linear probing strategies. He et al. [20] introduced a super-resolution-based re-synthesis technique to reconstruct test images and extract residual or layered artifact features, thereby reducing reliance on frequency artifacts. Tan et al. [21] proposed NPR, a method that revisits the upsampling process in generative CNNs by modeling Neighbor Pixel Relations, aiming to improve generalization in deepfake detection. Liu et al. [22] designed a robust detection framework based on multi-view image completion, which simulates real image distributions and captures frequency-independent features. FatFormer [11] presented a forgery-aware adaptive transformer incorporating forgery-specific adapters and language-guided alignment modules to better adapt pre-trained models for synthetic image detection.

2.2 Frequency-based Synthetic Image Detection↩︎

F3Net [23] introduced a dual-branch architecture that captures frequency-aware clues for detecting subtle forgery traces, particularly in low-quality and facial imagery. FrePGAN [14] developed a frequency-level perturbation GAN framework, where a generator-discriminator pair is used to iteratively improve classifier robustness against unseen categories and generative models. Tan et al. [24] exploited pre-trained CNN gradients to generate generalizable representations of GAN-specific artifacts. BiHPF [13] amplified frequency-level artifacts via a high-pass filtering approach, achieving improved robustness across diverse image categories, color manipulations, and generative models. FreqNet [25] introduced high-frequency representations and frequency-specific convolution layers to enhance detection by focusing on localized high-frequency components, addressing overfitting and poor generalization seen in prior methods.

3 Artifacts in Synthetic Image Generation↩︎

3.1 Image Generation Pipeline↩︎

Generative models such as VAEs, GANs, and LDMs are widely used for image synthesis. Despite differences in architecture and training objectives, these models share a common two-stage generation pipeline, as illustrated in Figure 1.

Figure 1: The image generation process in VAEs, GANs, and LDMs can be broadly divided into two stages: obtaining the latent variable z, and decoding it into an image.

1. Obtaining the latent variable \(z\):

In LDMs, the generation process begins with Gaussian noise \(\epsilon \sim \mathcal{N}(0, I)\), which is iteratively denoised into a latent representation \(z\) within the compressed latent space of a pretrained autoencoder, using a denoising network such as U-Net [5], [26] or Diffusion Transformer (DiT) [27], [28]. In contrast, VAEs and GANs directly sample \(z\) from predefined prior distributions, such as a standard normal distribution \(\mathcal{N}(0, I)\) or a uniform distribution \(U(-1,1)\).

2. Decoding \(z\) to generate images: In both VAEs and LDMs, a decoder transforms \(z\) into the final image through a series of convolutional layers with specific kernel sizes and strides. In GANs, the generator plays an analogous role, mapping \(z\) to the image space with the aim of approximating the target data distribution.

While this two-stage framework enables high-fidelity image synthesis, it can also introduce artifacts such as texture irregularities, unnatural transitions, and local detail loss. These artifacts commonly arise from two major sources: (1) deviations in the distribution of the latent variable \(z\), and (2) imperfections introduced during the decoding process.

3.2 Latent Distribution Deviations↩︎

The quality of synthetic images exhibits significant sensitivity to the distribution of the latent representation \(z\) [29][32]. Ideally, the sampled distribution \(Q(z)\) should match the prior distribution \(P(z)\) assumed or learned during training. However, in practice, factors such as data imbalance or insufficient training can lead to a mismatch between \(Q(z)\) and \(P(z)\). This discrepancy can be quantified using the Kullback–Leibler (KL) divergence: \[\begin{align} D_{\mathrm{KL}}(Q(z) \| P(z)) = \int Q(z) \log \frac{Q(z)}{P(z)} \, dz > \delta, \end{align}\] where \(\delta\) denotes an acceptable divergence threshold. When this threshold is exceeded, the resulting images are prone to visible artifacts, including texture inconsistencies and the loss of fine structural details. For example, in GANs, if the latent space is poorly aligned with the true data distribution, the generator may fail to reproduce realistic textures, resulting in unnatural or distorted outputs [33].

3.3 Artifacts from the Decoding Process↩︎

Even when \(z\) is accurately sampled, decoding artifacts may still arise due to limitations in the network architecture [34]. The kernel size and stride used in convolutional layers are particularly influential in determining the fidelity of the output [3]. Large kernels may over-smooth local features, while improper stride configurations can lead to aliasing, both of which degrade image quality.

Moreover, upsampling operations—such as nearest-neighbor or bilinear interpolation—are known to introduce specific artifacts. Nearest-neighbor interpolation often produces jagged edges, whereas bilinear interpolation may blur textures due to its smoothing effect. These operations can significantly impact the realism and perceptual quality of the generated images, especially in high-frequency regions.

4 Methodology↩︎

a

b

c

d

e

f

g
h
i

j

SDXL-Turbo

k
l

Figure 2: Local pixel dependencies (LPD) comparison between real and synthetic images. Top row: real images (COCO, LAION) and synthetic images (BigGAN, SDXL-Turbo, StyleGAN, RealVisXL-4.0). Bottom row: LPD maps derived from neighborhood-median reconstruction emphasize structural differences.. g — COCO, h — LAION, i — BigGAN, k — StyleGAN, l — RealVisXL-4.0

4.1 Local Median-based Feature Extraction↩︎

We propose a synthetic image detection method based on local statistical dependencies. The core idea is to identify generation artifacts by quantifying the deviation of each pixel from the median of its surrounding neighborhood. The full computational procedure is outlined in Algorithm 3.

Let \(I\) denote the input image, and \(x_{i,j}\) represent the pixel value at location \((i, j)\). According to the Markov Random Field (MRF) assumption, the probability distribution of a pixel depends only on its local neighborhood. Specifically, \[\begin{align} P(x_{i,j} \mid x_{k,l}, (k,l) \neq (i,j)) = P(x_{i,j} \mid x_{k,l}, (k,l) \in \mathcal{N}_{i,j}), \end{align}\] where \(\mathcal{N}_{i,j}\) is the set of neighboring pixels located within an \(n \times n\) window centered at \((i,j)\), excluding the center pixel itself: \[\begin{align} \mathcal{N}_{i,j} = \left\{ x_{k,l} \;\middle|\; \begin{array}{l} i - m \leq k \leq i + m,\; j - m \leq l \leq j + m, \\ (k,l) \ne (i,j) \end{array} \right\}, \end{align}\] with \(n = 2m + 1\) and \(m \in \mathbb{Z}^{+}\).

To enhance the robustness of the median filtering process and prevent contamination from generated pixels, we introduce a zero-masking strategy that replaces the center pixel with zero before computing the median. This adjustment is particularly beneficial when the neighborhood contains an even number of pixels. The median-based reconstruction at location \((i,j)\) is therefore computed as: \[y_{i,j} = \text{Median}(x_{k,l}, (k,l) \in \mathcal{N}'_{i,j}),\] where \(\mathcal{N}'_{i,j} = \mathcal{N}_{i,j} \cup \{x_{i,j} = 0\}\) is the extended neighborhood that includes the masked center pixel.

By applying the above operation to all pixels, we obtain a median-reconstructed image \(I'\), where each pixel value is replaced by its corresponding \(y_{i,j}\). The final local pixel dependency (LPD) feature map is then computed as the pixel-wise difference: \[LPD = I - I'.\]

Since both \(I\) and \(I'\) conform to local dependency assumptions, the LPD feature map effectively captures pixel-level inconsistencies and subtle structural deviations, offering strong cues for distinguishing synthetic from natural content, as illustrated in Figure 2.

Figure 3: Local Dependency Feature Extraction via Zero-Masked Median Deviation

This method effectively integrates the local dependency modeling capabilities of Markov Random Fields with the robustness of median filtering, providing a principled and resilient strategy for detecting subtle inconsistencies in synthetic imagery.

4.2 FerretNet Architecture↩︎

Figure 4: Pipeline of FerretNet: computation of local pixel median discrepancy for artifact representation, followed by lightweight detection using depthwise separable and dilated convolutions.

FerretNet is a lightweight convolutional neural network designed to achieve a balance between computational efficiency and feature extraction capability. As illustrated in Figure 4, the network begins with two conventional \(3\times3\) convolutional layers for initial feature extraction, each followed by Batch Normalization (BN) [35] and ReLU activation.

At the core of FerretNet are four cascaded Ferret Blocks, which progressively refine the extracted features while keeping the model compact. The final stage comprises a \(1\times1\) convolution, global average pooling, Dropout regularization, and a fully connected layer for classification.

The key innovation lies in the Ferret Block, which is designed to expand the effective receptive field under constrained network depth, thereby enhancing the model’s capacity for local pattern extraction. Each Ferret Block adopts a dual-path parallel architecture to increase the receptive field:

  • The primary path employs a \(3\times3\) dilated grouped convolution with a dilation rate of 2. The number of groups equals the number of input channels, allowing the receptive field to expand without increasing the number of parameters.

  • The secondary path utilizes a standard \(3\times3\) grouped convolution, maintaining the same grouping structure to capture fine-grained local patterns.

This dual-path configuration approximates a sparse \(5\times5\) receptive field via parallel processing, enabling FerretNet to simulate deeper network behaviors within shallower layers, thus reducing computational cost. The outputs from both paths are fused through a \(1\times1\) convolution, followed by BN and ReLU activation. Additional \(3\times3\) grouped and \(1\times1\) convolution layers further enrich the feature representation. Residual connections are employed to facilitate stable gradient propagation and enhance learning stability.

5 Experiments↩︎

5.1 Dataset Construction↩︎

5.1.1 Training Dataset↩︎

To ensure a consistent evaluation baseline, we follow the protocols established in [10], [11], [13], [21], utilizing four semantic classes (car, cat, chair, horse) from the ForenSynths dataset [8]. Each class contains 18,000 synthetic images generated by ProGAN [2], paired with an equal number of real images from the LSUN dataset [36]. In all experiments conducted in this study, only the aforementioned 4-class data (i.e., the 4-class ProGAN dataset) was used for training; no additional data sources were included.

5.1.2 Testing Dataset↩︎

To assess the generalization ability of the proposed method under real-world conditions, we evaluate its performance on diverse synthetic and real images from four distinct test sets, comprising a total of 22 generative models:

ForenSynths. This test set includes synthetic images generated by eight representative generative models: ProGAN [2], StyleGAN [37], StyleGAN2 [34], BigGAN [4], CycleGAN [38], StarGAN [39], GauGAN [40], and Deepfake [41]. Real images are sourced from six widely-used datasets: LSUN [36], ImageNet [42], CelebA [43], CelebA-HQ [44], COCO [17], and FaceForensics++ [41], totaling 62,000 images.

Diffusion-6-cls. As described in FatFormer [11], this test set comprises synthetic images generated by six diffusion-based models collected from DIRE [15] and Ojha et al. [10], including DALL-E [45], Guided [46], PNDM [47], VQ-Diffusion [48], Glide [49], and LDM [5]. Variants produced by Glide and LDM with different parameter configurations are treated as separate categories (see original papers for details). Each subset includes 1,000 synthetic and 1,000 real images, with some real images reused across subsets.

Synthetic-Pop. To capture the latest progress in high-resolution image generation, we constructed the Synthetic-Pop dataset using six popular models—Openjourney [50], Proteus-0.3 [51], RealVisXL-4.0 [52], SD-3.5-Medium [7], SDXL-Turbo [6], and YiffyMix [53]. Each model was prompted with 5,000 captions randomly sampled from COCO [17]. Real images were drawn from COCO and LAION-Aesthetics V2 (6.5+) [18], resulting in six subsets, each containing 5,000 synthetic and 5,000 real images (60,000 images total).

Synthetic-Aesthetic. To further investigate the aesthetic and stylistic diversity of synthetic imagery, we sampled 40,000 images from the Simulacra Aesthetic Captions (SAC) dataset [54], which were generated by CompVis latent GLIDE [5] and Stable Diffusion [5] using prompts sourced from over 40,000 real users. An equal number of real images were sampled from LAION-Aesthetics V2 (6.5+) [18], resulting in a total of 80,000 images. This dataset provides a challenging benchmark for evaluating performance under realistic and user-driven conditions.

5.2 Implementation Details↩︎

FerretNet is trained from scratch without any pretraining. We use the Adam optimizer with a learning rate of \(2\times10^{-4}\), betas of \((0.937, 0.999)\), and a weight decay of \(5\times10^{-4}\). The model is trained for 100 epochs using a batch size of 32. During training, input images are randomly cropped to a resolution of \(224\times224\) and augmented with random horizontal flipping. Binary Cross Entropy with Logits Loss (BCEWithLogitsLoss) is adopted as the loss function. For evaluation, images are center-cropped to \(256\times256\).

Following previous work [10], [11], [13], Accuracy (ACC) and Average Precision (AP) are used as the primary evaluation metrics. To measure real-world performance, we report throughput on the Synthetic-Aesthetic test set using an NVIDIA RTX 4090 GPU and an Intel(R) Xeon(R) Gold 6430 CPU (16 vCPUs), with a batch size of 128.

5.3 Main Results↩︎

Table 1: Accuracy and average precision comparisons with peer methods on ForenSynth test set for different GAN images and Deepfake images. The best and second best performance are highlighted in bold and underlined, respectively.
Methods ProGAN StyleGAN StyleGAN2 BigGAN CycleGAN StarGAN GauGAN Deepfake Mean
Wang [8] 91.4/99.4 63.8/91.4 /97.5 52.9/73.3 /88.6 63.8/90.8 /92.2 51.7/62.3 /86.9
F3Net [23] 99.4/100.0 92.6/99.7 /99.8 65.3/69.9 /84.3 100.0/100.0 /56.7 63.5/78.8 /86.2
FrePGAN [14] 99.0/99.9 80.7/89.6 /98.6 69.2/71.1 /74.4 99.9/100.0 /71.7 70.9/91.9 /87.2
BiHPF [13] 90.7/86.2 76.9/75.1 /74.7 84.9/81.7 /78.9 94.4/94.4 /78.1 54.4/54.6 /77.9
LGrad [24] 99.9/100.0 94.8/99.9 /99.9 82.9/90.7 /94.0 99.6/100.0 /79.3 58.0/67.9 /91.5
Ojha [10] 99.7/100.0 89.0/98.7 /98.4 90.5/99.1 /99.8 91.4/100.0 /100.0 80.2/90.2 /98.3
FreqNet [25] 99.6/100.0 90.2/99.7 /99.5 90.5/96.0 /99.6 85.7/99.8 /98.6 88.9/94.4 /98.5
NPR [21] 99.8/100.0 96.3/99.8 /100.0 87.5/94.5 /99.5 99.7/100 /88.8 77.4/86.2 /96.1
FatFormer [11] 99.9/100.0 97.2/99.8 /99.9 99.5/100.0 /100.0 99.8/100.0 /100.0 93.2/98.0 98.4/99.7
FerretNet (Our) 99.9/100.0 98.0/100.0 /100.0 92.6/98.5 /99.9 99.1/100.0 /99.8 89.2/96.7 95.9/99.3
Table 2: Accuracy and average precision comparisons with peer methods on Diffusion-6-cls test set.
Dataset Wang [8] F3Net [23] LGrad [24] Ojha [10] FreqNet [25] NPR [21] FatFormer [11] FerretNet
Dall-E 51.8/61.3 71.6/79.9 /97.3 89.5/96.8 /99.7 90.9/98.1 /99.8 91.4/98.2
Guided 54.9/66.6 69.2/70.8 /100.0 75.7/85.1 /75.4 74.0/78.1 /92.0 92.1/98.6
PNDM 50.8/90.3 72.8/99.5 /98.5 75.3/92.5 /99.9 97.5/100.0 /100.0 96.9/100.0
VQ-Diffusion 50.0/71.0 100.0/100.0 /100.0 83.5/97.7 /100.0 100.0/100.0 /100.0 99.9/100.0
Glide-50-27 54.2/76.0 88.5/95.4 /95.1 91.1/97.4 /95.8 97.5/99.5 /99.4 97.2/99.7
Glide-100-10 53.3/72.9 88.3/95.4 /94.9 90.1/97.0 /96.0 97.8/99.5 /99.2 97.9/99.9
Glide-100-27 53.0/71.3 87.0/94.5 /93.2 90.7/97.2 /95.6 97.4/99.5 /99.1 97.3/99.7
LDM-100 51.9/63.7 74.1/84.0 /99.2 90.5/97.0 /99.9 98.0/99.6 /99.9 98.8/100.0
LDM-200 52.0/64.5 73.4/83.3 /99.1 90.2/97.1 /99.9 98.2/99.6 /99.8 98.8/100.0
LDM-200-CFG 51.6/63.1 80.7/89.1 /99.2 77.3/88.6 /99.9 98.0/99.5 /99.1 98.5/99.9
Mean 52.4/70.1 80.6/89.2 /97.7 85.4/94.6 /96.2 94.9/97.3 95.0/98.8 96.9/99.6
Table 3: Accuracy and average precision comparisons with state-of-the-art methods on Synthetic-Pop test set.
Methods Openjourney Proteus-0.3 RealVisXL-4.0 SD-3.5-Medium SDXL-Turbo YiffyMix Mean
FreqNet [25] 56.3 / 63.6 44.0 / 41.2 / 66.6 78.5 / 86.8 / 86.0 74.3 / 84.4 / 71.4
NPR [21] 78.8 / 83.5 68.6 / 69.3 / 82.0 80.4 / 84.1 / 82.9 80.0 / 85.1 77.4 / 81.2
FatFormer [11] 58.8 / 65.4 93.9 / 97.6 / 41.7 81.9 / 89.1 / 65.3 80.9 / 89.9 / 74.8
FerretNet 98.4 / 99.7 98.6 / 99.7 / 99.9 97.2 / 99.6 / 100.0 97.8 / 99.7 98.3 / 99.8
Table 4: Performance comparisons with state-of-the-art methods across four distinct test sets. Throughput measurements were conducted on the Synthetic-Aesthetic test set. Upward arrows indicate that higher values are better, while downward arrows signify the opposite.
Parameters Throughput
4-8 (M) \(\downarrow\) (Img / s) \(\uparrow\) ForenSynths Diffusion-6-cls Synthetic-Pop Synthetic-Aesthetic Mean
FreqNet [25] 1.9 200.2 / 98.5 90.1 / 96.2 / 71.4 70.1 / 81.2 / 86.8
NPR [21] 1.4 720.9 / 96.1 94.8 / 97.7 77.4 / 81.2 81.4 / 82.4 86.5 / 89.4
FatFormer [11] 577.3 88.6 98.4 / 99.7 95.0 / 98.8 / 74.8 80.4 / 90.6 / 91.0
FerretNet 1.1 772.1 95.9 / 99.3 96.9 / 99.6 98.3 / 99.8 97.3 / 99.6 97.1 / 99.6

a

b

c

d

e

f

g

h

i

j

k

l

m

n

o

p

q

r

s

t

Figure 5: Grad-CAM Visualizations: The fake images are synthesized using RealVisXL-4.0, SDXL-Turbo, SD-3.5-Medium, StyleGAN, and BigGAN. FerretNet show stronger activation responses in synthetic images, while showing no significant responses in real ones..

We begin by evaluating FerretNet on GAN-based and Deepfake images using the ForenSynths test set. As shown in Table 1, it achieves an average accuracy (ACC) of 95.9%, outperforming lightweight baselines such as FreqNet [25] (91.5%) and NPR [21] (92.5%). Although FatFormer [11] reports a higher ACC of 98.4%, it relies on pre-trained CLIP weights, whereas FerretNet achieves competitive accuracy with significantly fewer parameters.

Next, on diffusion-generated images (Table 2), FerretNet attains an ACC of 96.9% and an AP of 99.6%, outperforming FatFormer [11] by 1.9 and 0.8 percentage points (pp), respectively. Other lightweight models such as NPR [21] and FreqNet [25] perform less favorably, with ACC scores falling below 95.0%.

We further evaluate performance on high-quality synthetic images using the Synthetic-Pop test set (Table 3). Existing methods experience noticeable degradation; for example, NPR [21]—the strongest among them—achieves only 77.4% ACC and 81.2% AP. In contrast, FerretNet maintains 98.3% ACC and 99.8% AP, highlighting its robustness and reliability on visually realistic forgeries.

To evaluate real-world applicability, we tested FerretNet on the Synthetic-Aesthetic set for both detection performance and efficiency. As shown in Table 4, FerretNet achieves 97.3% ACC and 99.6% AP with 1.1M parameters and 772.1 FPS on an RTX 4090. While comparisons are made on this set, its average performance across all four benchmarks reaches 97.1% ACC and 99.6% AP. Notably, it outperforms FatFormer [11] by 11.0 and 8.6 pp in ACC and AP, respectively, while using only 0.2% of its parameters.

Finally, Grad-CAM [55] visualizations in Figure 5 show that FerretNet focuses on pixel-level artifacts rather than high-level semantics, contributing to its strong generalization to unseen generators.

5.4 Ablation Study↩︎

Unless specified, all ablation results report the average ACC and AP across four datasets: ForenSynths, Diffusion-6-cls, Synthetic-Pop, and Synthetic-Aesthetic.

5.4.1 Impact of Different Local Neighborhood Sizes↩︎

Table 5: Impact of the local neighborhood size.
2-4 (lr)5-9 \(3\times3\) \(5\times5\) \(7\times7\) ForenSynths Diffusion-6-cls Synthetic-Pop Synthetic-Aesthetic Mean
\(I\) 84.6 / 88.9 87.8 / 96.8 84.5 / 92.9 90.5 / 95.3 86.9 / 93.5
\(LPD\) 95.9 / 99.3 96.9 / 99.6 98.3 / 99.8 97.3 / 99.6 97.1 / 99.6
\(LPD\) 91.8 / 96.2 95.8 / 99.3 91.1 / 97.4 96.9 / 98.9 93.9 / 98.0
\(LPD\) 82.4 / 90.6 85.2 / 93.6 78.6 / 91.9 85.0 / 94.4 82.8 / 92.6

Table 5 shows that \(LPD\) extracted using a \(3 \times 3\) local neighborhood substantially enhances detection accuracy compared to raw input \(I\). Average ACC improves from 86.9% to 97.1% (+10.2%), and AP rises from 93.5% to 99.6% (+6.1%). However, performance deteriorates as the neighborhood size increases. For instance, using a \(7 \times 7\) neighborhood weakens feature discrimination and significantly reduces detection accuracy.

This trend aligns with the structural characteristics of generative models, which typically employ \(2\times\) upsampling and small convolutional kernels (\(1 \times 1\) or \(3 \times 3\)). The \(3 \times 3\) neighborhood is particularly effective in capturing localized decoding artifacts for two reasons: 1) It matches the scale of operations used in generative architectures, making it ideal for exposing subtle synthesis artifacts; 2) It captures local pixel variations while suppressing potential noise artifacts.

5.4.2 Impact of Center Pixel Processing Methods↩︎

According to Section 4.1, the neighborhood median \(y_{i,j}\) should satisfy two key requirements: reducing the interference of center pixels in median computation, and ensuring the median value equals a real pixel value from the neighborhood set when possible, thus enhancing statistical correlation with the original image. To validate the effectiveness of zero-value masking strategy, we compared three center pixel processing methods:

r0.5

Table 6: Impact of center pixel processing methods on metrics (ACC/AP): average results across varying local neighborhood sizes.
Methods \(3\times 3\) \(5\times 5\) \(7\times 7\)
Mask 97.1 / 99.6 93.9 / 98.0 82.8 / 92.6
Exclusion 95.3 / 98.8 90.5 / 96.3 86.4 / 93.6
Retention 93.3 / 97.6 89.7 / 96.3 87.5 / 93.1

1. Zero-value Masking: Set the center pixel to zero while keeping it in the set. This increases the probability that the median equals a real neighborhood pixel and reduces the center pixel’s influence.

2. Complete Exclusion: Remove the center pixel entirely. This results in a non-existent pixel value (i.e., not from the original image), thereby weakening the dependency on the source image.

3. Center Pixel Retention: Keep the original center pixel, as in standard median filtering. This approach compromises the ability to detect local anomalies.

The experimental results in Table 6 demonstrate that for local neighborhood sizes of \(3\times3\) and \(5\times5\), the zero-value masking achieves the highest detection accuracy, followed by the complete exclusion, with the center pixel retention yielding the lowest accuracy. These findings validate the effectiveness of the proposed strategy.

5.4.3 Impact of Neighborhood Statistic Selection↩︎

r0.5

Table 7: Impact of neighborhood statistic selection methods.
Methods \(3\times 3\) \(5\times 5\) \(7\times 7\)
Max 93.6 / 97.9 86.8 / 94.3 88.9 / 94.8
Avg 92.2 / 97.2 88.2 / 94.4 90.0 / 96.6
Min 91.8 / 96.9 88.3 / 94.7 87.6 / 94.0
Med 97.1 / 99.6 93.9 / 98.0 82.8 / 92.6

To verify the advantages of the neighborhood median-based feature extraction strategy in synthetic image detection, we designed three alternative methods: selecting the maximum, minimum, and average values from the neighborhood. The center pixel was masked by setting it to infinity, negative infinity, or zero, respectively, to reduce its influence on feature extraction. The experimental results in Table 7 show that, for both \(3 \times 3\) and \(5 \times 5\) local neighborhoods, the median strategy significantly outperforms the other methods.

5.4.4 Impact of Different Backbones↩︎

r0.5

Table 8: Comparison of different backbones with and without \(LPD\) as input.
Methods Params w / o Throughput \(\uparrow\) ACC / AP \(\uparrow\)
Xception 20.8 M × 730.5 Img/s 89.8 / 94.1
710.6 Img/s 95.1 / 98.8
ResNet50 23.5 M × 755.4 Img/s 75.0 / 80.3
750.9 Img/s 81.1 / 85.6
FerretNet 1.1 M × 777.8 Img/s 86.9 / 93.5
772.1 Img/s 97.1 / 99.6

We evaluated ResNet50 [56], Xception [57], and our proposed FerretNet on both raw image \(I\) and \(LPD\) inputs. As shown in Table 8, FerretNet achieves competitive accuracy on raw images despite having significantly fewer parameters, and outperforms the other architectures when leveraging \(LPD\). Across all backbones, replacing \(I\) with \(LPD\) consistently delivers accuracy gains with negligible effect on inference speed.

6 Conclusion↩︎

This work presents a universal artifact representation framework and introduces FerretNet, a lightweight yet effective neural network for synthetic image detection. FerretNet achieves a remarkable 99.8% reduction in parameters compared to state-of-the-art methods FatFormer [11], while maintaining exceptional detection accuracy, reaching 97.1% on images generated by 22 different generative models. It demonstrates strong generalization capabilities and computational efficiency, outperforming existing approaches on high-quality synthetic datasets. Our contributions include a novel artifact representation approach and the introduction of the Synthetic-Pop dataset.

Limitations and Future Work. While the proposed method demonstrates robust performance, its effectiveness against compression-altered synthetic images has yet to be fully explored. Future work will focus on improving detection of compression-altered images and extending the approach to address challenges posed by emerging forms of synthetic media.

References↩︎

[1]
Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2022. URL https://arxiv.org/abs/1312.6114.
[2]
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.
[3]
Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. Advances in neural information processing systems, 34: 852–863, 2021.
[4]
Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2018.
[5]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
[6]
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. : Improving latent diffusion models for high-resolution image synthesis. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=di52zR8xgf.
[7]
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024.
[8]
Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are surprisingly easy to spot... for now. In CVPR, pages 8695–8704, 2020.
[9]
Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. Leveraging frequency analysis for deep fake image recognition. In International conference on machine learning, pages 3247–3258. PMLR, 2020.
[10]
Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards universal fake image detectors that generalize across generative models. In CVPR, pages 24480–24489, 2023.
[11]
Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Jingdong Wang, and Yao Zhao. Forgery-aware adaptive transformer for generalizable synthetic image detection. In CVPR, pages 10770–10780, 2024.
[12]
Ricard Durall, Margret Keuper, and Janis Keuper. Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions. In CVPR, pages 7890–7899, 2020.
[13]
Yonghyun Jeong, Doyeon Kim, Seungjai Min, Seongho Joe, Youngjune Gwon, and Jongwon Choi. Bihpf: Bilateral high-pass filters for robust deepfake detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 48–57, 2022.
[14]
Yonghyun Jeong, Doyeon Kim, Youngmin Ro, and Jongwon Choi. Frepgan: Robust deepfake detection using frequency-level perturbations. Proceedings of the AAAI Conference on Artificial Intelligence, 36 (1): 1060–1068, Jun. 2022. . URL https://ojs.aaai.org/index.php/AAAI/article/view/19990.
[15]
Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion-generated image detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22445–22455, 2023.
[16]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[17]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
[18]
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35: 25278–25294, 2022.
[19]
Zenan Shi, Haipeng Chen, Long Chen, and Dong Zhang. Discrepancy-guided reconstruction learning for image forgery detection. arXiv preprint arXiv:2304.13349, 2023.
[20]
Yang He, Ning Yu, Margret Keuper, and Mario Fritz. Beyond the spectrum: Detecting deepfakes via re-synthesis. arXiv preprint arXiv:2105.14376, 2021.
[21]
Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. In CVPR, pages 28130–28139, 2024.
[22]
Chi Liu, Tianqing Zhu, Sheng Shen, and Wanlei Zhou. Towards robust gan-generated image detection: a multi-view completion representation. arXiv preprint arXiv:2306.01364, 2023.
[23]
Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by mining frequency-aware clues. In European conference on computer vision, pages 86–103. Springer, 2020.
[24]
Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, and Yunchao Wei. Learning on gradients: Generalized artifacts representation for gan-generated images detection. In CVPR, pages 12105–12114, 2023.
[25]
Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning. Proceedings of the AAAI Conference on Artificial Intelligence, 38 (5): 5052–5060, Mar. 2024. .
[26]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
[27]
William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023.
[28]
Jingfeng Yao, Cheng Wang, Wenyu Liu, and Xinggang Wang. Fasterdit: Towards faster diffusion transformers training without architecture modification. Advances in Neural Information Processing Systems, 37: 56166–56189, 2024.
[29]
Ali Razavi, Aaron van den Oord, Ben Poole, and Oriol Vinyals. Preventing posterior collapse with delta-VAEs. In International Conference on Learning Representations, 2019.
[30]
Tianyang Hu, Fei Chen, Haonan Wang, Jiawei Li, Wenjia Wang, Jiacheng Sun, and Zhenguo Li. Complexity matters: Rethinking the latent space for generative modeling. Advances in Neural Information Processing Systems, 36: 29558–29579, 2023.
[31]
Hyungjin Chung, Jeongsol Kim, Geon Yeong Park, Hyelin Nam, and Jong Chul Ye. Cfg++: Manifold-constrained classifier free guidance for diffusion models. CoRR, abs/2406.08070, 2024.
[32]
Zikai Zhou, Shitong Shao, Lichen Bai, Zhiqiang Xu, Bo Han, and Zeke Xie. Golden noise for diffusion models: A learning framework. arXiv preprint arXiv:2411.09502, 2024.
[33]
Jonas Wulff and Antonio Torralba. Improving inversion and generation diversity in stylegan using a gaussianized latent space. arXiv preprint arXiv:2009.06529, 2020.
[34]
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
[35]
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
[36]
Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
[37]
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In CVPR, pages 4401–4410, 2019.
[38]
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
[39]
Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8789–8797, 2018.
[40]
Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In CVPR, pages 2337–2346, 2019.
[41]
Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1–11, 2019.
[42]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115: 211–252, 2015.
[43]
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015.
[44]
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.
[45]
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. Pmlr, 2021.
[46]
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34: 8780–8794, 2021.
[47]
Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=PlKWVd2yBkY.
[48]
Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. In CVPR, pages 10696–10706, 2022.
[49]
Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, pages 16784–16804. PMLR, 2022.
[50]
PromptHero. . https://openjourney.art/, 2023.
[51]
dataautogpt3. . https://huggingface.co/dataautogpt3/ProteusV0.3, 2024.
[52]
SG161222. . https://huggingface.co/SG161222/RealVisXL_V4.0, 2024.
[53]
Yntec. . https://huggingface.co/Yntec/YiffyMix, 2023.
[54]
John David Pressman, Katherine Crowson, and Simulacra Captions Contributors. Simulacra aesthetic captions. Technical Report Version 1.0, Stability AI, 2022. https://github.com/JD-P/simulacra-aesthetic-captions.
[55]
Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
[56]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[57]
François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.