Ivan-ISTD: Rethinking Cross-domain Heteroscedastic Noise Perturbations in Infrared Small Target Detection


Abstract

In the multimedia domain, Infrared Small Target Detection (ISTD) plays a important role in drone-based multi-modality sensing. To address the dual challenges of cross-domain shift and heteroscedastic noise perturbations in ISTD, we propose a doubly wavelet-guided Invariance learning framework(Ivan-ISTD). In the first stage, we generate training samples aligned with the target domain using Wavelet-guided Cross-domain Synthesis. This wavelet-guided alignment machine accurately separates the target background through multi-frequency wavelet filtering. In the second stage, we introduce Real-domain Noise Invariance Learning, which extracts real noise characteristics from the target domain to build a dynamic noise library. The model learns noise invariance through self-supervised loss, thereby overcoming the limitations of distribution bias in traditional artificial noise modeling. Finally, we create the Dynamic-ISTD Benchmark, a cross-domain dynamic degradation dataset that simulates the distribution shifts encountered in real-world applications. Additionally, we validate the versatility of our method using other real-world datasets. Experimental results demonstrate that our approach outperforms existing state-of-the-art methods in terms of many quantitative metrics. In particular, Ivan-ISTD demonstrates excellent robustness in cross-domain scenarios. The code for this work can be found at: https://github.com/nanjin1/Ivan-ISTD.

Infrared Small Target Detection (ISTD), Cross-domain, Degraded dataset, Self-supervised

1 Introduction↩︎

In the multimedia domain, Infrared Small Target Detection (ISTD) plays a crucial role in drone-based multi-modality sensing and applications [1][4]. However, achieving consistent and robust detection performance in complex target environments remains an ongoing challenge [5]. Existing deep learning models are often impacted by real-world data distribution shifts, including:

  • Background-induced Domain Shift: As illustrated in Fig. 1, differences in background environments [3], [6] cause a substantial distribution shift between the source and target domains [7]. As a result, the features learned by the model during training are often not transferable to target domain data with different backgrounds [8].

  • Cross-domain Heteroscedastic Noise Perturbations: As shown in Fig. 2, noise characteristics, such as type, intensity, and distribution, can vary significantly between source and target domains due to differences in equipment and environmental factors [9][12]. Traditional methods typically rely on artificially designed uniform noise augmentation [13], [14], but they fall short in capturing the dynamic and heteroscedastic nature of real-world noise [15].

Figure 1: A significant distribution shift between the source and target domains. Our transfer background method helps mitigate this domain shift issue.
Figure 2: Real domain noise (green) and artificially synthesized noise (red). The real domain noise exhibits greater variability, whereas the synthetic noise is uniform and cannot capture the dynamic heteroscedastic nature of noise in real-world situations.
Figure 3: The overall framework of our method. Stage I presents the complete Wavelet-guided Cross-domain Synthesis framework, consisting of (a) and (b). (a) illustrates the process of extracting high-value small targets from the source set (right image) and unsupervised background sample extraction (left image). (b) describes the synthesis of new samples using SSIM ranking and Poisson blending techniques. Stage II outlines the process of Real-domain Noise Invariance Learning. It involves unsupervised real noise sampling from target images, minimizing loss to enhance the model’s robustness to cross-domain noise. (c) shows the detailed process of extracting noise from the target set.

These challenges result in a performance gap between source and target domain data. Traditional techniques, such as synthetic noise augmentation [16], [17] and static data preprocessing [18], [19], are ineffective in addressing the complex noise distributions and dynamic background changes encountered in real-world testing conditions.

In recent years, researchers have been exploring methods to build hierarchical feature representations and facilitate cross-layer feature reuse. For instance, in the multimedia domain, HCF-Net [20] addresses multi-scale background interference by using a hierarchical context fusion strategy, Lin [21] introduces a large-kernel encoder and a shape-guided decoder through learning shape-biased representations, which incorporate shape information to improve the localization of faint targets, though this method depends on supervised training; Luo [1] presents a spatio-temporal aware unsupervised network that boosts adaptability to dynamic scenes; and SCTransNet [22] combines spatial attention with channel recalibration mechanisms. However, current research is limited by its reliance on fixed training datasets, which hinders adaptability to unknown target domains.

To address domain shifts arising from background environmental differences and cross-domain heteroscedastic noise, the CORAL [23] feature alignment method, based on second-order statistics, reduces domain differences by aligning the statistical distributions of source and target domains. On the other hand, the RandConv [24], [25] increases the diversity of training data through artificially generated noise or style perturbations. However, both methods are constrained by the assumption of static noise, which fails to capture the dynamic, complex noise distributions encountered in real-world environments [26], [27].

To overcome these limitations, we propose a Doubly Wavelet-guided Invariance Learning Framework. In the first stage, we introduce Wavelet-guided [28] Cross-domain Synthesis, a strategy that enables cross-domain adaptation during training, ensuring stable small target detection without the need for additional inference adjustments. In the second stage, Real-domain Noise Invariance Learning extracts real noise features from the target domain to build a dynamic noise library. By mixing noise data and applying self-supervised loss constraints, the model learns to be invariant to noise.

Additionally, we have developed a cross-domain dynamic degradation dataset for small targets in UAV infrared imaging: Dynamic-ISTD. This dataset contains training and testing sets from various domains, designed to simulate the distribution shifts typically encountered in real-world scenarios.

In summary, the key contributions of this paper are as follows:

  • We propose the Wavelet-guided Cross-domain Synthesis strategy, which allows the model to adapt to target domain features at the data space level during training, thus enhancing cross-domain generalization.

  • We present the Real-domain Noise Invariance Learning strategy, which enables the model to adapt to the noise characteristics of the target domain at the feature space level.

  • We introduce a new cross-domain dynamic degradation dataset, Dynamic-ISTD Benchmark, specifically for UAV infrared small targets. The dataset includes training and testing sets from different domains to replicate the cross-domain distribution shifts that are likely to occur in practical applications.

  • Through cross-domain validation experiments, we show the effectiveness and broad applicability of our method across various datasets.

Figure 4: (a) illustrates the proposed unsupervised Background Region Detection framework. After applying Wavelet Multi-frequency Filtering (WMF) and calculating small target probabilities based on edge density and Laplacian operator, we select appropriate background samples. (b) shows the detailed implementation of WMF, where L^{low} is the low-frequency sub-band obtained through wavelet decomposition, and \{H_k^{high}\}^3_{k=1} is the high-frequency sub-band.

2 Related work↩︎

2.1 Single-frame Infrared Small Target Detection↩︎

The precise detection and segmentation of small, weak targets in complex infrared backgrounds remain a major challenge. The inherent limitations of infrared imaging cause non-uniform background interference, significantly degrading the performance of traditional detection algorithms across different environments. Traditional methods, primarily focused on local contrast analysis (such as Tophat [29], [30] and IPI [19], [31]), struggle to handle dynamic background textures due to their inability to model multi-scale features effectively. Deep learning approaches, which build hierarchical feature representations, have shown considerable advantages in target representation learning[1]. DNA-Net[32] improves small target responses by reusing features across layers, HCF-Net[20] uses a hierarchical context fusion strategy to address multi-scale background interference[2], and SCTransNet [22] innovatively combines spatial attention with channel recalibration mechanisms, maintaining robust detection even in the presence of complex thermal noise. Despite these advancements, the lack of real-world environment modeling and the sensitivity to domain shifts continue to limit their practical use [5].

2.2 Domain Shift Challenges↩︎

The main challenge of domain shift lies in the complex and dynamic differences in data distributions between the source and target domains [33], such as changes in lighting, imaging conditions, or sensor noise [34], [35]. These differences often lead to a significant drop in model performance when applied to real-world scenarios. In recent years, researchers have introduced various strategies to counteract the performance degradation caused by domain shifts, including adaptive feature alignment [36][38]. CORAL [23], [39] uses second-order statistics, minimizes feature distribution differences by aligning the covariance matrices of the source and target domains. However, these methods typically rely on explicit domain label information to differentiate between various imaging conditions, making them difficult to apply in practical scenarios where the target domain lacks or has limited labels [40], [41].

3 Methodology↩︎

As shown in Fig. 3, In this section, we introduces a new two-stage optimization framework. In the first stage, the framework aligns domains in the data space. In the second stage, it reduces domain differences and noise sensitivity in the feature space.

3.1 Wavelet-guided Cross-domain Sample Synthesis↩︎

As shown in Fig. 3, we use Background Region Detection (BRD) and Dual-indicator Selection (DIS) to carry out unsupervised background filtering and extract high-value targets.

Background Region Detection(BRD). We use Wavelet Multi-frequency Filtering (WMF) to split the target images into sub-bands at different frequencies. The low-frequency baseband \(L^{low}\) captures the primary structural details of the image, while the high-frequency bands \(\{H_k^{high}\}^3_{k=1}\) capture edge details and noise. By using the low-frequency image to guide the high-frequency sub-bands, we perform edge-aware filtering that suppresses noise while preserving the edge details accurately: \[\tilde{H}_k^{high} = H_k^{high} \cdot \frac{|\nabla L^{low}|}{\max(|\nabla L^{low}|) + \epsilon},\] where \(H_k^{high}\) and \(\tilde{H}_k^{high}\) represent the high-frequency detail bands before and after filtering, \(L^{low}\) is the low-frequency baseband, and \(\epsilon\) is a regularization term to prevent numerical overflow. Additionally, we split the \(F^{target}\) into multiple non-overlapping subblocks \(B_{i,j}\). For each subblock, we calculate its edge density and Laplacian operator. The features are then normalized to the range \([0,1]\) and combined to yield the probability \(p_{i,j}\) of the background tendency region. \[p_{i,j} = \sum_{(u,v) \in B_{i,j}} \left| \nabla^2 I(u,v) \right| + \frac{1}{S^2} \sum_{(u,v) \in B_{i,j}} \sqrt{G_u^2 + G_v^2},\] where \((i,j)\) represents the subblock index, \((u,v)\) is the pixel index within the image, \(S\) is the subblock size, and \(G_u\) and \(G_v\) are the horizontal and vertical gradient fields of the subblock, respectively. \(\nabla^2\) is the Laplacian operator. Low-probability regions correspond to areas with low-texture background characteristics. A preseted threshold \(\tau_b\) is applied to select background subblocks \(P_{}^{target} \in R^{126\times126}\) from the original image \(X_{}^{target}\). These background regions are then upsampled to the original resolution using bilinear interpolation, ensuring geometric consistency with the input image. As shown in Fig. 4, the image after WMF exhibits a higher signal-to-noise ratio, which suggests that WMF effectively distinguishes real edges from random noise in the high-frequency sub-bands. And the workflow can be found in Algorithm 7.

Figure 5: Visualization results of Background Region Detection methods on target data. The left column displays the background region tendency map, while the right column presents the cropped region. The red boxes indicate the cropping results from different methods, and the yellow dashed circles highlight cases where other methods misclassified the background, including unnecessary small targets within the background. The yellow boxes are used to clearly emphasize the small targets.

As shown in Fig. 5, we use low-frequency structure guidance to filter high-frequency sub-bands. It effectively reduces random noise interference. It also helps to distinguish small targets from subtle background textures. This provides reliable prior information for cross-domain data optimization.

In addition, we use a dual-indicator selection strategy to build the difficult target set. During each iteration, the model parameter \(f_\theta\) generates a predicted mask \(\hat{Y}=f_\theta( X^{\text{source}})\) for the training set \(X^{\text{source}}\) . We then evaluate the candidate regions using pixel accumulation (PixAcc) and image quality (IoU). The final difficult target set \(P^{source}\) is selected based on these evaluations.

\[P^{\text{source}} = \begin{cases} \text{Select} & \text{if } \text{PixAcc}(\hat{Y},Y) < \tau_p \wedge \text{IoU}(\hat{Y},Y) < \tau_i \\ \varnothing & \text{otherwise} \end{cases}\] Here, \(\tau_p\) and \(\tau_i\) represent the threshold values for PixAcc and IoU, respectively.

Re-Generation. As shown in Fig. 6, we grid-segment the candidate background region \(P_{\text{}}^{\text{target}}\) to generate a set of local windows \(A^{target}\). We then calculate the structural similarity (SSIM) between these windows and the candidate small target set \(P_{\text{}}^{\text{source}}\). The regions are sorted by SSIM values in descending order, and the best matching region \({A}^{top}\) is selected.

\[\mathbf{A}^{top}= {TOP_{SSIM}}\mathbf{(A_{\text{}}^{\text{target}}}, \mathbf{P_{\text{}}^{\text{source}}}).\] where \(A^{target}\) refers to the local windows in the candidate background, and \(P_{\text{}}^{\text{source}}\) is the set of candidate targets. The workflow can be found in Algorithm 8. To remove color discrepancies and seam artifacts, we apply Poisson Fusion to optimize the gradient continuity.

\[\min_{{X}^{p}} \iint_{A^{top}} \left| \nabla {{X}^{p}} - \nabla{P^{\text{source}}} \right|^2 \, du \, dv , {I}^{a}|_{\partial A^{top}} = {I}^{a*}|_{\partial A^{top}}.\] Here,\({X}^{p}\) refers to the synthesized image, and \(A^{top}\) is the target coverage area. The pixel mappings \({I}^{a}\) and \({I}^{a*}\) inside and outside the synthesized image are kept consistent at the boundary \(\partial A^{top}\).

Figure 6: Illustration of Structural Similarity (SSIM) evaluation and fusion. First, the color and texture similarities between the target images and source image blocks are calculated to generate SSIM scores for the candidate regions. The regions are then ranked by their scores, and the best matching region is selected for target embedding.
Figure 7: Background Region Detection
Figure 8: Re-Generation Algorithm
Figure 9: The overall process of Real-domain Noise Invariance Learning. The upper-left section depicts the Noise Mixing process, where appropriate noise is extracted from the target set and mixed into the source set to create a new dataset.The lower-left section shows the dual-branch network architecture with shared weights, optimized through a combination of supervised and self-supervised losses to learn denoising features. The images on the right (a) and (b) show the modules within CAF.

3.2 Real-world Domain Noise Invariance Learning↩︎

Network Structure. A five-layer residual downsampling module \(\{E_i\}^5_{i=1}\) is used in the encoder to extract multi-scale high-level features, where \(E_i \in \mathbb{R}^{H/2^i \times W/2^i\times C_i}\), and low-dimensional semantic representations are generated through block embedding. In the decoder, spatial resolution is progressively restored using upsampling \(\{D_i\}^4_{i=1}\) and skip connections. Binary cross-entropy loss is applied to the output of each decoding layer. Multi-scale feature fusion is employed to aggregate contextual information from different levels, generating the final prediction. The total loss is computed through weighted summation. Specifically, the supervised loss \(L_{BCE}\) is composed of losses from all scales: \[\mathcal{L}_{\text{B}CE} = \sum_{i=1}^{4} \lambda_i \cdot {BCE(Y_i, \hat{Y}_i)}.\]

Where \(BEC(\cdot)\) refers to the binary cross-entropy loss function. \(Y_i\) is the true label for the \(i\)-th scale, while \(\hat{Y}_i\) is the predicted output for that scale. \(L_{\text{BCE}}\) represents the supervised loss, and \(\lambda_i\) is the weight coefficient for each scale.

Figure 10: (a) The diagram illustrates the process of constructing the dataset. (b) The diagram displays the composition and distribution of the dataset, where Severely Degraded Samples are created by combining Strongly Blurred Samples with Noisy Samples. (c) The diagram provides examples of backgrounds and targets across various scenes.

Real-World Noise Guided Regularization. To simulate real noise distributions, we build a noise mixing model. First, \(k\) groups of images with consistent noise distributions are selected from the real target domain to create the noise sample library \(X^{target}_k=[x^{target}_1,x^{target}_2,\dots,x^{target}_k]\in \mathbb{R}^{k\times c\times h\times w}\). Using the sliding window, we divide these \(k\) groups of noise samples \(X^{target}_k\) into multiple sampling regions \(A_k\). Then, based on the local variance threshold \(\sigma_{max}\) and mean threshold \(\mu_{\text{min}}\), we adaptively select the noise regions \(A^{ noise}_k=[a^{noise}_1,a^{noise}_2,\dots,a^{noise}_k]\in \mathbb{R}^{k\times c\times \frac{h}{15}\times \frac{w}{12}}\). Finally, these selected noise regions \(A^{noise}_{k}\) are upsampled to the original resolution, creating the noise library \(P^{noise}_k\in \mathbb{R}^{k\times c\times h\times w}\). At the same time, mixed samples \(X^{noise}\) are generated using linear interpolation. \[{X^{noise}} = \lambda \cdot \text{R}(P^{noise}_k) + (1 - \lambda) X^{source} , \quad \lambda \sim U(0,1).\] Here, \(R(\cdot)\) refers to the random sampling from the noise library. The parameter \(\lambda\) controls the noise injection intensity, and through our experiments, \(\lambda=0.5\) was found to provide the best performance.

The final output layer \(D_{1}\in \mathbb{R}^{H \times W \times 1}\)of the encoder-decoder architecture is selected as the feature alignment node. A dual-branch network with shared weights is designed: the main branch processes the original input \(X^{source}\) to extract clean features \(\hat{Y}\), while the auxiliary branch processes noisy samples \({X^{noise}}\) to capture noise features \(\tilde{Y}\). To enforce feature distribution consistency, a feature-level self-supervised loss function is defined. \[\mathcal{L}_{\text{S}UP} = {{L}_{\text{M}SE}( \hat{Y}, \tilde{Y})}.\] Here,we set \(\epsilon = 10^{-8}\) for numerical stability. \(N\) denotes the batch size. This enables implicit domain adaptation to the test-time noise. We then optimize total loss function as: \[\mathcal{L}_{\text{total}} = {L}_{\text{BCE}} + {L}_{\text{SUP}}.\]

Table 1: Quantitative results of different models trained on NUAA and tested on IRSTD-1K. The optimal, second-optimal, and third-optimal values are labelled in red, blue, and green respectively.
Model Train on NUAA
Test on IRSTD-1K
PixAcc\(\uparrow\) (\(\times 10^{-2}\)) mIoU\(\uparrow\) (\(\times 10^{-2}\)) nIoU\(\uparrow\) (\(\times 10^{-2}\)) \(P_d\uparrow\) (\(\times 10^{-2}\)) \(F_a\downarrow\) (\(\times 10^{-6}\)) F1\(\uparrow\) (\(\times 10^{-2}\))
ACM-Net [42] 61.03 50.87 51.12 88.92 30.68 55.49
ALC-Net [43] 89.16 25.02 26.19 90.60 519.76 39.07
DNA-Net [32] 53.80 46.85 57.49 84.22 30.89 50.08
RDIAN [44] 51.34 27.32 47.87 86.91 247.10 35.66
ISTDU-Net [45] 63.10 47.18 53.00 87.91 86.71 53.99
UIU-Net [46] 60.24 48.30 56.82 87.91 50.54 53.61
HCF-Net [20] 77.62 14.64 39.03 91.89 963.54 24.63
SCTransNet [22] 60.16 49.15 55.66 90.60 60.77 65.87
Ours 65.06 52.73 58.43 92.95 55.49 69.00
Table 2: Quantitative results of different models trained on IRSTD-1K and tested on NUAA. The optimal, second-optimal, and third-optimal values are labelled in red, blue, and green respectively.
Model Train on IRSTD-1K
Test on NUAA
PixAcc\(\uparrow\) (\(\times 10^{-2}\)) mIoU\(\uparrow\) (\(\times 10^{-2}\)) nIoU\(\uparrow\) (\(\times 10^{-2}\)) \(P_d\uparrow\) (\(\times 10^{-2}\)) \(F_a\downarrow\) (\(\times 10^{-6}\)) F1\(\uparrow\) (\(\times 10^{-2}\))
ACM-Net [42] 83.22 66.30 65.23 92.15 96.07 36.38
ALC-Net [43] 85.84 60.00 65.20 94.11 102.85 70.63
DNA-Net [32] 83.10 73.59 77.24 97.06 9.64 78.05
RDIAN [44] 76.93 57.56 68.37 91.18 115.13 73.06
ISTDU-Net [45] 83.78 69.86 73.88 96.08 19.72 29.37
UIU-Net [46] 86.21 75.41 75.04 93.14 11.10 80.45
HCF-Net[20] 63.29 53.30 54.15 87.01 31.79 57.82
SCTransNet [22] 79.92 65.96 75.04 95.10 61.66 79.48
Ours 87.45 75.44 75.89 98.09 11.54 86.00
Table 3: Comparison of target detection performance. This table presents the performance of both the state-of-the-art methods and our method across five evaluation metrics: pixel accuracy (PixAcc), mean intersection over union (mIoU), normalized intersection over union (nIoU), detection probability (Pd), and F1 score. The best results are marked in red.
Method Dynamic-ISTD Benchmark
PixAcc\(\uparrow\) (\(\times 10^{-2}\)) mIoU\(\uparrow\) (\(\times 10^{-2}\)) nIoU\(\uparrow\) (\(\times 10^{-2}\)) Pd\(\uparrow\) (\(\times 10^{-2}\)) F1\(\uparrow\) (\(\times 10^{-2}\))
19.46 17.62 19.52 27.34 18.49
ALCNet [43] 74.67 10.49 22.19 15.95 18.40
DNA-Net [32] 78.41 71.97 56.46 56.71 75.05
RDIAN [44] 65.01 58.03 46.14 50.25 61.32
ISTDU-Net [45] 74.14 69.03 53.84 52.91 71.49
UIU-Net [46] 58.13 55.96 41.22 37.47 57.02
HCF-Net[20] 69.27 48.66 50.63 55.33 57.17
SCTransNet [22] 78.37 74.51 58.75 60.51 84.47
Ours 81.70 76.01 62.97 68.10 85.51

4 Dynamic-ISTD Benchmark↩︎

To address the challenges of real-world degradations, we introduces a cross-domain dynamic degradation dataset for drone infrared (Dynamic-ISTD Benchmark) in Fig. 10. The dataset progressively degrades from the training set (source domain) to the test set (target domain) to simulate cross-domain shifts within the same dataset. It consists of 206 representative images selected from infrared aerial photography, we make pixel-level annotations to mark small targets. The image resolution is standardized at 640 \(\times\) 512 and includes multiple scenes and three types of targets, captured under various conditions.We also proposes a dynamic degradation injection strategy to simulate drone flight observation conditions. Under this strategy, 70% of the original data is retained as natural samples, while the remaining 30% is processed with a motion blur degradation [47] to simulate extreme flight scenes.

For the training set \(X_{train}\) the degradation parameters are set as follows: blur length \(l=5\) and blur angle \(\theta=45^\circ\) . To enhance the simulation of cross-domain shifts, the degradation method for the test set is more diverse. In the test set \(X_{test}\), motion blur with discrete angles \(\theta=[25^\circ,45^\circ,90^\circ,135^\circ,150^\circ,...]\) is applied to simulate the varying blur characteristics under different flight conditions. To further simulate the challenges of out-of-distribution noise, five types of noise, including Gaussian and salt-and-pepper noise, are combined and added to all images in the test set \(X_{test}\) at a proportion of 0.05. This approach helps to evaluate the adaptability of the approach under cross-domain shift conditions more thoroughly.

5 Experimental Validation↩︎

In this section, we present our experimental dataset, training details, and comparative evaluations. We demonstrate the superiority of our proposed method through quantitative and qualitative assessments, along with ablation experiments to highlight the effectiveness of each module.

5.1 Experiment setup↩︎

Implementation Details. The experiments are implemented with the PyTorch, and all training and inference are conducted on an NVIDIA TitanXp GPU. To prevent the vanishing gradient problem, the network parameters are initialized using the Kaiming [48] normal distribution strategy. During training, the batch size is 8, and the Adam [49] optimizer is used for parameter updates. The initial learning rate is 0.001 and is dynamically decayed to \(10^{-5}\) using cosine annealing. The model is trained for 1000 epochs.

Cross-dataset Settings. To evaluate the domain robustness of the proposed method, we applied a strict cross-domain validation framework on our self-constructed Dynamic-ISTD benchmark. The dataset is evenly split into training (50%) and test sets (50%), with controlled domain shifts. Specifically, the training set introduces motion blur within a limited parameter space, while the test set contains multi-angle motion blur and composite noise perturbations. This setup significantly increases the difference between the training and test distributions, mimicking real-world domain variations. Additionally, to validate the generality and effectiveness of our cross-domain method, we performed bidirectional transfer experiments on two established benchmarks (NUAA-SIRST [50] and IRSTD-1k [51]). All comparison methods were retrained using the same implementation details for a fair evaluation.

5.2 Comparisons under the o.o.d condition↩︎

Table. 1 and Table. 2 present the results of the cross-domain validation experiments. As shown, in the NUAA-SIRST [50] → IRSTD-1k [51] cross-domain scenario, our method achieved a PixAcc of 65.06 and an F1-score of 69.00. Similarly, in the IRSTD-1k [51] → NUAA-SIRST [50] cross-domain scenario, our method performed exceptionally well, with an F1-score of 86, outperforming all comparison methods. These cross-domain validation experiments clearly demonstrate that our method has strong generality and cross-domain adaptation capabilities. It can effectively handle the characteristic differences of different datasets, offering potential for cross-scenario deployment in real-world applications.

5.3 Comparisons under the i.i.d condition↩︎

Quantitative Analysis. To thoroughly assess the effectiveness of our proposed method, we compared it with several advanced methods in the infrared small target detection field, including ACM-Net [42], ALC-Net [43], DNA-Net [32], RDIAN [44], ISTDU-Net [45], UIU-Net [46], HCF-Net[20], and SCTransNet [22]. As shown in Table. 3, our method outperforms the existing methods in mIoU (76.01), nIoU (62.97), Pd (68.10), and F1 (85.51) scores.

Figure 11: Visualization of different models on the dataset. Blue, yellow, and red circles indicate correctly detected targets, missed detections, and false positives, respectively. The blue boxes highlight small targets for clearer observation.

Visualization Analysis. Fig. 11 shows the visual results of 8 representative algorithms on Dynamic-ISTD Benchmark. Our method significantly improves background region discrimination accuracy and achieves high-precision target localization, effectively solving the problem of distinguishing adjacent targets. In Fig. 11 (1), while the SCTransNet [22] performs similarly to our method in target clarity, it has a much higher background misclassification rate, with other comparison models showing similar issues. In Fig. 11 (3), our method provides more accurate and clearer recognition of small car targets, matching real small targets more closely and avoiding the unclear contours seen in other models.

Efficiency Analysis. As shown in Table. [tab4], our method strikes a much better balance between accuracy and efficiency than other models. Compared to SCTransNet [22] , which has a similar parameter count, our method increases IoU to 74.51 (+1.5) and reduces inference time by 0.66. In comparison to the efficient DNA-Net [32], we achieve a 4.04 increase in IoU with slightly more parameters and longer inference time. Against UIU-Net [46], with 50.54M parameters, our method delivers higher IoU with just 22% of the parameters and twice the speed.

Table 4: The performance of different methods in terms of parameters, inference time, and IoU on the Dynamic-ISTD Benchmark.
Method Dynamic-ISTD Benchmark
Pub year Params (M) Inference times (\(10^{-3}\)s) IoU\(\uparrow\)
ACM-Net [42] 2021 0.39 1.77 17.62
ALC-Net [43] 2021 0.43 1.66 10.49
DNA-Net [32] 2022 4.70 4.11 71.97
RDIAN [44] 2022 2.75 1.35 58.03
ISTDU-Net [45] 2022 2.75 3.52 69.03
UIU-Net [46] 2022 50.54 4.07 55.96
HCF-Net[20] 2024 0.22 8.35 48.66
SCTransNet [22] 2024 11.19 5.57 74.51
Ours 2025 11.33 6.23
Table 5: Ablation study on different Background Region Detection Methods on the Dynamic-ISTD Benchmark.
Method Dynamic-ISTD Benchmark
PixAcc\(\uparrow\) mIoU\(\uparrow\) nIoU\(\uparrow\) Pd\(\uparrow\) F1\(\uparrow\)
Baseline 78.37 74.51 58.75 60.51 84.47
Random Crop 77.78 73.98 59.58 65.32 84.12
w/o WMF 79.51 74.78 60.10 68.61 84.17
Ours
Table 6: Ablation study on varying the hyperparameter \(\alpha\) within the Real-world Domain Noise Invariance Learning framework.
Method Dynamic-ISTD Benchmark
PixAcc\(\uparrow\) mIoU\(\uparrow\) nIoU\(\uparrow\) Pd\(\uparrow\) F1\(\uparrow\)
\(\alpha\) = 0.1 78.90 74.48 60.38 63.80 84.49
\(\alpha\) = 0.3 79.90 74.68 60.53 67.85 84.63
\(\alpha\) = 0.5
\(\alpha\) = 0.7 79.62 75.19 60.40 64.68 84.96
\(\alpha\) = 0.9 78.65 73.53 60.01 67.34 83.83
Table 7: Noise Mixing Types on the Dynamic-ISTD Benchmark.
Noise-Type Dynamic-ISTD Benchmark
PixAcc\(\uparrow\) mIoU\(\uparrow\) nIoU\(\uparrow\) Pd\(\uparrow\) F1\(\uparrow\)
Baseline 78.37 74.51 58.75 60.51 84.47
Gaussian 74.84 72.04 55.14 58.61 82.70
Salt-Pepper 77.96 74.71 59.89 64.56 84.58
Speckle 78.36 75.21 59.58 63.42 84.88
Uniform 78.02 74.15 59.06 61.51 84.22
Poisson 79.30 75.37 59.55 64.94 85.03
Composite Noise 79.52 74.92 61.10 66.08 84.79
Real-World Noise

1pt

Table 8: Ablation study of Fusion datasets and Self-supervised modules on the Dynamic-ISTD Benchmark.
Baseline Fusion datasets Self-supervised Performance
PixAcc\(\uparrow\) mIoU\(\uparrow\) nIoU\(\uparrow\) Pd\(\uparrow\) F1\(\uparrow\)
\(✔\) \(\times\) \(\times\) 78.37 74.51 58.75 60.51 84.47
\(✔\) \(✔\) \(\times\) 80.37 75.26 62.23 68.23 84.98
\(✔\) \(✔\) \(✔\) 81.70 76.01 62.97 68.10 85.51

5.4 Ablation Studies↩︎

Effectiveness of Background Region Detection. The effectiveness of Background Region Detection comes from the combination of wavelet filtering and probabilistic clipping. These reduce random noise interference in segmentation while preserving key structural information. Wavelet filtering improves feature extraction by separating low-frequency basebands from high-frequency details, and probabilistic clipping precisely locates the key regions. As shown in Table [tab5], Background Region Detection enhances feature extraction and significantly improves segmentation accuracy.

Self-supervised Noise Consistency Parameters. Table [tab6] shows an important balance point in the noise-guided adaptation mechanism. The best performance is achieved when \(\alpha= 0.5\), indicating that the model needs to retain enough original image information while introducing sufficient noise variation in the learning process. Lower values of\(\alpha\)(e.g., 0.1, 0.3) may not introduce enough noise for the model to learn adaptive features, while higher values (e.g., 0.7, 0.9) could add too much noise, disrupting the key structures of the original image. At \(\alpha= 0.5\), the model effectively combines features consistent with the original image, while also learning noise robustness, achieving the best overall performance.

Noise Tpye. Table [tab7] shows the impact of different noise types on model performance. Composite Noise refers to a complex noise formed by combining five types of noise, including Gaussian noise, Salt-Pepper noise and so on. It performs well in nIoU, suggesting that this noise type aids in improving segmentation accuracy and target localization. Real-World Noise delivers the best performance, with a PixAcc of 79.75 and an F1 score of 85.22, highlighting that the model with added noise has the highest robustness.

Ablation on Fusion datasets and Self-supervised. As shown in Table [tab8], the Baseline performance is significantly lower than that of the other experimental groups. Re-Generation diversifies small target scenarios and greatly boosts PixAcc. The noise consistency self-supervised strategy enhances the model’s robustness to noise, effectively reducing the negative impact of noise in the test set. Overall, the collaborative optimization of multiple strategies helps overcome the performance limitations of individual modules.

6 Conclusion↩︎

We present a domain adaptation enhancement framework to tackle the cross-domain distribution shift problem in infrared small target detection. Wavelet-guided Cross-domain Synthesis improves adaptability to target environments without requiring extra inference adjustments. Real-world Domain Noise Invariance Learning overcomes the limitations of artificial noise assumptions and boosts robustness to heterogeneous noise. Cross-domain validation experiments conducted on both custom and real datasets show the effectiveness and wide applicability of our method across different datasets.

Yuehui Li is currently pursuing the B.S. degree at the School of Information Engineering, Guangdong University of Technology, Guangzhou, China. Her research interests include computer vision and machine learning.

Yahao Lu received the B.S. degree in 2023, from the School of Information Engineering, Guangdong University of Technology, Guangzhou, China, where he is currently working towards a M.S. degree. His research interests include computer vision and machine learning.

Haoyuan Wu is currently pursuing a B.S. in Communication Engineering at Guangdong University of Technology, Guangzhou, China. His academic interests are centered in robotics engineering and machine vision.

Sen Zhang received the Ph.D. degree from the School of Computer Science, the University of Sydney, Australia, in 2023. Previously, he obtained a B.S. degree in 2014, from the School of Biomedical Engineering, Tsinghua University, China. He is currently a Researcher at ByteDance Inc, Australia. His research interests include reinforcement learning, computer vision and large language model.

Liang Lin(Fellow, IEEE) is a Full Professor of computer science at Sun Yat-sen University. He served as the Executive Director and Distinguished Scientist of SenseTime Group from 2016 to 2018, leading the R\(\&\)D teams for cutting-edge technology transferring. He has authored or co-authored more than 200 papers in leading academic journals and conferences, and his papers have been cited by more than 26,000 times. He is an associate editor of IEEE Trans.Neural Networks and Learning Systems and IEEE Trans. Multimedia, and served as Area Chairs for numerous conferences such as CVPR, ICCV, SIGKDD and AAAI. He is the recipient of numerous awards and honors including Wu Wen-Jun Artificial Intelligence Award, the First Prize of China Society of Image and Graphics, ICCV Best Paper Nomination in 2019, Annual Best Paper Award by Pattern Recognition (Elsevier) in 2018, Best Paper Dimond Award in IEEE ICME 2017, Google Faculty Award in 2012. His supervised PhD students received ACM China Doctoral Dissertation Award, CCF Best Doctoral Dissertation and CAAI Best Doctoral Dissertation. He is a Fellow of IEEE/IAPR/IET.

Yukai Shi received the Ph.D. degrees from the school of Data and Computer Science, Sun Yat-sen University, Guangzhou China, in 2019. He is currently an associate professor at the School of Information Engineering, Guangdong University of Technology, China. His research interests include computer vision and machine learning.

References↩︎

[1]
Y. Luo, X. Li, and S. Chen, “Spatial-temporal aware-based unsupervised network for infrared small target detection,” IEEE Transactions on Multimedia, 2025.
[2]
F. Lin, S. Ge, K. Bao, C. Yan, and D. Zeng, “Learning shape-biased representations for infrared small target detection,” IEEE Transactions on Multimedia, vol. 26, 2024.
[3]
L. Chen, J. Liu, W. Chen, and B. Du, “A glrt-based multi-pixel target detector in hyperspectral imagery,” IEEE Transactions on Multimedia, vol. 25, 2023.
[4]
Z. Weng, X. Liu, C. Liu, X. Guo, Y. Shi, and L. Lin, “Dronesr: Rethinking few-shot thermal image super-resolution from drone-based perspective,” IEEE Sensors Journal, 2025.
[5]
M. Zhao, W. Li, L. Li, J. Hu, P. Ma, and R. Tao, “Single-frame infrared small-target detection: A survey,” IEEE Geoscience and Remote Sensing Magazine, vol. 10, no. 2, pp. 87–119, 2022.
[6]
H. Wang, L. Zhou, and L. Wang, “Miss detection vs. false alarm: Adversarial learning for small object segmentation in infrared images,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 8509–8518.
[7]
M. Federici, R. Tomioka, and P. Forré, “An information-theoretic approach to distribution shifts,” Advances in Neural Information Processing Systems, vol. 34, pp. 17 628–17 641, 2021.
[8]
S. Wang, A. Liu, and B. A. Plummer, “Noise-aware generalization: Robustness to in-domain noise and out-of-domain generalization,” arXiv preprint arXiv:2504.02996, 2025.
[9]
H. Song, M. Kim, D. Park, Y. Shin, and J.-G. Lee, “Learning from noisy labels with deep neural networks: A survey,” IEEE transactions on neural networks and learning systems, vol. 34, no. 11, pp. 8135–8153, 2022.
[10]
C. Shi, L. Fang, H. Wu, X. Xian, Y. Shi, and L. Lin, “Nitedr: Nighttime image de-raining with cross-view sensor cooperative learning for dynamic driving scenes,” IEEE Transactions on Multimedia, vol. 26, pp. 9203–9215, 2024.
[11]
Q. V. Le, A. J. Smola, and S. Canu, “Heteroscedastic gaussian process regression,” in Proceedings of the 22nd international conference on Machine learning, 2005, pp. 489–496.
[12]
Y. Shi, C. Shi, Z. Weng, Y. Tian, X. Xian, and L. Lin, “Crossfuse: Learning infrared and visible image fusion by cross-sensor top-k vision alignment and beyond,” IEEE Transactions on Circuits and Systems for Video Technology, 2025.
[13]
C. P. Chen, H. Li, Y. Wei, T. Xia, and Y. Y. Tang, “A local contrast method for small infrared target detection,” IEEE transactions on geoscience and remote sensing, vol. 52, no. 1, pp. 574–581, 2013.
[14]
S. Meng, C. Zhang, Q. Shi, Z. Chen, W. Hu, and F. Lu, “A robust infrared small target detection method jointing multiple information and noise prediction: Algorithm and benchmark,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–17, 2023.
[15]
A. Pal and J. Sulam, “Understanding noise-augmented training for randomized smoothing,” arXiv preprint arXiv:2305.04746, 2023.
[16]
K. Nishi, Y. Ding, A. Rich, and T. Hollerer, “Augmentation strategies for learning with noisy labels,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8022–8031.
[17]
Y. Lu, Y. Lin, H. Wu, X. Xian, Y. Shi, and L. Lin, “Sirst-5k: Exploring massive negatives synthesis with self-supervised learning for robust infrared small target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–11, 2024.
[18]
A. C. Tsoi and A. Back, “Static and dynamic preprocessing methods in neural networks,” Engineering Applications of Artificial Intelligence, vol. 8, no. 6, pp. 633–642, 1995.
[19]
T. Xi, L. Yuan, and Q. Sun, “A combined approach to infrared small-target detection with the alternating direction method of multipliers and an improved top-hat transformation,” Sensors, vol. 22, no. 19, p. 7327, 2022.
[20]
S. Xu, S. Zheng, W. Xu, R. Xu, C. Wang, J. Zhang, X. Teng, A. Li, and L. Guo, “Hcf-net: Hierarchical context fusion network for infrared small object detection,” in 2024 IEEE International Conference on Multimedia and Expo (ICME).IEEE, 2024, pp. 1–6.
[21]
F. Lin, S. Ge, K. Bao, C. Yan, and D. Zeng, “Learning shape-biased representations for infrared small target detection,” IEEE Transactions on Multimedia, vol. 26, pp. 4681–4692, 2023.
[22]
S. Yuan, H. Qin, X. Yan, N. Akhtar, and A. Mian, “Sctransnet: Spatial-channel cross transformer network for infrared small target detection,” IEEE Transactions on Geoscience and Remote Sensing, 2024.
[23]
G. Zhang, T. Zhou, and Y. Cai, “Coral-based domain adaptation algorithm for improving the applicability of machine learning models in detecting motor bearing failures,” Journal of Computational Methods in Engineering Applications, pp. 1–17, 2023.
[24]
Z. Xu, D. Liu, J. Yang, C. Raffel, and M. Niethammer, “Robust and generalizable visual representation learning via random convolutions,” arXiv preprint arXiv:2007.13003, 2020.
[25]
S. Choi, D. Das, S. Choi, S. Yang, H. Park, and S. Yun, “Progressive random convolutions for single domain generalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 312–10 322.
[26]
K. Liao, Z. Yue, Z. Wang, and C. C. Loy, “Denoising as adaptation: Noise-space domain adaptation for image restoration,” arXiv preprint arXiv:2406.18516, 2024.
[27]
Z. Han, X.-J. Gui, C. Cui, and Y. Yin, “Towards accurate and robust domain adaptation under noisy environments,” arXiv preprint arXiv:2004.12529, 2020.
[28]
S. G. Mallat, “A theory for multiresolution signal decomposition: the wavelet representation,” IEEE transactions on pattern analysis and machine intelligence, vol. 11, no. 7, pp. 674–693, 2002.
[29]
X. Bai and F. Zhou, “Analysis of new top-hat transformation and the application for infrared dim small target detection,” Pattern Recognition, vol. 43, no. 6, pp. 2145–2156, 2010.
[30]
L. Deng, J. Zhang, G. Xu, and H. Zhu, “Infrared small target detection via adaptive m-estimator ring top-hat transformation,” Pattern Recognition, vol. 112, p. 107729, 2021.
[31]
C. Gao, D. Meng, Y. Yang, Y. Wang, X. Zhou, and A. G. Hauptmann, “Infrared patch-image model for small target detection in a single image,” IEEE transactions on image processing, vol. 22, no. 12, pp. 4996–5009, 2013.
[32]
B. Li, C. Xiao, L. Wang, Y. Wang, Z. Lin, M. Li, W. An, and Y. Guo, “Dense nested attention network for infrared small target detection,” IEEE Transactions on Image Processing, vol. 32, pp. 1745–1758, 2022.
[33]
L. Gao, H.-M. Hu, X. Xue, and H. Hu, “From appearance to inherence: A hyperspectral image dataset and benchmark of material classification for surveillance,” IEEE Transactions on Multimedia, vol. 26, 2024.
[34]
A. Buades, B. Coll, and J.-M. Morel, “A review of image denoising algorithms, with a new one,” Multiscale modeling & simulation, vol. 4, no. 2, pp. 490–530, 2005.
[35]
Y.-M. Baek, J.-G. Kim, D.-C. Cho, J.-A. Lee, and W.-Y. Kim, “Integrated noise modeling for image sensor using bayer domain images,” in International Conference on Computer Vision/Computer Graphics Collaboration Techniques and Applications.Springer, 2009, pp. 413–424.
[36]
Y. Zhang, Y. Zhang, Z. Shi, R. Fu, D. Liu, Y. Zhang, and J. Du, “Enhanced cross-domain dim and small infrared target detection via content-decoupled feature alignment,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–16, 2023.
[37]
H. Guan and M. Liu, “Domain adaptation for medical image analysis: a survey,” IEEE Transactions on Biomedical Engineering, vol. 69, no. 3, pp. 1173–1185, 2021.
[38]
S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2009.
[39]
B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain adaptation,” in Proceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016.
[40]
Y. Fang, P.-T. Yap, W. Lin, H. Zhu, and M. Liu, “Source-free unsupervised domain adaptation: A survey,” Neural Networks, vol. 174, p. 106230, 2024.
[41]
Y. Zhang, “A survey of unsupervised domain adaptation for visual recognition,” arXiv preprint arXiv:2112.06745, 2021.
[42]
Y. Dai, Y. Wu, F. Zhou, and K. Barnard, “Asymmetric contextual modulation for infrared small target detection,” pp. 950–959, 2021.
[43]
——, “Attentional local contrast networks for infrared small target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 11, pp. 9813–9824, 2021.
[44]
H. Sun, J. Bai, F. Yang, and X. Bai, “Receptive-field and direction induced attention network for infrared dim small target detection with a large-scale dataset irdst,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–13, 2023.
[45]
Q. Hou, L. Zhang, F. Tan, Y. Xi, H. Zheng, and N. Li, “Istdu-net: Infrared small-target detection u-net,” IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2022.
[46]
X. Wu, D. Hong, and J. Chanussot, “Uiu-net: U-net in u-net for infrared small object detection,” IEEE Transactions on Image Processing, vol. 32, pp. 364–376, 2022.
[47]
H. Yue, C. Cao, L. Liao, R. Chu, and J. Yang, “Supervised raw video denoising with a benchmark dataset on dynamic scenes,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2301–2310.
[48]
K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” pp. 1026–1034, 2015.
[49]
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[50]
Y. Dai, Y. Wu, F. Zhou, and K. Barnard, “Asymmetric contextual modulation for infrared small target detection,” pp. 950–959, 2021.
[51]
M. Zhang, R. Zhang, Y. Yang, H. Bai, J. Zhang, and J. Guo, “Isnet: Shape matters for infrared small target detection,” pp. 877–886, 2022.

  1. Y. Li, Y. Lu, H. Wu and Y. Shi are with School of Information Engineering, Guangdong University of Technology, Guangzhou, 510006, China (email: liyuehui77161@gmail.com; 2112303120@mail2.gdut.edu.cn; bridgesness@gmail.com; ykshi@gdut.edu.cn)

    S. Zhang is with TikTok, ByteDance Inc, Sydney, NSW 2000, Australia (email: senzhang.thu10@gmail.com).

    L. Lin is with School of Data and Computer Science, Sun Yat-sen University, Guangzhou, 510006, China (email: linliang@ieee.org).↩︎