Transformer-Driven Active Transfer Learning for Cross-Hyperspectral Image Classification


Abstract

Hyperspectral image (HSI) classification presents inherent challenges due to high spectral dimensionality, significant domain shifts, and limited availability of labeled data. To address these issues, we propose a novel Active Transfer Learning (ATL) framework built upon a Spatial-Spectral Transformer (SST) backbone. The framework integrates multistage transfer learning with an uncertainty-diversity-driven active learning mechanism that strategically selects highly informative and diverse samples for annotation, thereby significantly reducing labeling costs and mitigating sample redundancy. A dynamic layer freezing strategy is introduced to enhance transferability and computational efficiency, enabling selective adaptation of model layers based on domain shift characteristics. Furthermore, we incorporate a self-calibrated attention mechanism that dynamically refines spatial and spectral weights during adaptation, guided by uncertainty-aware feedback. A diversity-promoting sampling strategy ensures broad spectral coverage among selected samples, preventing overfitting to specific classes. Extensive experiments on benchmark cross-domain HSI datasets demonstrate that the proposed SST-ATL framework achieves superior classification performance compared to conventional approaches. The source code is publicly available at https://github.com/mahmad000/ATL-SST.

Ahmad et al.

Hyperspectral Image Classification; Active Learning; Hybrid Query Function; Transfer Learning; Spatial-Spectral Transformer.

1 Introduction↩︎

Hyperspectral images (HSIs) capture detailed spectral signatures across hundreds of contiguous and narrow wavelength bands, offering rich information that enables accurate classification and identification of diverse materials and land cover types [1]. This fine spectral resolution has established HSIs as a vital tool across numerous domains, including remote sensing [2], environmental monitoring [3], food quality and safety inspection [4][6], forensic analysis [7], and biomedical imaging [8], [9]. However, despite these widespread applications, achieving robust HSI classification remains a formidable challenge. The high dimensionality of spectral data, the limited availability of labeled samples, and significant spectral-spatial variability due to varying acquisition conditions collectively hinder generalization performance [10], [11]. These issues become particularly pronounced in cross-domain classification tasks, where domain shifts between datasets severely limit model transferability.

In recent years, transformer-based architectures have garnered increasing interest for HSI classification [12][14], owing to their powerful self-attention mechanisms that capture long-range spectral and spatial dependencies. This capability leads to richer and more discriminative feature representations [15], [16]. Spatial-Spectral Transformers (SSTs), in particular, have emerged as a promising alternative to traditional convolutional neural networks (CNNs). SSTs can model intricate spectral-spatial interactions by operating on image patches, and inherently scale well to high-resolution HSI data [17], [18]. Unlike CNNs, SSTs eliminate the need for handcrafted feature extractors or complex pooling strategies by learning hierarchical representations directly from raw pixels [19]. Furthermore, their interpretability via attention maps enhances trust in model decisions by highlighting salient regions that drive classification outcomes [20], [21].

Building on this foundation, several SST-based models have been proposed better to harness the spectral and spatial properties of HSIs. For instance, Shi et al. [22] introduced a dual-branch transformer architecture that separately captures spectral and spatial dependencies, allowing for more effective multi-scale contextual learning. Nevertheless, deploying SSTs for large-scale HSI tasks presents several challenges. First, the self-attention mechanism incurs quadratic computational complexity with respect to sequence length, posing scalability concerns for longer inputs [23]. Second, unlike CNNs, SSTs lack built-in translation invariance, which may affect the consistency of spatial feature extraction [24]. Third, tokenizing HSIs into fixed-size patches can result in the loss of fine-grained information, particularly in small or irregular structures. Moreover, SSTs typically rely on large amounts of labeled data to achieve strong generalization, making them vulnerable to overfitting under low-label regimes [25]. Given the high cost and effort involved in acquiring labeled HSI samples [26], [27], this reliance on fully supervised learning is often impractical. Consequently, strategies that leverage both labeled and unlabeled data, such as semi-supervised and active learning (AL), have become increasingly important for reducing annotation demands while maintaining classification accuracy [28], [29].

AL, in particular, has gained significant traction in the HSI community as a means of selectively labeling the most informative samples [30], [31]. A number of effective approaches have been introduced to address this challenge [32]. Liu et al. [33] proposed MDL4OW, a multitask framework that combines classification and reconstruction to detect unknown classes via reconstruction errors. Liao et al. [34] introduced CGE-AL, a graph-embedding-based AL method that optimizes sample selection using graph CNNs and uncertainty estimation. Zhao et al. [35] designed MAT-ASSAL, which integrates a multi-attention transformer with adaptive superpixel segmentation to preserve spatial structures during query selection. Similarly, Wang et al. [36] presented a co-auxiliary learning strategy, coupling pseudo-labeling with diverse sample selection to ensure both uncertainty and representativeness. Liu et al. [37] further advanced the field by introducing an adversarial domain alignment framework that incorporates contrastive learning, enhancing robustness and cross-domain generalization under few-shot conditions.

Despite these promising directions, deep AL methods still face key limitations when applied to HSIs. The inherent scarcity of labeled data constrains the diversity of queried samples, often resulting in suboptimal generalization [38]. Additionally, complex spectral-spatial correlations present in HSIs are not always effectively captured by conventional AL techniques [39]. To address these gaps, the research community has started to explore active transfer learning (ATL) frameworks that combine the strengths of AL and transfer learning. In such frameworks, knowledge learned from well-labeled source domains can be transferred to improve sample selection and feature extraction in sparsely labeled target domains. This approach is especially valuable in dynamic remote sensing scenarios, where spectral characteristics can vary substantially over time and location. By enabling domain adaptation with minimal retraining, ATL offers a viable pathway to enhance HSI classification with limited annotation overhead.

Motivated by these insights, this work introduces an SST-based ATL framework (SST-ATL) specifically designed for cross-domain HSI classification. The proposed SST-ATL framework directly tackles critical issues such as limited label availability, spectral-spatial heterogeneity, and domain shifts, providing a comprehensive and adaptive solution. The key contributions of this work are summarized below:

We propose SST-ATL, a unified ATL framework for HSI classification that integrates an SST with a multistage fine-tuning strategy and a dynamic layer freezing mechanism. This design allows for efficient cross-domain adaptation under annotation scarcity by selectively updating model layers based on domain shift characteristics, thereby enhancing generalization while reducing computational overhead. To further improve robustness against spectral variability, a novel self-calibrated attention refinement module is introduced, which dynamically adjusts spatial and spectral attention weights using uncertainty-guided feedback. In addition, we develop a hybrid query strategy that jointly optimizes informativeness, diversity, and spectral coverage through a combined uncertainty-diversity sampling criterion and a spectral diversity-enforcing selection scheme. Extensive evaluations across six standard HSI benchmarks demonstrate that SST-ATL consistently outperforms CNN-, Transformer-, and state-space model-based baselines by +2.5% to +3.8% OA under constrained annotation budgets. The complete implementation is publicly available at https://github.com/mahmad000/ATL-SST.

2 Related Works↩︎

Recent advancements in HSI classification have explored several paradigms, including transformer-based models, AL, and their integration into ATL. We categorize and review the literature under these themes. Table 1 summarizes each method’s key contributions, limitations, and the improvements offered by our proposed SST-ATL framework.

2.0.0.1 Transformer-Based HSI Classification

Transformer-based models, particularly SSTs, have shown exceptional capacity in modeling long-range spectral-spatial dependencies. S2FTNet [40] introduces a dual-branch SST using separate spectral and spatial attention modules, achieving 3–5% gains over CNNs. Zhang et al. [41] propose ELS2T, a resource-efficient SST variant with reduced embedding dimensions. Zhang et al. [42] further design a multi-range SST that attends to spectral-spatial intervals at multiple scales. Ahmad et al. [43] tackle spectral redundancy using a wavelet-based compression (WaveFormer), which reduces FLOPs while preserving spectral detail. Despite these innovations, most transformer-based methods require extensive labeled data and struggle with domain shifts. These limitations motivate the integration of adaptive learning strategies such as AL and ATL to improve generalization under low-label regimes.

Table 1: Comparative Summary of Recent HSI Classification Approaches.
Method Key Contributions Limitations Proposed SST-ATL
Shi et al. [22] Dual-branch SST for decoupled spectral-spatial attention and multi-scale context modeling Quadratic attention cost, lack of translation invariance, and overfitting risk under low labels Lightweight SST, dynamic freezing for low-label adaptation, and computationally efficient
Zhang et al. [42] Multi-range attention with heads attending to different spectral-spatial scales Fusion overhead and redundant computation across branches Self-calibrated attention mechanism with uncertainty guidance optimizes attention allocation
Ahmad et al. [43] Wavelet-based downsampling before SST to reduce spectral redundancy Optimal wavelet scale selection critical; limited adaptivity to scene heterogeneity Integrates dynamic spectral diversity sampling with reversible compression
Liao et al. [34] Graph embedding-based AL with class-wise sample selection via uncertainty in graph parameters Graph construction is expensive and scales poorly; lacks spectral adaptivity SST-ATL avoids explicit graphs and embeds spectral–spatial diversity in query mechanism
Zhao et al. [35] Multi-attention transformer with adaptive superpixels for active querying Relies heavily on superpixel quality; segmentation biases impact results Patch-free querying driven by calibrated attention and uncertainty
Wang et al. [36] Co-auxiliary learning with pseudo-labeling and uncertainty-based diversity Requires training auxiliary models; complexity grows with iterations No auxiliary network needed; SST-ATL optimizes a unified loss with query feedback
Liu et al. [37] Adversarial domain alignment using contrastive loss to enhance cross-domain generalization Training instability due to adversarial components; contrastive loss tuning is sensitive Domain adaptation via dynamic freezing and progressive fine-tuning improves robustness
Di et al. [44] Dual-module AL with adversarial and inter-class uncertainty sampling Increased training complexity and instability in contrastive pairing under limited labels Simpler and stable sample selection via joint uncertainty–diversity strategy
Zhang et al. [41] Efficient SST (ELS2T) with reduced heads and embeddings for resource-limited settings Reduced capacity on complex datasets; performance drops in high-class overlap SST-ATL retains model expressivity while compressing via reversible transforms
Deng et al. [45] First ATL framework using sparse autoencoders and active querying Based on outdated autoencoder architecture; lacks attention-based adaptability Transformer-based ATL with uncertainty feedback and fine-grained spectral focus
Lin et al. [46] AL-driven transfer learning without prior target knowledge Evaluation limited to three datasets; weak robustness to large domain shifts Evaluated across diverse benchmarks; freezing strategy tailored to shift magnitude
Ahmad et al. [47] Deep AL with 3D CNN and query function using fuzziness, Breaking ties, and MI Computational cost of 3D CNNs high; lacks long-range dependency modeling SST-ATL models long-range features with fewer parameters using hierarchical attention
Yang et al. [48] Conceptual integration of AL and transfer learning for label efficiency High-level design; lacks architectural or query function details SST-ATL provides end-to-end framework with defined transfer, freezing, and query modules
Cao et al. [49] CNN + AL with MRF smoothing for better spatial consistency Lacks global attention modeling and depends on fixed post-processing Self-attention offers adaptive spatial consistency without external smoothers
He et al. [50] Heterogeneous transfer learning from RGB to HSI with mapping and attention Pretrained RGB CNNs underutilize spectral richness; weak generalization under large shifts SST-ATL learns spectral-attentive features directly from HSI with built-in domain adaptation

2.0.0.2 Active Learning Strategies for HSI

AL has been leveraged to reduce annotation costs by prioritizing informative sample selection. Zhao et al. [35] use adaptive superpixel segmentation with a multi-attention transformer for sample querying. Liao et al. [34] propose class-wise graph-based uncertainty sampling via graph embeddings, though with high computational costs. Wang et al. [36] combine pseudo-labeling with co-auxiliary learning to improve sample representativeness. Di et al. [44] propose ALSN, an adversarially trained Siamese network employing dual uncertainty criteria (AUAL, ICUAL), showing strong label efficiency but with increased model complexity. Cao et al. [49] integrate Markov Random Field smoothing with CNN-based AL to improve spatial coherence, while Ahmad et al. [47] adopt multiple sample selection criteria, including fuzziness, mutual information, and breaking ties in a 3D CNN setting. These methods, while effective, often rely on heuristic sampling or are computationally intensive.

2.0.0.3 Active Transfer Learning for HSI

The fusion of AL and transfer learning forms ATL frameworks, aimed at learning transferable representations under limited labels. Deng et al. [45] were among the first to propose ATL for HSI using stacked sparse autoencoders. Lin et al. [46] extend this by introducing an AL-driven transfer learning pipeline that does not require prior knowledge of the target domain. Yang et al. [48] provide a conceptual overview of ATL for HSI, but lack architectural detail. He et al. [50] focus on heterogeneous ATL from RGB to HSI using pretrained CNNs, but struggle with spectral misalignment. Our previous work [47] shows that ATL with 3D CNNs can achieve strong results, but at the cost of computational complexity and limited domain adaptation flexibility.

In contrast, our SST-ATL framework offers a principled fusion of SSTs, AL, and ATL by incorporating dynamic layer freezing, uncertainty-calibrated attention, and hybrid uncertainty-diversity querying, providing a robust and computationally efficient solution to cross-domain HSI classification.

Figure 1: Overview of the proposed ATL-SST framework for HSI classification. The model first trains an SST encoder by extracting 3D patch embeddings and learning spatial-spectral representations via multi-head self-attention (MHSA). AL iteratively refines the training set by querying the most informative samples based on uncertainty and diversity criteria, while dynamic layer freezing and self-calibrated attention refinement enhance model generalization and robustness. A fine-tuned SST model is evaluated to achieve efficient and accurate cross-dataset classification with minimal labeled data.

3 Proposed Methodology↩︎

Given an HSI cube \(\mathbf{X} \in \mathbb{R}^{M \times N \times k}\), where \(M\) and \(N\) are the spatial dimensions and \(k\) is the number of spectral bands, we first extract overlapping spatial windows of size \(W \times W\) from \(\mathbf{X}\). These windows provide localized spatial-spectral context for feature encoding via a transformer-based model. Each window is further divided into smaller subpatches used for embedding.

3.1 Spectral-Spectral Transformer (SST)↩︎

The SST backbone comprises four stages: patch embedding, positional encoding, self-attention-based encoder blocks, and the final classification head.

3.1.0.1 Patch Embedding

Each spatial window of size \(W \times W \times k\) is processed by a 3D convolutional layer to extract \(p \times p\) subpatches. Let \(\mathbf{W}_e\) be the learnable convolutional kernel bank. The \(i\)-th embedded subpatch \(\mathbf{P}_i\) is computed as:

\[\mathbf{P}_i(u,v,c) = \sum_{m=0}^{p-1} \sum_{n=0}^{p-1} \sum_{q=0}^{k-1} \mathbf{X}(s_u+m, s_v+n, q) \cdot \mathbf{W}_e(m,n,q,c)\] where \((s_u, s_v)\) are the coordinates of the \(i\)-th subpatch, \(p\) is the subpatch size, \(c\) indexes output channels, and \(N_p = \left(\frac{W}{p}\right)^2\) is the number of subpatches in a window. These subpatches are projected into a \(d\)-dimensional feature space.

3.1.0.2 Positional Encoding

To inject spatial information into the embeddings, a sinusoidal positional encoding \(\mathbf{E}_{\text{pos}}\) is added:

\[\begin{align} \mathbf{E}_{\text{pos}}(i, 2j) &= \sin\left(\frac{i}{10000^{2j/d}}\right) \\ \mathbf{E}_{\text{pos}}(i, 2j+1) &= \cos\left(\frac{i}{10000^{2j/d}}\right) \end{align}\]

The encoded patch sequence is \(\mathbf{Z} = \mathbf{P} + \mathbf{E}_{\text{pos}}\).

3.1.0.3 Transformer Encoder Block

Each block applies multi-head self-attention followed by feed-forward processing:

\[\begin{align} \text{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) &= \text{Concat}(\text{head}_1, \dots, \text{head}_h)\mathbf{W}_o \\ \text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) &= \text{Softmax}\left(\frac{\mathbf{QK}^T}{\sqrt{d_k}}\right)\mathbf{V} \end{align}\] where \(\mathbf{Q}_i = \mathbf{Z}_i \mathbf{W}_Q\), \(\mathbf{K}_i = \mathbf{Z}_i \mathbf{W}_K\), \(\mathbf{V}_i = \mathbf{Z}_i \mathbf{W}_V\), and \(d_k\) is the attention key dimension. The output of one encoder block is:

\[\begin{align} \mathbf{Z}' &= \text{LayerNorm}(\mathbf{Z} + \text{Dropout}(\text{Attention}(\mathbf{Z}))) \\ \mathbf{Z}'' &= \text{LayerNorm}(\mathbf{Z}' + \text{Dropout}(\mathbf{F}(\mathbf{Z}'))) \end{align}\] where \(\mathbf{F}\) is a feedforward network:

\[\mathbf{F}(\mathbf{Z}_i) = \text{ReLU}(\mathbf{Z}_i \mathbf{W}_1 + b_1)\mathbf{W}_2 + b_2\]

To improve robustness against domain-specific spectral distortions, we introduce uncertainty-aware attention scaling. Let \(\mathcal{U}_i\) be the entropy-based uncertainty for the \(i\)-th token. The calibrated attention is given by:

\[\mathbf{A}_{\text{cal}} = \text{Softmax}\left(\frac{\mathbf{QK}^T}{\sqrt{d_k}}\right) \cdot (1 + \lambda \cdot \mathcal{U})\] where \(\lambda\) is a hyperparameter controlling uncertainty influence. This guides the model to focus more on spatial-spectral tokens with higher ambiguity, leading to better adaptation. To inject global context, a learnable token \(\mathbf{Q}_c \in \mathbb{R}^d\) attends to the encoded patches:

\[\mathbf{A}_{\text{cross}} = \text{Softmax}\left(\frac{\mathbf{Q}_c \mathbf{K}_c^T}{\sqrt{d}}\right) \mathbf{V}_c\]

3.1.0.4 Classification Head

The aggregated features pass through a two-layer MLP with softmax:

\[\begin{align} \mathbf{O} &= \text{ReLU}(\mathbf{Z}'' \mathbf{W}_3 + b_3) \\ \mathbf{O}_{\text{final}} &= \text{Softmax}(\mathbf{O} \mathbf{W}_4 + b_4) \end{align}\] where \(\mathbf{O}_{\text{final}} \in \mathbb{R}^C\) and \(C\) is the number of classes.

3.2 Active Learning↩︎

3.2.0.1 Hybrid Query Strategy

We define \(X_{\text{pool}}\) as the set of samples and \(\text{query\_size}\) as the number of samples to be queried in each AL round. Our strategy combines:

  • Uncertainty Sampling: \[\mathcal{U}(\mathbf{x}_i) = -\max(p(y|\mathbf{x}_i))\]

  • Diversity Sampling: \[\mathcal{D}(\mathbf{x}_i) = \mathop{\mathrm{arg\,max}}_{S \subseteq X_{\text{pool}}, |S| = \text{query\_size}} \text{Diversity}(S)\] where \[\text{Diversity}(S) = \frac{1}{m(m-1)} \sum_{j=1}^{m} \sum_{\substack{k=1\\k \ne j}}^{m} d_{jk}\] and \(d_{jk}\) is the Euclidean distance between pixel spectra in local neighborhoods of size \(n_{\text{neighborhood}}^2\).

3.2.0.2 Query Index Selection

For each candidate \(\mathbf{x}_{(h,w)}\), we compute:

\[\text{diversity\_metric}(h,w) = \frac{1}{m(m-1)} \sum_{j \ne k} d_{jk}\]

The top-\(\text{query\_size}\) entries with the highest diversity are selected: \[\text{query\_indices} = \left\{ \pi(k) \mid k \in [HW - \text{query\_size} + 1, HW] \right\}\] where \[\pi = \text{argsort}(d)\]

3.3 Dynamic Layer Freezing↩︎

To facilitate efficient domain adaptation, we introduce a dynamic layer freezing policy. During transfer to a new target domain, SST layers are selectively frozen or unfrozen based on sensitivity to domain shift. We estimate shift magnitude using the maximum mean discrepancy (MMD) of intermediate features. Layers with low variance across domains are frozen, reducing computational cost while retaining useful knowledge.

3.4 Model Training and Evaluation↩︎

The model is trained using categorical cross-entropy:

\[\mathcal{L} = -\sum_{i=1}^{N} \sum_{c=1}^{C} y_{ic} \log(\hat{y}_{ic})\]

Evaluation metrics include OA, AA, and the kappa coefficient:

\[\kappa = \frac{p_o - p_e}{1 - p_e}\] where \(p_o\) is the observed agreement and \(p_e\) is expected agreement by chance.

4 Experimental Settings and Datasets↩︎

The experimental configuration for SST-ATL is designed to ensure clarity, reproducibility, and robust performance evaluation. Following the preprocessing of the HSI cubes and corresponding ground truths, patch-based datasets are generated. Each HSI is divided into overlapping spatial windows of size \(8 \times 8\) (i.e., input patch size), which serve as local contextual inputs for the SST backbone. The data are partitioned into training, pool, and test sets with a split ratio of 1%, 49%, and 50%, respectively. Specifically, 1% of the samples are randomly selected for initial training, 49% are allocated to the pool for AL queries, and 50% are reserved exclusively for testing. For diversity-based query selection, a neighborhood size of \(n_{\text{neighborhood}} = 3\) is used, resulting in \(9\) spectral vectors per neighborhood to compute diversity scores, and the query percentage is set to 0.02%.

Model training is conducted under a set of carefully tuned hyperparameters. The learning rate is initialized at 0.001 with a decay factor of \(1 \times 10^{-6}\), and the Adam optimizer is employed to facilitate convergence. Training proceeds for 50 epochs with a batch size of 56. The backbone architecture comprises four transformer layers, each featuring eight attention heads. The model dimensionality (\(d_{\text{model}}\)) is set to 54, and the feedforward network dimensionality is configured as \(4 \times 64 = 256\). A dropout rate of 0.1 is applied to mitigate overfitting, while layer normalization is stabilized with an epsilon value of \(1 \times 10^{-6}\).

The experimental evaluation in this study was conducted using six widely recognized hyperspectral datasets: Qingyun (QUH), WHU-Hi-HongHu (HH), WHU-Hi-HanChuan (HC), Salinas (SA), University of Houston (UH), and Pavia University (PU). Each dataset encompasses a variety of land cover classes with distinct characteristics, enabling a comprehensive assessment across diverse scenes. For HH and HC, a rich set of agricultural and urban classes was included, while SA predominantly featured vegetation-related categories. UH captured a complex urban environment, and PU primarily focused on urban materials and structures. The QUH dataset significantly expanded the scale and diversity of the evaluation, offering a substantial number of samples across natural and man-made categories.

5 ATL-SST Effects on the same dataset↩︎

The results presented in Table 2 highlight the consistent improvement in classification accuracy as the number of training samples increases. Table 2 also provides computational metrics, showcasing the model’s efficiency. Training times remain stable at around 44s across iterations, despite the increase in sample size, indicating scalability. The testing time is consistently around 3.4s, demonstrating the method’s suitability for real-time applications. The reported FLOPs (110,336) and parameter count (836,559) reflect the model’s balance between computational efficiency and capacity, making it practical. Qualitative improvements are visually evident in Figure 2, which presents ground truth maps for varying training sample sizes. With fewer training samples (e.g., 75 and 223), the maps exhibit substantial noise and misclassification. As the training sample size increases, the maps become progressively refined, reflecting improved spatial and spectral consistency.

Table 2: UH Dataset: Accuracy improvements as the sample count increases per iteration. Additionally, average training and testing times (in seconds) and the total FLOPs and parameter count provide a comprehensive overview of computational requirements.
Class Number of Training Samples
75 223 368 510 650 787
\(\kappa\) 71.27 84.98 91.00 94.96 96.17 97.38
OA 73.49 86.13 91.68 95.34 96.46 97.57
AA 71.02 84.32 90.35 94.27 95.57 96.97
Train (s) 50.52 44.63 44.98 44.34 44.96 47.39
Test (s) 4.43 3.45 3.45 3.35 3.44 3.44
Flops 110336
Param 836559

a

b

c

d

e

f

Figure 2: UH dataset: Ground truth maps corresponding to varying numbers of training samples..

The results presented in Table 3 and 3 collectively highlight the model’s performance progression on the PU dataset as training samples increase. Training and testing times remain relatively stable across sample sizes, indicating computational efficiency; a total of 108,800 FLOPs and 835,017 parameters suggests a well-optimized model that balances accuracy and computational overhead. The accompanying Figure further substantiates these findings by plotting OA, AA, and \(\kappa\) scores against sample counts, illustrating a clear upward trend with convergence near 99% as sample sizes approach 2245.

Table 3: PU Dataset: Accuracy improvements as the sample count increases per iteration.
Class Number of Training Samples
213 636 1051 1457 1855 2245
\(\kappa\) 82.69 90.62 95.65 96.98 97.85 98.51
OA 87.11 92.98 96.72 97.72 98.38 98.87
AA 80.26 90.20 95.10 96.35 97.35 98.17
Train (s) 118.11 117.25 115.99 119.21 119.78 124.78
Test (s) 9.87 9.26 9.08 8.79 9.07 9.09
Flops 108800
Param 835017

a

b

c

d

e

f

Figure 3: PU dataset: Ground truth maps corresponding to varying numbers of training samples..

Table 4 and 4 thoroughly analyze the model’s performance, computational complexity, and the scalability of its classification accuracy on the SA dataset. Although training times gradually increase as sample sizes grow, testing times remain relatively stable, underscoring the model’s scalability and efficiency. Furthermore, Table 4 provides computational metrics, including FLOPs and parameter counts, which quantify the model’s complexity at approximately 110,592 FLOPs and 836,816 parameters. These values confirm the model’s high computational efficiency, which is an essential consideration for large-scale applications. Finally, Figure 4 provides ground truth maps generated with different training sample sizes, visually showcasing classification performance improvements.

Table 4: SA Dataset: Accuracy improvements as the sample count increases per iteration.
Class Number of Training Samples
270 805 1330 1844 2348 2842
\(\kappa\) 91.55 96.59 97.99 98.62 99.08 99.42
OA 92.41 96.94 98.20 98.76 99.17 99.48
AA 95.47 98.30 99.15 99.41 99.62 99.76
Train (s) 149.00 144.89 149.44 152.20 155.20 157.85
Test (s) 12.44 11.40 11.60 11.32 10.69 11.49
Flops 110592
Param 836816

a

b

c

d

e

f

Figure 4: SA dataset: Ground truth maps corresponding to varying numbers of training samples..

Table 5 provides a comprehensive overview of the HC dataset’s classification performance and computational cost. The computational metrics reveal a gradual increase in training time with more samples, reflecting the added processing requirements, while testing times remain relatively stable. Furthermore, the reported FLOPs and parameter counts provide insight into the model’s computational efficiency, balancing accuracy with resource usage. Figure 5 showcases the ground truth maps for each training sample size, visually representing the model’s learning progress. With 1,287 samples, the classification appears fragmented, reflecting limited learning. As the sample size increases, the maps exhibit progressively finer details, and class boundaries become more accurate and well-defined.

Table 5: HC Dataset: Accuracy improvements as the sample count increases per iteration.
Class Number of Training Samples
1287 3836 6334 8782 11181 13532
\(\kappa\) 84.71 91.39 94.78 96.89 98.12 98.83
OA 86.97 92.65 95.54 97.34 98.39 99.00
AA 70.46 84.04 90.67 94.33 96.36 97.60
Train (s) 523.75 544.15 547.76 598.53 618.34 648.47
Test (s) 53.42 50.17 51.15 55.11 54.07 54.52
Flops 110592
Param 836816

a

b

c

d

e

f

Figure 5: HC dataset: Ground truth maps corresponding to varying numbers of training samples..

Table 6 provides computational metrics; despite increasing data, the training and testing times remain relatively stable, showcasing the model’s computational efficiency. For example, training times range from 272.05s to 291.68s, and testing times fluctuate minimally between 22.42s and 23.42s. For instance, OA increased from 87.83% to 98.97%, while AA rose from 71.14% to 97.94%, reflecting the model’s robust performance. The \(\kappa\) followed a similar upward trajectory, indicating reduced misclassification rates and alignment with ground truth. Figure 6 presents ground truth maps corresponding to the training sample counts. These visualizations show the gradual enrichment of spatial details and class boundaries as training data increases. The progression from 1933 to 20,322 samples demonstrates the model’s growing ability to capture intricate spatial and spectral variations.

Table 6: HH Dataset: Accuracy improvements as the sample count increases per iteration.
Class Number of Training Samples
1933 5761 9512 13188 16791 20322
\(\kappa\) 84.51 92.38 96.07 97.36 98.82 98.69
OA 87.82 93.98 96.89 97.91 99.07 98.968
AA 71.13 86.73 92.80 95.62 97.86 97.93
Train (s) 272.05 273.63 277.96 285.60 291.45 291.68
Test (s) 23.30 22.56 22.42 22.73 23.42 22.89
Flops 112128
Param 838358

a

b

c

d

e

f

Figure 6: HH dataset: Ground truth maps corresponding to varying numbers of training samples..

Table 7 reports training and testing times, which grow modestly with larger sample counts, and provides the model’s computational metrics, such as FLOPs and parameters, indicating efficient scaling. Figure 7 illustrates ground truth maps for the dataset, corresponding to each training sample size. These maps visually demonstrate the enhancement in spatial coverage and label consistency as the number of training samples increases, corroborating the quantitative improvements observed in Table 7.

Table 7: QUH Dataset: Accuracy improvements as the sample count increases per iteration.
Class Number of Training Samples
4774 14227 23491 32570 41467 50186
\(\kappa\) 88.59 95.18 96.72 98.32 98.64 99.04
OA 91.41 96.35 97.52 98.73 98.97 99.28
AA 81.15 94.52 96.29 97.91 97.92 98.95
Train (s) 2000.15 2122.98 2152.50 2248.08 2296.25 2360.75
Test (s) 219.09 227.46 231.50 239.75 239.00 234.79
Flops 108032
Param 834246

a

b

c

d

e

f

Figure 7: QUH dataset: Ground truth maps corresponding to varying numbers of training samples..

6 ATL-SST Effects on Cross Datasets↩︎

This section explores the impact of ATL in enabling effective cross-dataset classification. By leveraging the actively trained SST on one dataset and fine-tuning it with a limited fraction of samples from other datasets, we assess its ability to generalize and adapt to varying spectral-spatial distributions. The results highlight the significance of ATL in achieving high accuracy with minimal fine-tuning across diverse datasets.

Table 8: The OA, AA, and \(\kappa\) accuracies for cross-dataset classification were obtained using the actively trained SST on the SA dataset and evaluated across other datasets.
Metric SST Trained SST Fine-Tuned using 10% Samples
SA PU UH HC HH QUH
\(\kappa\) 99.42 96.42 97.95 93.95 93.94 95.70
OA 99.48 97.30 98.11 94.83 95.20 96.75
AA 99.76 95.47 97.89 90.14 89.30 93.94
Test (s) 11.49 10.31 5.46 61.11 128.04 252.74

a

b

c

d

e

Figure 8: SST trained on SA was fine-tuned and tested on PU, HC, HH, and UH datasets, respectively..

Table 8 presents the OA, AA, and \(\kappa\) for cross-dataset classification, where the ATL-SST is initially trained on the SA dataset and subsequently fine-tuned using 10% of the samples from other datasets (PU, UH, HC, HH, and QUH). The results indicate that the model achieves high accuracy across datasets, with \(\kappa\) values consistently above 93%, demonstrating effective transfer learning. Testing times vary significantly across datasets due to differences in data volume and complexity, with QUH requiring the longest computation times. Figure 8 provides visual representations of classification outcomes, showcasing the trained SST on SA and its fine-tuned performance on PU, HC, HH, and UH datasets. Each subfigure illustrates distinct patterns, emphasizing the model’s adaptability to different data distributions.

Table 9 and Figure 9 illustrate the performance of the SST trained on the PU dataset and subsequently fine-tuned and evaluated on other datasets, including SA, UH, HC, and HH. Table 9 provides detailed metrics, highlighting the transferability and generalization ability of the SST model. The actively trained SST on PU achieves a \(\kappa\) of 98.51% and an OA of 98.88% on its test data, demonstrating robust performance on the source dataset. When fine-tuned with only 10% samples from other datasets, the model maintains competitive accuracy, with OA values exceeding 96% for most target datasets. Testing times are also reported, indicating computational efficiency during model fine-tuning and evaluation. Figure 9 visually complements these results by showcasing the SST model’s classification maps on the source dataset (PU) and various target datasets (SA, HC, HH, and UH).

Table 9: The OA, AA, and \(\kappa\) accuracies for cross-dataset classification were obtained using the actively trained SST on the PU dataset and evaluated across other datasets.
Metric SST Trained SST Fine-Tuned using 10% Samples
PU SA UH HC HH QUH
\(\kappa\) 98.51 97.75 96.17 91.80 92.43 94.85
OA 98.87 97.98 96.46 93.00 94.01 96.11
AA 98.17 98.83 96.06 87.55 87.71 92.79
Test (s) 9.09 12.83 3.71 59.67 89.13 239.48

a

b

c

d

e

Figure 9: SST trained on PU was fine-tuned and tested on SA, HC, HH, and UH datasets, respectively..

Table 10: The OA, AA, and \(\kappa\) accuracies for cross-dataset classification were obtained using the actively trained SST on the UH dataset and evaluated across other datasets.
Metric SST Trained SST Fine-Tuned using 10% Samples
UH SA PU HC HH QUH
\(\kappa\) 97.38 98.29 97.01 93.26 94.04 95.60
OA 97.57 98.47 97.74 94.25 95.29 96.67
AA 96.97 99.19 96.32 90.42 89.40 94.35
Test (s) 3.44 12.85 10.81 69.98 97.30 242.65

a

b

c

d

e

Figure 10: SST trained on UH was fine-tuned and tested on SA, HC, HH, and PU datasets, respectively..

Table 10 and Figure 10 present the performance of the SST actively trained on the UH dataset and fine-tuned with 10% samples from other target datasets, including SA, PU, HC, and HH. Table 10 outlines key metrics, highlighting the SST model’s capability to generalize and adapt to new datasets with minimal fine-tuning. The actively trained SST on UH achieves high baseline performance, with a \(\kappa\) of 97.38% and an OA of 97.58% on its test data. When fine-tuned with samples from other datasets, the model demonstrates strong transferability, achieving an OA of 98.47% on SA and 97.75% on PU. Although the performance is slightly reduced for HC and HH, the OA values remain above 94%, underscoring the model’s robustness. Testing times scale with dataset complexity, with SA and PU requiring less computation compared to HC and HH, as seen in the recorded durations. Figure 10 complements the table by visualizing classification results. The maps indicate that the fine-tuned SST effectively captures spatial and spectral features across diverse datasets. Specifically, the SST retains consistent classification quality on target datasets despite using only 10% of the samples for fine-tuning.

7 Comparison with State-of-the-art Methods↩︎

The following Tables and Figures illustrate a comparative evaluation of various HSI classification methods on the UH, PU, and SA datasets. The methods compared include Attention Graph CNN (AGCN) [51], Wavelet-Based Spatial-Spectral Transformer (WaveFormer) [43], Hybrid Spatial-Spectral Transformer (HybViT) [52], Pyramid-based Spatial-Spectral Transformer (PyFormer) [12], Spatial-Spectral Transformer (SST) [53], Spatial-Spectral Mamba (SSMamba) [54], and Wavelet-Based Spatial-Spectral Mamba (WaveMamba) [55]. These state-of-the-art approaches are benchmarked against the proposed method, ATL-SST, under consistent training, validation, and testing configurations, as summarized in the Tables.

Table 11: UH dataset: Comparison of various HSI classification methods (Attention Graph CNN = AGCN, Hybrid Spatial-spectral Transformer = HViT, Spatial-spectral Mamba = SSM, WaveFormer = WF, PyFormer = PF, WaveMamab = WM) highlighting performance metrics.
Samples State-of-the-art Comparative Methods ATL-SST
Tr Va Te AGCN [51] HViT [52] SST [53] WF [43] PF [12] SSM [54] WM [55]
43 582 626 95.36 93.61 94.56 94.88 85.30 91.85 93.92 98.40
42 585 627 99.84 99.36 99.20 99.04 100 96.65 98.40 99.52
26 323 348 99.13 100 99.71 98.56 95.68 93.67 97.70 100
44 578 622 94.69 97.58 97.26 97.58 98.87 96.62 94.85 97.90
42 579 621 100 99.83 100 100 94.52 96.61 96.94 100
17 146 162 87.03 80.24 98.14 88.88 88.88 80.86 87.65 98.14
106 528 634 89.74 91.95 92.27 94.47 89.27 80.44 85.48 94.47
81 541 622 87.29 90.35 92.76 96.46 87.62 83.92 90.99 95.98
81 545 626 69.96 91.21 93.76 90.09 71.72 65.01 83.22 97.92
69 544 614 96.09 96.74 99.51 99.02 84.85 80.29 89.73 100
70 547 618 94.66 95.95 96.76 94.82 90.93 86.08 92.88 99.02
68 548 617 81.03 99.02 98.86 97.40 80.87 92.70 92.54 97.56
55 180 234 87.60 82.05 79.48 72.22 57.69 34.61 73.50 77.77
15 199 214 95.32 96.26 97.19 97.19 97.19 92.52 98.59 100
28 302 330 94.54 95.45 100 100 100 81.51 95.75 97.87
Train (s) 101.61 50.85 46.74 47.73 397.21 67.21 563.52 47.39
\(\kappa\) 90.64 94.61 96.01 95.39 87.62 84.26 91.19 97.38
OA 91.35 95.02 96.31 95.74 88.55 85.46 91.85 97.57
AA 91.49 93.97 95.96 94.71 88.23 83.56 91.48 96.97

a

b

c

d

e

f

g

h

Figure 11: UH dataset: Ground truth maps for comparative methods alongside the proposed ATL-SST..

Table 12: PU dataset: Comparison of various HSI classification methods highlighting performance metrics.
Samples State-of-the-art Comparative Methods ATL-SST
Tr Va Te AGCN [51] HViT [52] SST [53] WF [43] PF [12] SSM [54] WM [55]
422 2893 3316 99.45 97.13 97.40 97.55 96.86 88.75 97.64 98.79
535 8790 9324 98.01 99.14 99.02 99.53 98.93 94.74 99.53 99.90
218 832 1049 69.01 84.65 84.74 82.93 93.51 74.73 75.97 91.32
161 1371 1532 97.51 96.27 96.01 95.16 97.32 92.23 94.38 99.54
15 657 673 97.77 100 100 100 100 98.95 98.21 100
311 2204 2514 99.20 97.73 97.13 96.26 94.35 93.35 98.60 99.64
114 551 665 87.96 88.12 92.93 91.27 75.03 82.25 92.33 99.54
405 1436 1841 93.91 89.57 90.05 92.28 63.06 81.96 88.48 95.87
64 409 474 97.04 93.45 96.83 94.93 89.45 89.45 90.50 98.94
Train (s) 72.49 117.58 114.56 112.68 1181.18 154.76 1382.70 124.78
\(\kappa\) 95.00 95.33 95.56 95.68 91.62 88.16 94.94 98.51
OA 96.22 96.48 96.65 96.75 93.68 91.01 96.19 98.87
AA 93.32 94.01 94.90 94.44 89.83 88.49 92.85 98.17

a

b

c

d

e

f

g

h

Figure 12: PU dataset: Ground truth maps for comparative methods alongside the proposed ATL-SST..

Table 13: SA Comparison of various HSI classification methods highlighting performance metrics.
Samples State-of-the-art Comparative Methods ATL-SST
Tr Va Te AGCN [51] HViT [52] SST [53] WF [43] PF [12] SSM [54] WM [55]
41 964 1004 99.50 99.90 99.00 98.50 100 99.40 100 100
93 1770 1863 100 100 100 100 100 100 100 100
80 908 988 100 99.39 99.59 99.49 100 97.36 98.68 100
62 635 697 96.41 99.85 99.71 100 100 98.4218 99.13 100
118 1221 1339 86.03 99.85 99.70 99.17 94.69 98.65 98.20 99.40
77 1902 1980 100 100 100 100 100 99.84 99.84 100
64 1725 1790 99.88 99.88 100 100 100 98.82 99.21 100
1004 4631 5636 86.12 93.98 94.96 95.28 88.25 87.63 94.25 98.86
109 2992 3102 100 100 100 100 99.96 99.83 100 99.96
94 1545 1639 98.16 99.51 99.14 99.75 93.89 97.37 97.62 99.93
58 476 534 91.94 100 98.87 99.81 88.95 96.81 98.31 100
56 908 963 97.92 100 99.58 99.89 100 98.65 99.89 100
54 404 458 96.72 100 100 100 96.28 99.12 100 100
40 495 535 99.62 99.62 100 99.06 46.16 94.20 98.13 99.81
842 2792 3634 92.90 91.24 90.72 90.03 92.43 76.63 91.16 98.23
50 854 903 99.44 100 99.66 100 100 99.66 99.55 100
Train (s) 91.07 147.54 139.82 144.16 1469.81 194.06 1734.50 157.85
\(\kappa\) 94.35 97.20 97.25 97.24 93.94 92.80 96.84 99.42
OA 94.92 97.49 97.53 97.52 94.55 93.54 97.16 99.48
AA 96.54 98.95 98.81 98.81 93.79 96.40 98.37 99.76

a

b

c

d

e

f

g

h

Figure 13: SA dataset: Ground truth maps for comparative methods alongside the proposed ATL-SST..

As shown in Table 11, the results highlight the superior performance of ATL-SST across most experimental settings. For OA, ATL-SST consistently achieves the highest values, with a maximum of 100% in multiple scenarios, underscoring its robustness and effectiveness. Similarly, ATL-SST outperforms all comparative methods in terms of \(\kappa\) and AA, achieving top scores of 97.38% and 96.97%, respectively. This reflects the method’s ability to effectively classify diverse classes within the hyperspectral dataset. Notably, ATL-SST demonstrates significant improvement over AGCN and WaveMamba, with AGCN yielding lower performance metrics (e.g., 91.35% OA and 90.64% \(\kappa\)) in many cases. While HybViT and SST exhibit strong results, particularly in individual settings, their performance is generally less consistent compared to ATL-SST. The training time analysis also reveals ATL-SST’s computational efficiency, with an average training time comparable to SST and WaveFormer, and significantly lower than methods like WaveMamba and PyFormer. The ground truth maps presented in Figure 11 visually depict the classification results for the UH dataset. It is evident from the maps that ATL-SST provides more precise and coherent classification results, with fewer misclassified pixels and better-defined class boundaries compared to the comparative methods. For instance, methods like AGCN and WaveMamba exhibit noticeable misclassifications and noise, whereas ATL-SST maintains high spatial and spectral accuracy.

The results presented in Table 12 and Figure 12 comprehensively compare the performance. In terms of accuracy, ATL-SST consistently outperforms all other methods, achieving the highest OA of 98.8778%, AA of 98.1747%, and \(\kappa\) of 98.5130%. These results are a significant improvement over the next best-performing methods, such as WaveMamba, which achieves an OA of 96.1941%. From a computational perspective, ATL-SST maintains competitive efficiency, with a training time of 124.78s, comparable to HybViT and SST while significantly outperforming computationally intensive models like WaveMamba and WaveFormer, which require 1382.70s and 112.68s, respectively. Qualitative results presented in Figure 12 further validate the quantitative findings. The classification maps generated by ATL-SST are visually superior, exhibiting precise boundary delineation and accurate region classifications that closely match the ground truth. In contrast, competing methods, such as AGCN and SSMamba, show noticeable errors, particularly in regions with complex class boundaries or high spectral variability.

The comparison of HSI classification methods on the SA dataset, as detailed in Table 13 and illustrated in Figure 13, highlights the superior performance of the proposed ATL-SST model. Quantitatively, ATL-SST demonstrates outstanding accuracy, achieving the highest OA of 99.4864%, AA of 99.7641%, and \(\kappa\) of 99.4281%. These metrics significantly surpass the next best-performing models, such as HybViT and SST, which attain OAs of 97.4912% and 97.5318%, respectively. From a computational perspective, ATL-SST achieves a balanced trade-off between accuracy and training efficiency. Its training time of 157.85s is competitive with models like HybViT and SST, while being significantly faster than WaveMamba and PyFormer, which require 1734.50s and 1469.81s, respectively. Qualitative results in Figure 13 further validate ATL-SST’s efficacy. The classification maps produced by ATL-SST exhibit superior boundary delineation and region specificity, closely mirroring the ground truth. In comparison, models such as SSMamba and WaveMamba show noticeable misclassifications in regions with complex spectral patterns, while transformer-based models like HybViT and PyFormer, although performing well overall, display minor boundary smoothing.

8 Ablation Study↩︎

To evaluate the individual contributions of each module within the SST-ATL framework, we conduct a comprehensive ablation study on three representative datasets: PU, UH, and SA. The ablation analysis is divided into two categories: component-level evaluation and query strategy assessment.

8.0.0.1 Component-Level Ablation

First, to measure the effect of AL, we replace the hybrid uncertainty-diversity sampling strategy with random sampling. Second, to isolate the impact of diversity sampling, we retain only uncertainty-based selection. Third, to test the influence of selective model adaptation, we disable dynamic freezing and allow all transformer layers to be uniformly fine-tuned. Fourth, we examine the value of the self-calibrated attention mechanism by reverting to standard MHSA without uncertainty-guided refinement.

The results presented in Table 14 indicate that each component significantly contributes to the final performance of the model. The full SST-ATL achieves the highest OA, AA, and \(\kappa\) across all datasets. The removal of AL causes the most severe degradation, OA drops by 2.66%, 2.65%, and 2.70% for PU, UH, and SA respectively, highlighting the effectiveness of the proposed hybrid sampling. Excluding diversity sampling also leads to performance drops, particularly in the AA metric, reflecting reduced class coverage. The absence of dynamic freezing results in slightly reduced accuracy but increased training instability, while the removal of self-calibrated attention degrades spatial-spectral feature alignment, as evidenced by reduced \(\kappa\) values.

Table 14: Component-level ablation results.
Variant PU UH SA
OA AA \(\kappa\) OA AA \(\kappa\) OA AA \(\kappa\)
Full SST-ATL 98.88 98.17 98.51 97.58 96.97 97.38 99.49 99.76 99.43
w/o Active Learning 96.22 94.60 95.00 94.93 93.48 94.51 96.79 97.90 96.90
w/o Diversity 97.14 96.13 96.47 95.72 94.36 95.08 97.85 98.40 97.86
w/o Dynamic Freezing 97.31 96.23 96.60 96.14 94.72 95.52 98.12 98.83 98.10
w/o Self-Calibrated Attention 97.53 96.74 97.12 96.80 95.93 96.61 98.71 99.12 98.63

8.0.0.2 Query Strategy Evaluation

In addition to component analysis, we evaluate the effectiveness of our hybrid uncertainty-diversity sampling strategy against four widely used alternatives: random sampling, entropy-based uncertainty sampling, margin-based uncertainty sampling, and diversity-only sampling. All strategies are tested across four AL rounds (250 to 1000 labeled samples) on the PU, UH, and SA datasets.

As shown in Table 15, our proposed hybrid query strategy outperforms all others in every setting. The hybrid method achieves up to 98.14% OA on PU, 97.29% on UH, and 98.97% on SA with only 1000 labeled samples. The advantage is particularly evident in the early rounds, where the model learns from as few as 250 samples. For example, at 250 labels, the hybrid method exceeds random sampling by over 3% OA on all datasets, showcasing its superior label efficiency and informativeness. These results emphasize that incorporating both uncertainty and spectral-spatial diversity into the query process leads to faster convergence, better generalization, and significantly lower annotation costs than existing AL approaches.

Table 15: Comparison of query strategies across AL iterations.
Strategy PU (OA%) UH (OA%) SA (OA%)
250 500 750 1000 250 500 750 1000 250 500 750 1000
Random Sampling 90.81 93.54 95.02 96.22 88.13 91.95 93.41 94.93 91.74 94.27 95.89 96.79
Entropy-Based Uncertainty 91.54 94.38 95.73 97.14 89.97 93.22 94.78 95.72 92.90 95.83 96.61 97.85
Margin Sampling 91.23 94.07 95.49 96.80 89.51 92.68 94.33 95.31 92.46 95.30 96.20 97.43
Diversity-Only Sampling 92.14 94.96 96.34 97.04 90.62 93.75 95.11 95.96 93.33 96.08 96.95 98.00
Proposed (Hybrid) 93.92 96.23 97.51 98.14 92.01 95.04 96.30 97.29 94.71 97.42 98.13 98.97

9 Conclusions and Future Research Directions↩︎

This work proposes a novel multi-stage ATL framework that integrates the SST for efficient HSI classification. The proposed framework leverages the strengths of transfer learning, AL, and SST to address key challenges in HSI classification, such as high spectral dimensionality and limited labeled data. A major contribution of this work is developing an uncertainty-diversity querying mechanism that adaptively selects the most informative and diverse samples for iterative model refinement. This not only optimizes labeling efficiency but also improves the model’s ability to generalize across varying spectral profiles. Additionally, this work introduced a dynamic freezing strategy to selectively freeze and unfreeze SST layers during the transfer learning process, ensuring an optimal balance between computational efficiency and adaptability to new spectral variations. This mechanism significantly reduces computational overhead while preserving critical learned representations, making the approach scalable for resource-constrained environments.

While this study makes substantial progress in advancing HSI classification through ATL and SST, several avenues for future exploration remain: for instance, future extensions could focus on self-supervised learning approaches to reduce reliance on labeled data. This would be particularly beneficial in HSI classification, where obtaining labeled samples is costly. Leveraging self-supervised pretext tasks or pseudo-labeling strategies could further enhance the model’s learning capabilities in scenarios with limited labeled data. Although the current uncertainty-diversity-based querying has proven effective, dynamic querying strategies could be explored. For example, combining reinforcement learning with AL may allow the system to learn optimal sampling strategies over time, potentially improving both sample selection and overall learning efficiency. As HSI data can vary significantly between different sensors or geographical regions, exploring domain adaptation techniques within the ATL framework would be a valuable direction. This would allow models to adapt better to unseen datasets or shifts in data distribution, improving the robustness of the classification process in real-world scenarios.

References↩︎

[1]
J. M. Bioucas-Dias, A. Plaza, G. Camps-Valls, P. Scheunders, N. Nasrabadi, and J. Chanussot, “Hyperspectral remote sensing data analysis and future challenges,” IEEE Geoscience and Remote Sensing Magazine, vol. 1, no. 2, pp. 6–36, 2013.
[2]
M. H. F. Butt, J. P. Li, M. Ahmad, and M. A. F. Butt, “Graph-infused hybrid vision transformer: Advancing geoai for enhanced land cover classification,” International Journal of Applied Earth Observation and Geoinformation, vol. 129, p. 103773, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1569843224001274.
[3]
M. B. Stuart, M. Davies, M. J. Hobbs, T. D. Pering, A. J. McGonigle, and J. R. Willmott, “High-resolution hyperspectral imaging using low-cost components: Application within environmental monitoring scenarios,” Sensors, vol. 22, no. 12, p. 4652, 2022.
[4]
Z. Saleem, M. H. Khan, M. Ahmad, A. Sohaib, H. Ayaz, and M. Mazzara, “Prediction of microbial spoilage and shelf-life of bakery products through hyperspectral imaging,” IEEE Access, vol. 8, pp. 176 986–176 996, 2020.
[5]
M. H. Khan, Z. Saleem, M. Ahmad, A. Sohaib, H. Ayaz, M. Mazzara, and R. A. Raza, “Hyperspectral imaging-based unsupervised adulterated red chili content transformation for classification: Identification of red chili adulterants,” Neural Computing and Applications, vol. 33, no. 21, pp. 14 507–14 521, 2021.
[6]
H. Ayaz, M. Ahmad, M. Mazzara, and A. Sohaib, “Hyperspectral imaging for minced meat classification using nonlinear deep features,” Applied Sciences, vol. 10, no. 21, p. 7783, 2020.
[7]
M. H. F. Butt, H. Ayaz, M. Ahmad, J. P. Li, and R. Kuleev, “A fast and compact hybrid cnn for hyperspectral imaging-based bloodstain classification,” in 2022 IEEE Congress on Evolutionary Computation (CEC), 2022, pp. 1–8.
[8]
M. Zulfiqar, M. Ahmad, A. Sohaib, M. Mazzara, and S. Distefano, “Hyperspectral imaging for bloodstain identification,” Sensors, vol. 21, no. 9, p. 3045, 2021.
[9]
M. H. F. Butt, J. P. Li, J. C. Ji, W. Riaz, N. Anwar, F. F. Butt, M. Ahmad, A. Saboor, A. Ali, and M. Y. Uddin, “Intelligent tumor tissue classification for hybrid health care units,” Frontiers in Medicine, vol. 11, 2024. [Online]. Available: https://www.frontiersin.org/journals/medicine/articles/10.3389/fmed.2024.1385524.
[10]
S. Hajaj, A. El Harti, A. B. Pour, A. Jellouli, Z. Adiri, and M. Hashim, “A review on hyperspectral imagery application for lithological mapping and mineral prospecting: Machine learning techniques and future prospects,” Remote Sensing Applications: Society and Environment, vol. 35, p. 101218, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S235293852400082X.
[11]
M. F. Guerri, C. Distante, P. Spagnolo, F. Bougourzi, and A. Taleb-Ahmed, “Deep learning techniques for hyperspectral image analysis in agriculture: A review,” ISPRS Open Journal of Photogrammetry and Remote Sensing, vol. 12, p. 100062, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S266739322400005X.
[12]
M. Ahmad, M. H. F. Butt, M. Mazzara, S. Distefano, A. M. Khan, and H. A. Altuwaijri, “Pyramid hierarchical spatial-spectral transformer for hyperspectral image classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 17, pp. 17 681–17 689, 2024.
[13]
Z. Zhao, X. Xu, S. Li, and A. Plaza, “Hyperspectral image classification using groupwise separable convolutional vision transformer network,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–17, 2024.
[14]
M. Ahmad, M. H. F. Butt, M. Usama, H. A. Altuwaijri, M. Mazzara, S. Distefano, and A. M. Khan, “Multi-head spatial-spectral mamba for hyperspectral image classification,” Remote Sensing Letters, vol. 16, no. 4, pp. 339–353, 2025. [Online]. Available: https://doi.org/10.1080/2150704X.2025.2461330.
[15]
F. Xu, S. Mei, G. Zhang, N. Wang, and Q. Du, “Bridging cnn and transformer with cross-attention fusion network for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024.
[16]
S. Sohail, M. Usama, U. Ghous, M. Mazzara, and M. Ahmad, “Differential attention with enhanced squeeze-and-excitation for hyperspectral image classification,” IEEE Geoscience and Remote Sensing Letters, vol. 22, pp. 1–5, 2025.
[17]
M. Jiang, Y. Su, L. Gao, A. Plaza, X.-L. Zhao, X. Sun, and G. Liu, “Graphgst: Graph generative structure-aware transformer for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–16, 2024.
[18]
J. Li, Z. Zhang, R. Song, Y. Li, and Q. Du, “Scformer: Spectral coordinate transformer for cross-domain few-shot hyperspectral image classification,” IEEE Transactions on Image Processing, vol. 33, pp. 840–855, 2024.
[19]
D. Hong, Z. Han, J. Yao, L. Gao, B. Zhang, A. Plaza, and J. Chanussot, “Spectralformer: Rethinking hyperspectral image classification with transformers,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022.
[20]
J. Chen, C. Yang, L. Zhang, L. Yang, L. Bian, Z. Luo, and J. Wang, “Tccu-net: Transformer and cnn collaborative unmixing network for hyperspectral image,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 17, pp. 8073–8089, 2024.
[21]
M. Ahmad, M. Mazzara, S. Distefano, A. M. Khan, and X. Wu, “Self-supervised spatial-spectral transformer with extreme learning machine for hyperspectral image classification,” International Journal of Remote Sensing, vol. 0, no. 0, pp. 1–24, 2025. [Online]. Available: https://doi.org/10.1080/01431161.2025.2520049.
[22]
C. Shi, S. Yue, and L. Wang, “A dual-branch multiscale transformer network for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–20, 2024.
[23]
W. Huang, Y. Deng, S. Hui, Y. Wu, S. Zhou, and J. Wang, “Sparse self-attention transformer for image inpainting,” Pattern Recognition, vol. 145, p. 109897, 2024.
[24]
Y. Sun, X. Zhi, S. Jiang, G. Fan, X. Yan, and W. Zhang, “Image fusion for the novelty rotating synthetic aperture system based on vision transformer,” Information Fusion, vol. 104, p. 102163, 2024.
[25]
Y. Peng, Y. Zhang, B. Tu, Q. Li, and W. Li, “Spatial–spectral transformer with cross-attention for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022.
[26]
S. Jia, S. Jiang, Z. Lin, N. Li, M. Xu, and S. Yu, “A survey: Deep learning for hyperspectral image classification with few labeled samples,” Neurocomputing, vol. 448, pp. 179–204, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231221004033.
[27]
P. Ren, Y. Xiao, X. Chang, P.-Y. Huang, Z. Li, B. B. Gupta, X. Chen, and X. Wang, “A survey of deep active learning,” ACM Comput. Surv., vol. 54, no. 9, Oct. 2021. [Online]. Available: https://doi.org/10.1145/3472291.
[28]
M. Ahmad, S. Shabbir, D. Oliva, M. Mazzara, and S. Distefano, “Spatial-prior generalized fuzziness extreme learning machine autoencoder-based active learning for hyperspectral image classification,” Optik, vol. 206, p. 163712, 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0030402619316109.
[29]
H. Wu, Y. Zhong, G. Han, J. Lin, Z. Liu, and C. Han, “Dbral: A novel uncertainty-based active learning based on deep-broad learning for medical image classification,” in Artificial Neural Networks and Machine Learning – ICANN 2024: 33rd International Conference on Artificial Neural Networks, Lugano, Switzerland, September 17–20, 2024, Proceedings, Part VIII.Berlin, Heidelberg: Springer-Verlag, 2024, p. 260–273. [Online]. Available: https://doi.org/10.1007/978-3-031-72353-7_19.
[30]
R. Thoreau, V. Achard, L. Risser, B. Berthelot, and X. Briottet, “Active learning for hyperspectral image classification: A comparative review,” IEEE Geoscience and Remote Sensing Magazine, vol. 10, no. 3, pp. 256–278, 2022.
[31]
M. Ahmad, M. Usama, S. Distefano, and M. Mazzara, “Hyperspectral image classification with fuzzy spatial-spectral class discriminate information,” in 2024 IEEE International Conference on Image Processing (ICIP), 2024, pp. 2285–2291.
[32]
A. C. Karaca and G. Bilgin, “An end-to-end active learning framework for limited labelled hyperspectral image classification,” International Journal of Remote Sensing, vol. 46, no. 8, pp. 3179–3206, 2025. [Online]. Available: https://doi.org/10.1080/01431161.2025.2467294.
[33]
S. Liu, Q. Shi, and L. Zhang, “Few-shot hyperspectral image classification with unknown classes using multitask deep learning,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 6, pp. 5085–5102, 2021.
[34]
X. Liao, B. Tu, J. Li, and A. Plaza, “Class-wise graph embedding-based active learning for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–13, 2023.
[35]
C. Zhao, B. Qin, S. Feng, W. Zhu, W. Sun, W. Li, and X. Jia, “Hyperspectral image classification with multi-attention transformer and adaptive superpixel segmentation-based active learning,” IEEE Transactions on Image Processing, vol. 32, pp. 3606–3621, 2023.
[36]
Z. Wang, Z. Chen, and B. Du, “Active learning with co-auxiliary learning and multi-level diversity for image classification,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 8, pp. 3899–3911, 2023.
[37]
F. Liu, W. Gao, J. Liu, X. Tang, and L. Xiao, “Adversarial domain alignment with contrastive learning for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–20, 2023.
[38]
F. Deng, R. Liang, W. Luo, and G. Zhang, “Deep learning segmentation of seismic facies based on proximity constraint strategy: Innovative application of uma-net model,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–17, 2024.
[39]
H. wu, S. Wu, K. Zhang, X. Liu, S. Shi, and C. Bian, “Unsupervised blind spectral–spatial cross-super-resolution network for hsi and msi fusion,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024.
[40]
D. Liao, C. Shi, and L. Wang, “A spectral–spatial fusion transformer network for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–16, 2023.
[41]
S. Zhang, J. Zhang, X. Wang, J. Wang, and Z. Wu, “Els2t: Efficient lightweight spectral–spatial transformer for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–16, 2023.
[42]
L. Zhang, Y. Wang, L. Yang, J. Chen, Z. Liu, J. Wang, L. Bian, and C. Yang, “A multi-range spectral-spatial transformer for hyperspectral image classification,” Infrared Physics & Technology, vol. 135, p. 104983, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1350449523004413.
[43]
M. Ahmad, U. Ghous, M. Usama, and M. Mazzara, “Waveformer: Spectral–spatial wavelet transformer for hyperspectral image classification,” IEEE Geoscience and Remote Sensing Letters, vol. 21, pp. 1–5, 2024.
[44]
X. Di, Z. Xue, and M. Zhang, “Active learning-driven siamese network for hyperspectral image classification,” Remote Sensing, vol. 15, no. 3, 2023. [Online]. Available: https://www.mdpi.com/2072-4292/15/3/752.
[45]
C. Deng, Y. Xue, X. Liu, C. Li, and D. Tao, “Active transfer learning network: A unified deep joint spectral–spatial feature learning model for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 3, pp. 1741–1754, 2019.
[46]
J. Lin, L. Zhao, S. Li, R. Ward, and Z. J. Wang, “Active-learning-incorporated deep transfer learning for hyperspectral image classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 11, no. 11, pp. 4048–4062, 2018.
[47]
M. Ahmad, U. Ghous, D. Hong, A. M. Khan, J. Yao, S. Wang, and J. Chanussot, “A disjoint samples-based 3d-cnn with active transfer learning for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–16, 2022.
[48]
J. Yang, S. Li, and W. Xu, “Active learning for visual image classification method based on transfer learning,” IEEE Access, vol. 6, pp. 187–198, 2018.
[49]
X. Cao, J. Yao, Z. Xu, and D. Meng, “Hyperspectral image classification with convolutional neural network and active learning,” IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 7, pp. 4604–4616, 2020.
[50]
X. He, Y. Chen, and P. Ghamisi, “Heterogeneous transfer learning for hyperspectral image classification based on convolutional neural network,” IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 5, pp. 3246–3263, 2020.
[51]
A. Jamali, S. K. Roy, D. Hong, P. M. Atkinson, and P. Ghamisi, “Attention graph convolutional network for disjoint hyperspectral image classification,” IEEE Geoscience and Remote Sensing Letters, vol. 21, pp. 1–5, 2024.
[52]
Y. He, B. Tu, B. Liu, Y. Chen, J. Li, and A. Plaza, “Hybrid multiscale spatial–spectral transformer for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–18, 2024.
[53]
C. Jia, X. Zhang, H. Meng, S. Xia, and L. Jiao, “Centerformer: A center spatial–spectral attention transformer network for hyperspectral image classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 18, pp. 5523–5539, 2025.
[54]
G. Wang, X. Zhang, Z. Peng, T. Zhang, and L. Jiao, “S2mamba: A spatial–spectral state space model for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–13, 2025.
[55]
M. Ahmad, M. Usama, M. Mazzara, and S. Distefano, “Wavemamba: Spatial-spectral wavelet mamba for hyperspectral image classification,” IEEE Geoscience and Remote Sensing Letters, pp. 1–1, 2024.

  1. M. Ahmad is with SDAIA-KFUPM, Joint Research Center for Artificial Intelligence (JRCAI), King Fahd University of Petroleum and Minerals, Dhahran, 31261, Saudi Arabia. (e-mail: mahmad00@gmail.com).↩︎

  2. F. Mauro is with the Engineering Department, University of Sannio, Benevento 82100, Italy.↩︎

  3. M. Mazzara is with the Institute of Software Development and Engineering, Innopolis University, Innopolis, 420500, Russia.↩︎

  4. S. Distefano is with Dipartimento di Matematica e Informatica-MIFT, University of Messina, 98121 Messina, Italy.↩︎

  5. A.M. Khan is with the School of Computer Science, University of Hull, Hull HU6 7RX, UK.↩︎

  6. S. L. Ullo is with the Engineering Department, University of Sannio, Benevento 82100, Italy.↩︎