Sparseformer: a Transferable Transformer with Multi-granularity Token Sparsification for Medical Time Series Classification


Abstract

Medical time series (MedTS) classification is crucial for improved diagnosis in healthcare, and yet it is challenging due to the varying granularity of patterns, intricate inter-channel correlation, information redundancy, and label scarcity. While existing transformer-based models have shown promise in time series analysis, they mainly focus on forecasting and fail to fully exploit the distinctive characteristics of MedTS data. In this paper, we introduce Sparseformer, a transformer specifically designed for MedTS classification. We propose a sparse token-based dual-attention mechanism that enables global modeling and token compression, allowing dynamic focus on the most informative tokens while distilling redundant features. This mechanism is then applied to the multi-granularity, cross-channel encoding of medical signals, capturing intra- and inter-granularity correlations and inter-channel connections. The sparsification design allows our model to handle heterogeneous inputs of varying lengths and channels directly. Further, we introduce an adaptive label encoder to address label space misalignment across datasets, equipping our model with cross-dataset transferability to alleviate the medical label scarcity issue. Our model outperforms 12 baselines across seven medical datasets under supervised learning. In the few-shot learning experiments, our model also achieves superior average results. In addition, the in-domain and cross-domain experiments among three diagnostic scenarios demonstrate our model’s zero-shot learning capability. Collectively, these findings underscore the robustness and transferability of our model in various medical applications.

1 Introduction↩︎

Medical time series (MedTS) data, such as multi-channel electrocardiograms (ECGs) and electroencephalograms (EEGs), capture dynamic physiological processes critical for disease diagnosis and treatment monitoring. Classifying these signals accurately enables early detection of life-threatening conditions (e.g., arrhythmias) and personalized therapeutic interventions [1]. However, four key challenges arise due to the distinctive characteristics of MedTS data: First, multi-scale patterns—pathological features span milliseconds (e.g., spike waves in epilepsy) to minutes (e.g., slow-wave oscillations) [2], demanding multi-granularity analysis. Second, intricate inter-channel relationships—multi-sensor data such as 12-lead ECGs encode spatially distributed biomarkers [3], requiring cross-channel dependency modeling. Third, information redundancy—redundant segments in different scales and channels introduce noise and computational inefficiency [4]. Finally, label scarcity—clinically annotated datasets are limited due to expert annotation costs, especially for rare disorders [5].

The traditional methods for MedTS classification include nearest neighbor [6], Gaussian mixture model [7], relying on statistical pattern matching. Current commonly used deep learning approaches include RNNs [8], CNNs [9], GNNs [10], advancing in more complicated pattern recognition. The exploration of transformer-based models is still limited. Although transformers have shown success in time series analysis, particularly in forecasting [11][13], their design and functionality might not completely solve the specific challenges of medical time series classification. Transformers like PatchTST [14] and Crossformer [15] utilize patch embedding to capture local patterns, but their fixed patch length restricts their ability to extract multi-level temporal patterns. While MTST [16] and Pathformer [17] employ multi-granularity strategies, they are designed for single-channel inputs. On the other hand, FEDformer [13] and Autoformer [11] focus on cross-channel interactions but overlook the hierarchical granularity that is crucial for understanding complex time series data. Medformer [18] is an advanced multi-granularity cross-channel transformer for MedTS. However, it uses self-attention mechanism that evenly distributes attention across tokens, lacking effective suppression of redundant signals. Moreover, these models’ rigid architectural designs—fixed input lengths and channel configurations—constrain their adaptability to heterogeneous datasets, hindering their potential to mitigate label scarcity through cross-dataset transfer learning.

To bridge these gaps, we propose Sparseformer, a transferable transformer tailored for medical time series classification. First, we design a token-sparse dual attention (TSDA) mechanism for granularity and channel modeling. TSDA employs self-attention to model global token interactions, followed by token-sparse attention to compress tokens using a fixed smaller number of domain-guided learnable queries. This sparsification eliminates redundant information, preserves informative task-relevant features, and reduces computational cost. Multiple stacked TSDA blocks are then employed for multi-granularity and cross-channel encoding, first extracting local features within granularities and then refining inter-granularity correlation, finally modeling multi-channel interactions to integrate complementary information. The sparse encoding of TSDA can transform input sequences of varying lengths into fixed-length, enabling our model to process heterogeneous inputs directly without truncation or padding. To bridge cross-dataset label space mismatches, we design an adaptive label encoder to project label descriptions into a unified latent space to generate label embeddings. Together, these components allow Sparseformer to transfer knowledge across datasets with varying lengths, channels, and labels, demonstrating few-shot and zero-shot transferability in clinical applications, mitigating the label scarcity challenge. Our main contributions are as follows:

  • We propose Sparseformer, a multi-granularity transformer architecture for medical time series classification. It utilizes token-sparse dual attention mechanism to reduce information redundancy, capture multi-scale temporal patterns and inter-channel dependencies in medical signals.

  • Sparseformer can handle input with variable lengths, channels through sparse encoding and it is equipped with an adaptive label encoder to unify the label representations across datasets. To the best of our knowledge, it is the first transformer-based framework that can be trained on time-series classification datasets with input-output heterogeneity, supporting cross-task zero-shot learning among diverse medical applications.

  • We conduct comprehensive supervised experiments on seven public datasets across three medical diagnosis scenarios, achieving SOTA results. Additionally, few-shot and zero-shot learning experiments demonstrate the model’s transferability and generalization across different medical datasets.

2 Related Work↩︎

Medical Time Series Classification. Medical time series data is a unique category of time series data employed in healthcare for tasks such as disease diagnosis, monitoring, and rehabilitation. This includes various forms like EEG [19], ECG [20], EMG [21], and EOG [22], each providing critical insights into different medical conditions and significantly contributing to healthcare advancements. Traditional methods, such as nearest neighbor classifiers [6], auto-regressive models [23], and Gaussian mixture models [7], offer simplicity and interpretability but face challenges when dealing with complex, high-dimensional patterns. With the advent of deep learning, models leveraging RNNs [8], CNNs [9], and GNNs [10] have dominated MedTS classification. Various RNN architectures for ECG-based biometrics are utilized in [8], achieving high accuracy for authentication tasks. EEGNet [9] employs depthwise and separable convolutions to capture EEG features for brain-computer interfaces. A self-supervised GNN approach is proposed in [10] for automated EEG seizure classification. These models show promising results in tasks with single-modality medical signals.

Transformer for Time Series. Transformers have significantly advanced time series analysis. Based on tokenization strategies, they can be divided into single-timestamp [11], [12], all-timestamp [24], and multi-timestamp approaches [15], [16], [18], with multi-timestamp further divided into single-granularity and multi-granularity methods. Single-timestamp tokenization struggles with coarse-grained patterns, while all-timestamp may overlook fine-grained local details. Single-granularity approaches, such as PatchTST [14] and Crossformer [15], create tokens from sequences of single-channel fixed-length timestamps. Although effective at capturing local temporal patterns, fixed patch sizes are challenged by multi-scale patterns. Methods like MTST [16] and Pathformer [17] use multi-granularity patching for diverse temporal patterns but are limited to single-channel processing, which may be less effective for classification. Medformer [18] introduces a cross-channel multi-granularity approach akin to ours, capturing low-level channel correlation by cross-channel patching, whereas our model extracts high-level channel representations after channel-wise multi-granularity encoding. Medformer lacks mechanism to reduce redundant medical signals, which our model addresses with token sparsification. Additionally, existing models offer very limited cross-dataset transferability, whereas our model supports direct transfer learning across heterogeneous medical datasets.

a

Figure 1: (a) Sparseformer consists of a time series encoder and a label encoder to generate representations of time series and its label into a unified space for optimization. (b) The time series encoder utilizes multi-granularity encoding to capture multi-scale temporal patterns and cross-channel encoding to capture channel interactions. (c) We implement multi-granularity encoding with multiple TSDA blocks, which capture both intra-granularity and inter-granularity correlations. (d) The token-sparse dual attention block contains self-attention to model global context and token-sparse attention for feature refinement..

3 Problem Formulation↩︎

Consider a medical dataset \(\mathcal{D}=\{(X_i, y_i)\}_{i=1}^N\) where each multi-channel medical signal \(X_i \in \mathbb{R}^{L \times C}\) contains \(L\) timestamps across \(C\) channels and each class label \(y_i \in \{1, 2, \ldots, M\}\) corresponds to a text description \(T_{y_i}\), where \(M\) is the total number of classes. Our objective is to learn a framework to generate representations of time series input, denoted as \(H_{X_i} \in \mathbb{R}^{D}\) and its label description, denoted as \(H_{y_i} \in \mathbb{R}^{D}\). The model is optimized to maximize the similarity between time series-label pairs \((H_{X_i}, H_{y_i})\) in the latent space.

4 Methodology↩︎

Our Sparseformer is illustrated in Figure 1. In this section, we first introduce the core component—the token-sparse dual attention block (TSDA). TSDA effectively captures global context among tokens, eliminates redundant signals and refines features by token sparsification. Next, we apply TSDA blocks on multi-granularity encoding for intra- and inter-granularity correlation extraction, and on multi-channel encoding for channel correlation integration. Built upon TSDA blocks, our model is independent of input length and channel configurations, allowing it to be trained on heterogeneous datasets. Additionally, we design an adaptive label encoder for cross-dataset label space alignment, equipping our model with zero-shot transferability in different medical applications.

4.1 Token-Sparse Dual Attention Block↩︎

Inspired by physicians’ two-stage diagnostic process—holistic symptom contextualization followed by biomarker-focused analysis [25]—we propose the Token-Sparse Dual Attention (TSDA) block to jointly enable global context modeling and dynamic feature refinement through a two-stage attention mechanism.

TSDA first employs self-attention to model global temporal dependencies, as its effectiveness in capturing long-range interactions within sequences has been validated by previous research [17], [18]. This step captures pairwise interactions across all tokens, resolving ambiguities in local patterns (e.g., distinguishing arrhythmic heartbeats from noise-induced artifacts in ECG [3]). Specifically, given an input sequence \(H \in \mathbb{R}^{L \times D}\) with \(L\) tokens and each token with \(D\)-dimensional features, we denote the self-attention of TSDA block as \(H_{\text{Self}}= \text{Attn}^{\text{Self }}\left(H, H, H\right)\) and \(H_{\text{self }} \in \mathbb{R}^{L \times D}\). Inspired by Q-Former [26] which introduces a set of learnable queries to learn visual representation most relevant to the text, we design a token-sparse attention layer in TSDA to distill task-relevant patterns from noisy, high-dimensional medical data by the guidance of domain-specific prior knowledge. Specifically, this layer introduces a set of \(K\) learnable query vectors \(Q\), augmented with domain-specific prior embedding \(E_{\text{prior }}\): \(Q_{\text{aug }}=f(Q, E_{\text{prior }})\) where \(Q_{\text{aug }} \in \mathbb{R}^{K \times D}\). \(f\) is the function to fuse queries and priors and concatenation is applied in our experiments. \(E_{\text{prior }}\) is generated based on dataset description by a frozen language model. These queries then attend to \(H_{\text{self }}\) to generate a sparse token set: \[\begin{align} H_{\text{sparse}} &= \text{Attn}^{\text{Sparse}}(Q_{\text{aug}}, H_{\text{self}}, H_{\text{self}}) \\ &= \operatorname{Softmax}\left( \frac{Q_{\text{aug}} \left( H_{\text{self}} W_K^{\prime} \right)^{\top} }{ \sqrt{d} } \right) H_{\text{self}} W_V^{\prime} \end{align}\] where \(W_K^{\prime}, W_V^{\prime} \in \mathbb{R}^{D \times D}\) . The resulting \(H_{\text{sparse}} \in \mathbb{R}^{K \times D}\) retains only \(K \ll L\) tokens, achieving computational reduction while preserving critical features and eliminating information that is irrelevant to the task.

Note that the TSDA block transforms variable-length token sequences into fixed-length representations through its token-sparse attention mechanism. This design is input-length-agnostic—its trainable parameters depend solely on the predefined queries number \(K\) and dimension \(D\) rather than input length \(L\) — enabling parameter sharing across inputs of arbitrary lengths while ensuring robust generalization and computational stability. We utilize TSDA blocks to capture multi-granularity and multi-channel correlations in the subsequent sections.

4.2 Multi-granularity Hierarchical Sparsification Encoding↩︎

4.2.1 Multi-granularity Segmentation.↩︎

In the multi-granularity encoding module, each channel of the input is independently processed to capture intra-channel distinctive features. To capture intra-channel multi-scale temporal patterns, we partition the univariate time series into multi-granularity segments using varying window sizes \(\mathcal{S}=\{S_1, S_2, \ldots, S_G\}\) and \(|\mathcal{S}|=G\). Each granularity \(S_i\) generates a sequence of non-overlapping patches \(\{p_1^{(i)}, p_2^{(i)}, \ldots\}\), where \(p_j^{(i)} \in \mathbb{R}^{S_i}\) represents the \(j\)-th patch at granularity \(i\). The number of patches is \(L_i=\lceil L / S_i\rceil\) with zero padding to ensure divisibility. These patches are mapped into a unified \(D\) dimension latent space by linear transformations and \(P_i = \{\hat{p}_1^{(i)}, \hat{p}_2^{(i)}, \ldots\} \in \mathbb{R}^{L_i \times D}\) is the patch embedding sequence of granularity \(i\). The embeddings of all granularities undergo a two-stage hierarchical analysis to model both intra-granularity and inter-granularity temporal dependencies.

Intra-Granularity Hierarchical Sparsification Encoding. Each granularity is first processed independently to capture granularity-specific temporal dynamics. Inspired by the hierarchical temporal modeling in Crossformer [15], we combine \(K\) TSDA blocks for intra-granularity processing to enable hierarchical feature refinement. This design mimics the human cognitive process of analyzing time series signals (e.g., ECG waveforms) by iteratively filtering noise, aggregating local patterns, and distilling global semantics. The forward process of \(k\)-th TSDA block is formulated as follows: \[H_k=\text{TSDA}_{k}(H_{k-1}; \Theta_k, O_k)\] where \(H_{k} \in \mathbb{R}^{O_k \times D}\) is the output token sequence. \(H_{k-1}\) is the input token sequence and \(H_{0}=P_i\). \(\Theta_k\) denotes trainable parameters and \(O_k\) is the critical hyperparameter controlling token compression \((O_k<O_{k-1})\). The output of the whole hierarchical TSDA processing is denoted as \(H ^\text{Intra}= H_{K} \in \mathbb{R}^{O_K}\), a granularity-wise representation for subsequent inter-granularity correlation modeling. Compared to single TSDA block which may struggle to effectively separate noise from the informative local pattern, multiple TSDA blocks achieve progressive noise suppression and preserve hierarchical discriminative patterns with optimized computational efficiency.

Inter-Granularity Encoding. Intra-granularity encoding has learned temporal features in different granularities, which are subsequently concatenated into a matrix, denoted as \(H_\mathcal{S}^\text{Intra}=[H_{\mathcal{S}_1}^{\text{Intra}} ; H_{\mathcal{S}_2}^{\text{Intra}} ; \cdots ; H_{\mathcal{S}_G}^{\text{Intra}}]\). A single TSDA block then models inter-granularity relationships: \[H_\mathcal{S}^\text{Inter}=\text{TSDA}(H_\mathcal{S}^\text{Intra}; \Theta, O^\text{Inter})\] where \(H_\mathcal{S}^\text{Intra} \in \mathbb{R}^{(O^{\text{Intra}} \cdot G) \times D}\) and \(O^{\text{Intra}}=O_K\), \(H_\mathcal{S}^\text{Inter} \in \mathbb{R}^{O^{\text{Inter}} \times D}\) and \(O^{\text{Inter}} \ll O^{\text{Intra}} \cdot G\). Self-attention of the TSDA block serves to establish a global context among the tokens corresponding to different granularities. This operation allows the model to understand the overall structure and interconnections among the different granularity tokens without any prior assumptions about which specific granularities should interact more strongly. Then the token-sparse attention compresses the information from the different granularities into a more manageable and focused representation to refine the information as well as reduce computation. In addition, different datasets may prefer diverse granularities, and the domain knowledge-based learnable queries can guide the selection of optimal granularities for each dataset to enhance the model’s generalization of unseen data. After inter-granularity encoding, we flatten its output \(H_\mathcal{S}^\text{Inter}\) into a vector \(h_c\) as the representation for each channel \(c\), where \(h_c \in \mathbb{R}^{D_c}\) and \(D_c=O^{\text{Inter}}\cdot D\).

4.3 Cross-Channel Encoding↩︎

After acquiring high-level channel-wise representations \(\left\{h_c\right\}_{c=1}^C\) through multi-granularity encoding, we concatenate them into a multi-channel embedding \(H_C=\left[h_1^{\top} ; h_2^{\top} ; \cdots ; h_C^{\top}\right] \in \mathbb{R}^{C \times D_c}\) and process it through TSDA block to model inter-channel dynamics. TSDA employs a two-stage attention mechanism to progressively refine channel interactions. The first self-attention layer computes dense pairwise correlations across all channels via \[H_C^{\text{Self}} \in \mathbb{R}^{C \times D_c} \leftarrow \operatorname{Attn}^{\text{Self }}\left(H_C, H_C, H_C\right)\] establishing a global context that reveals both complementary relationships (e.g., spatially distant EEG channels synergistically capturing propagating epileptic spikes in seizure detection) and competitive redundancies (e.g., overlapping functionality among biosensors). Building upon this full-resolution context, the second token-sparse attention layer condenses the \(C\) channels into a pre-defined \(O^{\text{Cross}}\) task-specific interaction prototypes ( \(O^{\text{Cross}} < C\) ) through learnable queries with domain knowledge guidance \(Q_C \in \mathbb{R}^{O^{\text{Cross}} \times D_c}\): \[H_C^{\text{Sparse}} \in \mathbb{R}^{O^{\text{Cross}} \times D_c} \leftarrow \operatorname{Attn}^{\text{Sparse }}\left(Q_C, H_C^{\text{Self }}, H_C^{\text{Self }}\right)\] This hierarchical refinement enables the model to amplify critical channel interactions while suppressing stochastic channel noise, achieving a balance between comprehensive context modeling and decision-focused representation learning. The output is flattened and projected into a D-dimensional space to generate the final time series embedding: \(H_{X_i} \in \mathbb{R}^{D}=\text{MLP}\left(\text{Flatten}\left(H_C^{\mathrm{Sparse}} \right)\right)\). Note that this module’s trainable parameters depend solely on the output token number \(O^{\text{Cross}}\) and latent dimension \(D\) instead of channel number \(C\), allowing deployment across heterogeneous datasets with varying channel counts (e.g., 6-channel ICU monitors vs. 12-channel wearable arrays) without architectural adaptation.

4.4 Adaptive Label Encoder↩︎

Traditional classification models rely on one-hot embedding for label representation, struggling to adapt to heterogeneous label spaces or generalize to unseen classes, limiting their cross-dataset transferability. Recent advances attempt to mitigate this challenge: ZeroG [27] constructs a unified cross-dataset label space via pre-trained language model (LM) for graph classification. UniTS [28] introduces trainable CLS tokens as label embeddings to support different time series classification task adaptation. Akata et al. [29] utilize attribute embeddings as priors and update label embeddings for image classification through labeled training data. Inspired by these works, we propose an adaptive label encoder designed to enhance the model’s cross-dataset transferability and generalization capabilities. A subsequent learnable projector dynamically refines the label embeddings, mapping them to a unified D-dimensional space shared with the time series embeddings. The formula is as follows:

\[H_{y_i} \in \mathbb{R}^{D} =W_1 \cdot (\text{ReLU}(W_2 \cdot \mathcal{F}_{\mathrm{LM}}^{\text{frozen}}(T_{y_i}) + b))\] where \(\mathcal{F}_{\mathrm{LM}}^{\text{frozen}}\) refers to a frozen language model, and \(T_{y_i}\) denotes the textual description of label \(y_i\). \(H_{y_i}\) represents the adaptive label embedding.

Loss Function: We utilize a cross-entropy loss for training, formulated as follows: \[\mathcal{L}(\Theta)=-\sum_{i=1}^N \log \frac{\exp \left(\operatorname{sim}\left(H_{X_i}, H_{y_i}\right)\right)}{\sum_{j=1}^M \exp \left(\operatorname{sim}\left(H_{X_i}, H_j\right)\right)}\] where \(\mathcal{L}(\Theta)\) represents the loss function parameterized by \(\Theta\) to be minimized. sim(·) is a function to measure the similarity between time series embedding and class embedding. We employ the dot product to be the similarity function. During the inference stage, the class with the highest similarity score is predicted as the label of the time series embedding, formalized as: \[y^{\prime}_{i}=\operatorname{argmax}_j\left(\operatorname{sim}\left(H_{X_i}, H_j\right) \mid j \in\{1, \ldots, M\}\right)\] where \(y^{\prime}_{i}\) is the predicted label for sample \(X_i\) and \(M\) is class number.

5 Experiments↩︎

In this section, we perform experiments in supervised learning, few-shot learning, and zero-shot learning to evaluate the proposed model’s efficacy and cross-dataset transferability.

5.1 Experimental Setup↩︎

5.1.1 Datasets.↩︎

Table 1: Statistics of datasets. #Step refers to the sequence length of each sample.
Dataset # Domain # Sample # Class # Channel # Step # Modality # Link
APAVA (2-Classes) Alzheimer’s Disease 5967 2 16 256 EEG Link
ADFTD (3-Classes) Alzheimer’s Disease 69752 3 19 256 EEG Link
PTB (2-Classes) Heart Disease 64356 2 15 300 ECG Link
PTB-XL (4-Classes) Heart Disease 17110 4 12 1000 ECG Link
PTB-XL (5-Classes) Heart Disease 17110 5 12 1000 ECG Link
TUSZ (2-Classes) Epilepsy 22040 2 19 6000 EEG Link
TUSZ (4-Classes) Epilepsy 2891 4 19 6000 EEG Link

(1) APAVA (2-Classes) [19] is a public EEG dataset for Alzheimer’s disease (AD) classification with binary labels: Healthy person and Alzheimer’s disease. (2) ADFTD (3-Classes) [19] is another public Alzheimer’s disease dataset with three classes: Frontotemporal Dementia, Healthy person, Alzheimer’s disease. (3) PTB (2-Classes) [20] is a public ECG dataset for heart disease classification, categorizing each 15-lead ECG sample as Healthy person or Myocardial Infarction. (4) PTB-XL [3] is a large-scale public 12-lead ECG dataset with two label sets for heart disease diagnosis. PTB-XL (4-Classes) includes coarse-grained labels: Normal ECG, Abnormal ECG, Borderline ECG, and Otherwise Normal ECG. PTB-XL (5-Classes) offers fine-grained labels: Normal ECG, Conduction Disturbance, Hypertrophy, Myocardial Infarction, and ST-T Changes. (5) TUSZ [2] is a large-scale EEG dataset capturing brain electrical activity across 19 channels. TUSZ (2-Classes) distinguishes between seizure and non-seizure EEG signals. TUSZ (4-Classes) further categorizes seizures into four types: Combined Focal Seizures, Generalized Non-Specific Seizures, Absence Seizures, and Combined Tonic Seizures. Table 1 provides detailed information on all datasets. Note that we follow [18] for preprocessing APAVA, ADFTD, and PTB, [30] for PTB-XL, and [10] for TUSZ.

5.1.2 Baselines.↩︎

We evaluate our model against a diverse set of baselines categorized as follows: (1) Non-Transformer Models: DLinear [31], LightTS [32], TimesNet [33]. These models employ deep learning architectures without relying on self-attention mechanisms. (2) Transformer-based Models: PatchTST [14], Autoformer [11], Crossformer [15], ETSformer [34], FEDformer [13], Informer [12], PathFormer [17], Medformer [18], MTST [16]. These models leverage self-attention mechanism or its variants for time series analysis. Note that PathFormer, Medformer, and MTST are multi-granularity transformer-based models.

5.1.3 Implementation Details.↩︎

Following previous work [18], we adopt six evaluation metrics: F1 score (macro-averaged), AUROC (macro-averaged), and AUPRC (macro-averaged), accuracy, precision (macro-averaged), recall (macro-averaged). Due to limited space, we mainly present F1 or AUROC in our analysis. For our model, the number of TSDA blocks in intra-granularity encoding is set as \(K=3\) and the hierarchical token list for these blocks is configured as \(\{O_1,\cdots,O_K\}=[128, 64, 32]\). The number of output tokens for inter-granularity encoding and cross-channel encoding is set as \(O^{\text{Inter}}=8\) and \(O^{\text{Cross}}=3\). The hidden dimension \(D=128\). The multiple granularities are set as [25, 50, 100, 150]. We adopt ClinicalBERT [35] as the frozen language model. The batch size is set to 128 for the ADFTD and PTB datasets, while batch size 32 is used for the remaining datasets. We employ the AdamW optimizer [36] and the Cosine scheduler for learning rate decay. The training process runs for up to 40 epochs, with early stopping if F1 on the validation set does not improve for 7 consecutive epochs. All experiments are conducted using the PyTorch framework on NVIDIA A6000 (48GB) GPU.

Table 2: Supervised Learning. Results are evaluated by F1 score. The best results are highlighted in red, while the second-best are in bold.
Model APAVA (2-Classes) ADFTD (3-Classes) TUSZ (2-Classes) TUSZ (4-Classes) PTB-XL (4-Classes) PTB-XL (5-Classes) PTB (2-Classes) Average
DLinear 0.486 0.292 0.648 0.735 0.234 0.243 0.593 0.461
LightTS 0.532 0.384 0.700 0.849 0.475 0.436 0.761 0.591
TimesNet 0.706 0.465 0.767 0.854 0.479 0.513 0.776 0.652
PatchTST 0.572 0.458 0.752 0.849 0.559 0.506 0.758 0.636
Autoformer 0.718 0.440 0.719 0.795 0.437 0.490 0.628 0.604
Crossformer 0.685 0.430 0.746 0.843 0.555 0.487 0.743 0.641
ETSformer 0.649 0.455 0.813 0.830 0.512 0.431 0.795 0.641
FEDformer 0.747 0.437 0.722 0.804 0.533 0.528 0.667 0.634
Informer 0.675 0.465 0.773 0.842 0.446 0.454 0.722 0.625
PathFormer 0.663 0.419 0.710 0.792 0.507 0.478 0.613 0.597
Medformer 0.706 0.461 0.823 0.841 0.584 0.514 0.811 0.677
MTST 0.642 0.424 0.767 0.856 0.549 0.527 0.695 0.637
ZeroT 0.813 0.472 0.854 0.893 0.578 0.542 0.850 0.715

a

Supervised learning.

b

Few-shot learning.

Figure 2: (a) illustrates the ranking of the average performance of all models across all datasets in all metrics under supervised learning. Lower rank numbers indicate better performance. (b) displays the average performance of different models evaluated by F1 score and AUROC across all few-shot experiments..

5.2 Supervised Learning↩︎

We compare our model with 12 baselines using seven real-world datasets under supervised learning setting. In Table 2, key findings include: (1) DLinear and LightTS perform poorly due to their simplified architectures, which struggle with complex temporal dependencies. (2) Transformer-based models generally surpass traditional methods, underscoring the effectiveness of self-attention mechanisms. (3) TimesNet ranks third, while Medformer is second with a 2% higher F1 score. TimesNet’s numerous trainable parameters may improve representation learning, but its absence of cross-channel and multi-granularity features limits optimization, unlike Medformer. (4) Sparseformer achieves the best performance with an F1 score of 0.715, outperforming other baselines by 4% on average, showcasing its robust generalization capability. Sparseformer focuses on useful multi-granularity tokens while progressively discarding redundant information and it extracts higher level channel interactions, enhancing performance compared to Medformer. Figure 2 (a) displays a heatmap of average ranks across all metrics, with lower ranks indicating better performance. Our method achieves the best average rank, demonstrating superior overall performance, followed by Medformer, while DLinear and LightTS rank lowest.

Table 3: 5-shot Learning. Results of F1 score and AUROC are presented. "S" refers to the source dataset while "T" refers to the target dataset.
S: TUSZ (2-Classes) T: TUSZ (4-Classes) S: TUSZ (4-Classes) T: TUSZ (2-Classes) S: PTB-XL (4-Classes) T: PTB-XL (5-Classes) S: PTB-XL (5-Classes) T: PTB-XL (4-Classes)
2-9 Model F1 AUROC F1 AUROC F1 AUROC F1 AUROC
DLinear 0.134 0.432 0.201 0.498 0.142 0.485 0.182 0.460
LightTS 0.161 0.443 0.430 0.458 0.237 0.540 0.199 0.496
TimesNet 0.155 0.473 0.435 0.539 0.112 0.530 0.159 0.428
PatchTST 0.247 0.591 0.492 0.485 0.191 0.504 0.268 0.571
Autoformer 0.179 0.467 0.434 0.543 0.211 0.551 0.184 0.443
Crossformer 0.247 0.650 0.430 0.530 0.090 0.378 0.202 0.477
ETSformer 0.124 0.423 0.519 0.551 0.137 0.452 0.282 0.615
FEDformer 0.126 0.466 0.201 0.506 0.233 0.605 0.162 0.449
Informer 0.154 0.550 0.456 0.553 0.166 0.565 0.149 0.400
PathFormer 0.112 0.497 0.433 0.546 0.168 0.598 0.197 0.508
Medformer 0.117 0.448 0.433 0.441 0.221 0.561 0.292 0.648
MTST 0.216 0.647 0.420 0.469 0.210 0.587 0.272 0.611
ZeroT 0.277 0.646 0.537 0.574 0.276 0.640 0.353 0.695

5.3 Few-shot Learning↩︎

Few-shot learning addresses the label scarcity challenge by transferring knowledge from source domain with ample labeled data to target domain with limited labels. In this section, we pre-train all models on the source dataset, then fine-tune them on the target dataset under {5, 10, 20, 30, 40, 50}-shot settings. Since our baselines have fixed input dimensions, direct transfer between heterogeneous datasets is infeasible. We select source-target dataset pair with the same input length and channel counts, namely PTB-XL (4-Classes) and PTB-XL (5-Classes), TUSZ (2-Classes) and TUSZ (4-Classes). After pre-training, we freeze the model backbone and train a task-specific classification head for fine-tuning. Table 3 shows the F1 score and AUROC of four 5-shot experiments. It can be concluded that transformer-based models usually have better 5-shot performance than non-transformer models, which is consistent with the supervised model performance. Even trained on just 5-shot samples, our model surpasses the majority of baselines in most metrics, highlighting its strong performance in data-scarce scenarios. Figure 2 shows overall average results across four few-shot experiments on all shot settings. The performance of nearly all the models increases when the number of available data increases. Our model has the best performance in both F1 score and AUROC on all the shots, demonstrating its robustness in transferability. Additionally, when the shots are small, the performance of our model is much better compared to the suboptimal baseline, demonstrating its superior few-shot learning capacity.

Table 4: Zero-shot Learning. Zero-shot results are evaluated by F1 score across all datasets, with the best results highlighted in red. The "Test" column presents target datasets, while "Pre-training" shows the domains used for pre-training. We conduct both in-domain and cross-domain experiments. In-domain results are shaded in gray. For comparison, some few-shot and supervised learning results are provided at the bottom. "N/A" indicates unavailable result.
Test
3-9 Alzheimer’s Disease Epilepsy Heart Disease
3-9 Zero-shot Experiments APAVA (2-Classes) ADFTD (4-Classes) TUSZ (2-Classes) TUSZ (4-Classes) PTB-XL (4-Classes) PTB-XL (5-Classes) PTB (2-Classes)
Alzheimer’s Disease 0.205 0.444 0.121 0.195 0.181 0.432
Epilepsy 0.387 0.243 0.210 0.128 0.161 0.393
Pre-training Heart Disease 0.373 0.264 0.487 0.076 0.151
Dlinear (5-shot) N/A N/A 0.201 0.134 0.182 0.142 N/A
Dlinear (50-shot) N/A N/A 0.423 0.178 0.184 0.153 N/A
ZeroT (5-shot) N/A N/A 0.537 0.277 0.353 0.276 N/A
Experiments Dlinear (supervised) 0.486 0.292 0.648 0.735 0.234 0.243 0.593

5.4 Zero-shot Learning↩︎

To asses the zero-shot transferability of our model, we further conduct in-domain and cross-domain experiments. In-domain experiments involve transferring knowledge between datasets within the same domain, with results highlighted in grey background in Table 4. For instance, when the target dataset is PTB (2-Classes), we use the remainder of domain "Heart Disease" as the source datasets, namely PTB-XL (4-Classes) and PTB-XL (5-Classes). On the other hand, cross-domain experiments involve pre-training our model on all datasets of the source domain and evaluating it on a single dataset from the target domain. Since our baselines do not have zero-shot capability, we include some results from few-shot and supervised learning approaches at the bottom of the table for comparison. Table 4 indicates that four best performances are observed in in-domain experiments, while three best performances are presented in cross-domain experiments. This suggests that in-domain transfer exhibits stronger zero-shot performance compared to cross-domain transfer. Note that the zero-shot performance of our model is better than DLinear under 50 shots. In addition, the in-domain zero-shot performances of our model on APAVA (5-Classes) and PTB-XL (5-Classes) outperform DLinear under supervised learning, which may be attributed to the large scale of pre-training data to capture a broader range of temporal patterns.

Table 5: Ablation Study. Results are evaluated by F1 score across all datasets.
Model APAVA (2-Classes) ADFTD (3-Classes) TUSZ (2-Classes) TUSZ (4-Classes) PTB-XL (4-Classes) PTB-XL (5-Classes) PTB (2-Classes)
W/O Multi-Granularity 0.785 0.452 0.803 0.853 0.557 0.514 0.813
W/O Channel Attention 0.782 0.460 0.816 0.868 0.562 0.535 0.835
W/O Label Encoder 0.801 0.467 0.845 0.881 0.579 0.540 0.846
Sparseformer 0.813 0.472 0.854 0.893 0.578 0.542 0.850

5.5 More Experiments↩︎

Ablation Study. To evaluate the impact of critical modules in our model, we conduct ablation studies across three configurations. "W/O Multi-Granularity" indicates that we use single-granularity {25}. "W/O Channel Attention" means that we replace cross-channel encoding by simply concatenating all channel representations. "W/O Label Encoder" refers that we utilize one-hot encoding for ground truth. In Table 5, the influence of multi-granularity is significant; omitting it leads to a marked decrease in F1, highlighting the advantage of multi-scale modeling. The significance of channel attention comes after multi-granularity. Compared with ECG datasets, channel attention is more beneficial for EEG datasets. The Label Encoder boosts model performance by about 1%, likely due to its adaptive and fine-grained label embedding compared to one-hot embedding. These results underscore the efficacy of our proposed mechanisms.

a

Efficiency Analysis.

b

Sensitivity Analysis.

Figure 3: (a) compare the efficiency of representative models on APAVA. "M" is million. The bubble size represents the model parameter size. (b) demonstrates the impact of two critical hyperparameters on APAVA dataset..

Efficiency Analysis. In Figure 3, we compare the efficiency of our model against representative baselines, including the time required to train one epoch, F1 score, and model size. Due to space constraints, we present results on the APAVA dataset. Autoformer is the fastest among the models, yet it ranks third in terms of performance. Sparseformer sacrifices some training speed compared to Autoformer but achieves a significantly higher F1 score. Medformer is the smallest model in terms of size, but its performance ranks the second worst. Sparseformer has 8.4 million parameters, more than MIST (7.5 million) but less than TimesNet (12.3 million). Overall, Sparseformer presents a balanced option in the trade-off between efficiency and effectiveness. Its relatively high performance with a reasonable parameter count and training time makes it a viable choice.

Sensitivity Analysis. To evaluate the impact of two key hyperparameters, the Multi-granularity and the Hierarchical Token List, we present the results on APAVA (2-Classes) in Figure 3. In terms of Multi-granularity, \([25, 50, 100, 150]\) yields the best performance, achieving F1 of 0.8133 while a single granularity \([25]\) results in the lowest F1. This highlights the importance of incorporating multiple granularity to capture multi-scale patterns of the medical data. As to the Hierarchical Token List, the results indicate that three TSDA blocks \([128, 64, 32]\) is the most effective configuration for achieving the highest F1 score. It balances complexity and performance effectively. Adding more blocks \([128, 64, 32, 16]\) or reducing blocks \([128, 64]\) tends to decrease the model’s performance, suggesting that an optimal hierarchy depth of blocks is crucial for the model’s success.

6 Conclusion↩︎

In this paper, we propose Sparseformer, a novel transformer-based model tailored for medical time series classification. By integrating the sparse token-based dual-attention mechanism and multi-granularity cross-channel encoding, Sparseformer dynamically captures critical medical patterns. The sparse encoding together with an adaptive label encoder enables Sparseformer to process heterogeneous datasets with cross-dataset transferability. Experiments validate its superiority over existing methods in supervised and few-shot learning settings. These findings underscore the model’s robustness, adaptability, and potential to improve diagnostic performance in diverse medical contexts.

References↩︎

[1]
Wang, W.K., Chen, I., Hershkovich, L., Yang, J., Shetty, A., Singh, G., Jiang, Y., Kotla, A., Shang, J.Z., Yerrabelli, R., Roghanizad, A.R., Shandhi, M.M.H., Dunn, J.: A systematic review of time series classification techniques used in biomedical applications. Sensors 22(20),  8016 (2022).
[2]
Shah, V., Von Weltin, E., Lopez, S., McHugh, J.R., Veloso, L., Golmohammadi, M., Obeid, I., Picone, J.: The temple university hospital seizure detection corpus. Frontiers in neuroinformatics 12,  83 (2018).
[3]
Wagner, P., Strodthoff, N., Bousseljot, R.D., Kreiseler, D., Lunze, F.I., Samek, W., Schaeffter, T.: Ptb-xl, a large publicly available electrocardiography dataset. Scientific data 7(1), 1–15 (2020).
[4]
12-lead ecg signal classification for detecting ecg arrhythmia via an information bottleneck-based multi-scale network. Information Sciences 662, 120239 (2024).
[5]
Yang, F., Li, X., Wang, B., Zhang, T., Yu, X., Yi, X., Zhu, R.: Mmseg: A novel multi-task learning framework for class imbalance and label scarcity in medical image segmentation. Knowl. Based Syst. 309, 112835 (2025).
[6]
Lines, J., Bagnall, A.: Time series classification with ensembles of elastic distance measures. Data Mining and Knowledge Discovery p. 565–592 (May 2015).
[7]
Vincent, T., Risser, L., Ciuciu, P.: Spatially adaptive mixture modeling for analysis of fmri time series. NeuroImage 47,  S167 (Jul 2009).
[8]
Salloum, R., Kuo, C.C.J.: Ecg-based biometrics using recurrent neural networks. In: ICASSP. p. 2062–2066 (Mar 2017).
[9]
Lawhern, V.J., Solon, A.J., Waytowich, N.R., Gordon, S.M., Hung, C.P., Lance, B.J.: Eegnet: A compact convolutional network for eeg-based brain-computer interfaces. Journal of Neural Engineering p. 056013 (Oct 2018).
[10]
Tang, S., Dunnmon, J.A., Saab, K., Zhang, X., Huang, Q., Dubost, F., Rubin, D., Lee-Messer, C.: Self-supervised graph neural networks for improved electroencephalographic seizure analysis. In: ICLR (2021).
[11]
Wu, H., Xu, J., Wang, J., Long, M.: Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. NeurIPS 34, 22419–22430 (2021).
[12]
Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., Liu, W.: Informer: Beyond efficient transformer for long sequence time-series forecasting. In: AAAI. vol. 35, pp. 11106–11115 (2021).
[13]
Zhou, T., Ma, Z., Wen, Q., Yi, X., Sun, L.: Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In: ICML. pp. 27268–27286 (2022).
[14]
Nie, Y., Nguyen, N.H., Sinthong, P., Kalagnanam, J.: A time series is worth 64 words: Long-term forecasting with transformers. In: ICLR (2023).
[15]
Zhang, Y., Yan, J.: Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. In: ICLR (2023).
[16]
Zhang, Y., Ma, L., Pal, S., Zhang, Y., Coates, M.: Multi-resolution time-series transformer for long-term forecasting. In: International Conference on Artificial Intelligence and Statistics. pp. 4222–4230. PMLR (2024).
[17]
Chen, P., ZHANG, Y., Cheng, Y., Shu, Y., Wang, Y., Wen, Q., Yang, B., Guo, C.: Pathformer: Multi-scale transformers with adaptive pathways for time series forecasting. In: ICLR (2024).
[18]
Wang, Y., Huang, N., Li, T., Yan, Y., Zhang, X.: Medformer: A multi-granularity patching transformer for medical time-series classification. NeurIPS (2024).
[19]
Escudero, J., Abásolo, D., Hornero, R., Espino, P., López, M.: Analysis of electroencephalograms in alzheimer’s disease patients with multiscale entropy. Physiological measurement 27(11),  1091 (2006).
[20]
PhysioBank, P.: Physionet: components of a new research resource for complex physiologic signals. Circulation 101(23), e215–e220 (2000).
[21]
Xiong, D., Zhang, D., Zhao, X., Zhao, Y.: Deep learning for emg-based human-machine interaction: A review. IEEE/CAA Journal of Automatica Sinica p. 512–533 (Mar 2021).
[22]
Fan, J., Sun, C., Long, M., Chen, C., Chen, W.: Eognet: A novel deep learning model for sleep stage classification based on single-channel eog signal. Frontiers in Neuroscience (Jul 2021).
[23]
Schaffer, A.L., Dobbins, T.A., Pearson, S.A.: Interrupted time series analysis using autoregressive integrated moving average (arima) models: a guide for evaluating large-scale health interventions. BMC Medical Research Methodology (Dec 2021).
[24]
Liu, Y., Hu, T., Zhang, H., Wu, H., Wang, S., Ma, L., Long, M.: itransformer: Inverted transformers are effective for time series forecasting. In: ICLR (2024).
[25]
Jussupow, E., Spohrer, K., Heinzl, A., Gawlitza, J.: Augmenting medical diagnosis decisions? an investigation into physicians’ decision-making process with artificial intelligence. Inf. Syst. Res. 32(3), 713–735 (2021).
[26]
Li, J., Li, D., Savarese, S., Hoi, S.C.H.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: ICML. vol. 202, pp. 19730–19742 (2023).
[27]
Li, Y., Wang, P., Li, Z., Yu, J.X., Li, J.: Zerog: Investigating cross-dataset zero-shot transferability in graphs. KDD (2024).
[28]
Gao, S., Koker, T., Queen, O., Hartvigsen, T., Tsiligkaridis, T., Zitnik, M.: Units: A unified multi-task time series model. NeurIPS (2024).
[29]
Akata, Z., Perronnin, F., Harchaoui, Z., Schmid, C.: Label-embedding for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(7), 1425–1438 (2016).
[30]
Zhang, W., Ye, J., Li, Z., Li, J., Tsung, F.: Dualtime: A dual-adapter multimodal language model for time series representation. arXiv preprint arXiv:2406.06620 (2024).
[31]
Zeng, A., Chen, M., Zhang, L., Xu, Q.: Are transformers effective for time series forecasting? In: AAAI. vol. 37, pp. 11121–11128 (2023).
[32]
Zhang, T., Zhang, Y., Cao, W., Bian, J., Yi, X., Zheng, S., Li, J.: Less is more: Fast multivariate time series forecasting with light sampling-oriented mlp structures. arXiv preprint arXiv:2207.01186 (2022).
[33]
Wu, H., Hu, T., Liu, Y., Zhou, H., Wang, J., Long, M.: Timesnet: Temporal 2d-variation modeling for general time series analysis. In: ICLR (2022).
[34]
Woo, S., Qin, Y., Arik, S.O., Pfister, T.: Etsformer: Exponential smoothing transformers for time-series forecasting. In: NeurIPS. vol. 35, pp. 22898–22909 (2022).
[35]
Wang, G., Liu, X., Ying, Z., Yang, G., Chen, Z., Liu, Z., Zhang, M., Yan, H., Lu, Y., Gao, Y., et al.: Optimized glycemic control of type 2 diabetes with reinforcement learning: a proof-of-concept trial. Nature Medicine 29(10), 2633–2642 (2023).
[36]
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2017).