October 23, 2025
Automated histopathological image analysis plays a vital role in computer-aided diagnosis of various diseases. Among developed algorithms, deep learning-based approaches have demonstrated excellent performance in multiple tasks, including semantic tissue segmentation in histological images. In this study, we propose a novel approach based on attention-driven feature fusion of convolutional neural networks (CNNs) and vision transformers (ViTs) within a unified dual-encoder model to improve semantic segmentation performance. Evaluation on two publicly available datasets showed that our model achieved μIoU/μDice scores of 76.79%/86.87% on the GCPS dataset and 64.93%/76.60% on the PUMA dataset, outperforming state-of-the-art and baseline benchmarks. The implementation of our method is publicly available in a GitHub repository: https://github.com/NimaTorbati/ACS-SegNet
Segmentation, Deep Learning, Medical Image Analysis, Convolutional Neural Network, Vision Transformer, Computational Pathology
Digital histopathology plays a crucial role in computer-aided cancer diagnosis by providing valuable information about the tumor features including the tumor microenvironment [1]. In histopathological image analysis, tissue segmentation is one of the fundamental steps supporting various downstream analysis tasks. While this can be performed manually, the gigapixel resolution of whole-slide images and the need for expert knowledge make the process challenging and time-consuming. Therefore, automated computer-aided segmentation methods represent a promising alternative. In particular, deep learning–based automatic segmentation methods have emerged as powerful tools for tissue segmentation in digital pathology [2]. State-of-the-art approaches are primarily built upon convolutional neural networks (CNNs) [3], vision transformers (ViTs) [4], or hybrid architectures that combine the strengths of CNNs and ViTs [5]. CNN-based models are effective in capturing local spatial features, while ViTs excel at capturing long-range contextual relationships. Hybrid models combine both architectures to deliver improved segmentation performance.
To combine the local feature extraction ability of CNNs with the global modeling capacity of ViTs, a few hybrid architectures have been proposed that integrate components from both paradigms in computational pathology. In these models, the integration of CNN and ViT modules typically occurs either in the encoder or through skip connections [5]–[8]. A representative example is TransUNet [6], which employs a serial CNN–ViT encoder where CNN layers are followed by ViT blocks. Another widely adopted strategy is the use of parallel CNN–ViT encoders [7], [9], [10], where two encoders process the input in parallel and are subsequently merged using various fusion schemes. Despite the progress made by the hybrid approaches, designing architectures that can fully exploit the complementary strengths of CNNs and ViTs remains an open challenge, particularly for histopathology tissue segmentation where both local features and long-range dependencies are essential.
Inspired by [11], where improved tissue segmentation performance was demonstrated through ensembling two separate models, Res-UNet [12] and SegFormer [13], we propose an Attention-based CNN–SegFormer Segmentation Network (ACS-SegNet). Unlike ensemble approaches, ACS-SegNet adopts a dual-encoder design that integrates a ResNet encoder (CNN family) with a SegFormer encoder (ViT family) into a unified architecture. To maximize feature representation, the outputs of the two encoders are fused at multiple levels using the Convolutional Block Attention Module (CBAM) [14], enabling the network to capture the most informative global and local features. Applied to two publicly available datasets, GCPS [15] and PUMA [16], we demonstrate improved semantic segmentation performance of the proposed model compared to state-of-the-art algorithms.
A block diagram of the proposed model is shown in Figure 1. The model consists of two parallel encoders: a SegFormer encoder and a ResNet encoder. Features from the four stages of the SegFormer encoder are first upsampled and concatenated (C. blocks) with the corresponding ResNet encoder features, and then fused through the CBAM modules. The fourth block of the ResNet encoder serves as the bottleneck. The decoder is a traditional UNet decoder with five stages and four skip connections. The SegFormer encoder is responsible for capturing long-range dependencies, while the ResNet encoder focuses on local features. The CBAM module evaluates the importance of the feature maps and passes the weighted representations to the decoder at various levels. In our set up, the SegFormer encoder, ResNet encoder, UNet decoder, and CBAM modules are standard implementations from the respective studies [13], [14], [17], [18].
To train and evaluate the performance of the proposed method, we used the GCPS and PUMA datasets.
The GCPS dataset contains 6,287 gastric cancer image patches, each captured at 100\(\times\) magnification and sized 512\(\times\)512 pixels. This dataset includes annotations for two tissue types (binary segmentation): cancerous and non-cancerous. For a fair comparison and similar to the original paper [15], we resized all images to 256\(\times\)256 pixels in our experiments. Further details about the GCPS dataset are provided in [15].
The PUMA dataset contains 206 melanoma training image patches each with a size of 1024\(\times\)1024 pixels. The dataset includes annotations for six tissue types (multi-class segmentation): tumor, stroma, epidermis, necrosis, blood vessel, and background. In our experiments, all training images were used except those associated with the necrosis class, as this class exhibited a strong imbalance, resulting in a total of 197 images. We excluded the official validation and test sets of the PUMA dataset, as the annotations for those images are not publicly available. To accommodate the large model size of DGAUNet [15], a state-of-the-art segmentation model used for comparison, all images were resized to 512\(\times\)512 pixels due to GPU memory limitations (NVIDIA RTX 4090 GPU). Further details about the PUMA dataset are provided in [16].
For training, we used standard shape-based transformations (flip, rotation, scaling) and intensity-based augmentations (hue, saturation, brightness, contrast) to artificially increase the dataset size.
We compared the results of our proposed approach with two state-of-the-art architectures, TransUNet [6] and DGAUNet [15], as well as the baseline SegFormer [13] and ResNet-UNet [17] models. TransUNet employs a hybrid CNN-ViT serial encoder and DGAUNet uses two parallel CNN-based encoders. We also performed an ablation study by replacing the CBAM modules with simple concatenation (CS-SegNet) to investigate the effectiveness of the attention-based CNN–ViT feature fusion in our proposed approach. All models, except for DGAUNet, were initialized with ImageNet-pretrained weights. For SegFormer and ResNet-UNet (both as baseline models and within our proposed approach), we utilized the SegFormer-B2 and ResNet34 encoder variants, while for TransUNet, the ResNet50-B16 configuration was used.
In our experiments, we split the datasets into three folds and reported the mean and standard deviation for each model. To compare the performance of the studied methods, we employed two metrics, including the micro Intersection over Union (μIoU) and micro Dice Coefficient (μDice). \[\mu IoU = \frac{\mu TP}{\mu TP + \mu FP + \mu FN}\] \[\mu Dice = \frac{2\mu TP}{2\mu TP + \mu FP + \mu FN}\] where TP, FP, and FN represent true positive, false positive, and false negative, respectively. For the micro metric, instead of computing the metric for each validation image and then averaging the results across all images, the entire validation dataset is treated as a single large image as proposed in [16]. In this approach, the TP, FP, and FN values are computed cumulatively as follows: \[\mu f = \sum_{n=1}^{N} f_n , \quad \text{for } f \in \{ \text{TP}, \text{FP}, \text{FN} \}\] where \(N\) is the number of images in the validation set. We first compute the micro metric for each class, and then calculate the average across all classes.
| Method | μIoU (%) | μDice (%) |
|---|---|---|
| DGAUNet [15] | \(75.95 \pm 0.20\) | \(86.33 \pm 0.13\) |
| SegFormer [13] | \(70.90 \pm 0.38\) | \(82.97 \pm 0.26\) |
| ResNetUNet [17] | \(75.65 \pm 0.08\) | \(86.13 \pm 0.05\) |
| TransUNet [6] | \(74.84 \pm 0.10\) | \(85.61 \pm 0.09\) |
| CS-SegNet | \(76.68 \pm 0.15\) | \(86.80 \pm 0.06\) |
| ACS-SegNet | \(\boldsymbol{76.79} \pm \boldsymbol{0.14}\) | \(\boldsymbol{86.87} \pm \boldsymbol{0.09}\) |
The segmentation results of the GCPS dataset are presented in Table 1. As shown, the proposed method achieves the best performance with an μIoU of 76.79% and a μDice score of 86.87%. The second-best performance is obtained by our dual encoder model without the CBAM attention module (CS-SegNet), which confirms the positive effect of incorporating CBAM in the proposed model. Among other models, DGAUNet demonstrates the best performance, whereas ResNetUNet and TransUNet achieve comparable results, and SegFormer shows the lowest performance.
| Method | μIoU (%) | μDice (%) |
|---|---|---|
| DGAUNet [15] | \(44.35 \pm 1.76\) | \(53.69 \pm 1.91\) |
| SegFormer [13] | \(46.78 \pm 2.58\) | \(58.25 \pm 2.01\) |
| ResNetUNet [17] | \(58.42 \pm 1.98\) | \(71.58 \pm 1.21\) |
| TransUNet [6] | \(62.43 \pm 2.47\) | \(74.63 \pm 1.91\) |
| CS-SegNet | \(63.67 \pm 1.23\) | \(75.55 \pm 1.71\) |
| ACS-SegNet | \(\boldsymbol{64.93} \pm \boldsymbol{2.28}\) | \(\boldsymbol{76.60} \pm \boldsymbol{1.36}\) |
Table 2 presents the segmentation results for the PUMA dataset. The proposed model with the CBAM attention module achieves the best performance with an μIoU of 64.93% and a μDice score of 76.60%. The second-best performance is obtained by the proposed model without the CBAM module (CS-SegNet). Similar to the GCPS experiment, the performance of ResNetUNet is close to that of TransUNet, while SegFormer performs slightly worse. A noticeable difference compared to the GCPS dataset is the performance drop of DGAUNet, which fails to perform well on the multi-class segmentation task of a small-scale dataset.
It should be noted that, in general, annotating histological images is a challenging and time-consuming task, and many annotated datasets are small-scale. Therefore, models should be capable of being trained and performing well on small-scale datasets. The experimental results indicate that the proposed model provides an optimal solution for both small- and large-scale datasets. For the GCPS dataset, which is approximately 30 times larger than PUMA, CNN-based models such as DGAUNet exhibit good performance but fail to achieve comparable results on the small-scale PUMA dataset.
However, the proposed model, which fuses features from both CNN and ViT families, achieves superior results across both datasets. For the GCPS dataset, the μIoU values of DGAUNet (the second-best model for GCPS) across three folds are 75.74%, 75.87%, and 76.24%, while those of the proposed method are 76.89%, 76.62%, and 76.78%. These results demonstrate that the proposed method consistently achieves better performance across all folds. For the PUMA dataset—a smaller dataset involving a multi-class segmentation task—the advantage of the dual-encoder architecture becomes more evident. In this case, ACS-SegNet and TransUNet, both of which incorporate ViT components, perform better than purely CNN-based models. The μIoU values of TransUNet (the second-best model for PUMA) across three folds are 64.03%, 64.34%, and 58.93%, while those of the proposed method are 67.92%, 64.51%, and 62.37%. These results further confirm that the proposed method consistently outperforms competing models across all folds.
The number of trainable parameters for each model is summarized in Table 3. As shown, ACS-SegNet, with approximately one-fourth the parameter count of DGAUNet and half that of TransUNet, achieves superior performance.
Additionally to emphasize is that the superior performance of the proposed model was achieved for two different cancer types such as gastric cancer and melanoma, which diverge significantly in histological patterns.
| Model | Number of Parameters (\(\sim\)) |
|---|---|
| DGAUNet [15] | 237 million |
| SegFormer [13] | 27 million |
| ResNetUNet [17] | 24 million |
| TransUNet [6] | 105 million |
| CS-SegNet | 50 million |
| ACS-SegNet | 50 million |
Finally, for qualitative analysis, we show two sample results from the GSCP (top) and PUMA (bottom) datasets in Figure 2. As the qualitative results demonstrate, our predicted segmentations closely match the ground-truth masks in both examples.
This paper introduced ACS-SegNet, an attention-based dual-encoder CNN–ViT model designed for accurate tissue segmentation in histopathology images. The architecture integrates SegFormer and ResNet encoders to effectively capture both global and local features. We evaluated the model on two publicly available histopathology datasets, GCPS and PUMA, and benchmarked its performance against state-of-the-art methods. Experimental results demonstrate that ACS-SegNet consistently outperforms other approches and achieves excellent segmentation perfromnce across both datasets.
This research study was conducted retrospectively using human subject data made available in open access by [15], [16]. Ethical approval was not required as confirmed by the license attached with the open access data.
This work was supported by the Vienna Science and Technology Fund (WWTF) and by the State of Lower Austria [Grant ID: 10.47379/LS23006].