Abstract

Automated histopathological image analysis plays a vital role in computer-aided diagnosis of various diseases. Among developed algorithms, deep learning-based approaches have demonstrated excellent performance in multiple tasks, including semantic tissue segmentation in histological images. In this study, we propose a novel approach based on attention-driven feature fusion of convolutional neural networks (CNNs) and vision transformers (ViTs) within a unified dual-encoder model to improve semantic segmentation performance. Evaluation on two publicly available datasets showed that our model achieved μIoU/μDice scores of 76.79%/86.87% on the GCPS dataset and 64.93%/76.60% on the PUMA dataset, outperforming state-of-the-art and baseline benchmarks. The implementation of our method is publicly available in a GitHub repository: https://github.com/NimaTorbati/ACS-SegNet

Segmentation, Deep Learning, Medical Image Analysis, Convolutional Neural Network, Vision Transformer, Computational Pathology

1 Introduction↩︎

Digital histopathology plays a crucial role in computer-aided cancer diagnosis by providing valuable information about the tumor features including the tumor microenvironment [1]. In histopathological image analysis, tissue segmentation is one of the fundamental steps supporting various downstream analysis tasks. While this can be performed manually, the gigapixel resolution of whole-slide images and the need for expert knowledge make the process challenging and time-consuming. Therefore, automated computer-aided segmentation methods represent a promising alternative. In particular, deep learning–based automatic segmentation methods have emerged as powerful tools for tissue segmentation in digital pathology [2]. State-of-the-art approaches are primarily built upon convolutional neural networks (CNNs) [3], vision transformers (ViTs) [4], or hybrid architectures that combine the strengths of CNNs and ViTs [5]. CNN-based models are effective in capturing local spatial features, while ViTs excel at capturing long-range contextual relationships. Hybrid models combine both architectures to deliver improved segmentation performance.

To combine the local feature extraction ability of CNNs with the global modeling capacity of ViTs, a few hybrid architectures have been proposed that integrate components from both paradigms in computational pathology. In these models, the integration of CNN and ViT modules typically occurs either in the encoder or through skip connections [5]–[8]. A representative example is TransUNet [6], which employs a serial CNN–ViT encoder where CNN layers are followed by ViT blocks. Another widely adopted strategy is the use of parallel CNN–ViT encoders [7], [9], [10], where two encoders process the input in parallel and are subsequently merged using various fusion schemes. Despite the progress made by the hybrid approaches, designing architectures that can fully exploit the complementary strengths of CNNs and ViTs remains an open challenge, particularly for histopathology tissue segmentation where both local features and long-range dependencies are essential.

Inspired by [11], where improved tissue segmentation performance was demonstrated through ensembling two separate models, Res-UNet [12] and SegFormer [13], we propose an Attention-based CNN–SegFormer Segmentation Network (ACS-SegNet). Unlike ensemble approaches, ACS-SegNet adopts a dual-encoder design that integrates a ResNet encoder (CNN family) with a SegFormer encoder (ViT family) into a unified architecture. To maximize feature representation, the outputs of the two encoders are fused at multiple levels using the Convolutional Block Attention Module (CBAM) [14], enabling the network to capture the most informative global and local features. Applied to two publicly available datasets, GCPS [15] and PUMA [16], we demonstrate improved semantic segmentation performance of the proposed model compared to state-of-the-art algorithms.

Figure 1: Block diagram of the proposed model. The model consists of two parallel encoders: a SegFormer encoder and a ResNet encoder. Features from the four stages of the SegFormer encoder are first upsampled and concatenated (C. blocks) with the corresponding ResNet encoder features, and then fused through the Convolutional Block Attention Module (CBAM). The fourth block of the ResNet encoder serves as the bottleneck. For each path, the feature maps are illustrated with their corresponding dimensions. Input Image: (digital image of) H&E-stained tissue specimen of melanoma cancer.

2 Method↩︎

2.1 Model↩︎

A block diagram of the proposed model is shown in Figure 1. The model consists of two parallel encoders: a SegFormer encoder and a ResNet encoder. Features from the four stages of the SegFormer encoder are first upsampled and concatenated (C. blocks) with the corresponding ResNet encoder features, and then fused through the CBAM modules. The fourth block of the ResNet encoder serves as the bottleneck. The decoder is a traditional UNet decoder with five stages and four skip connections. The SegFormer encoder is responsible for capturing long-range dependencies, while the ResNet encoder focuses on local features. The CBAM module evaluates the importance of the feature maps and passes the weighted representations to the decoder at various levels. In our set up, the SegFormer encoder, ResNet encoder, UNet decoder, and CBAM modules are standard implementations from the respective studies [13], [14], [17], [18].

2.2 Datasets↩︎

To train and evaluate the performance of the proposed method, we used the GCPS and PUMA datasets.

2.2.1 GCPS↩︎

The GCPS dataset contains 6,287 gastric cancer image patches, each captured at 100\(\times\) magnification and sized 512\(\times\)512 pixels. This dataset includes annotations for two tissue types (binary segmentation): cancerous and non-cancerous. For a fair comparison and similar to the original paper [15], we resized all images to 256\(\times\)256 pixels in our experiments. Further details about the GCPS dataset are provided in [15].

2.2.2 PUMA↩︎

The PUMA dataset contains 206 melanoma training image patches each with a size of 1024\(\times\)1024 pixels. The dataset includes annotations for six tissue types (multi-class segmentation): tumor, stroma, epidermis, necrosis, blood vessel, and background. In our experiments, all training images were used except those associated with the necrosis class, as this class exhibited a strong imbalance, resulting in a total of 197 images. We excluded the official validation and test sets of the PUMA dataset, as the annotations for those images are not publicly available. To accommodate the large model size of DGAUNet [15], a state-of-the-art segmentation model used for comparison, all images were resized to 512\(\times\)512 pixels due to GPU memory limitations (NVIDIA RTX 4090 GPU). Further details about the PUMA dataset are provided in [16].

2.3 Experiments↩︎

For training, we used standard shape-based transformations (flip, rotation, scaling) and intensity-based augmentations (hue, saturation, brightness, contrast) to artificially increase the dataset size.

We compared the results of our proposed approach with two state-of-the-art architectures, TransUNet [6] and DGAUNet [15], as well as the baseline SegFormer [13] and ResNet-UNet [17] models. TransUNet employs a hybrid CNN-ViT serial encoder and DGAUNet uses two parallel CNN-based encoders. We also performed an ablation study by replacing the CBAM modules with simple concatenation (CS-SegNet) to investigate the effectiveness of the attention-based CNN–ViT feature fusion in our proposed approach. All models, except for DGAUNet, were initialized with ImageNet-pretrained weights. For SegFormer and ResNet-UNet (both as baseline models and within our proposed approach), we utilized the SegFormer-B2 and ResNet34 encoder variants, while for TransUNet, the ResNet50-B16 configuration was used.

In our experiments, we split the datasets into three folds and reported the mean and standard deviation for each model. To compare the performance of the studied methods, we employed two metrics, including the micro Intersection over Union (μIoU) and micro Dice Coefficient (μDice). \[\mu IoU = \frac{\mu TP}{\mu TP + \mu FP + \mu FN}\] \[\mu Dice = \frac{2\mu TP}{2\mu TP + \mu FP + \mu FN}\] where TP, FP, and FN represent true positive, false positive, and false negative, respectively. For the micro metric, instead of computing the metric for each validation image and then averaging the results across all images, the entire validation dataset is treated as a single large image as proposed in [16]. In this approach, the TP, FP, and FN values are computed cumulatively as follows: \[\mu f = \sum_{n=1}^{N} f_n , \quad \text{for } f \in \{ \text{TP}, \text{FP}, \text{FN} \}\] where \(N\) is the number of images in the validation set. We first compute the micro metric for each class, and then calculate the average across all classes.

3 Results & Discussion↩︎

3.1 GCPS Results↩︎

Table 1: Segmentation results on the GCPS dataset.
Method	μIoU (%)	μDice (%)
DGAUNet [15]	\(75.95 \pm 0.20\)	\(86.33 \pm 0.13\)
SegFormer [13]	\(70.90 \pm 0.38\)	\(82.97 \pm 0.26\)
ResNetUNet [17]	\(75.65 \pm 0.08\)	\(86.13 \pm 0.05\)
TransUNet [6]	\(74.84 \pm 0.10\)	\(85.61 \pm 0.09\)
CS-SegNet	\(76.68 \pm 0.15\)	\(86.80 \pm 0.06\)
ACS-SegNet	\(\boldsymbol{76.79} \pm \boldsymbol{0.14}\)	\(\boldsymbol{86.87} \pm \boldsymbol{0.09}\)

The segmentation results of the GCPS dataset are presented in Table 1. As shown, the proposed method achieves the best performance with an μIoU of 76.79% and a μDice score of 86.87%. The second-best performance is obtained by our dual encoder model without the CBAM attention module (CS-SegNet), which confirms the positive effect of incorporating CBAM in the proposed model. Among other models, DGAUNet demonstrates the best performance, whereas ResNetUNet and TransUNet achieve comparable results, and SegFormer shows the lowest performance.

3.2 PUMA Results↩︎

Figure 2: Example of tissue segmentation results of the proposed method from the GCPS dataset (top) and the PUMA dataset (bottom). Tumor (cancerous) regions are shown in red, blood vessels in green, stroma in blue, and non-cancerous regions in black. Input image: (digital image of) H&E-stained tissue specimens of gastric cancer (top) and of melanoma (bottom).

Table 2: Segmentation results on the PUMA dataset.
Method	μIoU (%)	μDice (%)
DGAUNet [15]	\(44.35 \pm 1.76\)	\(53.69 \pm 1.91\)
SegFormer [13]	\(46.78 \pm 2.58\)	\(58.25 \pm 2.01\)
ResNetUNet [17]	\(58.42 \pm 1.98\)	\(71.58 \pm 1.21\)
TransUNet [6]	\(62.43 \pm 2.47\)	\(74.63 \pm 1.91\)
CS-SegNet	\(63.67 \pm 1.23\)	\(75.55 \pm 1.71\)
ACS-SegNet	\(\boldsymbol{64.93} \pm \boldsymbol{2.28}\)	\(\boldsymbol{76.60} \pm \boldsymbol{1.36}\)

Table 2 presents the segmentation results for the PUMA dataset. The proposed model with the CBAM attention module achieves the best performance with an μIoU of 64.93% and a μDice score of 76.60%. The second-best performance is obtained by the proposed model without the CBAM module (CS-SegNet). Similar to the GCPS experiment, the performance of ResNetUNet is close to that of TransUNet, while SegFormer performs slightly worse. A noticeable difference compared to the GCPS dataset is the performance drop of DGAUNet, which fails to perform well on the multi-class segmentation task of a small-scale dataset.

3.3 Combined Results↩︎

It should be noted that, in general, annotating histological images is a challenging and time-consuming task, and many annotated datasets are small-scale. Therefore, models should be capable of being trained and performing well on small-scale datasets. The experimental results indicate that the proposed model provides an optimal solution for both small- and large-scale datasets. For the GCPS dataset, which is approximately 30 times larger than PUMA, CNN-based models such as DGAUNet exhibit good performance but fail to achieve comparable results on the small-scale PUMA dataset.

However, the proposed model, which fuses features from both CNN and ViT families, achieves superior results across both datasets. For the GCPS dataset, the μIoU values of DGAUNet (the second-best model for GCPS) across three folds are 75.74%, 75.87%, and 76.24%, while those of the proposed method are 76.89%, 76.62%, and 76.78%. These results demonstrate that the proposed method consistently achieves better performance across all folds. For the PUMA dataset—a smaller dataset involving a multi-class segmentation task—the advantage of the dual-encoder architecture becomes more evident. In this case, ACS-SegNet and TransUNet, both of which incorporate ViT components, perform better than purely CNN-based models. The μIoU values of TransUNet (the second-best model for PUMA) across three folds are 64.03%, 64.34%, and 58.93%, while those of the proposed method are 67.92%, 64.51%, and 62.37%. These results further confirm that the proposed method consistently outperforms competing models across all folds.

The number of trainable parameters for each model is summarized in Table 3. As shown, ACS-SegNet, with approximately one-fourth the parameter count of DGAUNet and half that of TransUNet, achieves superior performance.

Additionally to emphasize is that the superior performance of the proposed model was achieved for two different cancer types such as gastric cancer and melanoma, which diverge significantly in histological patterns.

Table 3: Number of trainable parameters for the studied models.
Model	Number of Parameters (\(\sim\))
DGAUNet [15]	237 million
SegFormer [13]	27 million
ResNetUNet [17]	24 million
TransUNet [6]	105 million
CS-SegNet	50 million
ACS-SegNet	50 million

Finally, for qualitative analysis, we show two sample results from the GSCP (top) and PUMA (bottom) datasets in Figure 2. As the qualitative results demonstrate, our predicted segmentations closely match the ground-truth masks in both examples.

4 Conclusion↩︎

This paper introduced ACS-SegNet, an attention-based dual-encoder CNN–ViT model designed for accurate tissue segmentation in histopathology images. The architecture integrates SegFormer and ResNet encoders to effectively capture both global and local features. We evaluated the model on two publicly available histopathology datasets, GCPS and PUMA, and benchmarked its performance against state-of-the-art methods. Experimental results demonstrate that ACS-SegNet consistently outperforms other approches and achieves excellent segmentation perfromnce across both datasets.

Compliance with ethical standards↩︎

This research study was conducted retrospectively using human subject data made available in open access by [15], [16]. Ethical approval was not required as confirmed by the license attached with the open access data.

Acknowledgments↩︎

This work was supported by the Vienna Science and Technology Fund (WWTF) and by the State of Lower Austria [Grant ID: 10.47379/LS23006].

References↩︎

[1]

Gabriele Campanella, Matthew G Hanna, Luke Geneslaw, Allen Miraflor, Vitor Werneck Krauss Silva, Klaus J Busam, Edi Brogi, Victor E Reuter, David S Klimstra, and Thomas J Fuchs, “Clinical-grade computational pathology using weakly supervised deep learning on whole slide images,” Nature medicine, vol. 25, no. 8, pp. 1301–1309, 2019.

[2]

Chukwuemeka Clinton Atabansi, Jing Nie, Haijun Liu, Qianqian Song, Lingfeng Yan, and Xichuan Zhou, “A survey of transformer applications for histopathological image analysis: New developments and future directions,” BioMedical Engineering OnLine, vol. 22, no. 1, pp. 96, 2023.

[3]

Amirreza Mahbod, Gerald Schaefer, Georg Dorffner, Sepideh Hatamikia, Rupert Ecker, and Isabella Ellinger, “A dual decoder u-net-based model for nuclei instance segmentation in hematoxylin and eosin-stained histological images,” Frontiers in medicine, vol. 9, pp. 978146, 2022.

[4]

Fabian Hörst, Moritz Rempe, Lukas Heine, Constantin Seibold, Julius Keyl, Giulia Baldini, Selma Ugurel, Jens Siveke, Barbara Grünwald, Jan Egger, et al., “Cellvit: Vision transformers for precise cell segmentation and classification,” Medical Image Analysis, vol. 94, pp. 103143, 2024.

[5]

Dinh-Phu Tran, Quoc-Anh Nguyen, Van-Truong Pham, and Thi-Thao Tran, “Trans2unet: neural fusion for nuclei semantic segmentation,” in 2022 11th international conference on control, automation and information sciences (ICCAIS). IEEE, 2022, pp. 583–588.

[6]

Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv preprint arXiv:2102.04306, 2021.

[7]

Boliang Li, Yaming Xu, Yan Wang, and Bo Zhang, “Dectnet: Dual encoder network combined convolution and transformer architecture for medical image segmentation,” Plos one, vol. 19, no. 4, pp. e0301019, 2024.

[8]

Yongzhao Du, Xin Chen, and Yuqing Fu, “Multiscale transformers and multi-attention mechanism networks for pathological nuclei segmentation,” Scientific Reports, vol. 15, no. 1, pp. 12549, 2025.

[9]

Chong Zhang, Lingtong Wang, Guohui Wei, Zhiyong Kong, and Min Qiu, “A dual-branch and dual attention transformer and cnn hybrid network for ultrasound image segmentation,” Frontiers in Physiology, vol. 15, pp. 1432987, 2024.

[10]

Penghui He, Aiping Qu, Shuomin Xiao, and Meidan Ding, “Detisseg: A dual-encoder network for tissue semantic segmentation of histopathology image,” Biomedical Signal Processing and Control, vol. 87, pp. 105544, 2024.

[11]

Nima Torbati, Anastasia Meshcheryakova, Diana Mechtcheriakova, and Amirreza Mahbod, “A multi-stage auto-context deep learning framework for tissue and nuclei segmentation and classification in h&e-stained histological images of advanced melanoma,” arXiv preprint arXiv:2503.23958, 2025.

[12]

Zhengxin Zhang, Qingjie Liu, and Yunhong Wang, “Road extraction by deep residual U-Net,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 5, pp. 749–753, 2018.

[13]

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,” Advances in neural information processing systems, vol. 34, pp. 12077–12090, 2021.

[14]

Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19.

[15]

Qinghua Zhang, Jiani Xiong, Yangqiang Wang, Tian jian Luo, Wei Gao, and Yucai Lin, “Towards gastric cancer pathological segmentation: A large-scale whole-slide images dataset and dual-stream mask guided attention u-net,” Biomedical Signal Processing and Control, vol. 111, pp. 108398, 2026.

[16]

Mark Schuiveling, Hong Liu, Daniel Eek, Gerben E Breimer, Karijn P M Suijkerbuijk, Willeke A M Blokx, and Mitko Veta, “A novel dataset for nuclei and tissue segmentation in melanoma with baseline nuclei segmentation and tissue segmentation benchmarks,” GigaScience, vol. 14, pp. giaf011, 2025.

[17]

Hao Chen, Qi Dou, Lequan Yu, Jing Qin, and Pheng-Ann Heng, “VoxResNet: Deep voxelwise residual networks for brain segmentation from 3DMR images,” NeuroImage, vol. 170, pp. 446–455, 2018.

[18]

O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015, pp. 234–241.

ACS-SegNet: An Attention-Based CNN-SegFormer Segmentation Network for Tissue Segmentation in Histopathology