Abstract

In recent years, significant advancements have been made in deep learning for medical image segmentation, particularly with convolutional neural networks (CNNs) and transformer models. However, CNNs face limitations in capturing long-range dependencies, while transformers suffer from high computational complexity. To address this, we propose RWKV-UNet, a novel model that integrates the RWKV (Receptance Weighted Key Value) structure into the U-Net architecture. This integration enhances the model’s ability to capture long-range dependencies and to improve contextual understanding, which is crucial for accurate medical image segmentation. We build a strong encoder with developed Global-Local Spatial Perception (GLSP) blocks combining CNNs and RWKVs. We also propose a Cross-Channel Mix (CCM) module to improve skip connections with multi-scale feature fusion, achieving global channel information integration. Experiments on 11 benchmark datasets show that the RWKV-UNet achieves state-of-the-art performance on various types of medical image segmentation tasks. Additionally, smaller variants, RWKV-UNet-S and RWKV-UNet-T, balance accuracy and computational efficiency, making them suitable for broader clinical applications.

RWKV-UNet, Receptance Weighted Key Value, Global-Local Spatial Perception, Cross-Channel Mix, Medical image segmentation

1 Introduction↩︎

Computer-aided medical image analysis is crucial in modern healthcare, where deep learning–driven automated segmentation precisely delineates anatomical structures and lesions, thereby enhancing visualization, improving diagnostic accuracy, and supporting personalized treatment planning. U-Net [1] is a seminal segmentation architecture featuring an encoder-decoder structure with skip connections, allowing it to capture both high-level semantic context and fine-grained details, which has inspired numerous improved variants.

Medical images are typically high-resolution and exhibit complex anatomical structures with diverse lesion appearances, posing challenges for segmentation tasks that demand both fine local detail and coherent global context. Local features are essential for accurate boundary delineation and small lesion detection, whereas long-range dependencies ensure structural consistency and cross-slice contextual integration. CNN-based designs excel at local feature extraction through convolutional kernels but struggle to model long-range dependencies. Transformers-based [2], [3] UNet variants like [4], [5] address this via patch-based image processing and self-attention, improving segmentation accuracy, but the \(O(N^2)\) computational cost impacts efficiency, especially when processing high-resolution images.

Figure 1: Comparative analysis of CNN-, Transformer-, Mamba-, RWKV-, and hybrid-based segmentation models, highlighting their respective strengths and weaknesses.

Recent linear-attention models such as Mamba [6], [7] and RWKV [8], [9] have emerged as efficient alternatives, both achieving linear complexity while maintaining long-range modeling capability, and UNet variants based on these [10]–[12] have been developed for medical image segmentation. Mamba leverages state-space dynamics and selective scanning for efficient global information propagation, whereas RWKV introduces a gated recurrent mechanism that combines recurrence inductive bias with Transformer-style context aggregation. From a modeling perspective, RWKV’s gated accumulation enables more persistent and isotropic context propagation capability, yielding broader effective receptive fields and more stable long-range dependency modeling in medical image segmentation. Integrating CNNs with these global modeling mechanisms unifies precise local feature extraction and efficient global context understanding, addressing the inherent limitations of either CNNs or Transformers alone. Building on RWKV’s advantages, CNN–RWKV hybrids are expected to provide more balanced representations, offering a promising trade-off between segmentation accuracy and computational efficiency. Fig. 1 compares the strengths and weaknesses of each model.

Based on the advantages discussed above, we propose a novel RWKV-UNet for medical image segmentation, explicitly integrating convolutional layers with the spatial mixing capabilities of RWKV. This design leverages RWKV’s strength in capturing long-range dependencies and directionally consistent context propagation while retaining the precise local feature extraction of convolution, aiming to produce more accurate and context-aware segmentation results. We evaluate our model on 11 benchmark datasets, where it achieves state-of-the-art (SOTA) performance, demonstrating its effectiveness and potential as a robust tool for medical image analysis.

In summary, our contributions are as follows:

We propose an innovative RWKV-based Global-Local Spatial Perception (GLSP) module to build encoders for medical image segmentation. This approach efficiently integrates global and local information, and its feature extraction capabilities are augmented via pre-training.
We design a Cross-Channel Mix (CCM) module to improve the skip connections in U-Net, which facilitates the fusion of multi-scale information and enhances the feature representation across different scales.
Our RWKV-UNet demonstrates superiority and efficiency, achieving SOTA segmentation performance on 11 datasets of different imaging modalities. Additionally, the relatively compact models, namely RWKV-UNet-S and RWKV-UNet-T, strike a balance between performance and computational efficiency, making them adaptable to a wider range of application scenarios.

2 Related Work↩︎

2.1 Architectures for Medical Image Segmentation↩︎

U-Net [1] is a deep learning architecture for biomedical image segmentation, featuring a U-shaped encoder-decoder structure with skip connections that preserve spatial details while extracting and restoring features. Variations based on U-Net improve its performance through various means, which include using developed encoders [13], [14], gating mechanisms to focus on important features [15], improving skip connection formats [16]–[18] to recover more spatial details, as well as adaptations for 3D images [19], [20]. Some studies have explored combining MLP/KAN with UNet to capture long-range dependencies [21]–[23]. nnU-Net [24] stands out as a self-configuring framework that automatically adapts preprocessing, architecture, and training settings to different medical datasets, serving as a robust baseline in many clinical segmentation challenges. These improvements allow U-Net to demonstrate higher accuracy and adaptability in various image segmentation tasks.

Attention-based Improvements. Attention-based methods have demonstrated strong performance in visual tasks due to their ability to capture global context. The self-attention mechanism of Transformers allows models to focus on the most informative regions, which is crucial in medical image segmentation for precise delineation of anatomical structures and lesions. Pure Transformer U-Nets [4], [5], [25]–[27] rely entirely on self-attention, enabling effective long-range dependency modeling but often at high computational cost. Hybrid CNN-Transformer architectures [28]–[32] combine CNNs’ local feature extraction with Transformers’ global modeling, preserving fine structural details while capturing complex tissue and lesion relationships. More recently, linear-attention sequence models such as Mamba [6], [7] and SSM-based U-Net variants [10], [11], [33]–[35] efficiently capture long-range dependencies with linear complexity, offering a favorable balance between segmentation accuracy and computational efficiency. And there are also attention-based attenpt for segmentation in medical videos [36]. Despite the progress,existing parallel global-local fusion suffers from high computational cost and can interfere with fine feature representation, while some global modeling may be unnecessary in shallow layers. Additionally, existing approaches may struggle to capture long-range dependencies stably, as they lack recurrent inductive bias that support stable context propagation.

2.2 Receptance Weighted Key Value (RWKV)↩︎

Receptance Weighted Key Value (RWKV) [8] models sequential dependencies via a weighted combination of past key-value pairs modulated by a learnable gate, avoiding the quadratic cost of traditional attention. Adapted to vision tasks as Vision-RWKV (VRWKV) [9], it efficiently captures long-range spatial dependencies in high-resolution images with linear complexity and reduced memory overhead, and has also been extended to broader visual domains [37], [38].

RWKV for Medical Imaging. Restore-RWKV [39] applies linear Re-WKV attention with an Omni-Shift layer for efficient multi-directional information propagation in image restoration. BSBP-RWKV [40] integrates RWKV with Perona-Malik diffusion to achieve high-precision segmentation, preserving structural and pathological details. RWKVMatch [41] leverages Vision-RWKV-based global attention, cross-fusion mechanisms, and elastic transformation integration to handle complex deformations in medical image registration. Zigzag RWKV-in-RWKV (Zig-RiR) model [12] introduces a nested RWKV structure with Outer and Inner blocks and a Zigzag-WKV attention mechanism to capture both global and local features while preserving spatial continuity. These innovations demonstrate RWKV’s ability to capture long-range dependencies, retain fine anatomical features, and scale to high-resolution images. However, systematic exploration of hybrid architectures combining RWKV with local feature extractors remains limited, leaving opportunities to balance local detail, global context, and computational efficiency in medical image segmentation.

3 Methodology: RWKV-UNet↩︎

3.1 Overall Architecture↩︎

Figure 2: Overall architecture of the proposed RWKV-UNet. (a) the encoder with four stages constructed by stacked LP blocks and stacked GLSP blocks, (b) Cross-Channel Mix (CCM) Module for multi-scale fusion, (c) the decoder with four stages, (d) the Local Perception (LP) block, (e) the RWKV-based Global-Local Spatial Perception (GLSP) block, (f) the decoder block constructed by a point-convolution layer and a 9\times9 layer, with a convolution and an upsampling operation. — Figure 2: **Overall architecture of the proposed RWKV-UNet**. (a) the encoder with four stages constructed by stacked LP blocks and stacked GLSP blocks, (b) Cross-Channel Mix (CCM) Module for multi-scale fusion, (c) the decoder with four stages, (d) the Local Perception (LP) block, (e) the RWKV-based Global-Local Spatial Perception (GLSP) block, (f) the decoder block constructed by a point-convolution layer and a \(9\times9\) layer, with a convolution and an upsampling operation.

The overall architecture of the proposed RWKV-UNet is presented in Fig. 2. RWKV-UNet consists of an encoder with stacked LP and GLSP blocks, a decoder, and skip connections with a Cross-Channel Mix (CCM) Module.

3.2 Effective Encoder in RWKV-UNet↩︎

3.2.0.1 Global-Local Spatial Perception (GLSP) Block

We employ the RWKV-Spatial-Mix module and Depth-Wise Convolution () with a skip connection to combine global and local dependency. The process of expanding and reducing dimensions enhances feature representation by capturing richer details while preventing information bottlenecks. Given a feature map \(X \in \mathbb{R}^{C_{\mathrm{in}} \times H \times W}\), the processing flow through the GLSP module is as follows: Normalize and project the input \(X\) to an intermediate dimension \(C_{mid}\) using a \(1\times1\) layer:

\[I_1 = \text{LayerNorm} (\text{Conv}_{1\times1}(X)),\] where \(I_1 \in \mathbb{R}^{C_{\mathrm{mid}} \times H \times W}\). \(C_{\mathrm{mid}}\) should be greater than \(C_{\mathrm{in}}\) to achieve dimension expansion.

Divide the feature map into patches of size \(1\times1\). Then the Spatial Mix in VRWKV is applied: \[I_2 = \text{Unfolding}(I_1) ,\] \[I_3 = \text{SpatialMix}(\text{LayerNorm}(I_2)) + I_2 ,\] where \(I_2 \in \mathbb{R}^{C_{\mathrm{mid}}\times N}\), \(I_3 \in \mathbb{R}^{C_{\mathrm{mid}}\times N}\).

Convert the feature sequence back into a 2D feature map \(I_4 = \text{Folding}(I_3) \in \mathbb{R}^{ C_{\mathrm{mid}} \times H \times W}\), and use a \(5\times5\) DW-Conv layer for local feature aggregation: \[I_5 = \text{DW-Conv}(I_4) + I_4,\] where \(I_5 \in \mathbb{R}^{C_{\mathrm{mid}} \times H^{\prime} \times W^{\prime}}\). \(H^{\prime} \times W^{\prime}\) is determined by the stride of .

Finally, project \(I_5\) to the output dimension and add a global skip connection:

\[F = \text{Conv}_{1\times1}(I_5) + X,\] where \(F \in \mathbb{R}^{C_{\mathrm{out}} \times H^{\prime} \times W^{\prime}}\).

3.2.0.2 Local Perception (LP) Block

An LP block contains a point convolution layer, a layer and a point convolution layer with local and global residual skips, removing the spatial mix as well as the unfolding and folding processes from the GLSP.

3.2.0.3 Effective Encoder with Stacked LPs and GLSPs

We construct the RWKV-UNet encoder by stacking the LP and GLSP blocks. The encoder comprises a stem stage followed by four stages. The first blocks in stages , , and do not have residual connections and use a with a stride of 2 to achieve downsampling. After the first block, Stages and are composed of stacked LP blocks without Spatial Mix. In Stages and , a series of GLSP blocks are stacked following the first block.

3.2.0.4 Scale Up

As shown in Table 1, we implement three different sizes of encoders with different numbers of LP and GLSP blocks in each stage (depths), different embedding dimensions and expansion ratios for \(C_{\mathrm{mid}}\) : Encoder-Tiny (Enc-T), Encoder-Small (Enc-S), and Encoder-Base (Enc-B), allowing flexibility to scale up the model according to different resource constraints and application needs.

3.2.0.5 Pre-training Manner

We pre-train the encoders of different sizes on ImageNet-1K [42] with a batch size of 1024 for 300 epochs using the AdamW [43] optimizer. These pre-trained weights of different sizes will be used in medical segmentation tasks, enabling improved feature extraction and faster convergence.

Table 1: RWKV-UNet Encoder Configurations Across Stages
Stage	Parameter	Enc-T	Enc-S	Enc-B
Stem	Dimension	24	24	24
Stage	Depth	2	3	3
	Dimension	32	32	48
	Expansion	2.0	2.0	2.0
	Spatial Mix
Stage	Depth	2	3	3
	Dimension	48	64	72
	Expansion	2.5	2.5	2.5
	Spatial Mix
Stage	Depth	4	6	6
	Dimension	96	128	144
	Expansion	3.0	3.0	4.0
	Spatial Mix	✔	✔	✔
Stage	Depth	2	3	3
	Dimension	160	192	240
	Expansion	3.5	4.0	4.0
	Spatial Mix	✔	✔	✔
Parameters		2.94M	9.42M	16.69M
FLOPs		1.83G	4.73G	8.96G

Table 2: Comparison results of models with different attention mechanisms on Synapse dataset.
Attention Type	Average		FLOPs
Attention Type	HD95 \(\downarrow\)	DSC \(\uparrow\)	FLOPs
Multi-Head Self-Attention [2]	29.32	77.03	18.55G
Focused Linear Attention [44]	27.66	76.89	11.18G
Bi-Mamba [7]	29.10	78.13	14.53G
RWKV Spatial-Mix (ours) [10]	25.01	78.14	11.11G

3.2.0.6 Comparison with other attention types

We conduct experiments by replacing the spatial mix in the encoder of RWKV-UNet with other attention mechanisms, such as self-attention [2], focused linear attention [44], and Bi-Mamba used in Vision Mamba [7], as well as by removing this module entirely. The total training epochs are 100. The results shown in Table 2 indicate that the model with Spatial-Mix achieves the best DSC with the lowest computational cost. Comparison visualizations of effective receptive fields of the last layer output using different attention mechanisms in the GLSP module are shown in Fig. 3.

3.3 Large Kernel-based Decoder Design↩︎

We design our CNN-based decoder block with a point-convolution layer and a \(9\times9\) layer, following a point operation layer and an upsampling operation. Consider the input feature map \(X \in \mathbb{R}^{C_{\text{in }} \times H \times W}\); in depthwise convolution, the number of input channels is \(C_{\mathrm{in}}\), and the second point convolution maps the \(C_{\text{in }}\) channels to \(C_{\text{out }}\).

3.4 Cross-Channel Mix for Multi-scale Fusion↩︎

Inspired by the Channel Mix in VRWKV, we propose Cross-Channel Mix (CCM), a module that can effectively extract channel information across multi-scale encoder features. By capturing the rich global context along channels, CCM further enhances the extraction of long-range contextual information.

Define output feature map of , and of the designed encoder are \(F_1, F_2\), and \(F_3\), which has different Spatial dimensions and channel counts: \(F_1 \in \mathbb{R}^{C_1 \times H_1 \times W_1}, \quad F_2 \in \mathbb{R}^{C_2 \times H_2 \times W_2}, \quad F_3 \in \mathbb{R}^{C_3 \times H_3 \times W_3}\). (\(H_1\) >\(H_2\)>\(H_3\), \(W_1\) >\(W_2\) >\(W_3\) and \(C_3\) >\(C_2\) >\(C_1\)).

We reshape and map smaller features to match the largest feature’s size and dimension: \[\begin{gather} \tilde{F}_i=\operatorname{Conv}_{C_i}\left(\operatorname{Upsample}\left(F_i\right)\right), \quad i=2,3 \\ \tilde{F}_1=\operatorname{Conv}_{C_1}\left(F_1\right), \end{gather}\] where \(\tilde{F}_i \in \mathbb{R}^{C_1 \times H_1 \times W_1}, \quad i=1,2,3\)

We concatenate the adjusted feature maps along the channel dimension to produce a combined feature map: \[F_{\text{cat}} = \text{Concat}(\tilde{F}_1, \tilde{F}_2, \tilde{F}_3),\] where \(F_{\text{cat}} \in \mathbb{R}^{3C_1 \times H_1 \times W_1}\).

Divide the feature map into patches: \(F_{\text{unf}} = \text{Unfolding}(F_{\text{cat}})\in \mathbb{R}^{3C_1 \times N}\). Then apply the Channel Mix in VRWKV, performing multi-scale global feature fusion in the channel dimension:

\[F_{\text{mix}} = \text{ChannelMix}(\text{LayerNorm}(F_{\text{unf}}))+F_{\text{unf}},\] where \(F_{\text{mix}}\in \mathbb{R}^{3C_1 \times N_1}\). Project it back to 2D feature, \(F_{\text{fold}}=\text{Folding}(F_{\text{mix}})\in \mathbb{R}^{3C_1 \times H_1 \times W_1}\), and \(N_1=H_1 \times W_1\).

The folded feature map \(F_{\text{fold }}\) is split back into three separate features:\([F_{\text{fold}}^{(1)}, F_{\text{fold}}^{(2)}, F_{\text{fold}}^{(3)}]\). Each split feature undergoes a reshape operation and a convolution to restore its original sizes and dimensions: \[F_i^{\prime}=\operatorname{Conv}_{C_i}\left(\operatorname{Reshape}\left(F_{\text{fold }}^{(i)}\right)\right), \quad i=1,2,3\] where \(F_1'\in \mathbb{R}^{C_1 \times H_1 \times W_1}\) , \(F_2'\in \mathbb{R}^{C_2 \times H_2 \times W_2}\) and \(F_3' \in \mathbb{R}^{C_3 \times H_3 \times W_3}\). \(F_1'\),\(F_2'\) and \(F_3'\) will be concatenated with the output of the decoder , and .

4 Experiments↩︎

Table 3: Comparison results on Synapse dataset. The evaluation metrics are HD95 (mm) and DSC in (%). DSC are reported for individual organs. \(\uparrow \downarrow\) denotes the higher (lower) the better. \(-\) means missing data from the source. **Bold** and underline represent the best and the second best results. \(^*\) denotes that the experiment is conducted by us.
Methods	Average		Aotra	Gallbladder	Kidney (Left)	Kidney (Right)	Liver	Pancreas	Spleen	Stomach
Methods	HD95 \(\downarrow\)	DSC \(\uparrow\)	Aotra	Gallbladder	Kidney (Left)	Kidney (Right)	Liver	Pancreas	Spleen	Stomach
U-Net [1]	39.70	76.85	89.07	69.72	77.77	68.60	93.43	53.98	86.67	75.58
R50 U-Net [1], [45]	36.87	74.68	87.74	63.66	80.60	78.19	93.74	56.90	85.87	74.16
ViT [3] + CUP [28]	36.11	67.86	70.19	45.10	74.70	67.40	91.32	42.00	81.75	70.44
TransUNet [28]	31.69	77.48	87.23	63.16	81.87	77.02	94.08	55.86	85.08	75.62
SwinUNet [46]	21.55	79.13	85.47	66.53	83.28	79.61	94.29	56.58	90.66	76.60
TransClaw U-Net [47]	26.38	78.09	85.87	61.38	84.83	79.36	94.28	57.65	87.74	73.55
MT-UNet [29]	26.59	78.59	87.92	64.99	81.47	77.29	93.06	59.46	87.75	76.81
UCTransNet [30]	26.75	78.23	-	-	-	-	-	-	-	-
MISSFormer [5]	18.20	81.96	86.99	68.65	85.21	82.00	94.41	65.67	91.92	80.81
TransDeepLab [25]	21.25	80.16	86.04	69.16	84.08	79.88	93.53	61.19	89.00	78.40
LeVit-Unet-384 [48]	16.84	78.53	87.33	62.23	84.61	80.25	93.11	59.07	88.86	72.76
MS-UNet [26]	18.97	80.44	85.80	69.40	85.86	81.66	94.24	57.66	90.53	78.33
HiFormer-L [31]	19.14	80.69	87.03	68.61	84.23	78.37	94.07	60.77	90.44	82.03
PVT-CASCADE [49]	20.23	81.06	83.01	70.59	82.23	80.37	94.08	64.43	90.10	83.69
TransCASCADE [49]	17.34	82.68	86.63	68.48	87.66	84.56	94.43	65.33	90.79	83.52
MetaSeg-B [32]	-	82.78	-	-	-	-	-	-	-	-
VM-UNet [10]	19.21	81.08	86.40	69.41	86.16	82.76	94.17	58.80	89.51	81.40
Hc-Mamba [50]	26.34	79.58	89.93	67.65	84.57	78.27	95.38	52.08	89.49	79.84
Mamba-UNet\(^*\) [33]	17.32	73.36	81.65	55.39	81.86	72.76	92.35	45.28	88.22	69.35
PVT-EMCAD-B2 [51]	15.68	83.63	88.14	68.87	88.08	84.10	95.26	68.51	92.17	83.92
SelfReg-SwinUnet [46], [52]	-	80.54	86.07	69.65	85.12	82.58	94.18	61.08	87.42	78.22
Zig-RiR (2D) [12] (256\(\times\)256)\(^*\)	28.51	74.61	82.73	64.19	79.13	68.25	93.71	50.14	88.46	70.29
RWKV-UNet-T (ours)	11.89	81.87	88.10	68.42	88.88	82.61	95.15	60.52	90.23	81.09
RWKV-UNet-S (ours)	11.01	82.93	88.30	65.92	88.47	84.11	95.45	66.51	91.51	83.17
RWKV-UNet (ours)	8.85	84.29	88.71	70.39	89.58	84.36	95.70	66.58	93.54	85.49

Table 4: Comparison results on ACDCdataset. The evaluation metric is DSC in (%).
Methods	Average	RV	Myo	LV
R50 U-Net [1], [45]	87.55	87.10	80.63	94.92
ViT [3] + CUP [28]	81.45	81.46	70.71	92.18
R50-ViT [3], [45] + CUP [28]	87.57	86.07	81.88	94.75
TransUNet [28]	89.71	88.86	84.54	95.73
SwinUNet [46]	90.00	88.55	85.62	95.83
MT-UNet [29]	90.43	86.64	89.04	95.62
MS-UNet [26]	87.74	85.31	84.09	93.82
MISSFormer [5]	90.86	89.55	88.04	94.99
LeViT-UNet-384s [48]	90.32	89.55	87.64	93.76
PVT-CASCADE [49]	90.45	87.20	88.96	95.19
VM-UNet [10]	91.47	89.93	89.04	95.44
TransCASCADE [49]	91.63	89.14	90.25	95.50
PVT-EMCAD-B2 [51]	92.12	90.65	89.68	96.02
SelfReg-SwinUnet [46], [52]	91.49	89.49	89.27	95.70
Zig-RiR (2D) [12] (256\(\times\)256)\(^*\)	87.42	86.41	81.87	93.98
RWKV-UNet-T (ours)	91.90	90.63	88.21	96.85
RWKV-UNet-S (ours)	91.19	89.49	87.87	96.22
RWKV-UNet (ours)	92.29	91.26	88.78	96.83

Table 5: Comparison results on GOALS and FUGC 2025dataset. The evaluation metric is DSC in (%). All experiments for baselines are conducted by us.
Methods	GOALS	FUGC 2025
U-Net [1]	90.99	51.96
TransUNet [28]	91.05	79.87
Rolling-UNet-M [22]	90.84	63.39
PVT-EMCAD-B2 [51]	90.91	57.39
PVTFormer [53]	90.98	78.41
UKAN [23]	90.59	72.19
VM-UNet [10]	91.10	72.94
H-vmunet [54]	84.18	65.99
Zig-RiR (2D) [12]	83.48	62.41
RWKV-UNet-T (ours)	91.34	78.67
RWKV-UNet-S (ours)	91.13	79.43
RWKV-UNet (ours)	91.75	81.17

[h]
    \label{comp}
    \caption{Comparison results on BUSI, CVC-ClinicDB, CVC-ColonDB, Kvasir-SEG, ISIC 2017, GLAS and PSVFM datasets. The evaluation metric are  DSC in (\%). All experiments for baselines are conducted by us. Results with error bars are averaged over three runs with different splits. }
\setlength\tabcolsep{12.0pt}
    \centering
    \resizebox{1.0\textwidth}{!}{
    \begin{tabular}{cccccccc|ccc}
    \hline
 Methods &BUSI &CVC-ClinicDB  &CVC-ColonDB&Kvasir-SEG&ISIC 2017& GLAS&PSVFM&Params.& FLOPs \\
 \hline 
U-Net \cite{ronneberger2015u} & 74.90$\pm$1.08&  89.22$\pm$3.67&83.43$\pm$4.79&84.94$\pm$2.43 &82.94& 88.56$\pm$0.63 &68.98&\pzo7.77M &12.16G\\
R50-UNet~\cite{he2016deep, ronneberger2015u} & 78.46$\pm$1.52& 93.60$\pm$0.85& 89.53$\pm$2.99 & 90.13$\pm$1.16&  84.78& 90.64$\pm$0.23&73.94&28.78M &\pzo9.72G\\
Att-UNet \cite{oktay2018attention}  & 78.14$\pm$1.08&  90.86$\pm$3.58&84.75$\pm$6.48&86.21$\pm$1.93 &83.96& 89.14$\pm$0.48&60.61&\pzo8.73M &16.78G\\
UNet++ \cite{zhou2018unet++}& 75.74$\pm$1.60&  90.07$\pm$4.68&84.75$\pm$3.92&85.86$\pm$1.41&82.27& 88.93$\pm$0.60&71.69&\pzo9.16M &34.71G\\
TransUNet \cite{chen2021transunet} & 79.35$\pm$2.18&  93.91$\pm$2.20&90.72$\pm$3.50&91.34$\pm$0.86 &\underline{86.09}& 91.58$\pm$0.15&74.86& 105.3M&33.42G\\
FCBFormer~\cite{sanderson2022fcn}& 80.66$\pm$0.26&  91.15$\pm$3.29&85.23$\pm$4.57&91.09$\pm$0.83 &84.56& 91.13$\pm$0.14&75.23&51.96M&41.22G\\
UNext~\cite{valanarasu2022unext}& 73.19$\pm$1.94&  83.48$\pm$2.78&78.98$\pm$1.86&79.99$\pm$2.84 &84.68& 84.09$\pm$0.43&56.41&\pzo1.47M&\pzo0.58G\\
PVT-CASCADE~\cite{rahman2023medical}& 78.30$\pm$2.75&  93.45$\pm$1.44&90.97$\pm$2.03&\underline{91.40$\pm$1.12} &84.77& 91.85$\pm$0.34&71.82&35.27M &\pzo8.20G\\
Rolling-UNet-M~\cite{liu2024rolling} & 77.49$\pm$2.20&  91.91$\pm$2.49&85.20$\pm$5.20&87.41$\pm$1.74 &84.42&88.12$\pm$0.57&68.93&  \pzo7.10M&\pzo8.31G\\
PVT-EMCAD-B2~\cite{rahman2024emcad} & 79.22$\pm$0.93&  93.39$\pm$1.42&91.27$\pm$1.63&90.37$\pm$1.04 &84.05&91.77$\pm$0.08 &71.29& 26.76M&\pzo5.64G\\
U-KAN~\cite{li2024ukanmakesstrongbackbone} & 76.95$\pm$1.48&  89.88$\pm$2.82&84.66$\pm$4.62&85.78$\pm$1.46 &83.46& 87.42$\pm$0.25&66.25& 25.36M&\pzo8.08G\\
\hline
RWKV-UNet-T (ours)& 79.47$\pm$1.19&  95.05$\pm$0.64&89.56$\pm$5.10&91.26$\pm$2.01 &85.59&91.96$\pm$0.30&75.23& \pzo3.15M& \pzo3.70G\\
RWKV-UNet-S (ours)& \underline{80.92$\pm$1.51}&  \textbf{95.58$\pm$0.63}&\underline{91.30$\pm$2.79}&\textbf{91.78$\pm$1.19}&\textbf{86.38}&\underline{92.11$\pm$0.14}&\textbf{78.14} &\pzo9.70M&\pzo7.64G \\
RWKV-UNet (ours)& \textbf{81.92$\pm$0.74}&  \underline{95.26$\pm$0.61}&\textbf{92.27$\pm$2.17}&91.26$\pm$1.15 &85.32&\textbf{92.35$\pm$0.07}& \underline{76.39}&17.13M &14.50G\\
\hline
\end{tabular}}
\label{binary}

Figure 4: Performance of different methods on the Synapse multi-organ segmentation dataset. The average DSC (%) is plotted against FLOPs (G). The size of each circle represents the model’s parameter count. RWKV-UNet achieves SOTA performance with balanced computation cost, while RWKV-UNet-S and RWKV-UNet-T also achieve remarkable results.

4.1 Experimental Setup↩︎

4.1.0.1 Datasets and Metrics

Experiments on conducted on Synapse [55] for multi-organ Segmentation in CT images, ACDC [56] for cardiac segmentation in MRI images, GOALS [57] for OCT layer segmentation, BUSI [58] for breast tumor segmentation in ultrasound images, CVC-ClinicDB [59], CVC-ColonDB [60], Kvasir-SEG [61] for poly segmentation in endoscopy images, ISIC 2017 [62] for skin lesion segmentation in dermoscopic images, GLAS [63] for gland segmentation in microscopy images, PSVFM [64] for placental vessel segmentation in fetoscopy images and FUGC [65] for semi-supervised cervix segmentation in ultrasound images. The average Dice Similarity Coefficient (DSC) and average 95% Hausdorff Distance (HD95) are used as evaluation metrics.

Synapse uses 18 training and 12 validation CT scans, while ACDC uses 70 training, 10 validation, and 20 testing MRI scans, following settings in [28].
GOALS uses 100 training, 100 validation and 100 test images following the original setting in the challenge.
BUSI, CVC-ClinicDB, CVC-ColonDB, and Kvasir-SEG are split into training, validation, and testing sets with an 8:1:1 ratio, repeated with three different random seeds.
ISIC 2017 uses 2000 training, 150 validation, and 600 testing images as per the original challenge setup.
GlaS uses 85 training and 80 testing whole-slide images, with the training set further split into training and validation with a 9:1 ratio across three random seeds.
PSVFM uses a video-based split: Videos 1–3 for training, Video 4 for validation, and Videos 5–6 for testing, reflecting real-case variability in fetoscopic surgeries.
Building on [66] and [66], FUGC uses 50 labeled and 450 unlabeled images for training, 90 for validation, and 300 for testing.

4.1.0.2 Implementation Details

All experiments are performed on NVIDIA Tesla V100 with 32 GB memory. The resolution of the input images is resized to 224\(\times\)224 for Synapse and ACDC, and 256\(\times\)256 for binary segmentation tasks. The total training epochs for Synapse are 30, for ACDC are 150, and for binary segmentation are 300. The batch size is 24 for Synapse and ACDC (32 for RWKV-UNet-T on Synapse to avoid a CUDA memory problem), and 8 for GOALS and binary segmentation tasks. The initial learning rate is \(1e^{-3}\) for Synapse and FUGC 2025, \(5e^{-4}\) for ACDC, and \(1e^{-4}\) for GOALS and binary segmentation experiments. The minimum learning rates for Synapse and ACDC are 0 and for GOALS, binary segmentation and semi-supervised experiments are 1\(e^{-5}\). The AdamW [43] optimizer and CosineAnnealingLR [67] scheduler are used. The baseline results on GOALS and binary segmentation tasks are run by us. The loss function for supervised tasks is a mixed loss combining cross entropy (CE) loss and dice loss [20]: \[\mathcal{L}=\alpha C E(\hat{y}, y)+\beta D i c e(\hat{y}, y),\] where \(\alpha\) and \(\beta\) are 0.5 and 0.5 for Synapse, 0.2 and 0.8 for ACDC, 0.5 and 1 for GOALS and binary segmentation. Experiments for PVT-CASCADE [49] and PVT-EMCAD-B2 [51] use deep supervision strategy [68]. For semi-supervised tasks, the supervised loss on labeled data remains \(\mathcal{L}_{\text{sup}} = 0.5CE(\hat{y},y) + Dice(\hat{y},y)\). In addition, a consistency loss is applied on unlabeled data, computed as the mean squared error (MSE) between predictions under different augmentations after inverting the transformations. Learning rate for TransUNet and PVTFormer are \(1e^{-4}\) on FUGC 2025 for better results.

4.2 Comparison with State-of-the-arts↩︎

4.2.0.1 Abdomen Multi-organ Segmentation

Table 3 shows that RWKV-UNet excels in abdomen organ segmentation on the Synapse dataset, achieving the highest average DSC of 84.29% , and surpassing all SOTA CNN-, transformer-, and Mamba-based methods. It also achieves 8.85 in HD95, which is significantly better than other models, demonstrating a strong ability to localize organ boundaries accurately. This can be attributed to the model’s ability to capture both long-range dependencies and local features. Smaller models, RWKV-UNet-T and RWKV-UNet-S, also achieve remarkable results. The DSCs with parameters and FLOPs of each model can be seen in Fig. 4 and visualization results can be seen in Fig. 5.

4.2.0.2 Cardiac Segmentation

Table 4 shows that our model outperforms all SOTA models in the ACDC MRI cardiac segmentation task, achieving the best results in RV and the second-best in LV categories. The smaller models, RWKV-UNet-T and RWKV-UNet-S, also perform well.

4.2.0.3 OCT Layer Segmentation

Table 5 shows that on the small GOALS dataset, performance differences are minor and some carefully designed models even lag behind the classic U-Net, while our RWKV-UNet still achieves the highest average DSC, demonstrating its robustness and superiority. These results also reflect the model’s ability to handle small datasets.

4.2.0.4 Binary Segmentation

Table [binary] shows that the RWKV-UNet series shows remarkable performance in the tasks of breast lesion, poly, skin lesion, grand, and vessel segmentation, with much smaller parameters and computation load than TransUNet and FCBFormer.

4.2.0.5 Semi-supervised Cervix Segmentation

Table 5 also shows that the RWKV-UNet achieves SOTA performance on the FUGC 2025 dataset, further indicating strong data efficiency and generalization ability, as it can fully leverage limited labeled data and mine useful information from unlabeled data to maintain high segmentation accuracy.

Figure 5: A qualitative comparison with previous SOTA methods on the Synapse dataset. The visual results demonstrate that our method achieves more accurate segmentation, especially in difficult tasks like pancreas segmentation.

4.3 Ablation Study and Additional Analysis↩︎

4.3.0.1 Comparison with Different Encoders

We replace the encoder with pre-trained weights while keeping the rest of the network architecture unchanged (dimensions vary according to the encoder). Table 6 shows that our pretrained encoder achieves the best DSC and HD95 on the Synapse dataset, with significantly lower FLOPs.

Table 6: Comparison results on Synapse dataset of different pretrained encoders
Encoder	Average		FLOPs
Encoder	HD95 \(\downarrow\)	DSC \(\uparrow\)	FLOPs
ResNet 50 [45]	9.99	82.27	62.44
PVT-B3-v2 [69]	11.47	81.23	22.74
PVT-B5-v2 [69]	10.53	80.14	27.38
ConvNext-small [70]	12.79	76.34	42.47
ConvNext-base [70]	10.61	81.42	74.86
MaxViT-small [71]	10.57	83.71	18.54
MaxViT-base [71]	11.09	82.89	29.86
RWKV-UNet Enc-B (ours)	8.85	84.29	11.11

4.3.0.2 Effect of Pre-training for Encoder

Experimental results in Table 7. indicate that pre-training is essential because it significantly improves feature extraction by enabling the encoder to capture intricate patterns and hierarchical information. And the inductive biases of RWKV are also not as strong as those of CNN, thus they also rely on pre-training similar to transformers.

Table 7: Comparison results on Synapse dataset of whether to use pre-trained weights on ImageNet-1k or not.
Pre-training	Training Epochs	Average
Pre-training	Training Epochs	HD95 \(\downarrow\)	DSC \(\uparrow\)
w/o	30	36.23	71.66
w/o	100	25.01	78.14
w/	30	8.85	84.29

Table 8: Comparison results of different decoder designs for RWKV-UNet on Synapse dataset.
Kernel Size	Synapse		ACDC	FLOPs
(l2ptr2pt)2-3 (l2ptr2pt)4-4	HD95 \(\downarrow\)	DSC \(\uparrow\)	DSC \(\uparrow\)
3	10.35	84.83	92.07	10.97G
5	8.69	84.67	91.95	11.00G
7	9.97	83.73	91.61	11.05G
9	8.85	84.29	92.29	11.11G
11	11.25	82.67	92.04	11.19G

4.3.0.3 Attention in Shallow Stages

We evaluate the effect of retaining attention in the first two layers of RWKV-UNet-S by replacing them with spatial-mix attention and comparing against the original CNN-based shallow layers. All models are trained for 100 epochs from scratch, without pretraining. Table 9 shows negligible improvement from shallow attention on Synapse, while introducing higher computational cost, indicating that CNNs are sufficient for capturing low-level features in the shallow layers.

Table 9: Comparison of results on the Synapse dataset: effect of attention in shallow stages of RWKV-UNet-S
Shallow Attention	Average		FLOPs
Shallow Attention	HD95 \(\downarrow\)	DSC \(\uparrow\)	FLOPs
w/	29.43	76.88	6.91G
w/o (ours)	28.92	76.41	5.86G

4.3.0.4 Channel Mix in GLSP Blocks

In VRWKV [9], a Channel Mix module is used after the Spatial Mix. In contrast, our RWKV-UNet’s encoder designs GLSP blocks without Channel Mix, as the pointwise convolution in the output layer performs both channel mixing and feed-forward operations. We compare results on the Synapse dataset (without pre-training, 100 epochs) . Table 10 demonstrates that Channel Mix is not essential for maintaining performance.

Table 10: Comparison of results on the Synapse dataset: effects of Channel Mix after Spatial Mix in GLSP blocks
Channel Mix	Average		FLOPs
Channel Mix	HD95 \(\downarrow\)	DSC \(\uparrow\)	FLOPs
w/	35.19	74.41	19.43G
w/o (ours)	25.01	78.14	11.11G

4.3.0.5 Skip Connections

Comparison results of different skip connection designs for RWKV-UNet on the Synapse dataset are shown in Table 11. Experiments analyze variations in the number of skip connections and the inclusion of the CCM block. We prioritize retaining shallow skip connections because shallow layers tend to experience greater information loss. The results show that, like other UNet’s variants, adequate skip connections are essential for RWKV-UNet’s performance. Furthermore, the results show that the CCM module can significantly improve the model segmentation results. With only two skips, the model can still achieve 84.28 DSC with CCM. Admittedly, the CCM module increases the computational load to some extent due to the aggregation of global information.

Table 11: Comparison results on different skip connections designs of RWKV-UNet on the Synapse dataset. Experiments are conducted on the number of skips and whether or not to use the CCM block.
Skips	CCM Block	Average		FLOPs
Skips	CCM Block	HD95 \(\downarrow\)	DSC \(\uparrow\)	FLOPs
0	w	12.72	69.83	9.18 G
1	w/o	10.83	78.27	9.33 G
1	w/	10.37	83.19	10.96 G
2	w/o	11.32	80.11	9.41 G
2	w/	11.62	84.28	11.04 G
3	w/o	12.56	82.40	9.48 G
3 (ours)	w/ (ours)	8.85	84.29	11.11 G

Table 12: Comparison of different hidden rates in the ChannelMix of CCM on the Synapse dataset.
Hidden Rate	Average		FLOPs
Hidden Rate	HD95 \(\downarrow\)	DSC \(\uparrow\)	FLOPs
1	10.47	82.94	10.57 G
2 (ours)	8.85	84.29	11.11 G
3	9.89	84.18	11.62 G
4	9.24	84.06	12.14 G

Based on the results in Table 12, the hidden rate of the Channel Mix layer in the CCM module materially affects performance: increasing the rate from 1 to 2 yields a notable improvement in performance, while further increases bring no gains at the cost of higher FLOPs.

4.3.0.6 Larger Resolution Inputs

The comparison results of different methods with 512\(\times\) 512 input in the Synapse dataset are shown in Table 13. The average Dice scores of other methods are from [72]. Our results indicate that increasing the input resolution improves the performance of our model. Furthermore, compared to TransUNet, the computational overhead associated with the resolution increase is less severe, showcasing the efficiency of at higher resolutions.

Table 13: Comparison of different methods with 512\(\times\)512 resolution input on the Synapse dataset. The evaluationmetric is DSC in (%). The average DSCs of CNN-based methods are from [72].
Method	Average	FLOPs
U-Net [1]	81.34	-
Pyramid Attention [73]	80.08	-
DeepLabv3+ [74]	82.50	-
UNet++ [16]	81.60	-
Attention U-Net [15]	80.88	-
nnU-Net [24]	82.92	-
TransUNet [28]	84.36	148.29G
SAMed_h [75]	84.30	783.98G
RWKV-UNet (ours)	86.73	58.05G

5 Conclusion↩︎

In this study, we introduce RWKV-UNet, a novel architecture that integrates the RWKV architecture with U-Net. By combining the strengths of convolutional networks for local feature extraction and RWKV’s ability to model global context, our model significantly improves medical image segmentation accuracy. The proposed enhancements, including the GLSP module and the CCM module, contribute to a more precise representation of features and information fusion across different scales. Experimental results on 11 datasets demonstrate that RWKV-UNet surpasses SOTA methods. Its variants (RWKV-UNet-S and RWKV-UNet-T) offer a practical balance between performance and computational efficiency. Our approach has a strong potential to advance medical image analysis, particularly in clinical settings where both accuracy and efficiency are paramount.

Limitations and Future Work. RWKV-UNet is a powerful 2D medical image segmentation model that effectively combines RWKV with convolutional operations; however, it is currently not applicable to 3D imaging. In the future, we plan to extend the model to 3D for handling volume data and to explore the potential of RWKV-based foundational models for medical imaging. We also aim to develop ultra-lightweight RWKV-based models tailored for point-of-care applications, preserving segmentation accuracy while enhancing adaptability and speed further.

References↩︎

[1]

O. Ronneberger, P. Fischer, and booktitle=MICCAI. 2015. Brox Thomas, “U-net: Convolutional networks for biomedical image segmentation,” pp. 234–241.

[2]

A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” NeurIPS 2017.

[3]

A. Dosovitskiy, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.

[4]

A. Hatamizadeh, V. Nath, Y. Tang, D. Yang, H. R. Roth, and booktitle=International. M. brainlesion workshop 2021. Xu Daguang, “Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images,” pp. 272–284.

[5]

X. Huang, Z. Deng, D. Li, X. Yuan, and Y. Fu, “Missformer: An effective transformer for 2d medical image segmentation,” IEEE Transactions on Medical Imaging, vol. 42, no. 5, pp. 1484–1494, 2022.

[6]

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.

[7]

L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,” arXiv preprint arXiv:2401.09417, 2024.

[8]

B. Peng, E. Alcaide, Q. Anthony, and booktitle=EMNLP. 2023. others, “RWKV: Reinventing RNNs for the transformer era,” pp. 14048–14077.

[9]

Y. Duan et al., “Vision-rwkv: Efficient and scalable visual perception with rwkv-like architectures,” arXiv preprint arXiv:2403.02308, 2024.

[10]

J. Ruan and S. Xiang, “Vm-unet: Vision mamba unet for medical image segmentation,” arXiv preprint arXiv:2402.02491, 2024.

[11]

J. Ma, F. Li, and B. Wang, “U-mamba: Enhancing long-range dependency for biomedical image segmentation,” arXiv preprint arXiv:2401.04722, 2024.

[12]

T. Chen, X. Zhou, Z. Tan, et al., “Zig-rir: Zigzag rwkv-in-rwkv for efficient medical image segmentation,” IEEE Transactions on Medical Imaging, 2025.

[13]

Z. Zhang, Q. Liu, and Y. Wang, “Road extraction by deep residual u-net,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 5, pp. 749–753, 2018.

[14]

J. Jiang, M. Wang, H. Tian, L. Cheng, and booktitle=BIBM. 2024. Liu Yong, “LV-UNet: A lightweight and vanilla model for medical image segmentation,” pp. 4240–4246.

[15]

O. Oktay, J. Schlemper, L. L. Folgoc, et al., “Attention u-net: Learning where to look for the pancreas,” arXiv preprint arXiv:1804.03999, 2018.

[16]

Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, and booktitle=DLMIA. and M.-C. H. in C. with M. 2018. Liang Jianming, “Unet++: A nested u-net architecture for medical image segmentation,” pp. 3–11.

[17]

H. Huang et al., “Unet 3+: A full-scale connected unet for medical image segmentation,” pp. 1055–1059.

[18]

L. Qian et al., “Multi-scale context UNet-like network with redesigned skip connections for medical image segmentation,” Computer Methods and Programs in Biomedicine, vol. 243, p. 107885, 2024.

[19]

Ö. Ç içek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and booktitle=MICCAI. 2016. Ronneberger Olaf, “3D u-net: Learning dense volumetric segmentation from sparse annotation,” pp. 424–432.

[20]

F. Milletari, N. Navab, and booktitle=3DV. 2016. Ahmadi Seyed-Ahmad, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” pp. 565–571.

[21]

J. M. J. Valanarasu and booktitle=MICCAI. 2022. Patel Vishal M, “Unext: Mlp-based rapid medical image segmentation network,” pp. 23–33.

[22]

Y. Liu, H. Zhu, M. Liu, H. Yu, Z. Chen, and booktitle=AAAI. 2024. Gao Jie, “Rolling-unet: Revitalizing MLP?s ability to efficiently extract long-distance dependencies for medical image segmentation,” vol. 38, pp. 3819–3827.

[23]

C. Li et al., “U-KAN makes strong backbone for medical image segmentation and generation.” 2024 , eprint={2406.02918}, archivePrefix={arXiv}, primaryClass={eess.IV}, [Online]. Available: https://arxiv.org/abs/2406.02918.

[24]

F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, and K. H. Maier-Hein, “nnU-net: A self-configuring method for deep learning-based biomedical image segmentation,” Nature methods, vol. 18, no. 2, pp. 203–211, 2021.

[25]

R. Azad, M. Heidari, M. Shariatnia, and booktitle=International. W. on Pr. I. I. Me. others, “Transdeeplab: Convolution-free transformer-based deeplab v3+ for medical image segmentation,” 2022, pp. 91–102.

[26]

H. Chen, Y. Han, Y. Li, P. Xu, K. Li, and booktitle=PRCV. 2023. Yin Jianping, “MS-UNet: Swin transformer u-net with multi-scale nested decoder for medical image segmentation with small training data,” pp. 472–483.

[27]

W. Song, X. Wang, Y. Guo, S. Li, B. Xia, and A. Hao, “CenterFormer: A novel cluster center enhanced transformer for unconstrained dental plaque segmentation,” IEEE Transactions on Multimedia, vol. 26, pp. 10965–10978, 2024.

[28]

J. Chen, Y. Lu, Q. Yu, et al., “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv preprint arXiv:2102.04306, 2021.

[29]

H. Wang et al., “Mixed transformer u-net for medical image segmentation,” pp. 2390–2394.

[30]

H. Wang, P. Cao, J. Wang, and booktitle=AAAI. 2022. Zaiane Osmar R, “Uctransnet: Rethinking the skip connections in u-net from a channel-wise perspective with transformer,” vol. 36, pp. 2441–2449.

[31]

M. Heidari et al., “Hiformer: Hierarchical multi-scale representations using transformers for medical image segmentation,” pp. 6202–6212.

[32]

B. Kang, S. Moon, Y. Cho, H. Yu, and S.-J. Kang, “MetaSeg: MetaFormer-based global contexts-aware network for efficient semantic segmentation , booktitle = WACV 2024,” pp. 434–443.

[33]

Z. Wang, J.-Q. Zheng, Y. Zhang, G. Cui, and L. Li, “Mamba-unet: Unet-like pure visual mamba for medical image segmentation,” arXiv preprint arXiv:2402.05079, 2024.

[34]

W. Liao, Y. Zhu, X. Wang, C. Pan, Y. Wang, and L. Ma, “Lightm-unet: Mamba assists in lightweight unet for medical image segmentation,” arXiv preprint arXiv:2403.05246, 2024.

[35]

C. Ma and Z. Wang, “Semi-mamba-UNet: Pixel-level contrastive and cross-supervised visual mamba-based UNet for semi-supervised medical image segmentation,” Knowledge-Based Systems, vol. 300, p. 112203, 2024.

[36]

Z. Tu, Z. Zhu, Y. Duan, B. Jiang, Q. Wang, and C. Zhang, “A spatial-temporal progressive fusion network for breast lesion segmentation in ultrasound videos,” IEEE Transactions on Multimedia, pp. 1–13, 2025.

[37]

Q. He, J. Zhang, J. Peng, and booktitle=AAAI. 2025. others, “Pointrwkv: Efficient rwkv-like model for hierarchical point cloud learning.”

[38]

Z. Yin, C. Li, and X. Dong, “Video rwkv: Video action recognition based rwkv,” arXiv preprint arXiv:2411.05636, 2024.

[39]

Z. Yang, J. Li, H. Zhang, D. Zhao, B. Wei, and Y. Xu, “Restore-rwkv: Efficient and effective medical image restoration with rwkv,” IEEE Journal of Biomedical and Health Informatics. 2025.

[40]

X. Zhou and booktitle=ACM. M. 2024. Chen Tianxiang, “BSBP-RWKV: Background suppression with boundary preservation for efficient medical image segmentation.”

[41]

Z. He, J. Tang, Z. Zhao, and booktitle=ICASSP. 2025. Gong Zeyu, “RWKVMatch: Vision RWKV-based multi-scale feature matching network for unsupervised deformable medical image registration,” pp. 1–5.

[42]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and booktitle=CVPR. 2009. Fei-Fei Li, “Imagenet: A large-scale hierarchical image database,” pp. 248–255.

[43]

I. Loshchilov, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.

[44]

D. Han, X. Pan, Y. Han, S. Song, and booktitle=ICCV. 2023. Huang Gao, “Flatten transformer: Vision transformer using focused linear attention,” pp. 5961–5971.

[45]

K. He, X. Zhang, S. Ren, and booktitle=CVPR. 2016. Sun Jian, “Deep residual learning for image recognition,” pp. 770–778.

[46]

H. Cao et al., “Swin-unet: Unet-like pure transformer for medical image segmentation,” pp. 205–218.

[47]

Y. Chang, H. Menghan, Z. Guangtao, and Z. Xiao-Ping, “Transclaw u-net: Claw u-net with transformers for medical image segmentation,” arXiv preprint arXiv:2107.05188, 2021.

[48]

G. Xu, X. Zhang, X. He, and booktitle=PRCV. 2023. Wu Xinglong, “Levit-unet: Make faster encoders with transformer for medical image segmentation,” pp. 42–53.

[49]

M. M. Rahman and booktitle=WACV. 2023. Marculescu Radu, “Medical image segmentation via cascaded attention decoding,” pp. 6222–6231.

[50]

M. Yang and L. Chen, “HC-mamba: Remote sensing image classification via hybrid cross-activation state?space model,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 18, pp. 10429–10441, 2025.

[51]

M. M. Rahman, M. Munir, and booktitle=CVPR. 2024. Marculescu Radu, “Emcad: Efficient multi-scale convolutional attention decoding for medical image segmentation,” pp. 11769–11779.

[52]

W. Zhu et al., “Selfreg-unet: Self-regularized unet for medical image segmentation.” pp. 601–611.

[53]

D. Jha, N. K. Tomar, K. Biswas, and bookarticle=ISBI. 2024. others, “CT liver segmentation via PVT-based encoding and refined decoding.”

[54]

R. Wu, Y. Liu, P. Liang, and Q. Chang, “H-vmunet: High-order vision mamba unet for medical image segmentation,” Neurocomputing, vol. 624, p. 129447, 2025.

[55]

B. Landman, Z. Xu, J. Igelsias, M. Styner, T. Langerak, and booktitle=Proc. M. M.-A. L. B. C. V. C. Klein Arno, “Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge,” 2015, vol. 5, p. 12.

[56]

O. Bernard, A. Lalande, C. Zotti, et al., “Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved?” IEEE transactions on medical imaging, vol. 37, no. 11, pp. 2514–2525, 2018.

[57]

H. Fang, F. Li, H. Fu, J. Wu, X. Zhang, and booktitle=International. W. on O. M. I. A. Xu Yanwu, “Dataset and evaluation algorithm design for goals challenge,” 2022, pp. 135–142.

[58]

W. Al-Dhabyani, M. Gomaa, H. Khaled, and A. Fahmy, “Dataset of breast ultrasound images,” Data in brief, vol. 28, p. 104863, 2020.

[59]

J. Bernal, F. J. Sánchez, G. Fernández-Esparrach, D. Gil, C. Rodrı́guez, and F. Vilariño, “WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. Saliency maps from physicians,” Computerized medical imaging and graphics, vol. 43, pp. 99–111, 2015.

[60]

J. Bernal, J. Sánchez, and F. Vilarino, “Towards automatic polyp detection with a polyp appearance model,” Pattern Recognition, vol. 45, no. 9, pp. 3166–3182, 2012.

[61]

D. Jha et al., “Kvasir-seg: A segmented polyp dataset.”

[62]

N. C. Codella, D. Gutman, M. E. Celebi, and booktitle=ISBI. 2018. others, “Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic),” pp. 168–172.

[63]

K. Sirinukunwattana, J. P. Pluim, H. Chen, et al., “Gland segmentation in colon histology images: The glas challenge contest,” Medical image analysis, vol. 35, pp. 489–502, 2017.

[64]

S. Bano, F. Vasconcelos, L. M. Shepherd, and booktitle=MICCAI. 2020. others, “Deep placental vessel segmentation for fetoscopic mosaicking,” pp. 763–773.

[65]

J. Bai, “A dataset for fetal ultrasound grand challenge: Semi-supervised cervical segmentation.” Zenodo, Aug. 2025, doi: 10.5281/zenodo.16893174.

[66]

J. Bai et al., “Pubic symphysis-fetal head segmentation from transperineal ultrasound images.” Zenodo, 2023, [Online]. Available: https://doi.org/10.5281/zenodo.7861699.

[67]

I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” arXiv preprint arXiv:1608.03983, 2016.

[68]

C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and booktitle=Artificial. intelligence and statistics Tu Zhuowen, “Deeply-supervised nets,” 2015, pp. 562–570.

[69]

W. Wang et al., “Pvt v2: Improved baselines with pyramid vision transformer,” Computational visual media, vol. 8, no. 3, pp. 415–424, 2022.

[70]

Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A ConvNet for the 2020s,” CVPR 2022.

[71]

Z. Tu et al., “MaxViT: Multi-axis vision transformer,” ECCV, 2022.

[72]

J. Chen, J. Mei, X. Li, and and others, “3d transunet: Advancing medical image segmentation through vision transformers,” arXiv preprint arXiv:2310.07781, 2023.

[73]

H. Li, P. Xiong, J. An, and L. Wang, “Pyramid attention network for semantic segmentation,” arXiv preprint arXiv:1805.10180, 2018.

[74]

L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and booktitle=ECCV. 2018. Adam Hartwig, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” pp. 801–818.

[75]

K. Zhang and D. Liu, “Customized segment anything model for medical image segmentation,” arXiv preprint arXiv:2304.13785, 2023.

RWKV-UNet: Improving UNet with Long-Range Cooperation for
Effective Medical Image Segmentation

Abstract

1 Introduction↩︎

2 Related Work↩︎

2.1 Architectures for Medical Image Segmentation↩︎

2.2 Receptance Weighted Key Value (RWKV)↩︎

3 Methodology: RWKV-UNet↩︎

3.1 Overall Architecture↩︎

3.2 Effective Encoder in RWKV-UNet↩︎

3.2.0.1 Global-Local Spatial Perception (GLSP) Block

3.2.0.2 Local Perception (LP) Block

3.2.0.3 Effective Encoder with Stacked LPs and GLSPs

3.2.0.4 Scale Up

3.2.0.5 Pre-training Manner

3.2.0.6 Comparison with other attention types

3.3 Large Kernel-based Decoder Design↩︎

3.4 Cross-Channel Mix for Multi-scale Fusion↩︎

4 Experiments↩︎

4.1 Experimental Setup↩︎

4.1.0.1 Datasets and Metrics

4.1.0.2 Implementation Details

4.2 Comparison with State-of-the-arts↩︎

4.2.0.1 Abdomen Multi-organ Segmentation

4.2.0.2 Cardiac Segmentation

4.2.0.3 OCT Layer Segmentation

4.2.0.4 Binary Segmentation

4.2.0.5 Semi-supervised Cervix Segmentation

4.3 Ablation Study and Additional Analysis↩︎

4.3.0.1 Comparison with Different Encoders

4.3.0.2 Effect of Pre-training for Encoder

4.3.0.3 Attention in Shallow Stages

4.3.0.4 Channel Mix in GLSP Blocks

4.3.0.5 Skip Connections

4.3.0.6 Larger Resolution Inputs

5 Conclusion↩︎

References↩︎

Subjects

Updated on Academus

RWKV-UNet: Improving UNet with Long-Range Cooperation for Effective Medical Image Segmentation

Abstract

1 Introduction↩︎

2 Related Work↩︎

2.1 Architectures for Medical Image Segmentation↩︎

2.2 Receptance Weighted Key Value (RWKV)↩︎

3 Methodology: RWKV-UNet↩︎

3.1 Overall Architecture↩︎

3.2 Effective Encoder in RWKV-UNet↩︎

3.2.0.1 Global-Local Spatial Perception (GLSP) Block

3.2.0.2 Local Perception (LP) Block

3.2.0.3 Effective Encoder with Stacked LPs and GLSPs

3.2.0.4 Scale Up

3.2.0.5 Pre-training Manner

3.2.0.6 Comparison with other attention types

3.3 Large Kernel-based Decoder Design↩︎

3.4 Cross-Channel Mix for Multi-scale Fusion↩︎

4 Experiments↩︎

4.1 Experimental Setup↩︎

4.1.0.1 Datasets and Metrics

4.1.0.2 Implementation Details

4.2 Comparison with State-of-the-arts↩︎

4.2.0.1 Abdomen Multi-organ Segmentation

4.2.0.2 Cardiac Segmentation

4.2.0.3 OCT Layer Segmentation

4.2.0.4 Binary Segmentation

4.2.0.5 Semi-supervised Cervix Segmentation

4.3 Ablation Study and Additional Analysis↩︎

4.3.0.1 Comparison with Different Encoders

4.3.0.2 Effect of Pre-training for Encoder

4.3.0.3 Attention in Shallow Stages

4.3.0.4 Channel Mix in GLSP Blocks

4.3.0.5 Skip Connections

4.3.0.6 Larger Resolution Inputs

5 Conclusion↩︎

References↩︎

Subjects

Updated on Academus

RWKV-UNet: Improving UNet with Long-Range Cooperation for
Effective Medical Image Segmentation