RWKV-UNet: Improving UNet with Long-Range Cooperation for
Effective Medical Image Segmentation
January 14, 2025
In recent years, significant advancements have been made in deep learning for medical image segmentation, particularly with convolutional neural networks (CNNs) and transformer models. However, CNNs face limitations in capturing long-range dependencies, while transformers suffer from high computational complexity. To address this, we propose RWKV-UNet, a novel model that integrates the RWKV (Receptance Weighted Key Value) structure into the U-Net architecture. This integration enhances the model’s ability to capture long-range dependencies and to improve contextual understanding, which is crucial for accurate medical image segmentation. We build a strong encoder with developed Global-Local Spatial Perception (GLSP) blocks combining CNNs and RWKVs. We also propose a Cross-Channel Mix (CCM) module to improve skip connections with multi-scale feature fusion, achieving global channel information integration. Experiments on 11 benchmark datasets show that the RWKV-UNet achieves state-of-the-art performance on various types of medical image segmentation tasks. Additionally, smaller variants, RWKV-UNet-S and RWKV-UNet-T, balance accuracy and computational efficiency, making them suitable for broader clinical applications.
RWKV-UNet, Receptance Weighted Key Value, Global-Local Spatial Perception, Cross-Channel Mix, Medical image segmentation
Computer-aided medical image analysis is crucial in modern healthcare, where deep learning–driven automated segmentation precisely delineates anatomical structures and lesions, thereby enhancing visualization, improving diagnostic accuracy, and supporting personalized treatment planning. U-Net [1] is a seminal segmentation architecture featuring an encoder-decoder structure with skip connections, allowing it to capture both high-level semantic context and fine-grained details, which has inspired numerous improved variants.
Medical images are typically high-resolution and exhibit complex anatomical structures with diverse lesion appearances, posing challenges for segmentation tasks that demand both fine local detail and coherent global context. Local features are essential for accurate boundary delineation and small lesion detection, whereas long-range dependencies ensure structural consistency and cross-slice contextual integration. CNN-based designs excel at local feature extraction through convolutional kernels but struggle to model long-range dependencies. Transformers-based [2], [3] UNet variants like [4], [5] address this via patch-based image processing and self-attention, improving segmentation accuracy, but the \(O(N^2)\) computational cost impacts efficiency, especially when processing high-resolution images.
Recent linear-attention models such as Mamba [6], [7] and RWKV [8], [9] have emerged as efficient alternatives, both achieving linear complexity while maintaining long-range modeling capability, and UNet variants based on these [10]–[12] have been developed for medical image segmentation. Mamba leverages state-space dynamics and selective scanning for efficient global information propagation, whereas RWKV introduces a gated recurrent mechanism that combines recurrence inductive bias with Transformer-style context aggregation. From a modeling perspective, RWKV’s gated accumulation enables more persistent and isotropic context propagation capability, yielding broader effective receptive fields and more stable long-range dependency modeling in medical image segmentation. Integrating CNNs with these global modeling mechanisms unifies precise local feature extraction and efficient global context understanding, addressing the inherent limitations of either CNNs or Transformers alone. Building on RWKV’s advantages, CNN–RWKV hybrids are expected to provide more balanced representations, offering a promising trade-off between segmentation accuracy and computational efficiency. Fig. 1 compares the strengths and weaknesses of each model.
Based on the advantages discussed above, we propose a novel RWKV-UNet for medical image segmentation, explicitly integrating convolutional layers with the spatial mixing capabilities of RWKV. This design leverages RWKV’s strength in capturing long-range dependencies and directionally consistent context propagation while retaining the precise local feature extraction of convolution, aiming to produce more accurate and context-aware segmentation results. We evaluate our model on 11 benchmark datasets, where it achieves state-of-the-art (SOTA) performance, demonstrating its effectiveness and potential as a robust tool for medical image analysis.
In summary, our contributions are as follows:
We propose an innovative RWKV-based Global-Local Spatial Perception (GLSP) module to build encoders for medical image segmentation. This approach efficiently integrates global and local information, and its feature extraction capabilities are augmented via pre-training.
We design a Cross-Channel Mix (CCM) module to improve the skip connections in U-Net, which facilitates the fusion of multi-scale information and enhances the feature representation across different scales.
Our RWKV-UNet demonstrates superiority and efficiency, achieving SOTA segmentation performance on 11 datasets of different imaging modalities. Additionally, the relatively compact models, namely RWKV-UNet-S and RWKV-UNet-T, strike a balance between performance and computational efficiency, making them adaptable to a wider range of application scenarios.
U-Net [1] is a deep learning architecture for biomedical image segmentation, featuring a U-shaped encoder-decoder structure with skip connections that preserve spatial details while extracting and restoring features. Variations based on U-Net improve its performance through various means, which include using developed encoders [13], [14], gating mechanisms to focus on important features [15], improving skip connection formats [16]–[18] to recover more spatial details, as well as adaptations for 3D images [19], [20]. Some studies have explored combining MLP/KAN with UNet to capture long-range dependencies [21]–[23]. nnU-Net [24] stands out as a self-configuring framework that automatically adapts preprocessing, architecture, and training settings to different medical datasets, serving as a robust baseline in many clinical segmentation challenges. These improvements allow U-Net to demonstrate higher accuracy and adaptability in various image segmentation tasks.
Attention-based Improvements. Attention-based methods have demonstrated strong performance in visual tasks due to their ability to capture global context. The self-attention mechanism of Transformers allows models to focus on the most informative regions, which is crucial in medical image segmentation for precise delineation of anatomical structures and lesions. Pure Transformer U-Nets [4], [5], [25]–[27] rely entirely on self-attention, enabling effective long-range dependency modeling but often at high computational cost. Hybrid CNN-Transformer architectures [28]–[32] combine CNNs’ local feature extraction with Transformers’ global modeling, preserving fine structural details while capturing complex tissue and lesion relationships. More recently, linear-attention sequence models such as Mamba [6], [7] and SSM-based U-Net variants [10], [11], [33]–[35] efficiently capture long-range dependencies with linear complexity, offering a favorable balance between segmentation accuracy and computational efficiency. And there are also attention-based attenpt for segmentation in medical videos [36]. Despite the progress,existing parallel global-local fusion suffers from high computational cost and can interfere with fine feature representation, while some global modeling may be unnecessary in shallow layers. Additionally, existing approaches may struggle to capture long-range dependencies stably, as they lack recurrent inductive bias that support stable context propagation.
Receptance Weighted Key Value (RWKV) [8] models sequential dependencies via a weighted combination of past key-value pairs modulated by a learnable gate, avoiding the quadratic cost of traditional attention. Adapted to vision tasks as Vision-RWKV (VRWKV) [9], it efficiently captures long-range spatial dependencies in high-resolution images with linear complexity and reduced memory overhead, and has also been extended to broader visual domains [37], [38].
RWKV for Medical Imaging. Restore-RWKV [39] applies linear Re-WKV attention with an Omni-Shift layer for efficient multi-directional information propagation in image restoration. BSBP-RWKV [40] integrates RWKV with Perona-Malik diffusion to achieve high-precision segmentation, preserving structural and pathological details. RWKVMatch [41] leverages Vision-RWKV-based global attention, cross-fusion mechanisms, and elastic transformation integration to handle complex deformations in medical image registration. Zigzag RWKV-in-RWKV (Zig-RiR) model [12] introduces a nested RWKV structure with Outer and Inner blocks and a Zigzag-WKV attention mechanism to capture both global and local features while preserving spatial continuity. These innovations demonstrate RWKV’s ability to capture long-range dependencies, retain fine anatomical features, and scale to high-resolution images. However, systematic exploration of hybrid architectures combining RWKV with local feature extractors remains limited, leaving opportunities to balance local detail, global context, and computational efficiency in medical image segmentation.
The overall architecture of the proposed RWKV-UNet is presented in Fig. 2. RWKV-UNet consists of an encoder with stacked LP and GLSP blocks, a decoder, and skip connections with a Cross-Channel Mix (CCM) Module.
We employ the RWKV-Spatial-Mix module and Depth-Wise Convolution () with a skip connection to combine global and local dependency. The process of expanding and reducing dimensions enhances feature representation by capturing richer details while preventing information bottlenecks. Given a feature map \(X \in \mathbb{R}^{C_{\mathrm{in}} \times H \times W}\), the processing flow through the GLSP module is as follows: Normalize and project the input \(X\) to an intermediate dimension \(C_{mid}\) using a \(1\times1\) layer:
\[I_1 = \text{LayerNorm} (\text{Conv}_{1\times1}(X)),\] where \(I_1 \in \mathbb{R}^{C_{\mathrm{mid}} \times H \times W}\). \(C_{\mathrm{mid}}\) should be greater than \(C_{\mathrm{in}}\) to achieve dimension expansion.
Divide the feature map into patches of size \(1\times1\). Then the Spatial Mix in VRWKV is applied: \[I_2 = \text{Unfolding}(I_1) ,\] \[I_3 = \text{SpatialMix}(\text{LayerNorm}(I_2)) + I_2 ,\] where \(I_2 \in \mathbb{R}^{C_{\mathrm{mid}}\times N}\), \(I_3 \in \mathbb{R}^{C_{\mathrm{mid}}\times N}\).
Convert the feature sequence back into a 2D feature map \(I_4 = \text{Folding}(I_3) \in \mathbb{R}^{ C_{\mathrm{mid}} \times H \times W}\), and use a \(5\times5\) DW-Conv layer for local feature aggregation: \[I_5 = \text{DW-Conv}(I_4) + I_4,\] where \(I_5 \in \mathbb{R}^{C_{\mathrm{mid}} \times H^{\prime} \times W^{\prime}}\). \(H^{\prime} \times W^{\prime}\) is determined by the stride of .
Finally, project \(I_5\) to the output dimension and add a global skip connection:
\[F = \text{Conv}_{1\times1}(I_5) + X,\] where \(F \in \mathbb{R}^{C_{\mathrm{out}} \times H^{\prime} \times W^{\prime}}\).
An LP block contains a point convolution layer, a layer and a point convolution layer with local and global residual skips, removing the spatial mix as well as the unfolding and folding processes from the GLSP.
We construct the RWKV-UNet encoder by stacking the LP and GLSP blocks. The encoder comprises a stem stage followed by four stages. The first blocks in stages , , and do not have residual connections and use a with a stride of 2 to achieve downsampling. After the first block, Stages and are composed of stacked LP blocks without Spatial Mix. In Stages and , a series of GLSP blocks are stacked following the first block.
As shown in Table 1, we implement three different sizes of encoders with different numbers of LP and GLSP blocks in each stage (depths), different embedding dimensions and expansion ratios for \(C_{\mathrm{mid}}\) : Encoder-Tiny (Enc-T), Encoder-Small (Enc-S), and Encoder-Base (Enc-B), allowing flexibility to scale up the model according to different resource constraints and application needs.
We pre-train the encoders of different sizes on ImageNet-1K [42] with a batch size of 1024 for 300 epochs using the AdamW [43] optimizer. These pre-trained weights of different sizes will be used in medical segmentation tasks, enabling improved feature extraction and faster convergence.
| Stage | Parameter | Enc-T | Enc-S | Enc-B | 
|---|---|---|---|---|
| Stem | Dimension | 24 | 24 | 24 | 
| Stage | Depth | 2 | 3 | 3 | 
| Dimension | 32 | 32 | 48 | |
| Expansion | 2.0 | 2.0 | 2.0 | |
| Spatial Mix | ||||
| Stage | Depth | 2 | 3 | 3 | 
| Dimension | 48 | 64 | 72 | |
| Expansion | 2.5 | 2.5 | 2.5 | |
| Spatial Mix | ||||
| Stage | Depth | 4 | 6 | 6 | 
| Dimension | 96 | 128 | 144 | |
| Expansion | 3.0 | 3.0 | 4.0 | |
| Spatial Mix | ✔ | ✔ | ✔ | |
| Stage | Depth | 2 | 3 | 3 | 
| Dimension | 160 | 192 | 240 | |
| Expansion | 3.5 | 4.0 | 4.0 | |
| Spatial Mix | ✔ | ✔ | ✔ | |
| Parameters | 2.94M | 9.42M | 16.69M | |
| FLOPs | 1.83G | 4.73G | 8.96G | |
| Attention Type | Average | FLOPs | |
| HD95 \(\downarrow\) | DSC \(\uparrow\) | ||
| Multi-Head Self-Attention [2] | 29.32 | 77.03 | 18.55G | 
| Focused Linear Attention [44] | 27.66 | 76.89 | 11.18G | 
| Bi-Mamba [7] | 29.10 | 78.13 | 14.53G | 
| RWKV Spatial-Mix (ours) [10] | 25.01 | 78.14 | 11.11G | 
We conduct experiments by replacing the spatial mix in the encoder of RWKV-UNet with other attention mechanisms, such as self-attention [2], focused linear attention [44], and Bi-Mamba used in Vision Mamba [7], as well as by removing this module entirely. The total training epochs are 100. The results shown in Table 2 indicate that the model with Spatial-Mix achieves the best DSC with the lowest computational cost. Comparison visualizations of effective receptive fields of the last layer output using different attention mechanisms in the GLSP module are shown in Fig. 3.
Figure 3: Comparison visualization of effective receptive fields of the last layer output using different attention mechanisms in the GLSP module.. a — No Attention, b — MHSA, c — Bi-Mamba, d — Spatial Mix
We design our CNN-based decoder block with a point-convolution layer and a \(9\times9\) layer, following a point operation layer and an upsampling operation. Consider the input feature map \(X \in \mathbb{R}^{C_{\text{in }} \times H \times W}\); in depthwise convolution, the number of input channels is \(C_{\mathrm{in}}\), and the second point convolution maps the \(C_{\text{in }}\) channels to \(C_{\text{out }}\).
Inspired by the Channel Mix in VRWKV, we propose Cross-Channel Mix (CCM), a module that can effectively extract channel information across multi-scale encoder features. By capturing the rich global context along channels, CCM further enhances the extraction of long-range contextual information.
Define output feature map of , and of the designed encoder are \(F_1, F_2\), and \(F_3\), which has different Spatial dimensions and channel counts: \(F_1 \in \mathbb{R}^{C_1 \times H_1 \times W_1}, \quad F_2 \in \mathbb{R}^{C_2 \times H_2 \times W_2}, \quad F_3 \in \mathbb{R}^{C_3 \times H_3 \times W_3}\). (\(H_1\) >\(H_2\)>\(H_3\), \(W_1\) >\(W_2\) >\(W_3\) and \(C_3\) >\(C_2\) >\(C_1\)).
We reshape and map smaller features to match the largest feature’s size and dimension: \[\begin{gather} \tilde{F}_i=\operatorname{Conv}_{C_i}\left(\operatorname{Upsample}\left(F_i\right)\right), \quad i=2,3 \\ \tilde{F}_1=\operatorname{Conv}_{C_1}\left(F_1\right), \end{gather}\] where \(\tilde{F}_i \in \mathbb{R}^{C_1 \times H_1 \times W_1}, \quad i=1,2,3\)
We concatenate the adjusted feature maps along the channel dimension to produce a combined feature map: \[F_{\text{cat}} = \text{Concat}(\tilde{F}_1, \tilde{F}_2, \tilde{F}_3),\] where \(F_{\text{cat}} \in \mathbb{R}^{3C_1 \times H_1 \times W_1}\).
Divide the feature map into patches: \(F_{\text{unf}} = \text{Unfolding}(F_{\text{cat}})\in \mathbb{R}^{3C_1 \times N}\). Then apply the Channel Mix in VRWKV, performing multi-scale global feature fusion in the channel dimension:
\[F_{\text{mix}} = \text{ChannelMix}(\text{LayerNorm}(F_{\text{unf}}))+F_{\text{unf}},\] where \(F_{\text{mix}}\in \mathbb{R}^{3C_1 \times N_1}\). Project it back to 2D feature, \(F_{\text{fold}}=\text{Folding}(F_{\text{mix}})\in \mathbb{R}^{3C_1 \times H_1 \times W_1}\), and \(N_1=H_1 \times W_1\).
The folded feature map \(F_{\text{fold }}\) is split back into three separate features:\([F_{\text{fold}}^{(1)}, F_{\text{fold}}^{(2)}, F_{\text{fold}}^{(3)}]\). Each split feature undergoes a reshape operation and a convolution to restore its original sizes and dimensions: \[F_i^{\prime}=\operatorname{Conv}_{C_i}\left(\operatorname{Reshape}\left(F_{\text{fold }}^{(i)}\right)\right), \quad i=1,2,3\] where \(F_1'\in \mathbb{R}^{C_1 \times H_1 \times W_1}\) , \(F_2'\in \mathbb{R}^{C_2 \times H_2 \times W_2}\) and \(F_3' \in \mathbb{R}^{C_3 \times H_3 \times W_3}\). \(F_1'\),\(F_2'\) and \(F_3'\) will be concatenated with the output of the decoder , and .
| Methods | Average | Aotra | Gallbladder | Kidney (Left) | Kidney (Right) | Liver | Pancreas | Spleen | Stomach | |
| HD95 \(\downarrow\) | DSC \(\uparrow\) | |||||||||
| U-Net [1] | 39.70 | 76.85 | 89.07 | 69.72 | 77.77 | 68.60 | 93.43 | 53.98 | 86.67 | 75.58 | 
| R50 U-Net [1], [45] | 36.87 | 74.68 | 87.74 | 63.66 | 80.60 | 78.19 | 93.74 | 56.90 | 85.87 | 74.16 | 
| ViT [3] + CUP [28] | 36.11 | 67.86 | 70.19 | 45.10 | 74.70 | 67.40 | 91.32 | 42.00 | 81.75 | 70.44 | 
| TransUNet [28] | 31.69 | 77.48 | 87.23 | 63.16 | 81.87 | 77.02 | 94.08 | 55.86 | 85.08 | 75.62 | 
| SwinUNet [46] | 21.55 | 79.13 | 85.47 | 66.53 | 83.28 | 79.61 | 94.29 | 56.58 | 90.66 | 76.60 | 
| TransClaw U-Net [47] | 26.38 | 78.09 | 85.87 | 61.38 | 84.83 | 79.36 | 94.28 | 57.65 | 87.74 | 73.55 | 
| MT-UNet [29] | 26.59 | 78.59 | 87.92 | 64.99 | 81.47 | 77.29 | 93.06 | 59.46 | 87.75 | 76.81 | 
| UCTransNet [30] | 26.75 | 78.23 | - | - | - | - | - | - | - | - | 
| MISSFormer [5] | 18.20 | 81.96 | 86.99 | 68.65 | 85.21 | 82.00 | 94.41 | 65.67 | 91.92 | 80.81 | 
| TransDeepLab [25] | 21.25 | 80.16 | 86.04 | 69.16 | 84.08 | 79.88 | 93.53 | 61.19 | 89.00 | 78.40 | 
| LeVit-Unet-384 [48] | 16.84 | 78.53 | 87.33 | 62.23 | 84.61 | 80.25 | 93.11 | 59.07 | 88.86 | 72.76 | 
| MS-UNet [26] | 18.97 | 80.44 | 85.80 | 69.40 | 85.86 | 81.66 | 94.24 | 57.66 | 90.53 | 78.33 | 
| HiFormer-L [31] | 19.14 | 80.69 | 87.03 | 68.61 | 84.23 | 78.37 | 94.07 | 60.77 | 90.44 | 82.03 | 
| PVT-CASCADE [49] | 20.23 | 81.06 | 83.01 | 70.59 | 82.23 | 80.37 | 94.08 | 64.43 | 90.10 | 83.69 | 
| TransCASCADE [49] | 17.34 | 82.68 | 86.63 | 68.48 | 87.66 | 84.56 | 94.43 | 65.33 | 90.79 | 83.52 | 
| MetaSeg-B [32] | - | 82.78 | - | - | - | - | - | - | - | - | 
| VM-UNet [10] | 19.21 | 81.08 | 86.40 | 69.41 | 86.16 | 82.76 | 94.17 | 58.80 | 89.51 | 81.40 | 
| Hc-Mamba [50] | 26.34 | 79.58 | 89.93 | 67.65 | 84.57 | 78.27 | 95.38 | 52.08 | 89.49 | 79.84 | 
| Mamba-UNet\(^*\) [33] | 17.32 | 73.36 | 81.65 | 55.39 | 81.86 | 72.76 | 92.35 | 45.28 | 88.22 | 69.35 | 
| PVT-EMCAD-B2 [51] | 15.68 | 83.63 | 88.14 | 68.87 | 88.08 | 84.10 | 95.26 | 68.51 | 92.17 | 83.92 | 
| SelfReg-SwinUnet [46], [52] | - | 80.54 | 86.07 | 69.65 | 85.12 | 82.58 | 94.18 | 61.08 | 87.42 | 78.22 | 
| Zig-RiR (2D) [12] (256\(\times\)256)\(^*\) | 28.51 | 74.61 | 82.73 | 64.19 | 79.13 | 68.25 | 93.71 | 50.14 | 88.46 | 70.29 | 
| RWKV-UNet-T (ours) | 11.89 | 81.87 | 88.10 | 68.42 | 88.88 | 82.61 | 95.15 | 60.52 | 90.23 | 81.09 | 
| RWKV-UNet-S (ours) | 11.01 | 82.93 | 88.30 | 65.92 | 88.47 | 84.11 | 95.45 | 66.51 | 91.51 | 83.17 | 
| RWKV-UNet (ours) | 8.85 | 84.29 | 88.71 | 70.39 | 89.58 | 84.36 | 95.70 | 66.58 | 93.54 | 85.49 | 
| Methods | Average | RV | Myo | LV | 
|---|---|---|---|---|
| R50 U-Net [1], [45] | 87.55 | 87.10 | 80.63 | 94.92 | 
| ViT [3] + CUP [28] | 81.45 | 81.46 | 70.71 | 92.18 | 
| R50-ViT [3], [45] + CUP [28] | 87.57 | 86.07 | 81.88 | 94.75 | 
| TransUNet [28] | 89.71 | 88.86 | 84.54 | 95.73 | 
| SwinUNet [46] | 90.00 | 88.55 | 85.62 | 95.83 | 
| MT-UNet [29] | 90.43 | 86.64 | 89.04 | 95.62 | 
| MS-UNet [26] | 87.74 | 85.31 | 84.09 | 93.82 | 
| MISSFormer [5] | 90.86 | 89.55 | 88.04 | 94.99 | 
| LeViT-UNet-384s [48] | 90.32 | 89.55 | 87.64 | 93.76 | 
| PVT-CASCADE [49] | 90.45 | 87.20 | 88.96 | 95.19 | 
| VM-UNet [10] | 91.47 | 89.93 | 89.04 | 95.44 | 
| TransCASCADE [49] | 91.63 | 89.14 | 90.25 | 95.50 | 
| PVT-EMCAD-B2 [51] | 92.12 | 90.65 | 89.68 | 96.02 | 
| SelfReg-SwinUnet [46], [52] | 91.49 | 89.49 | 89.27 | 95.70 | 
| Zig-RiR (2D) [12] (256\(\times\)256)\(^*\) | 87.42 | 86.41 | 81.87 | 93.98 | 
| RWKV-UNet-T (ours) | 91.90 | 90.63 | 88.21 | 96.85 | 
| RWKV-UNet-S (ours) | 91.19 | 89.49 | 87.87 | 96.22 | 
| RWKV-UNet (ours) | 92.29 | 91.26 | 88.78 | 96.83 | 
| Methods | GOALS | FUGC 2025 | ||
|---|---|---|---|---|
| U-Net [1] | 90.99 | 51.96 | ||
| TransUNet [28] | 91.05 | 79.87 | ||
| Rolling-UNet-M [22] | 90.84 | 63.39 | ||
| PVT-EMCAD-B2 [51] | 90.91 | 57.39 | ||
| PVTFormer [53] | 90.98 | 78.41 | ||
| UKAN [23] | 90.59 | 72.19 | ||
| VM-UNet [10] | 91.10 | 72.94 | ||
| H-vmunet [54] | 84.18 | 65.99 | ||
| Zig-RiR (2D) [12] | 83.48 | 62.41 | ||
| RWKV-UNet-T (ours) | 91.34 | 78.67 | ||
| RWKV-UNet-S (ours) | 91.13 | 79.43 | ||
| RWKV-UNet (ours) | 91.75 | 81.17 | 
[h]
    \label{comp}
    \caption{Comparison results on BUSI, CVC-ClinicDB, CVC-ColonDB, Kvasir-SEG, ISIC 2017, GLAS and PSVFM datasets. The evaluation metric are  DSC in (\%). All experiments for baselines are conducted by us. Results with error bars are averaged over three runs with different splits. }
\setlength\tabcolsep{12.0pt}
    \centering
    \resizebox{1.0\textwidth}{!}{
    \begin{tabular}{cccccccc|ccc}
    \hline
 Methods &BUSI &CVC-ClinicDB  &CVC-ColonDB&Kvasir-SEG&ISIC 2017& GLAS&PSVFM&Params.& FLOPs \\
 \hline 
U-Net \cite{ronneberger2015u} & 74.90$\pm$1.08&  89.22$\pm$3.67&83.43$\pm$4.79&84.94$\pm$2.43 &82.94& 88.56$\pm$0.63 &68.98&\pzo7.77M &12.16G\\
R50-UNet~\cite{he2016deep, ronneberger2015u} & 78.46$\pm$1.52& 93.60$\pm$0.85& 89.53$\pm$2.99 & 90.13$\pm$1.16&  84.78& 90.64$\pm$0.23&73.94&28.78M &\pzo9.72G\\
Att-UNet \cite{oktay2018attention}  & 78.14$\pm$1.08&  90.86$\pm$3.58&84.75$\pm$6.48&86.21$\pm$1.93 &83.96& 89.14$\pm$0.48&60.61&\pzo8.73M &16.78G\\
UNet++ \cite{zhou2018unet++}& 75.74$\pm$1.60&  90.07$\pm$4.68&84.75$\pm$3.92&85.86$\pm$1.41&82.27& 88.93$\pm$0.60&71.69&\pzo9.16M &34.71G\\
TransUNet \cite{chen2021transunet} & 79.35$\pm$2.18&  93.91$\pm$2.20&90.72$\pm$3.50&91.34$\pm$0.86 &\underline{86.09}& 91.58$\pm$0.15&74.86& 105.3M&33.42G\\
FCBFormer~\cite{sanderson2022fcn}& 80.66$\pm$0.26&  91.15$\pm$3.29&85.23$\pm$4.57&91.09$\pm$0.83 &84.56& 91.13$\pm$0.14&75.23&51.96M&41.22G\\
UNext~\cite{valanarasu2022unext}& 73.19$\pm$1.94&  83.48$\pm$2.78&78.98$\pm$1.86&79.99$\pm$2.84 &84.68& 84.09$\pm$0.43&56.41&\pzo1.47M&\pzo0.58G\\
PVT-CASCADE~\cite{rahman2023medical}& 78.30$\pm$2.75&  93.45$\pm$1.44&90.97$\pm$2.03&\underline{91.40$\pm$1.12} &84.77& 91.85$\pm$0.34&71.82&35.27M &\pzo8.20G\\
Rolling-UNet-M~\cite{liu2024rolling} & 77.49$\pm$2.20&  91.91$\pm$2.49&85.20$\pm$5.20&87.41$\pm$1.74 &84.42&88.12$\pm$0.57&68.93&  \pzo7.10M&\pzo8.31G\\
PVT-EMCAD-B2~\cite{rahman2024emcad} & 79.22$\pm$0.93&  93.39$\pm$1.42&91.27$\pm$1.63&90.37$\pm$1.04 &84.05&91.77$\pm$0.08 &71.29& 26.76M&\pzo5.64G\\
U-KAN~\cite{li2024ukanmakesstrongbackbone} & 76.95$\pm$1.48&  89.88$\pm$2.82&84.66$\pm$4.62&85.78$\pm$1.46 &83.46& 87.42$\pm$0.25&66.25& 25.36M&\pzo8.08G\\
\hline
RWKV-UNet-T (ours)& 79.47$\pm$1.19&  95.05$\pm$0.64&89.56$\pm$5.10&91.26$\pm$2.01 &85.59&91.96$\pm$0.30&75.23& \pzo3.15M& \pzo3.70G\\
RWKV-UNet-S (ours)& \underline{80.92$\pm$1.51}&  \textbf{95.58$\pm$0.63}&\underline{91.30$\pm$2.79}&\textbf{91.78$\pm$1.19}&\textbf{86.38}&\underline{92.11$\pm$0.14}&\textbf{78.14} &\pzo9.70M&\pzo7.64G \\
RWKV-UNet (ours)& \textbf{81.92$\pm$0.74}&  \underline{95.26$\pm$0.61}&\textbf{92.27$\pm$2.17}&91.26$\pm$1.15 &85.32&\textbf{92.35$\pm$0.07}& \underline{76.39}&17.13M &14.50G\\
\hline
\end{tabular}}
\label{binary}
Experiments on conducted on Synapse [55] for multi-organ Segmentation in CT images, ACDC [56] for cardiac segmentation in MRI images, GOALS [57] for OCT layer segmentation, BUSI [58] for breast tumor segmentation in ultrasound images, CVC-ClinicDB [59], CVC-ColonDB [60], Kvasir-SEG [61] for poly segmentation in endoscopy images, ISIC 2017 [62] for skin lesion segmentation in dermoscopic images, GLAS [63] for gland segmentation in microscopy images, PSVFM [64] for placental vessel segmentation in fetoscopy images and FUGC [65] for semi-supervised cervix segmentation in ultrasound images. The average Dice Similarity Coefficient (DSC) and average 95% Hausdorff Distance (HD95) are used as evaluation metrics.
Synapse uses 18 training and 12 validation CT scans, while ACDC uses 70 training, 10 validation, and 20 testing MRI scans, following settings in [28].
GOALS uses 100 training, 100 validation and 100 test images following the original setting in the challenge.
BUSI, CVC-ClinicDB, CVC-ColonDB, and Kvasir-SEG are split into training, validation, and testing sets with an 8:1:1 ratio, repeated with three different random seeds.
ISIC 2017 uses 2000 training, 150 validation, and 600 testing images as per the original challenge setup.
GlaS uses 85 training and 80 testing whole-slide images, with the training set further split into training and validation with a 9:1 ratio across three random seeds.
PSVFM uses a video-based split: Videos 1–3 for training, Video 4 for validation, and Videos 5–6 for testing, reflecting real-case variability in fetoscopic surgeries.
Building on [66] and [66], FUGC uses 50 labeled and 450 unlabeled images for training, 90 for validation, and 300 for testing.
All experiments are performed on NVIDIA Tesla V100 with 32 GB memory. The resolution of the input images is resized to 224\(\times\)224 for Synapse and ACDC, and 256\(\times\)256 for binary segmentation tasks. The total training epochs for Synapse are 30, for ACDC are 150, and for binary segmentation are 300. The batch size is 24 for Synapse and ACDC (32 for RWKV-UNet-T on Synapse to avoid a CUDA memory problem), and 8 for GOALS and binary segmentation tasks. The initial learning rate is \(1e^{-3}\) for Synapse and FUGC 2025, \(5e^{-4}\) for ACDC, and \(1e^{-4}\) for GOALS and binary segmentation experiments. The minimum learning rates for Synapse and ACDC are 0 and for GOALS, binary segmentation and semi-supervised experiments are 1\(e^{-5}\). The AdamW [43] optimizer and CosineAnnealingLR [67] scheduler are used. The baseline results on GOALS and binary segmentation tasks are run by us. The loss function for supervised tasks is a mixed loss combining cross entropy (CE) loss and dice loss [20]: \[\mathcal{L}=\alpha C E(\hat{y}, y)+\beta D i c e(\hat{y}, y),\] where \(\alpha\) and \(\beta\) are 0.5 and 0.5 for Synapse, 0.2 and 0.8 for ACDC, 0.5 and 1 for GOALS and binary segmentation. Experiments for PVT-CASCADE [49] and PVT-EMCAD-B2 [51] use deep supervision strategy [68]. For semi-supervised tasks, the supervised loss on labeled data remains \(\mathcal{L}_{\text{sup}} = 0.5CE(\hat{y},y) + Dice(\hat{y},y)\). In addition, a consistency loss is applied on unlabeled data, computed as the mean squared error (MSE) between predictions under different augmentations after inverting the transformations. Learning rate for TransUNet and PVTFormer are \(1e^{-4}\) on FUGC 2025 for better results.
Table 3 shows that RWKV-UNet excels in abdomen organ segmentation on the Synapse dataset, achieving the highest average DSC of 84.29% , and surpassing all SOTA CNN-, transformer-, and Mamba-based methods. It also achieves 8.85 in HD95, which is significantly better than other models, demonstrating a strong ability to localize organ boundaries accurately. This can be attributed to the model’s ability to capture both long-range dependencies and local features. Smaller models, RWKV-UNet-T and RWKV-UNet-S, also achieve remarkable results. The DSCs with parameters and FLOPs of each model can be seen in Fig. 4 and visualization results can be seen in Fig. 5.
Table 4 shows that our model outperforms all SOTA models in the ACDC MRI cardiac segmentation task, achieving the best results in RV and the second-best in LV categories. The smaller models, RWKV-UNet-T and RWKV-UNet-S, also perform well.
Table 5 shows that on the small GOALS dataset, performance differences are minor and some carefully designed models even lag behind the classic U-Net, while our RWKV-UNet still achieves the highest average DSC, demonstrating its robustness and superiority. These results also reflect the model’s ability to handle small datasets.
Table [binary] shows that the RWKV-UNet series shows remarkable performance in the tasks of breast lesion, poly, skin lesion, grand, and vessel segmentation, with much smaller parameters and computation load than TransUNet and FCBFormer.
Table 5 also shows that the RWKV-UNet achieves SOTA performance on the FUGC 2025 dataset, further indicating strong data efficiency and generalization ability, as it can fully leverage limited labeled data and mine useful information from unlabeled data to maintain high segmentation accuracy.
We replace the encoder with pre-trained weights while keeping the rest of the network architecture unchanged (dimensions vary according to the encoder). Table 6 shows that our pretrained encoder achieves the best DSC and HD95 on the Synapse dataset, with significantly lower FLOPs.
| Encoder | Average | FLOPs | |
| HD95 \(\downarrow\) | DSC \(\uparrow\) | ||
| ResNet 50 [45] | 9.99 | 82.27 | 62.44 | 
| PVT-B3-v2 [69] | 11.47 | 81.23 | 22.74 | 
| PVT-B5-v2 [69] | 10.53 | 80.14 | 27.38 | 
| ConvNext-small [70] | 12.79 | 76.34 | 42.47 | 
| ConvNext-base [70] | 10.61 | 81.42 | 74.86 | 
| MaxViT-small [71] | 10.57 | 83.71 | 18.54 | 
| MaxViT-base [71] | 11.09 | 82.89 | 29.86 | 
| RWKV-UNet Enc-B (ours) | 8.85 | 84.29 | 11.11 | 
Experimental results in Table 7. indicate that pre-training is essential because it significantly improves feature extraction by enabling the encoder to capture intricate patterns and hierarchical information. And the inductive biases of RWKV are also not as strong as those of CNN, thus they also rely on pre-training similar to transformers.
| Pre-training | Training Epochs | Average | |
| HD95 \(\downarrow\) | DSC \(\uparrow\) | ||
| w/o | 30 | 36.23 | 71.66 | 
| w/o | 100 | 25.01 | 78.14 | 
| w/ | 30 | 8.85 | 84.29 | 
| Kernel Size | Synapse | ACDC | FLOPs | |
|---|---|---|---|---|
| (l2ptr2pt)2-3 (l2ptr2pt)4-4 | HD95 \(\downarrow\) | DSC \(\uparrow\) | DSC \(\uparrow\) | |
| 3 | 10.35 | 84.83 | 92.07 | 10.97G | 
| 5 | 8.69 | 84.67 | 91.95 | 11.00G | 
| 7 | 9.97 | 83.73 | 91.61 | 11.05G | 
| 9 | 8.85 | 84.29 | 92.29 | 11.11G | 
| 11 | 11.25 | 82.67 | 92.04 | 11.19G | 
We evaluate the effect of retaining attention in the first two layers of RWKV-UNet-S by replacing them with spatial-mix attention and comparing against the original CNN-based shallow layers. All models are trained for 100 epochs from scratch, without pretraining. Table 9 shows negligible improvement from shallow attention on Synapse, while introducing higher computational cost, indicating that CNNs are sufficient for capturing low-level features in the shallow layers.
| Shallow Attention | Average | FLOPs | |
| HD95 \(\downarrow\) | DSC \(\uparrow\) | ||
| w/ | 29.43 | 76.88 | 6.91G | 
| w/o (ours) | 28.92 | 76.41 | 5.86G | 
In VRWKV [9], a Channel Mix module is used after the Spatial Mix. In contrast, our RWKV-UNet’s encoder designs GLSP blocks without Channel Mix, as the pointwise convolution in the output layer performs both channel mixing and feed-forward operations. We compare results on the Synapse dataset (without pre-training, 100 epochs) . Table 10 demonstrates that Channel Mix is not essential for maintaining performance.
| Channel Mix | Average | FLOPs | |
| HD95 \(\downarrow\) | DSC \(\uparrow\) | ||
| w/ | 35.19 | 74.41 | 19.43G | 
| w/o (ours) | 25.01 | 78.14 | 11.11G | 
Comparison results of different skip connection designs for RWKV-UNet on the Synapse dataset are shown in Table 11. Experiments analyze variations in the number of skip connections and the inclusion of the CCM block. We prioritize retaining shallow skip connections because shallow layers tend to experience greater information loss. The results show that, like other UNet’s variants, adequate skip connections are essential for RWKV-UNet’s performance. Furthermore, the results show that the CCM module can significantly improve the model segmentation results. With only two skips, the model can still achieve 84.28 DSC with CCM. Admittedly, the CCM module increases the computational load to some extent due to the aggregation of global information.
| Skips | CCM Block | Average | FLOPs | ||
| HD95 \(\downarrow\) | DSC \(\uparrow\) | ||||
| 0 | w | 12.72 | 69.83 | 9.18 G | |
| 1 | w/o | 10.83 | 78.27 | 9.33 G | |
| 1 | w/ | 10.37 | 83.19 | 10.96 G | |
| 2 | w/o | 11.32 | 80.11 | 9.41 G | |
| 2 | w/ | 11.62 | 84.28 | 11.04 G | |
| 3 | w/o | 12.56 | 82.40 | 9.48 G | |
| 3 (ours) | w/ (ours) | 8.85 | 84.29 | 11.11 G | |
Based on the results in Table 12, the hidden rate of the Channel Mix layer in the CCM module materially affects performance: increasing the rate from 1 to 2 yields a notable improvement in performance, while further increases bring no gains at the cost of higher FLOPs.
The comparison results of different methods with 512\(\times\) 512 input in the Synapse dataset are shown in Table 13. The average Dice scores of other methods are from [72]. Our results indicate that increasing the input resolution improves the performance of our model. Furthermore, compared to TransUNet, the computational overhead associated with the resolution increase is less severe, showcasing the efficiency of at higher resolutions.
| Method | Average | FLOPs | 
|---|---|---|
| U-Net [1] | 81.34 | - | 
| Pyramid Attention [73] | 80.08 | - | 
| DeepLabv3+ [74] | 82.50 | - | 
| UNet++ [16] | 81.60 | - | 
| Attention U-Net [15] | 80.88 | - | 
| nnU-Net [24] | 82.92 | - | 
| TransUNet [28] | 84.36 | 148.29G | 
| SAMed_h [75] | 84.30 | 783.98G | 
| RWKV-UNet (ours) | 86.73 | 58.05G | 
In this study, we introduce RWKV-UNet, a novel architecture that integrates the RWKV architecture with U-Net. By combining the strengths of convolutional networks for local feature extraction and RWKV’s ability to model global context, our model significantly improves medical image segmentation accuracy. The proposed enhancements, including the GLSP module and the CCM module, contribute to a more precise representation of features and information fusion across different scales. Experimental results on 11 datasets demonstrate that RWKV-UNet surpasses SOTA methods. Its variants (RWKV-UNet-S and RWKV-UNet-T) offer a practical balance between performance and computational efficiency. Our approach has a strong potential to advance medical image analysis, particularly in clinical settings where both accuracy and efficiency are paramount.
Limitations and Future Work. RWKV-UNet is a powerful 2D medical image segmentation model that effectively combines RWKV with convolutional operations; however, it is currently not applicable to 3D imaging. In the future, we plan to extend the model to 3D for handling volume data and to explore the potential of RWKV-based foundational models for medical imaging. We also aim to develop ultra-lightweight RWKV-based models tailored for point-of-care applications, preserving segmentation accuracy while enhancing adaptability and speed further.