June 06, 2025
Medical image segmentation is a critical task in computer vision, with UNet serving as a milestone architecture. The typical component of UNet family is the skip connection, however, their skip connections face two significant limitations: (1) they lack effective interaction between features at different scales, and (2) they rely on simple concatenation or addition operations, which constrain efficient information integration. While recent improvements to UNet have focused on enhancing encoder and decoder capabilities, these limitations remain overlooked. To overcome these challenges, we propose a novel multi-scale feature fusion method that reimagines the UNet decoding process as solving an initial value problem (IVP), treating skip connections as discrete nodes. By leveraging principles from the linear multistep method, we propose an adaptive ordinary differential equation method to enable effective multi-scale feature fusion. Our approach is independent of the encoder and decoder architectures, making it adaptable to various U-Net-like networks. Experiments on ACDC, KiTS2023, MSD brain tumor, and ISIC2017/2018 skin lesion segmentation datasets demonstrate improved feature utilization, reduced network parameters, and maintained high performance. The code is available at https://github.com/nayutayuki/FuseUNet.
Medical image segmentation is a crucial branch of computer vision, and with the development of deep learning, many excellent segmentation methods have emerged. UNet [1] is a milestone network architecture in this field, characterized by its symmetric encoder-decoder convolutional network structure and the skip connections for integrating features from different scales.
The emergence of UNet laid the foundation for segmentation networks, influencing the design of many existing architectures. For simplicity, we refer to these networks as “U-Nets". Recently popular U-Nets improvement strategies have primarily focused on enhancing the information processing capabilities of the encoder and decoder, such as UNETR [2], Swin-UNet [3], and their variants like Swin-UNETR [4], MetaUNETR [5] which incorporate Transformer [6]; VM-UNet [7], UltraLight VM-Unet [8], and LKM-UNet [9] which incorporate Mamba [10]; and Rolling-UNet [11], DHMF-MLP [12], which improve multi-layer perceptrons (MLP). However, the skip connections in U-Nets are limited to feature fusion within the same scale, lacking interaction across different scales, which is a clear limitation. Additionally, the feature fusion strategy in U-Nets relies solely on simple addition or concatenation, which makes it difficult to effectively integrate feature information.
Unfortunately, there has been little research on skip connections in recent years. The most recent notable studies addressing this issue are UNet++ [13] and UNet3+ [14]. UNet++ introduced many intermediate nodes and dense skip connections, using feature summation to integrate features from different scales. UNet3+ built on UNet++ by proposing a full-scale skip connection, where each decoder processes all scales of skip connections. Their experimental results demonstrate the benefits of adding multi-scale information interaction in the UNet structure. In fact, the idea behind U-Nets is quite analogous to the numerical solution of ordinary differential equation (ODE). In U-Nets, the decoder simulates functional relationships based on multiple known encoder output feature nodes to obtain the corresponding results. This process is similar to computing approximate values of a function at discrete points in ODE. However, the aforementioned networks still follow UNet’s approach for feature fusion, which relies on simple concatenation or addition. This straightforward method is analogous to the explicit Euler method [15] in mathematics, which has only first-order accuracy and fails to fully utilize the available feature information. To better leverage known information for feature fusion, we propose using the linear multistep method [16], [17], a widely used mathematical tool, in this paper.
In recent years, various discretization methods [18]–[21] involving neural memory ordinary differential equations (nmODEs) [22] has been developed, showing some potential of the linear multistep method in image segmentation tasks. These methods started by designing the decoder, enabling information interaction between adjacent stages, which helps in feature fusion. However, the discrete methods they use have at most second-order accuracy, rely only on information between adjacent stages, and lack the capability for multi-scale information interaction.
According to the mathematical principles of the linear multistep method, we propose a novel multi-scale feature fusion method for skip connections: treating the feature information of skip connections at various scales as a sequence and viewing the decoding process of U-Nets as an initial value problem (IVP). We introduce the nmODEs and adapt them to this problem by designing an adaptive high-order discretization method that discretizes the process. This method processes the sequence of skip connections using discrete nmODEs and ultimately generates the segmentation map. This method is relied on skip connections and is not limited to the types of encoders and decoders, making it theoretically applicable to a wide range of U-Net variants.
To demonstrate the effectiveness and generalizability of the proposed method, we applied it to three mainstream U-Nets based on convolution, Transformer, and Mamba architectures. Experiments on 3D tasks (ACDC, KiTS2023, MSD brain tumor) as well as 2D tasks (ISIC2017 and ISIC2018 skin lesion segmentation tasks), show that the proposed method significantly improves the utilization efficiency of the information extracted by encoders, while drastically reducing network parameters and maintaining network performance.
UNet [1], as the foundation of all U-Nets, features a symmetric encoder-decoder structure with skip connections that facilitate information flow between them. Based on its structural characteristics, improvements to UNet generally follow two main directions: optimizing the connection strategy within the network architecture or introducing more efficient modules in the encoders and decoders.
In the first direction, previous studies primarily focused on increasing connection density. ResUNet [23] and DenseUNet [24] achieved this by incorporating residual and dense connections within the encoder and decoder. UNet++ [13] introduced numerous intermediate nodes and adopted dense skip connections, while UNet3+ [14] further enhanced UNet++ by designing more comprehensive full-scale skip connections. In recent years, research in this area has received less attention due to the rise of advanced modules such as Transformer [6], shifting the focus towards the second direction.
A representative approach in the second direction is the introduction of attention mechanisms. For instance, TransUNet [25] and UNETR [2] designed encoder architectures based on Transformer, while retaining convolutional decoders. Swin-UNet [3], on the other hand, used a pure Transformer approach, and Swin-UNETR [4] built on UNETR by introducing a sliding attention mechanism. MetaUNETR [5] proposed the TriCruci module in an attempt to create a unified segmentation framework. Additionally, the Mamba [10] architecture gained attention due to its lower complexity compared to Transformer. LKM-UNet [9] demonstrated the feasibility and effectiveness of using large Mamba kernels to achieve a large receptive field, and UltraLight VM-Unet [8] introduced a new Parallel Vision Mamba layer on top of VM-UNet [7], significantly reducing the number of parameters and computational load in skin lesion segmentation tasks. Furthermore, research on MLP [11], [12] and other mechanisms in segmentation models has been observed, although these are not the mainstream methods. However, all these networks share a common limitation: the lack of information interaction across different scales in their skip connections.
Ordinary differential equation (ODE) systems, a key class of dynamical systems, have been widely studied in mathematics and physics. Neural ODEs (NODEs) [26] offered a mathematical framework to interpret ResNet, transforming it from a black box into a comprehensible model by viewing neural networks as ODE representations.
However, NODEs face limitations, as models using data as initial values can only learn features within the same topological space [27]. Furthermore, attractors in dynamical systems are linked to memory capacity [28], [29], but conventional NODEs cannot fully exploit this capacity. To address these limitations, nmODEs [22] extended NODEs by enhancing nonlinear expression through implicit mapping and nonlinear activation functions. Unlike traditional NODEs, nmODEs treat inputs as external parameters, not initial values. They separate the neuron’s function into learning and memory components: learning happens in the learning part, while the memory part maps inputs to global attractors, linking input space to memory space.
nmODEs have been successfully applied to various segmentation tasks. For example, BiFNN [20] employed bidirectional skip connections and a nonparametric backward path based on nmODEs, which improved image recognition performance over models such as ResNet and Vision Transformer. nmPLS-Net [30] utilized the nonlinear representation and memory capabilities of nmODEs for edge-based decoding, achieving precise lung lobe segmentation. Incorporating nmODEs into UNet through simple discretization has also shown promising results in diabetic kidney [31] and skin cancer lesion segmentation [18], [19], [21]. Furthermore, nmODEs have improved the robustness of medical image segmentation [32], demonstrating their versatility and effectiveness across diverse medical segmentation challenges.
The traditional structure of U-Nets, as shown in Fig. 1, clearly indicates that the skip connections only communicate information within the same stage. Consider a U-Net with \(L\) stages, for example, in the classic UNet, \(L = 5\). The skip connection at the same stage of the encoder is denoted as \(X_i\), and the output of the decoder is denoted as \(Y_i\). The abstracted mathematical model can be expressed as: \(Y_i = f(X_i, Y_{i-1})\). In this formulation, only the result from the previous step \(Y_{i-1}\) and the features from the current step \(X_i\) are used. This single-step computation method is simple and intuitive, but it fails to leverage information from previous stages of \(X_i\) and the results obtained before \(Y_{i-1}\), and the accuracy of this method is limited. To fully utilize the information from the skip connections and improve accuracy, a natural next step is to adopt a multistep approach. This would allow the network to incorporate information from both the past and current steps, providing more comprehensive feature integration and ultimately enhancing performance.
The linear multistep method is a classical numerical method that uses information from the current time step and several previous time steps to predict the solution at the next time step. This method aligns perfectly with the idea of fully utilizing the skip connection information \(X_i\) and the computed values \(Y_i\) from U-Nets. The theorem of linear multistep method is given as follow:
Theorem 1. Linear Multistep Method [16], [17]. Given the derivative \(\dot{y}(t)=F(t, y(t)), y(t_0) = y_0\), choose a value \(\delta\) for the size of every step along t-axis and set \(t_{n+i}=t_n+i \cdot \delta\), the result is approximations for the value of \(y(t_i) \approx y_i\), multistep methods use information from the previous \(s\) steps to calculate the next value: \[\label{LLM} \sum _{j=0}^{s}a_{j}y_{n+j}=\delta \sum _{j=0}^{s}b_{j}F(t_{n+j},y_{n+j}),\tag{1}\]
with \(a_s=1\). The coefficients \(a_0,\dotsc ,a_{s-1}\) and \(b_0,\dotsc ,b_{s}\) determine the method.
In Theorem 1, if \(b_s=0\), then the method is called “explicit", since the formula can directly compute \(y_{n+s}\). If \(b_s\neq 0\) then the method is called”implicit", since the value of \(y_{n+s}\) depends on the value of \(F(t_{n+s},y_{n+s})\), and the equation must be solved for \(y_{n+s}\).
| Explicit: Adams-Bashforth (AB) Method | ||
|---|---|---|
| step | order | equation |
| 1 | 1 | \(y_{n+1}= y_n + \delta\cdot F_{n}\) |
| 2 | 2 | \(y_{n+2} = y_{n+1}+\frac{\delta}{2}\cdot (3F_{{n+1}} - F_{n})\) |
| 3 | 3 | \(y_{n+3} = y_{n+2}+\frac{\delta}{12}\cdot (23F_{{n+2}} -16F_{{n+1}} + 5F_{n})\) |
| 4 | 4 | \(y_{n+4} = y_{n+3}+\frac{\delta}{24}\cdot (55F_{{n+3}} - 59F_{{n+2}} + 37F_{{n+1}} - 9F_{n})\) |
| Implicit: Adams-Moulton (AM) Method | ||
| step | order | equation |
| 1 | 2 | \(y_{n+1}= y_n + \frac{\delta}{2}\cdot (F_{n} + F_{{n+1}})\) |
| 2 | 3 | \(y_{n+2} = y_{n+1} + \frac{\delta}{12}\cdot (5F_{{n+2}} + 8F_{{n+1}} - F_{n})\) |
| 3 | 4 | \(y_{n+3} = y_{n+2} + \frac{\delta}{24}\cdot (9F_{{n+3}} + 19F_{{n+2}} - 5F_{{n+1}} + F_{n})\) |
Increasing the number of steps (i.e., incorporating more previous time points) raises the order of the linear multistep method, thereby improving its theoretical accuracy. However, it may also reduce stability, cause error accumulation, and increase dependence on initial values. Therefore, the highest order typically used for the linear multistep method is 4. For simplicity, we write \(F(t_n,y_n)\) as \(F_n\), the specific coefficients for the linear multistep method are shown in Table 1. Besides, the explicit method is denoted as \(AB_i\) and the implicit method as \(AM_i\) in section 3.2, where \(i\) corresponds to the step number in Table 1.
When the number of steps is the same, implicit methods consistently achieve higher accuracy than explicit methods. Consequently, we strive to use implicit methods whenever possible. However, implicit methods require the derivative values at the current node, which are unknown at the current step. The predictor-corrector method provides a mathematical solution to this problem. The corresponding theorem is presented as follows:
Theorem 2. Predictor-Corrector Method [33], [34]. Consider the differential equation \(\dot{y}(t)=F(t, y(t)), y(t_0) = y_0\), and denote the step size by \(\delta\). First, starting from the current value \(y_i\), calculate an initial guess value \(\overline{y_{i+1}}\) via an explicit method. Next, improve the initial guess using corresponding implicit method. For example use 1 step methods: \[\label{eqn:1step} \begin{align} \overline{y_{n+1}} &= y_n+\delta\cdot F(t_n,y_n)\\ y_{n+1}&= y_n + \frac{\delta}{2}\cdot (F(t_n,y_n)+F(t_{n+1},\overline{y_{n+1}})), \end{align}\tag{2}\] where \(y_n \approx y(t_n)\). \(\overline{y_{n+1}}\) is the predicted median, \(y_{n+1}\) is the final result corrected for \(\overline{y_{n+1}}\).
Thus, more accurate solutions can be obtained using implicit methods.
The linear multistep method is a mathematical tool used to solve initial value problem (IVP) in ordinary differential equation (ODE). To relate the architecture of U-Nets with ODE, existing methods typically introduce differential equations starting from the decoder. Following this idea, we define the differential relationship between the skip connections \(X\) and the corresponding stages \(Y\) as \(F\), but we focus on fusing multi-scale information starting from the skip connections. The function \(F\) will be explained in detail in section 3.3. In this section, we first describe the proposed discrete method and the overall framework applied in an L-stage U-Net ,as shown in Fig. 2 (a).
First, we map the data from each stage of the U-Net to the elements of the linear multistep method: we treat the process where the decoder reconstructs the features extracted by the encoder into predicted labels as solving an IVP. \(X_i\) represents a series of discrete nodes, and \(Y_i\) represents the corresponding solution at each node, \(1 \leq i \leq L\). \(Y_{final}\) is the solution to the IVP. The bottom stage represents the start of the IVP, where we initialize \(Y_1 = 0\), representing an empty memory stream. As the IVP progresses, we gradually fill in information on the empty memory stream, first mapping the coarse high-level features to the feature map, and then filling in the finer low-level details.
Then, we can apply the mathematical ideas of the linear multistep method to process features at multiple scales. To ensure effective interaction between features at different scales, it is crucial to incorporate as much information from previous stages as possible and employ high-accuracy implicit methods when deriving the output of a given decoder stage. When the current stage number is greater than 4 and not the last stage, a 4-step implicit method can be straightforwardly chosen for prediction-correction. However, the linear multistep method need to calculates the values of \(Y_l, l<i\) for the previous steps using a single-step method first, and then substitutes the calculated derivatives into the formula. The single-step method at the initialization step is still limited to the same scale, which does not align with our goal of multi-scale feature interaction. Therefore, we need to modify the initialization step of the linear multistep method. Our approach is to repeatedly apply the lower-step linear multistep method during the startup phase of the high-step method. Specifically, when calculating \(y_{i+1}\), we use the \(i\)-step implicit method. For \(i > 4\), the 4-step implicit method is applied until the \(L\)-th stage, where the final computation is performed using an explicit method. This algorithm is outlined in Algorithm 1, and the detailed computational process is provided in Appendix 6.
Finally, we apply a convolution layer to map \(Y_{final}\) to the segmentation map.
| Dataset | Region | Data Type | Number of Samples | Target Classes |
|---|---|---|---|---|
| ACDC [35] | Heart | 3D MRI | 100 | 3 (1-MYO, 2-RV,3-LV) |
| KiTS23 [36] | Kidney | 3D CT | 559 | 3 (1-Kidney, 2-Cyst, 3-Tumor) |
| MSD [37] | Brain | 3D MRI | 484 | 3 (1-WT, 2-ET, 3-TC) |
| ISIC2017 [38] | Skin | 2D Dermoscopy | 2000 | 1 (Lesion skin area) |
| ISIC2018 [39] | Skin | 2D Dermoscopy | 2594 | 1 (Lesion skin area) |
nmODEs divided neuron into two parts: learning neuron and memory neuron. The input data is passed to the learning neuron to extract features, while the memory carrier records the extracted information in the memory neuron. This design has three advantages. nmODEs inherently avoids the features learned from being isomorphic to the input data space. Additionally, we have identified two other advantages that are particularly relevant to this work:
First, the separation of learning and memory neurons in nmODEs allows it to retain a continuous memory flow while still maintaining the ability to process external inputs. Therefore, features extracted by the encoders from different scales can be fed into the learning neuron of nmODEs as a sequence of nodes, and the memory carrier is updated with the numerical solutions corresponding to each node.
Second, the memory carrier ultimately outputs the target data but does not participate in the feature extraction process. As a result, it does not need to retain much information other than the final output. This reduces the required number of channels significantly compared to the decoder of traditional U-Nets, thus greatly reducing computational costs.
The equation for nmODEs is given as follows: \[\label{eqn:nmODE} \dot{Y}_t=-Y_t + f\left(Y_t+g(X_t)\right) ,\tag{3}\] where \(X_t\) represents the external input at time \(t\), \(Y_t\) represents the memory flow carried by the memory neuron at time \(t\), with the initial condition \(Y_0 = 0\) representing an empty memory flow, \(f(\cdot)\) is a non-linear mapping that combines the existing memory and the new external input to update the memory flow using a differential equation numerical solver, \(g(\cdot)\) is the function that processes the external input \(X_t\) at each time step. When applying nmODEs to the discrete U-Nets, the following equation holds: \[\label{diseqn:nmODE} \dot{Y}_i=-Y_i + f\left(Y_i+g(X_i)\right) .\tag{4}\] In U-Nets, the \(X_i\) and \(Y_i\) in Eq. (4 ) correspond to those in Section 3.2. The function \(g(\cdot)\) uses convolution and interpolation to align the channels and shape of \(X_i\) and \(Y_i\), while the choice of \(f(\cdot)\) depends on the network to which this equation is applied. By substituting Eq. (3 ) into the linear multistep method proposed in Algorithm 3, we can obtain the approximate solution for each stage. Following this process, we designed three modules: Predictor-Corrector, Calculator, and the nmODEs block, as shown in Fig. 2 (b), (c), and (d), respectively.
The currently popular U-Nets are primarily based on three architectures: convolution, Transformer, and Mamba. To evaluate the effectiveness and generalizability of the proposed method, we select a representative backbone network from each of these architectures and perform experiments on the datasets used in their respective original studies. All experiments were conducted on a single RTX 4090. Except for the learning rate, all experimental settings were consistent with those of the backbone networks used for comparison. Due to the significant reduction in the number of parameters, the learning rate was set to 2 or 3 times that of the backbone network’s setting. The detailed hyperparameter settings are provided in Appendix 7.
Table 2 presents the details of the datasets used in this paper. In the table, LV denotes the left ventricle, RV represents the right ventricle, MYO corresponds to the myocardium, EC indicates the enhancing tumor, TC refers to the tumor core, and WT denotes the whole tumor.
nn-UNet. For CNN-based U-Nets, nn-UNet [40], [41] remains the backbone due to its superior applicability, outperforming nearly all other U-Nets across reliable datasets. UNETR. In Transformer-based U-Nets, UNETR [2], from the MONAI framework, is a key foundation for research and a suitable backbone for our study. UltraLight VM-UNet. Mamba’s efficiency and linear complexity reduce costs over Transformers. UltraLight VM-UNet [8] leverages this, achieving strong performance with a lightweight design. We use it to assess our method’s potential for further downsizing.
3D Tasks. All 3D tasks report Dice scores (%) using five-fold cross-validation, following the protocol of the backbone network. The performance data for the compared networks are sourced from [2], [41], [42]. Table 3 presents the performance of the proposed method on 3D tasks. The gray-shaded rows in the table indicate FuseUNet and its corresponding backbone network. When using the convolutional nn-UNet as the backbone, FuseUNet achieves a 54.9% reduction in the number of parameters and a 34.3% reduction in GFLOPs, while maintaining the performance of nn-UNet. Although its performance on the ACDC dataset is slightly lower, it surpasses nn-UNet on the larger KiTS dataset. Detailed performance data on each fold is provided in Appendix 9.
| Dataset | Model | Params(M) | GFLOPs | Dice1 | Dice2 | Dice3 | Dice avg |
|---|---|---|---|---|---|---|---|
| CoTr [43] | 41.9 | 668.1 | 89.06 | 88.56 | 94.06 | 90.56 | |
| Swin-UNETR [4] | 62.8 | 384.2 | 89.56 | 89.98 | 94.33 | 91.29 | |
| U-Mamba [44] | 173.5 | 1255 | 89.71 | 89.70 | 94.25 | 91.22 | |
| STU-Net-L [45] | 440.3 | 3810 | 90.02 | 89.59 | 94.32 | 91.31 | |
| nn-UNet [41] | 31.2 | 402.6 | 90.11 | 89.96 | 94.55 | 91.54 | |
| FuseUNet (Ours) | 14.0 | 264.9 | 90.18 | 90.05 | 94.49 | 91.57 | |
| nnFormer [46] | 150.1 | 425.8 | 92.27 | 69.78 | 65.53 | 75.86 | |
| Swin-UNETR [4] | 62.8 | 384.2 | 94.48 | 76.80 | 72.53 | 81.27 | |
| U-Mamba [44] | 173.5 | 1255 | 96.08 | 82.84 | 79.77 | 86.23 | |
| STU-Net-L [45] | 440.3 | 3810 | 96.11 | 82.35 | 79.09 | 85.85 | |
| nn-UNet [41] | 31.2 | 402.6 | 96.03 | 82.65 | 79.44 | 86.04 | |
| FuseUNet (Ours) | 14.0 | 264.9 | 96.47 | 83.06 | 79.04 | 86.19 | |
| UNet [1] | 13.4 | 31.1 | 76.6 | 56.1 | 66.5 | 66.4 | |
| UNet3+ [14] | 12.0 | - | 62.2 | 41.4 | 47.8 | 50.5 | |
| TransUNet [25] | 116.5 | - | 70.6 | 54.2 | 68.4 | 64.4 | |
| Swin-UNETR [4] | 62.8 | 384.2 | 70.0 | 52.6 | 70.6 | 64.4 | |
| UNETR [2] | 103.7 | 40.3 | 78.9 | 58.5 | 76.1 | 71.1 | |
| FuseUNet (Ours) | 89.2 | 20.1 | 79.5 | 60.1 | 78.2 | 72.6 |
When using the Transformer-based UNETR as the backbone network, FuseUNet achieves a 13.6% reduction in the number of parameters and a 50% reduction in GFLOPs, while surpassing UNETR’s performance by 1.5%. The smaller reduction in parameters compared to nn-UNet is due to UNETR’s parameters being primarily concentrated in the encoder’s attention module, with the decoder contributing a smaller proportion. Since FuseUNet’s multi-scale feature interaction method is designed around skip connections and does not modify the encoder, the reduction in the number of parameters is relatively modest.
Fig. 4 illustrates the visual segmentation results of FuseUNet, where (a), (b), and (c) correspond to the ACDC, KiTS, and MSD brain tumor datasets, respectively. In Figure 4 (a), nn-UNet frequently makes errors in identifying the right ventricle (RV), sometimes misclassifying unrelated tissues as the RV or missing portions of it entirely. FuseUNet significantly mitigates these issues. Fig. 4 (b) shows that nn-UNet occasionally fails to fully recognize the kidneys and often confuses kidney tumors with cysts. This is particularly evident in the third row, where nn-UNet incorrectly classifies most cysts as tumors, even in the absence of tumor tissue. FuseUNet greatly improves upon this. In Fig. 4 (c), UNETR exhibits challenges such as incomplete tumor recognition and missing enhanced tumor regions, both of which are notably addressed by FuseUNet.
2D Tasks. For 2D tasks, we report Dice, sensitivity, specificity, and accuracy as percentages, following the protocol of the backbone network. The performance data for the compared networks are sourced from [8]. Table 4 shows the performance of the proposed method on 2D tasks, using the lightweight UltraLight VM-UNet with a Mamba architecture as the backbone network. While FuseUNet reduces the number of parameters by 29.8%, its GFLOPs have increased by 0.075, with performance comparable to that of UltraLight VM-UNet. This increase in GFLOPs is attributed to the interpolations used by FuseUNet to align the shapes of skip connection \(X_i\) and the corresponding \(Y_i\). This issue is less noticeable when the backbone network has a larger GFLOPs, but it becomes more pronounced in lightweight networks. However, overall, it only leads to a minor increase in GFLOPs at the decimal level.
Fig. 5 illustrates the visual segmentation results of FuseUNet, where (a) and (b) correspond to the ISIC2017 and ISIC2018 datasets, respectively. FuseUNet addresses the false negative and false positive issues frequently observed in UltraLight VM-UNet. Moreover, it captures lesion boundary details that are significantly closer to the ground truth.
| Model | Params (M) | GFLOPs | ISIC2017 | ISIC2018 | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 4-11 | Dice | SE | SP | ACC | Dice | SE | SP | ACC | ||
| UNet [1] | 13.4 | 31.12 | 89.89 | 87.93 | 98.12 | 96.13 | 88.51 | 87.35 | 97.44 | 95.47 |
| TransFuse [47] | 26.16 | 11.5 | 79.21 | 87.14 | 97.98 | 96.17 | 89.27 | 91.28 | 95.74 | 94.66 |
| MALUNet [48] | 0.177 | 0.085 | 88.96 | 88.24 | 97.62 | 95.83 | 89.31 | 88.90 | 97.25 | 95.48 |
| EGE-UNet [49] | 0.053 | 0.072 | 90.73 | 89.31 | 98.16 | 96.42 | 88.19 | 90.09 | 96.38 | 95.10 |
| VM-UNet [7] | 27.427 | 4.112 | 90.70 | 88.37 | 98.42 | 96.45 | 88.91 | 88.09 | 97.43 | 95.54 |
| UltraLight VM-UNet [8] | 0.049 | 0.060 | 90.64 | 88.85 | 98.38 | 96.63 | 89.40 | 86.80 | 96.38 | 95.10 |
| FuseUNet (Ours) | 0.036 | 0.095 | 90.69 | 89.59 | 98.20 | 96.62 | 89.77 | 89.10 | 97.41 | 95.62 |
This subsection presents only a few visualization results, with additional results provided in Appendix 10.
This experiment was conducted on the first fold of all 3D datasets and the full set of 2D datasets. Figure 6 illustrates the performance (Dice) when different maximum orders of the linear multistep method are used. Since the baseline varies across datasets, we have normalized the data. As the order increases, more skip connections participate in feature information interaction. The results indicate a strong correlation between the network’s performance and the highest order of the applied feature interaction method.
This experiment was conducted on the first fold of the KiTS dataset. Table 5 shows the impact of channel count in \(Y\) on performance, where \(N\) is the number of target classes. Setting \(Y\) to \(2N\) significantly outperforms \(N\), suggesting that some redundancy in memory flow is beneficial. However, \(3N\) reduces performance, and \(4N\) offers only a slight gain, indicating excessive redundancy is unnecessary. As the linear multistep method requires storing multiple intermediate states, memory consumption increases with channel count. Using more than \(4N\) is inefficient, making \(2N\) the optimal balance of performance and cost. Thus, all subsequent experiments adopt \(2N\) for \(Y\).
| Channels | GFLOPs | VRAM(G) | Epoch(s) | Dice | |
|---|---|---|---|---|---|
| N | 264 | 5.2 | 125 | 85.86 | |
| 2N | 265 | 5.9 | 128 | 86.74 | |
| 3N | 266 | 7.1 | 135 | 86.63 | |
| 4N | 267 | 8.7 | 147 | 86.88 |
This paper reinterprets skip connections in U-Nets from a numerical computation perspective for the first time. We propose a multi-scale feature fusion method that integrates the linear multistep method, predictor-corrector method, and nmODEs, dynamically selecting the optimal discretization order for efficient cross-scale fusion. As a skip connection strategy, it is encoder-decoder agnostic and theoretically applicable to any U-Nets. Experiments validate its effectiveness across convolution-, Transformer-, and Mamba-based U-Nets, reducing computational costs while preserving backbone performance. Visualizations further highlight improved edge recognition and feature distinction. Our work moves beyond empirical network design, and reveals the underlying connection between skip connections and numerical integration methods. This demonstrates that classical numerical computation theory can provide an interpretable mathematical foundation for network structures, offering a new perspective on the cross-layer information propagation mechanism in U-Nets. However, the method’s reliance on the linear multistep principle introduces a limitation: the need to store multiple historical solutions, leading to high memory consumption, particularly in large-scale problems with numerous target categories. Addressing this challenge remains a key area for future research.
This work was supported by the National Natural Science Foundation of China under Grant 62206189, and the China Postdoctoral Science Foundation under Grant 2023M732427.
This paper proposes a novel perspective for viewing U-Nets and a new method for handling skip connections, which can be widely applied to U-Nets. We hope that this research can contribute to the advancement of medical image segmentation. We do not think this work has any harmful impact.
After publication, we discovered a unit conversion issue in the FLOPs measurement process. Specifically:
- FLOPs were originally measured using a 2D input example, with correct units. - When testing with a 3D input example in the demo script, we overlooked the unit scaling difference. - As a result, all reported FLOPs values in the paper are under-estimated by a factor of 10.
Importantly, this issue does not affect any experimental conclusions, as the relative comparisons across models and configurations remain valid.
We have updated the GitHub repository accordingly to reflect this correction.
The first time we update the memory stream, i.e., derive \(Y_2\), we use a one-step implicit approach.
Theorem 3. one-step Adams–Moulton method [17]. Given the derivative \(\dot{Y}_t = F_t\), let \(t = t_n\), and choose a step size \(\delta\) along the \(t\)-axis such that \(t_{n+1} = t_n + \delta\). The approximation of \(Y_{t_{n+1}}\) at the next integration point is given by: \[\label{1ie} Y_{t_{n+1}}= Y_{t_n} + \frac{\delta}{2}\cdot (F_{t_n} + F_{t_{n+1}}).\tag{5}\]
For a U-Net with \(L\) stages, let \(X_l\) and \(Y_l\) be \(X_{t_n}\) and \(X_{t_n}\) at different time, respectively, \(1 \leq l \leq L\), and set \(\delta = 1/L\). Taking \(X_{t_n} = X_1\) and \(Y_{t_n} = Y_1\), based on Theorem 3, we have: \[\label{1ieu} Y_2= Y_1 + \frac{\delta}{2}\cdot (F_1 + F_2).\tag{6}\]
In the process of calculating \(Y_2\), the term \(F_2 = F(X_2, Y_2)\) is unknown. Therefore, we first need to predict the value of \(Y_2\), substitute it into \(F\) to compute \(F_2\), and then correct the value of \(Y_2\). According to the description in Section 3.3, second paragraph, the known values at this stage are \(X_2\), \(X_1\), \(Y_1\), and \(F_1\). When predicting \(Y_2\), we can only use a one-step explicit method.
Theorem 4. one-step Adams–Bashforth method [16]. Given the derivative \(\dot{Y}_t = F_t\), let \(t = t_n\), and choose a step size \(\delta\) along the \(t\)-axis such that \(t_{n+1} = t_n + \delta\). The approximation of \(Y_{t_{n+1}}\) at the next integration point is given by: \[\label{1ee} Y_{t_{n+1}}= Y_{t_n} + \delta\cdot F_{t_n}.\tag{7}\]
According to Theorem 2 and Theorem 4, we can obtain the predicted value of \(Y_2\),i.e., \(\overline{Y_2}\) and calculate \(F_2\). \[\label{1eeu} \begin{align} \overline{Y_2} &= Y_1 + \delta\cdot F_1\\ F_2 &= F(X_2,\overline{Y_2}).\\ \end{align}\tag{8}\]
At this point, we have obtained all the information necessary to compute \(Y_2\). In this process, we obtain \(Y_2\) and \(F_2\). The next step is to compute \(Y_3\) and \(F_3\) using the two-step implicit method.
Theorem 5. two-step Adams–Moulton method [17]. Given the derivative \(\dot{Y}_t = F_t\), let \(t = t_n\), and choose a step size \(\delta\) along the \(t\)-axis such that \(t_{n+1} = t_n + \delta\), \(t_{n+2} = t_n + 2\delta\). The approximation of \(Y_{t_{n+2}}\) is given by: \[\label{2ie} Y_{t_{n+2}} = Y_{t_{n+1}} + \frac{\delta}{12}\cdot (5F_{t_{n+2}} + 8F_{t_{n+1}} - F_{t_n}).\tag{9}\]
In the U-Net specified before Eq. 6 , according to Theorem 5, we have: \[\label{2ieu} Y_3 = Y_2 + \frac{\delta}{12}\cdot (5F_3 + 8F_2 - F_1).\tag{10}\]
Similarly, \(F_3\) is unknown, the known values at this stage are \(X_3\), \(X_2\), \(X_1\), \(Y_2\), \(Y_1\), and \(F_2\), \(F_1\). Now we can use a two-step explicit method to predict \(Y_3\).
Theorem 6. two-step Adams–Bashforth method [16]. Given the derivative \(\dot{Y}_t = F_t\), let \(t = t_n\), and choose a step size \(\delta\) along the \(t\)-axis such that \(t_{n+1} = t_n + \delta\), \(t_{n+2} = t_n + 2\delta\). The approximation of \(Y_{t_{n+2}}\) is given by: \[\label{2ee} Y_{t_{n+2}} = Y_{t_{n+1}}+\frac{\delta}{2}\cdot (3F_{t_{n+1}} - F_{t_n}).\\None\tag{11}\]
According to Theorem 2 and Theorem 6, we can get \(\overline{Y_3}\) and \(F_3\).
\[\label{2eeu} \begin{align} \overline{Y_3} &= Y_2 + \frac{\delta}{2}\cdot (3F_2 - F_1)\\ F_3 &= F(X_3,\overline{Y_3}).\\ \end{align}\tag{12}\]
Then we finish the calculation of Eq. 10 , next step is to compute \(Y_4\) and \(F_4\) using the three-step implicit method.
Theorem 7. three-step Adams–Moulton method [17]. Given the derivative \(\dot{Y}_t = F_t\), let \(t = t_n\), and choose a step size \(\delta\) along the \(t\)-axis such that \(t_{n+1} = t_n + \delta\), \(t_{n+2} = t_n + 2\delta\), \(t_{n+3} = t_n + 3\delta\). The approximation of \(Y_{t_{n+3}}\) is given by: \[\label{3ie} Y_{t_{n+3}} = Y_{t_{n+2}} + \frac{\delta}{24}\cdot (9F_{t_{n+3}} + 19F_{t_{n+2}} - 5F_{t_{n+1}} + F_{t_n}).\tag{13}\]
In the U-Net specified before Eq. 6 , according to Theorem 7, we have: \[\label{3ieu} Y_4 = Y_3 + \frac{\delta}{24}\cdot (9F_4 + 19F_3 - 5F_2 + F_1).\tag{14}\]
\(F_4\) is unknown, the known values at this stage are \(X_1:X_4\),\(Y_1:Y_3\)and \(F_1:F_3\). Now we can use a three-step explicit method to predict \(Y_4\).
Theorem 8. three-step Adams–Bashforth method [16]. Given the derivative \(\dot{Y}_t = F_t\), let \(t = t_n\), and choose a step size \(\delta\) along the \(t\)-axis such that \(t_{n+1} = t_n + \delta\), \(t_{n+2} = t_n + 2\delta\), \(t_{n+3} = t_n + 3\delta\). The approximation of \(Y_{t_{n+3}}\) is given by: \[\label{3ee} Y_{t_{n+3}} = Y_{t_{n+2}}+\frac{\delta}{12}\cdot (23F_{t_{n+2}} -16F_{t_{n+1}} + 5F_{t_n}).\\None\tag{15}\]
According to Theorem 2 and Theorem 8, we can get \(\overline{Y_4}\) and \(F_4\) as follows: \[\label{3eeu} \begin{align} \overline{Y_4} &= Y_3 + \frac{\delta}{12}\cdot (23F_3 - 16F_2 + 5F_1)\\ F_4 &= F(X_4,\overline{Y_4}).\\ \end{align}\tag{16}\]
Finally, the computation of Eq. 14 is completed. When \(4 < t_{n+3} \leq L\), we still use Eq. 13 to calculate \(Y_{t_{n+3}}\). For example, \(Y_5 = Y_4 + \frac{\delta}{24}\cdot (9F_5 + 19F_4 - 5F_3 + F_2)\). However, at this stage, we have more known information. When predicting \(Y_5\), we can apply a four-step explicit method, unlike the prediction of \(Y_4\), where only a three-step explicit method can be used.
Theorem 9. four-step Adams–Bashforth method [16]. Given the derivative \(\dot{Y}_t = F_t\), let \(t = t_n\), and choose a step size \(\delta\) along the \(t\)-axis such that \(t_{n+1} = t_n + \delta\), \(t_{n+2} = t_n + 2\delta\), \(t_{n+3} = t_n + 3\delta\), \(t_{n+4} = t_n + 4\delta\). The approximation of \(Y_{t_{n+4}}\) is given by: \[\label{4ee} Y_{t_{n+4}} = Y_{t_{n+3}}+\frac{\delta}{24}\cdot (55F_{t_{n+3}} - 59F_{t_{n+2}} + 37F_{t_{n+1}} - 9F_{t_n}).\\None\tag{17}\]
According to Theorem 2 and Theorem 9, we can get \(\overline{Y_5}\) and \(F_5\) as follows: \[\label{4eeu} \begin{align} \overline{Y_5} &= Y_4 + \frac{\delta}{24}\cdot (55F_4 - 59F_3 + 37F_2 - 9F_1)\\ F_5 &= F(X_5,\overline{Y_5}).\\ \end{align}\tag{18}\]
In the subsequent computations, we perform calculations based on Theorem 7 and 9. When the computation reaches the topmost stage, the known information includes \(X_1:X_L\),\(Y_1:Y_L\)and \(F_1:F_L\). At this point, calculating \(Y_{final}\) cannot apply the implicit method because \(X_{L+1}\) does not exist for the calculation. Therefore, we switch to a four-step explicit method. Then we have: \[\label{final} Y_{final} = Y_L + \frac{\delta}{24}\cdot (55F_L - 59F_{L-1} + 37F_{L-2} - 9F_{L-3}).\tag{19}\]
Finally, we apply a convolution layer with a kernel size of 1 to map \(Y_{final}\) to the prediction map.
| Source | Workflow | Result |
|---|---|---|
| \(X_1, Y_1\) | P: \(Y_2 = Y_1 + \delta \cdot F_1\) | \(Y_2\) |
| C: \(Y_2 = Y_1 + \frac{\delta}{2} \cdot (F_1 + F_2)\) | ||
| \(X_{1:2}, Y_{1:2}\) | P: \(Y_3 = Y_2 + \frac{\delta}{2} \cdot (3F_2 - F_1)\) | \(Y_3\) |
| C: \(Y_3 = Y_2 + \frac{\delta}{12} \cdot (5F_3 + 8F_2 - F_1)\) | ||
| \(X_{1:3}, Y_{1:3}\) | P: \(Y_4 = Y_3 + \frac{\delta}{12} \cdot (23F_3 - 16F_2 + 5F_1)\) | \(Y_4\) |
| C: \(Y_4 = Y_3 + \frac{\delta}{24} \cdot (9F_4 + 19F_3 - 5F_2 + F_1)\) | ||
| \(X_{1:4}, Y_{1:4}\) | P: \(Y_5 = Y_4 + \frac{\delta}{24} \cdot (55F_4 - 59F_3 + 37F_2 - 9F_1)\) | \(Y_5\) |
| C: \(Y_5 = Y_4 + \frac{\delta}{24} \cdot (9F_5 + 19F_4 - 5F_3 + F_2)\) | ||
| \(X_{2:5}, Y_{2:5}\) | Cal: \(Y_6 = Y_5 + \frac{\delta}{24} \cdot (55F_5 - 59F_4 + 37F_3 - 9F_2)\) | \(Y_6\) |
| ACDC | KiTS | MSD | ISIC2017 | ISIC2018 | |
|---|---|---|---|---|---|
| 2-6 Backbone | 1e-2 | 3e-4 | 1e-4 | 1e-3 | 1e-3 |
| FuseUNet | 3e-2 | 1e-3 | 2e-4 | 3e-3 | 3e-3 |
| FuseUNet - Ori | Decoder | Skip Connection |
|---|---|---|
| Params | \(L \cdot 4N^2\cdot1^2 - \sum_{i=L}^{1}3C_i^2\cdot k^2\) | \(\sum_{i=L}^{1}2N*C_i\cdot1^2 - 0\) |
| Flops | \(2L\cdot 4N^2\cdot1^2\cdot H_o\cdot W_o - \sum_{i=L}^{1}6C_i^2\cdot k^2\cdot H_i\cdot W_i\) | \(2\sum_{i=L}^{1}2N*C_i\cdot1^2\cdot H_o\cdot W_o + 7H_o\cdot W_o \cdot 2N - 0\) |
Taking the L-stage convolutional architecture as an example. In the table, N and the superscript ‘o’ represent the number of target classes and the original, respectively.The reduction in parameters and computation is tied to the number of channels in each stage. Flops increase mainly due to interpolation in skip connections, especially when the original network has fewer channels.
| VRAM (G) / Epoch (s) | nn-UNet | UNETR | UltraLight VM-UNet |
|---|---|---|---|
| Backbone | 7.33/144 | 9.4/115 | 0.87/21 |
| FuseUNet | 5.91/128 | 7.8/105 | 1.27/22 |
| Dataset | Fold | Dice1 | Dice2 | Dice3 | Dice avg |
|---|---|---|---|---|---|
| ACDC | 1 | 90.37 | 90.46 | 94.60 | 91.81 |
| 1-Myo | 2 | 91.21 | 89.40 | 94.49 | 91.70 |
| 2-RV | 3 | 89.32 | 90.11 | 94.56 | 90.33 |
| 3-LV | 4 | 91.58 | 90.28 | 94.32 | 92.06 |
| 5 | 88.41 | 90.02 | 94.52 | 90.98 | |
| KiTS23 | 1 | 97.28 | 83.94 | 79.51 | 86.91 |
| 1-Kidney | 2 | 95.49 | 80.86 | 79.67 | 85.34 |
| 2-Cyst | 3 | 97.15 | 82.83 | 77.69 | 85.89 |
| 3-Tumor | 4 | 96.89 | 87.11 | 82.82 | 88.94 |
| 5 | 95.55 | 84.58 | 75.48 | 83.87 | |
| MSD | 1 | 77.92 | 60.81 | 76.53 | 71.75 |
| 1-WT | 2 | 79.72 | 61.12 | 79.42 | 73.42 |
| 2-ET | 3 | 81.40 | 60.53 | 78.23 | 73.40 |
| 3-TC | 4 | 77.92 | 59.47 | 79.72 | 72.37 |
| 5 | 80.56 | 58.68 | 76.79 | 72.01 |
| Order | KiTS | ACDC | MSD | ISIC2017 | ISIC2018 |
|---|---|---|---|---|---|
| 1 | 84.8 | 91.75 | 71.22 | 89.25 | 88.63 |
| 2 | 85.2 | 91.84 | 71.49 | 89.65 | 89.06 |
| 3 | 85.7 | 91.85 | 71.56 | 90.15 | 89.35 |
| 4 | 86.7 | 92.05 | 71.75 | 90.69 | 89.78 |
| FuseUNet - Backbone | Mean of Differences | Standard Deviation of Differences | Standard Error of the Mean | 95% Confidence Interval | t-statistic | Degrees of Freedom | p-value |
|---|---|---|---|---|---|---|---|
| ACDC | 0.03 | 3.39 | 0.14 | (-0.24, 0.31) | 0.25 | 603 | 0.80 |
| KiTS | 0.15 | 10.18 | 0.27 | (-0.38, 0.67) | 0.55 | 1470 | 0.58 |
The data in Table 12 indicate that the performance of FuseUNet on the ACDC and KiTS datasets shows no statistically significant difference compared to nn-UNet.


Figure 10: Visualization on 2D tasks.