Contextual Embedding Learning to Enhance 2D Networks for
Volumetric Image Segmentation


Abstract

The segmentation of organs in volumetric medical images plays an important role in computer-aided diagnosis and treatment/surgery planning. Conventional 2D convolutional neural networks (CNNs) can hardly exploit the spatial correlation of volumetric data. Current 3D CNNs have the advantage to extract more powerful volumetric representations but they usually suffer from occupying excessive memory and computation nevertheless. In this study we aim to enhance the 2D networks with contextual information for better volumetric image segmentation. Accordingly, we propose a contextual embedding learning approach to facilitate 2D CNNs capturing spatial information properly. Our approach leverages the learned embedding and the slice-wisely neighboring matching as a soft cue to guide the network. In such a way, the contextual information can be transferred slice-by-slice thus boosting the volumetric representation of the network. Experiments on challenging prostate MRI dataset (PROMISE12) and abdominal CT dataset (CHAOS) show that our contextual embedding learning can effectively leverage the inter-slice context and improve segmentation performance. The proposed approach is a plug-and-play, and memory-efficient solution to enhance the 2D networks for volumetric segmentation. The code will be publicly available.

Medical image segmentation ,convolutional neural networks ,embedding learning ,contextual information ,attention mechanism

1 Introduction↩︎

The segmentation of organs in volumetric medical images (e.g., computed tomography (CT), magnetic resonance imaging (MRI), 3D ultrasound) enables quantitative analysis of anatomical parameters such as shape, boundary, volume, etc [1], [2]. Also, the segmentation is often the very first step in the workflow of other medical image computing tasks, such as computer-assisted diagnosis [3], [4], and surface-based registration [5], [6]. However, it is tedious and time-consuming for the clinicians to manually delineate the organ’s contour. Therefore, the automatic segmentation approaches are highly needed to satisfy the constantly increasing clinical demand.

Figure 1: For volumetric medical images, conventional 2D networks can only segment each 2D slice individually, but can hardly obtain the context information between slices, which results in incomplete and discontinuous segmentation results as shown in (d). Our contextual embedding (CE) approach transfers contextual information via a slice-wisely neighboring matching mechanism thus boosting the volumetric representation of the 2D network. In (d), the blue, green, and orange surfaces indicate ground-truth segmentation, conventional 2D result and CE-enhanced result, respectively. It shows our approach makes the segmentation more smoother and complete in 3D.

A large number of literatures have been presented the widest variety of methodologies in medical image segmentation. Early approaches mainly rely on thresholding, edge detection, atlas matching, deformable models, and machine learning techniques [7]. Although these approaches have shown their successes in certain circumstances, medical image segmentation is still challenging mainly due to the difficulty of designing and extracting discriminative features.

With the promising capability of deep learning techniques in the society of computer vision and pattern recognition [8], convolutional neural networks (CNNs) have also become a primary option for medical image segmentation. Various CNN architectures have achieved excellent segmentation results in many medical applications. One of the most well-known CNN architectures for medical image segmentation is the U-net [9], which adopts the skip connections to effectively aggregate low-level and high-level features. The U-net has now been the benchmark for lots of segmentation tasks and based on which many 2D variants have been devised to target specific applications. Chen et al. [10] proposed a multi-task learning network for gland segmentation from histopathological images, which explores multi-level contextual features from the auxiliary classifier on top of the U-net structure. Ibtehaz et al. [11] took inspiration from Inception blocks [12] and proposed MultiRes blocks to enhance U-Net with the ability of multi-resolutional analysis. Zhou et al. [13] re-designed the skip connections and proposed a UNet++ which is a deeply-supervised encoder-decoder architecture with nested and dense skip connections. Jha et al. [14] proposed ResUNet++, which took advantage of residual blocks, squeeze and excitation blocks, and strous spatial pyramidal pooling (ASPP) to segment the colorectal polyps. However, directly employing 2D networks to deal with 3D images in a slice-by-slice way may result in discontinuous or inaccurate segmentation in 3D space (see the conventional 2D results in Fig. 1), due to the limited volumetric representation capability of 2D learning [15]. To better extract spatial information, some 2.5D approaches have been proposed [16], [17]. In general, these approaches still employ 2D convolutional kernels, whereas extracting features from three orthogonal views, i.e., transversal, coronal, and sagittal planes. Although 2.5D approaches could slightly improve the 2D networks’ results, they still may not well exploit the original volumetric images [18]. Moreover, the three branches of 2.5D networks would cause a rapid increase in parameters and memory usage.

To better empower the networks with richer volumetric information, Çiçek et al. [19] proposed 3D U-net to extend U-net architecture directly processing 3D images. However, the high memory consumption of 3D U-net would limit the depth of networks. Milletari et al. [20] proposed another 3D derivation of U-net architecture, named V-net, applying residual connections to enable a deeper network. Due to the advantages of residual connections, several studies [21][23] further employed the residual design to boost the 3D networks. Furthermore, to better aggregate the multi-scale spatial features, various attention mechanisms have been added into the 3D CNNs [24][27]. Although 3D CNNs are beneficial to extract more powerful volumetric features, they still suffer from occupying excessive memory and computation.

It is worth noting that combining CNNs with Recurrent Neural Networks (RNNs), such as Long Short-Term Memory (LSTM) [28] and convolutional-LSTM [29], can obtain slice-wise contextual information. Chen et al. [30] combined a 2D U-net with bi-directional LSTM-RNNs to mine the intra-slice and inter-slice contexts for 3D image segmentation. Poudel et al. [31] devised a recurrent fully-convolutional network to learn inter-slice spatial dependences for cardiac MRI segmentation. Alom et al. [32] developed a recurrent residual U-net model (R2U-Net), which utilized the recurrent residual convolutional layers to ensure better feature representation for segmentation tasks. Novikov et al. [33] proposed Sensor3D to integrate bi-directional C-LSTMs into a U-net-like architecture to extract the inter-slice context sequentially, which achieved notable performance in CT segmentation. RNNs can extract the contextual relationship, yet they bring extra computational cost and memory usage.

In this study, we aim to enhance the 2D networks with contextual information for the accurate and efficient volumetric image segmentation. Our proposed network shares the backbone features with the encoder of conventional 2D segmentation networks and utilizes the slice-wise spatial context adequately by the proposed contextual embedding learning approach, thus improving segmentation performance (see Fig. 1). The contributions of our network are as follows:

  • We provide a contextual embedding learning approach, which leverages the learned embedding and the slice-wisely neighboring matching as a soft cue to guide the network. In such a way, the contextual information can be transferred slice-by-slice thus facilitating 2D networks to capture volumetric representation properly.

  • The proposed contextual embedding block is a plug-and-play, and memory-efficient solution to enhance the 2D networks for volumetric segmentation.

  • Experiments on challenging prostate MRI dataset and abdominal CT dataset demonstrate that our approach consistently enhances 2D networks’ performance on volumetric images. In addition, our approach is more lightweight compared with the 3D networks.

The rest of this article is organized as follows. Section 2 introduces our segmentation network and further elaborates the proposed contextual embedding block. Section 3 shows the segmentation performance of our network. Sections 4 and 5 present the discussion and conclusion of this study, respectively.

2 Methods↩︎

Fig. 2 illustrates how the proposed contextual embedding (CE) block works in an encoder-decoder segmentation architecture. The backbone features are shared by the decoder and the CE block. In addition to the backbone features, the original prediction set (\(P_{ori}\)) from the decoder is the other input branch for the CE block. The CE block utilizes the context between neighboring slices as soft cue to track the variation of the three-dimensional shape in the axial direction to generate the final prediction set \(P_{CE}\).

Figure 2: The schematic overview of the proposed network. The yellow arrows indicate the workflow of the conventional 2D segmentation using the encoder-decoder architecture, whereas the pink flow shows the plug-and-play contextual embedding (CE) block to enhance the volumetric representation of the 2D network. Specifically, the CE block leverages the prediction of the neighboring slice (i.e., the \(P_{n-l}\) of the \(S_{n-l}\)) to calculate a distance map by matching the current slice embedding to the embedding of the neighboring slice (see details in Fig. 3). Then the CE block aggregates the neighboring matching distance map, the prediction of the neighboring slice (\(P_{n-l}\)), the original prediction of the current slice (\(P_{n}\)), and the backbone feature (\(B_{n}\)) to generate the refined prediction (\(P_{n-CE}\)) of the current slice (\(S_{n}\)).

2.1 Overview of the Contextual Embedding Block↩︎

Since conventional 2D CNNs can only perform convolutions in the axial plane, they can hardly obtain the slice-wise information that characterizes the implicit dependency between slices. In such a case, 2D segmentation results may appear discontinuous or incomplete in 3D space. In contrast, our designed CE block propagates the implicit dependency from neighboring slice to the current slice with a memory-efficient manner to improve the segmentation results.

Fig. 3 illustrates the detailed design of the proposed contextual embedding block. Compared with the conventional networks only leveraging backbone features for the segmentation prediction, we also employ the original predictions as the additional cue to generate the segmentation with slice-wise context. To segment the current slice, the CE block leverages the prediction of the neighboring slice to calculate a distance map by matching the current slice embedding to the embedding of the neighboring slice (see details in Section 2.3). Then the CE block combines the neighboring matching distance map, the prediction of the neighboring slice, and the backbone features to generate the refined prediction of the current slice. Finally, the refined prediction and the original prediction of the same slice are aggregated through an attention merge module (AMM) to produce the final segmentation (see details in Section 2.4). Note that the embedding and the neighboring matching mechanism are used as the soft cue to guide the network, thus the whole network can be trained end-to-end without an extra loss on the embedding.

2.2 Embedding Space↩︎

For each pixel \(p\) from a slice \(S\), we obtain its corresponding embedding vector \(e_{p}\) in the learned embedding space. Specifically, as shown in Fig. 3, three consecutive convolutional blocks are employed to non-linearly map backbone features (\(B\)) to the embedding space (\(E\)). Each convolutional block is composed of a convolutional layer and a batch normalization layer followed by a rectified linear unit (ReLU) activation layer.

Figure 3: The detailed design of the contextual embedding block. In order to segment the current slice (\(S_n\)), the backbone features (\(B\)), embedding vectors (\(E\)), and the prediction of the neighboring slice (\(P_{n-l}\) of \(S_{n-l}\)) are employed for it. First, a distance map (\(D_{n}\)) is obtained by matching the current slice embedding (\(E_n\)) to the embedding of the neighboring slice (\(E_{n-l}\)), see the green flow. Then, the \(D_{n}\), together with the prediction of the neighboring slice (\(P_{n-l}\)) and the backbone feature (\(B_{n}\)) are combined to generate the new prediction of the current slice (\(P_{n'}\)), see the yellow flow. Finally, the \(P_{n'}\) and the original prediction of the current slice (\(P_{n}\)) are aggregated through an attention merge module (AMM) to produce the final segmentation. The details of the AMM is shown in Fig. 4.

In the embedding space, the pixels from the same class will be close whereas pixels from different classes will be far way [34]. Thus we utilize the distance of pixels in the embedding space as a soft cue to guide the network. As described in [35], we represent the distance between pixels \(p\) and \(q\) in the embedding space as \[\label{formula:distanceP2P} d(p, q)=1-\frac{2}{1 + \exp(\left \| e_{p}-e_{q} \right \|^{2})} ,\tag{1}\] where \(e_{p}\) and \(e_{q}\) are the corresponding embedding vectors of \(p\) and \(q\), respectively. The embedding distance \(d(p, q)\) ranges from 0 to 1, depending on the similarity between pixels \(p\) and \(q\). For similar pixels, the \(d(p, q)\) approaches to 0, and for pixels which are far away in the embedding space, the \(d(p, q)\) will be close to 1.

2.3 Slice-wisely Neighboring Matching↩︎

To segment the current slice \(S_{n}\), we transfer the context information from the neighboring slice \(S_{n-l}\) to \(S_{n}\) by considering the slice-wisely neighboring matching in the learned embedding space1. Specifically, as shown in Fig. 3, a neighboring matching distance map \(D_{n}\) is calculated by using the embedding of current slice (\(E_{n}\)), the embedding (\(E_{n-l}\)) and prediction (\(P_{n-l}\)) of neighboring slice: \[\label{formula:dismap} D_n(p)=\min_{q\in P_{n-l, o}}d(p, q)\tag{2}\] where \(p\) and \(q\) represent the pixels from \(S_{n}\) and \(S_{n-l}\), respectively. \(D_n(p)\) indicates the value of matching distance at \(p\) position. \(P_{n-l, o}\) denotes the set of pixels which belong to the organ area according to the prediction \(P_{n-l}\). The embedding distance \(d(p, q)\) can be computed using Eq (1 ). By analyzing Eq (2 ), it can be observed that the \(D_n(p)\) affords for each pixel from the current slice a soft cue of how likely it belongs to the organ area.

When computing Eq (2 ), each pixel from the current slice has to be compared with the pixels from the neighboring slice, which may result in false positive matches and redundant computations [36]. Therefore, in practice we do not compute \(D_n\) in a global manner but instead we apply a local matching as described in FlowNet [37]. For pixel \(p\) from the current slice, we only compare pixel \(q\) from neighboring slice in a local patch with size \(k\) (i.e., the set of pixels which are at most \(k\) pixels away from pixel \(p\) in both \(x\) and \(y\) direction). In such a way, the distance map \(D_n\) can be implemented using efficient and fast matrix operations without the brute force search algorithm.

We then concatenate the distance map \(D_{n}\), together with the prediction of the neighboring slice (\(P_{n-l}\)) and the backbone feature (\(B_{n}\)), and feed them into the segmentation convolutions to generate the new prediction of the current slice (\(P_{n'}\)), see the yellow flow in Fig. 3. The \(P_{n'}\) aggregates the context information of neighboring slice to predict the current slice.

Figure 4: The structure of the attention merge module (AMM). The new prediction \(P_{n'}\) and the original prediction \(P_{n}\) of the current slice are aggregated through the AMM. The AMM consists of three sequential functions of \(F_{sq}\), \(F_{score}\), \(F_{mul}\), and the segmentation convolutions to fuse information and generate the final prediction \(P_{n-CE}\).

2.4 Attention Merge Module↩︎

The aforementioned embedding distance utilizes the context between neighboring slices as soft cue to track the variation of the volumetric shape, here we further selectively merge \(P_{n'}\) and the original prediction \(P_{n}\) to produce the final segmentation.

Inspired by SE Net [38], we propose an attention merge module (AMM), which makes the network pay more attention to the prediction that is more beneficial to the final segmentation. As shown in Fig. 4, the AMM mainly consists of three sequential functions of \(F_{sq}\), \(F_{score}\), \(F_{mul}\). Firstly, \(F_{sq}\) squeezes global spatial information of the two predictions into a discrete feature vector, which employs the operation of global average pooling (GAP) [39]: \[\label{formula:gap} v = F_{sq}(u) = \frac{1}{W \times H}\sum_{i=1}^{W}\sum_{j=1}^{H}u(i,j),\tag{3}\] where \(u\) denotes the prediction (i.e., the \(P_{n'}\) and \(P_{n}\)), \(W\) and \(H\) represent its dimension in width and height, respectively. In order to fully capture the dependency between predictions, we then employ a simple gating mechanism with a sigmoid activation (\(\sigma\)): \[\label{formula:4} s = F_{score}(v, W_1, W_2) = \sigma(W_2\delta(W_1v)),\tag{4}\] where \(\delta\) refers to ReLU, \(W_1\in \mathbb{R}^{2r \times 2}\) and \(W_2\in \mathbb{R}^{2 \times 2r}\). To capture more abundant information, two fully-connected (FC) layers are used to enlarge the channel dimension by \(r\) times. Finally, the learned \(s\) is used to re-weight the \(P_{n'}\) and \(P_{n}\): \[\label{formula:5} F_{mul}(u, s) = s\cdot u.\tag{5}\] where \(u\) denotes the \(P_{n'}\) and \(P_{n}\).

Our AMM assigns dependency to \(P_{n'}\) and \(P_{n}\) to provide more comprehensive information for the following segmentation convolutions to generate the final prediction.

Figure 5: Four transversal slices from the PROMISE12 dataset. The T2-weighted MRI images were collected from different centers and with different protocols: (a) Haukeland University Hospital, Siemens, 1.5T, with endorectal coil (ERC); (b) Beth Israel Deaconess Medical Center University Hospital, GE, 3.0T, with ERC; (c) University College London, Siemens, 1.5/3.0T, without ERC; (d) Radbound University Nijmegen Medical Centre, Siemens, 3.0T, without ERC.

Figure 6: Example liver CT slices from the CHAOS dataset. The images show low tissue contrast and ambiguous boundaries of the liver regions.

Table 1: Quantitative comparison of different methods on the prostate MRI dataset (Mean\(\pm\)SD, best results are highlighted in bold). “*" indicates the results are statistically different with ours (Wilcoxon tests, \(p\)-value\(<\)​0.05).
Methods DSC (%) ASSD (mm) 95HD (mm)
V-Net [20] 85.12 \(\pm\) 2.28* 1.95 \(\pm\) 0.61* 6.13 \(\pm\) 4.71
U-Net [9] 85.67 \(\pm\) 6.89* 2.05 \(\pm\) 2.36* 6.92 \(\pm\) 10.88
3D U-Net [19] 86.03 \(\pm\) 6.79 2.07 \(\pm\) 2.08* 6.70 \(\pm\) 11.17
U-Net+CE (Ours) 87.38 \(\pm\) 4.42 1.55 \(\pm\) 0.76 5.14 \(\pm\) 4.01
DeepLabV3 [40] 83.37 \(\pm\) 6.21* 2.13 \(\pm\) 1.01* 7.34 \(\pm\) 6.48*
3D DeepLabV3 84.86 \(\pm\) 6.49 1.88 \(\pm\) 0.82 4.83 \(\pm\) 2.33
DeepLabV3+CE (Ours) 84.94 \(\pm\) 5.80 1.94 \(\pm\) 0.91 6.26 \(\pm\) 5.14

3 Experiments and Results↩︎

3.1 Experimental Data and Pre-processing↩︎

The proposed method was evaluated on two public medical image datasets: the prostate MRI dataset (PROMISE12) [41], and the liver CT dataset (CHAOS)2. Both datasets are popular yet challenging benchmarks to evaluate the efficacy of segmentation algorithms.

The prostate MRI images from the PROMISE12 dataset were collected from four different centers, thus possessing diverse appearances (see Fig. 5). We used 50 MRI volumes to conduct five-fold cross validation. The intra- and inter-slice resolutions of these volumes range from 0.25 mm to 0.75 mm, and from 2.2 mm to 4.0 mm, respectively. All MRI volumes were resampled to a fixed resolution of 1\(\times\)​1\(\times\)​1.5 mm\(^{3}\). The N4 bias field correction [42] was conducted, and then the intensity values were normalized to zero mean and unit variance.

The liver CT volumes were from the challenging CHAOS dataset. Fig. 6 illustrates some example CT images, which show low tissue contrast and ambiguous organs’ boundaries. We used 40 CT volumes to conduct five-fold cross validation. All CT slices were resampled to the size of 256\(\times\)​256. The intensity values were clipped to an interval of [-100, 200] HU, and then normalized to zero mean and unit variance. During training, contrast stretching, brightness stretching and elastic deformation were applied for data augmentation.

3.2 Implementation Details↩︎

The segmentation model was implemented on the open source platform Pytorch. The values of neighboring interval \(l\), local patch size \(k\), coefficient \(r\) were empirically set to 1, 3, 64, respectively. Dice loss was utilized to train the model. Stochastic gradient descent (SGD) was used with batch size of 50, learning rate of 0.01, momentum of 0.9 and weight decay of 0.001 to control the learning progress of the network. The total number of training epochs was 300. All experiments were conducted on a single NVIDIA GeForce GTX 2080 GPU.

3.3 Evaluation Metrics↩︎

The metrics employed to quantitatively evaluate the segmentation accuracy included Dice similarity coefficient (DSC), average symmetric surface distance (ASSD), and 95% Hausdorff distance (95HD) [43]. All above metrics were calculated in 3D space. DSC measures the relative volumetric overlap between the predicted and ground-truth segmentations. ASSD determines the average distance between the surfaces of the predicted and ground-truth segmentations. 95HD is similar to ASSD but more sensitive to the localized disagreement as it determines the 95th percentile of all calculated Hausdorff distances. A better segmentation shall have smaller ASSD and 95HD, and larger value of DSC metric.

Figure 7: Qualitative illustration of prostate MRI segmentation results from different methods in transversal, coronal, and sagittal views, respectively. The proposed CE block boosts the conventional 2D networks to generate more continuous and accurate results in 3D space.

3.4 Segmentation Accuracy↩︎

To demonstrate the efficacy of the proposed CE block, we applied it to enhance two popular 2D segmentation networks, i.e., U-Net [9] and DeepLabV3 [40], in a plug-and-play manner. We compared the CE-enhanced 2D networks with their 3D version, i.e., 3D U-Net [19] and 3D DeepLabV3. We also compared our method with two well-established models for prostate MRI segmentation (V-Net [20]) and liver CT segmentation (U-Net++ [13]), respectively.

Figure 8: Qualitative illustration of liver CT segmentation results from different methods in transversal, coronal, and sagittal views, respectively. Our CE-enhanced networks generate the most similar segmentation to the ground truth.

Table 1 reports the numerical results of all methods on the prostate MRI dataset. To investigate the statistical significance of the proposed method over compared methods on each metrics, the Wilcoxon signed-rank test was employed and also reported in Table 1. It can be observed that our CE-enhanced 2D networks can be regarded as significantly better than the conventional 2D networks on almost all evaluation metrics (except for the pair U-Net \(vs\). U-Net\(+\)CE on 95HD). Furthermore, the \(p\)-values of CE-enhanced 2D networks \(vs\). 3D networks on almost all metrics are beyond the 0.05 level, which indicates that the CE-enhanced 2D networks achieved similar performance as the 3D networks did. Fig. 7 visualizes the segmentation results in transversal, coronal and sagittal views, respectively. Apparently, our CE-enhanced networks obtained the most similar segmented boundaries to the ground truth. Compared with conventional 2D networks, our method generated more accurate and continuous results in volumetric view (see the coronal and sagittal views in Fig. 7). In general, the results shown in Table 1 and Fig. 7 prove the effectiveness of our contextual embedding strategy. Compared with conventional 2D networks, the proposed CE block leveraged the learned embedding and the slice-wisely neighboring matching as an internal guidance to transfer contextual information thus facilitating 2D networks to capture volumetric representation properly.

The quantitative results on the liver CT dataset are listed in Table 2. It can be observed that the CE-enhanced networks outperformed conventional 2D networks on all metrics. It is worth noting that the proposed CE block largely improved the 2D networks with respect to the metrics of ASSD and 95HD, corroborating that the CE block has strong capability to utilize the slice-wise context for the accurate volumetric segmentation. The statistical results in Table 2 further shows our method can make the segmentation performance of the 2D networks close to or even surpass that of the 3D networks. Figs. 8 visualizes the liver segmentation results in transversal, coronal and sagittal views, respectively. Our method can successfully infer the ambiguous boundaries and attain an accurate and smooth volumetric segmentation.

Table 2: Quantitative comparison of different methods on the liver CT dataset (Mean\(\pm\)SD, best results are highlighted in bold). “*" indicates the results are statistically different with ours (Wilcoxon tests, \(p\)-value\(<\)​0.05).
Methods DSC (%) ASSD (mm) 95HD (mm)
UNet++ [13] 94.23 \(\pm\) 2.94* 0.95 \(\pm\) 0.58* 4.89 \(\pm\) 6.50*
U-Net [9] 94.13 \(\pm\) 2.32* 1.84 \(\pm\) 1.12* 8.62 \(\pm\) 7.62*
3D U-Net [19] 95.67 \(\pm\) 2.23* 2.31 \(\pm\) 1.72* 2.31 \(\pm\) 1.72
U-Net+CE (Ours) 96.49 \(\pm\) 1.01 0.50 \(\pm\) 0.17 1.77 \(\pm\) 0.84
DeepLabV3 [40] 95.05 \(\pm\) 4.85* 1.78 \(\pm\) 2.24* 7.62 \(\pm\) 14.83*
3D DeepLabV3 95.33 \(\pm\) 1.44 1.49 \(\pm\) 0.81* 5.83 \(\pm\) 5.56*
DeepLabV3+CE (Ours) 95.59 \(\pm\) 2.03 0.61 \(\pm\) 0.37 1.95 \(\pm\) 1.50
Table 3: Efficiency comparison between baseline 2D networks, corresponding 3D versions, and the CE-enhanced 2D networks in terms of parameter amount and floating point operations (FLOPs).
Methods \(\#\) Params FLOPs
UNet [9] 13.39 M 1092.31
3D UNet [19] 40.15 M 1611.50
U-Net+CE (Ours) 24.15 M 1466.12
DeepLabV3 [40] 25.42 M 941.41
3D DeepLabV3 74.72 M 1423.17
DeepLabV3+CE (Ours) 37.40 M 1373.74

3.5 Efficiency Comparison↩︎

We further compared the baseline 2D networks, corresponding 3D versions, and the CE-enhanced 2D networks in terms of parameter amount and floating point operations (FLOPs). As shown in Table 3, the CE block only increased the number of parameters by about 10M compared with the original 2D networks, which is about half the model size of the corresponding 3D versions. In addition, the CE-enhanced networks are more lightweight compared with the corresponding 3D networks in terms of FLOPs. The comparison results in Table 3 demonstrate the proposed CE block is a memory-efficient solution to enhance the 2D networks for volumetric segmentation.

4 Discussion↩︎

The automatic segmentation of organs in volumetric medical images, such as MRI and CT scans, plays an important role in computer-assisted diagnosis and interventions [44][46]. Conventional 2D convolutional networks treat volumetric images as a stack of 2D slices thus cannot learn spatial information in the third dimension [47]. The 3D CNNs are beneficial to extract richer volumetric information, but often require high computation cost and severe GPU memory consumption. To alleviate these issues, the proposed contextual embedding strategy can convert conventional 2D segmentation network consisting of the encoder-decoder architecture into a pseudo-3D segmentation network that captures the slice-wise context, which solves the issue in a memory-friendly way that the segmentation results of 2D networks are discontinuous and incomplete in the 3D space. The results on prostate MRI images and liver CT images show that the proposed CE block can effectively enhance the performance of U-net and DeepLabV3 on the volumetric segmentation. In some cases, the CE-enhanced 2D networks even generated more accurate segmentation metrics than the corresponding 3D networks did. From the visual comparison in Fig. 7 and Fig. 8, it is obviously that comparing with the conventional 2D networks, the CE-enhanced methods can provide more reliable results in the coronal and sagittal views. All above results corroborate that the CE block has strong capability to utilize the slice-wise context for the accurate volumetric segmentation.

It is worth noting that the proposed contextual embedding mechanism is employed as an internal guidance of the conventional networks, but not as a direct constraint for the final segmentation prediction. Thus the whole network can be trained without requiring an extra supervision on the embedding. In addition, our CE block is a plug-and-play module without too much additional convolutions. Compared with 3D convolutional networks, our CE-enhanced 2D networks are more computation-/memory-efficient for volumetric segmentation, as shown in Table 3.

In this study, we aim to design a novel module to enhance the 2D convolutional networks with contextual information for more accurate and efficient volumetric segmentation. Thus we mainly pay attention to design a lightweight and easy-to-use module rather than breaking through the state-of-the-art segmentation accuracy. The proposed module is expected to be integrated into the current 2D encoder-decoder architectures for their performance improvements. Therefore, in our experiments, we focused on comparing our method with the popular 2D backbones rather than with the current cutting-edge networks on the two experimental datasets.

5 Conclusion↩︎

For the volumetric segmentation, conventional 2D networks may hardly exploit spatial information in 3D, while most 3D networks suffer from occupying excessive computational resources. In this study, we present a novel plug-and-play contextual embedding block to enhance the 2D networks for volumetric segmentation. Specifically, the proposed CE block leverages the learned embedding and the slice-wisely neighboring matching to aggregate contextual information. The CE block has neither excessive convolutions nor 3D complicated algorithms, which is an efficient solution to boost the volumetric representation of the 2D networks. Experimental results on challenging prostate MRI dataset and abdominal CT dataset show that the proposed method effectively leverages the inter-slice context and consistently improves 2D networks’ performance on volumetric images.

References↩︎

[1]
Shen, D., Wu, G., Suk, H.I., 2017. Deep learning in medical image analysis. Annual Review of Biomedical Engineering19, 221–248.
[2]
Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., Van Der Laak, J.A., Van Ginneken, B., Sánchez, C.I., 2017. A survey on deep learning in medical image analysis. Medical Image Analysis42, 60–88.
[3]
Hesse, L.S., Kuling, G., Veta, M., Martel, A.L., 2020. Intensity augmentation to improve generalizability of breast segmentation across different MRi scan protocols. IEEE Transactions on Biomedical Engineering68, 759–770.
[4]
Wang, Y., Wang, N., Xu, M., Yu, J., Qin, C., Luo, X., Yang, X., Wang, T., Li, A., Ni, D., 2020b. Deeply-supervised networks with threshold loss for cancer detection in automated breast ultrasound. IEEE Transactions on Medical Imaging39, 866–876.
[5]
Ghavami, N., Hu, Y., Gibson, E., Bonmati, E., Emberton, M., Moore, C.M., Barratt, D.C., 2019. Automatic segmentation of prostate MRI using convolutional neural networks: Investigating the impact of network architecture on the accuracy of volume measurement and MRI-ultrasound registration. Medical Image Analysis58, 101558.
[6]
Wang, Y., Zheng, Q., Heng, P.A., 2018b. Online robust projective dictionary learning: shape modeling for MR-TRUS registration. IEEE Transactions on Medical Imaging37, 1067–1078.
[7]
Pham, D.L., Xu, C., Prince, J.L., 2000. Current methods in medical image segmentation. Annual Review of Biomedical Engineering2, 315–337.
[8]
LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature521, 436.
[9]
Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Interventions, Springer. pp. 234–241.
[10]
Chen, H., Qi, X., Yu, L., Heng, P.A., 2016a. DCAN: Deep contour-aware networks for accurate gland segmentation, in: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2487–2496.
[11]
Ibtehaz, N., Rahman, S., 2020. MultiResUNet: Rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Networks121, 74–87.
[12]
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A., 2017. Inception-v4, inception-resnet and the impact of residual connections on learning, in: Thirty-first AAAI conference on artificial intelligence.
[13]
Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J., 2018. Unet++: A nested u-net architecture for medical image segmentation, in: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, pp. 3–11.
[14]
Jha, D., Smedsrud, P.H., Riegler, M.A., Johansen, D., Lange, T.D., Halvorsen, P., D. Johansen, H., 2019. ResUNet++: An advanced architecture for medical image segmentation, in: 2019 IEEE International Symposium on Multimedia (ISM), pp. 225–230.
[15]
Li, X., Chen, H., Qi, X., Dou, Q., Fu, C.W., Heng, P.A., 2018. H-DenseUNet: Hybrid densely connected unet for liver and tumor segmentation from CT volumes. IEEE Transactions on Medical Imaging37, 2663–2674.
[16]
Prasoon, A., Petersen, K., Igel, C., Lauze, F., Dam, E., Nielsen, M., 2013. Deep feature learning for knee cartilage segmentation using a triplanar convolutional neural network, in: Medical Image Computing and Computer-Assisted Interventions, Springer. pp. 246–253.
[17]
Wang, S., Zhou, M., Gevaert, O., Tang, Z., Dong, D., Liu, Z., Tian, J., 2017. A multi-view deep convolutional neural networks for lung nodule segmentation, in: International Conference of the IEEE Engineering in Medicine and Biology Society, IEEE. pp. 1752–1755.
[18]
Kamnitsas, K., Ledig, C., Newcombe, V.F., Simpson, J.P., Kane, A.D., Menon, D.K., Rueckert, D., Glocker, B., 2017. Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Medical Image Analysis36, 61–78.
[19]
Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O., 2016. 3D U-Net: learning dense volumetric segmentation from sparse annotation, in: International Conference on Medical Image Computing and Computer-Assisted Interventions, pp. 424–432.
[20]
Milletari, F., Navab, N., Ahmadi, S.A., 2016. V-net: Fully convolutional neural networks for volumetric medical image segmentation, in: 2016 Fourth International Conference on 3D Vision (3DV), IEEE. pp. 565–571.
[21]
Chen, H., Dou, Q., Yu, L., Qin, J., Heng, P.A., 2018a. VoxResNet: Deep voxelwise residual networks for brain segmentation from 3D MR images. NeuroImage170, 446–455.
[22]
Wang, Y., Qin, C., Lin, C., Lin, D., Xu, M., Luo, X., Wang, T., Li, A., Ni, D., 2020a. 3D inception u-net with asymmetric loss for cancer detection in automated breast ultrasound. Medical Physics47, 5582–5591.
[23]
Wu, H., Huo, Y., Pan, Y., Xu, Z., Huang, R., Xie, Y., Han, C., Liu, Z., Wang, Y., 2022. Learning pre-and post-contrast representation for breast cancer segmentation in DCE-MRI, in: International Symposium on Computer-Based Medical Systems, IEEE. pp. 355–359.
[24]
Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Misawa, K., Mori, K., McDonagh, S., Hammerla, N.Y., Kainz, B., et al., 2018. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 .
[25]
Wang, Y., Deng, Z., Hu, X., Zhu, L., Yang, X., Xu, X., Heng, P.A., Ni, D., 2018a. Deep attentional features for prostate segmentation in ultrasound, in: Medical Image Computing and Computer Assisted Intervention, Springer. pp. 523–530.
[26]
Lin, H., Li, Z., Yang, Z., Wang, Y., 2021. Variance-aware attention U-net for multi-organ segmentation. Medical Physics48, 7864–7876.
[27]
Huang, R., Xu, Z., Xie, Y., Wu, H., Li, Z., Cui, Y., Huo, Y., Han, C., Yang, X., Liu, Z., Wang, Y., 2023. Joint-phase attention network for breast cancer segmentation in DCE-MRI. Expert Systems with Applications224, 119962.
[28]
Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural Computation9, 1735–1780.
[29]
Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.c., 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting, in: Advances in Neural Information Processing Systems, pp. 802–810.
[30]
Chen, J., Yang, L., Zhang, Y., Alber, M., Chen, D.Z., 2016b. Combining fully convolutional and recurrent neural networks for 3D biomedical image segmentation. Advances in Neural Information Processing Systems29.
[31]
Poudel, R.P., Lamata, P., Montana, G., 2016. Recurrent fully convolutional neural networks for multi-slice MRI cardiac segmentation, in: Reconstruction, Segmentation, and Analysis of Medical Images. Springer, pp. 83–94.
[32]
Alom, M.Z., Hasan, M., Yakopcic, C., Taha, T., Asari, V., 2018. Recurrent residual convolutional neural network based on U-Net (R2U-Net) for medical image segmentation. arXiv:1802.06955 .
[33]
Novikov, A.A., Major, D., Wimmer, M., Lenis, D., Bühler, K., 2018. Deep sequential segmentation of organs in volumetric medical scans. IEEE Transactions on Medical Imaging38, 1207–1215.
[34]
Chen, Y., Pont-Tuset, J., Montes, A., Van Gool, L., 2018b. Blazingly fast video object segmentation with pixel-wise metric learning, in: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1189–1198.
[35]
Fathi, A., Wojna, Z., Rathod, V., Wang, P., Song, H.O., Guadarrama, S., Murphy, K.P., 2017. Semantic instance segmentation via deep metric learning. arXiv preprint arXiv:1703.10277 .
[36]
Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.C., 2019. Feelvos: Fast end-to-end embedding learning for video object segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9481–9490.
[37]
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T., 2015. FlowNet: Learning optical flow with convolutional networks, in: Proceedings of the IEEE Conference on Computer Vision (ICCV), pp. 2758–2766.
[38]
Hu, J., Shen, L., Sun, G., 2018. Squeeze-and-excitation networks, in: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141.
[39]
Lin, M., Chen, Q., Yan, S., 2013. Network in network. arXiv preprint arXiv:1312.4400 .
[40]
Chen, L.C., Papandreou, G., Schroff, F., Adam, H., 2017. Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587 .
[41]
Litjens, G., Toth, R., Ven, W.V.D., et al., 2013. Evaluation of prostate segmentation algorithms for MRI: the PROMISE12 challenge. Medical Image Analysis18, 359–373.
[42]
Tustison, N.J., Avants, B.B., Cook, P.A., Zheng, Y., Egan, A., Yushkevich, P.A., Gee, J.C., 2010. N4ITK: improved N3 bias correction. IEEE Transactions on Medical Imaging29, 1310–1320.
[43]
Wang, Y., Dou, H., Hu, X., Zhu, L., Yang, X., Xu, M., Qin, J., Heng, P.A., Wang, T., Ni, D., 2019. Deep attentive features for prostate segmentation in 3D transrectal ultrasound. IEEE Transactions on Medical Imaging38, 2768–2778.
[44]
Zeng, X., Huang, R., Zhong, Y., Sun, D., Han, C., Lin, D., Ni, D., Wang, Y., 2021. Reciprocal learning for semi-supervised segmentation, in: Medical Image Computing and Computer Assisted Intervention, Springer. pp. 352–361.
[45]
Zhong, Y., Wang, Y., 2023. SimPLe: Similarity-aware propagation learning for weakly-supervised breast cancer segmentation in DCE-MRI, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 567–577.
[46]
Yang, Z., Lin, D., Ni, D., Wang, Y., 2024a. Non-iterative scribble-supervised learning with pacing pseudo-masks for medical image segmentation. Expert Systems with Applications238, 122024.
[47]
Yang, Z., Lin, D., Ni, D., Wang, Y., 2024b. Recurrent feature propagation and edge skip-connections for automatic abdominal organ segmentation. Expert Systems with Applications , 123856.

  1. \(l\) denotes the interval between the current and neighboring slices. Note that for the first \(l\) slices, we transfer the information from \(S_{n+l}\) to \(S_{n}\).↩︎

  2. https://chaos.grand-challenge.org/↩︎