Data-Efficient Unsupervised Interpolation

Without Any Intermediate Frame for 4D Medical Images

April 01, 2024

4D medical images, which represent 3D images with temporal information, are crucial in clinical practice for capturing dynamic changes and monitoring long-term disease progression. However, acquiring 4D medical images poses challenges due to factors
such as radiation exposure and imaging duration, necessitating a balance between achieving high temporal resolution and minimizing adverse effects. Given these circumstances, not only is data acquisition challenging, but increasing the frame rate for each
dataset also proves difficult. To address this challenge, this paper proposes a simple yet effective **U**nsupervised **V**olumetric **I**nterpolation framework, UVI-Net. This framework facilitates temporal
interpolation without the need for any intermediate frames, distinguishing it from the majority of other existing unsupervised methods. Experiments on benchmark datasets demonstrate significant improvements across diverse evaluation metrics compared to
unsupervised and supervised baselines. Remarkably, our approach achieves this superior performance even when trained with a dataset as small as one, highlighting its exceptional robustness and efficiency in scenarios with sparse supervision. This positions
UVI-Net as a compelling alternative for 4D medical imaging, particularly in settings where data availability is limited. The code is available at UVI-Net.

Video Frame Interpolation (VFI) has been a cornerstone in the realm of video processing, enriching motion visualization by generating intermediate frames. This method primarily relies on intermediate frame supervision, where known frames are used as references to create new intermediate frames. However, applying these VFI methods to 4D medical imaging is not trivial. While the principles of frame interpolation hold the potential for enhancing medical diagnostics and treatments [1]–[7], the unique constraints and requirements of medical imaging present challenges.

One significant challenge lies in obtaining a sufficient dataset. Unlike general domain videos, 4D medical images are captured for specific clinical purposes from a relatively small pool of individuals. Similarly, acquiring intermediate frames per image is also hampered by limitations and risks associated with medical imaging modalities.

Computed tomography (CT) exposes patients to elevated radiation levels, potentially increasing the risk of secondary cancer [8]. Similarly, magnetic resonance imaging (MRI) faces the obstacle of lengthy scan times, lasting up to an hour [9], presenting both logistical challenges and issues related to patient comfort. Furthermore, the quality of ground truth intermediate frames in medical imaging is often compromised due to factors such as patient movement, unstable breathing, and the difficulty of maintaining a stable position during prolonged scans [10], [11], limiting data variety and accessibility for research.

In light of these challenges, we present the following question: “Can a VFI model be trained without depending on *any* ground truth intermediate frames?". Unlike other previous unsupervised approaches in the 2D general domain [12]–[14] that interpolate frames given the
multiple frame sequences, we address the task of freely interpolating between two given frames *without any intermediate frames.* To achieve this, we propose a straightforward yet effective framework to VFI in medical imaging. By interpolating the
flow between two frames with a two-stage process and cycle-consistency constraint, our framework can effectively operate even with a video composited with two frames (i.e., only images of the start and end points exist), entirely in an unsupervised manner.
In the initial stage, virtual samples are generated from the two real input images. Subsequently, the real images are reconstructed based on these virtual intermediate samples. This reconstruction process incorporates the candidate images and warped
contextual information in multiple scales from the input images. Through this cyclic interpolation approach, we successfully minimize discrepancies between the generated and the actual images by using the real images as a form of pseudo-supervision.

Our proposed method has achieved state-of-the-art results in unsupervised VFI for 4D medical imaging, outperforming the existing techniques with a substantial gap. Our approach also consistently outperforms even for existing supervised methods. Remarkably, our model shows competitive or even superior performance when trained with a minimal training dataset size of just one, contrasting with other baselines that require full datasets, typically exceeding 60 in size. Additionally, the unsupervised nature of our model allows for further performance enhancements through instance-specific optimization. This process involves briefly fine-tuning the model using each test sample during the inference stage, potentially yielding even more refined results.

In summary, our contributions are three-fold:

We introduce a simple yet effective unsupervised VFI approach for 4D medical imaging. Our methodology leverages cycle consistency constraints within the temporal dimension, thereby obviating the need for ground truth data typically required for interpolated images.

Our approach achieves state-of-the-art performance, surpassing other unsupervised and supervised interpolation methods. This is accomplished without the instance-specific optimization, which could be employed as a viable option to enhance performance.

The robustness of our model is particularly evident under conditions of limited data availability, as demonstrated by the increasing performance margin relative to other methods when the dataset size is reduced.

Many studies in the field of video interpolation have been conducted, with a significant emphasis on frame rate upsampling for natural scene videos [15]–[17]. These studies typically rely on ground truth intermediate frames for training [18]–[25]. While some studies have explored alternative approaches that do not rely on ground truth intermediate frames, they involve synthesizing frames between a given sequence of intermediate frames [12]–[14] or utilize the information from specialized devices, such as event camera [26]. Consequently, applying these methods in settings like our study presents a challenge, as there are no intermediate frames available for synthesis. Furthermore, validation of these methods is restricted to 2D frames, and they encounter challenges when directly applied to volume sequences. This is primarily attributable to the markedly lower availability of intermediate frames within such datasets, as elucidated in [tab:dataset32comparison].

width=.37

To address the above challenges, frame interpolation methods specifically focused on 4D medical images are driven. Several recent works [27], [28] have attempted to interpolate medical 4D images, but these methods rely on the availability of ground-truth intermediate images for training. Although [29] proposed an interpolation approach without using the authentic intermediate frames, they do not incorporate an unsupervised learning technique for the interpolated samples. Instead, their method involves a post-hoc multiplication of the flow calculation model, which is prone to spatial distortion. This weakness arises since the underlying network does not account for the structural smoothness between two samples during network training. As a result, the scaled calculated flow fails to capture the spatial continuity of intermediate samples beyond the samples provided by authentic frames. Furthermore, since the method focus solely on warping without incorporating image synthesis, they encounter specific issues if a voxel is displaced to a new location without replacement at the original site. Specifically, it results in the voxel appearing twice in the backward-warped frame [30], [31], or a hole at the original location in the forward-warped frame [32]. To overcome the limitation of nonexistent training for intermediate images and warping procedure, we propose a novel network incorporating pseudo-supervision, including an image synthesis network to ensure the integrity of intermediate images.

Optical flow learning is crucial in the video and medical domain. Various learning methods have been extensively investigated [33]–[35] aiming to estimate optical flow. However, they require a ground truth optical flow for training, which is limited in availability. To address this limitation, some methods [36]–[46] have been developed to compute the similarity between the warped image and a fixed reference to train networks, allowing training without ground truth optical flow.

We first briefly introduce the necessary background on the flow calculation model in 3.1 and the existing unsupervised interpolation approaches in 3.2.

Suppose we are given two input images \(I_0\) and \(I_1\) at time \(T=0\) and \(T=1\), respectively. Our main objective is to predict the intermediate image \(\hat{I}_t\) at time \(T=t\) within the range of 0 to 1, given \(I_0\) and \(I_1\), without explicit supervision. An intuitive approach is to train a neural network to directly generate voxel values of \(\hat{I}_t\) without explicitly computing coordinate transformation. However, the generation models such as generative adversarial networks (GAN) [47]–[50] typically require a large amount of training data, making them impractical for the medical domain where data is limited. In contrast, flow calculation models [12], [18], [19] can generate 3D images using only two real input images. Given these advantages, we employ flow-based methods for this task, as they are capable of generating 3D images using only two real input images.

Flow-based interpolation approaches employ a flow calculation model \(\mathcal{F}^\theta\) with model parameters \(\theta\) to obtain a coordinate transformation map between two target samples. Given \(I_0\) and \(I_1\), the flow calculation model \(\mathcal{F}^\theta\) takes \(I_0\) and \(I_1\) as sequential inputs and provides a coordinate transformation map \(\phi^\theta_{0 \rightarrow 1}\). The objective of the flow calculation model \(\mathcal{F}^\theta\) is to warp \(I_0\) into \(\hat{I}_{0 \rightarrow 1} := I_0 \circ \phi^\theta_{0 \rightarrow 1}\) such that it matches \(I_1\), where \(\circ\) indicates spatial transformation.

To train flow calculation models, a warping loss \(\mathcal{L}_{warp}\left(I_0, I_1\right)\) is used, which ensures the quality of computed optical flow. The warping loss is defined based on the warped images \(I_1 \circ \phi_{1 \rightarrow 0}^\theta\) and \(I_0 \circ \phi_{0 \rightarrow 1}^\theta\), which corresponds to \(I_0\) and \(I_1\), respectively. \(L_{warp}\) can be expressed as \[\begin{align} \mathcal{L}_{warp}^{\theta}(I_0, &I_1) = \; \mathcal{L}_{smth}(\phi_{0 \rightarrow 1}^\theta) + \mathcal{L}_{image}(I_1, I_0 \circ \phi_{0 \rightarrow 1}^\theta) \notag \\ + \; &\mathcal{L}_{smth}(\phi_{1 \rightarrow 0}^\theta) + \mathcal{L}_{image}(I_0, I_1 \circ \phi_{1 \rightarrow 0}^\theta), \end{align}\] where \(\mathcal{L}_{smth}\) is a smoothness term that promotes similar flow values among neighboring voxels, and \(\mathcal{L}_{image}\) ensures alignment between two images. Typically, we utilize the sum of normalized cross-correlation (NCC) [38] and Charbonnier [51] losses as the \(\mathcal{L}_{image}\), since NCC has extensively used is 3D medical flow calculation works [38], [52], [53], and Charbonnier loss is a common choice in previous VFI works [21], [22], [52]–[55]. The losses are defined as: \[\begin{align} &\mathcal{L}_{smth}(\phi) = \; \lVert \nabla \phi \rVert_2 \\ \mathcal{L}_{image}(I, \hat{I}) &= \; - NCC(I, \hat{I}) + \sqrt{(I - \hat{I})^2 + \epsilon^2}, \label{eq:l95image} \end{align}\tag{1}\] where \(\nabla \phi\) denotes the flow gradient, and \(\epsilon\) represents a small constant.

The fully learned flow calculation model, denoted as \(\phi^{\theta^*}_{0 \rightarrow 1}\), is earned by minimizing the warping loss \(\mathcal{L}_{warp}^{\theta}(I_0, I_1)\) with respect to the model parameter \(\theta\). The calculated flow can be formulated as: \[\begin{align} \label{posthoc95flow1} \phi^{\theta^*}_{0 \rightarrow 1} \quad \mathrm{s.t.} \quad \theta^* := \underset{\theta}{\arg\min}\sum_{(I_0, I_1) \in \mathcal{D}} \mathcal{L}_{warp}^{\theta}\left(I_0, I_1\right), \end{align}\tag{2}\] where \(\mathcal{D}\) indicates the training set containing the pairs of \(I_0\) and \(I_1\).

If the flow from \(I_0\) to the intermediate target sample \(I_t\) can be ideally acquired as \(\phi_{0 \rightarrow t}\), the corresponding \(\hat{I}_t\) can also be obtained (i.e., \(\hat{I_t} = I_0 \circ \phi_{0 \rightarrow t}\)). To obtain this flow, current approaches [29], [38] approximate \(\phi_{0 \rightarrow t}\) as the following linear interpolation of the flow or latent vector: \[\begin{align} \label{baseline95flow1} \phi^{\theta^*}_{0 \rightarrow t} := t \cdot \phi^{\theta^*}_{0 \rightarrow 1} \quad \mathrm{or} \quad \phi^{\theta^*}_{0 \rightarrow t} := \phi^{t \cdot \theta^*}_{0 \rightarrow 1}, \end{align}\tag{3}\] where \(t \cdot \theta\) indicates the linear multiplication of latent vector. Therefore, the target \(I_t\) can be obtained by approximating it as \(\hat{I}_t := I_0 \circ \phi^{\theta^*}_{0 \rightarrow t}\).

As detailed in the latter part of 2.1, existing post-hoc linear interpolation approaches encounter two major challenges: firstly, they are prone to spatial distortion since the underlying network \(\mathcal{F}^{\theta^*}\) that \(\hat{I}_t\) relies on does not account for the structural smoothness between two samples during network training; and secondly, they often suffer from artifacts resulting from the warping procedure. Moreover, the methods heavily rely on post-hoc linear multiplication, leading to potential overfitting to the linear assumption. In other words, these methods assume that the structures within a given 4D medical image move only in a linear direction, and the magnitude of this movement is linearly proportional to time.

We introduce our Unsupervised Volumetric Interpolation Network, referred to as UVI-Net. The network first generates intermediate images and then employs cycle consistency constraints to reconstruct authentic images from these synthesized ones. In 4.1, we provide an overview and a detailed presentation of our method. The training and inference procedures are outlined in 4.2 and 4.3, respectively. Additionally, in 4.4, we introduce an instance-specific optimization method to further enhance our model’s performance.

To achieve a result exhibiting improved smoothness for the intermediate sample derived from the network, it is imperative for the network to access the pertinent information to the intermediate sample during the learning process. In light of this, we propose the cyclic structure model, which first generates the intermediate images and reconstructs them back to the two given input images. To ensure consistency and coherence in the generated images, we impose constraints of cycle consistency \(\mathcal{L}_{cyc}\) between the reconstructed samples, denoted as \(\hat{I}_0^{cyc}, \hat{I}_1^{cyc}\), and the corresponding original samples \(I_0, I_1\). The flow of the intermediate frame is then estimated using our flow calculation model with the parameter \(\theta^*\) as follows: \[\begin{align} \phi^{\theta^*}_{0 \rightarrow t} &:= t \cdot \phi^{\theta^*}_{0 \rightarrow 1} \quad \mathrm{s.t.}\tag{4} \\ \theta^* := \underset{\theta}{\arg\min}\,&\underset{\omega, \psi}{\min}\sum_{(I_0, I_1) \in \mathcal{D}}\mathcal{L}_{warp}^{\theta}\left(I_0, I_1\right) \notag \\ + \mathcal{L}_{cyc}^{(\theta,\omega,\psi)}&\left(I_0, \hat{I}_{0}^{cyc}\right) + \mathcal{L}_{cyc}^{(\theta,\omega, \psi)}\left(I_1, \hat{I}_{1}^{cyc}\right), \tag{5} \end{align}\] where \(\theta\), \(\omega\), and \(\psi\) indicate the parameters for the flow calculation, feature extraction, and reconstruction models, which will be described in the below sections.

Unlike the current approach in 2 , we allow the network to access intermediate samples and update them in its training, as described in 5 , resulting in improved natural voxels. We first explain the process of obtaining \(\hat{I}^{cyc}\) and provide a detailed explanation of \(\mathcal{L}_{cyc}\) in the following sections.

The overall acquire procedure of \(\hat{I}_0^{cyc}\) and \(\hat{I}_1^{cyc}\) is illustrated in 1. First, we generate multiple virtual intermediate samples (see Step 2 in 1) by randomly sampling values of \(t_1, t_2\), and \(t_3\) as below. \[\begin{align} \hat{I}_{t_1}^{vir} &:= I_0 \circ \left ( t_1 \cdot \phi^\theta_{0 \rightarrow 1} \right ) \qquad \quad -0.5 \le t_1 \le 0 \makebox[0cm]{} \tag{6} \\ \hat{I}_{t_2}^{vir} &:= \begin{cases} I_0 \circ \left ( t_2 \cdot \phi^\theta_{0 \rightarrow 1} \right ) \qquad \quad \;\; 0 \le t_2 \le 0.5 \\ I_1 \circ \left ( (1-t_2) \cdot \phi^\theta_{1 \rightarrow 0} \right ) \quad 0.5 \le t_2 \le 1 \end{cases} \\ \hat{I}_{t_3}^{vir} &:= I_1 \circ \left ( (1-t_3) \cdot \phi^\theta_{1 \rightarrow 0} \right ) \qquad 1 \le t_3 \le 1.5 \tag{7} \end{align}\]

Since \(\hat{I}_{t_1}^{vir}\) and \(\hat{I}_{t_3}^{vir}\) are generated outside the time range between the two frames, we limit the maximum time offset to 0.5 to mitigate the occurrence of artifacts. When generating the \(\hat{I}_{t_2}^{vir}\), a synthesized image between the two input images, we adopt the result created from the image—either \(I_0\) or \(I_1\)—that is closer to the reference point \(t_2\), to preserve the properties of the real image maximally.

Next, we interpolate the generated intermediate samples (see Step 3 in 1) to acquire the \(I_0\) and \(I_1\)’s candidates as follows: \[\begin{align} \hat{I}_{t_1 \rightarrow 0}^{cand} := \hat{I}^{vir}_{t_1} \circ \left ( \frac{-t_1}{t_2 - t_1} \cdot \phi^{\theta}_{t_1 \rightarrow t_2} \right ), \tag{8} \\ \hat{I}_{t_2 \rightarrow 0}^{cand} := \hat{I}^{vir}_{t_2} \circ \left ( \frac{t_2}{t_2 - t_1} \cdot \phi^{\theta}_{t_2 \rightarrow t_1} \right ), \\ \hat{I}_{t_2 \rightarrow 1}^{cand} := \hat{I}^{vir}_{t_1} \circ \left ( \frac{1-t_2}{t_3 - t_2} \cdot \phi^{\theta}_{t_2 \rightarrow t_3} \right ), \\ \hat{I}_{t_3 \rightarrow 1}^{cand} := \hat{I}^{vir}_{t_3} \circ \left ( \frac{t_3-1}{t_3 - t_2} \cdot \phi^{\theta}_{t_3 \rightarrow t_2} \right ). \tag{9} \end{align}\]

While warping the virtual frames, we simultaneously warp the feature space of the frames across multiple resolutions, obtaining a set of warped feature maps: \(\mathcal{S}_{t_1 \rightarrow 0}\), \(\mathcal{S}_{t_2\rightarrow 0}\), \(\mathcal{S}_{t_2 \rightarrow 1}\), and \(\mathcal{S}_{t_3 \rightarrow 1}\). Specifically, following the architecture of our feature extractor as shown in 3, we extract feature maps resized to 1, 0.5, and 0.25 times their original size. Then, using the same optical flow as described in 8 to (9 ) (or downscaled as necessary), we obtain the final warped feature maps. This method enhances the reconstruction model’s ability to make more accurate predictions by providing access to both voxel and feature information. Furthermore, we extract image representations at various levels, which have proven effective in previous research on video-related tasks [22], [24], [56].

Using these warped images and features, we obtain the predictions \(\hat{I}_{0}^{cyc}\) and \(\hat{I}_{1}^{cyc}\) using the reconstruction model \(\mathcal{R}^\psi\) (see Step 4 in 1). The model takes the distance-based weighted sum images and warped feature map sets, and reconstructs the original frames through residual corrections. Each element of the input feature map sets is fed into individual encoder layers of the reconstruction model and concatenated channel-wise. The procedure of the reconstruction model can be written as: \[\begin{align} \hat{I}_{0}^{cyc} := \; &\mathcal{R}^\psi(\, \hat{I}_{t_1 \rightarrow 0}^{cand} \oplus \hat{I}_{t_2 \rightarrow 0}^{cand},\, \mathcal{S}_{t_1 \rightarrow 0},\, \mathcal{S}_{t_2\rightarrow 0} \, ), \\ \hat{I}_{1}^{cyc} := \; &\mathcal{R}^\psi(\, \hat{I}_{t_2 \rightarrow 1}^{cand} \oplus \hat{I}_{t_3 \rightarrow 1}^{cand},\, \mathcal{S}_{t_2 \rightarrow 1},\, \mathcal{S}_{t_3\rightarrow 1} \, ), \end{align}\] where \(\oplus\) indicates distance-based addition.

With the reconstructed images \(\hat{I}_0^{cyc}\) and \(\hat{I}_1^{cyc}\), we can introduce the cycle consistency loss. Our cycle-consistent framework reconstructs real images from the generated intermediate images, thereby enhancing the smoothness of the interpolated images. Without time notation for clarity, consider the reconstructed image (\(\hat{I}^{cyc}\)) and corresponding real image (\(I\)). The cycle consistency loss is defined as: \[\begin{align} \mathcal{L}_{cyc}^{(\theta,\omega,\psi)}(I, \hat{I}^{cyc}) &= \mathcal{L}_{image}(I, \hat{I}^{cyc}) + \mathcal{L}_{reg}(\mathcal{R}^\psi), \end{align}\] where \(\mathcal{L}_{image}\) follows 1 , and \(\mathcal{L}_{reg}\) acts as an L1 regularization term applied to the predicted residual of the reconstruction model. This term helps control excessive modification during the reconstruction process.

In essence, even without any intermediate frames, we utilize the given authentic frames as pseudo supervision for the intermediate frame, facilitated by the initially generated virtual intermediate samples \(\hat{I}^{vir}\). By incorporating a cycle consistency constraint between the reconstructed and original authentic images, our approach enhances spatial continuity between the two images and generates high-quality virtual intermediate samples.

width=.85

We illustrate the overall inference procedure of UVI-Net in 2. First, we obtain two optical flow \(\phi_{0 \rightarrow 1}^{\theta^*}\) and \(\phi_{1 \rightarrow 0}^{\theta^*}\), where \(\theta^*\) follows 5 . Next, we attain two \(I_t\)’s candidate as follows: \[\begin{align} \hat{I}_{0 \rightarrow t}^{cand} := \; &I_0 \circ \phi_{0 \rightarrow t} = I_0 \circ \left( t \cdot \phi_{0 \rightarrow 1}^{\theta^*} \right)\\ \hat{I}_{1 \rightarrow t}^{cand} := \; &I_1 \circ \phi_{1 \rightarrow t} = I_1 \circ \left( (1-t) \cdot \phi_{1 \rightarrow 0}^{\theta^*} \right). \end{align}\] Finally, by reconstructing the final image with the two candidates considering the temporal distance, we derive \(\hat{I}_{t}\) as \[\begin{align} \hat{I}_t := \; & \mathcal{R}^\psi(\, \hat{I}_{0 \rightarrow t}^{cand} \oplus \hat{I}_{1 \rightarrow t}^{cand},\, \mathcal{S}_{0 \rightarrow t},\, \mathcal{S}_{1 \rightarrow t} \, ) , \end{align}\] where \(\mathcal{S}_{0 \rightarrow t}\) and \(\mathcal{S}_{1 \rightarrow t}\) are warped feature map sets from \(I_0\) and \(I_1\), respectively. Remarkably, while baseline approaches can only use one of \(\hat{I}_{0 \rightarrow t}\) and \(\hat{I}_{1 \rightarrow t}\), we can engage both information and make to be symmetric even the order of \(I_0\) and \(I_1\) is switched.

Instance-specific optimization is a technique used to enhance the final performance by fine-tuning models for each test sample. This approach was introduced by [37] within the unsupervised medical image warping domain. Despite our work being in a different task, this strategy remains applicable. Utilizing a model weight pre-trained on the training data, we fine-tune the model for a relatively small number of epochs on each test data. Such an adaptive approach is particularly beneficial in medical imaging, allowing for more personalized and accurate frame interpolation tailored to individual scans.

This section describes the benchmark datasets for 4D medical imaging used in this study in 5.1. Next, 5.2 outlines some settings, including training details and metrics for performance evaluation. The results are comprehensively presented in 5.3, highlighting our method’s effectiveness and efficiency.

To evaluate the performance of image interpolation, two 4D image datasets are used, each for the heart and lung. The ACDC cardiac dataset [57] consists of 100 4D temporal cardiac MRI images. End-diastolic and end-systolic phase images are used as the start and end images, respectively. The initial 90 alphabetically sorted samples form the training set, with the remaining used for the test set. The 4D-Lung dataset [58] consists of 82 chest CT scans for radiotherapy planning from 20 lung cancer patients. In each 4D-CT study, the end-inspiratory (0% phase) and end-expiratory (50% phase) phase scans are set as the initial and final images, respectively. The first 68 CT scans from 18 patients in the dataset are included in the training set. For additional information for dataset, please refer to 7.

For comparison with our proposed methods, six models are included as the baselines. VoxelMorph (VM) [37], TransMorph (TM) [59], Fourier-Net+ [45] and R2Net [44] are first initially trained with the provided dataset to calculate optical flow. Interpolated images are then obtained by linear scaling the optical flow, i.e., \(t \cdot \phi^{\theta^*}_{0 \rightarrow t}\). Diffusion Deformable Model (DDM) [29] also uses dataset training but interpolates by scaling the latent vector, i.e., \(\phi^{t \cdot \theta^*}_{0 \rightarrow t}\). For IDIR [46], it is crucial to clarify that it requires individual training for each target registration pair, leading to limited generalization, whereas our method is trained using a distinct training set and subsequently applied for inference on the target pairs. We also compared the results of our model with two supervised methods proposed for video interpolation on 4D medical images: SVIN [27] and MPVF [55]. Detailed information about the baseline models is in 8.

To evaluate the similarity between the predicted and ground truth images, metrics including PSNR (Peak Signal-to-Noise Ratio) [60], NCC (Normalized Cross Correlation), SSIM (Structural Similarity Index Measure) [61], NMSE (Normalized Mean Squared Error) and LPIPS (Learned Perceptual Image Patch Similarity) [62] are used. Since LPIPS is available only for 2D, it was averaged across slices along the x, y, and z axes. Each metric represents the voxel-wise similarity, correlation, structural similarity, reconstruction error, and perceptual similarity between the synthesized and authentic images.

For the flow calculation model, we employed the network designed in VoxelMorph [37]. As for the reconstruction model, we used a small size of 3D-UNet. The detailed configuration of the network and more details are described in 9. The proposed method was implemented with PyTorch [63] using an NVIDIA Tesla V100 GPU. The training process takes approximately 4 hours for the cardiac dataset and 8 hours for the lung dataset, respectively. Instance-specific optimization took about 1.12 minutes per sample for ACDC and 3.13 minutes for 4D-Lung.

The performance of interpolation compared to unsupervised and supervised methods is shown in [tab:interpolation32result]. Our method consistently demonstrates superior performance among all the models, outperforming others with a significant margin in every evaluation metric. This trend is observed across both heart and lung datasets, even in the absence of instance-specific optimization.

It is important to note that our approach surpasses IDIR [46], serving as a rigorous comparison baseline for our method due to IDIR’s test set-specific optimization. The core methodology behind IDIR undergoes unique adaptation for each test set pair, which involves retraining for every new instance. While this strategy enables IDIR to tailor its performance to each dataset, it restricts its practical applicability. Nevertheless, our method demonstrates substantial superiority over IDIR in terms of performance.

Notably, our approach also surpasses supervised methods. An interesting observation is the varying performance of these supervised models across different datasets. As detailed in [tab:dataset32comparison], the ACDC dataset contains significantly more frames compared to the 4D-Lung dataset. This discrepancy implies that the 4D-Lung dataset experiences limitations in terms of supervision quality. Therefore, the performance gap is more pronounced in the lung dataset, underscoring a critical insight: supervised models tend to underperform with limited supervision from intermediate frames. This pattern reaffirms the importance of our method’s ability to achieve high accuracy in scenarios with constrained supervision, highlighting its robustness and effectiveness in 4D medical VFI tasks.

4 illustrates the interpolation performance based on the number of training samples. With the test sets remaining fixed, the sizes of the training sets are reduced from their full down to one. Compared to the other five
unsupervised baselines (VM, TM, Fourier-Net+, R2Net, and DDM), our method consistently exhibits superior performance across varying training set sizes, with performance gaps widening as the dataset size decreases. Remarkably, even with a minimal size
comprising *only one sample*, our approach frequently outperforms the baselines that utilize the maximum training set size. It should be noted that IDIR is not included in this comparison, as it does not follow a traditional training process on a
training set. For the two supervised baseline models (SVIN, MPVF), the performance also diminishes as the number of samples for supervision decreases, leading to an increasing performance gap between them and our model. Our consistent performance in
scenarios with small datasets underscores the strengths of our approach in mitigating the challenges posed by data scarcity in the medical field.

The comparison of qualitative results between interpolation methods is shown in 5. Our method consistently produces visually appealing and accurate intermediate frames, capturing fine details and preserving the structural integrity of the original images.

We also demonstrate that our interpolation method can be applied to downstream tasks. Specifically, we tested its effectiveness on segmentation data, which is relatively complex, demonstrating our approach’s potential for augmenting 3D medical datasets. Details of the experimental setup and performance can be found in 10.

Additional experiments, including ablation studies and further qualitative results, are detailed in 11. Moreover, we have analyzed the results of extrapolation to ensure that generated images during the training process do not exhibit any unnatural changes or issues.

Our framework, UVI-Net, effectively tackles the challenge of generating intermediate frames for 4D medical images through unsupervised volumetric interpolation. By leveraging pseudo supervision within a cyclic structure, our method ensures spatial continuity between the generated intermediate and real images. Experimental results on benchmark datasets validate the efficacy of our approach, revealing substantial improvements in intermediate frame quality across various evaluation metrics, surpassing both unsupervised and supervised baselines. Furthermore, our method has demonstrated robustness not only in situations of frame scarcity but also in data scarcity contexts. Ultimately, this study underscores the promise of unsupervised 3D flow-based interpolation and opens new avenues for research and development in the field of medical imaging.

This study was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No.2019-0-00075 Artificial Intelligence Graduate School Program (KAIST)) and Medical Scientist Training Program from the Ministry of Science & ICT of Korea.

The ACDC dataset features an average of 10.02\(\pm\)2.20 frames between the end-systolic and end-diastolic phases in the training set, with the test set presenting an average of 8.80\(\pm\)2.48 frames. All cardiac MRI scans have been uniformly resized. Following this resizing process, min-max scaling is applied to ensure consistent scaling across all scans.

In the case of the 4D-lung dataset, the models are trained to predict the four intermediate frames (10%, 20%, 30%, 40%) between the end-inspiratory (0%) and end-expiratory (50%) phases. Only CT images captured using kilovoltage energy are included in the study due to their superior image quality. Each lung CT scan is adjusted to the lung window range (-1400 to +200 Hounsfield unit) [64] and subjected to centering and min-max scaling. Subsequently, bed removal is performed using the following method: pixels exceeding a certain threshold (-500 HU in this study) are assigned a value of 1, while all other pixels are set to 0, creating a binarized map. The binarized map undergoes erosion/dilation [65] to identify the most prominent body contour mask. By getting the resulting body contour mask to the corresponding voxel region of the given images, a bed-removed CT image is obtained. All the lung CT images are resized to \(128 \times 128 \times 128\).

The following three unsupervised models and two supervised models are used as the baseline models for our main result: VoxelMorph [38], TransMorph [59], Fourier-Net+ [45], R2Net [44], IDIR [46], DDM [29] for unsupervised models, and SVIN [27], MPVF [55] for supervised models. To the best of our knowledge, this selection covers the most pertinent and all current baseline models in the field, providing a comprehensive benchmark for our study.

The VoxelMorph employs the exact same model architecture as our flow calculation model, as discussed in 9.1. For TransMorph, we follow the TransMorph-Large framework from the original paper. In the case of Fourier-Net+, R2Net, IDIR, and DDM, we utilize the default architecture outlined in the original paper.

In our study involving SVIN, we adhered to the official architecture as described in the foundational paper. For MPVF, we applied the architecture specified for the ACDC dataset, as outlined in the original publication. However, our experience with the 4D-lung dataset presented unique challenges. Despite the original study using a distinct lung preprocessing method, which resulted in larger data sizes, and reporting successful execution on a V100 GPU with 32GB of memory, our attempts to run their code on an A6000 GPU with 48GB of memory encountered memory issues. Upon contacting the authors, we learned that no official code was available for the 4D-lung dataset. Consequently, we were compelled to arbitrarily modify the model size to accommodate our 48GB memory constraint. This entailed reducing the encoder inplanes from [32, 64, 128] to [8, 16, 32], decreasing the number of ViT heads from 4 to 2, lowering the ViT num classes from 1000 to 300, and diminishing the hidden dimension from 256 to 64. Please note that although we reduced the model to fit a 48GB memory constraint, our measurements were conducted on a model size larger than the original model’s 32GB specification.

The flow calculation model follows the network architecture illustrated in 6 (a), which is based on VoxelMorph [38]. The model processes a single input by combining the images \(I_0\) and \(I_1\) into a 2-channel 3D image. Then, it outputs 3-channel 3D flows, where each channel represents the displacement along each dimension. The flow model incorporates 3D convolutions in both the encoder and decoder stages with a kernel size of 3. LeakyReLU layer with a negative slope of 0.2 follows each convolutional operation.

In the encoder, strided convolutions with stride size 2 are utilized to reduce the spatial dimensions by half at each layer. Conversely, the decoding involves a combination of upsampling, convolutions, and concatenation of skip connections. As a result, the model outputs the flows \(\phi_{0 \rightarrow 1}\) and \(\phi_{1 \rightarrow 0}\), each warping \(I_0\) to resemble \(I_1\) and \(I_1\) to resemble \(I_0\), respectively.

6 (b) describes the architecture of the reconstruction model, based on 3D-UNet [66]. We employ a single image \(\hat{I}^{cand}_{0 \rightarrow t} \oplus \hat{I}^{cand}_{1 \rightarrow t}\), which is a weighted sum of two candidate images, in conjunction with three levels of multi-resolution features, each possessing channel dimensions of 4, 8, and 16, respectively. The model’s first encoder layer receives an input composed of two channel-wise concatenated warped features and an image \(\hat{I}^{cand}_{0 \rightarrow t} \oplus \hat{I}^{cand}_{1 \rightarrow t}\). Advancing to the subsequent layers, the model concatenates features of half and quarter resolutions at the second and third encoder layers. Thereafter, the model returns the image difference \(\Delta \hat{I}\), which will be added to the input to acquire the final estimated image \(\hat{I}_{t}\). The architecture of the reconstruction model follows details similar to those of the flow calculation model.

In our training process, we employ the Adam optimizer [67] with a learning rate \(2 \times 10^{-4}\) for 200 epochs, configuring the batch size as 1. For instance-specific optimization, models are fine-tuned for 100 epochs on the given test sample while maintaining the same experimental settings as in the previous training. The results are presented in a straightforward setup, with all loss coefficients uniformly set to 1.

We propose an effective 3D data augmentation technique based on our interpolation framework. To extend the interpolation task to 3D data augmentation, we generate new data by inputting randomly selected pairs of 3D images from the training dataset that share common types of segmentation labels. Here, we utilize time \(t\) as an interpolation degree for augmentation. Furthermore, inspired by previous works [38], [59], we incorporate the segmentation labels as supplementary information to enrich the augmented dataset.

Let \(s_0\) and \(s_1\) represent the organ segmentation of \(I_0\) and \(I_1\). When calculating flow fields, we only use \(I_0\) and \(I_1\), excluding segmentation labels. Using the calculated flows, we calculate \(\hat{s}_{t_1 \rightarrow 0}^{cand}, \; \hat{s}_{t_2 \rightarrow 0}^{cand}, \; \hat{s}_{t_2 \rightarrow 1}^{cand}\) and \(\hat{s}_{t_3 \rightarrow 1}^{cand}\) similar to the procedure of image. Finally, we ensure that \(\hat{s}_{t_1 \rightarrow 0}^{cand}\) and \(s_{t_2 \rightarrow 0}^{cand}\) have cycle consistency between \(s_0\), while \(s_{t_2 \rightarrow 1}^{cand}\) and \(s_{t_2 \rightarrow 1}^{cand}\) have cycle consistency with \(s_1\).

When labels are used during training, we expand the segmentation map into \(K\) binary masks to enable backpropagation, where \(K\) represents the total number of labels in the segmentation maps. Since Dice score [68] is commonly used to quantify optical flow performance [38], [59], we directly minimized the Dice loss [69].

For the segmentation dataset for augmentation, three 3D medical datasets are used. OASIS [70] is a brain dataset comprising 414
T1-weighted MRI scans and the corresponding segmentation labels for 36 organs, including the background label released from VoxelMorph [38].
IXI ^{2} is another brain MRI dataset with segmentation labels for 31 organs, including the background [59] released from TransMorph [59]. All the brain MRI scans are
skull-stripped and resized to \(128 \times 128 \times 128\). In both datasets, the first 20 samples are used for training, while the rest are included in the test set. Lastly, MSD-Heart [71] is an MRI dataset with one label (excluding background) and resized to \(128 \times 128 \times 64\). Since MSD-Heart has
only 20 data, we use 10 data for training and 10 for testing with background loss.

width=.8

To perform 3D segmentation, we utilize three publicly available models from MONAI package^{3}: 3D-UNet [66], VNet [72], and UNETR [73]. The segmentation models are trained for 15,000 iteration steps the final Dice score at the last iteration is recorded. Adam optimizer [67] with an initial learning rate \(1 \times 10^{-4}\) is used, and batch size is set to 1. For loss function, the
weighted sum of Dice [69] and Cross Entropy [74] losses is used. For augmented data generation, which expands the original dataset size by a factor of ten, we employed alpha sampling ratios of \(t = 0.1, 0.2, \ldots, 1.0\).

We have successfully generated pairs of images and labels, as illustrated in 7. Detailed results presented in [tab:segmentation32result] reveal that our approach consistently outperforms competing methods, delivering superior performance across a diverse range of conditions. This includes variations in dataset types and the use of different segmentation models, underscoring the robustness and versatility of our methodology.

width=0.65

width=0.55

width=0.6

We further substantiate our methodology through a series of ablation studies designed to broaden the empirical results. All reported outcomes represent the values derived from three distinct experimental runs.

The ablation results of loss terms conducted on the ACDC dataset are summarized in [tab:loss95ablation] and [tab:loss95ablation95image]. As indicated in [tab:loss95ablation], integrating each component of cyclic loss, which are \(\mathcal{L}_{image}\) and \(\mathcal{L}_{reg}\), significantly improves the performance of intermediate image synthesis. Furthermore, [tab:loss95ablation95image] demonstrates that the combined application of NCC and Charbonnier losses leads to a performance improvement compared to the application of each loss term independently.

The [tab:feature95ablation] presents the results of ablation studies on the feature extraction model, conducted on the ACDC dataset. In our comparative analysis, we demonstrate that our feature extraction methodology exhibits superior performance compared to scenarios where no feature extraction model is implemented. Additionally, we explored alternative methods of feature extraction, including: (1) using the Canny edge detector, (2) employing a simple U-Net architecture, and (3) utilizing a CNN module with single-scale warped images. Our approach outperformed other feature extraction modules in overall metric aspects. Moreover, some metrics in those modules showed performance worse than cases where no feature extraction was applied.

We present a series of additional qualitative results in 8. Our approach demonstrate the superior results against various baseline methods. This not only underscores our method’s enhanced alignment and coordination but also showcases its ability to generate outcomes that are more accurate and realistic. The visual evidence presented here plays a crucial role in substantiating the quantitative metrics we have reported, offering a holistic view of our model’s capabilities in real-world scenarios.

The 9 visualizes the extrapolation results, particularly for \(\hat{I}_{-0.5}\) and \(\hat{I}_{1.5}\), along with the corresponding optical flow and source images \(I_0\) and \(I_1\). These images represent the most extreme cases of extrapolation in our study. To ensure the credibility and real-world applicability of the results, they have been rigorously examined by a board-certified radiation oncologist. The evaluation focused on determining whether the extrapolated images exhibit any excessive or unnatural changes that could undermine their practical utility. This ensures that using extrapolation in our method does not present significant complications.

10 visualizes the prediction results over time for the entire 4D sequence. As the baseline results, we introduce the interpolated images through the application of linear scaling to VoxelMorph, which serves as the backbone registration model within our framework. It can be observed that our approach more effectively captures fine-grained details and predicts the ground truth compared to the baseline.

[1]

vol. (itv) for hepatocellular carcinoma using four–dimensional ct. Radiotherapy and Oncology, 84 (3): 272–278, 2007.

[2]

.
[3]

.
[4]

.
[5]

.
[6]

[7]

[8]

.
[9]

.
[10]

[11]

.
[12]

pp. 892–900, 2019.

[13]

.
[14]

pp. 8794–8802, 2019.

[15]

pp. 498–507, 2018.

[16]

pp. 2398–2407, 2019.

[17]

pp. 261–270, 2017.

[18]

pp. 9000–9008, 2018.

[19]

pp. 3370–3379, 2020.

[20]

pp. 5682–5692, 2023.

[21]

pp. 1568–1577, 2023.

[22]

pp. 1578–1587, 2023.

[23]

pp. 22169–22179, 2023.

[24]

pp. 5437–5446, 2020.

[25]

pp. 2047–2057, 2022.

[26]

pp. 17804–17813, 2022.

[27]

pp. 4726–4735, 2020.

[28]

.
[29]

pp. 539–548. Springer, 2022.

[30]

pp. 2839–2847, 2022.

[31]

vol. network for learning optical flow. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2705–2713, 2020.

[32]

pp. 1701–1710, 2018.

[33]

.
[34]

pp. 232–239. Springer, 2017.

[35]

pp. 55–63. Springer, 2018.

[36]

[37]

pp. 9252–9260, 2018.

[38]

.
[39]

.
[40]

pp. 347–364. Springer, 2022.

[41]

pp. 120–129. Springer, 2019.

[42]

.
[43]

.
[44]

.
[45]

.
[46]

pp. 1349–1359. PMLR, 2022.

[47]

.
[48]

pp. 417–425. Springer, 2017.

[49]

.
[50]

pp. 174–182. Springer, 2018.

[51]

pp. 168–172. IEEE, 1994.

[52]

.
[53]

pp. 3–13, Cham, 2021. Springer International Publishing.

[54]

[55]

[56]

pp. 6323–6333, 2023.

[57]

.
[58]

.
[59]

.
[60]

pp. 1906–1913, 2005.

[61]

.
[62]

pp. 586–595, 2018.

[63]

pp. 8024–8035. Curran Associates, Inc., 2019.

[64]

.
[65]

pp. 63–103. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004.

[66]

pp. 234–241. Springer, 2015.

[67]

.
[68]

.
[69]

pp. 565–571. Ieee, 2016.

[70]

.
[71]

.
[72]

.
[73]

pp. 574–584, 2022.

[74]

.