March 19, 2024
X-ray photon-counting computed tomography (PCCT) for extremity allows multi-energy high-resolution (HR) imaging but its radiation dose can be further improved. Despite the great potential of deep learning techniques, their application in HR volumetric PCCT reconstruction has been challenged by the large memory burden, training data scarcity, and domain gap issues. In this paper, we propose a deep learning-based approach for PCCT image reconstruction at halved dose and doubled speed validated in a New Zealand clinical trial. Specifically, we design a patch-based volumetric refinement network to alleviate the GPU memory limitation, train network with synthetic data, and use model-based iterative refinement to bridge the gap between synthetic and clinical data. Our results in a reader study of 8 patients from the clinical trial demonstrate a great potential to cut the radiation dose to half that of the clinical PCCT standard without compromising image quality and diagnostic value.
Photon-counting CT, few-view reconstruction, radiation dose reduction, deep learning, clinical trial.
Given the potential patient risk under ionizing radiation, research is actively performed to reduce computed tomography (CT) radiation dose[1]. For example, we can optimize scanning parameters for different patients and use automatic exposure control [2]. Recent development of photon-counting CT (PCCT) and deep learning-based reconstruction algorithms gives new promises in this regard [3], [4].
The great potential of PCCT in clinical utilities has been well demonstrated in atherosclerosis imaging, extremity scanning, and multi-contrast-enhanced studies [5]–[9]. Since 2019, the PCCT company MARS has been conducting human clinical trials for orthopaedic and cardiovascular applications, and already expanded the trials into the local acute care clinics. The orthopaedic trials have shown that PCCT imaging at a high resolution is advantageous in the acute, follow-up, pre-surgical and post-surgical stages. Efforts are being made to conduct clinical trials in Europe for rheumatology applications.
Despite the huge potential of extremity PCCT, a few challenges must be addressed for further improvements [10]. The current low-cost and light-weight specialty MARS CT scanner uses neither an expensive gantry nor a heavy X-ray source. A customized gantry offers cost-effectiveness at a slow scanning speed. A micro-focus X-ray tube improves spatial resolution but, at the same time, limits the photon flux. The mechanical limitations and low photon flux introduce significant challenges in temporal resolution and noise. First, the MARS PCCT scanner currently takes over 8 minutes to scan a patient, which cannot support dynamic contrast-enhanced studies. Second, channel-wise projections suffer from low signal-to-noise ratios. For instance, with our current protocol, fewer than 1,500 photons are split into five non-overlapping energy bins, resulting in only hundreds of photons in one channel as opposed to \(\sim 1\times10^5\) photons for conventional CT. This becomes more problematic with a narrow energy bin. To mitigate these issues, a natural solution is to reduce the number of projection views per scan and to develop advanced reconstruction techniques.
Few-view and low-dose CT reconstruction is a main area of CT research. In the early stages, compressed sensing was widely used, such as total variation (TV) and dictionary learning techniques [11], [12]. More recently, deep learning methods delivered exciting reconstructions [13], becoming the new frontier along the direction. However, there are still several gaps to meet for high-resolution (HR) PCCT. First, the existing methods are mainly developed for CT image reconstruction in single-channel mode and 2D imaging geometry, few of which target on volumetric reconstruction in multi-channel mode at high resolution due to GPU memory constraints [14]–[19]. Second, it is well known that the network performance could drop significantly if the data used during inference comes from a distribution different from that of the training data [20]–[22]. This domain gap is more critical for diagnostic image reconstruction as it is less tolerant of artifacts and “hallucinations” (generating factually incorrect or misleading structures with realistic appearances) than in some other fields. Third, high quality datasets are scarce for network training since HR PCCT is an emerging technology. Although some unsupervised and self-supervised methods work without paired data, they often rely on specific assumptions about noise characteristics in the images, do not consider the inter-channel correlation in spectral images, and demonstrate sub-optimal performance [22]–[27].
In this paper, we present a deep learning-based approach to address the aforementioned challenges in a few-view mode at a halved dose (using half the number of projections from the original full-dose scan, with “half-view" and”half-dose" used interchangeably hereafter unless explicitly noted) relative to the commercial PCCT technology used in the New Zealand clinical trial. We summarize our primary contributions as follows:
We develop a deep learning-based reconstruction pipeline for volumetric HR PCCT reconstruction. The pipeline is memory-efficient on a single workstation, aided by a shared low noise prior for all channels, patch-based deep iterative refinement on channel-wise volume reconstruction, and texture tuning that leverages inter-channel correlations for full-channel slices;
We demonstrate the potential in addressing the domain gap issues using our patch-based volumetric denoising combined with a model-based iterative refinement framework. With the proposed network trained on synthetic data, we consistently achieve excellent results on both phantom data and patient scans acquired on different machines with different protocols;
Our half-view/half-dose PCCT reconstruction results are favored by radiologists over the proprietary reconstruction from the full-view dataset in terms of diagnostic image quality, suggesting a great potential for clinical translation to address data scarcity.
To the best of our knowledge, this is the first attempt at deep learning-based volumetric reconstruction for multi-channel HR PCCT, e.g., \(1,200^3\times5\). This also represents the first study reporting superior diagnostic quality at half-dose with a deep network trained on synthetic data over full-dose clinical PCCT reconstruction.
An overview of our approach is shown in Fig. 1, mainly consisting of three parts: structural prior reconstruction, deep iterative refinement, and textural appearance tuning. The details are elaborated in the following subsections.
The clinical trial was performed on the state-of-the-art MARS Extremity 5X120 scanner, which can simultaneously measure five effective energy bins at spatial resolution \(50-200 \mu m\). The system includes CdTe-Medipix3RX photon-counting detectors (PCDs) with \(110 \mu m\) pixel pitch (12 chips arranged in a non-flat arc shape), an X-ray source (up to \(120 kVp\), \(350 \mu A\)), and a rotating gantry for helical scanning. It provides isotropic \(90^3 \mu m^3\) voxel size. The bore size is \(125 mm\) with a scanning length of \(35cm\) for extremity scans.
Each element of the MARS PCD counts with 5 effective energy thresholds simultaneously, resulting in quasi-monochromatic projections in 5 energy bins: i.e., \(7-40 keV\), \(40-50 keV\), \(50-60 keV\), \(60-70 keV\), and \(70 keV\) above. The patient is scanned with a total incoming photon count of around 1,500 per detector element for open-beam measurement, resulting in only hundreds of photons in one channel. Given such low counts, direct reconstruction in each energy bin inevitably suffers from major quantum noise. Instead, we employ a similar idea from the prior image constrained compressed sensing [28] and its extension to PCCT [29], by noticing that the structural information among different energy bins is closely correlated, with only slight differences in attenuation values. Hence, we propose the following steps for spectral reconstructions in 5 energy bins with minimized quantum noise: (1) We sum the counts from all channels to form a virtual ‘integrating’ bin with minimized quantum uncertainty; (2) We reconstruct from the virtual bin to obtain an image as a low-noise structural prior; (3) Leveraging the inter-bin similarity, we initialize our iterative deep reconstruction method with the structural prior, and feed in bin-wise real data to reconstruct the spectral image. A multi-scale iterative reconstruction strategy is used to significantly accelerate the convergence for the large volume reconstruction. Note that we enforce this structural similarity here by using the prior only as an initialization for the best computational efficiency in contrast to traditional approaches [29], which use the prior as a constraint and solve the optimization problem iteratively. As a cost, our approach in this step does not guarantee a solid convergence. Hence, we rely on iterative reconstruction constrained by a deep neural network-based prior regularization for valid image reconstruction as introduced in the following section, while this virtual-bin structural prior initialization serves as an acceleration step.
To address the challenge of data scarcity, we propose to use synthetic data for network training and address the domain gap issue using the following strategies. First, we limit the function of network to low-level feature denoising, which is less sensitive to domain gaps. Second, a patch-based training strategy is employed to leverage low-level similarity and reduce the domain gap. Furthermore, model-based iterative refinement is used to suppress remaining errors. These elements are integrated using the alternating direction method of multipliers (ADMM).
The solution space under a data constraint is often high-dimensional for a few-view or low-dose CT reconstruction problem. Mathematically, this formulates an optimization problem as follows: \[\label{eq:opt95problem} \boldsymbol{x}^* = \mathop{\mathrm{arg\,min}}_{\boldsymbol{x}}\frac{1}{2}{\left\lVert A\boldsymbol{x}- \boldsymbol{y}\right\rVert^2 + \lambda R(\boldsymbol{x})},\tag{1}\] where \(A\in \mathbb{R}^{M\times N}\) and \(\boldsymbol{y}\in \mathbb{R}^{M}\) are a system matrix and a measurement vector respectively, \(\boldsymbol{x}\in \mathbb{R}^N\) denotes an image volume to be reconstructed, and \(R(\cdot)\) is a regularization term.
To solve Eq. (1 ) with deep prior, an auxiliary variable \(\boldsymbol{z}\) is introduced to decouple the prior term from the loss function as follows: \[\boldsymbol{x}^* = \mathop{\mathrm{arg\,min}}_{\boldsymbol{x}}\frac{1}{2}{\left\lVert A\boldsymbol{x}- \boldsymbol{y}\right\rVert^2 + \lambda R(\boldsymbol{z})}, \quad \text{s.t.} \quad \boldsymbol{z}= \boldsymbol{x}. \label{eq:optconstrained}\tag{2}\] The augmented Lagrangian [30] of Eq. (2 ) is written as \[\mathcal{L}_\mu(\boldsymbol{x},\boldsymbol{z},\boldsymbol{v}) = \frac{1}{2}\left\lVert A\boldsymbol{x}- \boldsymbol{y}\right\rVert^2 + \lambda R(\boldsymbol{z}) + \boldsymbol{v}^T(\boldsymbol{x}-\boldsymbol{z}) +\frac{\mu}{2}\left\lVert\boldsymbol{x}-\boldsymbol{z}\right\rVert^2, \label{eq:ADMM95Lag}\tag{3}\] which becomes a saddle point problem and can be solved using the ADMM [31], [32] as follows:
\[\left\{\begin{array}{l} \boldsymbol{x}^{k+1} = \mathop{\mathrm{arg\,min}}_{\boldsymbol{x}} \frac{1}{2}\left\lVert A\boldsymbol{x}- \boldsymbol{y}\right\rVert^2 +\frac{\mu}{2}\left\lVert\boldsymbol{x}-\boldsymbol{z}^k + \frac{\boldsymbol{v}^k}{\mu}\right\rVert^2 \\ \boldsymbol{z}^{k+1} = \mathop{\mathrm{arg\,min}}_{\boldsymbol{z}} \lambda R(\boldsymbol{z}) +\frac{\mu}{2}\left\lVert\boldsymbol{x}^{k+1}-\boldsymbol{z}+ \frac{\boldsymbol{v}^k}{\mu}\right\rVert^2 \\ \boldsymbol{v}^{k+1} = \boldsymbol{v}^{k} + \mu (\boldsymbol{x}^{k+1} - \boldsymbol{z}^{k+1}) \\ \end{array} ,\label{eqn:ADMM} \right.\tag{4}\] where \(\mu\) is a hyper-parameter and \(\boldsymbol{v}\) is the augmented Lagrange multiplier.
The optimization of \(\boldsymbol{x}\) can be achieved using the gradient descent method for a number of steps with a step size \(\beta\): \[\begin{align} \nabla \mathcal{L}_\mu(\boldsymbol{x}) & = A^T(A\boldsymbol{x}- \boldsymbol{y}) + \mu(\boldsymbol{x}- \boldsymbol{z}^k + \boldsymbol{v}^k/\mu), \\ \boldsymbol{x}^{k, (t+1)} & = \boldsymbol{x}^{k, (t)} - \beta \nabla \mathcal{L}_\mu(\boldsymbol{x}^{k, (t)}), \end{align}\] where \(t\) represents the step index for the gradient descent search. The optimization of \(\boldsymbol{z}\) is a proximal operation: \[\boldsymbol{z}^{k+1} = \text{prox}_{\frac{\lambda}{\mu}R}(\boldsymbol{x}^{k+1} + \boldsymbol{v}^k/\mu).\label{eq:proximalOP}\tag{5}\] Clearly, a learned denoiser resembles a projection of the noisy input onto a clean image manifold [33], also shown by several recent studies using deep networks as learned proximal operators [34], [35]. Here we use our network to approximate the proximal operation as a deep prior.
Note that the noise characteristics could change through iterations. However, the network denoiser is often trained for mapping a noisy CT reconstruction to its corresponding clean label. To reduce noise mismatch and accelerate convergence, we initialize \(\boldsymbol{x}\) with the results obtained with the structural prior. While the magnitude of noise in the reconstruction gradually reduces through iterations, we regulate the network contribution at later stages. Eq. (5 ) is reformulated with a network denoiser as follows: \[\boldsymbol{z}^{k+1} = \gamma f_{\theta}(\boldsymbol{x}^{k+1} + \boldsymbol{v}^k/\mu) + (1-\gamma)(\boldsymbol{x}^{k+1} + \boldsymbol{v}^k/\mu),\] where \(\gamma\) controls the network contribution, which can be understood as a parameter to control the amount of noise to be removed.
Despite the great successes of deep 2D CT reconstruction methods, directly applying them to clinical HR PCCT for volumetric reconstruction is infeasible on conventional GPUs. The GPU memory cost becomes huge for image volume and sinogram storage as well as for the corresponding backward/forward projection operations, invalidating direct methods like AUTOMAP [36] and other unrolling methods [14], [37]. Rather than training a network targeting a whole volume, we train a network that learns a patch-based representation to overcome the memory limit for volumetric reconstruction. The architecture of our proposed network is illustrated in Fig. 2. It is a light-weight 3D network that combines U-Net [38] and ResNet [39] with 3D grouped convolutions [40] and specialized 3D pixel shuffle operations to promote application speed and performance.
In contrast to many generative networks using large reception fields for realistic high-level feature synthesis, we force the network to concentrate on low-level features by choosing a small kernel size of \(3\times 3 \times 3\) for all convolution layers to leverage low-level structural similarities and gain more tolerance to domain shift. Compared to widely used 2D convolutions, we use a 3D grouped convolution for all convolution layers to fully utilize the information from neighboring voxels. To facilitate training, the grouped convolution encourages structural sparsity, promotes the differentiation between feature maps, and constructs a 3D network with fewer trainable parameters. Different from the conventional downscale/upscale convolutions with strides, we use 3D pixel unshuffle/shuffle operations as shown in Fig. 2 (b). The unshuffle operation splits the input volume into 8 sub-volumes and concatenates them in the channel dimension, while the shuffle operation assembles the sub-volumes into a super volume. In reference to the work in [41], we omit batch normalization throughout the network, and the first and last convolution layers involve no activation. For scaling invariance and generalizability, no bias is used in all layers as suggested in [35]. The cube size for network training can be adjusted according to the available GPU memory. We set the cube size to 32 in our experiments, and the network can be easily deployed on a conventional standard 1080Ti GPU with 11GB memory.
Since it is rather challenging to obtain the ground truth for HR PCCT scans of patients, we use synthesized data for network training. Specifically, we construct our training dataset from the open dataset for the Low-dose CT Grand Challenge [42]. We first resize the images to have isotropic voxels of \(1 mm\) along each dimension, and convert the voxel values in Hounsfield units to linear attenuation coefficients. Then, we treat volumes as digital phantoms of \(0.2^3 mm^3\) voxel size and generate noise-free projections in the MARS CT scanner geometry. Finally, the projections with quantum noise are simulated assuming 16,000 incident photons per detector element. The Simultaneous Iterative Reconstruction Technique (SIRT) is used to reconstruct images.
The isotropic volumes and corresponding noisy reconstructions are the labels and noisy inputs for network training. Ten patient volumes are partitioned into cubes of size \(32^3\) with a stride of 25 pixels along each direction. Then, the cubes are sieved to remove empty ones based on the standard deviation of pixel values. As a result, over 190,000 pairs of 3D patches are generated for training, and around 38,000 pairs for validation. The loss function consists of an \(L_1\) norm for the value difference and a mean square error for the relative value difference: \[\sum_i \left[\left\lVert\boldsymbol{y}_i - f_{VSR}(\boldsymbol{x}_i;\theta)\right\rVert_1 +\beta_0 \left\lVert\frac{\boldsymbol{y}_i - f_{VSR}(\boldsymbol{x}_i;\theta)}{\boldsymbol{y}_i + c}\right\rVert^2_2\right],\] where \(\boldsymbol{y}_i\) and \(\boldsymbol{x}_i\) are respectively the label patches and noisy inputs, \(f_{VSR}(\boldsymbol{x}_i;\theta)\) corresponds the network output with trainable parameters \(\theta\), and \(c\) is a constant to avoid zero denominator. The \(L_1\) norm, instead of the \(L_2\) norm, is used in the first term to avoid blurring details, and the relative error is measured with the \(L_2\) norm in the second term to preserve tiny structures based on our experiences [43]. During training, we set the balancing hyperparameter \(\beta_0\) to 1 and \(c\) to 0.1.
During the inference, a reconstruction volume is partitioned into overlapping patches and then fed into the VSR-Net. Geometric self-ensemble based on flipping and rotation is used to boost performance and suppress checker-board artifacts. To save computational time, we randomly apply 1 of 8 transforms to the reconstruction volume for each iteration, which is modified from the periodic geometric self-ensemble idea [35]. For acceleration, parallel processing techniques are used to distribute the workload across multiple GPUs.
To match the image texture with that of the MARS commercial reconstruction radiologists are already familiar with, we adopt a two-step refinement process. First, we use a 2D convolutional network to exploit the inter-channel correlation for texture enhancement and value alignment. Multi-channel images extracted from the channel reconstructions of the same slice are fed to the network for the mapping, in a slice-by-slice manner for memory efficiency. Then, we process the reconstruction through a few SIRT iterations to enhance image sharpness and alter noise characteristics. To minimize potential perception bias, we further balance the sharpened but noisier result by mixing it with the original network processed result at a ratio preferred by radiologists. Specifically, we ran 30 additional SIRT iterations for our reconstruction, and the mixing weights were set to 0.75 and 0.25 for the SIRT reconstruction and the network result, respectively. This ratio was determined by presenting a set of images with various mixing ratios to a radiologist and asking them to choose the one that produced the best perceptual image quality.
We design a value and texture alignment network by modifying the residual channel attention network [44] using a Fourier channel attention mechanism [45] to learn a multi-channel mapping, as illustrated in Fig. 3. For training the RFCAN, we use the full-view MARS reconstruction slices as the label for our half-view reconstruction results. However, due to patient motion, there can be occasional non-rigid misalignments across a few slices between our reconstruction and the MARS results, resulting from differences in volume and projection partitioning schemes, and the number of projections. Hence, we select one patient scan that is least affected by motions as our training data, and then sieve out the misalignment-affected slices, resulting in 584 pairs of HR multi-channel images (\(1,200\times1,200\)). A total of 206,000 pairs of overlapping patches of size \(128\times128\) are randomly extracted for network training, and around 52,000 pairs for validation. An additional penalty on the Fourier spectrum (insensitive to misalignment) is introduced in the loss function to emphasize the texture similarity besides the intensity fidelity imposed by the other terms as follows: \[\begin{align} \sum_i \Bigl[\left\lVert\boldsymbol{y}_i - f(\boldsymbol{x}_i;\theta)\right\rVert_1 +\beta_1 \left\lVert\frac{\boldsymbol{y}_i - f(\boldsymbol{x}_i;\theta)}{\boldsymbol{y}_i + c}\right\rVert^2_2 \nonumber \\ + \beta_2\left\lVert\left| \mathcal{FFT}(\boldsymbol{y}_i)\right| - \left| \mathcal{FFT}(f(\boldsymbol{x}_i;\theta))\right|\right\rVert_1 \Bigr], \end{align}\] where \(\mathcal{FFT}(\cdot)\) denotes the Fourier transform. \(\boldsymbol{y}_i\), \(\boldsymbol{x}_i\), and \(f(\boldsymbol{x}_i;\theta)\) are the label, input, and network output, respectively. The balancing hyperparameters \(\beta_1\) and \(\beta_2\) are both set to 1 with \(c\) set to 0.1 during training.
The size of projection data from a patient scan can be huge, e.g., \(1,536\text{~columns}\times128\text{~rows}\times3,392 \text{~views}\times5\text{~channels}\), overwhelming the GPU memory during direct reconstruction. We further use an interleaved updating technique to divide the large-volume reconstruction task into a batch of mini-jobs in smaller size by partitioning the projection data and reconstruction volume into different segments as illustrated in Fig. 4. The volume along with the associated projection data is partitioned into \(N\) segments, with each volume segment and its corresponding projection data segment being geometrically aligned. To ensure the data completeness, sub-volumes at the seams are also extracted with their corresponding projection data (about 1.5 to 2 rotations from the helical scan). The volume segments and seams form \(2N-1\) mini-reconstruction tasks, which are assigned to multiple GPUs for parallel computing or can be processed sequentially with a single GPU. The resultant sub-volumes are combined in an interleaved pattern, with a few slices at one or both ends trimmed off to ensure data completeness of the resting volume, forming a complete large volume reconstruction update as shown in Fig. 4 (b).
Training Details. Our VSR-Net and FRCAN are implemented on PyTorch and trained with the Adam optimizer on a single NVIDIA V100 GPU. The learning rate for VSR-Net is initially \(2\times10^{-4}\) and decayed by 0.95 every epoch. The total number of epochs is 60 with a batch size of 32. The learning rate for FRCAN is initially \(1\times10^{-4}\) and decayed by 0.6 every epoch, with a total of 10 epochs and a batch size of 32. The VSR-Net is trained on the synthetic dataset described in Sub-sec. 2.3.3, while FRCAN is trained on real patient data as described in Sub-sec. 2.4.1.
Reconstruction Details. We use the ASTRA Toolbox [46] for GPU-based forward and backward projection operations. The patient data are reconstructed on a cluster node using four NVIDIA V100 GPUs for parallel computation (parallel sub-volume reconstruction and patch processing), and other data are reconstructed on a server with a single RTX A5000 GPU.
Experimental Setup. First, we demonstrate the in-domain capability of our DIR method on synthetic single channel CT data. The testing volume is generated from the AAPM dataset following a similar simulation protocol but from different patients. Then, we demonstrate the enhanced generalization on out-of-domain data with our DIR compared to the conventional post-processing with VSR-Net. We use phantom data scanned from a micro-PCCT system for out-of-domain testing. Finally, we validate our whole PCCT reconstruction workflow (DIR followed by texture appearance tuning) on new real patient data acquired on the MARS Extremity scanner. Ratings from radiologists on diagnostic value are used to assess the effectiveness of our method.
To demonstrate the generalizability, we further test our DIR method on phantom data with totally different structures as shown in Fig. 6. The single-channel helical scan data are acquired on our custom-built micro-CT system equipped with a PCD (ADVACAM WidePIX1x5, Prague, Czech Republic) at 80\(kVp\). We collect the data at 6 different dose levels by adjusting the exposure time per projection (0.15, 0.5, 1.0, 1.5, 2.0, and 5.0 seconds). The volumes of \(979\times979\times610\) isotropic voxels (\(35^3\mu m^3\)) are reconstructed using 250 SIRT iterations, serving as noisy inputs (exposure time of 0.15 to 2.0 seconds) and clean reference (5.0-second exposure time). Post-processing with the latest BM3D [12] and with VSR-Net (the same one used in our DIR) are the baselines for our DIR method. The standard deviation parameters for the BM3D method are determined by measuring the standard deviation of values in a water region after normalized with its mean. For the DIR method, we use the half-finished reconstruction as the structural prior (60 and 40 SIRT iterations at the scale of 0.637 and 1 respectively), and refine it with 36 DIR iterations (3 gradient descent steps per iteration, \(\mu=0.03\), \(\beta=0.5\), \(\gamma=0.8\)).
Figures 6 (a) to (d) compare the results with 0.5 and 0.15 seconds exposure from axial and sagittal views against the reference, respectively. Figure 6 (e) displays the frequency modulation curves of axial slices from 6 (a) related to the reference. Figure 7 (a) illustrates a zoomed-in view of a surgical tape under various exposure times and reconstruction methods. For each combination, we computed the SSIM and PSNR for every slice (either from the axial or sagittal view) from the volume against the reference for quantitative comparison. The resulting SSIM and PSNR distributions are presented in Fig. 7 (b) using violin plots with boxplot overlay. In each violin, the center thick gray line represents the interquartile range, and the large white dot indicates the median. Our method consistently demonstrates improved image quality across different acquisition conditions in terms of both qualitative and quantitative measures, suggesting good generalizability on out-of-domain structures. In contrast, the adverse effects of domain shift are clearly presented in VSR-Net results, showing artifacts and different appearance and intensity from that of the reference despite the enhanced structure visibility, e.g., structural errors revealed in Figs. 6 (c) and (d), and the tape structure and dots pointed by the red and green arrows in Fig. 7 (a). The BM3D method is structurally agnostic and intrinsically offers better generalization than deep learning methods as reflected by the high PSNR and SSIM scores in Fig. 7 (b). However, it suffers from resolution loss as observed in the difference image in Fig. 6 (c) and a significantly dampened spectrum in the middle and high frequencies in Fig. 7 (e), compromising subtle details. This suggests the importance of task-relevant metrics and underlines the need for radiologists’ evaluation in the medical imaging field.
Patients aged 21 years and above referred from the fracture clinic are recruited for the clinical trial (Ethics approval:18/STH/221/AM01, Health and Disability Ethics Committee, New Zealand). The spectral image volumes of patient wrist are acquired using the MARS Extremity PCCT scanner with scanning settings shown in Table 1.
| Parameters | MARS Extremity 5X120 Settings |
|---|---|
| Tube Setting | 118\(kVp\), 28\(\mu A\) |
| Exposure | 160\(ms\) per projection, helical scan |
| CTDI\(_{\text{vol}}\) | 7.87\(mGy\) |
| Energy Bins | \(7-40\), \(40-50\), \(50-60\), \(60-70\), \(70-118\), \(keV\) |
| Recon. Voxel | Isotropic \(90\times90\times90 \mu m^3\) voxel |
| Recon. Method | A customized polychromatic iterative method. [47] |
The images of 8 patients who provided written consents are reconstructed using the commercial algorithm from a full-view dataset and our proposed deep learning method (illustrated as Fig. 1) from a half-view dataset respectively, and then evaluated by three independent double-blinded radiologists (SG, AB, AL) using the rating scale defined in Table 2 regarding whether diagnostic image quality is achieved or not [48]. In our method, DIR is applied to the structural prior (Sub-sec. 2.2, obtained with 80 and 80 SIRT iterations at the scale of 0.5 and 1 respectively) for 30 iterations (3 gradient descent steps per iteration, \(\mu=0.03,\beta=0.5,\gamma=0.8\)) for the reconstruction from data in each bin, then the combined multi-channel volume (\(1,200^3\times5\)) are processed with RFCAN in a slice-by-slice manner for value alignment and texture enhancement (Sub-sec. 2.4), and the number of SIRT iterations is set to 30 and the mixing ratio to \(0.75\colon0.25\) to accommodate radiologists’ preference on image sharpness and noise characteristic.
| -2 | Confident that the diagnostic criteria is not fulfilled; | |
| -1 | Somewhat confident that the criteria is not fulfilled; | |
| 0 | Indecisive whether the criteria is fulfilled or not; | |
| +1 | Somewhat confident that the criteria is fulfilled; | |
| +2 | Confident that the criteria is fulfilled. |
The radiologists are randomly presented with 500 images from each patient (three energy bins \(7-40keV\), \(50-60keV\) and \(70-118 keV\)) in the axial, coronal and sagittal formats. The sagittally reformatted images reconstructed using both methods along with 3D rendered images are shown in Fig. 8 (a). The image metrics are based on the “European guidelines on quality criteria for CT" for bones and joints [49], including the visibility and sharpness of the cortical and trabecular bone, the adequacy in soft tissue contrast for the visualization of tendons, muscle and ligaments, as well as image noise (quantum noise) and artifacts.
Additionally, we compare our results with the state-of-the-art results obtained by applying the self-supervised denoising method Noise2Sim [27] to the multi-channel reconstructions after 320 SIRT iterations. Despite significant enhancement over SIRT reconstruction from the half-view dataset, Noise2Sim results demonstrate insufficient image quality (suffering from image blur and losing fine structures) as shown in Fig. 8 (b). Hence, they are excluded from the reader study.
For quantitative assessment, regions of interest (ROIs), each with \(\sim250\) voxels, are drawn in the flexor carpi radialis tendon and adjacent subcutaneous fat regions in the patient images. The mean and standard deviation of linear attenuation coefficients in the ROIs are used to calculate the signal-to-noise ratio (SNR) in soft tissue regions and contrast-to-noise ratio (CNR) between soft tissue and fat in the ROIs. SNR and CNR values associated with both methods are compared over all patients’ datasets. For subjective evaluation, overall radiologists’ assessment grades for seven image quality measures from both methods are summarized as a frequency table. Then, all three radiologists’ overall and combined ratings are compared with descriptive statistics. The hypothesis of no significant difference between the two methods is tested in the Wilcoxon signed-rank test.
Image grades from both methods in terms of each image quality metric are also converted into visual grading characteristics (VGC) points as described in [48]. Hence, with the current 5 image grading criteria, 4 VGC points are obtained, and 0 as the origin and 1 as the maximum value are added as well [48]. The combined VGC points (using grades from all three radiologists) of seven image quality measures are calculated, and the empirical area under the curve (AUC\(_{VGC}\)) is compared. The statistical significance of the mean AUC\(_{VGC}\) of seven image quality measures is analyzed through one-sample t-testing against a hypothetical AUC\(_{VGC}\) value of 0.5 (the value 0.5 corresponding to equal/comparable image quality for the two imaging methods in comparison). VGC points are also obtained by combining all seven image quality measures for both methods. Full-view vs Half-view empirical AUC\(_{VGC}\) values and their 95% confidence intervals are obtained for each radiologist and the combined scores. The statistical significance of AUC\(_{VGC}\) and its 95% confidence interval for each radiologist (56 samples) and all radiologists (168 samples) are interpreted according to the method described in [51]. The statistical analysis is presented using GraphPad Prism 9.2.0 at a significance level of 95%. Finally, the inter-rater agreements between radiologists are evaluated with quadratically weighted kappa statistics [52].
Figure 8 (a) displays a 3D rendering of a patient’s wrist with a scaphoid fracture, along with oblique slices from our half-view reconstruction and the standard full-view reconstruction, highlighting the lesion. The fracture is clearly presented in two images, showing comparable contrast and noise textures. Note that noticeable structural differences are due to slight mismatches in slice location or viewing angle introduced by manual alignment. A color visualization of the three-channel spectral images via linear blending [50] is shown in Fig. 8 (b). The consistent color tone and brightness compared to the standard result demonstrate the high fidelity of our result in spectral values. Fine structural details are also well-preserved, as indicated by the red arrow. In contrast, the Noise2Sim result exhibits structure loss, while the conventional SIRT result suffers from significant noise and contrast issues. To further evaluate the noise characteristics, Fig. 8 (c) presents zoomed-in noise textures from a flat tissue region at different stages of our reconstruction, alongside the standard reference and their corresponding Fourier spectra. The corresponding NPS curves are plotted in Fig. 8 (d). As revealed in the figures, the DIR result contains major low-frequency noise components. These are progressively replaced by higher-frequency components through RFCAN post-processing, with the final reconstruction further enhancing high frequencies and well resembling the texture of the standard reference, despite exhibiting a lower overall noise amplitude.
SNR in soft tissue regions and CNR between soft tissue and fat are compared in Fig. 9, where the bar charts illustrate that for all the patients, SNR and CNR in the images obtained with the proposed half-view reconstruction method are higher than that in the clinical benchmark images reconstructed using the standard commercial method from the full dataset, except for the second one whose images showed quite comparable CNRs. The distributions of image noise across patients shown in Fig. 10 further indicates that the proposed method consistently yields significantly lower noise levels compared to the standard method, despite utilizing halved radiation dose.
| Methods | Raters | Median \(\uparrow\) | IQR | Mean \(\uparrow\) (Std.) |
|---|---|---|---|---|
| Full | RD1 | 1 | 0 | 0.875 (0.740) |
| Full | RD2 | 1 | 2 | 1.107 (0.966) |
| Full | RD3 | \(-\)1 | 3 | \(-\)0.589 (1.247) |
| Full | COM | 1 | 1 | 0.464 (1.252) |
| Half | RD1 | 1 | 2 | 1.054 (0.862) |
| Half | RD2 | 2 | 2 | 1.179 (1.011) |
| Half | RD3 | 0 | 3 | \(-\)0.357 (1.354) |
| Half | COM | 1 | 2 | 0.625 (1.293) |
| Overall | RD1 | 1 | 1.5 | 0.964 (0.804) |
| Overall | RD2 | 1 | 2 | 1.143 (0.985) |
| Overall | RD3 | \(-\)1 | 3 | \(-\)0.473 (1.301) |
| Overall | COM | 1 | 2 | 0.545 (1.273) |
| Raters | # of Pairs | # of Ties | \(p\)-Value |
|---|---|---|---|
| RD1 | 56 | 22 | 0.2734 |
| RD2 | 56 | 44 | \(0.3877\) |
| RD3 | 56 | 38 | \(0.0355\) |
| COM | 168 | 104 | \(0.0166\) |
More importantly, the overall confidence ratings of diagnostic image quality with seven criteria are compared in Table 3. The table shows significantly better mean and median image quality scores with the proposed half-view reconstruction method than the current clinically used reconstruction method from the full-view dataset for all radiologists despite their different scores, indicating a preference for the proposed reconstructions. The median value for the half-view reconstructions is 2 for the second radiologist, which suggests higher confidence in image interpretation. Despite the discrepancy in ratings, the combined median value is clearly positive, indicating the favorable acceptability of our reconstructions. The hypothesis is further tested in the Wilcoxon signed rank test for all three raters and combined ratings in Table 4. It shows that the p-value is not statistically significant for the 1st and 2nd radiologists, suggesting no difference in image quality between the two methods. However, the p-values for the 3rd radiologist and the combined results are statistically significant, indicating the image quality from the proposed method is perceived significantly better than the standard commercial image reconstruction from the full-view dataset.
The proposed method also performs better when image quality measures are individually evaluated. The mean area under the curve for visual grading characteristics AUC\(_{VGC}\) values of the proposed method are consistently higher than 0.5 for five image quality measures evaluated, as shown in Fig. 11 (b). Similar trends are also reflected in the violin plots in Fig. 11 (a) as indicated by the better median scores and narrower tails in the low end (less low scores). The mean AUC\(_{VGC}\) from the standard method was only slightly better than that of the proposed method in soft tissue contrast differentiation, mainly related to the depiction of ligaments, tendons, and muscles. The VGC points obtained for overall image quality scores from the two competing methods are plotted in Fig. 11 (c). Fig. 11 (c) shows that the mean value of AUC\(_{VGC}\) for the proposed method is better than 0.5 for all radiologists and combined ratings. However, to show that the AUC\(_{VGC}\) is significantly better than 0.5 in the \(95\%\) interval sense, more data would be needed. The statistical significance of the mean of AUC\(_{VGC}\) of the seven image quality measures is established as well using the one-sample t-test in Table 5. As the clinical trials proceed, more datasets may help further strengthen the statistical significance of this comparative study.
Although all the raters preferred the proposed half-view reconstructions, they provide different ratings for the same images, resulting in a lower inter-rater agreement. The agreements between raters are evaluated with weighted kappa statistics in Table 6. The table shows a slight to fair agreement between radiologists’ scores of high significance. Also, the kappa value is higher for radiologists 2 and 3, indicating a higher degree of agreement between these two radiologists.
| # of samples | Mean (Std.) | 95% CI | \(t\) | \(p\)-Value |
|---|---|---|---|---|
| 7 (df=6) | 0.5479 (0.0503) | [0.0014, 0.0944] | 2.52 | 0.0454 |
| Categories | Weighted Kappa | \(p\)-Value |
|---|---|---|
| RD1-RD2 | 0.247 | \(0.0042\) |
| RD1-RD3 | 0.171 | \(<0.0001\) |
| RD2-RD3 | 0.339 | \(<0.0001\) |
This study targets deep learning-based HR PCCT volumetric reconstruction given insufficient training data. Direct volumetric reconstruction is necessary and advantageous since rebinning to fan-beam geometry could compromise image quality [53], especially due to large gaps and bad pixels in PCDs and free-form scanning with robotic arms [54], [55]. However, volumetric PCCT reconstruction poses GPU memory and computational challenges. We have tackled them with interleaved updating, patch-based refinement, and low-noise prior sharing. Low-level structural similarity has been leveraged in combination with model-based iterative refinement to address the domain gap effectively. Additionally, textural appearance has been fine-tuned to align with the standard reconstruction in the application domain. On the other hand, the patch-based representation mitigates the GPU memory limitation but at a cost of extra computation during inference. For example, the reconstruction volume is partitioned into overlapping patches to minimize checker-board artifacts. Compared to directly processing the entire volume, it becomes the overhead to compute the overlapping portions, and perform procedures of partition and assembly and special geometric self-ensemble processing steps. Fortunately, we may use different patch sizes for training and inference, leveraging the shift-invariant property of convolution networks, and bigger patches can be used to reduce the overhead during inference. With parallel computing on four V100 GPUs, a typical computation time to reconstruct a \(94mm\) patient wrist scan in five energy bins is about 7 hours using our method, whereas the commercial reconstruction time is around 9 to 10 hours on its dedicated hardware. In comparison with the current commercial reconstruction from the full-view dataset, the major benefits of our approach include halved radiation dose and doubled imaging speed, without compromising image quality. The evaluation methods are classic and double blinded. The involved patient datasets are randomly determined, covering a range of pathological and technical conditions. Therefore, our results strongly suggest a great potential of our approach for clinical PCCT image reconstruction. Further improvements are surely possible, using more advanced network architectures such as the emerging diffusion/score-matching models [56]. The barriers for adapting the diffusion approach for PCCT include the scarcity of high-quality data, the memory limitation, and the sampling overhead, which are being explored actively.
Interestingly, for out-of-domain single-channel real phantom data, significant improvement in both image quality and stability has been made with our DIR approach over the conventional single-pass post-processing method using the same network. For post-processing methods, the domain gap could lead to inappropriate frequency elevation in network processed images and cause undesired artifacts. For example, in Fig. 6 (e), the spectrum elevation of VSR-Net curves even extends beyond the cutoff frequency, which explains the network’s tendency to mistake noise as features and produce high-frequency artifacts. Through iterative feedback and correction, DIR significantly reduces the errors despite its similar elevation pattern to the VSR-Net curves. One limitation is that we might observe some low-frequency bone value shift in the extreme case reconstruction (0.15-second exposure against the 5-second reference, \(3\%\) dose) as shown in Fig. 6 (d), and which is also reflected by a small dent from the DIR frequency modulation profile. This dent can cause low-frequency image contrast change, which explains why the SSIM and PSNR scores of DIR are smaller than those of BM3D. Luckily, no structure or resolution loss is noticed and this issue can be easily remedied with a texture appearance tuning network RFCAN. We further confirmed the resolution change with point spread function (PSF) measurement using a \(10\mu m\)-diameter tungsten wire phantom. The measured PSF from DIR reconstruction is \(81\mu m\) in full width at half maximum (FWHM) while the FWHM from direct SIRT reconstruction reads \(88.8\mu m\) as shown in Fig. 12, suggesting no resolution loss. One interesting observation is that the BM3D method scores the best in terms of SSIM and PSNR despite the loss of fine details, suggesting the necessity of using task-relevant metrics in clinical applications. We also underline that our major aim is to demonstrate the improved generalizability with DIR over VSR-Net in this experiment, rather than to compete with BM3D in these scores. While we already tied scores with BM3D in less noisy cases, superior results can be expected if we adopt an RFCAN or retrain the VSR-Net and narrow the domain gap.
More remarkably, in retrospective patient studies our spectral reconstruction quality has surpassed that with the state-of-the-art unsupervised learning method Noise2Sim. In our reader study, the analysis on the grading results from the combined scores from all three radiologists has demonstrated that our proposed method is better than the standard commercial reconstruction. Encouragingly, the median values of all image quality scores are on the positive side, suggesting our reconstructions are diagnostically acceptable and preferred, despite reconstructed with only halved radiation dose. Furthermore, our method has been evaluated in terms of VGC points from the radiologists’ scores against the existing method. By each of the image quality measures our method has produced significantly better results in almost all aspects. For the total combined image quality scores, the proposed method has shown its competitive advantage against the existing method, with a mean \(\text{AUC}_{VGC}\) value exceeding 0.5, although additional patient data is needed to establish statistical significance. Notably, the individual scores from all three radiologists, as well as the combined rankings, exhibit consistent trends. All favor the proposed method for a majority of the VGC points, as shown in Fig. 11, but with the VGC fitting curve crossing the diagonal line at the rightmost part. This is a pattern typically associated with a method that receives higher mean scores but also exhibits greater variation in scores than the reference. This interpretation is supported by the summary statistics in Table 3, where the combined ratings for the proposed half-view method show both a higher mean and a larger interquartile range compared to that of the full-view reference. Finally, a slight to fair agreement has been obtained among radiologists, despite no formal training on their visual grading.
Regarding the radiation dose of a patient’s wrist scan, we have performed comparison experiments with a GE Discovery CT750 HD scanner for standard musculoskeletal imaging. The Extremity 5x120 scanner delivers a radiation dose (volumetric CT dose index, CTDI\(_{\text{vol}}\)) of \(7.87mGy\) to a \(10cm\) polymethyl methacrylate (PMMA) phantom during a routine wrist scan. This radiation dose remains the same regardless of whether or not a metal implant is present in the wrist of a patient. The radiation dose of the conventional CT scanner, using wrist protocols derived by the School of Medicine and Public Health, University of Wisconsin-Madison, USA, in conjunction with GE engineers, was also measured using a \(10cm\) PMMA phantom. The radiation dose was measured at \(43.02mGy\) in CTDI\(_{\text{vol}}\) using a protocol optimized for a wrist without metal implants, and \(197.3mGy\) for a protocol optimized with a metal implant present. These results suggest that the MARS scanner can produce diagnostically useful images at only a small fraction of the radiation dose required by traditional CT scanners. Moreover, our proposed method can further reduce the dose by half for improved safety without sacrificing image quality.
This study also confirms that high-quality images suitable for musculoskeletal diagnosis can be acquired in a reasonable time with a low X-ray dose at the point of care. As demonstrated in Fig. 8 (a), a clear image of a scaphoid fracture was captured with quality comparable to that of full field-of-view CT scans at only a fraction of the radiation dose. Traditional point-of-care musculoskeletal scanners typically rely on cone-beam flat panel technology, whereas the system used for this research was a photon counting scanner with an AI-based reconstruction and a helical scanning geometry. Significantly higher spatial resolution is achieved with the MARS PCDs than with the flat panel systems, along with other benefits from photon-counting technology—such as reduced metal artifacts, multi-contrast imaging, and lower radiation dose. The in-plane resolution at \(10\%\) of the modulation transfer function (MTF) peak was measured to be \(1.8 lp/mm\), while the longitudinal resolution was around \(5.0 lp/mm\) [57]. These values are comparable or better than those reported for the latest Naeotom Alpha PCCT scanner from Siemens (in-plane \(1.69 lp/mm\) and axial \(3 lp/mm\) at MTF \(10\%\)) [58]. Compared to the Siemens whole body PCCT, the MARS scanner offers unique advantages as a point-of-care solution for musculoskeletal imaging. For example, full-field PCCT scanners require substantial floor space in a radiation-shielded environment, at least two trained operators, and carry a high capital cost. With these characteristics, they are usually only installed in tertiary imaging centers, which limits patient accessibility, delays time to diagnosis, and increases cost compared to point-of-care scanners. This is especially prominent for musculoskeletal patients who are often evaluated in outpatient clinics away from the main hospital. In contrast, a MARS point-of-care scanner is more compact, requires fewer operating staff, and is well-suited for deployment in outpatient facilities where musculoskeletal assessments commonly take place.
This study is not without limitations. First, the number of patients included is relatively small, since our clinical trial is still in progress. Second, the inter-reader agreement between radiologists is low. This can be attributed to the fact that the radiologists were not trained for the inter-rater agreement regarding image quality evaluation prior to the review study. At that moment, no protocol was established in the context of PCCT, and pre-evaluation training could potentially introduce image biases. In addition, the radiologists have different degrees of familiarity with and knowledge of PCCT images. Considering the significantly lower radiation dose (\(5\%\) of an equivalent traditional CT scanner protocol), they could have very different opinions on permeability to satisfactory diagnosis. It is also worth mentioning that motion correction has been applied to five patients with noticeable movements using the method described in [54] prior to our reconstruction, which has significantly improved motion artifact assessment as shown in Fig. 11 (b).
Finally, we would like to emphasize that this study aims to address the training data scarcity issue for HR volumetric PCCT reconstruction by bridging the gap between synthetic and clinical data through our proposed DIR framework. It is generally preferable to use in-domain clinical data for network training whenever possible, given they are of sufficient quality and quantity. However, such data are often scarce in practice, which sometimes necessitates the use of synthetic datasets for training. Synthetic data offer two key benefits: (1) they are relatively easy to acquire compared to patient data, and (2) they provide greater flexibility in controlling acquisition conditions, such as generating noise-free labels or producing repetitive scans under varying dose levels and motion states. On the other hand, the primary drawbacks of synthetic data are the inevitable unrealistic aspects, e.g., fake structures or features, simplification or omission of certain physics effects during image formation, which can introduce domain gaps and degrade network performance when applied to real clinical data. Our DIR framework bridges the domain gap through two strategies: (1) Narrowing the domain gap by training a structure-agnostic, low-level denoiser (VSR-Net); (2) Decomposing the difficult “long jump” into easier “segmental walks”, with each cornerstone (intermediate result) kept quasi in-domain for the denoiser. Specifically, by designing VSR-Net as a low-level denoiser, the primary domain gap between synthetic and clinical data is reduced to differences in noise levels and image contrast modulation. To address this, by properly selecting the number of SIRT iterations when generating the structural prior, we can closely match the resulting noise magnitude to that of the synthetic data, minimizing the domain gap for VSR-Net at the first step in the DIR pipeline. In subsequent steps, we can balance the noise introduced by gradient descent updates and the noise removed by VSR-Net (controlled by the parameter \(\mu\)). When tuned properly, these opposing effects largely cancel each other out, maintaining a quasi in-domain input for the denoiser (note that denoisers are tolerant to cleaner images). Additionally, the image contrast modulation mismatches can be corrected through iterative feedback during DIR iterations, further ensuring the robustness of the DIR framework.
In conclusion, we have developed a novel deep learning method for few-view HR PCCT volumetric reconstruction in the New Zealand clinical trial at halved radiation dose and doubled imaging speed. Compared to the standard commercial reconstruction method used in the clinical trial, the proposed method produces equivalent or superior image quality at halved radiation dose. We plan to translate the proposed method for few-view image reconstruction into the PCCT system and keep improving the method as the clinical trial proceeds.
A visual comparison example (an adjacent slice from that displayed in Fig. 3 (a)) is displayed in Fig. 13, as a complementary to Fig. 8. It shows RFCAN output, final half-view reconstruction result, and full-view clinical reference. The bone edge pinpointed by the red arrow appears a little over-smoothed in the RFCAN result compared to the clinical reference. The final few-view reconstruction result improves the sharpness, which is also demonstrated in the intensity profiles. The noise texture improvement in soft tissue region is also observed, cross-validating the result observed in Figs. 8 (c) and (d). These improvements confirm the benefits of the other step from our two-step textural appearance tuning procedure besides the RFCAN processing.
This work was supported in part by NIH under the grants R01CA237267, R01EB034737, R01CA233888, R01HL151561, and R01EB031102. M. Li, C. Niu, and G. Wang are with the Biomedical Imaging Center, Rensselaer Polytechnic Institute, Troy, New York, USA (e-mail:mzli@ieee.org; niuc@rpi.edu; wangg6@rpi.edu). M. R. Amma, N. d. Ruiter, and P. Butler are affiliated with MARS Bioimaging Limited, Christchurch, New Zealand (e-mail: maya.vivek1@gmail.com; niels.deruiter@marsbioimaging.com; phil.butler@marsbioimaging.com). K. M. Chapagain is with MARS Bioimaging Limited, Christchurch, New Zealand, also with University of Otago, New Zealand (e-mail: krish22chapagain@gmail.com). S. Gabrielson is with Canterbury District Health Board, New Zealand (e-mail: STEFAN.GABRIELSON@cdhb.health.nz). A. Li is with Pacific Radiology, New Zealand (e-mail: andrew.li@pacificradiology.com). K. Jonker is with MARS Bioimaging Limited, Christchurch, New Zealand, also with University of Canterbury, New Zealand (e-mail: kevin.jonker@marsbioimaging.com). J. A. Clark is with MARS Bioimaging Limited, Christchurch, New Zealand, also with University of Otago, New Zealand (e-mail: jenn.clark@marsbioimaging.com). A. Butler is with MARS Bioimaging Limited, Christchurch, New Zealand, also with University of Otago, New Zealand, University of Canterbury, New Zealand, and Canterbury District Health Board, New Zealand (e-mail: anthony.butler@marsbioimaging.com). H. Yu is with Department of Electrical and Computer Engineering, University of Massachusetts Lowell, Lowell, Massachusetts, USA (e-mail: Hengyong-yu@ieee.org). The corresponding authors are G. Wang, A. Butler, and H. Yu.↩︎