Deep Few-view High-resolution Photon-counting CT at Halved Dose for Extremity Imaging


Abstract

X-ray photon-counting computed tomography (PCCT) for extremity allows multi-energy high-resolution (HR) imaging but its radiation dose can be further improved. Despite the great potential of deep learning techniques, their application in HR volumetric PCCT reconstruction has been challenged by the large memory burden, training data scarcity, and domain gap issues. In this paper, we propose a deep learning-based approach for PCCT image reconstruction at halved dose and doubled speed validated in a New Zealand clinical trial. Specifically, we design a patch-based volumetric refinement network to alleviate the GPU memory limitation, train network with synthetic data, and use model-based iterative refinement to bridge the gap between synthetic and clinical data. Our results in a reader study of 8 patients from the clinical trial demonstrate a great potential to cut the radiation dose to half that of the clinical PCCT standard without compromising image quality and diagnostic value.

Photon-counting CT, few-view reconstruction, radiation dose reduction, deep learning, clinical trial.

1 Introduction↩︎

Given the potential patient risk under ionizing radiation, research is actively performed to reduce computed tomography (CT) radiation dose[1]. For example, we can optimize scanning parameters for different patients and use automatic exposure control [2]. Recent development of photon-counting CT (PCCT) and deep learning-based reconstruction algorithms gives new promises in this regard [3], [4].

The great potential of PCCT in clinical utilities has been well demonstrated in atherosclerosis imaging, extremity scanning, and multi-contrast-enhanced studies [5][9]. Since 2019, the PCCT company MARS has been conducting human clinical trials for orthopaedic and cardiovascular applications, and already expanded the trials into the local acute care clinics. The orthopaedic trials have shown that PCCT imaging at a high resolution is advantageous in the acute, follow-up, pre-surgical and post-surgical stages. Efforts are being made to conduct clinical trials in Europe for rheumatology applications.

Despite the huge potential of extremity PCCT, a few challenges must be addressed for further improvements [10]. The current low-cost and light-weight specialty MARS CT scanner uses neither an expensive gantry nor a heavy X-ray source. A customized gantry offers cost-effectiveness at a slow scanning speed. A micro-focus X-ray tube improves spatial resolution but, at the same time, limits the photon flux. The mechanical limitations and low photon flux introduce significant challenges in temporal resolution and noise. First, the MARS PCCT scanner currently takes over 8 minutes to scan a patient, which cannot support dynamic contrast-enhanced studies. Second, channel-wise projections suffer from low signal-to-noise ratios. For instance, with our current protocol, fewer than 1,500 photons are split into five non-overlapping energy bins, resulting in only hundreds of photons in one channel as opposed to \(\sim 1\times10^5\) photons for conventional CT. This becomes more problematic with a narrow energy bin. To mitigate these issues, a natural solution is to reduce the number of projection views per scan and to develop advanced reconstruction techniques.

Few-view and low-dose CT reconstruction is a main area of CT research. In the early stages, compressed sensing was widely used, such as total variation (TV) and dictionary learning techniques [11], [12]. More recently, deep learning methods delivered exciting reconstructions [13], becoming the new frontier along the direction. However, there are still several gaps to meet for high-resolution (HR) PCCT. First, the existing methods are mainly developed for CT image reconstruction in single-channel mode and 2D imaging geometry, few of which target on volumetric reconstruction in multi-channel mode at high resolution due to GPU memory constraints [14][19]. Second, it is well known that the network performance could drop significantly if the data used during inference comes from a distribution different from that of the training data [20][22]. This domain gap is more critical for diagnostic image reconstruction as it is less tolerant of artifacts and “hallucinations” (generating factually incorrect or misleading structures with realistic appearances) than in some other fields. Third, high quality datasets are scarce for network training since HR PCCT is an emerging technology. Although some unsupervised and self-supervised methods work without paired data, they often rely on specific assumptions about noise characteristics in the images, do not consider the inter-channel correlation in spectral images, and demonstrate sub-optimal performance [22][27].

In this paper, we present a deep learning-based approach to address the aforementioned challenges in a few-view mode at a halved dose (using half the number of projections from the original full-dose scan, with “half-view" and”half-dose" used interchangeably hereafter unless explicitly noted) relative to the commercial PCCT technology used in the New Zealand clinical trial. We summarize our primary contributions as follows:

  • We develop a deep learning-based reconstruction pipeline for volumetric HR PCCT reconstruction. The pipeline is memory-efficient on a single workstation, aided by a shared low noise prior for all channels, patch-based deep iterative refinement on channel-wise volume reconstruction, and texture tuning that leverages inter-channel correlations for full-channel slices;

  • We demonstrate the potential in addressing the domain gap issues using our patch-based volumetric denoising combined with a model-based iterative refinement framework. With the proposed network trained on synthetic data, we consistently achieve excellent results on both phantom data and patient scans acquired on different machines with different protocols;

  • Our half-view/half-dose PCCT reconstruction results are favored by radiologists over the proprietary reconstruction from the full-view dataset in terms of diagnostic image quality, suggesting a great potential for clinical translation to address data scarcity.

To the best of our knowledge, this is the first attempt at deep learning-based volumetric reconstruction for multi-channel HR PCCT, e.g., \(1,200^3\times5\). This also represents the first study reporting superior diagnostic quality at half-dose with a deep network trained on synthetic data over full-dose clinical PCCT reconstruction.

2 Methods↩︎

An overview of our approach is shown in Fig. 1, mainly consisting of three parts: structural prior reconstruction, deep iterative refinement, and textural appearance tuning. The details are elaborated in the following subsections.

Figure 1: Deep few-view PCCT workflow. (a) A less noisy structural prior is reconstructed by summing counts from all channels and using a multi-scale iterative reconstruction (MS-IR) technique; (b) for image reconstruction in each channel, the structural prior is iteratively refined using a Volumetric Sparse Representation Network (VSR-Net) and model-based guidance with the projection measurements in an Alternating Direction Method of Multipliers (ADMM) framework; and (c) the multi-channel images are further refined using a Residual Fourier Channel Attention Network (RFCAN) for alignment with the MARS full-dose reconstruction, and followed by further polishing with the Simultaneous Iterative Reconstruction Technique (SIRT) to generate similar image sharpness and noise characteristics that radiologists prefer.

2.1 MARS Extremity PCCT↩︎

The clinical trial was performed on the state-of-the-art MARS Extremity 5X120 scanner, which can simultaneously measure five effective energy bins at spatial resolution \(50-200 \mu m\). The system includes CdTe-Medipix3RX photon-counting detectors (PCDs) with \(110 \mu m\) pixel pitch (12 chips arranged in a non-flat arc shape), an X-ray source (up to \(120 kVp\), \(350 \mu A\)), and a rotating gantry for helical scanning. It provides isotropic \(90^3 \mu m^3\) voxel size. The bore size is \(125 mm\) with a scanning length of \(35cm\) for extremity scans.

2.2 Reconstruction with Structural Prior↩︎

Each element of the MARS PCD counts with 5 effective energy thresholds simultaneously, resulting in quasi-monochromatic projections in 5 energy bins: i.e., \(7-40 keV\), \(40-50 keV\), \(50-60 keV\), \(60-70 keV\), and \(70 keV\) above. The patient is scanned with a total incoming photon count of around 1,500 per detector element for open-beam measurement, resulting in only hundreds of photons in one channel. Given such low counts, direct reconstruction in each energy bin inevitably suffers from major quantum noise. Instead, we employ a similar idea from the prior image constrained compressed sensing [28] and its extension to PCCT [29], by noticing that the structural information among different energy bins is closely correlated, with only slight differences in attenuation values. Hence, we propose the following steps for spectral reconstructions in 5 energy bins with minimized quantum noise: (1) We sum the counts from all channels to form a virtual ‘integrating’ bin with minimized quantum uncertainty; (2) We reconstruct from the virtual bin to obtain an image as a low-noise structural prior; (3) Leveraging the inter-bin similarity, we initialize our iterative deep reconstruction method with the structural prior, and feed in bin-wise real data to reconstruct the spectral image. A multi-scale iterative reconstruction strategy is used to significantly accelerate the convergence for the large volume reconstruction. Note that we enforce this structural similarity here by using the prior only as an initialization for the best computational efficiency in contrast to traditional approaches [29], which use the prior as a constraint and solve the optimization problem iteratively. As a cost, our approach in this step does not guarantee a solid convergence. Hence, we rely on iterative reconstruction constrained by a deep neural network-based prior regularization for valid image reconstruction as introduced in the following section, while this virtual-bin structural prior initialization serves as an acceleration step.

2.3 Deep Iterative Refinement (DIR)↩︎

To address the challenge of data scarcity, we propose to use synthetic data for network training and address the domain gap issue using the following strategies. First, we limit the function of network to low-level feature denoising, which is less sensitive to domain gaps. Second, a patch-based training strategy is employed to leverage low-level similarity and reduce the domain gap. Furthermore, model-based iterative refinement is used to suppress remaining errors. These elements are integrated using the alternating direction method of multipliers (ADMM).

2.3.1 ADMM Optimization↩︎

The solution space under a data constraint is often high-dimensional for a few-view or low-dose CT reconstruction problem. Mathematically, this formulates an optimization problem as follows: \[\label{eq:opt95problem} \boldsymbol{x}^* = \mathop{\mathrm{arg\,min}}_{\boldsymbol{x}}\frac{1}{2}{\left\lVert A\boldsymbol{x}- \boldsymbol{y}\right\rVert^2 + \lambda R(\boldsymbol{x})},\tag{1}\] where \(A\in \mathbb{R}^{M\times N}\) and \(\boldsymbol{y}\in \mathbb{R}^{M}\) are a system matrix and a measurement vector respectively, \(\boldsymbol{x}\in \mathbb{R}^N\) denotes an image volume to be reconstructed, and \(R(\cdot)\) is a regularization term.

To solve Eq. (1 ) with deep prior, an auxiliary variable \(\boldsymbol{z}\) is introduced to decouple the prior term from the loss function as follows: \[\boldsymbol{x}^* = \mathop{\mathrm{arg\,min}}_{\boldsymbol{x}}\frac{1}{2}{\left\lVert A\boldsymbol{x}- \boldsymbol{y}\right\rVert^2 + \lambda R(\boldsymbol{z})}, \quad \text{s.t.} \quad \boldsymbol{z}= \boldsymbol{x}. \label{eq:optconstrained}\tag{2}\] The augmented Lagrangian [30] of Eq. (2 ) is written as \[\mathcal{L}_\mu(\boldsymbol{x},\boldsymbol{z},\boldsymbol{v}) = \frac{1}{2}\left\lVert A\boldsymbol{x}- \boldsymbol{y}\right\rVert^2 + \lambda R(\boldsymbol{z}) + \boldsymbol{v}^T(\boldsymbol{x}-\boldsymbol{z}) +\frac{\mu}{2}\left\lVert\boldsymbol{x}-\boldsymbol{z}\right\rVert^2, \label{eq:ADMM95Lag}\tag{3}\] which becomes a saddle point problem and can be solved using the ADMM [31], [32] as follows:

\[\left\{\begin{array}{l} \boldsymbol{x}^{k+1} = \mathop{\mathrm{arg\,min}}_{\boldsymbol{x}} \frac{1}{2}\left\lVert A\boldsymbol{x}- \boldsymbol{y}\right\rVert^2 +\frac{\mu}{2}\left\lVert\boldsymbol{x}-\boldsymbol{z}^k + \frac{\boldsymbol{v}^k}{\mu}\right\rVert^2 \\ \boldsymbol{z}^{k+1} = \mathop{\mathrm{arg\,min}}_{\boldsymbol{z}} \lambda R(\boldsymbol{z}) +\frac{\mu}{2}\left\lVert\boldsymbol{x}^{k+1}-\boldsymbol{z}+ \frac{\boldsymbol{v}^k}{\mu}\right\rVert^2 \\ \boldsymbol{v}^{k+1} = \boldsymbol{v}^{k} + \mu (\boldsymbol{x}^{k+1} - \boldsymbol{z}^{k+1}) \\ \end{array} ,\label{eqn:ADMM} \right.\tag{4}\] where \(\mu\) is a hyper-parameter and \(\boldsymbol{v}\) is the augmented Lagrange multiplier.

The optimization of \(\boldsymbol{x}\) can be achieved using the gradient descent method for a number of steps with a step size \(\beta\): \[\begin{align} \nabla \mathcal{L}_\mu(\boldsymbol{x}) & = A^T(A\boldsymbol{x}- \boldsymbol{y}) + \mu(\boldsymbol{x}- \boldsymbol{z}^k + \boldsymbol{v}^k/\mu), \\ \boldsymbol{x}^{k, (t+1)} & = \boldsymbol{x}^{k, (t)} - \beta \nabla \mathcal{L}_\mu(\boldsymbol{x}^{k, (t)}), \end{align}\] where \(t\) represents the step index for the gradient descent search. The optimization of \(\boldsymbol{z}\) is a proximal operation: \[\boldsymbol{z}^{k+1} = \text{prox}_{\frac{\lambda}{\mu}R}(\boldsymbol{x}^{k+1} + \boldsymbol{v}^k/\mu).\label{eq:proximalOP}\tag{5}\] Clearly, a learned denoiser resembles a projection of the noisy input onto a clean image manifold [33], also shown by several recent studies using deep networks as learned proximal operators [34], [35]. Here we use our network to approximate the proximal operation as a deep prior.

Note that the noise characteristics could change through iterations. However, the network denoiser is often trained for mapping a noisy CT reconstruction to its corresponding clean label. To reduce noise mismatch and accelerate convergence, we initialize \(\boldsymbol{x}\) with the results obtained with the structural prior. While the magnitude of noise in the reconstruction gradually reduces through iterations, we regulate the network contribution at later stages. Eq. (5 ) is reformulated with a network denoiser as follows: \[\boldsymbol{z}^{k+1} = \gamma f_{\theta}(\boldsymbol{x}^{k+1} + \boldsymbol{v}^k/\mu) + (1-\gamma)(\boldsymbol{x}^{k+1} + \boldsymbol{v}^k/\mu),\] where \(\gamma\) controls the network contribution, which can be understood as a parameter to control the amount of noise to be removed.

2.3.2 Volumetric Sparse Representation Network (VSR-Net)↩︎

Despite the great successes of deep 2D CT reconstruction methods, directly applying them to clinical HR PCCT for volumetric reconstruction is infeasible on conventional GPUs. The GPU memory cost becomes huge for image volume and sinogram storage as well as for the corresponding backward/forward projection operations, invalidating direct methods like AUTOMAP [36] and other unrolling methods [14], [37]. Rather than training a network targeting a whole volume, we train a network that learns a patch-based representation to overcome the memory limit for volumetric reconstruction. The architecture of our proposed network is illustrated in Fig. 2. It is a light-weight 3D network that combines U-Net [38] and ResNet [39] with 3D grouped convolutions  [40] and specialized 3D pixel shuffle operations to promote application speed and performance.

In contrast to many generative networks using large reception fields for realistic high-level feature synthesis, we force the network to concentrate on low-level features by choosing a small kernel size of \(3\times 3 \times 3\) for all convolution layers to leverage low-level structural similarities and gain more tolerance to domain shift. Compared to widely used 2D convolutions, we use a 3D grouped convolution for all convolution layers to fully utilize the information from neighboring voxels. To facilitate training, the grouped convolution encourages structural sparsity, promotes the differentiation between feature maps, and constructs a 3D network with fewer trainable parameters. Different from the conventional downscale/upscale convolutions with strides, we use 3D pixel unshuffle/shuffle operations as shown in Fig. 2 (b). The unshuffle operation splits the input volume into 8 sub-volumes and concatenates them in the channel dimension, while the shuffle operation assembles the sub-volumes into a super volume. In reference to the work in [41], we omit batch normalization throughout the network, and the first and last convolution layers involve no activation. For scaling invariance and generalizability, no bias is used in all layers as suggested in [35]. The cube size for network training can be adjusted according to the available GPU memory. We set the cube size to 32 in our experiments, and the network can be easily deployed on a conventional standard 1080Ti GPU with 11GB memory.

Figure 2: Architecture of our volumetric sparse representation network (VSR-Net). (a) This light-weight network takes a small cubic patch as input and outputs a denoised patch, with 3D pixel shuffle operations and grouped convolutions; and (b) the downscaling and upscaling of the feature maps are achieved through 3D pixel unshuffle and shuffle operations (illustrated in (c)) combined with two 3D grouped convolutional layers. Note that a color-coded number above each convolutional operation denotes the number of groups used, while the number underneath the feature map indicates the number of channels.

2.3.3 VSR-Net Training with Synthetic Data↩︎

Since it is rather challenging to obtain the ground truth for HR PCCT scans of patients, we use synthesized data for network training. Specifically, we construct our training dataset from the open dataset for the Low-dose CT Grand Challenge [42]. We first resize the images to have isotropic voxels of \(1 mm\) along each dimension, and convert the voxel values in Hounsfield units to linear attenuation coefficients. Then, we treat volumes as digital phantoms of \(0.2^3 mm^3\) voxel size and generate noise-free projections in the MARS CT scanner geometry. Finally, the projections with quantum noise are simulated assuming 16,000 incident photons per detector element. The Simultaneous Iterative Reconstruction Technique (SIRT) is used to reconstruct images.

The isotropic volumes and corresponding noisy reconstructions are the labels and noisy inputs for network training. Ten patient volumes are partitioned into cubes of size \(32^3\) with a stride of 25 pixels along each direction. Then, the cubes are sieved to remove empty ones based on the standard deviation of pixel values. As a result, over 190,000 pairs of 3D patches are generated for training, and around 38,000 pairs for validation. The loss function consists of an \(L_1\) norm for the value difference and a mean square error for the relative value difference: \[\sum_i \left[\left\lVert\boldsymbol{y}_i - f_{VSR}(\boldsymbol{x}_i;\theta)\right\rVert_1 +\beta_0 \left\lVert\frac{\boldsymbol{y}_i - f_{VSR}(\boldsymbol{x}_i;\theta)}{\boldsymbol{y}_i + c}\right\rVert^2_2\right],\] where \(\boldsymbol{y}_i\) and \(\boldsymbol{x}_i\) are respectively the label patches and noisy inputs, \(f_{VSR}(\boldsymbol{x}_i;\theta)\) corresponds the network output with trainable parameters \(\theta\), and \(c\) is a constant to avoid zero denominator. The \(L_1\) norm, instead of the \(L_2\) norm, is used in the first term to avoid blurring details, and the relative error is measured with the \(L_2\) norm in the second term to preserve tiny structures based on our experiences [43]. During training, we set the balancing hyperparameter \(\beta_0\) to 1 and \(c\) to 0.1.

2.3.4 Parallel Batch Processing & Geometric Self-ensemble↩︎

During the inference, a reconstruction volume is partitioned into overlapping patches and then fed into the VSR-Net. Geometric self-ensemble based on flipping and rotation is used to boost performance and suppress checker-board artifacts. To save computational time, we randomly apply 1 of 8 transforms to the reconstruction volume for each iteration, which is modified from the periodic geometric self-ensemble idea [35]. For acceleration, parallel processing techniques are used to distribute the workload across multiple GPUs.

2.4 Textural Appearance Tuning↩︎

To match the image texture with that of the MARS commercial reconstruction radiologists are already familiar with, we adopt a two-step refinement process. First, we use a 2D convolutional network to exploit the inter-channel correlation for texture enhancement and value alignment. Multi-channel images extracted from the channel reconstructions of the same slice are fed to the network for the mapping, in a slice-by-slice manner for memory efficiency. Then, we process the reconstruction through a few SIRT iterations to enhance image sharpness and alter noise characteristics. To minimize potential perception bias, we further balance the sharpened but noisier result by mixing it with the original network processed result at a ratio preferred by radiologists. Specifically, we ran 30 additional SIRT iterations for our reconstruction, and the mixing weights were set to 0.75 and 0.25 for the SIRT reconstruction and the network result, respectively. This ratio was determined by presenting a set of images with various mixing ratios to a radiologist and asking them to choose the one that produced the best perceptual image quality.

2.4.1 Residual Fourier Channel Attention Network (RFCAN)↩︎

We design a value and texture alignment network by modifying the residual channel attention network [44] using a Fourier channel attention mechanism [45] to learn a multi-channel mapping, as illustrated in Fig. 3. For training the RFCAN, we use the full-view MARS reconstruction slices as the label for our half-view reconstruction results. However, due to patient motion, there can be occasional non-rigid misalignments across a few slices between our reconstruction and the MARS results, resulting from differences in volume and projection partitioning schemes, and the number of projections. Hence, we select one patient scan that is least affected by motions as our training data, and then sieve out the misalignment-affected slices, resulting in 584 pairs of HR multi-channel images (\(1,200\times1,200\)). A total of 206,000 pairs of overlapping patches of size \(128\times128\) are randomly extracted for network training, and around 52,000 pairs for validation. An additional penalty on the Fourier spectrum (insensitive to misalignment) is introduced in the loss function to emphasize the texture similarity besides the intensity fidelity imposed by the other terms as follows: \[\begin{align} \sum_i \Bigl[\left\lVert\boldsymbol{y}_i - f(\boldsymbol{x}_i;\theta)\right\rVert_1 +\beta_1 \left\lVert\frac{\boldsymbol{y}_i - f(\boldsymbol{x}_i;\theta)}{\boldsymbol{y}_i + c}\right\rVert^2_2 \nonumber \\ + \beta_2\left\lVert\left| \mathcal{FFT}(\boldsymbol{y}_i)\right| - \left| \mathcal{FFT}(f(\boldsymbol{x}_i;\theta))\right|\right\rVert_1 \Bigr], \end{align}\] where \(\mathcal{FFT}(\cdot)\) denotes the Fourier transform. \(\boldsymbol{y}_i\), \(\boldsymbol{x}_i\), and \(f(\boldsymbol{x}_i;\theta)\) are the label, input, and network output, respectively. The balancing hyperparameters \(\beta_1\) and \(\beta_2\) are both set to 1 with \(c\) set to 0.1 during training.

Figure 3: Motivation and architecture of the residual Fourier channel attention network (RFCAN). It mainly intends to (i) correct contrast shift, and (ii) adjust noise texture to match clinical references. (a) The difference images of deep iterative refinement (DIR) results and RFCAN results, against the MARS full-view reconstruction reference, reveal clear misregistration in the upper part of the volume—highlighting the need for a misalignment-insensitive loss function. A slight value shift for bones is also observed in the DIR result, indicated by the dark region pinpointed by arrows. (b) Zoomed-in views of a flat tissue region show a notable texture discrepancy between the DIR result and the reference, despite their similar noise levels (standard deviation) and mean values. (c) Noise power spectrum (NPS) curves, estimated from the flat tissue region, further confirm this texture difference, motivating the alignment of bone values and the adjustment of noise characteristics to reduce perception bias using RFCAN. (d) The proposed RFCAN consists of 15 Fourier channel attention residual blocks (FCA-ResBlocks) built upon the attention layers with FCA (FCA-Layer). It functions as a post-processing procedure that was applied to the multi-channel DIR outputs.

2.5 Interleaved Updating for Large-volume Recon↩︎

The size of projection data from a patient scan can be huge, e.g., \(1,536\text{~columns}\times128\text{~rows}\times3,392 \text{~views}\times5\text{~channels}\), overwhelming the GPU memory during direct reconstruction. We further use an interleaved updating technique to divide the large-volume reconstruction task into a batch of mini-jobs in smaller size by partitioning the projection data and reconstruction volume into different segments as illustrated in Fig. 4. The volume along with the associated projection data is partitioned into \(N\) segments, with each volume segment and its corresponding projection data segment being geometrically aligned. To ensure the data completeness, sub-volumes at the seams are also extracted with their corresponding projection data (about 1.5 to 2 rotations from the helical scan). The volume segments and seams form \(2N-1\) mini-reconstruction tasks, which are assigned to multiple GPUs for parallel computing or can be processed sequentially with a single GPU. The resultant sub-volumes are combined in an interleaved pattern, with a few slices at one or both ends trimmed off to ensure data completeness of the resting volume, forming a complete large volume reconstruction update as shown in Fig. 4 (b).

Figure 4: Interleaved updating for large volume reconstruction. (a) Partitioning the projections and image volume to form a batch of tasks for sub-volume reconstruction, and (b) combining the results in an interleaved pattern with slices at one or both ends trimmed off to ensure data completeness.

3 Experiments and Results↩︎

3.1 Implementation and Experimental Setup↩︎

Training Details. Our VSR-Net and FRCAN are implemented on PyTorch and trained with the Adam optimizer on a single NVIDIA V100 GPU. The learning rate for VSR-Net is initially \(2\times10^{-4}\) and decayed by 0.95 every epoch. The total number of epochs is 60 with a batch size of 32. The learning rate for FRCAN is initially \(1\times10^{-4}\) and decayed by 0.6 every epoch, with a total of 10 epochs and a batch size of 32. The VSR-Net is trained on the synthetic dataset described in Sub-sec. 2.3.3, while FRCAN is trained on real patient data as described in Sub-sec. 2.4.1.

Reconstruction Details. We use the ASTRA Toolbox [46] for GPU-based forward and backward projection operations. The patient data are reconstructed on a cluster node using four NVIDIA V100 GPUs for parallel computation (parallel sub-volume reconstruction and patch processing), and other data are reconstructed on a server with a single RTX A5000 GPU.

Experimental Setup. First, we demonstrate the in-domain capability of our DIR method on synthetic single channel CT data. The testing volume is generated from the AAPM dataset following a similar simulation protocol but from different patients. Then, we demonstrate the enhanced generalization on out-of-domain data with our DIR compared to the conventional post-processing with VSR-Net. We use phantom data scanned from a micro-PCCT system for out-of-domain testing. Finally, we validate our whole PCCT reconstruction workflow (DIR followed by texture appearance tuning) on new real patient data acquired on the MARS Extremity scanner. Ratings from radiologists on diagnostic value are used to assess the effectiveness of our method.

3.2 In-domain Simulation Study↩︎

Figure 5: Representative images reconstructed using the competing methods on simulated data. (a) The full-view reconstructions with FDK, SIRT-TV, and our method displayed against the ground truth, including exemplary axial, coronal, and sagittal views from top to bottom; (b) the reconstructions from halved views; (c) error map of half-view reconstructions against ground truth for the SIRT-TV (left half) and the proposed method (right half); and (d) magnified regions from the coronal and sagittal views as indicated by the green and orange boxes respectively and displayed in the descent order of image sharpness and structural fidelity: ground truth, our full-view and half-view reconstructions, and full-view and half-view reconstructions with SIRT-TV from top to bottom. The display window settings are W/L:400/50 HU for images and W/L:200/0 HU for error maps. The red arrows highlight the structural details that are recovered for our methods but challenging for SIRT-TV, e.g., resulting in loss of resolution as indicated in (c) and a blotchy and cartoonish appearance as shown in (d).

3.3 Out-of-domain Phantom Studies↩︎

To demonstrate the generalizability, we further test our DIR method on phantom data with totally different structures as shown in Fig. 6. The single-channel helical scan data are acquired on our custom-built micro-CT system equipped with a PCD (ADVACAM WidePIX1x5, Prague, Czech Republic) at 80\(kVp\). We collect the data at 6 different dose levels by adjusting the exposure time per projection (0.15, 0.5, 1.0, 1.5, 2.0, and 5.0 seconds). The volumes of \(979\times979\times610\) isotropic voxels (\(35^3\mu m^3\)) are reconstructed using 250 SIRT iterations, serving as noisy inputs (exposure time of 0.15 to 2.0 seconds) and clean reference (5.0-second exposure time). Post-processing with the latest BM3D [12] and with VSR-Net (the same one used in our DIR) are the baselines for our DIR method. The standard deviation parameters for the BM3D method are determined by measuring the standard deviation of values in a water region after normalized with its mean. For the DIR method, we use the half-finished reconstruction as the structural prior (60 and 40 SIRT iterations at the scale of 0.637 and 1 respectively), and refine it with 36 DIR iterations (3 gradient descent steps per iteration, \(\mu=0.03\), \(\beta=0.5\), \(\gamma=0.8\)).

Figures 6 (a) to (d) compare the results with 0.5 and 0.15 seconds exposure from axial and sagittal views against the reference, respectively. Figure 6 (e) displays the frequency modulation curves of axial slices from 6 (a) related to the reference. Figure 7 (a) illustrates a zoomed-in view of a surgical tape under various exposure times and reconstruction methods. For each combination, we computed the SSIM and PSNR for every slice (either from the axial or sagittal view) from the volume against the reference for quantitative comparison. The resulting SSIM and PSNR distributions are presented in Fig. 7 (b) using violin plots with boxplot overlay. In each violin, the center thick gray line represents the interquartile range, and the large white dot indicates the median. Our method consistently demonstrates improved image quality across different acquisition conditions in terms of both qualitative and quantitative measures, suggesting good generalizability on out-of-domain structures. In contrast, the adverse effects of domain shift are clearly presented in VSR-Net results, showing artifacts and different appearance and intensity from that of the reference despite the enhanced structure visibility, e.g., structural errors revealed in Figs. 6 (c) and (d), and the tape structure and dots pointed by the red and green arrows in Fig. 7 (a). The BM3D method is structurally agnostic and intrinsically offers better generalization than deep learning methods as reflected by the high PSNR and SSIM scores in Fig. 7 (b). However, it suffers from resolution loss as observed in the difference image in Fig. 6 (c) and a significantly dampened spectrum in the middle and high frequencies in Fig. 7 (e), compromising subtle details. This suggests the importance of task-relevant metrics and underlines the need for radiologists’ evaluation in the medical imaging field.

Figure 6: Generalization on the PCCT phantom scan that is out-of-domain. Comparisons between noisy input, DIR, BM3D, and VSR-Net against the long-exposure reference (5.0 seconds). (a) Axial slices from the dataset with 0.5-second exposure; (b) Sagittal slices from the dataset with 0.15-second exposure; (c) and (d) Difference images for (a) and (b) with respect to the reference, revealing resolution loss in BM3D (particularly in tape details) and falsely generated structures by VSR-Net. In contrast, DIR results exhibit no noticeable loss of resolution or structures, apart from a slight bone value shift in the extreme-case reconstruction shown in (d); and (e) Radial profiles of spatial frequency modulation functions, calculated by azimuthally averaging the normalized Fourier spectra of (a) with respect to the spectrum of the reference, further confirming the resolution loss in BM3D and demonstrating the improvement of DIR over VSR-Net with significantly dampened spectrum deviation. The display window is [0, 0.45] for axial view and [0, 1.05] for sagittal view while the corresponding windows for difference images are [-0.05, 0.05] and [-0.15, 0.15], respectively (unit: cm^{-1}). The mean and standard deviation values of a flat water region are listed along with the SSIM and PSNR values.
Figure 7: Zoomed comparison on the phantom scan under different noise levels accompanied by quantitative image quality evaluation. (a)Same magnified region as that in Fig. 6 (a) but from the datasets with exposures of 0.15, 0.5, 1.0, and 2.0 seconds respectively; and (b)violin plots of distributions of the PSNR and SSIM values for the axial and sagittal slices.

3.4 Retrospective Patient Study↩︎

Patients aged 21 years and above referred from the fracture clinic are recruited for the clinical trial (Ethics approval:18/STH/221/AM01, Health and Disability Ethics Committee, New Zealand). The spectral image volumes of patient wrist are acquired using the MARS Extremity PCCT scanner with scanning settings shown in Table 1.

Table 1: Scanner Settings for Patient Study.
Parameters MARS Extremity 5X120 Settings
Tube Setting 118\(kVp\), 28\(\mu A\)
Exposure 160\(ms\) per projection, helical scan
CTDI\(_{\text{vol}}\) 7.87\(mGy\)
Energy Bins \(7-40\), \(40-50\), \(50-60\), \(60-70\), \(70-118\), \(keV\)
Recon. Voxel Isotropic \(90\times90\times90 \mu m^3\) voxel
Recon. Method A customized polychromatic iterative method. [47]

3.4.1 Experiment Setup↩︎

The images of 8 patients who provided written consents are reconstructed using the commercial algorithm from a full-view dataset and our proposed deep learning method (illustrated as Fig. 1) from a half-view dataset respectively, and then evaluated by three independent double-blinded radiologists (SG, AB, AL) using the rating scale defined in Table 2 regarding whether diagnostic image quality is achieved or not [48]. In our method, DIR is applied to the structural prior (Sub-sec. 2.2, obtained with 80 and 80 SIRT iterations at the scale of 0.5 and 1 respectively) for 30 iterations (3 gradient descent steps per iteration, \(\mu=0.03,\beta=0.5,\gamma=0.8\)) for the reconstruction from data in each bin, then the combined multi-channel volume (\(1,200^3\times5\)) are processed with RFCAN in a slice-by-slice manner for value alignment and texture enhancement (Sub-sec. 2.4), and the number of SIRT iterations is set to 30 and the mixing ratio to \(0.75\colon0.25\) to accommodate radiologists’ preference on image sharpness and noise characteristic.

Table 2: Grading scale for image quality assessment.
-2 Confident that the diagnostic criteria is not fulfilled;
-1 Somewhat confident that the criteria is not fulfilled;
0 Indecisive whether the criteria is fulfilled or not;
+1 Somewhat confident that the criteria is fulfilled;
+2 Confident that the criteria is fulfilled.

The radiologists are randomly presented with 500 images from each patient (three energy bins \(7-40keV\), \(50-60keV\) and \(70-118 keV\)) in the axial, coronal and sagittal formats. The sagittally reformatted images reconstructed using both methods along with 3D rendered images are shown in Fig. 8 (a). The image metrics are based on the “European guidelines on quality criteria for CT" for bones and joints [49], including the visibility and sharpness of the cortical and trabecular bone, the adequacy in soft tissue contrast for the visualization of tendons, muscle and ligaments, as well as image noise (quantum noise) and artifacts.

Additionally, we compare our results with the state-of-the-art results obtained by applying the self-supervised denoising method Noise2Sim [27] to the multi-channel reconstructions after 320 SIRT iterations. Despite significant enhancement over SIRT reconstruction from the half-view dataset, Noise2Sim results demonstrate insufficient image quality (suffering from image blur and losing fine structures) as shown in Fig. 8 (b). Hence, they are excluded from the reader study.

Figure 8: Sagittal view of a wrist joint reconstructed using the standard and proposed methods respectively. (a) From left to right are 3D rendering of the standard reconstruction, half-view and full-view images in the channel 7-40keV (manually aligned to show the same region, displayed in W/L:0.72/0.5 cm^{-1}), where the arrow points to a scaphoid fracture; (b) a color visualization of our three-channel reconstruction via linear blending [50] in reference to the standard full-view result and Noise2Sim half-view result. Our result demonstrates high fidelity in both spectral values (same color tone and brightness as the full-view) and spatial structures (sharp and accurate fine details as pointed by the red arrow); (c) zoomed-in views of noise texture from a flat tissue region in our half-view reconstruction (7-40keV channel) illustrate the texture changes across different stages; and (d) the corresponding NPS curves demonstrate that, despite differences in noise amplitude, our final result closely matches the reference in spectral shape.

3.4.2 Data Analysis↩︎

For quantitative assessment, regions of interest (ROIs), each with \(\sim250\) voxels, are drawn in the flexor carpi radialis tendon and adjacent subcutaneous fat regions in the patient images. The mean and standard deviation of linear attenuation coefficients in the ROIs are used to calculate the signal-to-noise ratio (SNR) in soft tissue regions and contrast-to-noise ratio (CNR) between soft tissue and fat in the ROIs. SNR and CNR values associated with both methods are compared over all patients’ datasets. For subjective evaluation, overall radiologists’ assessment grades for seven image quality measures from both methods are summarized as a frequency table. Then, all three radiologists’ overall and combined ratings are compared with descriptive statistics. The hypothesis of no significant difference between the two methods is tested in the Wilcoxon signed-rank test.

Image grades from both methods in terms of each image quality metric are also converted into visual grading characteristics (VGC) points as described in [48]. Hence, with the current 5 image grading criteria, 4 VGC points are obtained, and 0 as the origin and 1 as the maximum value are added as well [48]. The combined VGC points (using grades from all three radiologists) of seven image quality measures are calculated, and the empirical area under the curve (AUC\(_{VGC}\)) is compared. The statistical significance of the mean AUC\(_{VGC}\) of seven image quality measures is analyzed through one-sample t-testing against a hypothetical AUC\(_{VGC}\) value of 0.5 (the value 0.5 corresponding to equal/comparable image quality for the two imaging methods in comparison). VGC points are also obtained by combining all seven image quality measures for both methods. Full-view vs Half-view empirical AUC\(_{VGC}\) values and their 95% confidence intervals are obtained for each radiologist and the combined scores. The statistical significance of AUC\(_{VGC}\) and its 95% confidence interval for each radiologist (56 samples) and all radiologists (168 samples) are interpreted according to the method described in [51]. The statistical analysis is presented using GraphPad Prism 9.2.0 at a significance level of 95%. Finally, the inter-rater agreements between radiologists are evaluated with quadratically weighted kappa statistics [52].

3.4.3 Image Comparison↩︎

Figure 8 (a) displays a 3D rendering of a patient’s wrist with a scaphoid fracture, along with oblique slices from our half-view reconstruction and the standard full-view reconstruction, highlighting the lesion. The fracture is clearly presented in two images, showing comparable contrast and noise textures. Note that noticeable structural differences are due to slight mismatches in slice location or viewing angle introduced by manual alignment. A color visualization of the three-channel spectral images via linear blending [50] is shown in Fig. 8 (b). The consistent color tone and brightness compared to the standard result demonstrate the high fidelity of our result in spectral values. Fine structural details are also well-preserved, as indicated by the red arrow. In contrast, the Noise2Sim result exhibits structure loss, while the conventional SIRT result suffers from significant noise and contrast issues. To further evaluate the noise characteristics, Fig. 8 (c) presents zoomed-in noise textures from a flat tissue region at different stages of our reconstruction, alongside the standard reference and their corresponding Fourier spectra. The corresponding NPS curves are plotted in Fig. 8 (d). As revealed in the figures, the DIR result contains major low-frequency noise components. These are progressively replaced by higher-frequency components through RFCAN post-processing, with the final reconstruction further enhancing high frequencies and well resembling the texture of the standard reference, despite exhibiting a lower overall noise amplitude.

3.4.4 Statistical Results↩︎

SNR in soft tissue regions and CNR between soft tissue and fat are compared in Fig. 9, where the bar charts illustrate that for all the patients, SNR and CNR in the images obtained with the proposed half-view reconstruction method are higher than that in the clinical benchmark images reconstructed using the standard commercial method from the full dataset, except for the second one whose images showed quite comparable CNRs. The distributions of image noise across patients shown in Fig. 10 further indicates that the proposed method consistently yields significantly lower noise levels compared to the standard method, despite utilizing halved radiation dose.

Table 3: Descriptive statistics of the radiologists’ ratings.
Methods  Raters Median \(\uparrow\) IQR Mean \(\uparrow\) (Std.)
Full RD1 1 0 0.875 (0.740)
Full RD2 1 2 1.107 (0.966)
Full RD3 \(-\)​1 3 \(-\)​0.589 (1.247)
Full COM 1 1 0.464 (1.252)
Half RD1 1 2 1.054 (0.862)
Half RD2 2 2 1.179 (1.011)
Half RD3 0 3 \(-\)​0.357 (1.354)
Half COM 1 2 0.625 (1.293)
Overall RD1 1 1.5 0.964 (0.804)
Overall RD2 1 2 1.143 (0.985)
Overall RD3 \(-\)​1 3 \(-\)​0.473 (1.301)
Overall COM 1 2 0.545 (1.273)
Table 4: Hypothesis testing (Half-view vs Full view) in terms of the Wilcoxon signed rank.
Raters  # of Pairs # of Ties \(p\)-Value
RD1 56 22 0.2734
RD2 56 44 \(0.3877\)
RD3 56 38 \(0.0355\)
COM 168 104 \(0.0166\)

More importantly, the overall confidence ratings of diagnostic image quality with seven criteria are compared in Table 3. The table shows significantly better mean and median image quality scores with the proposed half-view reconstruction method than the current clinically used reconstruction method from the full-view dataset for all radiologists despite their different scores, indicating a preference for the proposed reconstructions. The median value for the half-view reconstructions is 2 for the second radiologist, which suggests higher confidence in image interpretation. Despite the discrepancy in ratings, the combined median value is clearly positive, indicating the favorable acceptability of our reconstructions. The hypothesis is further tested in the Wilcoxon signed rank test for all three raters and combined ratings in Table 4. It shows that the p-value is not statistically significant for the 1st and 2nd radiologists, suggesting no difference in image quality between the two methods. However, the p-values for the 3rd radiologist and the combined results are statistically significant, indicating the image quality from the proposed method is perceived significantly better than the standard commercial image reconstruction from the full-view dataset.

Figure 9: Quantitative evaluation of our half-view reconstruction against the standard commercial reconstruction from the full-view dataset.
Figure 10: Distribution of the noise across patients measured as standard deviations of pixel values within the regions for CNR/SNR calculation.
Figure 11: Subjective evaluation on seven image quality (IQ) metrics comparing our half-view reconstruction with the clinical full-view reconstruction. (a) Violin plot of combined radiologists’ ratings on the half-view results against the full-view results; (b) Slightly better than 0.5 AUC_{VGC} (the performance neutral threshold) from the proposed method over the conventional method for most image quality metrics; and (c) the half-view versus full-view VGC plots generated by combining all the image quality metrics indicate most VGC points above the diagonal line (the performance neutral line) for all the three radiologists and their combined ratings.

The proposed method also performs better when image quality measures are individually evaluated. The mean area under the curve for visual grading characteristics AUC\(_{VGC}\) values of the proposed method are consistently higher than 0.5 for five image quality measures evaluated, as shown in Fig. 11 (b). Similar trends are also reflected in the violin plots in Fig. 11 (a) as indicated by the better median scores and narrower tails in the low end (less low scores). The mean AUC\(_{VGC}\) from the standard method was only slightly better than that of the proposed method in soft tissue contrast differentiation, mainly related to the depiction of ligaments, tendons, and muscles. The VGC points obtained for overall image quality scores from the two competing methods are plotted in Fig. 11 (c). Fig. 11 (c) shows that the mean value of AUC\(_{VGC}\) for the proposed method is better than 0.5 for all radiologists and combined ratings. However, to show that the AUC\(_{VGC}\) is significantly better than 0.5 in the \(95\%\) interval sense, more data would be needed. The statistical significance of the mean of AUC\(_{VGC}\) of the seven image quality measures is established as well using the one-sample t-test in Table 5. As the clinical trials proceed, more datasets may help further strengthen the statistical significance of this comparative study.

Although all the raters preferred the proposed half-view reconstructions, they provide different ratings for the same images, resulting in a lower inter-rater agreement. The agreements between raters are evaluated with weighted kappa statistics in Table 6. The table shows a slight to fair agreement between radiologists’ scores of high significance. Also, the kappa value is higher for radiologists 2 and 3, indicating a higher degree of agreement between these two radiologists.

Table 5: Hypothesis testing of the mean of AUC\(_{VGC}\) of the image quality measures (one-sample t-test).
# of samples Mean (Std.) 95% CI \(t\) \(p\)-Value
7 (df=6) 0.5479 (0.0503) [0.0014, 0.0944] 2.52 0.0454
Table 6: Agreement in the combined subjective scores between radiologists (quadratically weighted kappa).
Categories Weighted Kappa \(p\)-Value
RD1-RD2 0.247 \(0.0042\)
RD1-RD3 0.171 \(<0.0001\)
RD2-RD3 0.339 \(<0.0001\)

4 Discussion↩︎

This study targets deep learning-based HR PCCT volumetric reconstruction given insufficient training data. Direct volumetric reconstruction is necessary and advantageous since rebinning to fan-beam geometry could compromise image quality [53], especially due to large gaps and bad pixels in PCDs and free-form scanning with robotic arms [54], [55]. However, volumetric PCCT reconstruction poses GPU memory and computational challenges. We have tackled them with interleaved updating, patch-based refinement, and low-noise prior sharing. Low-level structural similarity has been leveraged in combination with model-based iterative refinement to address the domain gap effectively. Additionally, textural appearance has been fine-tuned to align with the standard reconstruction in the application domain. On the other hand, the patch-based representation mitigates the GPU memory limitation but at a cost of extra computation during inference. For example, the reconstruction volume is partitioned into overlapping patches to minimize checker-board artifacts. Compared to directly processing the entire volume, it becomes the overhead to compute the overlapping portions, and perform procedures of partition and assembly and special geometric self-ensemble processing steps. Fortunately, we may use different patch sizes for training and inference, leveraging the shift-invariant property of convolution networks, and bigger patches can be used to reduce the overhead during inference. With parallel computing on four V100 GPUs, a typical computation time to reconstruct a \(94mm\) patient wrist scan in five energy bins is about 7 hours using our method, whereas the commercial reconstruction time is around 9 to 10 hours on its dedicated hardware. In comparison with the current commercial reconstruction from the full-view dataset, the major benefits of our approach include halved radiation dose and doubled imaging speed, without compromising image quality. The evaluation methods are classic and double blinded. The involved patient datasets are randomly determined, covering a range of pathological and technical conditions. Therefore, our results strongly suggest a great potential of our approach for clinical PCCT image reconstruction. Further improvements are surely possible, using more advanced network architectures such as the emerging diffusion/score-matching models [56]. The barriers for adapting the diffusion approach for PCCT include the scarcity of high-quality data, the memory limitation, and the sampling overhead, which are being explored actively.

Figure 12: PSF measurements on a tungsten-wire phantom scan (0.5-second exposure) reconstructed by (a) DIR and (b) SIRT methods. All other settings remained consistent with the phantom study.

Interestingly, for out-of-domain single-channel real phantom data, significant improvement in both image quality and stability has been made with our DIR approach over the conventional single-pass post-processing method using the same network. For post-processing methods, the domain gap could lead to inappropriate frequency elevation in network processed images and cause undesired artifacts. For example, in Fig. 6 (e), the spectrum elevation of VSR-Net curves even extends beyond the cutoff frequency, which explains the network’s tendency to mistake noise as features and produce high-frequency artifacts. Through iterative feedback and correction, DIR significantly reduces the errors despite its similar elevation pattern to the VSR-Net curves. One limitation is that we might observe some low-frequency bone value shift in the extreme case reconstruction (0.15-second exposure against the 5-second reference, \(3\%\) dose) as shown in Fig. 6 (d), and which is also reflected by a small dent from the DIR frequency modulation profile. This dent can cause low-frequency image contrast change, which explains why the SSIM and PSNR scores of DIR are smaller than those of BM3D. Luckily, no structure or resolution loss is noticed and this issue can be easily remedied with a texture appearance tuning network RFCAN. We further confirmed the resolution change with point spread function (PSF) measurement using a \(10\mu m\)-diameter tungsten wire phantom. The measured PSF from DIR reconstruction is \(81\mu m\) in full width at half maximum (FWHM) while the FWHM from direct SIRT reconstruction reads \(88.8\mu m\) as shown in Fig. 12, suggesting no resolution loss. One interesting observation is that the BM3D method scores the best in terms of SSIM and PSNR despite the loss of fine details, suggesting the necessity of using task-relevant metrics in clinical applications. We also underline that our major aim is to demonstrate the improved generalizability with DIR over VSR-Net in this experiment, rather than to compete with BM3D in these scores. While we already tied scores with BM3D in less noisy cases, superior results can be expected if we adopt an RFCAN or retrain the VSR-Net and narrow the domain gap.

More remarkably, in retrospective patient studies our spectral reconstruction quality has surpassed that with the state-of-the-art unsupervised learning method Noise2Sim. In our reader study, the analysis on the grading results from the combined scores from all three radiologists has demonstrated that our proposed method is better than the standard commercial reconstruction. Encouragingly, the median values of all image quality scores are on the positive side, suggesting our reconstructions are diagnostically acceptable and preferred, despite reconstructed with only halved radiation dose. Furthermore, our method has been evaluated in terms of VGC points from the radiologists’ scores against the existing method. By each of the image quality measures our method has produced significantly better results in almost all aspects. For the total combined image quality scores, the proposed method has shown its competitive advantage against the existing method, with a mean \(\text{AUC}_{VGC}\) value exceeding 0.5, although additional patient data is needed to establish statistical significance. Notably, the individual scores from all three radiologists, as well as the combined rankings, exhibit consistent trends. All favor the proposed method for a majority of the VGC points, as shown in Fig. 11, but with the VGC fitting curve crossing the diagonal line at the rightmost part. This is a pattern typically associated with a method that receives higher mean scores but also exhibits greater variation in scores than the reference. This interpretation is supported by the summary statistics in Table 3, where the combined ratings for the proposed half-view method show both a higher mean and a larger interquartile range compared to that of the full-view reference. Finally, a slight to fair agreement has been obtained among radiologists, despite no formal training on their visual grading.

Regarding the radiation dose of a patient’s wrist scan, we have performed comparison experiments with a GE Discovery CT750 HD scanner for standard musculoskeletal imaging. The Extremity 5x120 scanner delivers a radiation dose (volumetric CT dose index, CTDI\(_{\text{vol}}\)) of \(7.87mGy\) to a \(10cm\) polymethyl methacrylate (PMMA) phantom during a routine wrist scan. This radiation dose remains the same regardless of whether or not a metal implant is present in the wrist of a patient. The radiation dose of the conventional CT scanner, using wrist protocols derived by the School of Medicine and Public Health, University of Wisconsin-Madison, USA, in conjunction with GE engineers, was also measured using a \(10cm\) PMMA phantom. The radiation dose was measured at \(43.02mGy\) in CTDI\(_{\text{vol}}\) using a protocol optimized for a wrist without metal implants, and \(197.3mGy\) for a protocol optimized with a metal implant present. These results suggest that the MARS scanner can produce diagnostically useful images at only a small fraction of the radiation dose required by traditional CT scanners. Moreover, our proposed method can further reduce the dose by half for improved safety without sacrificing image quality.

This study also confirms that high-quality images suitable for musculoskeletal diagnosis can be acquired in a reasonable time with a low X-ray dose at the point of care. As demonstrated in Fig. 8 (a), a clear image of a scaphoid fracture was captured with quality comparable to that of full field-of-view CT scans at only a fraction of the radiation dose. Traditional point-of-care musculoskeletal scanners typically rely on cone-beam flat panel technology, whereas the system used for this research was a photon counting scanner with an AI-based reconstruction and a helical scanning geometry. Significantly higher spatial resolution is achieved with the MARS PCDs than with the flat panel systems, along with other benefits from photon-counting technology—such as reduced metal artifacts, multi-contrast imaging, and lower radiation dose. The in-plane resolution at \(10\%\) of the modulation transfer function (MTF) peak was measured to be \(1.8 lp/mm\), while the longitudinal resolution was around \(5.0 lp/mm\) [57]. These values are comparable or better than those reported for the latest Naeotom Alpha PCCT scanner from Siemens (in-plane \(1.69 lp/mm\) and axial \(3 lp/mm\) at MTF \(10\%\)[58]. Compared to the Siemens whole body PCCT, the MARS scanner offers unique advantages as a point-of-care solution for musculoskeletal imaging. For example, full-field PCCT scanners require substantial floor space in a radiation-shielded environment, at least two trained operators, and carry a high capital cost. With these characteristics, they are usually only installed in tertiary imaging centers, which limits patient accessibility, delays time to diagnosis, and increases cost compared to point-of-care scanners. This is especially prominent for musculoskeletal patients who are often evaluated in outpatient clinics away from the main hospital. In contrast, a MARS point-of-care scanner is more compact, requires fewer operating staff, and is well-suited for deployment in outpatient facilities where musculoskeletal assessments commonly take place.

This study is not without limitations. First, the number of patients included is relatively small, since our clinical trial is still in progress. Second, the inter-reader agreement between radiologists is low. This can be attributed to the fact that the radiologists were not trained for the inter-rater agreement regarding image quality evaluation prior to the review study. At that moment, no protocol was established in the context of PCCT, and pre-evaluation training could potentially introduce image biases. In addition, the radiologists have different degrees of familiarity with and knowledge of PCCT images. Considering the significantly lower radiation dose (\(5\%\) of an equivalent traditional CT scanner protocol), they could have very different opinions on permeability to satisfactory diagnosis. It is also worth mentioning that motion correction has been applied to five patients with noticeable movements using the method described in [54] prior to our reconstruction, which has significantly improved motion artifact assessment as shown in Fig. 11 (b).

Finally, we would like to emphasize that this study aims to address the training data scarcity issue for HR volumetric PCCT reconstruction by bridging the gap between synthetic and clinical data through our proposed DIR framework. It is generally preferable to use in-domain clinical data for network training whenever possible, given they are of sufficient quality and quantity. However, such data are often scarce in practice, which sometimes necessitates the use of synthetic datasets for training. Synthetic data offer two key benefits: (1) they are relatively easy to acquire compared to patient data, and (2) they provide greater flexibility in controlling acquisition conditions, such as generating noise-free labels or producing repetitive scans under varying dose levels and motion states. On the other hand, the primary drawbacks of synthetic data are the inevitable unrealistic aspects, e.g., fake structures or features, simplification or omission of certain physics effects during image formation, which can introduce domain gaps and degrade network performance when applied to real clinical data. Our DIR framework bridges the domain gap through two strategies: (1) Narrowing the domain gap by training a structure-agnostic, low-level denoiser (VSR-Net); (2) Decomposing the difficult “long jump” into easier “segmental walks”, with each cornerstone (intermediate result) kept quasi in-domain for the denoiser. Specifically, by designing VSR-Net as a low-level denoiser, the primary domain gap between synthetic and clinical data is reduced to differences in noise levels and image contrast modulation. To address this, by properly selecting the number of SIRT iterations when generating the structural prior, we can closely match the resulting noise magnitude to that of the synthetic data, minimizing the domain gap for VSR-Net at the first step in the DIR pipeline. In subsequent steps, we can balance the noise introduced by gradient descent updates and the noise removed by VSR-Net (controlled by the parameter \(\mu\)). When tuned properly, these opposing effects largely cancel each other out, maintaining a quasi in-domain input for the denoiser (note that denoisers are tolerant to cleaner images). Additionally, the image contrast modulation mismatches can be corrected through iterative feedback during DIR iterations, further ensuring the robustness of the DIR framework.

5 Conclusion↩︎

In conclusion, we have developed a novel deep learning method for few-view HR PCCT volumetric reconstruction in the New Zealand clinical trial at halved radiation dose and doubled imaging speed. Compared to the standard commercial reconstruction method used in the clinical trial, the proposed method produces equivalent or superior image quality at halved radiation dose. We plan to translate the proposed method for few-view image reconstruction into the PCCT system and keep improving the method as the clinical trial proceeds.

A visual comparison example (an adjacent slice from that displayed in Fig. 3 (a)) is displayed in Fig. 13, as a complementary to Fig. 8. It shows RFCAN output, final half-view reconstruction result, and full-view clinical reference. The bone edge pinpointed by the red arrow appears a little over-smoothed in the RFCAN result compared to the clinical reference. The final few-view reconstruction result improves the sharpness, which is also demonstrated in the intensity profiles. The noise texture improvement in soft tissue region is also observed, cross-validating the result observed in Figs. 8 (c) and (d). These improvements confirm the benefits of the other step from our two-step textural appearance tuning procedure besides the RFCAN processing.

Figure 13: Visual comparison between (a) intermediate RFCAN result, (b) final half-view reconstruction, and (c) MARS full-view reference in the channel 7-40keV, displayed in W/L:0.72/0.5 cm^{-1}, accompanied by (d) plots of their intensity profiles along the red line. Note the bony structure differences in the upper part of images are due to misalignment, as explained in Fig. 3 (a).

References↩︎

[1]
S. Joyce, O. J. O’Connor, M. M. Maher, and M. F. McEntee, “Strategies for dose reduction with specific clinical indications during computed tomography,” Radiography, vol. 26, pp. S62–S68, 2020.
[2]
M. Söderberg and M. Gunnarsson, “Automatic exposure control in computed tomography–an evaluation of systems from different manufacturers,” Acta Radiol., vol. 51, no. 6, pp. 625–634, 2010.
[3]
N. Muhammad, M. Karim, H. Harun, M. Rahman, R. Azlan, and N. Sumardi, “The impact of tube current and iterative reconstruction algorithm on dose and image quality of infant ct head examination,” Radiat. Phys. Chem., p. 110272, 2022.
[4]
M. J. Willemink, M. Persson, A. Pourmorteza, N. J. Pelc, and D. Fleischmann, “Photon-counting ct: technical principles and clinical prospects,” Radiology, vol. 289, no. 2, pp. 293–312, 2018.
[5]
S. A. Si-Mohamed, S. Boccalini, H. Lacombe, A. Diaw, M. Varasteh, P.-A. Rodesch, R. Dessouky, M. Villien, V. Tatard-Leitman, T. Bochaton et al., “Coronary ct angiography with photon-counting ct: first-in-human results,” Radiology, vol. 303, no. 2, pp. 303–313, 2022.
[6]
V. Mergen, M. Eberhard, R. Manka, A. Euler, and H. Alkadhi, “First in-human quantitative plaque characterization with ultra-high resolution coronary photon-counting ct angiography,” Frontiers in cardiovascular medicine, vol. 9, p. 981012, 2022.
[7]
R. K. Panta, A. P. Butler, P. H. Butler, N. J. de Ruiter, S. T. Bell, M. F. Walsh et al., “First human imaging with mars photon-counting ct,” in IEEE Nucl. Sci. Symp. Medi. imag. Conf.IEEE, 2018, pp. 1–7.
[8]
L. K. Stamp, N. G. Anderson, F. Becce, M. Rajeswari, M. Polson, O. Guyen, A. Viry, C. Choi, T. E. Kirkbride, and A. Y. Raja, “Clinical utility of multi-energy spectral photon-counting computed tomography in crystal arthritis,” Arthritis & rheumatology, vol. 71, no. 7, pp. 1158–1162, 2019.
[9]
F. Ostadhossein, I. Tripathi, L. Benig, D. LoBato, M. Moghiseh, C. Lowe et al., “Multi-‘color’ delineation of bone microdamages using ligand-directed sub-5 nm hafnia nanodots and photon counting ct imaging,” Adv. Funct. Mater., vol. 30, no. 4, p. 1904936, 2020.
[10]
A. S. Wang and N. J. Pelc, “Spectral photon counting ct: Imaging algorithms and performance assessment,” IEEE Trans. Radiat. Plasma Med. Sci., vol. 5, no. 4, pp. 453–464, 2020.
[11]
E. Y. Sidky, C.-M. Kao, and X. Pan, “Accurate image reconstruction from few-views and limited-angle data in divergent-beam ct,” J. X-ray Sci. Technol., vol. 14, no. 2, pp. 119–139, 2006.
[12]
Y. Mäkinen, L. Azzari, and A. Foi, “Collaborative filtering of correlated noise: Exact transform-domain variance for improved shrinkage and patch matching,” IEEE Trans. Imag. Proc., vol. 29, pp. 8339–8354, 2020.
[13]
G. Wang, J. C. Ye, and B. De Man, “Deep learning for tomographic image reconstruciton,” Nat. Mach. Intell., vol. 2, no. 12, pp. 737–748, 2020.
[14]
J. He, Y. Yang, Y. Wang, D. Zeng, Z. Bian, H. Zhang, J. Sun, Z. Xu, and J. Ma, “Optimizing a parameterized plug-and-play admm for iterative low-dose ct reconstruction,” IEEE Trans. Med. Imag., vol. 38, no. 2, pp. 371–382, 2018.
[15]
H. Chen, Y. Zhang, Y. Chen, J. Zhang, W. Zhang, H. Sun et al., “Learn: Learned experts’ assessment-based reconstruction network for sparse-data ct,” IEEE Trans. Med. Imag., vol. 37, no. 6, pp. 1333–1347, 2018.
[16]
L. Shen, W. Zhao, and L. Xing, “Patient-specific reconstruction of volumetric computed tomography images from a single projection view via deep learning,” Nat. Biomed. Eng., vol. 3, no. 11, pp. 880–888, 2019.
[17]
W. Wu, D. Hu, C. Niu, H. Yu, V. Vardhanabhuti, and G. Wang, “Drone: Dual-domain residual-based optimization network for sparse-view ct reconstruction,” IEEE Trans. Med. Imag., vol. 40, no. 11, pp. 3002–3014, 2021.
[18]
M. Thies, F. Wagner, M. Gu, L. Folle, L. Felsner, and A. Maier, “Learned cone-beam ct reconstruction using neural ordinary differential equations,” arXiv preprint arXiv:2201.07562, 2022.
[19]
X. Li, K. Jing, Y. Yang, Y. Wang, J. Ma, H. Zheng, and Z. Xu, “Noise-generating and imaging mechanism inspired implicit regularization learning network for low dose ct reconstrution,” IEEE Trans. Med. Imag., 2023.
[20]
M. Li, P. Lorraine, J. Pack, G. Wang, and B. De Man, “Realistic ct noise modeling for deep learning training data generation and application to super-resolution,” in 17th Int. Meeting Fully 3D Imag. Recon. Radiol. Nucl. Med., Stony Brook, NY, USA, 7 2023.
[21]
M. Du, K. Liang, Y. Liu, and Y. Xing, “Investigation of domain gap problem in several deep-learning-based ct metal artefact reduction methods,” arXiv preprint arXiv:2111.12983, 2021.
[22]
J. Xu, Y. Huang, M.-M. Cheng, L. Liu, F. Zhu, Z. Xu, and L. Shao, “Noisy-as-clean: Learning self-supervised denoising from corrupted image,” IEEE Trans. Imag. Proc., vol. 29, pp. 9316–9329, 2020.
[23]
N. Moran, D. Schmidt, Y. Zhong, and P. Coady, “Noisier2noise: Learning to denoise from unpaired noisy data,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 12 064–12 072.
[24]
A. A. Hendriksen, D. M. Pelt, and K. J. Batenburg, “Noise2inverse: Self-supervised deep convolutional denoising for tomography,” IEEE Trans. Comput. Imag., vol. 6, pp. 1320–1335, 2020.
[25]
N. Yuan, J. Zhou, and J. Qi, “Half2half: deep neural network based ct image denoising without independent reference data,” Phys. Med. Biol., vol. 65, no. 21, p. 215020, 2020.
[26]
Z. Zhang, X. Liang, W. Zhao, and L. Xing, “Noise2context: Context-assisted learning 3d thin-layer for low-dose ct,” Med. Phys., vol. 48, no. 10, pp. 5794–5803, 2021.
[27]
C. Niu, M. Li, F. Fan, W. Wu, X. Guo, Q. Lyu, and G. Wang, “Noise suppression with similarity-based self-supervised deep learning,” IEEE Trans. Med. Imag., 2022.
[28]
G.-H. Chen, J. Tang, and S. Leng, “Prior image constrained compressed sensing (piccs): a method to accurately reconstruct dynamic ct images from highly undersampled projection data sets,” Medical physics, vol. 35, no. 2, pp. 660–663, 2008.
[29]
K. Rajendran, S. Tao, D. Abdurakhimova, S. Leng, and C. McCollough, “Ultra-high resolution photon-counting detector ct reconstruction using spectral prior image constrained compressed-sensing (uhr-spiccs),” in Medical Imaging 2018: Physics of Medical Imaging, vol. 10573.SPIE, 2018, pp. 310–315.
[30]
M. R. Hestenes, “Multiplier and gradient methods,” Journal of optimization theory and applications, vol. 4, no. 5, pp. 303–320, 1969.
[31]
S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al., “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Found. Trends Mach. Learn., vol. 3, no. 1, pp. 1–122, 2011.
[32]
Y. Wang, W. Yin, and J. Zeng, “Global convergence of admm in nonconvex nonsmooth optimization,” J. Sci. Comput., vol. 78, no. 1, pp. 29–63, 2019.
[33]
G. Alain and Y. Bengio, “What regularized auto-encoders learn from the data-generating distribution,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 3563–3593, 2014.
[34]
W. Dong, P. Wang, W. Yin, G. Shi, F. Wu, and X. Lu, “Denoising prior driven deep neural network for image restoration,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 10, pp. 2305–2318, 2018.
[35]
K. Zhang, Y. Li, W. Zuo, L. Zhang, L. Van Gool, and R. Timofte, “Plug-and-play image restoration with deep denoiser prior,” IEEE Trans. Pattern Anal. Mach. Intell., 2021.
[36]
B. Zhu, J. Z. Liu, S. F. Cauley, B. R. Rosen, and M. S. Rosen, “Image reconstruction by domain-transform manifold learning,” Nature, vol. 555, no. 7697, pp. 487–492, 2018.
[37]
J. Xu and F. Noo, “Convex optimization algorithms in medical image reconstruction—in the age of ai,” Phys. Med. Biol., vol. 67, no. 7, p. 07TR01, 2022.
[38]
O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Int. Conf. Med. Imag. Comput. Comput. Assist. Interv.Springer, 2015, pp. 234–241.
[39]
S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1492–1500.
[40]
G. Huang, S. Liu, L. Van der Maaten, and K. Q. Weinberger, “Condensenet: An efficient densenet using learned group convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 2752–2761.
[41]
B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep residual networks for single image super-resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 136–144.
[42]
C. H. McCollough, A. C. Bartley, R. E. Carter, B. Chen, T. A. Drees, P. Edwards et al., “Low-dose ct for the detection and classification of metastatic liver lesions: results of the 2016 low dose ct grand challenge,” Med. Phys., vol. 44, no. 10, pp. e339–e352, 2017.
[43]
M. Li, D. S. Rundle, and G. Wang, “X-ray photon-counting data correction through deep learning,” arXiv preprint arXiv:2007.03119, 2020.
[44]
Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 286–301.
[45]
C. Qiao, D. Li, Y. Guo, C. Liu, T. Jiang, Q. Dai, and D. Li, “Evaluation and development of deep neural networks for image super-resolution in optical microscopy,” Nat. Methods, vol. 18, no. 2, pp. 194–202, 2021.
[46]
W. Van Aarle, W. J. Palenstijn, J. Cant, E. Janssens, F. Bleichrodt, A. Dabravolski et al., “Fast and flexible x-ray tomography using the astra toolbox,” Opt. Exp., vol. 24, no. 22, pp. 25 129–25 147, 2016.
[47]
N. J. De Ruiter, P. H. Butler, A. P. Butler, S. T. Bell, A. I. Chernoglazov, and M. F. Walsh, “Mars imaging and reconstruction challenges,” in 14th Int. Meeting Fully 3D Imag. Recon. Radiol. Nucl. Med., Xi’an, China, 2017, pp. 18–23.
[48]
M. Bath and L. Mansson, “Visual grading characteristics (vgc) analysis: a non-parametric rank-invariant statistical method for image quality evaluation,” Brit. J. Radiol., vol. 80, no. 951, pp. 169–176, 2007.
[49]
J. Carmichael, European guidelines on quality criteria for diagnostic radiographic images.Office for Official Publications of the European Communities, 1996.
[50]
M. Li, X. Guo, A. Verma, A. Rudkouskaya, A. M. McKenna, X. Intes, G. Wang, and M. Barroso, “Contrast-enhanced photon-counting micro-ct of tumor xenograft models,” bioRxiv, pp. 2024–01, 2024.
[51]
J. A. Hanley and B. J. McNeil, “The meaning and use of the area under a receiver operating characteristic (roc) curve.” Radiology, vol. 143, no. 1, pp. 29–36, 1982.
[52]
J. L. Fleiss, J. Cohen, and B. S. Everitt, “Large sample standard errors of kappa and weighted kappa.” Psychological bulletin, vol. 72, no. 5, p. 323, 1969.
[53]
F. Noo, M. Defrise, and R. Clackdoyle, “Single-slice rebinning method for helical cone-beam ct,” Phys. Med. Biol., vol. 44, no. 2, p. 561, 1999.
[54]
M. Li, C. Lowe, A. Butler, P. Butler, and G. Wang, “Motion correction via locally linear embedding for helical photon-counting CT,” in Proc. SPIE 7th Int. Conf. Imag. Form. X-Ray Comput. Tomo., vol. 12304, 6 2022.
[55]
M. Li, J. Bohacova, J. Uher, W. Cong, J. Rubinstein, and G. Wang, “Motion correction for robot-based x-ray photon-counting CT at ultrahigh resolution,” in Proc. SPIE Dev. X-Ray Tomo. XIV, vol. 12242.SPIE, 8 2022.
[56]
Q. Gao, Z. Li, J. Zhang, Y. Zhang, and H. Shan, “Corediff: Contextual error-modulated generalized diffusion model for low-dose ct denoising and generalization,” IEEE Trans. Med. Imag., vol. 43, no. 2, pp. 745–759, 2024.
[57]
L. Gallego Manzano, P. Monnin, Y. Sayous, F. Becce, J. Damet, and A. Viry, “Clinical commissioning of the first point-of-care spectral photon-counting ct for the upper extremities,” Medical physics, vol. 50, no. 5, pp. 2844–2859, 2023.
[58]
F. S. L. Thomsen, S. Horstmeier, J. H. Niehoff, J. A. Peña, and J. Borggrefe, “Effective spatial resolution of photon counting ct for imaging of trabecular structures is superior to conventional clinical ct and similar to high resolution peripheral ct,” Investigative Radiology, vol. 57, no. 9, pp. 620–626, 2022.

  1. This work was supported in part by NIH under the grants R01CA237267, R01EB034737, R01CA233888, R01HL151561, and R01EB031102. M. Li, C. Niu, and G. Wang are with the Biomedical Imaging Center, Rensselaer Polytechnic Institute, Troy, New York, USA (e-mail:mzli@ieee.org; niuc@rpi.edu; wangg6@rpi.edu). M. R. Amma, N. d. Ruiter, and P. Butler are affiliated with MARS Bioimaging Limited, Christchurch, New Zealand (e-mail: maya.vivek1@gmail.com; niels.deruiter@marsbioimaging.com; phil.butler@marsbioimaging.com). K. M. Chapagain is with MARS Bioimaging Limited, Christchurch, New Zealand, also with University of Otago, New Zealand (e-mail: krish22chapagain@gmail.com). S. Gabrielson is with Canterbury District Health Board, New Zealand (e-mail: STEFAN.GABRIELSON@cdhb.health.nz). A. Li is with Pacific Radiology, New Zealand (e-mail: andrew.li@pacificradiology.com). K. Jonker is with MARS Bioimaging Limited, Christchurch, New Zealand, also with University of Canterbury, New Zealand (e-mail: kevin.jonker@marsbioimaging.com). J. A. Clark is with MARS Bioimaging Limited, Christchurch, New Zealand, also with University of Otago, New Zealand (e-mail: jenn.clark@marsbioimaging.com). A. Butler is with MARS Bioimaging Limited, Christchurch, New Zealand, also with University of Otago, New Zealand, University of Canterbury, New Zealand, and Canterbury District Health Board, New Zealand (e-mail: anthony.butler@marsbioimaging.com). H. Yu is with Department of Electrical and Computer Engineering, University of Massachusetts Lowell, Lowell, Massachusetts, USA (e-mail: Hengyong-yu@ieee.org). The corresponding authors are G. Wang, A. Butler, and H. Yu.↩︎