Reconstruct Anything Model:
a lightweight foundation model for
computational imaging
March 11, 2025
Most existing learning-based methods for solving imaging inverse problems can be roughly divided into two classes: iterative algorithms, such as plug-and-play and diffusion methods leveraging pretrained denoisers, and unrolled architectures that are trained end-to-end for specific imaging problems. Iterative methods in the first class are computationally costly and often yield suboptimal reconstruction performance, whereas unrolled architectures are generally problem-specific and require expensive training. In this work, we propose a novel non-iterative, lightweight architecture that incorporates knowledge about the forward operator (acquisition physics and noise parameters) without relying on unrolling. Our model is trained to solve a wide range of inverse problems, such as deblurring, magnetic resonance imaging, computed tomography, inpainting, and super-resolution, and handles arbitrary image sizes and channels, such as grayscale, complex, and color data. The proposed model can be easily adapted to unseen inverse problems or datasets with a few fine-tuning steps (up to a few images) in a self-supervised way, without ground-truth references. Throughout a series of experiments, we demonstrate state-of-the-art performance from medical imaging to low-photon imaging and microscopy. Our code is available at https://github.com/matthieutrs/ram.
Computational imaging problems are ubiquitous in applications ranging from demosaicing in computational photography to Magnetic Resonance Imaging (MRI). In this work, we focus on linear inverse problems of the form \[\boldsymbol{y}\sim p(\boldsymbol{y}|\boldsymbol{A}\boldsymbol{x}), \label{eq:inv95pb}\tag{1}\] where \(\boldsymbol{x}\in \mathbb{R}^n\) is the image to recover, \(\boldsymbol{y}\in\mathbb{R}^m\) are measurements, \(\boldsymbol{A}\colon \mathbb{R}^n\to\mathbb{R}^m\) is an operator describing the acquisition physics, usually assumed to be linear, and \(p\) is the noise distribution, which can typically be Gaussian and/or Poisson noise. We mostly focus on non-blind problems with known \(\boldsymbol{A}\).
The most common approach to solving the inverse problem 1 is to employ an iterative algorithm to minimize an objective function of the form \[\mathop{\mathrm{arg\,min}}_{\boldsymbol{x}}\, \frac{1}{2}\|\boldsymbol{A}\boldsymbol{x}-\boldsymbol{y}\|_2^2 + g(\boldsymbol{x}), \label{eq:min95pb}\tag{2}\] where \(g\) encodes prior knowledge about the data. Classical approaches rely on handcrafted priors such as total variation [1] or wavelet sparsity [2]. However, recent advances have highlighted the effectiveness of learned priors, particularly deep denoising neural networks, which have been successfully integrated into iterative reconstruction methods. These include Plug-and-Play (PnP) algorithms [3]–[7], and the more recent diffusion models-based methods [8]–[11]. The advantage of these approaches is that a single model trained only for denoising can be used to solve a large set of image restoration tasks. However, these methods come with several major drawbacks: they are often slow, may introduce blur [12]–[14], and depend heavily on the training data, limiting their use in domains with scarce ground-truth references [15].
Another common strategy for solving 1 is to train an end-to-end deep neural network specifically for the given task. In low-level vision applications such as single-image super-resolution and image denoising, architectures are often derived empirically, with UNet-based models emerging as a standard choice [16], [17]. Notably, the UNet has also become a prevalent backbone for denoising tasks in generative modeling [18]–[20]. However, in computational imaging tasks (e.g., MRI, computed tomography, astronomical imaging), unrolled architectures, inspired by optimization algorithms, are typically preferred due to their ability to incorporate the measurement operator, \(\boldsymbol{A}\), and observed data [21]–[23]. Yet, the architecture of unrolled models strongly differs from state-of-the-art UNets used for low level tasks. In particular, empirical results [16], [17], [24], [25] suggest that effective architectures need not strictly adhere to optimization-derived operations. In both cases, however, even minor architectural modifications—such as adjusting the number of input and output channels—often require full model retraining. This reduces adaptability across datasets and inverse problems with similar statistical properties, such as transitioning from grayscale to color images or handling complex multi-channel representations [26].
In this work, we propose a novel neural network architecture for solving 1 , building on the DRUNet convolutional neural network [16]. Our key modification introduces a conditioning mechanism that integrates the acquisition physics \((\boldsymbol{A},p)\) through multigrid Krylov iterations. We train a single backbone model across multiple imaging tasks simultaneously ranging from image deblurring to MRI, handling datasets with grayscale, color, and complex images, corrupted by both Gaussian and Poisson noise. As a result, our architecture supports grayscale (1 channel), complex-valued (2 channels), and color images (3 channels), with only the output heads differing to accommodate specific channel numbers. Beyond its strong zero-shot performance, we demonstrate that the model can be efficiently fine-tuned in a self-supervised way on real measurement data.
Several architectures have imposed themselves as cornerstone in the low-level vision community. Early breakthroughs include the UNet [27], [28] and deep residual convolutional networks [29]–[31]. More recently, architectures incorporating attention mechanisms have surpassed previous state-of-the-art models [17], [25], [26]. Notable recent developments include transformer-based models, particularly in single-image super-resolution [25], [26], and, more recently, state-space models [32]. Despite these advances, the UNet remains a key backbone for many state-of-the-art methods, see e.g. [17], [25], incorporating transformer blocks on UNet-like architectures.
To better condition on the acquisition physics, unrolled networks introduce learnable neural modules (e.g., convolutional layers) within the iterations of an optimization algorithm solving 2 . Several architectures have become standard in this framework, notably in natural image processing [31], [33]–[35], medical imaging [21], [23], and astronomical imaging [36]. However, these models have large memory footprints and are typically trained on a single or limited set of tasks with limited variations. Moreover, there is a significant discrepancy between the overall architecture of these unrolled models and the highly effective architectures employed by the previously discussed state-of-the-art UNet-like architectures.
In many scientific and medical imaging applications, obtaining ground-truth data for supervised training is often expensive or even impossible [15]. Self-supervised methods enable training directly from measurements alone [37]–[39]. In this work, we show that self-supervision is highly effective for finetuning a foundation model without ground-truth references.
Recently, various papers have proposed image restoration models that can handle multiple tasks using the same underlying weights [40]–[42]. While these works share a similar goal of building foundation models, they differ from this paper in that they focus on (semi) blind image-to-image tasks, such as denoising, deraining, and dehazing. In contrast, the proposed model is designed to handle scientific and medical imaging tasks where the measurements are not necessarily in the image space and the observation model 1 is typically known, such as in MRI (k-space data) or CT (sinograms).
The proposed architecture, illustrated in 1, is derived from the DRUNet [16]. In this section, we detail the proposed architectural modifications to handle general inverse problems. First, we add a proximal estimation module casting the input into a Gaussian-noise-corrupted form. In subsequent layers, the conditioning is performed via Krylov subspace iteration on the feature maps. Finally, we adapt the architecture in order to handle a variety of noise models.
In order to solve 1 for a wide variety of inverse problems, a first step is to map \(\boldsymbol{y}\) onto the appropriate image domain. Several approaches are used in the literature, using either \(\boldsymbol{A}^\dagger \boldsymbol{y}\) as the input of the network or \(\boldsymbol{A}^\top \boldsymbol{y}\). The first approach shows the advantage of inverting the ill-conditioned \(\boldsymbol{A}\), turning the effective problem for subsequent backbone layers as a Gaussian denoising problem. However, this pseudo-inverse operation is extremely sensitive to noise. In noisy settings, the latter approach is often chosen, at the cost of blurring of the input. Instead, letting \(f(\boldsymbol{x}, \boldsymbol{y}) = \frac{1}{2}\|\boldsymbol{A}\boldsymbol{x}-\boldsymbol{y}\|_2^2\), our first estimation step consists in a proximal step \[\operatorname{prox}_{\lambda f}(\boldsymbol{A}^\top\! \boldsymbol{y}) = \mathop{\mathrm{arg\,min}}_{\boldsymbol{u}} \lambda\|\boldsymbol{A}\boldsymbol{u}-\boldsymbol{y}\|^2 + \|\boldsymbol{u}-\boldsymbol{A}^\top\!\boldsymbol{y}\|^2, \label{eq:prox95input}\tag{3}\] where \(\lambda>0\) is a regularization parameter. Equation 3 can be seen as a middle-ground between \(\boldsymbol{A}^\top \boldsymbol{y}\) (when \(\lambda\) is low) and \(\boldsymbol{A}^\dagger \boldsymbol{y}\) (when \(\lambda\) is large). We set \(\lambda\) proportional to the input signal-to-noise ratio (SNR) as \({\lambda = \sigma\eta/\|\boldsymbol{y}\|_1}\) where \(\sigma\) is the Gaussian noise level and \(\eta\) is a learnable parameter.
Given an intermediate estimate of the image \(\boldsymbol{x}^{\ell}\), unrolled network architectures condition on the acquisition physics \(\boldsymbol{A}\) via a gradient step on an \(L^2\) distance data-fidelity term, i.e. \[\boldsymbol{x}^{\ell+1} = \boldsymbol{x}^{\ell} - \gamma \boldsymbol{A}^{\top}(\boldsymbol{A}\boldsymbol{x}^{\ell}-\boldsymbol{y}),\] or via proximal step, i.e. \[\boldsymbol{x}^{\ell+1} = \mathop{\mathrm{arg\,min}}_{\boldsymbol{u}} \gamma \|\boldsymbol{A}\boldsymbol{u}-\boldsymbol{y}\|^2 + \|\boldsymbol{u}-\boldsymbol{x}^{\ell}\|^2.\] for some stepsize \(\gamma\). The proximal step is commonly solved using \(K\) iterations of conjugate gradient, yielding a solution in the Krylov subspace spanned by \(\{ ( \boldsymbol{I}+\gamma\boldsymbol{A}^{\top}\boldsymbol{A})^{k}(\gamma\boldsymbol{A}^{\top}\boldsymbol{y}+ \boldsymbol{x}^{\ell})\}_{k=0}^{K}\), and both steps can be written as \[\boldsymbol{x}^{\ell+1} = \sum_{k=0}^K \alpha_k (\boldsymbol{A}^{\top}\boldsymbol{A})^{k}\boldsymbol{x}^{\ell} + \beta_k (\boldsymbol{A}^{\top}\boldsymbol{A})^{k}\boldsymbol{A}^{\top}\boldsymbol{y},\] for some scalar coefficients \(\{(\alpha_k,\beta_k)\}_{k=1}^{K}\). Thus, we propose to condition on \(\boldsymbol{A}\) using a Krylov Subspace Module (KSM), which learns the linear combination coefficients \(\{(\alpha_k,\beta_k)\}_{k=1}^{K}\). Given an intermediate latent representation, a decoding module first maps the features onto the image domain using a convolution layer. Then, it stacks \(\{\Big((\boldsymbol{A}^{\top}\boldsymbol{A})^{k}\boldsymbol{x}^{\ell}, (\boldsymbol{A}^{\top}\boldsymbol{A})^{k}\boldsymbol{A}^{\top}\boldsymbol{y}\Big)\}_{k=1}^{K}\) along channels and combines them through a \(3\times 3\) convolution. The output is finally added to the latent space via a convolutional encoding module, see [fig:krylov,fig:ram_arch] for more details.
For a fixed number of measurements \(m\), the finer the grid (i.e., larger \(n\)), the more ill-posed the inverse problem becomes, as the dimension of the nullspace of \(\boldsymbol{A}\) is lower bounded by \(n-m\). Thus, inspired by multigrid methods [43], we propose to condition our architecture on forward operators defined on coarse grids as: \[\boldsymbol{A}_s = \boldsymbol{A}\boldsymbol{U}_{s}\] where \(\boldsymbol{U}_s: \mathbb{R}^{\frac{n}{4^{s}}}\to \mathbb{R}^{n}\) is a \(s\times\) upsampling operator with a Kaiser-windowed sinc antialias filter [44]. 2 illustrates that the linear pseudoinverse is unstable on fine grids but remains stable on coarse grids. We normalize all operators to have unit norm at each scale, i.e., \(\|\boldsymbol{A}_s\|_2=1\) for all \(s\). Moreover, for many inverse problems, we can develop efficient implementations of \(\boldsymbol{A}_s^{\top}\boldsymbol{A}_s\in \mathbb{R}^{\frac{n}{4^{s}}\times \frac{n}{4^{s}}}\) and \(\boldsymbol{A}_s^{\top}\boldsymbol{y}\in \mathbb{R}^{\frac{n}{4^{s}}}\) which can be computed fully on the coarse grid, avoiding any expensive fine scale computations. For example, if \(\boldsymbol{A}\) represents a blur or an inpainting operation, we can simply downscale the kernel or the inpainting mask, respectively.
Same as the original DRUNet [16], [45], conditioning on the noise level is implemented by concatenating constant feature maps, filled with the noise level, along the channel dimension. While a single noise map was added for Gaussian denoising in [16], we extend this strategy to two maps, corresponding to the parameters \(\sigma\) and \(\gamma\) of the Poisson–Gaussian noise model \[\label{eq:32PGmodel} \boldsymbol{y}= \gamma \boldsymbol{z}+ \sigma \boldsymbol{n},\tag{4}\] where \(\gamma \geq 0\) is the gain factor1 associated to the Poisson noise \(\boldsymbol{z}\sim \mathcal{P}(\boldsymbol{x}/\gamma)\), and \(\sigma \geq 0\) is the standard deviation of the Gaussian component of the noise, with \(\boldsymbol{n}\sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\). All biases are removed from our architecture to ensure scale equivariance with respect to both \(\sigma\) and \(\gamma\) components, enabling better generalization to unseen noise levels [16], [46].
The modifications needed to handle inputs with different channel numbers (color, grayscale, or complex images) involve only a small subset of the model parameters. Specifically, only the first (input) and last (output) convolutional layers of the DRUNet, as well as the encoding and decoding blocks within the KSM modules, are adjusted to account for varying channel numbers. All other weights of the model are shared across modalities.
We train our network simultaneously on \(G\) computational imaging tasks, spanning image restoration (deblurring, inpainting, Poisson-Gaussian denoising in both color and grayscale), single-coil MRI, CT, and others (see 6 for a complete list), leveraging the DeepInverse library [47]. Each task \(g\) is associated with a dataset \(\mathcal{D}_g\) = \(\{ \boldsymbol{x}_{i,g}\}_{i=1}^{N_g}\): LSDIR [48] for tasks involving natural images, LIDC-IDRI for CT images [49], and the fastMRI brain-multicoil dataset for MRI [24].
Furthermore, each task \(g\) can be framed as an inverse problem 1 for some \(\boldsymbol{A}\), and noise following the Poisson Gaussian distribution in 4 with varying gain \(\gamma\geq 0\) and standard deviation \(\sigma\geq 0\). Our model is trained in a supervised fashion to minimize the following task-wise loss: \[\mathcal{L}_g(\boldsymbol{\theta}, \boldsymbol{x}_{i,g}) = \mathbb{E}_{(\sigma_g, \gamma_g)} \mathbb{E}_{\boldsymbol{y}|\boldsymbol{x}_{i,g}}\,\omega_g\|\operatorname{R}_{\boldsymbol{\theta}}(\boldsymbol{y}, \boldsymbol{A}_g, \sigma_g, \gamma_g)-\boldsymbol{x}_{i,g}\|_1\] where \(\operatorname{R}_{\boldsymbol{\theta}}\) is the model to be trained, \(\boldsymbol{A}_g\) is the measurement operator of task \(g\), and \((\sigma,\gamma)\) are the noise parameters sampled from a distribution \(p(\sigma,\gamma)\). We choose the \(\ell_1\) over the standard \(\ell_2\) loss as it was empirically observed to obtain better test performance [50]. We consider a weighting parameter \(\omega_g = \|\boldsymbol{A}_g^\top\boldsymbol{y}\|_2/\sigma_g\) to ensure the training loss is balanced across different noise levels and tasks. The final training loss is obtained by summing over all tasks: \[\mathcal{L}(\boldsymbol{\theta}) = \sum_{g=1}^G \sum_{i=1}^{N_g} \mathcal{L}_g(\boldsymbol{\theta}, \boldsymbol{x}_{i,g}).\] This formulation allows the network to generalize across multiple imaging modalities by learning from diverse measurement operators and noise distributions.
For each of the training inverse problems, we extract a random image patch \(\boldsymbol{x}_{i,g}\) of size \((C, 128, 128)\) with \(C\in \{1,2,3\}\) to which we apply the measurement \(\boldsymbol{A}_g\). We use the pretrained weights of the DRUNet denoiser to initialize our model. The model is trained with a batch size of 16 per inverse problem considered and for a total of 200k steps. We use the Adam optimizer with a learning rate \(10^{-4}\), which is divided by \(10\) after 180k steps.
Many real computational imaging applications have limited or no ground-truth reference data [15]. In such cases, the RAM model can be finetuned using measurement data alone, leveraging some of the recent advances in self-supervised learning for inverse problems [38], [51]–[54]. In particular, we can finetune our pretrained RAM model with a loss that only requires a dataset of noisy and/or incomplete observations \(\{\boldsymbol{y}_1,\dots,\boldsymbol{y}_N\}\): \[\label{eq:32self-sup} \mathcal{L}(\boldsymbol{\theta}) = \sum_{i=1}^{N} \mathcal{L}_{\textrm{MC}}(\boldsymbol{\theta}, \boldsymbol{y}_i) + \omega \, \mathcal{L}_{\textrm{NULL}}(\boldsymbol{\theta}, \boldsymbol{y}_i),\tag{5}\] where \(\omega>0\) is a trade-off parameter. The first term, \(\mathcal{L}_{\textrm{MC}}\), enforces measurement consistency \(\boldsymbol{A}\hat{\boldsymbol{x}}\approx \boldsymbol{A}\boldsymbol{x}\) while taking care of the noise, and is chosen according to the knowledge about the noise distribution of the finetuning dataset, i.e., using SURE [55] for fully-specified noise models, and UNSURE [54] or splitting losses [37] in the case of unknown or mispecified noise models. The second term, \(\mathcal{L}_{\textrm{NULL}}\), is required if \(\boldsymbol{A}\) is non-invertible and handles the lack of information in the nullspace of \(\boldsymbol{A}\), by leveraging information from multiple forward operators [39] and/or enforcing equivariance to a group of transformations [38]. More details are provided in 11.1.
| CBSD68 | Urban100 | Div2K | |||||||||
| Method | Params | Easy | Med. | Hard | Easy | Med. | Hard | Easy | Med. | Hard | |
| DPIR | 32M | 33.45 | 26.53 | 24.42 | 34.39 | 28.12 | 24.38 | 36.32 | 29.97 | 27.16 | |
| DiffPIR | 552M | 31.83 | 25.98 | 23.80 | 32.93 | 27.03 | 23.69 | - | - | - | |
| DDRM\(^\text{*}\) | 552M | 33.10 | 27.97 | 25.41 | 34.19 | 29.65 | 26.05 | 35.19 | 30.57 | 27.70 | |
| PDNet | 456K | 30.54 | 26.84 | 24.62 | 28.36 | 25.22 | 22.67 | 32.61 | 28.81 | 26.32 | |
| Restormer | 26M | 30.96 | 26.99 | 25.00 | 28.21 | 24.75 | 23.13 | 28.56 | 25.75 | 25.00 | |
| uDPIR/t | 32M | 33.78 | 28.20 | 25.61 | 33.07 | 28.73 | 25.42 | 35.79 | 30.76 | 27.91 | |
| uDPIR/u | 256M | 33.64 | 27.90 | 25.54 | 32.72 | 28.14 | 25.28 | 35.38 | 30.17 | 27.77 | |
| RAM | 36M | 34.04 | 28.22 | 25.64 | 33.61 | 28.63 | 25.30 | 36.18 | 30.72 | 27.89 | |
| CBSD68 | Urban100 | Div2K | ||||||
| Easy | Med. | Hard | Easy | Med. | Hard | Easy | Med. | Hard |
| 32.10 | 25.70 | 22.73 | 32.53 | 23.78 | 20.38 | 35.23 | 28.14 | 24.71 |
| 30.93 | 24.96 | 22.51 | 31.17 | 23.63 | 20.32 | - | - | - |
| 31.85 | 25.89 | 23.14 | 32.73 | 24.40 | 20.73 | 34.46 | 28.25 | 24.89 |
| 29.88 | 25.26 | 22.55 | 27.85 | 22.81 | 19.90 | 32.01 | 27.29 | 24.03 |
| 31.68 | 25.93 | 23.19 | 29.07 | 24.00 | 20.99 | 30.01 | 27.20 | 25.00 |
| 31.90 | 26.10 | 23.41 | 30.47 | 24.31 | 21.05 | 34.63 | 28.45 | 25.20 |
| 31.86 | 26.03 | 23.38 | 30.26 | 24.32 | 21.04 | 34.47 | 28.37 | 25.23 |
| 32.59 | 26.19 | 23.42 | 31.78 | 24.65 | 21.12 | 35.23 | 28.56 | 25.16 |
We evaluate our model’s performance alongside both iterative reconstruction methods, unrollled models and end-to-end architectures. First, we consider the Plug-and-Play model DPIR [33], which plugs a DRUNet, trained only for denoising, within an 8-step Half Quadratic Splitting (HQS) algorithm [56], thereby requiring approximately eight times as many FLOPs as our model (see 8 for detail). We also compare with unrolled versions of DPIR (referred to as uDPIR [33]), also comprising eight HQS iterations, that we train on the same tasks, with the same loss function and dataset as our model. Specifically, we investigate both the weight-tied variant, where all eight iterations share the same DRUNet parameters, and the weight-untied variant, where each iteration has its own set of parameters—resulting in a model with eight times more parameters and FLOPs than RAM. Additionally, we compare our approach with the PDNet unrolled network [21], trained on the same tasks. Finally, we evaluate Restormer [57], a transformer-based model with a similar computational cost (in terms of FLOPs) to DRUNet, but trained separately on each restoration tasks. We also compare the proposed model with the diffusion-based DDRM algorithm [58] with the diffusion model backbone from [59]; see 10 for technical details on adaptation of DDRM to the proposed setting.
We first evaluate the proposed method on tasks that are consistent with the training tasks.
Deblurring We evaluate two types of deblurring: Gaussian deblurring, using fixed Gaussian blur kernels, and motion deblurring, based on random motion blur kernels. For each, we define three difficulty levels (easy, medium, hard), determined by kernel characteristics and the Gaussian noise standard deviation. Details for each setup are provided in 9.2. 3 and 8 report visual and quantitative results. Among non-iterative methods, RAM and uDPIR-tied achieve the best performance, while uDPIR-untied underperforms despite its larger parameter count. Our method generally surpasses other end-to-end approaches, except for motion deblurring on Urban100. Compared to iterative methods, our approach is superior except at low noise levels (\(\sigma=0.01\)) on large images (Urban100, DIV2K).
MRI We evaluate on the fastMRI brain validation set using the single-coil accelerated acquisition procedure from [24] with acceleration factors \(4\) and \(8\). The problem is formulated as \(\boldsymbol{y}= \text{diag}(\boldsymbol{m})\boldsymbol{F}\boldsymbol{x}+ \sigma \boldsymbol{n}\), where \(\boldsymbol{m}\) is a binary mask and \(\boldsymbol{F}\) the discrete Fourier transform, and \(\sigma=5\cdot 10^{-4}\). Quantitative results are reported in [tab:metrics95mri95ct] and visuals in 6. The proposed RAM model outperforms all baselines.
CT We evaluate on computed tomography with Gaussian noise, modeled as \(\boldsymbol{y}= \boldsymbol{A}\boldsymbol{x}+ \sigma \boldsymbol{n}\), where \(\boldsymbol{A}\) is the Radon transform. The acquisition uses 51 angles, i.e., about \(10\%\) of the LIDC-IDRI image width, with \(\sigma = 10^{-4}\) as in training. Results are reported in the rightmost column of [tab:metrics95mri95ct] and visuals in 7. DPIR results are omitted due to instability on this task. The proposed RAM model produces reconstructions with finer details than uDPIR-tied.
Additional results for in-distribution tasks on Poisson-Gaussian denoising, motion blur, super-resolution, and inpainting are provided in 13.
We consider the Cartesian multi-coil MRI problem. In this setup, and assuming \(L\) coils, the signal measured by each coil \(\ell \in \{1,\cdots, L\}\) writes as \(\boldsymbol{y}_\ell = \text{diag}(\boldsymbol{m})\boldsymbol{F}\text{diag}(\boldsymbol{s}_\ell)\boldsymbol{x}+ \sigma \boldsymbol{n}_\ell\), where the \((\boldsymbol{s}_\ell)_{1\leq \ell \leq L}\) are the sensitivity maps (or S-maps). We provide reconstruction metrics for the multi-coil setting in [tab:metrics95mri95ct] and associated visuals in 7. We provide comparisons with the baseline UNet from [24]. Both methods perform similarly up to the addition of mild residual noise with the UNet, yielding lower PSNR despite similar visual results. We stress that since PDNet contains learnable layers acting in the measurement domain, one would need to retrain the architecture for this new setting specifically, hence we do not present the results.
While our model was trained on the CT problem with Gaussian noise, we propose to apply it to a CT problem degraded with Poisson noise only. As previously, we consider a problem with 51 angles. Numerical results are provided in [tab:metrics95mri95ct] and visual results in 7. The proposed method performs better at estimating the textures.
Additional results for out-of-distribution tasks are provided in 13.
Real imaging problems may involve inverse problems or images that significantly differ from those at train time. In this case, finetuning with only a handful of measurements (i.e., without ground-truth data), sometimes a single one, substantially improves performance. We present results for three different tasks in 4 and 3. In all cases, finetuning with fewer than \(N=10\) measurements results in performances that significantly outperform PnP and diffusion baselines. Moreover, finetuning takes a few minutes on a single consumer-grade GPU (see 7).
| Compressed Sensing | Demosaicing | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2-8 (l)9-16 | DPIR+ | DRUNet rand | DRUNet | RAM | DPIR+ | DDRM | DRUNet rand | DRUNet | RAM | ||||||
| Sup. | Self. | Sup. | Self. | Sup. | Self. | Sup. | Self. | Sup. | Self. | Sup. | Self. | ||||
| zero-shot | 31.29 | 8.07 | 10.23 | 22.50 | 32.84 | 30.64 | 8.10 | 10.73 | 29.49 | ||||||
| \(N=1\) | 20.92 | 21.20 | 19.49 | 19.44 | 30.77 | 30.40 | 23.14 | 21.28 | 25.82 | 25.57 | 34.46 | 33.89 | |||
| \(N=10\) | 22.59 | 22.00 | 25.99 | 24.74 | 32.52 | 32.29 | 24.85 | 21.85 | 33.14 | 31.38 | 35.03 | 34.73 | |||
| \(N=100\) | 22.73 | 22.24 | 28.38 | 30.58 | 33.40 | 33.57 | 25.84 | 23.22 | 34.06 | 34.56 | 35.26 | 35.10 | |||
Compressed sensing on Sentinel-2 data: on 100 Sentinel-2 RGB images (\(128\times128\), 10m resolution), we set \(\boldsymbol{A}=\boldsymbol{S} \text{diag}(\boldsymbol{m})\) with \(\boldsymbol{m}\) a random \({\pm1}\) mask and \(\boldsymbol{S}\) a subsampled sine transform (factor 4). With Gaussian noise \(\sigma=0.05\), RAM finetuning achieves strong results from a single image, clearly surpassing DRUNet. Cryo electron microscopy: we use 5 noisy Cryo-EM micrographs (\(7676 \times 7420\)) from Topaz-EM [60], where the SNR is very low. Since the noise distribution is unknown, we finetune with a splitting loss. Low-photon imaging: we use 9 LinoSPAD images [61] (\(256\times 256\)). While noise is generally assumed to be Poisson in the low-flux case, the images were acquired on high-flux conditions and follow an unknown discrete noise distribution. Due to missing pixel lines, recovery amounts to inpainting and denoising. We finetune with a splitting and multi-operator losses (removing lines at random). Further implementation details are provided in 11.2.
Architectural ablations 5 shows training PSNRs obtained on all training tasks after a fixed budget of 10K iterations for various architectural variations. We first note that introducing a proximal input block leads to a substantial improvement (+0.8dB). Conditioning the inner feature maps on the observation \(\boldsymbol{y}\) provides a smaller but consistent improvement (+0.1dB). The largest performance boost (+0.9dB) comes from incorporating our Krylov blocks, which apply \(\boldsymbol{A}^\top \boldsymbol{A}\) operations to the feature maps. This highlights that mimicking the core operations of iterative optimization methods within the architecture yields significant benefits. Interestingly, our model surpasses unrolled architectures that are explicitly derived from optimization algorithms.
Single vs multitask training 5 shows that given appropriate conditioning on the measurement operator, bias towards specific training tasks is negligible, as the model trained on all tasks shows a performance comparable (yet slightly inferior to) models trained on specific tasks only. The proposed reweighting using the SNR of the considered problem limits bias towards harder tasks.
4pt
| Configuration | PSNR (dB) | #Params |
|---|---|---|
| base (DRUNet backbone) | 25.83 | 32.6M |
| base + prox | 26.64 | 32.6M |
| base + prox + embed \(y\) | 26.71 | 33.4M |
| base + prox + embed \(y\) + Krylov (RAM) | 27.61 | 35.6M |
8pt
| Training tasks | Inpainting | Deblurring | SR (\(\times\)2) |
|---|---|---|---|
| Inpainting only | 30.84 | 23.03 | 26.17 |
| Deblurring only | 14.06 | 25.62 | 28.02 |
| SR\(\times\)2, \(\sigma=0\) only | n/a | 21.25 | 28.60 |
| All three tasks | 30.73 | 25.58 | 29.79 |
Uncertainty quantification We can use existing uncertainty quantification (UQ) algorithms that only require a general reconstruction function. 5 shows estimated errors using the recent equivariant bootstrap technique [62] (see 15 on a noisy image inpainting task (DIV2K validation images with \(\sigma=0.02\) and 50% observed pixels). More details about the bootstrapping algorithm and additional results are included in 14. As shown in 16, our algorithm obtains better-calibrated uncertainty estimates than DDRM, while requiring \(100\times\) fewer network evaluations.
Limitations Training the RAM model requires a fair amount of GPU resources, which might not be available to all practitioners. Nevertheless, our model can be finetuned on a single mid-sized GPU, obtaining competitive results with a few optimization steps on small finetuning datasets. We focus on low-distortion reconstructions, in contrast to the higher perceptual quality and higher distortion of diffusion methods [12]. However, reconstructors such as RAM can be used within sampling algorithms, e.g. [63], [64].
We present a new lightweight foundational model for computational imaging that achieves state-of-the-art performance on a wide range of problems, outperforming PnP methods while providing similar results to \(8\times\) more compute and parameter-intensive unrolled networks. Our results demonstrate competitive performances without following the common practice of unrolling an optimization algorithm, resulting in a faster and lighter reconstruction network.
Our model displays transfer capabilities, as it can be finetuned on new datasets or imaging problems with small datasets (up to a single image) in a fully self-supervised way, i.e., without any ground-truth references. These results challenge the idea that a reconstruction network has to be specialized to a specific imaging task, showing good reconstructions across a wide variety of imaging modalities with a relatively lightweight network (36M parameters). Our results suggest that imaging tasks that might seem very different (e.g. cryo-EM denoising and demosaicing of satellite images) share significant common structure and can be tackled with the same model.
We believe this work paves the way for a new approach to imaging, where most of the effort can be put into developing strong base models and robust self-supervised finetuning techniques.
We use the LSDIR dataset [48] for training on natural imaging tasks, consisting of 84,991 high quality images. We split the images in 512\(\times\)512 non-overlapping patches for faster data loading. Beyond this, no additional pre-processing is performed on the dataset.
We perform virtual coil-combination on the raw kspace data from the brain-multicoil fastMRI dataset [24]. We use the resulting 70,748 complex images from the training set for training our model and the 21,842 slices from the validation set are reserved for validation.
We use the LIDC-IDRI dataset [49] as a basis for our computed tomography experiments containing 244,526 chest slices. We normalize the scans in Houndsfield Units (HU) using the rescale slope and intercept provided in the Dicom files. Data is then clipped in the lung window values [-1200, 800] and the rescaled in [0, 1]. We perform random extraction of slices from the dataset to provide a (95%, 4%, 1%) split of (training, validation, test).
We provide in 6 a detailed list of all datatets and inverse problems considered for the training of RAM.
| :===========: 5-8 | Task :=============== | Dataset :============ | Configuration :=============================== | Noise Distribution :===============:+:===============:+:===============:+:===============: Gaussian | Poisson | |||
| 5-6 (lr)7-8 | \(\sigma_{\min}\) | \(\sigma_{\max}\) | \(\gamma_{\min}\) | \(\gamma_{\max}\) | |||
| Denoising | LSDIR | - | 0.001 | 0.2 | 0.01 | 1.0 | |
| Deblurring | LSDIR | Random motion & Gaussian kernels (31×31) | 0.001 | 0.2 | - | - | |
| Inpainting | LSDIR | Bernoulli masks: \(p \sim \mathcal{U}(0.3, 0.9)\) | 0.01 | 0.2 | 0.01 | 1.0 | |
| SR\(\times\)2,4 | LSDIR | Bicubic/bilinear downsampling (\(\times\)2, \(\times\)4) | 0.001 | 0.01 | - | - | |
| Tomography | LIDC-IDRI | Radon transform, 10 projection angles | - | 0.01 | - | - | |
| Denoising | LSDIR | - | 0.001 | 0.2 | 0.01 | 1.0 | |
| Deblurring | LSDIR | Random motion & Gaussian kernels (31\(\times\)31) | 0.01 | 0.2 | - | - | |
| Inpainting | LSDIR | Bernoulli masks: \(p \sim \mathcal{U}(0.3, 0.9)\) | 0.01 | 0.2 | 0.01 | 1.0 | |
| SR\(\times\)2,4 | LSDIR | Bicubic/bilinear downsampling (\(\times\)2, \(\times\)4) | 0.001 | 0.01 | - | - | |
| Pan-sharpening | LSDIR | Flat spectral response | 0.01 | 0.1 | - | - | |
| MRI | FastMRI | Acceleration masks (\(\times\)4, \(\times\)8 undersampling) | 0.001\(^*\) | 0.1\(^*\) | - | - | |
| Denoising | FastMRI | - | 0.001\(^*\) | 0.1\(^*\) | - | - | |
| Inpainting | FastMRI | Bernoulli masks: \(p \sim \mathcal{U}(0.1, 0.9)\) | 0.01\(^*\) | 0.1\(^*\) | - | - | |
The deblurring inverse problem is \(\boldsymbol{y}= \boldsymbol{k}*\boldsymbol{x}+\sigma\boldsymbol{n}\), where \(\boldsymbol{k}\) is some blur kernel and \(\boldsymbol{n}\sim\mathcal{N}(\boldsymbol{0}, I)\). We use a “valid” padding strategy, i.e., the image is not padded before convolving with the blur kernel. In each case, we establish 3 types of problems (easy, medium and hard) corresponding to different kernel lengths and noise standard deviations.
We generate random motion blur kernels as random Gaussian processes following [47], [65] of \(31\times 31 pixels\). In this case, the difficulty of the deblurring problem is defined by the length scale of blur trajectories \(\ell\), the standard deviation of the Gaussian processes \(s\) generating the psf and the standard deviation \(\sigma\) of the additive Gaussian noise. The tuple \(\{\ell, s, \sigma\}\) are set to \(\{0.1, 0.1, 0.01\}\), \(\{0.6, 0.5, 0.05\}\), \(\{1.2, 1.0, 0.1\}\) for the easy, medium and hard settings respectively.
We generate fixed blur kernels with psf of size 31; the difficulty of the problem is defined by the standard deviation of the (Gaussian) blur kernel \(\sigma_{\text{blur}}\) and the standard deviation \(\sigma\) of the additive Gaussian noise. The tuple \(\{\sigma_{\text{blur}}, \sigma\}\) is set to \(\{1.0, 0.01\}\), \(\{2.0, 0.05\}\), \(\{4.0, 0.1\}\) for the easy, medium and hard settings respectively.
The DDRM algorithm builds on a denoising diffusion UNet backbone, following the architecture from [59], as in [58]. Due to its large receptive field, this convolutional UNet requires sufficiently large input sizes, but memory constraints make resolutions beyond 512\(\times\)512 impractical. To address this, we use circular padding to align inputs with the \(2^5\) minimum size constraint, and process larger images by dividing them into non-overlapping 512\(\times\)512 patches.
In this section, we review the possible choices for \(\mathcal{L}_{\textrm{MC}}\) and \(\mathcal{L}_{\textrm{NULL}}\) that can be used in 5 .
\(\mathcal{L}_{\textrm{MC}}\) can be selected based on prior knowledge of the noise distribution [54]. If the noise distribution is known exactly, Stein’s Unbiased Risk Estimate (SURE) [55] is used. For the case of Gaussian noise of level \(\sigma\), SURE is defined2 as \[\mathcal{L}_{\textrm{SURE}}(\boldsymbol{\theta}, \boldsymbol{y}_i) = \|\boldsymbol{A}_i\operatorname{R}_{\boldsymbol{\theta}}(\boldsymbol{y}_i,\boldsymbol{A}_i)-\boldsymbol{y}_i\|_2^2 + 2\sigma_i^2 \text{div} (\boldsymbol{A}_i\circ \operatorname{R}_{\boldsymbol{\theta}})(\boldsymbol{y}_i, \boldsymbol{A}_i),\] where the divergence is approximated using a Monte Carlo method [66]. Extensions of SURE to unknown \(\sigma\) [54] and other noise distributions also exist [67]. If the noise distribution is only assumed to be separable across measurements, we can use a splitting loss \[\mathcal{L}_{\textrm{SPLIT}}(\boldsymbol{\theta}, \boldsymbol{y}_i) = \mathbb{E}_{\boldsymbol{m}}\| (\boldsymbol{I}-\text{diag}(\boldsymbol{m}))\left(\boldsymbol{A}_i\operatorname{R}_{\boldsymbol{\theta}}(\text{diag}(\boldsymbol{m})\boldsymbol{y}_i,\text{diag}(\boldsymbol{m})\boldsymbol{A}_i) - \boldsymbol{y}_i\right) \|_2^2,\] where \(\boldsymbol{m}\) is a splitting mask sampled from a Bernoulli random variable or using problem-specific strategies (e.g., Neighbor2Neighbor [52], SSDU [68], etc.).
If the finetuning dataset is observed via a single operator \(\boldsymbol{A}_i=\boldsymbol{A}\) for all \(i=1,\dots,N\), we choose \(\mathcal{L}_{\textrm{NULL}}\) as the Equivariant Imaging (EI) loss [38] which enforces equivariance to a group of transformations \(\{\boldsymbol{T}_r\}_{r\in\mathcal{R}}\) such as rotations or shifts with \[\mathcal{L}_{\textrm{EI}}(\boldsymbol{\theta}) = \mathbb{E}_{r\sim \mathcal{R}}\| \boldsymbol{T}_r\hat{\boldsymbol{x}}_{i,\boldsymbol{\theta}} - \operatorname{R}_{\boldsymbol{\theta}}(\boldsymbol{A}\boldsymbol{T}_r\hat{\boldsymbol{x}}_{i,\boldsymbol{\theta}}, \boldsymbol{A}) \|_2^2,\] where \(\hat{\boldsymbol{x}}_{i,\boldsymbol{\theta}}=\operatorname{R}_{\boldsymbol{\theta}}(\boldsymbol{y}_i,\boldsymbol{A}_i)\) is the reconstructed image. To learn in the nullspace of \(\boldsymbol{A}\), the group of transformations should be chosen such that \(\boldsymbol{A}\) is not equivariant. If the finetuning dataset is associated with multiple operators \(\{\boldsymbol{A}_r\}_{r\in\mathcal{R}}\), we choose \(\mathcal{L}_{\textrm{NULL}}\) as the Multi Operator Imaging (MOI) loss [53], i.e., \[\mathcal{L}_{\textrm{MOI}}(\boldsymbol{\theta}) = \mathbb{E}_{r\sim \mathcal{R}}\| \hat{\boldsymbol{x}}_{i,\boldsymbol{\theta}} - \operatorname{R}_{\boldsymbol{\theta}}(\boldsymbol{A}_r\hat{\boldsymbol{x}}_{i,\boldsymbol{\theta}},\boldsymbol{A}_r) \|_2^2,\] also where \(\hat{\boldsymbol{x}}_{i,\boldsymbol{\theta}}=\operatorname{R}_{\boldsymbol{\theta}}(\boldsymbol{y}_i,\boldsymbol{A}_i)\), which enforces consistency across operators \(\operatorname{R}_{\boldsymbol{\theta}}(\boldsymbol{A}_r \boldsymbol{x}, \boldsymbol{A}_r) \approx \operatorname{R}_{\boldsymbol{\theta}}(\boldsymbol{A}_s \boldsymbol{x}, \boldsymbol{A}_s)\) for all pairs \((i,r)\).
In all finetuning experiments, we use the Adam optimizer with the same parameters used at train time (see 4.1). In all experiments and baselines, we choose the network checkpoint that obtains the best performance, using ground-truth for the Sentinel experiments and visual inspection for the SPAD and Cryo-EM data. The trade-off parameter in 5 is set as \(\omega=0.1\) in all experiments where a \(\mathcal{L}_{\textrm{NULL}}\) loss is used. Shifts up to \(10\%\) of the image width and height are considered in the EI loss. 4 shows reconstructions with finetuning on a single measurement on a compressed sensing problem, and [fig:cryo-em-appendix,fig:spad-appendix] show more reconstructions for the LinoSPAD and Cryo-EM datasets, respectively.
| Dataset | Compressed Sensing | Demosaicing |
|---|---|---|
| \(1\) image | 43 sec | 60 sec |
| \(10\) images | 80 sec | 107 sec |
| \(100\) images | 228 sec | 267 sec |
We use a dataset of 200 images of \(128\times 128\) pixels of the coast of the United Kingdom taken by the Sentinel-2 L1C satellite with 10-meter resolution and minimal cloud coverage, keeping only the RGB bands. The dataset is split into 100 images for training and 100 for testing.
Compressed sensing: we set \(\boldsymbol{A}= \boldsymbol{S}\text{diag}(\boldsymbol{m})\) where \(\boldsymbol{m}\) is a random mask with values in {-1,+1}, and \(\boldsymbol{S}\in\mathbb{R}^{m\times n}\) is the discrete sine transform wit output randomly subsampled by a factor of 4.
Demosaicing: measurements are generated by applying a Bayer pattern, keeping a single band per pixel. In both cases, we consider Gaussian noise with \(\sigma=0.05\), and finetune with the \(\ell_2\) loss for the supervised case, and with \(\mathcal{L}_{\textrm{SURE}}\) and \(\mathcal{L}_{\textrm{EI}}\) (using shifts) for the self-supervised setting. As shown in 3 and 4, finetuning RAM requires only on a single image to obtain good results, significantly outperforming the DRUNet baselines. As shown in 7, self-supervised finetuning can be performed in the order of a couple of minutes on a single mid-sized GPU.
We evaluate the model on 5 real Cryo-EM images of \(7676 \times 7420\) pixels provided by the Topaz-EM open-source library [60] whose noise distribution is unknown and have very low SNR. We finetune our model with fixed \(\boldsymbol{A}=\boldsymbol{I}\), \(\sigma=0.98\), \(\gamma=0\) at the input, using \(256 \times 256\) crops normalized to have unitary standard deviation. Since the noise distribution is unknown and there is a trivial nullspace, we use \(\mathcal{L}_{\text{SPLIT}}\) for measurement consistency and \(\mathcal{L}_{\text{NULL}}=0\). Reconstructions are shown in 4.
We evaluate the RAM model on the LinoSPAD data provided by [61], which is corrupted by photon noise. While noise in SPADs is generally assumed to be Poisson in the very low-flux case, images acquired with higher flux can follow more complex discrete noise distributions. The LinoSPAD scans the image using epipolar scanning, with some lines of pixels removed due to faulty acquisition. Thus, the image recovery problem in this case is inpainting. The missing line pattern and noise model were not used during model training. We finetune the model using \(\mathcal{L}_{\textrm{SPLIT}}\) and \(\mathcal{L}_{\textrm{MOI}}\) (removing random subsets of lines). Reconstructions are shown in 4.
| PDNet | uDPIR-untied | uDPIR-tied | RAM | |
|---|---|---|---|---|
| GFLOPS | 60 | 2234 | 2234 | 360 |
| Test memory (BS=1) | 52 | 1213 | 298 | 354 |
| Train memory (BS=8) | 5815 | 37288 | 36374 | 9670 |
We provide in 6 visual results for MRI reconstructions on acceleration factors \(\times\)4 and \(\times\)8 respectively.
We provide in 8 results on the multi-coil MRI inverse problem where we simulate \(L=15\) coil maps. The UNet reconstruction shows a less smooth aspect, penalizing PSNR.
We provide in 7 visual results for computed tomography for the in-distribution setup on the top row (i.e., with a setup similar to that of training), and the out-of-distribution setup on the bottom row. In the latter case, measurements are degraded with additional Poisson noise, unlike the training setup.
Motion deblurring We provide additional visual results for the motion-blur (“hard”) task on DIV2K dataset samples in 9. We observe that the proposed method performs on par with uDPIR with tied weights and DDRM, although some mild visual differences may be noticed. See, for instance, the windows in the blue zoombox of the church image, which appear to show a repetitive pattern in DDRM that is not faithful to the ground-truth image.
Super-resolution We provide visual results for bicubic super-resolution with factor 4 and with additive Gaussian noise of standard deviation \(\sigma = 0.01\) on BSD68 dataset samples in Fig. 10. For this task, we also compare the proposed result with the state-of-the-art transformer-based SwinIR model [26] trained specifically for this degradation. We observe that the proposed method outperforms the DPIR algorithm, but underperforms compared to SwinIR.
Grayscale Poisson-Gaussian denoising We provide visual results for Poisson-gaussian denoising on a grayscale BSD68 dataset sample in Fig. 11. We follow the formulation from 4 . On both Poisson denoising and Gaussian denoising, we observe that the model is relatively robust to out-of-distribution parameters. In both cases, while the reconstruction quality degrades as the degradation level increases, low-frequency features are preserved in the reconstruction, and no substantial artefacts are introduced.
Image inpainting We provide visual results for inpainting with mild Poisson-Gaussian noise on a color BSD68 dataset sample in Fig. 12. We show results on inpainting masks applied either per-channel (top rows) or pixel-wise (bottom rows), for various masking levels. In both cases, Poisson-Gaussian noise is applied with \(\sigma = 0.01\) and \(\gamma = 0.01\). We observe that the model is more robust to out-of-distribution mask sparsity for channel-wise masks than for pixel-wise masks, where artifacts appear when fewer than \(20\%\) of pixels are observed.
We can obtain uncertainty estimates of the reconstructions obtained by RAM with the equivariant bootstrap algorithm [62] described in 15. This approach is similar to the standard parametric bootstrap [69], with the addition of a group of geometric transformations (set as random shifts, rotations, and flips in our experiments). The transforms are used to probe the confidence of the network on the reconstruction in the nullspace of the operator \(\boldsymbol{A}\).
We evaluate the uncertainty quantification algorithm on an inpainting task with \(50\%\) missing pixels and Gaussian noise with standard deviation \(\sigma=0.02\), using the DIV2K validation dataset. As a baseline, we use the DDRM diffusion method, running the diffusion multiple times to obtain a set of approximate posterior samples. We obtain 100 samples for each test image, which requires 101 network evaluations for equivariant bootstrap and 1000 evaluations for DDRM (since each diffusion requires 100 denoiser evaluations). Estimated pixelwise errors, averaging over all color channels, are computed using \(\boldsymbol{x}_{\textrm{err}}\) of 15 and shown in [fig:UQ,fig: UQ extra]. We can see in both figures that the estimated errors follow closely the true errors.
We also evaluate the accuracy of these confidence regions by calculating the empirical coverage probabilities, as measured by the proportion of ground-truth test images that lie within the confidence regions for a range of specified confidence levels between 0% and 100%. This method provides a quantitative metric of the estimated uncertainty intervals [70]. Results are shown in 16. The proposed algorithm obtains good coverage without any additional calibration step, while the diffusion method provides highly overconfident intervals (note this behavior of diffusion models has been observed in various previous works [62], [70]).