April 01, 2024
Phase recovery, calculating the phase of a light wave from its intensity measurements, is essential for various applications, such as coherent diffraction imaging, adaptive optics, and biomedical imaging. It enables the reconstruction of an object’s refractive index distribution or topography as well as the correction of imaging system aberrations. In recent years, deep learning has been proven to be highly effective in addressing phase recovery problems. Two main deep learning phase recovery strategies are data-driven (DD) with supervised learning mode and physics-driven (PD) with self-supervised learning mode. DD and PD achieve the same goal in different ways and lack the necessary study to reveal similarities and differences. Therefore, in this paper, we comprehensively compare these two deep learning phase recovery strategies in terms of time consumption, accuracy, generalization ability, ill-posedness adaptability, and prior capacity. What’s more, we propose a co-driven (CD) strategy of combining datasets and physics for the balance of high- and low-frequency information. The codes for DD, PD, and CD are publicly available at https://github.com/kqwang/DLPR.
*Kaiqiang Wang, ; Edmund Y. Lam,
2
Phase recovery refers to a class of methods that recover the phase of light waves from intensity measurements [1]. It is active in various fields of imaging and detection, such as in bioimaging for obtaining the refractive index or thickness distribution of tissues or cells [2], in adaptive optics for characterizing aberrant wavefront [3], in coherent diffraction imaging for detecting structural information of nanomolecules [4], and in material inspection for measuring surface profile [5].
Since optical detectors, such as charge-coupled device sensors, can only record the intensity/amplitude but lose the phase, one has to recover the phase from the recorded intensity indirectly. And precisely because of the loss of the phase, it is ill-posed to directly calculate the phase on the object plane from the only amplitude on the measurement plane through the forward physical model. On the one hand, the phase can be iteratively retrieved from intensity measurements with prior knowledge, i.e., phase retrieval [6]. On the other hand, by incorporating additional information, this problem can be transformed into a well-posed one and solved directly, such as holography or interferometry with reference light [7], [8], Shack-Hartmann wavefront sensing with micro-lens array [9], [10], and the transport of intensity equation with multiple through-focus amplitudes [11], [12].
In recent years, deep learning, with artificial neural networks as the carrier, has brought new solutions to phase recovery, where the core idea is to train neural networks to learn the mapping relationship from intensity measurements to the light wave phase [1], [13]. As illustrated in Fig. 1, the training of neural networks can be driven either supervised by paired datasets as an implicit prior (data-driven, DD) or self-supervised by physical models as an explicit prior (physics-driven, PD) [1].
Sinha et al. [14] first demonstrated DD phase recovery with paired diffraction-phase datasets, obtained by recording diffraction images of virtual phase objects loaded on a spatial light modulator. Subsequently, DD phase recovery was successively extended to in-line holography [15], coherent diffraction imaging [16], Fourier ptychography [17], off-axis holography [18], Shack-Hartmann wavefront sensing [19], transport of intensity equation [20], optical diffraction tomography [21], and electron diffractive imaging [22]. In addition, several studies focused on more efficient neural network structures for phase recovery, such as Bayesian neural network [23], generative adversarial network [24], Y-Net [25], [26], residual capsule network [27], recurrent neural network [28], Fourier imager network [29], [30], and neural architecture search [31]. Some studies also used data-driven methods for pre- or post-processing of phase recovery, such as defocus distance prediction [32], resolution enhancement [33], phase unwrapping [34], and classification [35], [36].
The idea of PD phase recovery was first introduced by Boominathan et al. [37] in their simulation work on Fourier ptychography. Wang et al. [38] first experimentally used PD to iteratively infer the phase of a phase-only object from its diffraction image directly on an untrained/initialized neural network. Afterward, it was subsequently extended to the cases of unknown defocus distances [39], dual wavelengths [40], and complex-valued amplitude objects [41], [42]. In the quest for faster inference times, PD and a large number of intensity measurements were used for neural network pre-training [41]–[45]. Further, refinement of pre-trained neural networks by PD achieved higher accuracy with lower inference time [46], [47].
DD and PD achieve the same goal in different ways and are being studied in different contexts to achieve efficient phase recovery. Therefore, it is necessary and meaningful to compare them under the same context. In this paper, we introduce the principles of DD and PD, and comparatively study them in terms of time consumption, accuracy, generalization ability, ill-posedness adaptability, and prior capacity. We also combine DD and PD as a co-driven (CD) strategy to train neural networks for high- and low-frequency information balance. What’s more, to facilitate readers to get started with deep learning phase recovery quickly, we release the demonstrations of DD, PD, and CD at https://github.com/kqwang/DLPR.
Here, we consider a classic phase recovery paradigm, recovering the phase or complex-valued amplitude of a light wave from its in-line hologram (diffraction pattern). For an object illuminated by a coherent plane wave, its hologram can be written as \[H = G(A, P) \label{eq:forward}\tag{1}\] where \(H\) is the hologram, \(A\) is the amplitude of light wave, \(P\) is the phase of light wave, and \(G(\cdot)\) is the forward propagation function, respectively. For a phase object, we assume \(A=1\). Then, the purpose of phase recovery is to formulate the inverse mapping of \(G(\cdot)\): \[P = G^{-1}(H) \label{eq:inverse}\tag{2}\]
With a supervised learning mode, DD trains neural networks with paired hologram-phase datasets \(S_{H-P}=\{(H_i, P_i),i=1,\ldots, N\}\) as an implicit prior to learn this inverse mapping [14]:
\[f_{\omega^{\ast}} = \mathop{\arg\min}\limits_{f_{\omega}} \sum_{i=1}^{N} \Vert f_{\omega}(H_{i})-P_{i} \Vert ^{2}_{2},\qquad \forall(H_i, P_i) \in S_{H-P}\] where \(\Vert \cdot \Vert ^{2}_{2}\) denotes the square of the \(\textit{l}_2\)-norm (or other distance functions) and \(f_w\) is a neural network with trainable parameters \(\omega\), like weights and biases. When the optimization is complete, the trained neural network \(f_{\omega^{\ast}}\) is used as an inverse mapper to infer the corresponding phase \(\hat{P}_x\) from its hologram \(H_x\) of an unseen object that is not in training dataset: \[\hat{P}_x=f_{\omega^{\ast}}(H_x)\]
A visual representation of DD can be seen in Fig. 2, in which holograms and phases are used as the input and ground truth (GT) of the neural network, respectively. The training dataset, collected through experiments or numerical simulations, typically contains paired data from thousands to hundreds of thousands. The training stage usually lasts for hours or even days but only takes one time. After that, the trained neural network quickly infers the phase of the unseen object after being fed its hologram.
For physical processes that can be well modeled, such as phase recovery, PD is another available strategy. With a self-supervised learning mode, PD uses a numerical propagation \(G(\cdot)\) as an explicit prior to drive the training or inference of neural networks (Fig. 3). Different from DD, which calculates the loss function in the phase domain, PD converts the network output from the phase domain to the hologram domain via numerical propagation and then calculates the loss function. This explicit prior can be utilized in three ways: untrained PD (uPD) [38], trained PD (tPD) [44], and tPD with refinement (tPDr) [46].
With driving of the explicit prior \(G(\cdot)\), uPD iteratively optimizes an untrained neural network \(f_{\omega}(\cdot)\) to infer the phase \(\hat{P}_x\) of an unseen object from its hologram \(H_x\) (Fig. 3a): \[\begin{align} &f_{\omega^{\ast}} = \mathop{\arg\min}\limits_{f_\omega} \Vert G(f_{\omega}(H_{x}))-H_x \Vert ^{2}_{2}\\ &\hat{P}_x = f_{\omega^{\ast}}(H_x) \end{align}\]
In tPD, the explicit prior \(G(\cdot)\) is employed to train an untrained neural network \(f_{\omega}(\cdot)\) with intensity-only training dataset \(S_H=\{(H_i),i=1,\ldots, N\}\) as input, and then the trained neural network \(f_{\omega^{\ast}}\) infers the phase \(\hat{P}_x\) of an unseen object from its hologram \(H_x\) (Fig. 3b): \[\begin{align} &f_{\omega^{\ast}} = \mathop{\arg\min}\limits_{f_\omega} \sum_{i=1}^{N} \Vert G(f_{\omega}(H_{i}))-H_{i} \Vert ^{2}_{2},\qquad \forall(H_i) \in S_{H} \\ &\hat{P}_x = f_{\omega^{\ast}}(H_x) \end{align}\]
In tPDr, the trained neural network \(f_{\omega^{\ast}}(\cdot)\) of tPD is iteratively fine-tuned on the hologram of the unseen object (Fig. 3c): \[\begin{align} &f_{\omega^{\ast\ast}} = \mathop{\arg\min}\limits_{f_\omega^{\ast}} \Vert G(f_{\omega^{\ast}}(H_{x}))-H_x \Vert ^{2}_{2} \\ &\hat{P}_x = f_{\omega^{\ast\ast}}(H_x) \end{align}\]
For the sake of clarity, we summarize DD, uPD, tPD, and tPDr according to their requirements for the physical model, the training dataset, the number of cycles needed for inference, and the learning mode in Table 1.
Strategy | Physics requirement | Dataset requirement | Inference cycles | Learning mode | |
---|---|---|---|---|---|
DD | No | Hologram-phase dataset | One time | Supervised | |
uPD | Numerical propagation | No | Multi times | self-supervised | |
tPD | Numerical propagation | Hologram-only dataset | One time | self-supervised | |
tPDr | Numerical propagation | Hologram-only dataset | Multi times | self-supervised |
To avoid unnecessary distraction factors, all datasets used for comparison are generated through numerical simulation based on ImageNet, LFW, and MNIST, see Appendix A. All methods use the same U-Net-based neural network, the specific structure of which is described in the Supplementary Material of Ref. . The implementation of the neural network is set uniformly, see Appendix B. The average peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) are used to quantify the inference accuracy.
In this section, ImageNet is used for dataset generation. We summarize the training settings and inference evaluation of DD, uPD, tPD, and tPDr in Table 2.
Strategy | training datasets | Inference cycles | Inference time | PSNR \(\uparrow\) | SSIM \(\uparrow\) |
---|---|---|---|---|---|
DD | 10,000 pairs | 1 | 0.02 seconds | 19.9 | 0.68 |
uPD | 0 | 10,000 | 800 seconds | 25.6 | 0.94 |
tPD | 10,000 pairs | 1 | 0.02 seconds | 18.5 | 0.69 |
tPDr | 10,000 inputs | 1,000 | 80 seconds | 25.1 | 0.93 |
In terms of time consumption, DD, tPD, and tPDr all require pre-training before inference, thus consuming hours or even more for neural network optimization, whereas uPD performs inference for the tested sample directly on an untrained neural network. During the inference stage of DD and tPD, the hologram of the tested sample passes through the trained neural network once in one second, while the inference process for uPD and tPDr takes several minutes for iteration.
As for the inference accuracy, the PSNR and SSIM of DD and tPD which do quick inference once after pre-training are basically the same, and both significantly lower than uPD and tPDr which do inference multiple times. Due to the prior knowledge introduced in the pre-training stage, the initial inference of tPDr is closer to the target solution, which makes it get the same accuracy with shorter inference cycles than uPD. Specifically, with comparable accuracy, the inference time of tPDr is one-tenth that of uPD.
Although having the same accuracy index (Table 2), the inference result of tPD shows better high-frequency detailed information while that of DD shows better low-frequency background information (Figure 4). According to the frequency principle, deep neural networks are more inclined to learn low-frequency information in the data [48]. DD learns the hologram-phase mapping relationship through the loss function in the phase domain, while PD uses numerical propagation to transfer it from the phase domain to the hologram domain. On the one hand, as shown in the white curve on the left side of Figure 4, the high-frequency phase information (steeper curve) is recorded in the diffraction fringes of the hologram which contains a more balanced high- and low-frequency information (smoother curve). This is more favorable for PD to learn those high-frequency phase information from the loss function in the hologram domain. On the other hand, the low-frequency phase causes only little contrast in the hologram, making it difficult for PD to learn low-frequency phase information, especially the plane background phase.
In order to balance the high- and low-frequency phase information learned by the neural network, we propose to use both dataset and physics for the neural network training, named CD. The loss function of CD is derived from the weighted sum of the data-driven term and physical-driven term: \[f_{\omega^{\ast}} = \mathop{\arg\min}\limits_{f_{\omega}} \sum_{i=1}^{N} \alpha \Vert f_{\omega}(H_{i})-P_{i} \Vert ^{2}_{2} + \Vert G(f_{\omega}(H_{i}))-H_{i} \Vert ^{2}_{2},\qquad \forall(H_i, P_i) \in S_{H-P}\] where \(\alpha\) is the weight used to control the contribution of the data-driven term and physical-driven term, which is set to 0.3. As illustrated in Figure 5, compared to the low-frequency-tendency DD and high-frequency-tendency tPD, CD takes into account both the high-frequency phase (see the blue box) and low-frequency phase (see the green box).
To compare the generalization ability of DD and tPD, ImageNet, LFW, and MNIST are used to generate datasets for neural network training and cross-inference, respectively. ImageNet represents dense samples, MNIST represents sparse samples, and LFW is somewhere in between. In Fig. 6, we show the cross-inference results and their absolute error maps of a sample from ImageNet, LFW, and MNIST, and attach the average SSIM on the testing dataset below each result.
Overall, the dataset is the main factor affecting the generalization ability of the trained neural network. Specifically, the neural networks trained by ImageNet and LFW generally perform better on all three testing datasets, while the neural networks trained by MNIST can only infer the overall distribution of ImageNet and LFW but lack detailed information. Admittedly, MNIST itself lacks detailed information, so it is reasonable that neural networks trained with it would not be able to fully infer detailed information about ImangeNet and LFW. In this extreme case, tPD is significantly better than DD, both in terms of inference results and SSIM. As can be seen in Fig. 6, tPD infers more detailed information than DD (marked by the green arrow). Another thing worth noting is that for the case of using neural networks trained by ImageNet and LFW to infer MNIST, although the inference results of both tPD and DD appear to be ideal, the SSIM of tPD is much lower than that of DD. As can be seen from the absolute error maps (marked by the yellow arrow), the error in the background part of tPD is relatively larger than that of DD, which confirms a conclusion in Sec. 3.1 that tPD is not good at low-frequency phase information, especially the plane background phase.
Let us consider a more ill-posed case of using a neural network to simultaneously infer phase and amplitude from a hologram. In dataset generation, ImageNet, LFW, and MNIST are used to get samples containing phase and amplitude respectively, and the corresponding holograms are calculated through numerical propagation. Given that the neural network needs to output both phase and amplitude, we modified the original U-Net by paralleling another up-sampling path to build a Y-Net [25]. The way tPD trains the neural network has not changed, except that there is an amplitude term in the loss function: \[\begin{align} &f^{P,A}_{\omega^{\ast}} = \mathop{\arg\min}\limits_{f^{P,A}_\omega} \Vert G(f^{P,A}_{\omega}(H_{x}))-H_x \Vert ^{2}_{2}\\ &\hat{P}_x, \hat{A}_x = f^{P,A}_{\omega^{\ast}}(H_x) \end{align}\] where \(f^{P,A}_\omega(\cdot)\) denotes the Y-Net that outputs phase and amplitude simultaneously. The loss function of DD is derived by weighted summation of the phase term and amplitude term:
\[\begin{align} &f^{P,A}_{\omega^{\ast}} = \mathop{\arg\min}\limits_{f^{P,A}_{\omega}} \sum_{i=1}^{N} \Vert f^{P}_{\omega}(H_{i})-P_{i} \Vert ^{2}_{2} + \beta \Vert f^{A}_{\omega}(H_{i})-A_{i} \Vert ^{2}_{2}\\ &\hat{P}_x, \hat{A}_x = f^{P,A}_{\omega^{\ast}}(H_x) \end{align}\] where \(f^{P}_\omega(\cdot)\) and \(f^{A}_\omega(\cdot)\) denote the phase path and amplitude path of Y-Net, respectively, \(\beta\) is the weight used to control the contribution of the phase term and amplitude term, which is set to 0.1.
The inference results of DD and tPD with single hologram input are shown in the blue part of Fig. 7. DD can infer the phase and amplitude at the same time, because the implicit mapping relationship from holograms to phase and amplitude is completely included in the paired dataset used for the network training. As for tPD, obvious artifacts appear in the inference results and its SSIM is reduced accordingly. This means that although there are many undesirable components in the inference result, the hologram corresponding to this non-ideal phase and amplitude matches the hologram of the sample. That is, the situation of using a hologram to infer both phase and amplitude simultaneously is severely ill-posed for tPD.
Here we show two solutions for this ill-posedness of tPD. For one thing, we introduce an aperture constraint in the sample plane to reduce the difficulty of tPD phase recovery [41]: \[\begin{align} &f^{P,A}_{\omega^{\ast}} = \mathop{\arg\min}\limits_{f^{P,A}_\omega} \Vert G(f^{P,A}_{\omega}(H_{x}))-H_x \Vert ^{2}_{2} + \Vert f^{A}_{\omega}(H_{x}) \cdot (1-C(r))-0_{N \times N} \Vert ^{2}_{2}\\ &\hat{P}_x, \hat{A}_x = f^{P,A}_{\omega^{\ast}}(H_x) \end{align}\] where \(C(r)\) is the aperture constraint with radius \(r\) which is set to 80 pixels, and \(0_{N \times N}\) denotes the zero matrix of size \(N \times N\) where \(N\) is set to 256. After introducing aperture constraints, the inference results of tPD for the three datasets are improved to varying degrees (see the red part of Fig. 7). MNIST has the largest improvement, followed by LFW, and ImageNet has such limited improvement. This means that the aperture constraint works well for simple cases with less information but can hardly deal with more difficult samples. For another thing to further reduce the ill-posedness of tPD, we introduce more prior knowledge by using multiple holograms with different defocus distances as network inputs [44]. In this case, the loss function contains three terms corresponding to different defocus distances: \[\begin{align} f^{P,A}_{\omega^{\ast}} = &\mathop{\arg\min}\limits_{f^{P,A}_\omega} \Vert G^{z_1}(f^{P,A}_{\omega}(H^{z_1}_{x}, H^{z_2}_{x}, H^{z_3}_{x}))-H^{z_1}_x \Vert ^{2}_{2}\\ & + \Vert G^{z_2}(f^{P,A}_{\omega}(H^{z_1}_{x}, H^{z_2}_{x}, H^{z_3}_{x}))-H^{z_2}_x \Vert ^{2}_{2} \\ & +\Vert G^{z_3}(f^{P,A}_{\omega}(H^{z_1}_{x}, H^{z_2}_{x}, H^{z_3}_{x}))-H^{z_3}_x \Vert ^{2}_{2} \\ \hat{P}_x, \hat{A}_x = &f^{P,A}_{\omega^{\ast}}(H^{z_1}_{x}, H^{z_2}_{x}, H^{z_3}_{x}) \end{align}\] where \(G^{z_1}(\cdot), G^{z_2}(\cdot), G^{z_3}(\cdot)\) donate the numerical propagation of different distances, and \(H^{z_1}_{x}, H^{z_2}_{x}, H^{z_3}_{x}\) donate holograms with different defocus distances, where \(z_1,z_2,z_3\) are set to 20mm, 40mm, and 60mm respectively. Compared to a single hologram input, two more holograms introduce sufficient prior knowledge for tPD, resulting in a significant improvement in the trained neural network, both for the simple MNIST and the complex LFW and ImageNet (see the yellow part of Fig. 7).
tPD uses numerical propagation as an explicit prior to train the neural network, so the neural network learns priors from numerical propagation. DD trains a neural network with paired datasets, which means that the neural network learns all implicit priors contained in the dataset even if it is outside the numerical propagation. For example, in the presence of imaging aberration, there will be both sample and aberration information in the hologram. Here, we use ImageNet as the sample phase and a random phase generated by the random matrix enlargement (RME) [34], [49] as the aberration phase to generate a dataset for the comparison of DD and tPD. The process of dataset generation and network training is shown in Fig. 8, where blue represents the dataset generation part, green represents the network training part of DD, and red represents the network training part of tPD.
We illuminate the inference results and absolute error maps of four samples in Figure 9. As expected, DD infers the sample phase while removing the imaging aberration phase, while the inference result of tPD includes both the sample phase and the aberration phase. Accordingly, the SSIM of DD is much higher than that of tPD. In DD, the hologram contains unwanted aberration information, but the ground truth only contains sample information, which means that the dataset implicitly contains both the prior for phase recovery and the prior for aberration removal. As for tPD, the prior for the network training is derived from numerical propagation, which allows both the sample information and the aberration information in the hologram to be recovered. It should be noted that the results of uPD also contain the unwanted aberration phase just like that of tPD.
We compare DD, tPD, CD, and uPD(tPDr) using experimental holograms of standard phase objects. This experimental hologram with a defocus distance of 8.78mm is from an open-source dataset of Ref . To match the defocus distance of the experimental hologram, we use ImageNet to generate corresponding datasets for the network training. Inference results for all methods are given in Fig. 10.
Overall, uPD and tPDr with multiple-times inferences have the best results, as seen from the neatly drawn peaks and valleys. It should be noted that due to the presence of redundant diffraction fringes at the edge of the hologram (see the green box in Fig. 10 (a)), unwanted fluctuations appear in the background of the uPD and tPDr inference results (see the green arrows in Fig. 10 (a)). Among the remaining one-time inference methods, the background fluctuations of the tPD results are larger (see the yellow arrows in Fig. 10 (a)), while the detailed information of the DD results is weaker (see the yellow arrows in Fig. 10 (b)). As a combination of DD and tPD, CD better considers detailed and background information. It should be noted that as the training dataset further expands, the neural network’s accuracy will increase accordingly.
We introduced the principles of DD and PD strategies for deep learning phase recovery in the same context. On this basis, we compared the time consumption and accuracy of DD, uPD, tPD, and tPDr, and found that uPD and tPDr achieve the highest accuracy with multiple inferences, tPD prefers the high-frequency detailed phase while DD favors the low-frequency background phase. Therefore, we proposed CD to balance high- and low-frequency information. Furthermore, we found that tPD generalizes better than DD for the case of inferring dense samples using neural networks trained on sparse samples. As for the case of inferring phase and amplitude simultaneously, we revealed the reason why DD is stronger than tPD, that is, the dataset for DD implicitly contains the mapping relationship from holograms to phase and amplitude while tPD may encounter situations where multiple network outputs phases and amplitudes correspond to a same hologram. To alleviate the ill-posedness of tPD, we proposed solutions by aperture constraints or multiple hologram inputs. In addition, we used the case of imaging aberration to demonstrate that DD can learn more about the prior implicit in the dataset whereas PD can only learn the prior in numerical propagation. Finally, we verified with experimental data that uPD and tPDr have the highest accuracy and that CD balances high- and low-frequency information better than DD and tPD.
Three publicly available image datasets (ImageNet, LFW, and MNIST) are used to generate phases and amplitudes, and then the corresponding holograms at a certain propagation distance are computed via numerical propagation. The training and testing datasets contain 10,000 and 100 data, respectively. The size of all data is set to \(256\times 256\). The propagation distance is set to 20 mm and 8.78 mm for the simulation comparisons and the experimental tests, respectively.
The Adam optimizer with an initial learning rate of 0.001 is adopted to update the weights and biases. The neural network training epoch of DD and PD is set to 100. The inference cycles of uPD and tPDr are set to 10,000 and 1000, respectively. All the neural networks are based on Pytorch (2.0.0) with Python (3.8.18). All operators run on a compute server equipped with AMD Ryzen Threadripper PRO 3955WX and NVIDIA GeForce RTX 3090.
The authors declare no competing interests.
The work was supported in part by the Research Grants Council of Hong Kong (GRF 17201620, GRF 17200321, RIF R7003-21).