Holo-VQVAE: VQ-VAE for phase-only holograms

Joohyun Park
KyungHee University
Korea, Republic of
james5450@khu.ac.kr

,

Hyeongyeop Kang
KyungHee University
Korea, Republic of
siamiz@khu.ac.kr


Abstract

Holography stands at the forefront of visual technology innovation, offering immersive, three-dimensional visualizations through the manipulation of light wave amplitude and phase. Contemporary research in hologram generation has predominantly focused on image-to-hologram conversion, producing holograms from existing images. These approaches, while effective, inherently limit the scope of innovation and creativity in hologram generation. In response to this limitation, we present Holo-VQVAE, a novel generative framework tailored for phase-only holograms (POHs). Holo-VQVAE leverages the architecture of Vector Quantized Variational AutoEncoders, enabling it to learn the complex distributions of POHs. Furthermore, it integrates the Angular Spectrum Method into the training process, facilitating learning in the image domain. This framework allows for the generation of unseen, diverse holographic content directly from its intricately learned latent space without requiring pre-existing images. This pioneering work paves the way for groundbreaking applications and methodologies in holographic content creation, opening a new era in the exploration of holographic content.

1 Introduction↩︎

Holography is an emerging technology that unveils a realm where the conventional boundaries of image capturing and representation are transcended. Traditional photography is confined to recording the amplitude of light waves, making it lose three-dimensionality and present a two-dimensional portrayal of the world. In contrast, holography is able to encapsulate not just the amplitude, but also the phase values of light waves, thereby preserving the spatial information intrinsic to every scene.

Unfortunately, the pace of evolution in 3D holography is slower than that in 2D images due to the technical complexity, expensive equipment, computational intensity, and material limitations [1][3]. To overcome this, many researchers begin with simpler forms of holography such as 2D phase-only holograms (POHs). They conduct experiments and test their hypothesis in the 2D domain to obtain a foundational understanding of light’s behavior and the principles of interference and diffraction [4], [5]. The gained understanding serves as a prerequisite for exploring the more complex domain of 3D holography. In a similar context, POHs are actively researched because they simplify the data handling process by keeping the amplitude of the wavefront constant and only considering the phase variants information to encode the information of the scene.

These days, the research topic in the image domain has begun to include generative models for 2D images. Such models have transformative potential in synthesizing realistic and high-quality visuals, augmenting datasets, and enhancing various applications from art to healthcare. However, due to its complexity, only a few attempts have been made to translate this success to the realm of holography. In terms of phase data learning, generative models have to process intricate patterns and dependencies between different phases and phase wraps that can cause abrupt changes and discontinuities in the phase data. Furthermore, developing appropriate metrics and loss functions that consider the unique characteristics of phase information becomes a challenge. These contrast with the simple, pixel-based intensity values commonly encountered in 2D images, making it challenging to train generative models for holography.

Previous research of Liu  [5] demonstrates that a Variational AutoEncoder(VAE) model trained in the phase domain managed to reconstruct the input POH but failed to generate novel samples. This suggests that conventional learning methods face challenges when the model is trained directly in the phase domain.

In this paper, we introduce Holo-VQVAE, a novel end-to-end generative model architecture for 2D POH generation. Inspired by HoloNet [4], we find that using images as input and computing the reconstruction loss in the image domain allows the model to learn the data distribution and sample novel POHs. Specifically, we apply the Angular Spectrum Method (ASM) on the generated POH to reconstruct an image, which is then compared with the input image to compute the reconstruction loss. Though the results generated using a standard VAE were promising, we find that adopting the discrete latent space of Vector-Quantized Variational Autoencoder(VQ-VAE) [6], [7] can further improve the generation quality. The discrete latent space mitigates the entanglement of the learned features, resulting in high-quality POH generation.

In summary, the contributions of this paper are as follows:

  • We introduce Holo-VQVAE, a pioneer end-to-end generative model architecture for POHs.

  • We facilitate the training process by integrating ASM propagation, enabling training in the image domain rather than the phase domain, and devised a loss function balancing noise and quality through an experimental ratio of l2 loss to perceptual loss.

  • We employ the discrete latent space of VQVAE for training the POH with complex distributions and confirmed through experiments that it shows superior performance compared to using the continuous latent space of a standard VAE.

2 Related works↩︎

Holograms are fringe patterns that encapsulate the intricate interplay of light waves reflected from a scene. The hologram itself is a two-dimensional surface that contains information about the intensity and phase distribution of light waves, but the image it produces appears three-dimensional and can be viewed from different angles.

Holograms can be produced through two primary methods: optical generation and computational generation, with the latter referred to as Computer-Generated Holography (CGH). In recent years, CGH has gained increasing prominence over optical methods, primarily due to the stringent environmental conditions and precise control required for the generation of optical holograms.

In the idealized conception of CGH, both amplitude and phase play pivotal roles, collaboratively contributing to the multi-dimensional portrayal of scenes with enhanced depth and perspective. However, due to the technological complexities and the economic considerations of spatial light modulators (SLMs), a substantial corpus of holographic research often focuses on either amplitude-only holograms or POHs [8].

Among these, POHs have gained more research attention. This inclination is attributed to the inherent capacity of POHs to yield reconstructions marked by enhanced brightness and clarity [9], [10]. The process of generating POH is intricately connected to phase retrieval, a technique that focuses on deducing the phase of input data from the magnitude of its Fourier transform [11]. Phase retrieval is a complex, non-linear, and ill-posed problem. There are inherent ambiguities and constraints that make the retrieval of phase information challenging. For this reason, a plethora of algorithms exist in the realm of phase retrieval, each tailored to address specific aspects of these challenges.

Traditionally, algorithmic approaches have long been studied. They are inherently systematic and are grounded in well-defined mathematical principles. Algorithmic approaches can indeed be primarily classified into iterative and non-iterative algorithms. Iterative algorithms for phase retrieval involve a repetitive process of refinement to achieve an accurate estimation of the phase [12][17]. On the other hand, non-iterative algorithms are characterized by their ability to provide rapid phase estimations, resulting in significantly reduced computation times [18][21]. However, they are often hampered by issues related to computational intensity, convergence challenges, sensitivity to noise, and the expertise required for implementation.

Recently, the advent and proliferation of deep learning techniques have indeed instigated notable advancements in the field of POH generation. Deep learning facilitates the rapid generation of high-quality POHs by learning intricate patterns of light wave interactions within vast datasets, leading to phase hologram generation that is not only accurate but also adaptable across diverse scenarios and conditions [4], [22][30].

The recent integration of deep learning in POH generation has seen significant strides, yet the incorporation of generative models has not been as pronounced. Generative models, characterized by their sophisticated latent space representations, offer advanced techniques for data manipulation [31][34]. The integration of these models into POH generation holds the potential to expand the horizon of possibilities in holography, paving the way for innovative applications and methodologies.

One relevant study is Liu  [5] that introduces a generative approach for hologram generation named channeled variational encoder. This involves an assumption where the target hologram is modeled as the product of the original hologram and a specific transformation function. While this method circumvents the complexities associated with mapping intricate phase data into a latent space, it inherently reveals a limitation to generating more complex scenes that cannot be derived simply by applying a transformation function. Furthermore, augmenting the model’s generative range is intrinsically tied to the availability of a dataset volume. Our model markedly deviates from Liu’s paradigm, embodying an enhanced capability to directly synthesize unseen POHs from a learned latent space.

3 Method↩︎

Figure 1: Overall architecture of Holo-VQVAE

The overall pipeline of our model, Holo-VQVAE, is illustrated in 1. It is structured around two principal components: the encoder and the decoder. During the training phase, the encoder’s role is to convert an input 2D image into a corresponding representation within the latent space. Subsequently, the decoder is tasked with the generation of a POH from this latent representation. The objective function regarding reconstruction is computed in the image domain, which aims to minimize the discrepancy between the original input image and the image reconstructed from the POH. The architecture of our model is based on the work by Esser  [35], which utilizes a discrete latent space [6] to effectively encapsulate the distinctive features of the data. This approach enables a condensed yet richly detailed encoding of the input, ensuring that the essential characteristics are efficiently represented within a compact latent space. The details on model configurations are illustrated in the supplementary material.

3.1 Encoding↩︎

The input to the encoder is a 2D image \(I \in \mathbb{R}^{N \times N \times C}\), where \(N\) denotes the width and height of the image and \(C\) specifies its number of channels. The encoding process encompasses the convolution of input features and their subsequent downsampling to yield the latent vector \(z \in \mathbb{R}^{n \times n \times H}\), where \(n\) denotes its width and height and \(H\) specifies its number of channels.

In the design of our Holo-VQVAE architecture, we have determined to use 2D images as the training input, with the output being a POH. This decision is underpinned by two primary considerations.

Firstly, the previous study of Liu  [5] has reported the significant challenges associated with direct learning from phase data. The inherent complexity and high sensitivity to errors in phase information can severely impede the learning process, often leading to a significant loss in reconstructed results upon minor inaccuracies. By leveraging the image domain, we are able to exploit a wide range of well-established learning strategies, circumventing the complexities inherent to phase data and facilitating a more stable and effective training process.

Secondly, the scarcity of high-quality POH datasets poses a substantial barrier [36]. Of course, conversion models are available that can turn standard image datasets into POH datasets [4], [30]. However, the quality of the holographic data produced is not always comparable to that of the original images, and the conversion process can be time-consuming. Maintaining the quality of training data is crucial, as any reduction in its quality can negatively impact the performance of the trained model. Hence, we prioritize using datasets in their original form to guarantee the best possible training input for our Holo-VQVAE.

3.2 Vector quantization↩︎

The vector quantization process is a crucial step in VQ-VAEs, distinguishing them from standard VAEs by quantizing a continuous latent vector \(z\) into a discrete counterpart \(\hat{z}\). This is achieved using a codebook consisting of \(K\) representative vectors. For each pixel \(z_{uv}\) in the continuous latent space, where \(z_{uv} \in \mathbb{R}^{ 1 \times 1 \times H }\) and \(u,v\in n\), we measure its similarity to each codebook vector \(w_{k} \in \mathbb{R}^{ 1 \times 1 \times H }\), where \(k\in K\). The similarity is assessed by computing the squared Euclidean distance. Then, the codebook vector \(w_{k}\) that is most similar to \(z_{uv}\) is then chosen as the quantized counterpart \(\hat{z}_{uv} \in \mathbb{R}^{ 1 \times 1 \times H }\). Mathematically, this selection process is defined as follows:

\[\hat{z}_{uv} = \underset{w_k}{\mathrm{argmin}} \;\| z_{uv} - w_k \|^2 \label{eq:quantization}\tag{1}\]

The quantization process is applied across all pixels, transforming the continuous latent vector \(z\) into its discrete equivalent \(\hat{z}\), thereby facilitating the generation of a discrete latent representation of the input data.

3.3 Decoding and reconstruction↩︎

During the decoding stage, the quantized vector \(\hat{z}\) is processed through a series of convolutional layers and upsampling steps to construct a POH \(\phi \in \mathbb{R}^{ N \times N \times C }\).

The produced POHs encapsulate phase information in the form of intricate fringe patterns, which do not present a recognizable depiction of the original scene. To convert the patterns back into a visual representation of the scene, a numerical reconstruction process is essential. This typically involves simulating the wavefront propagation, effectively retracing the path of light as captured in the hologram.

For wavefront propagation, our model employs ASM [37], [38], a method recognized for its ability to emulate light wave propagation. The application of ASM to \(\phi\) is defined with the following set of equations, as specified in 2 . These equations model the transformation of the wavefront as it propagates from the SLM plane, through free space, experiencing diffraction, and finally arriving at the target plane, where the holographic image is formed:

\[\begin{align} \begin{aligned} & {!}{ \hat{\phi_{c}}(p_{d}) = \iint F\left(e^{i\phi_{c}}\right)H(f_x, f_y,p_{d}) e^{i2\pi(f_x x + f_y y)}df_x df_y , }\\\\ & {!}{ H(f_x, f_y, p_{d}) = \begin{cases} e^{i\frac{2\pi}{\lambda_{c}} \sqrt{1 - \lambda_{c}^2 (f_x^2 + f_y^2)}p_{d}}, & \text{if} \sqrt{f_x^2 + f_y^2} < \frac{1}{\lambda_{c}}, \\ 0 & \text{otherwise} \end{cases} } \end{aligned} \label{eq:asm} \end{align}\tag{2}\] where \(p_{d}\) denotes the propagation distance, \({\phi}_{c}\) denotes the \(c \in C\) channel of \({\phi}\), \(\hat{\phi}_{c}\) denotes the propagated wavefront of \({\phi}_{c}\) at distance \(p_{d}\), \(F\) denotes the Fourier Transform, \(\lambda_{c}\) denotes the wavelength corresponding to channel \(c\), \(x \in N\) and \(y \in N\) denote the spatial coordinates, and \(f_{x}\) and \(f_{y}\) denote the spatial frequency components of \({\phi}_{c}\) in the \(x\) and \(y\) directions. It is important to note that ASM is applied separately to each color channel, as the wavelength \(\lambda_{c}\) varies for different colors in the light spectrum.

The propagated wavefront \(\hat{\phi}\) contains both amplitude and phase information. Here, the phase information does not directly contribute to the perceived image but rather to the way light interferes to form that image. In contrast, the amplitude pattern is what needs to be visualized to represent the reconstructed image from a hologram. Therefore, it is essential to extract the amplitude patterns of \(\hat{\phi}\). The amplitude extraction is defined by the following equations:

\[\hat{I} = s \cdot |\hat{\phi}| \label{eq:intensity}\tag{3}\] where \(\hat{I}\) denotes the reconstructed image and \(s\) denotes a hyperparameter representing a variable for considering the gap between the output values of the wave propagation operator (ASM) and the desired target [4].

3.4 Objective function↩︎

The Holo-VQVAE model’s objective function is articulated through a tripartite formulation, encompassing reconstruction loss (\(L_{recon}\)), codebook loss (\(L_{codebook}\)), and commitment loss (\(L_{commit}\)), each serving a distinct purpose in the training process.

The reconstruction loss, \(L_{recon}\), quantifies the fidelity of the reconstructed output \(\hat{I}\) relative to the original input \(I\). It comprises of an \(l2\) loss and the perceptual loss \(L_{per}\) [39] between the images. The perceptual loss works by feeding both the input and the reconstructed image into a pretrained VGG16 network to compare their feature maps extracted at designated layers. The layers for feature extraction are determined by the target features for comparison. Early layers capture low-level features like edges and blobs, whereas deeper layers reveal high-level semantic features [40]. Our model employs features extracted from ReLU 2-2 and ReLU 4-3 for a comprehensive comparison that encompasses both the low-level and high-level features.

Our reconstruction loss is computed as a weighted sum of the \(l2\) norm and the perceptual loss \(L_{per}\) [39], as delineated in 4 :

\[L_{recon} = \alpha \cdot \| I - \hat{I} \|_2^2 + \beta \cdot L_{per}(I, \hat{I}) \label{eq:recon95loss}\tag{4}\] where \(\alpha\) and \(\beta\) are the respective weights assigned to the \(l2\) norm and perceptual loss components. We discover that incorporating the \(L_{per}\) can mitigate the noise artifacts within the generated POHs, a phenomenon further explained upon in  4.1

The codebook loss, \(L_{codebook}\), and the commitment loss, \(L_{commit}\), are integral to the VQ-VAE architectures, ensuring the encoder’s outputs maintain a coherent relationship with the codebook vectors. The \(L_{codebook}\), as specified in Equation 5 , updates the codebook vectors to better approximate the encoder’s outputs, thereby encapsulating the data distribution more effectively:

\[L_{codebook} = \| sg[z] - \hat{z} \|_2^2 \label{eq:codebook95loss}\tag{5}\] where \(sg\) denotes the stop-gradient operation that prevents gradient information from flowing back.

The \(L_{commit}\) optimizes the encoder to produce latent vectors that closely align with the codebook vectors, as expressed in Equation 6 . This loss component restricts the latent space from becoming excessively expansive and ensures its conformity to a certain embedding:

\[L_{commit} = \| z - sg[\hat{z}] \|_2^2 \label{eq:commit95loss}\tag{6}\]

The aggregate objective function of Holo-VQVAE, therefore, is the summation of these individual losses, formalized in Equation 7 :

\[L_{total} = L_{recon} + L_{codebook} + L_{commit} \label{eq:total95loss}\tag{7}\]

3.5 Sampling↩︎

For a standard VAE, the latent space is designed to follow a continuous Gaussian distribution, which allows for easy sampling. The generation of novel data can be achieved by sampling a vector from the Gaussian distribution, which is then passed to the decoder for reconstruction.

In contrast, VQ-VAEs exploit a discrete latent space where the input to the decoder is represented by a combination of the codebook vectors. Since it is not straightforward to sample from this space, an additional model is trained to sample appropriate sequences of the codebook vectors for the generation of new samples.

We select PixelSnail [41] as the sampling model for Holo-VQVAE. Other models including diffusion-based [42] and transformer-based [35] methods may provide better sampling results, but we leave that to future works. The details on PixelSnail configurations are illustrated in the supplementary material.

4 Experiments↩︎

We evaluate our model using datasets of MNIST [43], Fashion-MNIST [44] and CelebA-HQ [45] datasets. The images are resized to \(128\times128\) for CelebA-HQ, and \(64\times64\) for MNIST and Fashion-MNIST. The propagation distance \(p_{d}\) is set to 15 mm and 21.5 mm for images with \(64\times64\) and \(128\times128\) resolutions, respectively. The hyperparameter variable \(s\) used in the amplitude extraction is set to \(0.95\), following the implementation of Holonet [4]. The optimizer is set to ADAM with a learning rate of \(0.0002\), and the model was trained on each dataset for \(100\) epochs on a single NVIDIA Geforce RTX 3090 GPU. Further details can be found in the supplementary materials.

The performance of our model was assessed using Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and Fréchet Inception Distance (FID) as metrics.

PSNR calculates the ratio between the maximum possible power of a signal (in this paper, an image) and the power of distorting noise that affects the quality of its representation. A higher PSNR generally indicates better reconstruction quality.

SSIM is used to measure the similarity between two images by considering changes in structural information, luminance, and contrast. SSIM values range between -1 and 1, where 1 indicates perfect similarity.

FID measures the distribution similarity between the generated images and the real images. A lower FID score indicates a greater similarity, with a score of 0 being the exact match.

Figure 2: An illustration of example POH reconstructions utilizing diverse loss functions: (a) employing a model trained solely on \(l2\) loss; (b) employing a model trained on a loss function that integrates \(l2\) loss with perceptual loss in a 9:1 ratio; (c) employing a model trained solely on perceptual loss. The examples present only a single channel (green) of the input for better noise visualization.

Table 1: The average scores of PSNR and SSIM were assessed for the Holo-VQVAE, trained with diverse l2: perceptual ratios, using test datasets including MNIST, Fashion-MNIST, and CelebA-HQ datasets.
l2 : perceptual Metric MNIST Fashion-MNIST CelebA-HQ
100 : 0 PSNR 33.53 32.24 26.91
SSIM 0.76 0.82 0.78
90 : 10 PSNR 31.86 30.38 24.70
SSIM 0.71 0.79 0.73
0 : 100 PSNR 23.09 23.14 17.74
SSIM 0.41 0.66 0.62

4.1 Reconstruction loss↩︎

In image synthesis, generative models typically incorporate either \(l2\) loss or perceptual loss to enhance reconstruction quality. The implementation of \(l2\) loss aims at minimizing pixel-level discrepancies between the generated and target images, thereby achieving high fidelity in pixel-value correspondence. Conversely, perceptual loss [39], [46], aims at enhancing the perceptual similarity between images, aligning more congruently with human visual processing mechanisms.

Upon applying the perceptual loss for POH synthesis, we observed that it can also suppress the generation of noise [47][49] in the reconstructed images. When the POH was generated from the model trained only on the \(l2\) loss, the reconstructed image shows noticeable noise, as shown in 2 (a). Conversely, when the POH was generated from the model trained only on the perceptual loss, the reconstructed image shows pixel value difference compared to the input image, as shown in 2 (c). Here, we speculate that perceptual loss may contribute to regulating and smoothing the noise in the generation of POH.

The sole use of the perceptual loss, however, results in a noticeable drop in the average PSNR and SSIM scores on the reconstruction as shown in 1. In light of these insights, we combined \(l2\) loss with the perceptual loss in a 9:1 ratio, as demonstrated in 2 (b). This combination was empirically engineered to harmonize the PSNR and SSIM metrics, thereby optimizing the reconstruction quality.

4.2 Evaluation on reconstruction and sampling↩︎

Figure 3: POHs and their reconstructions, generated with Holo-VAE and Holo-VQVAE, using ground truth (GT) images from MNIST, Fashion-MNIST, and CelebA-HQ datasets.

Table 2: PSNR and SSIM scores on the reconstructed validation set images for Holo-VAE and Holo-VQVAE.
Model Metric MNIST Fashion-MNIST CelebA-HQ
Holo-VAE PSNR 23.89 21.75 17.63
SSIM 0.65 0.60 0.46
Holo-VQVAE PSNR 31.86 30.38 24.70
SSIM 0.71 0.79 0.73

We conducted comparative evaluations from both reconstruction and sampling perspectives to validate whether our architecture design successfully serves our purpose. Since there are no existing studies with networks that have successfully achieved our goal of generating POHs, the evaluation focused on two internally developed approaches: Holo-VAE and Holo-VQVAE.

Holo-VAE refers to our initial architecture, employing a standard VAE architecture instead of a VQ-VAE. The objective function of this architecture comprises a reconstruction loss and a KL-divergence loss, which collectively ensure that the latent space conforms to a Gaussian distribution. Our choice to begin with the VAE was influenced by its straightforward nature and the existing foundation of research demonstrating its potential in similar applications. We hypothesized that employing a VAE, with a carefully devised reconstruction loss \(L_{recon}\), would yield high-fidelity POHs, both in reconstruction and sampling tasks. Holo-VQVAE refers to our architecture described in this paper.

The result of reconstructions is shown in 3. While Holo-VAE appears to capture essential features of the ground truth images and generate plausible outputs, it fails to capture high-level details and results in blurred images. We speculate that this could be attributed to VAE’s assumption that the latent space comprises Gaussian distributions. High-level details often correspond to less probable regions in a Gaussian distribution, and the VAE’s regularization (KL-divergence loss) may inadvertently smooth out these details to fit the Gaussian model. This problem of course can be mitigated by reducing the weight of the KL-divergence loss. However, this inevitably compromises the sampling ability [50].

Conversely, the Holo-VQVAE demonstrates a notable proficiency in generating results that are both clearer and more adept at restoring high-level details. We speculate that this enhanced performance can be attributed to VQ-VAE’s utilization of discrete latent space. The discrete representation allows VQ-VAE to handle complex distributions like POH data and preserve distinct features that might be averaged out or lost in the continuous latent space of a VAE [6].

The quantitative evaluation of reconstruction tasks, as indicated by the PSNR and SSIM scores, is detailed in 2. These results demonstrate that Holo-VQVAE generally outperforms Holo-VAE in terms of reconstruction quality. Particularly, the enhancement in scores is more pronounced as the dataset complexity escalates from MNIST to CelebA-HQ.

Subsequently, the result of the sampling task is shown in 4. This task involves generating new, unseen POHs that are representative of the training dataset, thereby demonstrating the model’s generative capabilities.

Figure 4: Representative POH samples and their reconstructions of Holo-VAE and Holo-VQVAE trained on MNIST, Fashion-MNIST, and Celeba-HQ datasets. The R, G, and B placed on the top-left refer to the RGB channels of the sample POH, respectively.

Similar to the reconstruction task, Holo-VAE often resulted in images with a loss of high-level detail and a blurred appearance. In contrast, the Holo-VQVAE produced images that were clear and rich in high-level details. The effectiveness of each model in this task is quantitatively assessed using the FID scores, which measure the similarity between a set of 100 generated samples and a random selection of 100 images from the validation set. As indicated in 3, the Holo-VQVAE demonstrates superior performance in terms of FID scores, with the improvement being more pronounced as the dataset transitions from the less complex MNIST to the more intricate CelebA-HQ. More samples from Holo-VQVAE can be found in the supplementary material.

Table 3: FID scores on the reconstructions of sample POHs for Holo-VAE and Holo-VQVAE.
Model Metric MNIST Fashion-MNIST CelebA-HQ
Holo-VAE FID 134.7 209.1 173.9
1-1 Holo-VQVAE 102.1 162.7 134.1

It is important to note that the FID scores for our models are relatively high when compared to those of recent image generation models. This discrepancy can be attributed to the inherent challenges in generating high-fidelity POHs, a task that is fundamentally more complex and nuanced than typical image generation. The presence of noise, a factor we aimed to mitigate as discussed in 4.1, also contributes to these higher scores. We anticipate that ongoing advancements and refinements in our approach will lead to improved FID scores in the future.

5 Limitations and future work↩︎

Figure 5: Representative POH samples and their reconstructions of Holo-VQVAE trained on CelebA-HQ of resolution \(256\times256\). The R, G, and B placed on the top-left refer to the RGB channels of the sample POH, respectively.

In addressing the limitations and future directions of our research, three key aspects warrant attention.

Firstly, the resolution of the generated POHs presents a significant challenge. Our model currently supports resolutions up to \(256\times256\) pixels, as demonstrated in 5. However, for practical applications in holography, notably higher resolutions are required. This need stems from the specifications of holographic displays, which typically feature pixel sizes ranging from 3 to 60 \(\mu m\) [51]. Consequently, an image reconstructed at our model’s highest resolution of \(256\times256\) translates to a physical size of merely about 1.536 \(cm\) in both width and height. This size is considerably smaller than what is needed for effective visualization in real-world scenarios, such as in holographic displays or augmented reality devices. Addressing this limitation involves not only scaling up the resolution but also ensuring that the increase in resolution does not compromise the quality of the holographic reconstruction.

Secondly, our model’s current design generates POHs optimized for a fixed propagation distance \(p_{d}\). While this approach ensures quality reconstructions at this specific distance, it limits the hologram’s effectiveness for viewers positioned at varying distances. Therefore, it is crucial to develop methods that allow for variable propagation distances in the training process. Such an advancement would enable the generation of POHs that maintain high-quality reconstructions across a range of distances, significantly enhancing the versatility and applicability of our holographic models in diverse real-world settings.

Lastly, it is noteworthy that our study did not incorporate experimental validation using real optical devices for assessing the reconstruction quality. This decision was grounded as our primary aim was to assess the algorithmic effectiveness of the learning strategies for high-fidelity POH generation. Advances in computational holography allow for accurate predictions and assessments of hologram quality through simulations, aligning with current research trends where computational validation often precedes resource-intensive optical testing. However, we acknowledge that future work involving the practical application of these models would benefit from experimental validation using optical setups, to fully ascertain their effectiveness in real-world holographic display scenarios. Therefore, future work will benefit from optical validation to bridge the gap between theoretical development and practical application.

6 Conclusion↩︎

Recent image generation methods leveraging generative models have unlocked a plethora of creative possibilities, ranging from seamless image interpolation to intricate text-driven modifications. The cornerstone of these advancements lies in the effective utilization of latent spaces which distilled the crucial features of target data.

Despite these advancements in image generation, the field of holography, particularly a generative model for hologram generation, has yet to fully embrace the potential of latent spaces. This omission has constrained the holography domain from accessing the advanced capabilities inherent in generative models, thereby limiting the scope of innovation and creativity in hologram generation.

To bridge this gap, we present Holo-VQVAE, a novel framework designed to integrate the generative models into the domain of hologram generation. As a pioneering effort, Holo-VQVAE marks a significant stride towards enriching the holography field with the advanced functionalities of generative models. We anticipate that this integration will catalyze a wave of novel methodologies, expanding the horizons of holographic technology. Therefore, the potential of Holo-VQVAE extends beyond current applications, promising to unlock new realms of exploration and innovation in the creation and manipulation of holographic content.

References↩︎

[1]
.
[2]
.
[3]
.
[4]
.
[5]
.
[6]
.
[7]
.
[8]
.
[9]
.
[10]
.
[11]
.
[12]
.
[13]
.
[14]
.
[15]
.
[16]
.
[17]
.
[18]
.
[19]
.
[20]
.
[21]
.
[22]
.
[23]
.
[24]
.
[25]
.
[26]
.
[27]
.
[28]
.
[29]
.
[30]
.
[31]
pp. 10630–10640, 2022.
[32]
pp. 3626–3636, 2022.
[33]
pp. 11379–11388, 2022.
[34]
pp. 5997–6006. IEEE, 2023.
[35]
pp. 12873–12883, 2021.
[36]
.
[37]
.
[38]
.
[39]
pp. 694–711. Springer, 2016.
[40]
pp. 2710–2719, 2019.
[41]
pp. 864–872. PMLR, 2018.
[42]
pp. 10684–10695, 2022.
[43]
.
[44]
.
[45]
.
[46]
pp. 1–6. IEEE, 2020.
[47]
.
[48]
.
[49]
.
[50]
.
[51]
.