Abstract

Uncertainty Quantification (UQ) is essential in probabilistic machine learning models, particularly for assessing the reliability of predictions. In this paper, we present a systematic framework for estimating both epistemic and aleatoric uncertainty in probabilistic models. We focus on Gaussian Process Latent Variable Models and employ scalable Random Fourier Features-based Gaussian Processes to approximate predictive distributions efficiently. We derive a theoretical formulation for UQ, propose a Monte Carlo sampling-based estimation method, and conduct experiments to evaluate the impact of uncertainty estimation. Our results provide insights into the sources of predictive uncertainty and illustrate the effectiveness of our approach in quantifying the confidence in the predictions.

1 Introduction↩︎

Uncertainty Quantification (UQ) plays a fundamental role in probabilistic machine learning (ML). ML models are increasingly used in applications that require making decisions and where reliable predictions are essential. This is especially important in fields that demand reliable choices under uncertainty, such as medical diagnosis [1] autonomous systems [2], and scientific modeling in general [3]. However, most traditional ML models provide only point predictions without assessing their confidence. This, in turn, makes it difficult to quantify the reliability of the results [4].

To address this limitation, probabilistic ML models provide a foundational framework for representing uncertainty [5]. By explicitly modeling uncertainty [6], these models not only improve reliability in high-risk applications, but also promote better generalization by distinguishing between known and unknown regions in the input space. Moreover, uncertainty estimates increase model interpretability by clarifying the sources of uncertainty and revealing potential limitations and biases, which emphasizes the role of UQ in characterizing and managing the uncertainty inherent in models and their predictions while providing valuable insights into reliability and confidence calibration. Given the broad impact of UQ, improving uncertainty estimation remains a pressing challenge.

In probabilistic ML models, uncertainty can be categorized into two main types:

Epistemic uncertainty (\(\mathcal{E}\)): Also known as model uncertainty, this type of uncertainty arises due to a lack of knowledge about the model, which, in turn, results from insufficient training data. It reflects the model’s confidence in its parameters and structure.
Aleatoric uncertainty (\(\mathcal{A}\)): This uncertainty, often called data uncertainty, originates from intrinsic noise in the data-generating process and represents variability that remains irreducible, even with unlimited data.

The total predictive uncertainty in a model is given by \[\begin{align} \mathcal{T} &= \mathcal{E} + \mathcal{A}, \end{align}\] where \(\mathcal{T}\) represents the total uncertainty in the model’s prediction.

In this paper, we focus on developing a principled framework for estimating both epistemic and aleatoric uncertainty in probabilistic ML models. Since probabilistic ML models comprise a broad range of approaches, we focus on making the problem more tractable by studying the UQ of predictions made by Gaussian process latent variable models (GPLVMs) [7]. Specifically, we explore how uncertainty can be efficiently quantified using scalable approximations, such as Random Fourier Features (RFF)-based Gaussian processes (GPs), which we use to construct GPLVMs.

Our contributions include:

A systematic formulation of epistemic and aleatoric uncertainty estimation.
A scalable approach to UQ using RFF-based GPs.
Methodology for computing epistemic and aleatoric uncertainties, along with insights gained from experimental results.

The paper is organized as follows. First, we provide background on probabilistic ML models in Section 2, followed by a brief overview of GPLVMs in Section 3. In Section 4, we present the theoretical foundation for evaluating epistemic and aleatoric uncertainties, while in Section 5 we provide details of the methods used for these evaluations. Section 6 discusses experimental results and insights gained from various experiments. Finally, Section 7 concludes with final remarks.

2 Background↩︎

A probabilistic ML model learns a predictive distribution over outputs \(\boldsymbol{y}_n\) given an input \(\boldsymbol{x}_n\). The general formulation takes the following form: \[\begin{align} \label{eq:predd} p(\boldsymbol{y}_n | \boldsymbol{x}_n, \mathcal{D}) = \int p(\boldsymbol{y}_n | \boldsymbol{x}_n, \boldsymbol{\theta}) p(\boldsymbol{\theta} | \mathcal{D}) {\rm d}\boldsymbol{\theta}, \end{align}\tag{1}\] where \(\mathcal{D} = \{\boldsymbol{x}_n, \boldsymbol{y}_n\}_{n=1}^{N}\) is the training dataset, \(p(\boldsymbol{\theta} | \mathcal{D})\) is the posterior distribution over the model parameters, and \(p(\boldsymbol{y}_n | \boldsymbol{x}_n, \boldsymbol{\theta})\) represents the likelihood function.

In many cases, ML models approximate this likelihood function using a deterministic mapping from inputs to outputs. In such models, a function \(f(\boldsymbol{x}_n, \boldsymbol{\theta})\) produces a single output estimate, \[\begin{align} \widehat{\boldsymbol{y}}_n = f(\boldsymbol{x}_n, \widehat{\boldsymbol{\theta}}), \end{align}\] where \(\widehat{\boldsymbol{\theta}}\) is some estimate of \(\boldsymbol{\theta}\) obtained from the training data \({\cal D}\). However, probabilistic models extend this by modeling the entire predictive distribution rather than just a point estimate as given by 1 .

We now describe how this predictive distribution quantifies the uncertainty of the prediction through its total covariance. Let the predictive distribution be given as in 1 . Then, we can express the predictive mean as follows: \[\begin{align} \mathbb{E}_{ p(\boldsymbol{y}_n | \boldsymbol{x}_n, \mathcal{D})} [\boldsymbol{y}_n] = \mathbb{E}_{p(\boldsymbol{\theta} | \mathcal{D})} \big[ \mathbb{E}_{p(\boldsymbol{y}_n | \boldsymbol{x}_n, \boldsymbol{\theta})} [\boldsymbol{y}_n] \big]. \end{align}\] To find the total predictive covariance, we apply the law of total covariance and obtain \[\begin{align} \text{Cov}_{p(\boldsymbol{y}_n | \boldsymbol{x}_n, \mathcal{D})} [\boldsymbol{y}_n] &= \text{Cov}_{p(\boldsymbol{\theta} | \mathcal{D})} \big[ \mathbb{E}_{p(\boldsymbol{y}_n | \boldsymbol{x}_n, \boldsymbol{\theta})} [\boldsymbol{y}_n] \big]\notag\\ & + \mathbb{E}_{p(\boldsymbol{\theta} | \mathcal{D})} \big[ \text{Cov}_{p(\boldsymbol{y}_n | \boldsymbol{x}_n, \boldsymbol{\theta})} [\boldsymbol{y}_n] \big]. \end{align}\] This decomposition separates epistemic uncertainty (first term) from aleatoric uncertainty (second term), i.e., \[\begin{align} \mathcal{E} &= \text{Cov}_{p(\boldsymbol{\theta} | \mathcal{D})} \big[ \mathbb{E}_{p(\boldsymbol{y}_n | \boldsymbol{x}_n, \boldsymbol{\theta})} [\boldsymbol{y}_n] \big],\\ \mathcal{A} &= \mathbb{E}_{p(\boldsymbol{\theta} | \mathcal{D})} \big[ \text{Cov}_{p(\boldsymbol{y}_n | \boldsymbol{x}_n, \boldsymbol{\theta})} [\boldsymbol{y}_n] \big]. \end{align}\] In words, the epistemic uncertainty measures how much the expected prediction varies due to uncertainty in the parameters \(\boldsymbol{\theta}\). It is obtained by first computing the expectation over the predictive distribution given the parameters and then computing the covariance of this expectation across different model parameters sampled from the posterior of the parameters. The aleatoric uncertainty is the expected predictive covariance due to data noise under the model’s posterior.

3 GPLVMs and inference with GPLVMs↩︎

We now briefly introduce GPLVMs with a focus on the uncertainties in their predictions. GPLVMs are powerful tools for modeling high-dimensional data by learning low-dimensional latent representations [7]. They establish a connection between a low-dimensional latent space and a high-dimensional observed space through GPs. Specifically, GPLVMs model data \(\boldsymbol{Y}\in\mathbb{R}^{N\times d_y}\) in a high-dimensional space \(\mathcal{Y}\), where \(\mathcal{Y} = \mathbb{R}^{d_y}\), by assuming that these data are generated from latent variables \(\boldsymbol{X}\) in a lower-dimensional space, \(\mathcal{X}\), with \(\boldsymbol{X}\in\mathbb{R}^{N\times d_x}\) and \(\mathcal{X} = \mathbb{R}^{d_x}\), where \(d_x \ll d_y\). Here, \(N\) represents the number of paired observations between \(\mathcal{X}\) and \(\mathcal{Y}\).

More specifically, each observed vector \(\boldsymbol{y}_n\in\mathbb{R}^{d_y}\), \(n=1, 2, \ldots, N\) is generated from a corresponding latent vector \(\boldsymbol{x}_n\in\mathbb{R}^{d_x}\), where the mapping \(\boldsymbol{x}_n\mapsto \boldsymbol{y}_n\) is modeled using \(d_y\) GPs [8]. In other words, for a given latent variable \(\boldsymbol{x}_n\), the observed variable \(\boldsymbol{y}_n\) is drawn according to \[\begin{align} \boldsymbol{y}_n&\sim {\cal N}(\boldsymbol{f},\boldsymbol{\Sigma}_y), \end{align}\] where \(\boldsymbol{\Sigma}_y={\rm diag}\{\sigma_1^2, \sigma_2^2,\ldots, \sigma_{d_y}^2\}\), and \(\boldsymbol{f}\) is a vector of \(d_y\) functions, i.e., \(\boldsymbol{f}(\boldsymbol{x}_n)=[f_1(\boldsymbol{x}_n), f_2(\boldsymbol{x}_n),\ldots, f_{d_y}(\boldsymbol{x}_n)]^\top\), each generated from a GP prior, \[\begin{align} f_d(\cdot)&\sim {\cal GP}(0,\kappa_d), \;\;\;d=1, 2, \ldots, d_y. \end{align}\]

Now we provide a brief overview of inference with GPLVMs. Given the observed (training) data \(\boldsymbol{Y}\in\mathbb{R}^{N\times d_y}\), the main objective during training is to learn how to draw samples from the posterior distribution, \[\begin{align} \label{eq:neweq1} p(\boldsymbol{X},\boldsymbol{\theta}, \boldsymbol{\Sigma}_y|\boldsymbol{Y}) &\propto p(\boldsymbol{Y}|\boldsymbol{X},\boldsymbol{\theta},\boldsymbol{\Sigma}_y)p(\boldsymbol{X})p(\boldsymbol{\theta})p(\boldsymbol{\Sigma}_y), \end{align}\tag{2}\] where \(\boldsymbol{\theta}\) represents the model parameters, \(p(\boldsymbol{Y}|\boldsymbol{X},\boldsymbol{\theta}, \boldsymbol{\Sigma}_y)\) is the likelihood function, where we treat \(\boldsymbol{Y}\) as observed variables and view \(\boldsymbol{X},\boldsymbol{\theta},\) and \(\boldsymbol{\Sigma}_y\) as the unknowns to be inferred, and \(p(\boldsymbol{X})\), \(p(\boldsymbol{\theta})\) and \(p(\boldsymbol{\Sigma})\) are the respective priors of \(\boldsymbol{X}\), \(\boldsymbol{\theta}\), and \(\boldsymbol{\Sigma}_y\). Samples from the posterior can be drawn using methods such as Hamiltonian Monte Carlo (HMC) or other Markov Chain Monte Carlo (MCMC) techniques.

For testing, let \(\boldsymbol{y}_*=[\boldsymbol{y}_{o,*}^\top, \boldsymbol{y} _{u,*}^\top]^\top\), where \(\boldsymbol{y}_{o,*}\) and \(\boldsymbol{y}_{u,*}\) represent the observed and unobserved parts of \(\boldsymbol{y}_*\). The goal is now to determine the predictive distribution of \(\boldsymbol{y}_{u,*}\) given \(\boldsymbol{y}_{o,*}\), \(p(\boldsymbol{y}_{u,*}|\boldsymbol{y}_{o,*},\) \(\boldsymbol{Y})\), using the learned posterior of the GPLVM in 2 . We achieve this in three steps:.

First we infer the posterior distribution of the latent variable \(\boldsymbol{x}_*\) that corresponds to the partially observed test point \(\boldsymbol{y}_{o,*}\). This posterior is given by \[\begin{align} \label{eq:normals} p(\boldsymbol{x}_*|\boldsymbol{y}_{o,*},\boldsymbol{Y})&\propto p(\boldsymbol{x}_*)\int p(\boldsymbol{y}_{o,*}|\boldsymbol{x}_*,\boldsymbol{\theta},\boldsymbol{\Sigma}_y) \notag\\ &\times p(\boldsymbol{X},\boldsymbol{\theta},\boldsymbol{\Sigma}_y|\boldsymbol{Y}) {\rm d}\boldsymbol{X}{\rm d}\boldsymbol{\theta}{\rm d}\boldsymbol{\Sigma}_y, \end{align}\tag{3}\] where \(p(\boldsymbol{y}_{o,*}|\boldsymbol{x}_*,\boldsymbol{\theta},\boldsymbol{\Sigma}_y)\) is a Gaussian distribution, and \(p(\boldsymbol{X},\boldsymbol{\theta}, \boldsymbol{\Sigma}_y|\boldsymbol{Y})\) is given by 2 .
Next we find the predictive distribution of the unobserved components \(y_{d,*}\), \(d\in{\cal S}_u\), where \({\cal S}_u\) represents the set of indices corresponding to the unobserved outputs. This is done using the GPs associated with the respective components of \(\boldsymbol{y}_{u,*}\), which leads to the predictive distribution \[\begin{align} \label{eq:preddis} p(y_{d,*}|\boldsymbol{y}_{o,*},\boldsymbol{Y})&\propto\int p(y_{d,*}|\boldsymbol{x}_*,\boldsymbol{\theta},\boldsymbol{\Sigma}_y)p(\boldsymbol{x}_*|\boldsymbol{y}_{o,*},\boldsymbol{\theta},\boldsymbol{\Sigma}_y)\notag\\ &\times p(\boldsymbol{X},\boldsymbol{\theta},\boldsymbol{\Sigma}_y|\boldsymbol{Y}) \, {\rm d}\boldsymbol{X} {\rm d}\boldsymbol{\theta}{\rm d}\boldsymbol{x}_* {\rm d}\boldsymbol{\Sigma}_y, \end{align}\tag{4}\] where \(p(\boldsymbol{x}_*|\boldsymbol{y}_{o,*},\boldsymbol{\theta},\boldsymbol{\Sigma}_y)\) was determined in step 1, and \(p(\boldsymbol{X},\boldsymbol{\theta},\boldsymbol{\Sigma}_y|\boldsymbol{Y})\) is defined by 2 .
From 4 , we can draw samples of each \(y_{d,*}\) and use them to approximate its predictive distribution. Often, this predictive distribution is approximated by a Gaussian factorized across dimensions, i.e., \[\begin{align} \label{eq:prediction} p(\boldsymbol{y}_{u,*}|\boldsymbol{y}_{o,*},\boldsymbol{Y}) &= \prod_{d\in {\cal S}_u} \mathcal{N}(y_{d,*} | \mu_{d,*}, \sigma_{d,*}^2), \end{align}\tag{5}\] where \(\mu_{d,*}\) is the posterior GP mean for dimension \(d\), and \(\sigma_{d,*}^2\) is the corresponding posterior variance. The variance \(\sigma_{d,*}^2\) quantifies the uncertainty in each predicted component of \(\boldsymbol{y}_{u,*}\) and it includes both aleatoric and epistemic uncertainty, where the epistemic uncertainty arises from the uncertainty of the latent variables and the model parameters.

In order to expand our work on UQ to as wide a set of problems as possible, we explore the use of Random Fourier Features (RFF)–based GPs [9]. These GPs approximate an adopted kernel function by a finite-dimensional RFF mapping by \[\begin{align} \kappa(\boldsymbol{x},\boldsymbol{x}^\prime)&\approx \boldsymbol{\phi}(\boldsymbol{x})^\top\boldsymbol{\phi}(\boldsymbol{x}^\prime), \end{align}\] where \(\boldsymbol{\phi}(\boldsymbol{x})\in\mathbb{R}^{J\times 1}\) is given by \[\begin{align} \boldsymbol{\phi}(\boldsymbol{x})&=\sqrt{\frac{2}{J}} \left[ \begin{matrix} \cos(\boldsymbol{\omega}_1^\top\boldsymbol{x}), \sin(\boldsymbol{\omega}_1^\top\boldsymbol{x}), \dots, \sin(\boldsymbol{\omega}_{J/2}^\top\boldsymbol{x}) \end{matrix} \right]^\top, \end{align}\] where \(\boldsymbol{\omega}_j\) are random frequencies, which according to Bochner’s theorem [10], are drawn from the power spectral density of the chosen kernel. The functions of interest shown in Fig. [Fig:GPLVM] are then modeled by \[\begin{align} f_d(\boldsymbol{x}_n)&=\boldsymbol{\phi}(\boldsymbol{x}_n)^\top\boldsymbol{\theta}_d. \end{align}\]

Thus, after training the RFF-based GPs, the log-likelihood used to optimize \(\boldsymbol{x}_*\) becomes \[\begin{align} {\cal L} &=\sum_{d \in \mathcal{S}_o} \log \mathcal{N}(y_{o,*,d} | \boldsymbol{\phi}(\boldsymbol{x}_*)^\top \widehat{\boldsymbol{\theta}}_d, \widehat{\sigma}_{o,*,d}^2) + \log p(\boldsymbol{x}_*), \end{align}\] where \(\widehat{\boldsymbol{\theta}}_d\) and \(\widehat{\sigma}_{o,*,d}^2\) represent estimated values from the training data.

4 Uncertainty quantification: The theory↩︎

Given the above description of GPLVMs, we first formulate the problem. We are given a training dataset \(\mathcal{D} = \{\boldsymbol{y}_n\}_{n=1}^{N}\), which is used to construct a GPLVM. The construction involves estimating the latent variables \(\boldsymbol{X}\), the parameters of the GP models, \(\boldsymbol{\Theta}=[\boldsymbol{\theta}_1,\dots, \boldsymbol{\theta}_d]\), and \(\boldsymbol{\Sigma}_y\). The estimates of \(\boldsymbol{X}\), \(\boldsymbol{\Theta}\), and \(\boldsymbol{\Sigma}_y\) can be samples from \(p(\boldsymbol{X},\boldsymbol{\theta}, \boldsymbol{\Sigma}_y|\boldsymbol{Y})\), which we denote by \(\boldsymbol{X}^{(m)}\), \(\boldsymbol{\Theta}^{(m)}\), and \(\boldsymbol{\Sigma}_y^{(m)}\), \(m=1, \dots, M\). After training, a test observation \(\boldsymbol{y}_*\) is provided, where some elements are missing. The missing variables are predicted using Gaussian predictive distributions constructed by the GPs that correspond to these variables. Our goal is to quantify the epistemic and aleatoric uncertainties associated with these predictions. As already pointed out, we use the law of total variance and decompose the total predictive variance of \(y_{d,*}\) by \[\begin{align} \text{Var}[y_{d,*}] &= \underbrace{\mathbb{E}_{p(\boldsymbol{x}_* | \boldsymbol{y}_{o,*}, \boldsymbol{Y})} \big[ \text{Var}_{p(y_{d,*} | \boldsymbol{x}_*, \boldsymbol{\theta}_d)} [y_{d,*}] \big]}_{\text{Aleatoric Uncertainty}} \notag \\ &\quad + \underbrace{\text{Var}_{p(\boldsymbol{x}_* | \boldsymbol{y}_{o,*}, \boldsymbol{Y})} \big[ \mathbb{E}_{p(y_{d,*} | \boldsymbol{x}_*, \boldsymbol{\theta}_d)} [y_{d,*}] \big]}_{\text{Epistemic Uncertainty}}. \end{align}\] We focus first on the epistemic uncertainty when making predictions of \(y_{d,*}\). This uncertainty comes from three main sources

Uncertainty in training latent variables \(\boldsymbol{X}\): Recall that \(\boldsymbol{X}\) was inferred from data and that it follows a posterior distribution \(p(\boldsymbol{X} | \boldsymbol{Y})\), which affects both \(\boldsymbol{\theta}_d\) and \(\boldsymbol{x}_*\).
Uncertainty in test latent variable \(\boldsymbol{x}_*\): The posterior \(p(\boldsymbol{x}_* | \boldsymbol{y}_{o,*}, \boldsymbol{Y})\) also introduces variability in predictions.
Uncertainty in function parameters \(\boldsymbol{\theta}_d\): Since \(\boldsymbol{\theta}_d\) is also inferred from data, its posterior \(p(\boldsymbol{\theta}_d | \boldsymbol{X}, \boldsymbol{Y})\) also contributes to epistemic uncertainty.

The predicted value of \(y_{d,*}\) has a mean that is linear in \(\boldsymbol{\theta}_d\), i.e., \[\begin{align} \mathbb{E}_{p(y_{d,*} | \boldsymbol{x}_*, \boldsymbol{\theta}_d)} [y_{d,*}] = \boldsymbol{\phi}(\boldsymbol{x}_*)^\top \mathbb{E}[\boldsymbol{\theta}_d | \boldsymbol{X}, \boldsymbol{Y}]. \end{align}\] Thus, the epistemic uncertainty simplifies to \[\begin{align} \sigma_{\text{epistemic}, d,*}^2 = \text{Var}_{p(\boldsymbol{x}_* | \boldsymbol{y}_{o,*}, \boldsymbol{Y})} \big[ \boldsymbol{\phi}(\boldsymbol{x}_*)^\top \mathbb{E}[\boldsymbol{\theta}_d | \boldsymbol{X}, \boldsymbol{Y}] \big]. \end{align}\] Now we use the law of total variance, and we decompose the epistemic uncertainty into contributions from the training latent variables \(\boldsymbol{X}\) and test latent variable \(\boldsymbol{x}_*\),

We incorporate the uncertainty in \(\boldsymbol{X}\) by applying the law of total variance again. We write \[\begin{align} &\sigma_{\text{epistemic}, d,*}^2\notag\\ &= \mathbb{E}_{p(\boldsymbol{X} | \boldsymbol{Y})} \Big[ \text{Var}_{p(\boldsymbol{x}_* | \boldsymbol{y}_{o,*}, \boldsymbol{X}, \boldsymbol{Y})} \big[ \boldsymbol{\phi}(\boldsymbol{x}_*)^\top \mathbb{E}[\boldsymbol{\theta}_d | \boldsymbol{X}, \boldsymbol{Y}] \big] \Big] \notag \\ &\quad + \text{Var}_{p(\boldsymbol{X} | \boldsymbol{Y})} \Big[ \mathbb{E}_{p(\boldsymbol{x}_* | \boldsymbol{y}_{o,*}, \boldsymbol{X}, \boldsymbol{Y})} \big[ \boldsymbol{\phi}(\boldsymbol{x}_*)^\top \mathbb{E}[\boldsymbol{\theta}_d | \boldsymbol{X}, \boldsymbol{Y}] \big] \Big], \end{align}\] where the first term accounts for uncertainty due to test input variability, reflecting how variations in \(\boldsymbol{x}_*\) variations influence the prediction, and the second term accounts for uncertainty from inferred latent variables and represents the impact of different inferred values of \(\boldsymbol{X}\).

We recall that \[\begin{align} \mathbb{E}_{p(y_{d,*} | \boldsymbol{x}_*, \boldsymbol{\theta}_d)} [y_{d,*}] = \boldsymbol{\phi}(\boldsymbol{x}_*)^\top \boldsymbol{\theta}_d, \end{align}\] and write \[\begin{align} &\text{Var}_{p(\boldsymbol{x}_* | \boldsymbol{y}_{o,*}, \boldsymbol{X}, \boldsymbol{Y})} \big[ \boldsymbol{\phi}(\boldsymbol{x}_*)^\top \mathbb{E}[\boldsymbol{\theta}_d | \boldsymbol{X}, \boldsymbol{Y}] \big] \notag\\ &\quad = \mathbb{E}_{p(\boldsymbol{x}_* | \boldsymbol{y}_{o,*}, \boldsymbol{X}, \boldsymbol{Y})} \big[ \boldsymbol{\phi}(\boldsymbol{x}_*)^\top \text{Cov}[\boldsymbol{\theta}_d | \boldsymbol{X}, \boldsymbol{Y}] \boldsymbol{\phi}(\boldsymbol{x}_*) \big] \notag \\ &\quad + \text{Var}_{p(\boldsymbol{x}_* | \boldsymbol{y}_{o,*}, \boldsymbol{X}, \boldsymbol{Y})} \big[ \boldsymbol{\phi}(\boldsymbol{x}_*)^\top \mathbb{E}[\boldsymbol{\theta}_d | \boldsymbol{X}, \boldsymbol{Y}] \big]. \end{align}\] Next, we take the expectation over \(p(\boldsymbol{X} | \boldsymbol{Y})\) and obtain \[\begin{align} &\sigma_{\text{epistemic}, d,*}^2 = \mathbb{E}_{p(\boldsymbol{X} | \boldsymbol{Y})} \Bigg[ \boldsymbol{\mu}_\phi^\top \text{Cov}[\boldsymbol{\theta}_d | \boldsymbol{X}, \boldsymbol{Y}] \boldsymbol{\mu}_\phi\notag\\ &\quad+ \text{Tr} \big( \text{Cov}_{p(\boldsymbol{x}_* | \boldsymbol{y}_{o,*}, \boldsymbol{X}, \boldsymbol{Y})} [\boldsymbol{\phi}(\boldsymbol{x}_*)] \text{Cov}[\boldsymbol{\theta}_d | \boldsymbol{X}, \boldsymbol{Y}] \big) \Bigg] \notag \\ \label{eq:ep} &\quad + \text{Tr} \big( \text{Cov}_{p(\boldsymbol{x}_* | \boldsymbol{y}_{o,*}, \boldsymbol{Y})} [\mathbb{E}_{p(y_{d,*} | \boldsymbol{x}_*, \boldsymbol{\theta}_d)} [y_{d,*}]] \big), \end{align}\tag{6}\] where \[\begin{align} \boldsymbol{\mu}_\phi&=\mathbb{E}_{p(\boldsymbol{x}_* | \boldsymbol{y}_{o,*}, \boldsymbol{X}, \boldsymbol{Y})} \big[\boldsymbol{\phi}(\boldsymbol{x}_*) \big]. \end{align}\] We observe that

(a): the uncertainty of the function parameters is quantified by \(p(\boldsymbol{\theta}_d | \boldsymbol{X}, \boldsymbol{Y})\), more specifically by \(\text{Cov}[\boldsymbol{\theta}_d | \boldsymbol{X}, \boldsymbol{Y}]\). We note it affects the predictions through both expectation and variance terms,
(b): the uncertainty due to \(\boldsymbol{x}_*\) is quantified by \(p(\boldsymbol{x}_* | \boldsymbol{y}_{o,*}, \boldsymbol{X}, \boldsymbol{Y})\) and is tracked via \(\text{Var}_{p(\boldsymbol{x}_* | \boldsymbol{y}_{o,*}, \boldsymbol{X}, \boldsymbol{Y})} [\cdot]\), and
(c): the effect of the uncertainty on the predictions due to \(\boldsymbol{X}\) and quantified by \(p(\boldsymbol{X} | \boldsymbol{Y})\) appears in \(\text{Var}_{p(\boldsymbol{X} | \boldsymbol{Y})} [\cdot]\).

Now we turn our attention to the aleatoric uncertainty. In our GPLVM framework, we assume that \[\begin{align} y_{d,*} \sim \mathcal{N}(\boldsymbol{\phi}(\boldsymbol{x}_*)^\top \boldsymbol{\theta}_d, \sigma_{d}^2), \end{align}\] and, thus, \[\begin{align} \text{Var}_{p(y_{d,*} | \boldsymbol{x}_*, \boldsymbol{\theta}_d)} [y_{d,*}] = \sigma_{d}^2. \end{align}\] We note that since \(\boldsymbol{x}_*\) is random, \(\sigma_d^2\) is also random. We take the expectation over \(p(\boldsymbol{x}_* | \boldsymbol{y}_{o,*}, \boldsymbol{Y})\) and write \[\begin{align} \sigma_{\text{aleatoric}, d*}^2 = \mathbb{E}_{p(\boldsymbol{x}_* | \boldsymbol{y}_{o,*}, \boldsymbol{Y})} [\sigma_{d}^2]. \end{align}\] From training, the noise variance \(\sigma_d^2\) is inferred from the posterior \[\begin{align} p(\sigma_d^2 | \boldsymbol{Y}) = \int p(\sigma_d^2 | \boldsymbol{X}, \boldsymbol{Y}) p(\boldsymbol{X} | \boldsymbol{Y}) d\boldsymbol{X}. \end{align}\] Thus, the full expression for aleatoric uncertainty is \[\begin{align} \label{eq:al} \sigma_{\text{aleatoric}, d,*}^2 = \mathbb{E}_{p(\boldsymbol{X} | \boldsymbol{Y})} \mathbb{E}_{p(\boldsymbol{x}_* | \boldsymbol{y}_{o,*}, \boldsymbol{X}, \boldsymbol{Y})} [\sigma_{d}^2]. \end{align}\tag{7}\]

5 Uncertainty quantification: The methods↩︎

In this section we discuss the actual computation of the epistemic and aleatoric uncertainties. We propose an approximation based on a Monte Carlo sampling method. We compute the epistemic uncertainty defined by 6 using the following approach:

Draw \(M\) samples \(\boldsymbol{X}^{(m)} \sim p(\boldsymbol{X} | \boldsymbol{Y})\).
For each \(\boldsymbol{X}^{(m)}\), draw \(L\) samples \[\boldsymbol{x}_*^{(m,\ell)} \sim p(\boldsymbol{x}_* | \boldsymbol{y}_{o,*}, \boldsymbol{X}^{(m)}, \boldsymbol{Y}).\]

The Monte Carlo estimate is then \[\begin{align} \label{eq:eu} &\sigma_{\text{epistemic}, d,*}^2 \approx \frac{1}{M} \sum_{m=1}^{M} \Bigg[ \boldsymbol{\mu}_\phi^\top \text{Cov}[\boldsymbol{\theta}_d | \boldsymbol{X}^{(m)}, \boldsymbol{Y}] \boldsymbol{\mu}_\phi \notag\\ &\quad+ \text{Tr} \big( \text{Cov}_{p(\boldsymbol{x}_* | \boldsymbol{y}_{o,*}, \boldsymbol{X}^{(m)}, \boldsymbol{Y})} [\boldsymbol{\phi}(\boldsymbol{x}_*)] \text{Cov}[\boldsymbol{\theta}_d | \boldsymbol{X}^{(m)}, \boldsymbol{Y}] \big) \Bigg] \notag \\ &\quad + \frac{1}{M} \sum_{m=1}^{M} \text{Tr} \bigg( \frac{1}{L} \sum_{\ell=1}^{L} \big( \boldsymbol{\phi}(\boldsymbol{x}_*^{(m,\ell)})^\top \mathbb{E}[\boldsymbol{\theta}_d | \boldsymbol{X}^{(m)}, \boldsymbol{Y}] \big)^2 \bigg). \end{align}\tag{8}\]

The aleatoric uncertainty from 7 is approximated by \[\begin{align} \label{eq:au} \sigma_{\text{aleatoric}, d,*}^2 \approx \frac{1}{M} \sum_{m=1}^{M} \Bigg[ \frac{1}{L} \sum_{\ell=1}^{L} \sigma_d^{2(m)} \Bigg]. \end{align}\tag{9}\]

6 Uncertainty quantification: The insights from experiments↩︎

In this section, we describe our experiments and present insights gained from them. We generated \(N = 1{,}000\) four-dimensional observations \(\boldsymbol{y}_n\), with 800 used for training and the remaining 200 reserved for testing. Each observed vector \(\boldsymbol{y}_n\) was obtained by first drawing a latent variable \(\boldsymbol{x}_n\in\mathbb{R}^2\) according to \(\boldsymbol{x}_n \sim \mathcal{N}(\boldsymbol{0}, \sigma^2 \boldsymbol{I})\), and then applying four different functions to obtain \(y_{d,n}\) for \(d = 1, 2, 3, 4\). More specifically, the observations were generated as follows:

linear function (baseline): \[y_{1,n} = \boldsymbol{w}_1^\top \boldsymbol{x}_n + \epsilon_{1,n}, \quad \epsilon_{1,n} \sim \mathcal{N}(0, \sigma_\epsilon^2),\]
nonlinear squared function: \[y_{2,n} = (\boldsymbol{w}_2^\top \boldsymbol{x}_n)^2 + \epsilon_{2,n}, \quad \epsilon_{2,n} \sim \mathcal{N}(0, \sigma_\epsilon^2),\]
periodic function: \[y_{3,n} = \sin(\boldsymbol{w}_3^\top \boldsymbol{x}_n) + \epsilon_{3,n}, \quad \epsilon_{3,n} \sim \mathcal{N}(0, \sigma_\epsilon^2),\]
discontinuous step function: \[y_{4,n} = \begin{cases} 1 + \epsilon_{4,n}, & \text{if } \boldsymbol{w}_4^\top \boldsymbol{x}_n > 0, \\ -1 + \epsilon_{4,n}, & \text{otherwise}, \end{cases}, \epsilon_{4,n} \sim \mathcal{N}(0, \sigma_\epsilon^2),\]

where the variance of the noise in all the outputs was \(\sigma_\epsilon^2=1\), and the weights \(\boldsymbol{w}_d\in\mathbb{R}^2\) were drawn independently from \({\cal N}(\boldsymbol{0},\sigma_w^2 \boldsymbol{I}_2)\), with \(\sigma_w^2=1.\)

Figure 1: The estimated aleatoric uncertainties in each of the four predicted outputs of \boldsymbol{y}_* for various values of J. It was assumed that one of the outputs was missing while the remaining ones were observed. — Figure 1: The estimated aleatoric uncertainties in each of the four predicted outputs of \(\boldsymbol{y}_*\) for various values of \(J\). It was assumed that one of the outputs was missing while the remaining ones were observed.

During training, we drew \(M=100\) samples from \(p(\boldsymbol{X}|\boldsymbol{Y})\) using the variational distribution of \(\boldsymbol{X}\). For each test sample, \(\boldsymbol{y}_{o,*}\) and \(\boldsymbol{X}^{(m)}\), we generated \(L=100\) samples of \(\boldsymbol{x}_*\) according to \(\boldsymbol{x}_*^{(m,\ell)} \sim p(\boldsymbol{x}_* | \boldsymbol{y}_{o,*}, \boldsymbol{X}^{(m)}, \boldsymbol{Y})\). We used the generated training samples \(\boldsymbol{X}^{(m)}\) to compute the epistemic and aleatoric uncertainties defined by 8 and 9 , respectively. We conducted experiments where the number of drawn frequencies took values from the set \(J\in\{10, 25, 50, 100, 200, 500\}\). All the frequencies were obtained by sampling from the radial-basis-function kernel.

During evaluation, we computed the aleatoric and epistemic uncertainties as described in the previous section. In each trial, we assumed that one of the output variables, \(y_{d,*}\), was missing and had to be predicted using the remaining variables. For example, to predict \(y_{1,*}\), we used the observed values \(y_{2,*}\), \(y_{3,*}\), and \(y_{4,*}\) to estimate the latent variable \(\boldsymbol{x}_*\), which was then used to infer \(y_{1,*}\). The boxplots in Figs. 1 and 2 summarize average aleatoric and epistemic variances computed on 200 test samples per trial, following training. Each boxplot is constructed from 50 such averages obtained across independent simulations. The circles denote outlier trials. The plots illustrate the consistency of uncertainty estimates and the variability across test instances.

We observe that the aleatoric uncertainty of \(y_{1,*}\), which corresponds to the linear function, was estimated accurately and did not vary across values of \(J\). In contrast, the aleatoric uncertainty for \(y_{4,*}\), which corresponds to the step function, was significantly overestimated due to the limited ability of GPs to model discontinuities. The epistemic uncertainty reached its highest value for \(y_{4,*}\), which reveals the model’s lack of confidence in regions with abrupt changes, and its lowest value for \(y_{3,*}\), the periodic function output, which is smooth and more suitable for GP models.

Figure 2: The estimated epistemic uncertainties in each of the four predicted outputs of \boldsymbol{y}_* for various values of J. It was assumed that one of the outputs was missing while the remaining ones were observed. — Figure 2: The estimated epistemic uncertainties in each of the four predicted outputs of \(\boldsymbol{y}_*\) for various values of \(J\). It was assumed that one of the outputs was missing while the remaining ones were observed.

7 Conclusions↩︎

In this paper, we presented a systematic framework for UQ in probabilistic ML models with focus on GPLVMs. We provided a theoretical formulation for decomposing predictive uncertainty into epistemic and aleatoric components and introduced a Monte Carlo sampling-based approach for their estimation. Using RFF-based GPs, we demonstrated a scalable and efficient method to approximate predictive distributions. Our experimental results provided insight into the impact of uncertainty estimation on model reliability. The computation of uncertainties relied on Monte Carlo simulations, which in turn introduced additional sources of uncertainty. This was not accounted for in this paper and is left for future work.

References↩︎

[1]

Abdar, M., Khosravi, A., Islam, S. M. S., Acharya, U. R., and Vasilakos, A. V. (2022). The need for quantification of uncertainty in artificial intelligence for clinical data analysis: increasing the level of trust in the decision-making process. IEEE Systems, Man, and Cybernetics Magazine, 8(3):28–40.

[2]

Cheng, S., Quilodrán-Casas, C., Ouala, S., Farchi, A., Liu, C., Tandeo, P., Fablet, R., Lucor, D., Iooss, B., Brajard, J., et al. (2023). Machine learning with data assimilation and uncertainty quantification for dynamical systems: a review. IEEE/CAA Journal of Automatica Sinica, 10(6):1361–1387.

[3]

Ghanem, R., Higdon, D., Owhadi, H., et al. (2017). Handbook of Uncertainty Quantification, volume 6. Springer New York.

[4]

Papamarkou, T., Skoularidou, M., Palla, K., Aitchison, L., Arbel, J., Dunson, D., Filippone, M., Fortuin, V., Hennig, P., Hernández-Lobato, J. M., et al. (2024). Position: Bayesian deep learning is needed in the age of large-scale AI. In Forty-first International Conference on Machine Learning.

[5]

Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT press.

[6]

Wang, T., Wang, Y., Zhou, J., Peng, B., Song, X., Zhang, C., Sun, X., Niu, Q., Liu, J., Chen, S., et al. (2025). From aleatoric to epistemic: Exploring uncertainty quantification techniques in artificial intelligence. arXiv preprint arXiv:2501.03282.

[7]

Titsias, M. and Lawrence, N. D. (2010). Bayesian Gaussian process latent variable model. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 844–851. JMLR Workshop and Conference Proceedings.

[8]

Rasmussen, C. E. and Williams, C. K. (2006). Gaussian Processes for Machine Learning, volume 2. MIT Press.

[9]

Rahimi, A. and Recht, B. (2007). Random features for large-scale kernel machines. Advances in Neural Information Processing Systems, 20.

[10]

Rudin, W. (2017). Fourier Analysis on Groups. Courier Dover Publications.

Work done while at Stony Brook University.↩︎

Uncertainty Quantification in Probabilistic Machine Learning Models: Theory, Methods, and Insights