September 10, 2025
We study a missing-value imputation method, termed kNNSampler, that imputes a given unit’s missing response by randomly sampling from the observed responses of the \(k\) most similar units to the given unit in terms of the observed covariates. This method can sample unknown missing values from their distributions, quantify the uncertainties of missing values, and be readily used for multiple imputation. Unlike popular kNNImputer, which estimates the conditional mean of a missing response given an observed covariate, kNNSampler is theoretically shown to estimate the conditional distribution of a missing response given an observed covariate. Experiments demonstrate its effectiveness in recovering the distribution of missing values. The code for kNNSampler is made publicly available.1
Keywords: missing values imputation, k nearest neighbours, conditional distribution, kernel mean embedding
Missing values occur in real-world datasets for various reasons, such as non-response in surveys and sensor failures. Imputation — filling in missing values with their estimates — is a common preprocessing step used to address missing data. Over the decades, various imputation methods have been proposed, ranging from simple statistical techniques to machine learning algorithms e.g., [1], [2]–[6].
kNNImputer [7] is one of the most widely used imputation methods, owing to its simplicity and availability in popular software packages such as scikit-learn2 [8]. It imputes a missing response variable (e.g., customer satisfaction level) of a given unit (e.g., a customer) as the average of the observed responses of the \(k\) most similar units to the given unit in terms of observed covariates (e.g., age, gender, occupation). This is to predict the missing response by \(k\) nearest neighbours (kNN) regression [9] so the imputation is an estimate of the conditional expectation of the missing response given a covariate. The method has been widely used in science and engineering, and many extensions have been proposed [10]–[14].
Figure 1: Comparison of imputations by kNNImputer (left) and kNNSampler (right). In each figure, \(x\) and \(y\) are the covariate and response, respectively. Blue points are observed covariate-response pairs, green points are true missing values and red points are imputed values. For details, see Section 4..
An issue of kNNImputer, shared by other regression-based imputers, is that the distribution of imputations can be significantly different from the distribution of true (hidden) missing values. This is because, as mentioned, an imputation of kNNImputer is an estimate of the conditional expectation of a missing response, thus tending to be a deterministic function of the covariate. As a result, the distribution of imputed responses is concentrated around the regression curve, even when the distribution of missing responses has large variability. This is illustrated in Figures 1 and 2, where the true conditional distribution of a missing response is bimodal when the covariate is small, but the distribution of imputations is unimodal and many imputations take values never realized by the true missing values. A substantial bias can occur in an analysis of such a distorted imputed dataset, for example, when estimating the variance, quantiles and modes in the population.
The above issue of kNNImputer may be addressed by estimating the conditional distribution of a missing response given a covariate, and randomly sampling imputations from it. This idea was investigated by [15], who proposed the “kNN\(\times\)KDE” approach that combines a soft version of kNN and kernel density estimation (KDE). For a given unit, the conditional density of a missing response is estimated as a weighted average of Gaussian densities centered at observed responses, where the weights are computed so that units more similar, in terms of covariates, to the given unit receive larger weights. kNN\(\times\)KDE was demonstrated to have good empirical performance in recovering the distribution of missing values, compared to established imputation methods, including kNNImputer, missForest [16], SoftImpute [17], and Gain [18]. However, no theoretical guarantee exists for kNN\(\times\)KDE, such as its statistical consistency, i.e., whether the estimated conditional density converges to the true one as the sample size increases. Consistency is not only important as a minimal theoretical guarantee but also in understanding how hyperparameters should be chosen. While kNN\(\times\)KDE has two main hyperparameters (the “inverse temperature” in the softmax function used for weight computations, and the variance of Gaussian densities), no systematic selection procedure was proposed.
This paper studies a simpler kNN-based stochastic imputation method named kNNSampler. For a given unit whose response is missing, it estimates the conditional distribution of the missing response given the unit’s observed covariate as the empirical distribution of the observed responses of the \(k\) most similar units to that unit in terms of covariates; an imputation is randomly sampled from this empirical distribution, which we call kNN conditional distribution. kNNSampler is as simple as kNNImputer: instead of taking the mean of the observed responses of \(k\) nearest neighbours, kNNSampler simply samples one of those \(k\) observed responses. It is thus simpler than kNN\(\times\)KDE as it does not involve an intermediate step of density estimation and is free of any hyperparameter for responses. The number \(k\) of nearest neighbours in kNNSampler can be efficiently chosen by leave-one-out cross validation using the fast computation method recently proposed by [19]. Figures 1 and 2 describe imputations by kNNSampler with \(k\) selected in this way, which align much better with the distribution of true missing values than imputations by kNNImputer. More systematic experiments are provided in Section 4.
kNNSampler can be interpreted as an instance of hot deck, classic imputation methods widely used in practice for socio-economic and public health surveys, including the U.S. Census Bureau’s Current Population Survey and the National Center for Education Statistics [20]. In a hot deck method, a missing value of a given unit is imputed as one of the response values of the units belonging to the same “adjustment cell” as the given unit. The method is called random hot deck if the imputation is selected randomly from the adjustment cell; it is called nearest-neighbour hot deck if nearest neighbours define the adjustment cell [4]. kNNSampler is thus essentially a nearest-neighbour random hot deck method. However, while classic and widely used, hot deck methods have not been well established theoretically [20].
Our contribution is to establish kNNSampler, and thus the nearest-neighbour random hot deck, as a theoretically principled missing-value imputation method. To this end, we analyze the kNN conditional distribution, i.e., the empirical distribution of \(k\) nearest neighbour responses from which an imputation is sampled, as an estimator of the true conditional distribution of a missing response given a covariate (Section 3). Our theoretical contributions are summarized as follows.
We derive an error bound between the kNN and true conditional distributions for any given, fixed covariate, in terms of the number \(n\) of observed response-covariate pairs, the number \(k\) of the nearest neighbours, and other problem-specific constants. The error is measured by the maximum mean discrepancy (MMD) [21], a distance metric on probability distributions that metrizes the weak convergence [22], between the kNN and true conditional distributions. It holds under a Lipschitz condition that the response’s conditional distribution changes smoothly when the covariate changes continuously. A consequence of the bound is the statistical consistency of the kNN conditional distribution, in that the error decreases to zero as the sample size \(n\) goes to infinity, if the number \(k\) of nearest neighbours increases to infinity at a rate slower than \(n\). This offers a theoretical foundation of the kNNSampler and thus the nearest-neighbour random hot deck.
To derive the bound, we analyze the mean embedding of the kNN conditional distribution in a reproducing kernel Hilbert space (RKHS) as a novel estimator of the mean embedding of the true conditional distribution, known as conditional mean embedding [23], which is the RKHS-valued regression function [24]. The RKHS distance between these two embeddings is equivalent to the MMD between the kNN and true conditional distributions. Our bound leads to the consistency and convergence rates for the novel kNN-based estimator of the conditional mean embedding.
Our analysis extends the error analysis by [25] on real-valued kNN regression to RKHS-valued regression in which the response variable is infinite-dimensional. As a byproduct, we prove that the required sample size to attain a given level of precision increases exponentially not with the covariate’s ambient dimension but with the intrinsic dimension of the covariate distribution. Therefore, the kNNSampler may not be severely affected by the curse of dimensionality if the covariate distribution has a low intrinsic dimension.
This paper is organised as follows. We describe the proposed approach in Section 2, its theory in Section 3, and experiments in Section 4.
This section describes the proposed approach. Section 2.1 introduces the setting. Section 2.2 explains the kNNImputer and its issue as a preliminary. We describe kNNSampler in Section 2.3, uncertainty quantification with the kNN conditional distribution in Section 2.4, and multiple imputation with kNNSampler in Section 2.5.
We first describe the problem setup. Let \(\mathcal{X}\) and \(\mathcal{Y}\) be measurable spaces representing the covariate space and the response space, respectively. For example, the covariate space may be the \(d\)-dimensional Euclidean space, \(\mathcal{X}= \mathbb{R}^d\), in which case a covariate \(x \in \mathcal{X}\) consists of \(d\) features (e.g., a person’s age, weight, height), and the response space may be the real line \(\mathcal{Y}= \mathbb{R}\), in which case a response \(y \in \mathcal{Y}\) is real-valued (e.g., the person’s blood pressure).
We assume that our dataset consists of \(n + m\) units (e.g., persons), where \(n\) units have both covariate \(x_i \in \mathcal{X}\) and response \(y_i \in \mathcal{Y}\) observed, while \(m\) units have only covariate \(\tilde{x}_j \in \mathcal{X}\) observed and response \(\tilde{y}_{{\rm miss}, j} \in \mathcal{Y}\) missing: \[\begin{align} \mathcal{D}_n := \{(x_1, y_1), \dots, (x_n, y_n)\}, \quad \mathcal{D}_{\rm miss}:= \{ ( \tilde{x}_1, \tilde{y}_{1, {\rm miss}}), \dots, (\tilde{x}_{m}, \tilde{y}_{m, {\rm miss}}) \} \label{eq:missing-dataset} \end{align}\tag{1}\]
For each of the \(n\) units with observed responses, we assume that the covariate follows a marginal distribution \(P(x)\) and the response given the covariate follows the conditional distribution \(P(y|x)\) in an independently and identically distributed (i.i.d.) manner: \[\label{eq:observed-pairs-dist} (x_1, y_1), \dots, (x_n, y_n) \stackrel{i.i.d.}{\sim} P(y|x)P(x)\tag{2}\]
On the other hand, for the \(m\) units with missing responses, the covariate is assumed to follow a marginal distribution \(Q(\tilde{x})\), which can be different from \(P(x)\), while the conditional distribution of the missing response given the covariate remains the same: \[\label{eq:missing-pairs-joint-dist} ( \tilde{x}_1, \tilde{y}_{1, {\rm miss}}), \dots, (\tilde{x}_{m}, \tilde{y}_{m, {\rm miss}}) \stackrel{i.i.d.}{\sim} P(\tilde{y}_{\rm miss}|\tilde{x})Q(\tilde{x}).\tag{3}\] This assumption implies that the probability of a unit missing its response is determined by the unit’s covariate and is not affected by the response. Therefore, it is an instance of the Missing-At-Random (MAR) assumption [1]. In the special case where the covariate distributions for the two cases are the same, \(Q(\tilde{x}) = P(\tilde{x})\), the assumption can be interpreted as the Missing-Completely-At-Random (MCAR) assumption where missingness occurs completely randomly.
Under this setup, missing responses may be imputed by estimating the unknown conditional distribution \(P(y|x)\) of a response given a covariate, and sampling from the estimated conditional distribution. This is what the kNNSampler does.
Before describing the proposed kNNSampler, we discuss an issue with the widely used kNNImputer [7] and other regression-based imputation methods.
Suppose that the covariate space \(\mathcal{X}\) is equipped with a distance metric \(d_{\mathcal{X}}(x,x')\) that quantifies the distance between any two points \(x, x' \in \mathcal{X}\). For example, if \(\mathcal{X}\) is the Euclidean space, then \(d_{\mathcal{X}}(x,x')\) may be the Euclidean distance between two vectors \(x\) and \(x'\). Let \(X_n\) be the set of covariates for the \(n\) units with observed responses: \[X_n := \{ x_1, \dots, x_n \}\] For a given covariate \(\tilde{x}\) and a number \(k\) of nearest neighbours, let \({\rm NN}(\tilde{x}, k, X_n)\) be the indices of the \(k\) units whose covariates are the most similar to \(\tilde{x}\) in terms of the distance metric among the \(n\) units with observed responses:3 \[\begin{align} \label{eq:nearest-neighbours} {\rm NN}(\tilde{x}, k, X_n) := \{j_1, \ldots, j_k \in \{1, \ldots, n\} \mid & ~~ d_\mathcal{X}(\tilde{x}, x_{j_1}) \le \ldots \le d_\mathcal{X}(\tilde{x}, x_{j_k})\\ &\le d_\mathcal{X}(\tilde{x}, x_j) \text{ for all } j \in \{1, \ldots, n\} \setminus \{j_1, \ldots, j_k\} \}. \nonumber \end{align}\tag{4}\] That is, \({\rm NN}(\tilde{x}, k, X_n)\) is the indices of the \(k\) nearest neighbours of \(\tilde{x}\) in \(X_n\).
kNNImputer [7] imputes the missing response \(\tilde{y}_{i, {\rm miss}}\) of the unit with observed covariate \(\tilde{x}_i\) as the average of the observed responses \(y_{j_1}, \dots, y_{j_k}\) of its \(k\)-nearest neighbors \(x_{j_1}, \dots, x_{j_k}\): \[\hat{y}_{i, {\rm imp}} = \frac{1}{k} \sum_{j \in {\rm NN}(\tilde{x}_i, k, X_n)} y_j.\] This is kNN regression [26] and thus estimates the conditional mean of the missing response \(\tilde{y}_{i, {\rm miss}}\) given the observed covariate \(\tilde{x}_i\): \[\hat{y}_{i, {\rm imp}} \approx f(\tilde{x}_i) := \int \tilde{y} ~dP(\tilde{y} | \tilde{x_i}),\] where \(f: \mathcal{X}\to \mathcal{Y}\) is the regression function. In this case, the observed covariate and the imputed response \((\tilde{x}_i, \hat{y}_{i, {\rm imp}} )\) approximately follow the degenerate joint distribution \[\delta(\tilde{y}-f(\tilde{x}))Q(\tilde{x}),\] where \(\delta(\tilde{y}-f(\tilde{x}))\) denotes the Dirac distribution at the conditional mean \(f(\tilde{x})\), i.e., the degenerate distribution whose mass concentrates at \(f(\tilde{x})\). This differs from the joint distribution of the observed covariate and the true missing response \((\tilde{x}_i, \tilde{y}_{i, {\rm miss}})\): \[\label{eq:true-dist-missing} P(\tilde{y} \mid \tilde{x})Q(\tilde{x})\tag{5}\] unless the conditional distribution \(P(\tilde{y} \mid \tilde{x})\) is the Dirac distribution \(\delta(\tilde{y}-f(\tilde{x}))\), i.e., unless the missing response is the deterministic function of observed covariate. The same issue occurs with other single imputation methods based on regression, because they impute the missing response by estimating the conditional mean.
To summarize, kNNImputer and other regression-based imputation methods do not generally recover the true distribution of the missing data. An analysis based on the imputed dataset may lead to a biased result. For instance, the variance of the imputed values may be much lower than the variance of the true missing values. kNNSampler alleviates this issue by imputing missing values by estimating the conditional distribution \(P(\tilde{y}\mid \tilde{x})\).
We now describe kNNSampler (Algorithm 3). Consider imputing the missing response \(\tilde{y}_{{\rm miss}}\) of a unit with observed covariate \(\tilde{x}\). kNNSampler estimates the conditional distribution \(P(\tilde{y}_{{\rm miss}} \mid \tilde{x})\) of \(\tilde{y}_{{\rm miss}}\) given \(\tilde{x}\) as the empirical distribution of the observed responses \(y_{j_1}, \dots, y_{j_k}\) of the \(k\) nearest neighbours \(x_{j_1}, \dots, x_{j_k}\) of \(\tilde{x}\): \[\label{eq:cond-dist-est} P(\tilde{y}_{{\rm miss}} \mid \tilde{x}) \approx \hat{P}(\tilde{y}_{{\rm miss}} \mid \tilde{x}) := \frac{1}{k} \sum_{j \in {\rm NN}(\tilde{x}, k, X_n)} \delta(\tilde{y}_{{\rm miss}} - y_j),\tag{6}\] which is the discrete distribution where each of \(y_{j_1}, \dots, y_{j_k}\) has probability mass \(1/k\). An imputation \(\hat{y}_{{\rm imp}}\) for the missing response is randomly sampled from this empirical distribution: \[\hat{y}_{{\rm imp}} \sim \hat{P}(\tilde{y}_{{\rm miss}} \mid \tilde{x}).\] Algorithmically, this is to randomly sample one of the kNN observed responses \(y_{j_1}, \dots, y_{j_k}\). Algorithm 3 independently applies this procedure to the observed covariate \(\tilde{x}_i\) to generate an imputation \(\hat{y}_{i, {\rm imp}}\) of missing value \(y_{i, {\rm miss}}\) for each unit \(i = 1, \dots, m\).
The number of nearest neighbors \(k\) is a hyperparameter of kNNSampler. The theoretical and empirical results below indicate that \(k\) should not be fixed to a prespecified value (e.g., \(k=5\)), and should be chosen depending on the available data. One way is to perform cross-validation for kNN regression on the data \((x_1, y_1), \dots, (x_n, y_n)\) and select \(k\) among candidate values that minimizes the mean-square error on held-out observed responses, averaged over different training-validation splits. In particular, the present work uses Leave-One-Out Cross-Validation (LOOCV) using the fast computation method recently proposed by [19].
Quantifying the uncertainty in missing values is important for several reasons, including assessing the reliability of imputations and the adequacy of the covariates used, as well as determining how to perform imputations (e.g., single or multiple) and how to use the imputations in subsequent analyses. We describe here how to perform uncertainty quantification of missing values with the kNN conditional distribution.
kNNSampler can be used to estimate the conditional probability of a missing response \(\tilde{y}_{\rm miss}\) belonging to a specified (measurable) subset \(S\) of the response space \(\mathcal{Y}\), given observed covariate \(\tilde{x}\): \[{\rm Pr}( \tilde{y}_{\rm miss}\in S \mid \tilde{x}) = \int ~\mathbb{I}[\tilde{y}\in S]~ dP( \tilde{y}\mid \tilde{x}),\] where \(\mathbb{I}[\tilde{y}\in S]\) is the indicator function that outputs \(1\) if \(\tilde{y}\in S\) and \(0\) otherwise. By replacing the unknown conditional distribution \(P(\tilde{y}~|~ \tilde{x})\) by the kNN conditional distribution \(\hat{P}(\tilde{y}~|~ \tilde{x})\) in (6 ), this conditional probability is approximated as \[\widehat{{\rm Pr}}( \tilde{y}_{\rm miss}\in S \mid \tilde{x}) = \int ~\mathbb{I}[\tilde{y}\in S]~ d\hat{P}( \tilde{y}\mid \tilde{x}) = \frac{1}{k} \sum_{j \in {\rm NN}(\tilde{x}, k, X_n)} ~\mathbb{I} [y_j \in S].\] In other words, the conditional probability is estimated as the observed frequency of the kNN response values that fall in \(S\).
Let us focus on a real-valued missing response \(\tilde{y}_{{\rm miss}} \in \mathcal{Y} = \mathbb{R}\). The conditional probability of the missing response belonging to a given (finite or infinite) interval \(S = (\ell,u)\), where \(\ell< u\), is estimated as the observed frequency of the k-NN responses belonging to that interval. This indicates that an interval to which the kNN responses belongs at a specified frequency \(0< 1-\alpha < 1\) (e.g., \(\alpha = 0.05\), in which case the 95% of the kNN responses belong to the interval) is an estimate of an interval to which the unknown missing response belongs at that probability \(1 - \alpha\).
Such an interval \((\ell, u)\) is constructed by defining its lower bound \(\ell\) and upper bound \(u\) as, respectively, the lower and upper \(\alpha/2\) empirical quantiles of the kNN responses, i.e., the \(k \alpha/2\)-smallest and the \(k \alpha/2\)-largest kNN responses (e.g., if \(k = 200\) and \(\alpha = 0.05\), the 5th smallest and the 5th largest kNN responses): \[{\rm Pr}( \ell < \tilde{y}_{{\rm miss}} < u \mid \tilde{x}) \approx 1 - \alpha\]
The conditional standard deviation of a missing response given observed covariate quantifies the variability of the missing response. This can be estimated by the empirical standard deviation of the kNN response values for the observed covariate.
kNNSampler can be used for multiple imputation by independently generating multiple imputed datasets. More precisely, let \(B\) be the number of multiple imputed datasets to be generated (e.g., \(B = 10\)). For each \(b = 1, \dots, B\), kNNSampler is independently applied to impute the missing responses in the dataset \(\mathcal{D}_{\rm miss}\) (1 ) to create an imputed dataset \[\mathcal{D}_{n+m}^{(b)} := \mathcal{D}_n \cup \mathcal{D}_{\rm imp}^{(b)} \quad \text{where} \quad \mathcal{D}_{\rm imp}^{(b)} := \{ ( \tilde{x}_1, \tilde{y}_{1, {\rm imp}}^{(b)}), \dots, (\tilde{x}_{m}, \tilde{y}_{m, {\rm imp}}^{(b)}) \},\] where \(\tilde{y}_{i, {\rm imp}}^{(b)}\) is an imputation for the \(i\)-th unit with a missing response \(\tilde{y}_{i,{\rm miss}}\) covariates \(\tilde{x}_i\). This results in \(B\) imputed datasets: \[\mathcal{D}_{n+m}^{(1)} , \dots, \mathcal{D}_{n+m}^{(B)}.\] An analysis can then be made based on the standard procedure of multiple imputation [27].
For example, suppose that we want to estimate a population quantity \(\theta^*\) (e.g., the mean customer satisfaction level of a population). Let \(S_{n+m}\) be a function of a dataset of size \(n+m\) that outputs an estimate \(\hat{\theta}_{n+m}\) of the unknown \(\theta^*\) (e.g., the empirical average of \(n+m\) values): \(\hat{\theta}_{n+m} = S(\mathcal{D}_{n+m})\). Apply this function to each of the \(B\) imputed datasets, one obtains \(B\) estimates of \(\theta^*\): \[\hat{\theta}^{(b)}_{n+m} = S(\mathcal{D}_{n+m}^{(b)}), \quad b = 1, \dots, B.\] The empirical average of these \(B\) estimates gives a multiple-imputation estimate of \(\theta^*\). The empirical standard deviation of the \(B\) estimates \(\hat{\theta}_{n+m}^{(1)}, \dots, \hat{\theta}_{n+m}^{(B)}\) quantifies the uncertainty due to the missingness in the original data. Combined with the standard error of each \(\hat{\theta}_{n+m}^{(b)}\), this standard deviation can be used to quantify the overall uncertainty of the estimate using Rubin’s rule.
We describe a theory for kNNSampler’s conditional distribution (6 ) as an estimator of the true conditional distribution. We shall show that, as the number \(k\) of nearest neighbors increases at an approximate rate as the increase of the number \(n\) of observed covariate-response pairs, the kNN conditional distribution converges to the true one in the Maximum Mean Discrepancy (MMD) [21], which implies the convergence in distribution [28]. We prove this by adapting the proof of the convergence rates of real-valued kNN regression by [25] to Hilbert space-valued kNN regression.4
We use the framework of kernel mean embedding [23] in which every probability distribution is represented as a distinct point in an infinite-dimensional feature space known as a reproducing kernel Hilbert space (RKHS). The true and kNN conditional distributions are represented as points in an RKHS, and the distance between them, which is the MMD, quantifies the estimation error. An upper bound on this distance is obtained in terms of the sample size, the number of nearest neighbours, and other relevant quantities.
Let us first define an RKHS on the response space \(\mathcal{Y}\). As before, \(\mathcal{Y}\) is a measurable space such as the \(p\)-dimensional Euclidean space, \(\mathcal{Y}= \mathbb{R}^p\). A Hilbert space5 \(\mathcal{H}\) consisting of functions \(f\) on \(\mathcal{Y}\) is called RKHS if there exists a map \[\Phi: \mathcal{Y}\to \mathcal{H}\] called feature map, such that the value \(f(y)\) of any function \(f\) in \(\mathcal{H}\) at any point \(y\) in \(\mathcal{Y}\) can be written as the inner product between \(f\) and the feature map \(\Phi(y)\) of \(y\): \[f \in \mathcal{H}\quad \Longleftrightarrow \quad f(y) = \left< f, \Phi(y) \right>_{\mathcal{H}} \;\text{for all }\;y \in \mathcal{Y},\] where \(\left< \cdot, \cdot \right>_{\mathcal{H}}\) denotes the inner product of \(\mathcal{H}\). The \(\Phi(y)\) may be called feature vector of \(y\), and \(\mathcal{H}\) the feature space, which can be infinite-dimensional.
The inner product between the feature maps \(\Phi(y), \Phi(y')\) of any two points \(y, y'\) defines the kernel function \[\label{eq:kernel-inner-prod} \ell(y, y') := \left< \Phi(y), \Phi(y') \right>_{\mathcal{H}} \quad \text{for all }\;y, y', \in \mathcal{Y}.\tag{7}\] This is called reproducing kernel of the RKHS. The RKHS and the reproducing kernel are one-to-one, so an RKHS can be induced by defining a kernel. For example, if \(\mathcal{Y}= \mathbb{R}^p\), the Gaussian kernel \(\ell(y, y') = \exp( - \alpha \| y - y'\|^2 )\) for \(\alpha > 0\) is the reproducing kernel of a certain RKHS \(\mathcal{H}\), and there exists an infinite-dimensional feature map \(\Phi\) that induces the Gaussian kernel as (7 ). See e.g. [30], [31] for details on RKHSs.
Every probability distribution \(P\) on \(\mathcal{Y}\) is represented as the expected feature map: \[\Phi(P) := \int \Phi(y)dP(y) \in \mathcal{H}.\] This is called mean embedding of \(P\). If the RKHS \(\mathcal{H}\) is large enough, any two different probability distributions \(P\) and \(Q\) are mapped to two distinct mean embeddings: \[P \not= Q \quad \Longleftrightarrow \quad \Phi(P) \not= \Phi(Q).\] In this case, the RKHS is called characteristic [28]. For example, Gaussian, Matérn and Laplace kernels induce characteristics RKHSs.
The true and kNN conditional distributions in (2 ) and (6 ) are represented as their mean embeddings: \[\begin{align} \label{eq:mean-embeddings-cond-dists} \Phi(P(\cdot \mid x)) := \int \Phi(y) dP(y|x) \quad \text{and} \quad \Phi( \hat{P}( \cdot \mid x) ) := \frac{1}{k} \sum_{j \in {\rm NN}(x, k, X_n)} \Phi(y_j) \quad \text{for all }\;x \in \mathcal{X}. \end{align}\tag{8}\] Here, the dot “\(\;\cdot \;\)” is used in the notation of the conditional distributions to emphasize that they are probability distributions on \(\mathcal{Y}\) and do not depend on a specific value of \(y \in \mathcal{Y}\). The RKHS distance between the two conditional mean embeddings is the MMD between the true and kNN conditional distributions. It is used as an error metric of the kNN conditional distribution and theoretically analyzed in the following.
The mean embedding of the conditional distribution is known as conditional mean embedding [32], [33] and its estimator based on a regularized least-squares algorithm has been studied extensively [24], [34], [35]. The mean embedding of the kNN conditional distribution in (8 ) is a new estimator of the conditional mean embedding. Its analysis below is thus a new contribution to the RKHS literature and may be of independent interest.
We describe key assumptions for the analysis, which follow [25] with appropriate modifications.
The conditional mean embedding in (8 ) is the conditional expectation of the response feature vector \(\Phi(y)\) given a covariate \(x \in \mathcal{X}\); thus, it is the RKHS-valued regression function [24]. We assume that the map from a covariate \(x\) to the conditional mean embedding \(\Phi(P(\cdot \mid x))\) is smooth in the sense that it is Lipschitz continuous.
Assumption 1. There exists a constant \(\lambda > 0\) such that the RKHS distance between the conditional mean embeddings for any two inputs \(x, x' \in \mathcal{X}\) is bounded by the distance between \(x\) and \(x'\) times \(\lambda\): \[\left\| \Phi(P(\cdot \mid x)) - \Phi(P(\cdot \mid x')) \right\|_\mathcal{H}\leq \lambda~ d_\mathcal{X}(x,x') \quad \text{for all }\;x, x' \in \mathcal{X},\] where \(\| \cdot \|_\mathcal{H}\) is the norm of the RKHS \(\mathcal{H}\).
Our next assumption is that the reproducing kernel (7 ) is bounded on \(\mathcal{Y}\). This is a mild assumption satisfied by many commonly used kernels such as Gaussian, Matérn and Laplace kernels.
Assumption 2. There exists a constant \(C_{\rm ker} > 0\) that upper-bounds the value of the reproducing kernel (7 ): \[0 \leq \ell(y, y') \leq C_{\rm ker}^2 \quad \text{for all }\;y, y' \in \mathcal{Y}.\]
It can be easily shown that this assumption implies that the RKHS distance between the conditional mean embedding and any response’s feature vector is bounded: \[\label{eq:bound-output-noise} \left\| \Phi(P(\cdot \mid x)) - \Phi(y) \right\|_\mathcal{H}\leq \sqrt{2} C_{\rm ker} \quad \text{for all }\;x \in \mathcal{X}\;\text{and }\;y \in \mathcal{Y}.\tag{9}\] This implies that the “noise” in the RKHS-valued regression is bounded.
The next assumption is about the intrinsic dimension of the marginal distribution \(P(x)\) on the covariate space, which can be much smaller than the covariate’s dimension \(p\) if \(x \in \mathbb{R}^p\). The error of the kNN conditional distribution shall be shown to decrease as the sample size increases at a rate depending on the intrinsic dimension, not the covariate’s dimension. Let \(B(x,r) \subset \mathcal{X}\) denote the ball of center \(x \in \mathcal{X}\) and radius \(r > 0\): \[B(x,r) := \left\{ x' \in \mathcal{X}\mid d_\mathcal{X}(x, x') \leq r \right\}.\]
Assumption 3. For the marginal distribution \(P(x)\) on the covariate space \(\mathcal{X}\), there are positive constants \(C_{\rm dist} > 0\), \(r_{\rm max} > 0\), and \(d > 0\) such that \[P( B(x,r) ) \leq C_{\rm dist} \epsilon^{-d} P( B(x, \epsilon r) ) \quad \text{for all }\;0 < r < r_{\rm max} \;\text{and all }\;0 < \epsilon < 1.\]
This assumption states that if the radius of a ball is increased by a factor of \(\epsilon^{-1}\), the probability mass of the ball increases by at most a factor of \((\epsilon^{-1})^d\). Therefore, the constant \(d\) is interpreted as the intrinsic dimension of the covariate distribution, and can be much lower than the ambient dimension \(p\) if \(\mathcal{X}= \mathbb{R}^p\). For example, if the distribution \(P(x)\) is supported on a line in a two-dimensional space, then \(d = 1\) while \(p = 2\). If \(P(x)\) is supported on a plane in a three-dimensional space, then \(d = 2\) and \(p = 3\) and so forth.
Lastly, we need the following technical condition.
Assumption 4. The covariate space \(\mathcal{X}\) is a metric space with distance metric \(d_{\mathcal{X}}\) such that the class of all balls \(\mathcal{B} := \left\{ B(x,r) \mid x \in \mathcal{X}, \;r > 0 \right\}\) has a finite Vapnik–Chervonenkis (VC) dimension \(\mathcal{V}_\mathcal{B}> 0\).
This assumption is satisfied, for example, if \(\mathcal{X}= \mathbb{R}^p\) with \(p \geq 1\), in which case \(\mathcal{V}_\mathcal{B}\leq p + 2\) [36].
Under the above assumptions, the distance between the true and kNN conditional distributions can be upper-bounded as follows. The proof, provided in Appendix 6, is an adaptation of the proof of [25], which is an upper error bound on real-valued kNN regression, to our setting of RKHS-valued regression.
Theorem 1. Let \((x_1, y_1), \dots, (x_n, y_n) \stackrel{i.i.d.}{\sim} P(y|x)P(x)\) and \(\hat{P}(y|x)\) be the kNN conditional distribution (6 ) with \(k\) nearest neighbours. Suppose that Assumptions 1, 2, 3 and 4 hold. Let \(0 < \delta < 1\). Then, with probability at least \(1 - 2 \delta\), the bound \[\begin{align} \label{eq:bound-main-theo} \left\| \Phi( P(\cdot \mid x)) - \Phi( \hat{P}(\cdot \mid x) ) \right\|_{\mathcal{H}}^2 \leq 4 C_{\rm ker}^2 (1 + 4 \left( \mathcal{V}_\mathcal{B}\ln (n) - \ln(\delta) \right) \cdot \frac{1}{k} + 2 \lambda^2 r^2 \left( \frac{3 C_{\rm dist}}{P( B(x, r) )} \cdot \frac{k}{n } \right)^{2/d} \end{align}\qquad{(1)}\] holds simultaneously for all \(x \in \mathcal{X}\), \(k \in \{1, \dots, n\}\) and \(0 < r < r_{\rm max}\) satisfying \[\begin{align} \label{eq:cond-n-k-rx} & k \geq \mathcal{V}_\mathcal{B}\ln (2n) + \ln(8/\delta) \quad \text{and} \quad \frac{ k}{n} < \frac{ P\left( B(x,r ) \right) }{ 3 C_{\rm dist} } . \end{align}\qquad{(2)}\]
From Theorem 1, the following observations can be made.
Focusing on the dependence on the sample size \(n\) and the number \(k\) of nearest neighbours, the bound (?? ) can be written as \[\label{eq:bound-constants-simplified} \left\| \Phi( P(\cdot \mid x)) - \Phi( \hat{P}(\cdot \mid x) ) \right\|_{\mathcal{H}}^2 \leq C_1 \frac{\ln(n)}{k} + C_2 \left( \frac{k}{n} \right)^{d/2},\tag{10}\] where \(C_1\) and \(C_2\) are constants independent of \(n\) and \(k\). The first and second terms correspond to the variance and bias, respectively, of the kNN-based conditional mean embedding estimator \(\Phi( \hat{P}(\cdot \mid x) )\). The overall error decreases to zero as \(n\) increases if both the variance and bias decrease to zero; this requires that \(k\) increases as \(n\) increases so that the variance goes to zero, \(\ln(n) / k \to 0\), while \(k\) should not decrease “too fast” so that the bias also goes to zero, \(k / n \to 0\): \[\label{eq:consitency-disp} \left\| \Phi( P(\cdot \mid x)) - \Phi( \hat{P}(\cdot \mid x) ) \right\|_{\mathcal{H}} \longrightarrow 0 \quad \text{as } \;n \to \infty \;\;\left(\text{with} \;k/n \to 0 \;\text{and} \; \ln(n) / k \to 0 \right).\tag{11}\] On the other hand, if \(k\) is fixed to a constant value (e.g., \(k = 1\)), the variance term does not decrease even if the sample size increases. These observations are well known for real-valued kNN regression [26].
The above consistency (11 ) implies the convergence in distribution (or weak convergence) of the kNN conditional distribution \(\hat{P}(\cdot \mid x)\) to the true one \(P(\cdot \mid x)\) if the response space \(\mathcal{Y}\) is a compact metric space (e.g., \(\mathcal{Y}\) is a bounded closed subset of an Euclidean space) and \(\mathcal{H}\) is a universal RKHS6, such as the RKHSs of Gaussian, Matérn and Laplace kernels [28]; see [22] for more generic conditions. That is, under these conditions, the expectation of any continuous bounded function \(f: \mathcal{Y}\to \mathbb{R}\) under the kNN distribution \(\hat{P}(\cdot \mid x)\) converges to the expectation under the true distribution \(P(\cdot \mid x)\): \[\int f(y)d\hat{P}(y \mid x) \longrightarrow \int f(y)dP(y \mid x) \quad \text{as } \;n \to \infty \;\;\left(\text{with} \;k/n \to 0 \;\text{and} \; \ln(n) / k \to 0 \right).\] This supports using the approximate conditional distribution in multiple imputation of missing values.
An asymptotically optimal choice of \(k\) that minimizes the bound (10 ), up to the \(\ln(n)\) factor, can be obtained by balancing the variance and bias terms. If we set \(k \propto n^{\frac{2}{2+d}}\), we obtain the convergence rate \[\label{eq:convergence-rate} \left\| \Phi( P(\cdot \mid x)) - \Phi( \hat{P}(\cdot \mid x) ) \right\|_{\mathcal{H}}^2 \leq C_3 \ln (n) \cdot n^{ - \frac{2}{2+d} },\tag{12}\] where \(C_3\) is a constant independent of \(n\) and \(k\).
The rate (12 ) shows that the required sample size \(n\) to attain a desired error level increases exponentially with respect to the intrinsic dimension \(d\) of the covariate distribution \(P(x)\), not the ambient dimension of the input space \(\mathcal{X}\), which is captured by the VC dimension \(\mathcal{V}_\mathcal{B}\) of all the balls in \(\mathcal{X}\). Therefore, even when the covariate’s dimension is large, the error can be small if the covariate features have strong correlations so that the intrinsic dimension \(d\) is small. This is the finding first made by [25] on real-valued kNN regression, and we extend it to RKHS-valued kNN regression.
The rate (12 ) is the same as the minimax optimal rate for estimating a Lipschitz-continuous real-valued regression function when the covariate distribution \(P(x)\) has the intrinsic dimension \(d\) [25]. An interesting point is that the same rate is attained with RKHS-valued kNN regression where the output space is an RKHS that can be infinite-dimensional. Similar observations have been made for RKHS-valued kernel ridge regression [34], [35].
The second inequality in the condition (?? ) implies that, for successful recovery of the missing value distribution, the support of the covariate distribution \(Q(x)\) for units with missing responses (see (3 )) should be reasonably covered by the support of the covariate distribution \(P(x)\) for units with observed responses. To explain this, suppose that a missing-response unit has covariate \(x'\), i.e., \(x'\) is in the support of \(Q(x)\), but \(x'\) is not in the support of \(P(x)\) so that there exists some \(r' > 0\) with \(P(B(x',r')) = 0\); then the condition (?? ) is not satisfied for any \(n\) and \(k\).
We describe experiments to asses the empirical performance of kNNSampler in recovering the distribution of missing values. Section 4.1 explains the settings, evaluation metrics and benchmark methods. Section 4.2 describes and discusses the results.
We consider the following two models for data generation. As before, let \(n\) be the number of units with observed responses, \(m\) be the number of units with missing responses, and \(N = n+m\) be the total number of units.
For each unit \(i = 1, \dots, N\), covariate \(x_i\) is uniformly randomly generated on the interval \([-2, 2]\). Response \(y_i\) is the sum of covariate \(x_i\) and noise \(\epsilon_i\) generated randomly from the chi-square distribution with degree of freedom \(2\): \[\label{eq:synthetic-data-linear} y_i = x_i + \epsilon_i, \quad \text{where}~~ x_i \sim {\rm unif}([-2,2]), \quad \epsilon_i \sim \chi^2(2).\tag{13}\] Since chi-square noises are positive, this setup enables assessing the capability of imputation methods to recover non-Gaussian, asymmetric data distributions.
This model, considered by [15], randomly generates covariate \(x_i\) and response \(y_i\) for each unit \(i = 1, \dots, N\) from a noisy two-dimensional ring of unit radius perturbed with an additive Gaussian noise of variance \(0.1\): \[\label{eq:synthetic-data-ring-data} y_i = (1+\epsilon_i)\sin(\theta_i), \quad x_i = (1+\epsilon_i)\cos(\theta_i), \quad \text{where}~~ \theta_i \sim \mathrm{unif}[0, 2\pi], \quad \epsilon_i \sim \mathcal{N}(0, 0.1).\tag{14}\] The conditional distribution of response \(y_i\) given covariate \(x_i\) is bi-modal when \(x_i\) is between about \(-0.5\) and \(0.5\). Thus, this setup enables the assessment of imputation methods in recovering a multi-modal missing-value distribution.
We consider the MAR (missing at random) setting.7 We select \(m\) units uniformly randomly from the subset of the \(N\) units whose covariates lie on the interval \([0.5, 1.5]\) and make their responses missing. We set \(m = 200\), and vary \(n\) to assess the effect of training size on imputation performance. Specifically, we set \(n \in \{2800, 4800, 6800, 8800, 10800\}\).
To quantify the performance of an imputation method in recovering the missing value distribution, we compute the energy distance [37] between the empirical distributions of the complete and imputed datasets. The energy distance is a well-established distance metric between probability distributions and is a parameter-free special case of the MMD [38].
Let \(\tilde{x}_1, \dots, \tilde{x}_m\) be the covariates of the \(m\) units whose responses \(\tilde{y}_{1}, \dots, \tilde{y}_{m}\) are missing, and \(\tilde{y}_{1}^*, \dots, \tilde{y}_{m}^*\) be their imputations. For each unit \(i\), let \(z_i = (\tilde{x}_i, \tilde{y}_i)\) be the pair of the covariate and the true (missing) response, and \(z_i^* = (\tilde{x}_i, \tilde{y}^*_i)\) be the pair of the covariate and the imputation. We compute the energy distance between the empirical distributions of \(D_m := \{ z_1, \dots, z_m \}\) and \(D_m^* := \{ z_1^*, \dots, z_m^* \}\) as \[\mathcal{E}(D_m, D_m^*) := \frac{2}{m^2} \sum_{i,j=1}^m \|z_i - z_j^*\| - \frac{1}{m(m-1)} \sum_{i \ne j} \|z_i - z_j\| - \frac{1}{n(n-1)} \sum_{i \ne j} \|z_i^* - z_j^*\|.\] This is an unbiased estimate of the squared energy distance between the two joint distributions \(Q(x,y) = P(y|x)Q(x)\) and \(Q^*(x,y) = P^*(y|x)Q(x)\), where \(P(y|x)\) is the true conditional distribution of true response \(y\) given covariate \(x\), \(P^*(y|x)\) is the conditional distribution of imputed response \(y\) given covariate \(x\), and \(Q(x)\) is the covariate distribution of missing units. \[\mathcal{E}(Q, Q^*) := 2\mathbb{E}\|z - z^*\| - \mathbb{E}\|z - z'\| - \mathbb{E}\|z^* - z^{*'}\|,\] where \(z, z' \stackrel{i.i.d.}{\sim} Q\) and \(z^*, z^{*'} \stackrel{i.i.d.}{\sim} Q^*\).
A lower energy distance means that the two joint distributions are more similar, implying a better recovery of the missing value distribution. A higher energy distance implies that the imputed distribution is more different from the true data distribution.
Along with the energy distance itself, we report the p-value from a permutation two-sample test using the energy distance as a test statistic, where the null hypothesis is the identity of the two joint distributions \(Q(x,y)\) and \(Q^*(x,y)\). A smaller p-value suggests more significantly that the null hypothesis is false, implying that the imputed distribution \(Q^*(x,y)\) is dissimilar to the true missing value distribution \(Q(x,y)\). A larger p-value suggests that there is no evidence of the null hypothesis being false, suggesting that the imputed dataset is not statistically distinguishable from the dataset with true missing responses. Therefore, a larger p-value implies a better distribution recovery performance of the imputation method.
We compare kNNSampler with the following kNN-based and other imputation methods.
Linear Imputation: This method models the response-covariates relation as linear and imputes a missing response by its linear prediction applied to an observed covariate. It should be regarded as a benchmark slightly more sophisticated than naive methods such as mean imputation.
Random Forest [16]: This method, widely used in practice, imputes a missing response by averaging its
multiple predictions made by bootstrap-sampled tree regressors. It can learn a nonlinear relation between the response and covariate and handle the interactions among covariate features [39], [40]. We use the default configuration in scikit-learn
.
kNNImputer [7]: See Section 2.2 for the description of the method. We
set the number \(k\) of nearest neighbours as \(k =5\), which is the default setting in scikit-learn
and widely used in practice.
kNN\(\times\)KDE [15]: As explained earlier, this method generates an imputation by sampling from an estimated conditional density of a missing response given a covariate. The conditional density is estimated by weighted Gaussian kernel density estimation over observed responses, with weights derived from a softmax function applied to covariate distances. We use the authors’ recommended settings: inverse temperature \(\tau = 50\) and kernel bandwidth \(h = 0.03\).
Figures 4 and 5 describe imputation results by the different methods on datasets generated from the linear chi-square model (13 ) and the noisy ring model (14 ), respectively, with sample size \(N = 10,000\) and \(30\%\) missing rate under the MAR mechanism. The results under the MCAR mechanism are similar and omitted.
The linear imputations ignore the variability in the missing responses and demonstrate the danger of naive imputation methods, such as mean and zero imputations. The imputations by Random Forest and kNNImputer appear to be better than the linear imputations, but are distributed more narrowly than the distribution of missing responses. This is evident for the noisy ring dataset (Figure 5), for which the imputed responses lie inside the ring, which is outside the support of the missing value distribution. This happens because these imputation methods estimate the conditional mean of the missing response given a covariate.
kNNSampler and kNN\(\times\)KDE recover the distribution of missing values much better than the above imputation methods. However, kNN\(\times\)KDE generated imputations for the linear chi-square model (Figure 4) outside the support of the missing value distribution. This is because the noises in this dataset are asymmetric and non-Gaussian, while kNN\(\times\)KDE uses Gaussian noises for generating imputations. In contrast, kNNSampler appears to recover the missing-value distributions accurately. We will next quantitatively compare these methods.
Each experiment, consisting of data generation, imputations by each method, and the calculation of an evaluation metric, was independently repeated 10 times, and the mean and standard deviation of the evaluation metric are reported.
Figures 6 and 7 report the results on the energy distance between the empirical distributions of the imputed and true missing values; Figures 11 and 12 in Appendix 7 show the results for the best two methods for easier comparison. Figures 8 and 9 report the corresponding p-values. See Section 4.1.2 for details.
The energy distance for the linear imputer is the highest among the different methods, quantifying the large discrepancy between the distributions of the imputations and true missing values, as visually observed in Figures 4 and 5. In contrast, the root mean squared error (RMSE) for the linear imputer is the lowest (see Appendix [sec:exp-RMSE]), as the true regression function in either case is a linear function. This result demonstrates that the RMSE is not a good metric for evaluating the distributional similarity between imputations and missing values. See [41] for a further discussion.
The energy distances for kNNImputer and Random Forest are lower than those of the linear imputer, but they are still significantly higher than those of the two other methods. This is reasonable because they are estimating the conditional mean of the missing response given a covariate.
kNNSampler and kNN\(\times\)KDE resulted in small energy distances, implying that they produced imputations whose distribution is similar to that of true missing values. However, the standard deviations of the energy distance for kNN\(\times\)KDE tend to be higher than those of kNNSampler. Indeed, the p-values for kNN\(\times\)KDE are very small for some runs of the experiments (as implied by the error bars in Figures 8 and 9), in which case the distributions of imputations and true missing values are statistically significantly different. In contrast, the p-values of kNNSampler are stably higher, implying that kNNSampler stably generates imputations whose distribution is less distinguishable from that of true missing values.
This section evaluates kNNSampler’s ability to quantify uncertainty in missing values, using the approach described in Section 2.4. Figure 10 shows the mean and standard deviation of the coverage probabilities of kNN prediction intervals over 10 independent runs, for each sample size and missing rate (MR). As the sample size increases, the coverage probabilities converge to the designed probabilities (80%, 90%, 95%) irrespective of the missing rate, supporting the validity of the prediction intervals.
We studied kNNSampler, a stochastic missing-value imputation method that imputes a missing response of a given unit by searching for its \(k\) most similar units in terms of covariates and by randomly sampling one of the associated \(k\) observed responses. This method is interpreted as sampling from an approximate kNN-based conditional distribution of a missing response given a covariate. Assuming a Lipschitz condition that the true conditional distribution changes continuously with covariates, we proved that the kNN conditional distribution converges to the true conditional distribution as the number \(k\) of nearest neighbours increases at a rate slower than the sample size increases. This analysis offers a theoretical justification for kNNSampler, and may be of independent as it analyzes a novel kNN-based estimator of the Hilbert space embedding of a conditional distribution. Empirical results demonstrate the capability of kNNSampler in recovering the distributions of missing values.
Proof. We proceed as the proof of [25] on real-valued kNN regression, with adaptations to our RKHS-valued kNN regression setting.
The RKHS distance between the mean embeddings of the true and kNN conditional distributions is decomposed into the “bias” and “variance” terms: \[\begin{align} & \left\| \Phi( P(\cdot \mid x)) - \Phi( \hat{P}(\cdot \mid x) ) \right\|_{\mathcal{H}}^2 \nonumber \\ & = \left\| \Phi( P(\cdot \mid x)) - \mathbb{E}[ \Phi( \hat{P}(\cdot \mid x) ) \mid X_n ] ~~ + ~~ \mathbb{E}[\Phi( \hat{P}(\cdot \mid x) ) \mid X_n ] - \Phi( \hat{P}(\cdot \mid x) ) \right\|_{\mathcal{H}}^2, \nonumber \\ & \leq 2 \underbrace{\left\| \Phi( P(\cdot \mid x)) - \mathbb{E}[ \Phi( \hat{P}(\cdot \mid x) ) \mid X_n ] \right\|_\mathcal{H}^2 }_{\rm Bias} ~~ + ~~ 2 \underbrace{ \left\| \mathbb{E}[ \Phi( \hat{P}(\cdot \mid x) ) \mid X_n ] - \Phi( \hat{P}(\cdot \mid x) ) \right\|_\mathcal{H}^2 }_{\rm Variance}, \label{eq:bias-variance-bound-proof} \end{align}\tag{15}\] where \(\mathbb{E}[ \Phi( \hat{P}(\cdot \mid x) ) \mid X_n ]\) is the conditional expectation of \(\Phi( \hat{P}(\cdot \mid x))\) given \(X_n = (x_1, \dots, x_n)\), the expectation being taken for the \(n\) output values \(y_1, \dots, y_n\): \[\begin{align} \label{eq:cond-exp-exres-proof} \mathbb{E}[ \Phi( \hat{P}(\cdot \mid x) ) \mid X_n ] = \frac{1}{k} \sum_{j \in {\rm NN}(x, k, X_n)} \mathbb{E}[ \Phi(y_j) \mid X_n ] = \frac{1}{k} \sum_{j \in {\rm NN}(x, k, X_n)} \Phi(P(\cdot \mid x_j)), \end{align}\tag{16}\] where the last identity follows from \(y_j \sim P(\cdot \mid x_j)\).
Lemma 2 in Section 6.1 and Lemma 3 in Section 6.2 respectively provide probabilistic upper bounds of the bias and variance terms in the upper bound (15 ), each holding simultaneously for all \(x \in \mathcal{X}\), \(k \in \{1,\dots, n\}\) and \(r > 0\) satisfying the condition (?? ) with probability at least \(1 - \delta\). The claim follows from using these probabilistic bounds in (15 ). ◻
Lemma 1. Suppose that Assumption 4 holds. Let \(x_1, \dots,x_n \stackrel{i.i.d.}{\sim} P(x)\) be an i.i.d. sample of size \(n\) from a probability distribution \(P\) on \(\mathcal{X}\), and \(P_n := \frac{1}{n} \sum_{i=1}^n \delta_{x_i}\) be the empirical distribution. Let \(0 < \delta < 1\). Then, \[P_n(B) = \frac{1}{n} \sum_{i=1}^n \mathbb{I}[x_i \in B] \geq a\] holds simultaneously for all balls \(B \in \mathcal{B}\) and for all constants \(a > 0\) satisfying \[P(B) \geq 3a \quad \text{and} \quad a \geq \frac{\mathcal{V}_\mathcal{B}\ln(2n) + \ln(8/\delta) }{n}.\] with probability at least \(1-\delta\).
Lemma 2. Suppose that Assumptions 1, 3 and 4 hold. Let \((x_1, y_1), \dots, (x_n, y_n) \stackrel{i.i.d.}{\sim} P(y|x)P(x)\). Let \(0 < \delta < 1\). Then the following bound holds with probability at least \(1 - \delta\) simultaneously for all \(x \in \mathcal{X}\), \(k \in \{1, \dots, n \}\) and \(0 < r < r_{\rm max}\) satisfying the condition (?? ) \[\begin{align} \left\| \Phi(P(\cdot \mid x)) - \mathbb{E}[ \Phi( \hat{P}(\cdot \mid x) ) \mid X_n ] \right\|_{\mathcal{H}} \leq \lambda r \left( \frac{3 C_{\rm dist} k}{n P( B(x, r) )} \right)^{1/d} \end{align}\]
Proof. By using the triangle inequality and the Lipschitz continuity of the mapping \(x \mapsto \Phi( P(\cdot \mid x) )\) in Assumption 1, we obtain \[\begin{align} & \left\| \Phi(P(\cdot \mid x)) - \mathbb{E}[ \Phi( \hat{P}(\cdot \mid x) ) \mid X_n ] \right\|_{\mathcal{H}} \nonumber \\ = & \left\| \Phi(P(\cdot \mid x)) - \frac{1}{k} \sum_{j \in {\rm NN}(x, k, X_n)} \Phi(P(\cdot \mid x_j)) \right\|_{\mathcal{H}} = \left\| \frac{1}{k} \sum_{j \in {\rm NN}(x, k, X_n)} \left\{ \Phi(P(\cdot \mid x)) - \Phi(P(\cdot \mid x_j)) \right\} \right\|_{\mathcal{H}} \nonumber \\ \leq & \frac{1}{k} \sum_{j \in {\rm NN}(x, k, X_n)} \left\| \Phi(P(\cdot \mid x)) - \Phi(P(\cdot \mid x_j))\right\|_{\mathcal{H}} \leq \frac{1}{k} \sum_{j \in {\rm NN}(x, k, X_n)} \lambda d_\mathcal{X}(x, x_j) \leq \lambda r_{n,k}(x), \label{eq:lipshitz-bound-bias-proof} \end{align}\tag{17}\] where \(r_{n,k}(x)\) is the distance between \(x\) and its \(k\)-th nearest neighbour in \(X_n\). This distance is bounded as in the proof of [25], which leads to the claimed bound. For completeness, we prove it here.
The first inequality in the condition (?? ) implies that \[a := \frac{k}{n} \geq \frac{\mathcal{V}_\mathcal{B}\ln(2n) + \ln(8/\delta) }{n}.\] Define a constant \(0 < \epsilon < 1\) as \[\epsilon := \left( \frac{3C_{\rm dist} k}{n P(B(x, r))} \right)^{1/d},\] where \(\epsilon < 1\) follows from the second inequality in the condition (?? ). Then, Assumption 3 implies that \[P( B(x, \epsilon r) ) \geq C_{\rm dist}^{-1} \epsilon^{d} P( B(x,r) ) = 3 \cdot \frac{k}{n} = 3 a\] Thus, Lemma 1 with this choice of \(a\) implies that the following holds simultaneously for all \(x \in \mathcal{X}\), \(k \in \{1, \dots, n\}\) and \(0 < r < r_{\rm max}\) satisfying the condition (?? ) with probability at least \(1-\delta\): \[P_n \left( B(x, \epsilon r ) \right) \geq a = \frac{k}{n} = P_n \left(~ B(x, r_{k,n}(x)) ~ \right),\] where the second identity follows from that \(r_{k,n}(x)\) is the distance between \(x\) and its \(k\)-nearest neighbour, so the ball of center \(x\) and radius \(r_{k,n}(x)\) contains \(k\) points from \(x_1, \dots, x_n\). This implies that \[r_{k,n}(x) \leq \epsilon r \leq r \left( \frac{3 C_{\rm dist} k}{n P( B(x, r) )} \right)^{1/d}\] simultaneously holds for all \(x \in \mathcal{X}\), \(k \in \{1, \dots, n\}\) and \(0 < r < r_{\rm max}\) satisfying the condition (?? ) with probability at least \(1 - \delta\). The claim is obtained by using this and the bound (17 ). ◻
Lemma 3. Suppose that Assumptions 2 and 4 hold. Let \((x_1, y_1), \dots, (x_n, y_n) \stackrel{i.i.d.}{\sim} P(y|x)P(x)\). Let \(0 < \delta < 1\). The following bound simultaneously holds for all \(x \in \mathcal{X}\) and \(k \in \{1, \dots, n\}\) with probability at least \(1 - \delta\): \[\begin{align} \label{eq:lemma-2215} & \left\| \mathbb{E}[ \Phi( \hat{P}(\cdot \mid x) ) \mid X_n ] - \Phi( \hat{P}(\cdot \mid x) ) \right\|_{\mathcal{H}}^2 \leq 2 C_{\rm ker}^2 \cdot \frac{1 + 4 \left( \mathcal{V}_\mathcal{B}\ln (n) - \ln (\delta) \right)}{k}. \end{align}\qquad{(3)}\]
Proof. Denote by \(\psi({\rm NN}(x,k,X_n)) \geq 0\) the left hand side of the inequality (?? ) without the square: \[\begin{align} \label{eq:rand-var-dist} \psi({\rm NN}(x,k,X_n)) &:= \left\| \mathbb{E}[ \Phi( \hat{P}(\cdot \mid x) ) \mid X_n ] - \Phi( \hat{P}(\cdot \mid x) ) \right\|_{\mathcal{H}} \\ &= \left\| \frac{1}{k} \sum_{j \in {\rm NN}(x, k, X_n)} \Phi(P(\cdot \mid x_j)) - \Phi( \hat{P}(\cdot \mid x) ) \right\|_{\mathcal{H}} \nonumber \\ & = \left\| \frac{1}{k} \sum_{j \in {\rm NN}(x, k, X_n)} \left\{ \Phi(P(\cdot \mid x_j)) - \Phi(y_j) \right\} \right\|_{\mathcal{H}} \leq \frac{1}{k} \sum_{j \in {\rm NN}(x, k, X_n)} \left\| \left\{ \Phi(P(\cdot \mid x_j)) - \Phi(y_j) \right\} \right\|_{\mathcal{H}}, \nonumber \end{align}\tag{18}\] where the third expression follows from the definition of \(\Phi( \hat{P}(\cdot \mid x) )\) in (8 ). The notation \(\psi({\rm NN}(x,k,X_n))\) emphasizes that it depends only on the subset of training data \((x_1, y_1), \dots, (x_n, y_n)\) associated with the indices \({\rm NN}(x,k,X_n)\) of the \(k\)-nearest neighbours of \(x\) in \(X_n = \{x_1, \dots, x_n\}\).
Because of the bound (9 ), changing \(y_i\) for any \(i \in {\rm NN}(x, k, X_n)\) to any different value \(y'_i \in \mathcal{Y}\) changes the value of \(\psi({\rm NN}(x,k,X_n))\) at most \(2 \sqrt{2} C_{\rm ker} / k\), since \[\begin{align} & \left| \left. \psi({\rm NN}(x,k,X_n)) \right|_{y_i} - \left. \psi({\rm NN}(x,k,X_n)) \right|_{y'_i} \right| \\ & \leq \left| \left. \psi({\rm NN}(x,k,X_n)) \right|_{y_i} \right| + \left| \left. \psi({\rm NN}(x,k,X_n)) \right|_{y'_i} \right| \\ & = \left. \frac{1}{k} \sum_{j \in {\rm NN}(x, k, X_n)} \left\| \left\{ \Phi(P(\cdot \mid x_j)) - \Phi(y_j) \right\} \right\|_{\mathcal{H}} \right|_{y_i} + \left. \frac{1}{k} \sum_{j \in {\rm NN}(x, k, X_n)} \left\| \left\{ \Phi(P(\cdot \mid x_j)) - \Phi(y_j) \right\} \right\|_{\mathcal{H}} \right|_{y_i'} \\ & = \frac{1}{k} \left( \left\| \left\{ \Phi(P(\cdot \mid x_i)) - \Phi(y_i) \right\} \right\|_{\mathcal{H}} + \left\| \left\{ \Phi(P(\cdot \mid x_i)) - \Phi(y_i') \right\} \right\|_{\mathcal{H}} \right) \leq \frac{2 \sqrt{2} C_{\rm ker} }{k}, \end{align}\] where \(\left. \psi({\rm NN}(x,k,X_n))\right|_{y'_i}\) denotes that the value of \(y_i\) in \(\psi({\rm NN}(x,k,X_n))\) is replaced by \(y'_i\). On the other hand, the output \(y_i\) associated with any non-\(k\)-nearest neighbours \(i \not\in {\rm NN}(x, k, X_n)\) does not appear in \(\psi({\rm NN}(x,k,X_n))\), so changing the value of \(y_i\) in this case does not change \(\psi({\rm NN}(x,k,X_n))\).
Thus, for fixed \(X_n\), the probability that the random variable \(\psi({\rm NN}(x,k,X_n))\) exceeds its expectation \(\mathbb{E}[ \psi({\rm NN}(x,k,X_n)) ]\) plus any positive constant \(\epsilon > 0\) is upper bounded by using McDiarmid’s inequality as \[\begin{align} \label{eq:prob-bound-for-fixed-x-and-k} & {\rm Pr}\left( \psi({\rm NN}(x,k,X_n)) > \mathbb{E}\left[ \psi({\rm NN}(x,k,X_n)) \mid X_n \right] + \epsilon \mid X_n \right) \leq \exp \left( - \frac{ \epsilon^2 k }{ 4 C_{\rm ker}^2 } \right). \end{align}\tag{19}\] This is a bound for fixed \(x\) and \(k\).
Next, for fixed \(X_n\), we consider the probability that the statement \[\label{eq:statement-dist} \psi({\rm NN}(x,k,X_n)) > \mathbb{E}\left[ \psi({\rm NN}(x,k,X_n)) \mid X_n \right] + \epsilon\tag{20}\] holds for some \(x \in \mathcal{X}\) and \(k \in \{1, \dots, n\}\). The number of distinct such statements is identical to the number of distinct index sets of nearest neighbours \({\rm NN}(x,k,X_n)\), since the random variable \(\psi({\rm NN}(x,k,X_n))\) depends only on the subset of \((x_1, y_1), \dots, (x_n,y_n)\) associated with \({\rm NN}(x,k,X_n)\), as mentioned previously. In other words, if there are other \(x' \in \mathcal{X}\) and \(k' \in \{1, \dots, n\}\) that give the identical index set of nearest neighbours as for \(x\) and \(k\), i.e., \[{\rm NN}(x', k', X_n) = {\rm NN}(x, k, X_n),\] then the random variable \(\psi({\rm NN}(x',k',X_n))\) for \(x'\) and \(k'\) is identical to that for \(x\) and \(k\): \[\psi({\rm NN}(x',k',X_n)) = \psi({\rm NN}(x,k,X_n)).\]
The number of distinct index sets of nearest neighbours is identical to the number of distinct ways the set \(X_n\) of \(n\) points is intersected by balls \(B(x, r_{k,n}(x))\) of center \(x\) and radius \(r_{k,n}(x)\) being the distance of the \(k\)-th nearest neighbour from \(x\). This number is upper-bounded by the number of distinct ways \(X_n\) is intersected by the class \(\mathcal{B}= \{ B(x,r) \mid x \in \mathcal{X}, \;r > 0 \}\) of all balls, which is further upper-bounded by \(n^{\mathcal{V}_\mathcal{B}}\) with the VC dimension \(\mathcal{V}_\mathcal{B}\) of \(\mathcal{B}\) [25]. Therefore, by using the union bound, the probability that the statement (20 ) holds for some \(x\) and \(k\) is upper bounded by the bound (19 ) times \(n^{\mathcal{V}_\mathcal{B}}\): \[\begin{align} & {\rm Pr}\left( \psi({\rm NN}(x,k,X_n)) > \mathbb{E}[\psi({\rm NN}(x,k,X_n)) \mid X_n] + \epsilon ~ \text{ for some } x \in \mathcal{X}\text{ and } k \in \{1,\dots,n\} \mid X_n \right) \nonumber \\ & \leq n^{\mathcal{V}_\mathcal{B}} \exp \left( - \frac{ \epsilon^2 k }{ 4 C_{\rm ker}^2 } \right) \quad \text{for all} \quad \epsilon > 0. \label{eq:union-bound-variance} \end{align}\tag{21}\]
Now, set \[\delta = n^{\mathcal{V}_\mathcal{B}} \exp \left( - \frac{ \epsilon^2 k }{ 4 C_{\rm ker}^2 } \right) \;\Longleftrightarrow \;\epsilon^2 = \frac{4C_{\rm ker}^2 \left( \mathcal{V}_\mathcal{B}\ln (n) - \ln(\delta) \right)}{k}.\] For any value of \(0 < \delta < 1\), there is a corresponding \(\epsilon > 0\). Then, the bound (21 ) implies that, for fixed \(X_n\), the following upper bound on the random variable \(\psi({\rm NN}(x,k,X_n))\) squared holds for all \(x \in \mathcal{X}\) and \(k \in \{1, \dots,n\}\) with at least probability \(1 - \delta\): \[\begin{align} \psi({\rm NN}(x,k,X_n))^2 & \leq 2 \mathbb{E}[ \psi({\rm NN}(x,k,X_n)) \mid X_n ]^2 + 2 \epsilon^2 \\ &\leq 2 \mathbb{E}[ \psi({\rm NN}(x,k,X_n))^2 \mid X_n ] + 2 \epsilon^2, \end{align}\] where the second inequality follows from Jensen’s inequality. Replacing \(\psi( {\rm NN}(x, k, X_n) )\) by its definition (18 ) and \(\epsilon^2\) by the above expression, we obtain the following bound on the variance term that holds for all \(x\) and \(k\) with probability at least \(1 - \delta\): \[\begin{align} & \left\| \frac{1}{k} \sum_{j \in {\rm NN}(x, k, X_n)} \Phi(P(\cdot \mid x_j)) - \Phi( \hat{P}(\cdot \mid x) ) \right\|_{\mathcal{H}}^2 \nonumber \\ & \leq 2 \mathbb{E}\left[ \left\| \frac{1}{k} \sum_{j \in {\rm NN}(x, k, X_n)} \Phi(P(\cdot \mid x_j)) - \Phi( \hat{P}(\cdot \mid x) ) \right\|_{\mathcal{H}}^2 \mid X_n \right] + \frac{8 C_{\rm ker}^2 \left( \mathcal{V}_\mathcal{B}\ln (n) - \ln(\delta) \right)}{k}. \label{eq:bound-849} \end{align}\tag{22}\]
Define \(\mathcal{H}\)-valued random variables \[z_j := \Phi(P(\cdot \mid x_j)) - \Phi(y_j) \quad \text{for all} \;\;j \in {\rm NN}(x, k, X_n).\] These random variables are conditionally independent given \(X_n\). The conditional expectation of each \(z_j\) given \(X_n\) is zero, and the conditional variance is uniformly upper bounded due to the bound (9 ): \[\begin{align} & \mathbb{E}\left[ z_j \mid X_n \right] = \mathbb{E}\left[ \Phi(P(\cdot \mid x_j)) - \Phi(y_j) \mid X_n \right] = \Phi(P(\cdot \mid x_j)) - \mathbb{E}\left[ \Phi(y_j) \mid x_j \right] = 0, \\ & \mathbb{E}\left[ \left\| z_j \right\|_\mathcal{H}^2 \mid X_n \right] = \mathbb{E}[ \left\| \Phi(P(\cdot \mid x_j)) - \Phi(y_j) \right\|_{\mathcal{H}}^2 \mid x_j] \leq 2 C_{\rm ker}^2. \end{align}\] Therefore, the first term in the bound (22 ) can be expressed as (see also the definition of \(\Phi( \hat{P}(\cdot \mid x) )\) in (8 )) \[\begin{align} & \mathbb{E} \left[ \left\| \frac{1}{k} \sum_{j \in {\rm NN}(x, k, X_n)} \left\{ \Phi(P(\cdot \mid x_j)) - \Phi(y_j) \right\} \right\|_{\mathcal{H}}^2 \mid X_n \right] = \mathbb{E} \left[ \left\| \frac{1}{k} \sum_{j \in {\rm NN}(x, k, X_n)} z_j \right\|_{\mathcal{H}}^2 \mid X_n \right] \\ & = \mathbb{E} \left[ \frac{1}{k^2} \sum_{j \in {\rm NN}(x, k, X_n)} \left\| z_j \right\|_{\mathcal{H}}^2 + \frac{1}{k^2} \sum_{j \not= m \in {\rm NN}(x, k, X_n)} \left< z_j, z_m \right>_{\mathcal{H}} \mid X_n \right] \\ & = \frac{1}{k^2} \sum_{j \in {\rm NN}(x, k, X_n)} \mathbb{E} \left[ \left\| z_j \right\|_{\mathcal{H}}^2 \mid X_n \right] + \frac{1}{k^2} \sum_{j \not= m \in {\rm NN}(x, k, X_n)} \left< \mathbb{E} \left[ z_j \mid X_n \right] , \mathbb{E} \left[ z_m \mid X_n \right] \right>_{\mathcal{H}} \\ & = \frac{2C_{\rm ker}^2}{k}. \end{align}\] The proof completes by using this expression in the bound (22 ) and noting that this bound is independent of \(X_n\). ◻
Figures 11 and 12 show the results on the energy distance for the three best methods in Figures 6 and 7, respectively. Figure 13 shows the corresponding results on the root mean squared errors (RMSEs) between the imputations and true missing values. As mentioned in the main body, having low RMSEs does not imply good imputations in terms of recovering the distribution of missing values. Indeed, imputations from the linear imputer have the lowest RMSEs, but their distribution significantly differs from the distribution of true missing values, as quantified in Figures 6 and 7 and visually observed in Figures 4 and 5.
https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html↩︎
If there is a tie in the distances \(d_\mathcal{X}(\tilde{x}, x_i)\), break it randomly.↩︎
Hilbert space-valued kNN regression was also analyzed in [29], but their results are not directly applicable to our case. This is because [29] assumes that Hilbert space-valued noises are independent of input variables, but this assumption is too strong in our case.↩︎
A Hilbert space is a vector space in which an inner product is defined, the norm is induced from the inner product, and the limit point of any convergent sequence in this norm belongs to the vector space.↩︎
An RKHS \(\mathcal{H}\) consisting of functions on a metric set \(\mathcal{Y}\) is called universal if any continuous bounded function \(f: \mathcal{Y}\to \mathbb{R}\) can be approximated arbitrarily well in terms of the supremum norm by functions in \(\mathcal{H}\).↩︎
We also performed the experiments under the MCAR (missing completely at random) setting, but the results were similar and thus omitted.↩︎