June 17, 2024
Two key tasks in high-dimensional regularized regression are tuning the regularization strength for accurate predictions and estimating the out-of-sample risk. It is known that the standard approach — \(k\)-fold cross-validation — is inconsistent in modern high-dimensional settings. While leave-one-out and generalized cross-validation remain consistent in some high-dimensional cases, they become inconsistent when samples are dependent or contain heavy-tailed covariates. As a first step towards modeling structured sample dependence and heavy tails, we use right-rotationally invariant covariate distributions — a crucial concept from compressed sensing. In the proportional asymptotics regime where the number of features and samples grow comparably, which is known to better reflect the empirical behavior in moderately sized datasets, we introduce a new framework, ROTI-GCV, for reliably performing cross-validation under these challenging conditions. Along the way, we propose new estimators for the signal-to-noise ratio and noise variance. We conduct experiments that demonstrate the accuracy of our approach in a variety of synthetic and semi-synthetic settings.
Regularized estimators are fundamental for modern high-dimensional statistics. In this context, the prototypical problem of ridge regression has gained renewed attention, with multiple works characterizing its out-of-sample risk in high dimensions, c.f., [1], [2]. Additionally, consistent estimates for this out-of-sample risk, crucial for optimizing the regularization parameter \(\lambda\), have been established in the form of leave-one-out cross-validation (LOOCV) and generalized cross-validation (GCV) [3]–[5].
Crucially, all aforementioned analyses assume that samples are independent and identically distributed (i.i.d.) draws from some distribution with sufficient regularity, such as light tails. These assumptions often fail in practice, such as in financial returns, neuroscience, and climate data [6]–[8]. In such cases, cross-validation techniques can become biased (Theorem 1, Remark 3), necessitating alternatives. Only recently, [9] considered independent but non-identically distributed samples and [10] explored Gaussian designs with both sample and feature dependence. Understanding cross-validation in high dimensions under more general sample dependencies and for designs with heavier tails than sub-Gaussians remains a significant challenge. This paper takes the first step in addressing this issue.
We assess the impact of sample dependence and heavy tails in ridge regression and cross-validation using an alternative random matrix ensemble for the covariate distribution. Instead of assuming a Gaussian matrix or i.i.d. rows from a fixed distribution, we require the singular value decomposition of the design \(\mathbf{X}\) to be right-rotationally invariant (see Definition 1). We characterize the ridge regression risk for these designs and introduce ROTI-GCV, a new GCV-inspired framework for consistently estimating the out-of-sample risk. Simulations show that ROTI-GCV outperforms traditional GCV and LOOCV in situations with correlated and heavy-tailed observations.
Right-rotationally invariant designs facilitate tractable analysis while capturing structured forms of sample dependencies and heavy-tailed covariates. These designs have gained significant attention as alternatives to i.i.d. designs for theoretical analysis in numerous studies [11]–[20]. Recent works [21], [22] showed that, under mild conditions, properties of convex estimators across various designs are well-approximated by those under a right-rotationally invariant design with the same spectrum, indicating a broad universality class for these distributions.
Formally, we operate in the proportional asymptotics regime, where the dimension \(p\) scales proportionally with the sample size \(n\). This regime, rooted in probability theory and statistical physics, has received significant attention due to its ability to accurately capture empirical phenomena [2], [23]–[25]. Furthermore, results derived under this regime require minimal assumptions on the signal structure, resulting in theory and methods with broad practical utility [26]. In regards to cross-validation for i.i.d. samples, [2], [27], [28] established that classical \(k\)-fold cross-validation inconsistently estimates the out-of-sample error in this high-dimensional regime, whereas LOOCV and GCV remain consistent, prompting our study of GCV. However, our work focuses on non-i.i.d. samples and non-Gaussian tailed designs, areas not sufficiently addressed in prior work.
We provide proofs for ridge regression, but emphasize that our core idea can extend to other penalties. Our proof hinges on characterizing the limit of GCV (Theorem 1). This reveals that the original GCV is inconsistent for the out-of-sample risk in our setting. We next observe that though GCV is inconsistent, its asymptotic limit can serve as an estimating equation for the unknown parameters parameterizing the risk. This estimating equation approach directly inspires a new cross-validation method that works for our challenging right-rotationally invariant designs. A similar strategy may be invoked for any other penalty where an analogue of Theorem 1 is available. Therefore, we anticipate our approach to be broadly generalizable beyond ridge regression.
Definition 1 (Right-rotationally invariant design). A random design matrix \(\mathbf{X} \in \mathbb{R}^{n \times p}\) is right-rotationally invariant if for any \(\mathbf{O} \in \mathbb{O}(p)\), one has \(\mathbf{XO} \stackrel{d}{=} \mathbf{X}\), where \(\mathbb{O}(p)\) denotes \(p \times p\) orthogonal matrices. Equivalently, \(\mathbf{X}\) is right-rotationally invariant if and only if the singular value decomposition \(\mathbf{X} = \mathbf{Q}^\top \mathbf{D} \mathbf{O}\) satisfies that \(\mathbf{O}\) is independent of \((\mathbf{Q}, \mathbf{D})\) and drawn from the Haar measure on \(\mathbb{O}(p)\), i.e. “uniformly" distributed on \(\mathbb{O}(p)\).
Some examples of right-rotationally invariant designs are as follows (note that an i.i.d. entries standard Gaussian design is a member of this class, but the class includes many non-i.i.d. and non-Gaussian designs as well; see Appendix 8.3 for proofs of invariance):
(i) Autocorrelated data: The rows of \(\mathbf{X}\) can be drawn from an autocorrelated series, where \(\mathbf{x}_i = \rho \mathbf{x}_{i - 1} + \sqrt{1 - \rho^2} \mathbf{z}_i\), with \(\mathbf{z}_i\) being i.i.d. draws from \(\mathsf{N}(0, \mathbf{I}_n)\).
(ii) \(t\)-distributed data: The rows of \(\mathbf{X}\) can be drawn from a multivariate \(t\) distribution with as few as \(3\) degrees of freedom, capturing heavy-tailed data such as financial returns [6].
(iii) Products of Gaussian matrices: \(\mathbf{X} = \mathbf{X}_1 \mathbf{X}_2 \cdot \cdots \cdot \mathbf{X}_k\), where \(\mathbf{X}_1\) has \(n\) rows and \(\mathbf{X}_k\) has \(p\) columns, while the remaining dimensions are arbitrary. See [29], [30] for connections to linear neural networks.
(iv) Matrix-Normals: \(\mathbf{X} \sim \mathsf{MN}(\mathbf{0}, \boldsymbol{\Sigma}^{\mathrm{row}}, \boldsymbol{\Sigma}^{\mathrm{col}})\), where \(\boldsymbol{\Sigma}^{\mathrm{row}}\) can be arbitrary while \(\boldsymbol{\Sigma}^{\mathrm{col}} \sim \mathsf{InvWishart}(\mathbf{I}_p, (1 + \delta)p)\), for any \(\delta > 0\). See Appendix 8.5 for details on distribution notation.
(v) Equicorrelated data: \(\mathbf{X} \in \mathbb{R}^{n \times p}\) has independent columns, but each column follows a multivariate Gaussian distribution with covariance matrix \(\boldsymbol{\Sigma}\), where \(\Sigma_{ij} = \rho\) if \(i \neq j\), and \(\Sigma_{ii} = 1\).
(vi) Spiked matrices: \(\mathbf{X} = \lambda \mathbf{V}\mathbf{W}^\top + \mathbf{G}\), where \(\mathbf{V} \in \mathbb{R}^{n \times r}\) and \(\mathbf{W} \in \mathbb{R}^{p \times r}\) are the first \(r\) columns of two Haar matrices, and \(G_{ij} \sim \mathsf{N}(0, 1)\).
See [31] for more such examples. Setting (ii) has heavier tails than supported by existing GCV theory; settings (i, iii, v, vi) have sample dependence; setting (iv) has both.
Right-rotationally invariant ensembles allow us enormous flexibility—they can simultaneously capture dependent rows and heavy-tails by allowing a rather general class of singular value distributions. The spectrum of right-rotationally invariant designs can essentially be arbitrary, and allowing for significant generalization from prior i.i.d. designs used to study the behavior of LOOCV and GCV in high dimensions. Furthermore, the universality class of these designs is extremely broad, as established in [21], [22].
To simplify some statements, we introduce some notation. Let \(m_\mathbf{D}(z) = \frac{1}{p} \sum_{i = 1} ^p \frac{1}{D_{ii}^2 - z}\) and \(v_\mathbf{D}(z) = \frac{1}{n}\sum_{i = 1} ^n \frac{1}{D_{ii}^2 - z}\), with \(D_{ii} = 0\) if \(i > \min(n, p)\). Also of use will be the derivatives \(m'_{\mathbf{D}}\) and \(v'_{\mathbf{D}}\). Note \(m_\mathbf{D}\) is the Stieltjes transform of the empirical measure of \((\mathbf{D}^\top \mathbf{D}_{ii})_{i = 1} ^p\); \(v_\mathbf{D}\) is the Stieltjes transform of \((\mathbf{D}\mathbf{D}^\top_{ii})_{i = 1} ^n\).
We study high-dimensional ridge regression and GCV with right-rotationally invariant designs. Concretely, we observe \((\mathbf{X}, \mathbf{y})\), such that \[\label{eq:riri-gen} \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon},\tag{1}\] with \(\mathbf{y} \in \mathbb{R}^n\), \(\mathbf{X} \in \mathbb{R}^{n \times p}\), and \(\boldsymbol{\epsilon}, \boldsymbol{\beta} \in \mathbb{R}^{p}\). Hence \(n\) is the sample size, \(p\) is both the data dimension and the parameter count, \(\boldsymbol{\beta}\) is the signal, and \(\boldsymbol{\epsilon}\) is a component-wise independent noise vector, where each entry has mean \(0\) and variance \(\sigma^2\). We study the expected out-of-sample risk on an independent draw from a new, potentially different right-rotationally invariant design, \(\widetilde{\mathbf{X}} = \widetilde{\mathbf{Q}}^\top \widetilde{\mathbf{D}} \widetilde{\mathbf{O}}\), with \(\widetilde{\mathbf{X}} \in \mathbb{R}^{n' \times p}\). This corresponds to learning a model on one interdependent population, and using it on a separate interdependent population. We analyze the conditional risk \[\label{eq:risk-def} R_{\mathbf{X}, \mathbf{y}}\left(\hat{\boldsymbol{\beta}}(\mathbf{X}, \mathbf{y}), \boldsymbol{\beta}\right) = \frac{1}{n'} \mathbb{E}\left[\left.\left\|\widetilde{\mathbf{X}}\hat{\boldsymbol{\beta}} - \widetilde{\mathbf{X}}\boldsymbol{\beta}\right\|^2 \right| \mathbf{X}, \mathbf{y}\right].\tag{2}\] Such conditional risks have been examined in prior works [2], [4]. We focus on the risk of the ridge estimator \[\hat{\boldsymbol{\beta}}_\lambda = \mathop{\mathrm{arg\;min}}_{\mathbf{b} \in \mathbb{R}^p} \left\{\|\mathbf{y} - \mathbf{X}\mathbf{b}\|_2^2 + \lambda \|\mathbf{b}\|^2_2\right\},\] which has closed form \(\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X}+ \lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y}\). As such, sometimes we will write \(R_{\mathbf{X}, \mathbf{y}}(\lambda) := R_{\mathbf{X}, \mathbf{y}}(\hat{\boldsymbol{\beta}}_\lambda, \boldsymbol{\beta})\) for convenience. Before stating our results, we note the running assumptions used in what follows.
We consider a sequence of right-rotationally invariant designs \(\mathbf{X}_1, \mathbf{X}_2, \dots, \mathbf{X}_{n}, \dots\) where each \(\mathbf{X}_i \sim \mathbf{Q}_i^\top \mathbf{D}_i \mathbf{O}_i\), and a sequence of signal vectors \(\boldsymbol{\beta}_1, \boldsymbol{\beta}_2, \dots, \boldsymbol{\beta}_n, \dots\) such that for all \(n\), \(\boldsymbol{\beta}_n \in \mathbb{R}^p\) and \(\|\boldsymbol{\beta}_n\|= r\sqrt{n}\); \(\mathbf{y}_i\) is generated by Eq. 1 . In the sequel, we drop the dependence on \(n\) and write \(\mathbf{X}, \mathbf{y}\) and \(\boldsymbol{\beta}\) with the indexing implicit. We refer to \((\mathbf{X}, \mathbf{y})\) as the training data.
\(\limsup \lambda_{\max}(\mathbf{D}_n) < C\) a.s. for some constant \(C\).
\(\mathbf{X}_n \in \mathbb{R}^{n \times p(n)}\), \(n\to\infty\), \(p(n) / n \rightarrow \gamma \in (0, \infty)\).
\(\boldsymbol{\epsilon}\) is independent of \(\mathbf{X}\); \(\epsilon_i\)’s are independent; \(\mathbb{E}[\epsilon_i] = 0\), \(\mathbb{E}[\epsilon_i^2] = \sigma^2\), and \(\mathbb{E}[\epsilon_i^{4 + \eta}] <\infty\) for some \(\eta > 0\).
To fix scaling and simplify our results, we assume that \(\frac{n \mathbb{E}[\mathsf{Tr}(\widetilde{\mathbf{D}}^\top \widetilde{\mathbf{D}})]}{n' p} = 1\); this ensures the risk is defined and maintains the scale of the training set relative to the test set (See Remark 8 of Appendix).
\(\boldsymbol{\beta}_n\) is fixed or random independent of \((\mathbf{X}_n, \boldsymbol{\epsilon})\).
\(\widetilde{\mathbf{X}}\) is an independent draw from a potentially different right-rotationally invariant ensemble.
Of crucial importance will be the two parameters \(r^2\) and \(\sigma^2\), which dictate the behavior of the problem.
Remark 1. To clarify why the risk in Eq. 2 is relevant, if one samples \((\widetilde{\mathbf{X}}, \widetilde{\mathbf{y}})\) according to Eq. 1 , then \(R_{\mathbf{X}, \mathbf{y}}(\lambda) = \mathbb{E}[ \|\widetilde{\mathbf{y}} - \widetilde{\mathbf{X}}\hat{\boldsymbol{\beta}}_\lambda \|^2 \mid \mathbf{X}, \mathbf{y}] - \sigma^2\), meaning the risk defined corresponds, up to a factor of \(\sigma^2\), to the mean-squared error on the test set.
Remark 2. One should think of \(\mathbf{X}\) in our setting as akin to the normalized matrix \(\mathbf{X}/\sqrt{n}\) in the i.i.d. setting. This further ensures that \(\|\mathbf{X}^\top \mathbf{X}\|_{\mathsf{op}} = \|\mathbf{D}^\top \mathbf{D}\|_{\mathsf{op}} = \mathcal{O}(1)\). To maintain the signal to noise ratio under this scaling, \(\|\boldsymbol{\beta}\| = \mathcal{O}(\sqrt{n})\) (Assumption [item:scaling-matrix]); the test set is appropriately scaled through Assumption [item:scaling].
Prior work [3], [4] established that for many designs with i.i.d. samples, LOOCV and GCV estimate the out-of-sample error consistently in the proportional asymptotics regime, while \(k\)-fold CV does not. In our setting, LOOCV is intractable without additional assumptions on \(\mathbf{Q}\) (see Appendix 8.6). However, GCV is tractable, motivating our study here. Initially introduced as a technique for selecting the regularization parameter in smoothing splines [32], GCV’s optimality and consistency has been shown in various tasks [33]–[37]. The GCV cross-validation metric for ridge regression is as follows [38]: \[\label{eq:gcv} \mathsf{GCV}_n(\lambda) = \frac{1}{n} \sum_{i = 1} ^n \left(\frac{y_i - \mathbf{x}_i^\top \hat{\boldsymbol{\beta}}_{\lambda}}{1 - \mathsf{Tr}(\mathbf{S}_{\lambda})/n}\right)^2.\tag{3}\] Here \(\mathbf{S}_\lambda = \mathbf{X}^\top (\mathbf{X}^\top \mathbf{X}+ \lambda \mathbf{I})^{-1} \mathbf{X}^\top\) is the smoother matrix. Our first result is as follows:
Theorem 1. Under the stated assumptions, \[\mathsf{GCV}_n(\lambda) - \frac{r^2 (v_{\mathbf{D}}(-\lambda) - \lambda v'_{\mathbf{D}}(-\lambda)) + \sigma^2\gamma v'_{\mathbf{D}}(-\lambda)}{\gamma v_{\mathbf{D}}(-\lambda)^2} \xrightarrow{a.s.} 0.\]
Further, the risk takes the following form:
Theorem 2. Fix \(0 < \lambda_1 < \lambda_2 < \infty\). Define \[\begin{gather} \label{Rdef} \mathsf{R}_{\mathbf{D}}(r^2, \sigma^2) = r^2 \left(\frac{\lambda^2}{\gamma} v'_{\mathbf{D}}(-\lambda) + \frac{\gamma - 1}{\gamma}\right) \\ + \sigma^2 ( v_{\mathbf{D}}(-\lambda) - \lambda v_{\mathbf{D}}'(-\lambda)) \end{gather}\tag{4}\] Under the stated assumptions, \[\sup_{\lambda \in [\lambda_1, \lambda_2]} \left| R_{\mathbf{X}, \mathbf{y}} \left( \hat{\boldsymbol{\beta}}_\lambda, \boldsymbol{\beta} \right) - \mathsf{R}_{\mathbf{D}}(r^2, \sigma^2) \right| \xrightarrow{a.s.} 0.\] where we recall that \(r^2=\|\boldsymbol{\beta} _n\|^2/n\) and \(\sigma^2=\mathbb{E}{\epsilon_i^2}\).
In Theorem 2, one can interpret the term involving \(r^2\) as the bias of the estimator, and the term with \(\sigma^2\) as its variance. As such, one can then interpret Theorem 1 as showing how GCV reconstructs the bias and variance from the data.
Remark 3. It is shown in [4] that for i.i.d. designs, \(\Delta_n(\lambda) := \mathsf{GCV}_n(\lambda) - R_{\mathbf{X}, \mathbf{y}}(\lambda) - \sigma^2\) approaches zero uniformly over compact intervals1. Their proof of this relies explicitly on identities of the Stieltjes transform given by the Silverstein equation [39]. These identities do not exist for right-rotationally invariant designs, and thus the GCV becomes biased for these designs, meaning \(\Delta_n(\lambda) \not \to 0\). The more the spectrum departs from that of the i.i.d. setting, the more biased the estimates.
The original GCV (in Eq. 3 ) can be seen as starting with the training error and finding a rescaling that recovers the out-of-sample risk. To construct our alternative GCV metric, which is provably consistent for right-rotationally invariant designs, we likewise begin with the training error, but instead use it to produce estimating equations for \(r^2\) and \(\sigma^2\). We then use these to compute the risk formula given by Theorem 2.
Consistent estimation of \(\sigma^2\) has been studied in [31]. We propose a different approach since we observe it performs better in finite samples. In particular, we establish the following result:
Lemma 1. Under our assumptions, for any \(\lambda > 0\), \[\begin{gather} \label{eq:gcv-as} \frac{1}{n} \sum_{i = 1} ^n \left(y_i - \mathbf{x}_i^\top \hat{\boldsymbol{\beta}}_\lambda \right)^2 - \left[r^2 \frac{\lambda^2}{\gamma} \left(v_{\mathbf{D}}(-\lambda) - \lambda v'_{\mathbf{D}}(-\lambda)\right)\right. \\ \left. \phantom{\frac{\lambda^2}{\gamma}} + \sigma^2\lambda^2 v'_{\mathbf{D}}(-\lambda)\right] \xrightarrow{a.s.} 0. \end{gather}\tag{5}\]
Observe that Eq. 5 can be used as an estimating equation, meaning that, if one treats 5 as an equality to 0 (i.e. replacing “\(\xrightarrow{a.s.}\)” with “\(=\)”), then we can solve for the unknown quantities \(r^2\) and \(\sigma^2\). Recall that \(\gamma = p/n\), \(v_\mathbf{D}\), and \(v'_\mathbf{D}\), defined in Section 2.1, are all data dependent quantities. Now, Eq. 5 yields a distinct estimating equation for each value of \(\lambda\). While it takes only two equations to solve for \(r^2\) and \(\sigma^2\), we instead use a grid of \(\lambda\) and find OLS estimates for \(r^2\) and \(\sigma^2\) to improve robustness. Explicitly, the scheme proceeds as shown in steps 1-6 of Algorithm 1.
Using Lemma 1, we can establish that the following:
Corollary 1. Under Assumptions [item:scaling-matrix]-[item:indep-test] and a non-degeneracy condition on \(\mathbf{D}\) (see Appendix 7.7.1 for details), \(\hat{r}^2\) and \(\hat{\sigma}^2\) from Algorithm 1 are strongly consistent for \(r^2\) and \(\sigma^2\), respectively.
Empirically, we find that normalizing the spectrum of \(\mathbf{X}^\top \mathbf{X}\) and taking \(\{\lambda_i\}_{i=1}^L\) logarithmically spaced between \(1\) and \(10^{2.5}\) works well.
We now define \(\mathsf{ROTI}\text{-}\mathsf{GCV}(\lambda) := \mathsf{R}_\mathbf{D}(\hat{r}^2, \hat{\sigma}^2)\) (recall \(\mathsf{R}_\mathbf{D}\) from Theorem 2) and tune \(\lambda\) using this metric (Algorithm 1); below we establish that this quantity is uniformly consistent for the out-of-sample risk over compact intervals.
Corollary 2. Fix any \(0 < \lambda_1 < \lambda_2 < \infty\). Under Assumptions [item:scaling-matrix]-[item:indep-test], one has \[\sup_{\lambda \in [\lambda_1, \lambda_2]}\left|\mathsf{ROTI}\text{-}\mathsf{GCV}(\lambda) - R_{\mathbf{X}, \mathbf{y}}(\hat{\boldsymbol{\beta}}_\lambda, \boldsymbol{\beta})\right| \xrightarrow{a.s.} 0.\]
Remark 4. Uniform convergence, unlike pointwise convergence, ensures the risk attained by the minimizer (over a compact interval) of our cross-validation metric is close to optimal. See Appendix 8.4 for a short proof.
An issue that emerges when attempting to apply ROTI-GCV in practice is that the signal often tends to align with the top eigenvectors of the data, violating the independence assumption imposed on \(\boldsymbol{\beta}\) and \(\mathbf{O}\). The right-rotationally invariant assumption, for fixed \(\boldsymbol{\beta}\), inherently assumes that the signal is incoherent with respect to the eigenbasis. However, this assumption is generally not true, and in fact in the i.i.d. anisotropic case, the geometry of \(\boldsymbol{\beta}\) with respect to the top eigenvectors of \(\boldsymbol{\Sigma}\) is known to significantly influence the behavior of ridge regression (see e.g. [40]).
Adapting the approach given in [31], we model this scenario using two index sets \(\mathcal{J}_a\) and \(\mathcal{J}_c\) of distinguished eigenvectors, which are aligned and coupled eigenvectors, respectively. We replace Assumptions [item:indep-beta] and [item:indep-test] in Section 3 with the following:
The true signal \(\boldsymbol{\beta}\) is given as \(\boldsymbol{\beta} = \boldsymbol{\beta}' + \sum_{i \in \mathcal{J}_a}\sqrt{n}\alpha_i\mathbf{o}_i\), where \(\|\boldsymbol{\beta}'\|/\sqrt{n} = r\); \(\boldsymbol{\beta}'\) is independent of \((\mathbf{X}, \boldsymbol{\epsilon})\). Here \(\mathbf{o}_i\) refers to the \(i\)th row of \(\mathbf{O}\), which is an eigenvector of \(\mathbf{X}^\top \mathbf{X}\). An eigenvector \(\mathbf{o}_i\) is an aligned eigenvector if \(\mathbf{o}_i \in \mathcal{J}_a\), as then \(\mathbf{o}_i\) aligns with the signal \(\boldsymbol{\beta}\);
The test data \(\widetilde{\mathbf{X}} \sim \widetilde{\mathbf{Q}}^\top \widetilde{\mathbf{D}} \widetilde{\mathbf{O}}\) is drawn from a separate right-rotationally invariant ensemble; \(\widetilde{\mathbf{O}}\) is distributed according to a Haar matrix with rows conditioned to satisfy \(\mathbf{o}_i = \widetilde{\mathbf{o}}_i\) for all \(i \in \mathcal{J}_c\). An eigenvector \(\mathbf{o}_i\) is then coupled if \(i \in \mathcal{J}_c\), as \(\mathbf{o}_i\) is shared between the test and train sets.
We require two more assumptions:
\(\mathcal{J}_a, \mathcal{J}_c\), \(\boldsymbol{\alpha} = (\alpha_i)_{i \in \mathcal{J}_a}\) are fixed, not changing with \(n, p\); this ensures consistent estimation of these distinguished directions.
\(\limsup \|\mathbb{E}[\widetilde{\mathbf{D}}^\top \widetilde{\mathbf{D}}]\|_\mathsf{op}< C\) for some constant \(C\). This differs from Assumption [item:bound-op], as it is on the test data.
The main difficulty that arises in our setting, compared to [31], is that we must account for how the geometry of the eigenvectors of the test set relates to those of the training set. One should expect, for example, in some types of structured data that the top eigenvectors of the training data and test data are closely aligned. We account for this through the coupled eigenvectors condition ([item:coupled]). Furthermore, without this coupling, any alignment between the eigenvectors of the training set and the signal \(\boldsymbol{\beta}\) will not exist in the test set, since then the eigenvectors of the test set would be independent of this alignment. We are still interested in the same test risk \(R_{\mathbf{X}, \mathbf{y}}\) as in Eq. 2 , except now note that \(\widetilde{\mathbf{X}}\) and \(\mathbf{X}\) are dependent.
Theorem 3. Under Assumptions [item:scaling-matrix]-[item:scaling] and [item:aligned]-[item:j-bounded-spec], for any \(0 < \lambda_1 < \lambda_2 < \infty\), we have \[\sup_{\lambda \in [\lambda_1, \lambda_2]} \left| R_{\mathbf{X}, \mathbf{y}} \left( \hat{\boldsymbol{\beta}}_\lambda, \boldsymbol{\beta} \right) - \mathcal{R}_{\mathbf{X}, \mathbf{y}}(r^2, \sigma^2, \boldsymbol{\alpha}) \right| \xrightarrow{a.s.} 0,\] where \[\mathcal{R}_{\mathbf{X}, \mathbf{y}}(r^2, \sigma^2, \boldsymbol{\alpha}) = \mathcal{B}_{\mathbf{X}}(\hat{\boldsymbol{\beta}}_\lambda, \boldsymbol{\beta}) + \mathcal{V}_{\mathbf{X}}(\hat{\boldsymbol{\beta}}_\lambda, \boldsymbol{\beta}).\] Let \(\mathfrak{d}_i^2 = \mathbb{E}\left[\left(\widetilde{\mathbf{D}}^\top \widetilde{\mathbf{D}}\right)_{ii}\right]\) and \(\mathfrak{d}_\mathsf{bulk}^2 = \frac{1}{p - |\mathcal{J}_c|} \sum_{i \not \in \mathcal{J}_c} \mathfrak{d}_i^2\). \[\begin{align} \mathcal{B}_\mathbf{X}(\hat{\boldsymbol{\beta}}_\lambda, \boldsymbol{\beta}) &= \frac{\lambda^2}{n} \sum_{i = 1} ^p \left[\frac{\left(r^2/\gamma + \mathbb{1}(i \in \mathcal{J}_a)n\alpha_i^2\right)}{(D_{ii}^2 + \lambda)^2} \right. \\ &\qquad \cdot \left. \phantom{\frac{a^2}{b}}\left(\mathfrak{d}_\mathsf{bulk}^2 + \mathbb{1}(i \in \mathcal{J}_c)(\mathfrak{d}_i^2 - \mathfrak{d}^2_\mathsf{bulk})\right)\right] \\ \mathcal{V}_\mathbf{X}(\hat{\boldsymbol{\beta}}_\lambda, \boldsymbol{\beta}) &= \frac{\sigma^2}{n} \sum_{i = 1} ^n \frac{D_{ii}^2\left(\mathfrak{d}_\mathsf{bulk}^2 + \mathbb{1}(i \in \mathcal{J}_c)(\mathfrak{d}_i^2 - \mathfrak{d}^2_\mathsf{bulk})\right)}{(D_{ii}^2 + \lambda)^2} \end{align}\]
Remark 5. Comparing the two terms here with those in Theorem 1, observe that they are equal when \(\mathcal{J}_a\) and \(\mathcal{J}_c\) are empty. Furthermore, observe that each element of \(\mathcal{J}_a\) adds a large contribution to the bias, scaling with the size of the alignment \(\alpha_i\); similarly, elements of \(\mathcal{J}_c\) contribute additional bias and variance.
Remark 6. The assumption that the top eigenvectors of training and test sample covariance are exactly equal (Assumption [item:coupled]) is not expected to hold perfectly in practice. However, the idea is that the approximation can lead to more robust procedures. We thus expect this approximation to hold when the training and test data have “spiky” spectra, where the top few eigenvalues are heavily separated from the bulk, and these top eigenvectors of the test and training data are close.
Our approach is simply to estimate the \(\alpha_i\)’s using the classical principal components regression (PCR); see [41], [42] for details, also described in steps 1 and 2 of Algorithm 2. We then compute a transformed model which is right-rotationally invariant, from which Algorithm 1 steps 1-6 can be used to find \(r^2\) and \(\sigma^2\).
We require some notation; we focus on subsets \(S\) of the indices of all nonzero singular values \(\mathcal{N} = \{i \in [p] : (\mathbf{D}^\top \mathbf{D})_{ii} > 0\}\). Let \(\mathbf{O}_S \in \mathbb{R}^{|S| \times p}\) for any \(S\) denote the rows of \(\mathbf{O}\) indexed by \(S\). Let \(\mathbf{D}_S \in \mathbb{R}^{(n - |S|) \times p}\) denote the rows of \(\mathbf{D}\) indexed by \(S\). Finally, for a set \(S \subset \mathcal{N}\), denote \(\overline{S} = \mathcal{N} \setminus S\). The estimation for \(r^2, \sigma^2\), and \(\boldsymbol{\alpha}\) then proceeds as in steps 1-5 of Algorithm 2.
Lemma 2. Under the additional assumption that \(\boldsymbol{\epsilon}\) has Gaussian entries2, \(\tilde{r}^2\), \(\tilde{\sigma}^2\), and \(\hat{\boldsymbol{\alpha}}\), from Algorithm 2, are all strongly consistent.
Our approach for cross-validation is to then again explicitly compute the risk formulas \(\mathcal{B}_{\mathbf{X}}(\hat{\boldsymbol{\beta}}_\lambda, \boldsymbol{\beta})\) and \(\mathcal{V}_{\mathbf{X}}(\hat{\boldsymbol{\beta}}_\lambda, \boldsymbol{\beta})\), where we estimate the unknown parameters from data. To apply our method, we therefore need to consistently estimate \(r^2, \sigma^2, \boldsymbol{\alpha}\), and \(\{\mathfrak{d}_i^2\}_{i \in \mathcal{J}_c}\) and \(\mathfrak{d}_\mathsf{bulk}^2\). Lemma 2 gives us most of these parameters, but we still require consistent estimates of the \(\mathfrak{d}_i\)’s. This is a nonissue in cases where \(\mathbf{D}\) and \(\widetilde{\mathbf{D}}\) are modeled as deterministic. If they are random, we expect that if the test data is drawn from the same right-rotationally invariant ensemble as the training data, then, in practice, one can use the eigenvalues of the training data to do this. However, our framework further allows for the test distribution to be drawn from a different right-rotationally invariant ensemble. In such cases, it is possible to then use test data to do this estimation because it only depends on the test covariates \(\widetilde{\mathbf{X}}\) (which the practitioner may have access to), but not its labels; this is done for experiments in Section 5. We therefore make one final assumption:
We then take our validation metric as \(\mathsf{aROTI}\text{-}\mathsf{GCV}(\lambda) = \mathcal{R}_{\mathbf{X}, \mathbf{y}}(\tilde{r}^2, \tilde{\sigma}^2, \hat{\boldsymbol{\alpha}}, \{\hat{\mathfrak{d}}_i\}_{i \in \mathcal{J}_c, \mathsf{bulk}})\), where the latter expression denotes substituting every parameter with its estimated version. Algorithm 2 summarizes our entire procedure3.
Corollary 3. Under Assumptions [item:scaling-matrix]-[item:scaling], [item:aligned]-[item:consist-eigen], and the additional assumption that the entries of \(\boldsymbol{\epsilon}\) are Gaussian, \(\mathsf{aROTI}\text{-}\mathsf{GCV}\) satisfies \[\sup_{\lambda \in [\lambda_1, \lambda_2]} |\mathsf{aROTI}\text{-}\mathsf{GCV}(\lambda) - R_{\mathbf{X}, \mathbf{y}}(\hat{\boldsymbol{\beta}}_\lambda, \boldsymbol{\beta})| \xrightarrow{a.s.} 0.\]
Remark 7. This states that aROTI-GCV is uniformly consistent on compact intervals, ensuring the minimizer of aROTI-GCV attains risk that is close to optimal.
In this section, we demonstrate the efficacy of our method via numerical experiments. In producing each plot below, we generate a training set according to a prescribed distribution, as well as a signal vector \(\boldsymbol{\beta}\). We then repeatedly resample the training noise \(\boldsymbol{\epsilon}\) and run each cross-validation method, in addition to computing the true risk on the test set.
Figure 3: Risk curves produced by the cross-validation methods in addition to the true risk curve. Each plot uses a different form of right-rotationally invariant data, as indicated in each caption. (): Autocorrelated data: rows of \(\mathbf{X}\) are drawn according to \(\mathbf{x}_i = \rho \mathbf{x}_{i - 1} + \sqrt{1 - \rho^2} \mathbf{z}_i\), with \(\mathbf{z}_i\) being i.i.d. draws from \(\mathsf{N}(0, \mathbf{I}_n)\). We set \(\rho = 0.8\). (): \(t\)-distributed data: each row is drawn from a multivariate \(t\) distribution with \(3\) degrees of freedom. (): Equicorrelated data \(\mathbf{X} \in \mathbb{R}^{n \times p}\) has independent columns, but each column follows a multivariate Gaussian distribution with covariance matrix \(\boldsymbol{\Sigma}\), where \(\Sigma_{ij} = \rho\) if \(i \neq j\), and \(\Sigma_{ii} = 1\). All simulations have \(n = p = 1000\) and \(r^2 = \sigma^2 = 1\). The \(x\)-axis for every plot is \(\lambda\), the regularization parameter. The colored lines (blue, green, red, purple) are one of 10 iterations. In each iteration, we compute the cross-validation metric over a range of \(\lambda\) to produce the line, which reflects the estimated out-of-sample risk. The black line of each plot shows the average result for that method. The dashed blue line shows the average expected MSE curve as a benchmark.. a — Gaussian data, b — Autocorrelated data, \(\rho = 0.8\), c — \(t\)-distributed data, d — Equicorrelated data, \(\rho = 0.8\)
Figure 3 illustrates the performance of ROTI-GCV and compares it to LOOCV and GCV. In these figures, the Tuned Risk (abbreviated TR) refers to the out-of-sample risk obtained when we use the value of \(\lambda\) that optimizes the given cross-validation method. Minimum risk (MR), denotes the minimum of the true expected out-of-sample risk curve. Standard errors are shown in parentheses for all tuned risks. Estimated Risk (ER) refers to the value of the out-of-sample risk that would be estimated using the given cross-validation method. As expected, in the Gaussian setting all three methods perform well. However, for dependent data, such as in the autocorrelated and equicorrelated cases, we observe that the risk curves produced by GCV and LOOCV are wildly off from the truth, whereas ROTI-GCV accurately captures the true risk curve. Surprisingly, the tuned risk values from LOOCV, GCV are still decently close to that of ROTI-GCV. As discussed in Remark 3, GCV (and also LOOCV) estimate a quantity with an additional factor of \(\sigma^2\); thus, to make the evaluation of estimated risks fair, these two curves are plotted with this term removed.
Figure 4: As in Figure 3, curves show the risk and estimated risk as a function of \(\lambda\). (): \(\mathbf{X}\) has autocorrelated rows, with \(\rho = 0.8\). The top 10 eigenvectors (\(\mathcal{J}_c = [10]\)) are coupled, so that the top 10 eigenvectors of the test set are equal to those of \(\mathbf{X}\). \(\boldsymbol{\beta} = \boldsymbol{\beta}' + \sqrt{n}\sum_{i = 1} ^{10} \frac{i}{10} \mathbf{o}_i\), with \(\|\boldsymbol{\beta}'\|^2 = \sqrt{n}\), \(\sigma^2 = 1\), \(n = p = 1000\). () i.i.d. rows drawn from a Gaussian mixture, i.e. \(\mathbf{x}_i\) is drawn from \(\frac{1}{2} \mathsf{N}(\mathbf{3}, \mathbf{I}_p) + \frac{1}{2} \mathsf{N}(-\mathbf{3}, \mathbf{I}_p)\). When computing ROTI-GCV, we set \(\mathcal{J}_c = \{1\}\). (): i.i.d. rows drawn from an equicorrelated Gaussian, i.e. \(\mathbf{x}_i \sim \mathsf{N}(0, \boldsymbol{\Sigma})\), where \(\Sigma_{ij} = \mathbb{1}(i = j) + \rho\mathbb{1}(i \neq j)\). We set \(\rho = 0.5\). When computing ROTI-GCV, we set \(\mathcal{J}_c = [10]\). (): speech data with \(\boldsymbol{\beta}\) sampled uniformly from sphere, with \(r^2 = \sigma^2 = 1\), \(n = p = 400\); we choose \(\mathcal{J}_c = [3]\). (): speech data with \(\boldsymbol{\beta} = \boldsymbol{\beta}' + \tfrac{\sqrt{n}}{2}\sum_{i = 1} ^5 \mathbf{o}_i\); again \(r^2 = \sigma^2 = 1\) and \(n = p = 400\). We choose \(\mathcal{J}_c = [3]\) and \(\mathcal{J}_a = [5]\) (discussion in Section 5.3.1). (): 30 minute residualized returns sampled every 1 minute; \(\boldsymbol{\beta}\) sampled uniformly from sphere.. a — Signal-PC alignment, b — Gaussian mixture, c — Equicorrelated rows, \(\rho = 0.5\), d — Speech Data without Signal-PC alignment, e — Speech Data with Signal-PC alignment, f — Residualized returns
We now consider settings where the signal \(\boldsymbol{\beta}\) aligns with the eigenvectors \(\mathbf{O}\). The relevant method here is aROTI-GCV (Algorithm 2). In Figure 4 (a), we show the performance of aROTI-GCV when we set the top eigenvectors of the test set equal to those of the training set, matching the coupled eigenvectors condition (Assumption [item:coupled]). We further construct the signal such that it aligns with the top 10 eigenvectors. As the conditions match our theory, we see aROTI-GCV performs well. The setting of Figure 4 (a) might look contrived, but we next provide examples of well-known distributions where such structures naturally arise (Figures 4 (b) and 4 (c)). In Figure 4 (b), the data is drawn from a mixture of two Gaussians, one centered at \(\mathbf{3} = [3, 3, \dots, 3]\) and the other centered at \(-\mathbf{3}\), each with identity covariance. In this setting, the top eigenvector is very close to \(\mathbf{3}\) and emerges from the bulk. Furthermore, projected into the subspace \(\mathrm{span}(\mathbf{3})^\perp\), the data is a standard Gaussian and thus right-rotationally invariant. As seen, our method works under this type of covariance structure. Shown in Figure 4 (c), we find similar results when each row of \(\mathbf{X}\) is drawn from an equicorrelated Gaussian, i.e., \(\mathbf{x}_i \sim \mathsf{N}(0, \boldsymbol{\Sigma})\) where \(\Sigma_{ij} = \rho^{1-\mathbb{1}(i=j)}\). Both examples mirror the setting of 4 (a) in that they are not right-rotationally invariant, as the test and train tests vary in one direction more than the others. However, we see that the modifications in aROTI-GCV enable the method to still succeed. Note that in both of these examples, the top eigenvalue diverges at the rate \(O(n)\), which violates our Assumption [item:j-bounded-spec] as well as the bounded spectrum condition [4], meaning neither cross-validation method is provably valid in this setting. However, we observe that ROTI-GCV still performs reasonably well.
In our semi-real experiments, we use real data for the designs, but generate the signals and responses ourselves. Knowing the true signal enables us to calculate the population risk of any estimator, allowing us to evaluate the performance of our GCV metrics accurately. For such experiments, we must verify whether the assumptions of our theory are met. To this end, we provide a series of diagnostics to check whether the most important assumptions hold, including to select the index sets \(\mathcal{J}_a\) and \(\mathcal{J}_c\). We briefly list these here, and defer more details to Appendix 9, including examples of running these diagnostics for multiple datasets.
(i) regularity of singular value distributions (Assumptions [item:bound-op], [item:j-bounded-spec]). Plot histogram of singular values to check for outliers; see Figure 5.
(ii) signal-PC alignment: relationship between \(\boldsymbol{\beta}\) and \(\{\mathbf{o}_i\}_{i = 1} ^p\) (Assumptions [item:indep-beta], [item:aligned]). [31] provide a hypothesis testing framework for identifying \(\mathcal{J}_a\), wherein one computes a \(p\)-value per eigenvector to test whether or not it is aligned; the Benjamini-Hochberg procedure can then be used to control the false discovery rate. See Table 1.
(iii) coupling: relationship between \(\{\mathbf{o}_i\}_{i = 1} ^p\) and \(\{\widetilde{\mathbf{o}}_j\}_{i=1} ^p\) (Assumptions [item:indep-test], [item:coupled]). We compute overlaps \(\sqrt{p}\langle \mathbf{o}_i, \widetilde{\mathbf{o}}_j\rangle\) for the top singular vectors. Note if \(\mathbf{o}_i~\perp\!\!\!\perp~\widetilde{\mathbf{o}}_j\) (as under [item:indep-test]), then \(\sqrt{p}\langle \mathbf{o}_i, \widetilde{\mathbf{o}}_j \rangle \to \mathsf{N}(0, 1)\). If \(\sqrt{p}\langle \mathbf{o}_i, \widetilde{\mathbf{o}}_j \rangle\) is large relative to this null distribution, \(\mathbf{o}_i\), \(\widetilde{\mathbf{o}}_j\) should be coupled. See Table [table:speech-overlaps]. We leave development of a formal hypothesis test to future work.
The designs we use consist of speech data sourced from OpenML [43]. Each row has dimension 400 and consists of an i-vector (a type of featurization for audio data, see e.g. [44]) of a speech segment. In this first example, we fully illustrate all the proposed diagonstics; as such, we consider both when \(\boldsymbol{\beta}\) is uniformly drawn on the sphere, so there is no signal-PC alignment, and when signal-PC alignment is present, to illustrate how to choose \(\mathcal{J}_a\).
We first examine the distribution of singular values in Figure 5, finding there are no outlier values. Next, we compute overlaps of test and train eigenvectors, shown in Table [table:speech-overlaps], finding that the top 3 eigenvectors of each have strong overlaps. This motivates setting \(\mathcal{J}_c = [3]\)4. The results for this setting when \(\boldsymbol{\beta}\) is drawn uniformly from the sphere are shown in Figure 4 (d), where we see successfully capture the true loss curve and that the estimated risk values are superior to those given by GCV and LOOCV.
Next, in the setting with added signal-PC alignment, we instead take \(\boldsymbol{\beta} = \boldsymbol{\beta}' + \tfrac{\sqrt{n}}{2}\sum_{i = 1} ^5 \mathbf{o}_i\), meaning the aligned set is \(\mathcal{J}_a = [5]\). We use the hypothesis testing framework of [31] and compute Benjamini-Hochberg-adjusted \(p\)-values for alignment, shown in Table 1; here, we clearly identify the set of aligned eigenvectors. The resulting loss curves for this setting are then shown in Figure 4 (e).
\caption{Overlaps for speech data setting; Cell $(i, j)$ contains $\sqrt{p}|\langle \bl{o}_i, \tbl{o}_j \rangle|$.} \label{table:speech-overlaps}
\centering
\resizebox{0.9\textwidth}{!}{\tiny \setlength\tabcolsep{1.5pt}\renewcommand{\arraystretch}{1.5} \sffamily \begin{tabular}{p{1.5em}C{2.5em}C{2.5em}C{2.5em}C{2.5em}C{2.5em}C{2.5em}C{2.5em}C{2.5em}C{2.5em}C{2.5em}} & $\tbl{o}_{1}$ & $\tbl{o}_{2}$ & $\tbl{o}_{3}$ & $\tbl{o}_{4}$ & $\tbl{o}_{5}$ & $\tbl{o}_{6}$ & $\tbl{o}_{7}$ & $\tbl{o}_{8}$ & $\tbl{o}_{9}$ & $\tbl{o}_{10}$ \\
$\bl o_{1}$ & \cellcolor[HTML]{67000d} \textcolor[HTML]{f1f1f1}{17.30} & \cellcolor[HTML]{fdd0bc} \textcolor[HTML]{000000}{2.72} & \cellcolor[HTML]{ffeee7} \textcolor[HTML]{000000}{0.63} & \cellcolor[HTML]{fedecf} \textcolor[HTML]{000000}{2.01} & \cellcolor[HTML]{fee4d8} \textcolor[HTML]{000000}{1.55} & \cellcolor[HTML]{feeae0} \textcolor[HTML]{000000}{1.02} & \cellcolor[HTML]{fff2eb} \textcolor[HTML]{000000}{0.30} & \cellcolor[HTML]{feeae1} \textcolor[HTML]{000000}{0.96} & \cellcolor[HTML]{fee2d5} \textcolor[HTML]{000000}{1.70} & \cellcolor[HTML]{fff2ec} \textcolor[HTML]{000000}{0.27} \\
$\bl o_{2}$ & \cellcolor[HTML]{ffece3} \textcolor[HTML]{000000}{0.83} & \cellcolor[HTML]{67000d} \textcolor[HTML]{f1f1f1}{15.24} & \cellcolor[HTML]{ffeee6} \textcolor[HTML]{000000}{0.68} & \cellcolor[HTML]{fee3d7} \textcolor[HTML]{000000}{1.63} & \cellcolor[HTML]{fee7db} \textcolor[HTML]{000000}{1.34} & \cellcolor[HTML]{fee8dd} \textcolor[HTML]{000000}{1.22} & \cellcolor[HTML]{ffebe2} \textcolor[HTML]{000000}{0.89} & \cellcolor[HTML]{fee5d9} \textcolor[HTML]{000000}{1.42} & \cellcolor[HTML]{fff2ec} \textcolor[HTML]{000000}{0.27} & \cellcolor[HTML]{fedaca} \textcolor[HTML]{000000}{2.19} \\
$\bl o_{3}$ & \cellcolor[HTML]{fee9df} \textcolor[HTML]{000000}{1.08} & \cellcolor[HTML]{fee0d2} \textcolor[HTML]{000000}{1.93} & \cellcolor[HTML]{d11e1f} \textcolor[HTML]{f1f1f1}{10.93} & \cellcolor[HTML]{fee3d7} \textcolor[HTML]{000000}{1.60} & \cellcolor[HTML]{ffeee7} \textcolor[HTML]{000000}{0.64} & \cellcolor[HTML]{fedecf} \textcolor[HTML]{000000}{2.02} & \cellcolor[HTML]{fca78b} \textcolor[HTML]{000000}{4.63} & \cellcolor[HTML]{fff2eb} \textcolor[HTML]{000000}{0.35} & \cellcolor[HTML]{fcb398} \textcolor[HTML]{000000}{4.14} & \cellcolor[HTML]{fdd0bc} \textcolor[HTML]{000000}{2.75} \\
$\bl o_{4}$ & \cellcolor[HTML]{fff2ec} \textcolor[HTML]{000000}{0.29} & \cellcolor[HTML]{fdcebb} \textcolor[HTML]{000000}{2.78} & \cellcolor[HTML]{fdccb8} \textcolor[HTML]{000000}{2.92} & \cellcolor[HTML]{fff3ed} \textcolor[HTML]{000000}{0.22} & \cellcolor[HTML]{fcb79c} \textcolor[HTML]{000000}{3.98} & \cellcolor[HTML]{fb7c5c} \textcolor[HTML]{f1f1f1}{6.63} & \cellcolor[HTML]{fdcbb6} \textcolor[HTML]{000000}{2.98} & \cellcolor[HTML]{fee3d6} \textcolor[HTML]{000000}{1.65} & \cellcolor[HTML]{ffefe8} \textcolor[HTML]{000000}{0.54} & \cellcolor[HTML]{fee4d8} \textcolor[HTML]{000000}{1.56} \\
$\bl o_{5}$ & \cellcolor[HTML]{fee7dc} \textcolor[HTML]{000000}{1.27} & \cellcolor[HTML]{fdd0bc} \textcolor[HTML]{000000}{2.70} & \cellcolor[HTML]{fcb69b} \textcolor[HTML]{000000}{4.02} & \cellcolor[HTML]{fedbcc} \textcolor[HTML]{000000}{2.14} & \cellcolor[HTML]{fc9d7f} \textcolor[HTML]{000000}{5.11} & \cellcolor[HTML]{fee7db} \textcolor[HTML]{000000}{1.29} & \cellcolor[HTML]{fcae92} \textcolor[HTML]{000000}{4.34} & \cellcolor[HTML]{fca689} \textcolor[HTML]{000000}{4.72} & \cellcolor[HTML]{fdcbb6} \textcolor[HTML]{000000}{2.95} & \cellcolor[HTML]{fed9c9} \textcolor[HTML]{000000}{2.24} \\
$\bl o_{6}$ & \cellcolor[HTML]{fee8dd} \textcolor[HTML]{000000}{1.19} & \cellcolor[HTML]{ffefe8} \textcolor[HTML]{000000}{0.56} & \cellcolor[HTML]{fca486} \textcolor[HTML]{000000}{4.82} & \cellcolor[HTML]{fc8b6b} \textcolor[HTML]{f1f1f1}{5.95} & \cellcolor[HTML]{fff1ea} \textcolor[HTML]{000000}{0.37} & \cellcolor[HTML]{fcb99f} \textcolor[HTML]{000000}{3.83} & \cellcolor[HTML]{fee6da} \textcolor[HTML]{000000}{1.37} & \cellcolor[HTML]{ffece3} \textcolor[HTML]{000000}{0.87} & \cellcolor[HTML]{fca183} \textcolor[HTML]{000000}{4.98} & \cellcolor[HTML]{feeae0} \textcolor[HTML]{000000}{1.04} \\
$\bl o_{7}$ & \cellcolor[HTML]{fedaca} \textcolor[HTML]{000000}{2.21} & \cellcolor[HTML]{ffeee6} \textcolor[HTML]{000000}{0.70} & \cellcolor[HTML]{fedbcc} \textcolor[HTML]{000000}{2.14} & \cellcolor[HTML]{fdd1be} \textcolor[HTML]{000000}{2.65} & \cellcolor[HTML]{fca486} \textcolor[HTML]{000000}{4.84} & \cellcolor[HTML]{fee1d4} \textcolor[HTML]{000000}{1.77} & \cellcolor[HTML]{ffefe8} \textcolor[HTML]{000000}{0.54} & \cellcolor[HTML]{feeae0} \textcolor[HTML]{000000}{1.00} & \cellcolor[HTML]{fee3d6} \textcolor[HTML]{000000}{1.65} & \cellcolor[HTML]{fdd3c1} \textcolor[HTML]{000000}{2.53} \\
$\bl o_{8}$ & \cellcolor[HTML]{fff0e9} \textcolor[HTML]{000000}{0.46} & \cellcolor[HTML]{fff4ee} \textcolor[HTML]{000000}{0.15} & \cellcolor[HTML]{fdcbb6} \textcolor[HTML]{000000}{2.96} & \cellcolor[HTML]{fff4ee} \textcolor[HTML]{000000}{0.16} & \cellcolor[HTML]{fee3d6} \textcolor[HTML]{000000}{1.66} & \cellcolor[HTML]{fcab8f} \textcolor[HTML]{000000}{4.49} & \cellcolor[HTML]{fff3ed} \textcolor[HTML]{000000}{0.22} & \cellcolor[HTML]{fff1ea} \textcolor[HTML]{000000}{0.40} & \cellcolor[HTML]{fee1d3} \textcolor[HTML]{000000}{1.82} & \cellcolor[HTML]{fff4ef} \textcolor[HTML]{000000}{0.10} \\
$\bl o_{9}$ & \cellcolor[HTML]{fff2eb} \textcolor[HTML]{000000}{0.30} & \cellcolor[HTML]{fff1ea} \textcolor[HTML]{000000}{0.40} & \cellcolor[HTML]{ffede5} \textcolor[HTML]{000000}{0.72} & \cellcolor[HTML]{fff5f0} \textcolor[HTML]{000000}{0.01} & \cellcolor[HTML]{fee3d7} \textcolor[HTML]{000000}{1.63} & \cellcolor[HTML]{fee6da} \textcolor[HTML]{000000}{1.39} & \cellcolor[HTML]{fcb99f} \textcolor[HTML]{000000}{3.86} & \cellcolor[HTML]{fcaa8d} \textcolor[HTML]{000000}{4.53} & \cellcolor[HTML]{fee8de} \textcolor[HTML]{000000}{1.12} & \cellcolor[HTML]{fdd3c1} \textcolor[HTML]{000000}{2.57} \\
$\bl o_{10}$ & \cellcolor[HTML]{fee8de} \textcolor[HTML]{000000}{1.14} & \cellcolor[HTML]{fdd4c2} \textcolor[HTML]{000000}{2.51} & \cellcolor[HTML]{ffebe2} \textcolor[HTML]{000000}{0.90} & \cellcolor[HTML]{ffece4} \textcolor[HTML]{000000}{0.77} & \cellcolor[HTML]{fff1ea} \textcolor[HTML]{000000}{0.37} & \cellcolor[HTML]{ffece4} \textcolor[HTML]{000000}{0.79} & \cellcolor[HTML]{fdd5c4} \textcolor[HTML]{000000}{2.44} & \cellcolor[HTML]{fee1d3} \textcolor[HTML]{000000}{1.83} & \cellcolor[HTML]{fee1d4} \textcolor[HTML]{000000}{1.81} & \cellcolor[HTML]{fed8c7} \textcolor[HTML]{000000}{2.34} \\
\end{tabular}
}
| \(\mathbf{o}_1\) | \(\mathbf{o}_2\) | \(\mathbf{o}_3\) | \(\mathbf{o}_4\) | \(\mathbf{o}_5\) | ||||||
| \(p\) | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |||||
| \(\mathbf{o}_6\) | \(\mathbf{o}_7\) | \(\mathbf{o}_8\) | \(\mathbf{o}_9\) | \(\mathbf{o}_{10}\) | ||||||
| \(p\) | 0.934 | 0.588 | 0.651 | 0.913 | 0.395 |
We next consider designs consisting of residualized returns from 493 stocks retrieved from Polygon API5 [45]. There is one row per minute, and each contains the residualized return for each stock over the last 30 minutes. Since they share 29 minutes of returns, consecutive rows are heavily correlated. The regression corresponds to the known practice of replicating the residualized returns of an unknown asset \(\mathbf{y}\) using a portfolio of stocks \(\hat{\boldsymbol{\beta}}\) (with returns \(\mathbf{X}\hat{\boldsymbol{\beta}}\)). The diagnostic plots in this case are tame, and we require no coupling; these are deferred to Appendix 9. The loss curves produced by ROTI-GCV better capture the true loss curve, and that the estimated risk values (ER) provide superior estimates of the true out-of-sample risk compared to LOOCV and GCV.
Takeaways: We give a precise analysis of ridge regression for right-rotationally invariant designs. These designs depart crucially from the existing framework of i.i.d. samples from distributions with light enough tails, instead allowing for structured dependence and heavy tails. We then characterize GCV in this setting, finding that it is inconsistent, and introduce ROTI-GCV, a consistent alternative; various synthetic and semi-real experiments illustrate the accuracy of our method. As a byproduct of our analysis, we produce new estimators for signal-to-noise ratio (\(r^2\)) and noise variance (\(\sigma^2\)) robust to these deviations from the i.i.d. assumptions. At the core, our method relies on noticing that the relation given by Lemma 1 yields a different estimating equation for every choice of \(\lambda\); using this to generate a series of estimating equations, we obtain stable estimators of \(r^2, \sigma^2\). This core idea applies whenever an analogue of Lemma 1 can be derived, giving it reach beyond ridge regression. As future steps, extending our method to other settings such as generalized linear models, kernel ridge regression, regularized random features regression would be important directions. Furthermore, recent advances in [21], [22] show that right-rotationally invariant distributions have a broad universality class – in fact as long as the singular vectors of a design matrix satisfy certain generic structure, properties of learning algorithms under such settings are well captured by results derived under the right-rotationally invariant assumption. In particular, the universality of the in-sample risk of ridge regression is already established in these works, and we believe such universality also holds for the out-of-sample risk; thus, our analysis should extend to the entire universality class of these designs, which is remarkably broad. This promises that our approach can have wide reach, allowing reliable hyperparameter tuning for a broad class of covariate distributions.
Limitations: The well-specification assumption ([item:scaling-matrix]) is crucial and seems more difficult to relax for our designs than in the i.i.d. case due to the global dependence structure. Right-rotationally invariant designs have difficulty modeling distributions with feature covariances that are much more complicated than those covered in Section 5, especially when the spectrum is not dominated by eigenvalues. Finding ways to account for different types of feature covariance in data using these designs would naturally increase the broader applicability of our method, such as by modeling the design as \(\mathbf{X} = \mathbf{Q}^\top \mathbf{D} \mathbf{O} \boldsymbol{\Sigma}^{1/2}\) and extending our analysis. Finally, as mentioned in Section 1, our results focus on ridge regression, though we believe similar results for other penalties can be obtained.
Lemma 3. Let \(\mathbf{X} = \mathbf{Q}^\top \mathbf{DO}\) be right-rotationally invariant. Then \(\mathbb{E}[\mathbf{X}^\top \mathbf{X}] = \mathsf{Tr}(\mathbb{E}[\mathbf{D}^\top \mathbf{D}])/p \cdot \mathbf{I}_p\).
Proof. Let \(\mathbf{O}\) have rows \(\mathbf{o}_i\). Then \[\begin{align} \mathbb{E}[\mathbf{X}^\top \mathbf{X}] &= \mathbb{E}[\mathbf{O}^\top \mathbf{D}^\top \mathbf{D}\mathbf{O}] = \sum_{i = 1} ^{n \wedge p} \mathbb{E}[ D_{ii}^2 \mathbf{o}_i \mathbf{o}_i^\top] = \sum_{i = 1} ^{n \wedge p} \mathbb{E}[ D_{ii}^2 ]\mathbb{E}[\mathbf{o}_i \mathbf{o}_i^\top] \\ &\stackrel{(*)}{=} \frac{1}{p} \mathbf{I}_p \sum_{i = 1} ^{n \wedge p} \mathbb{E}[ D_{ii}^2 ] = \mathsf{Tr}(\mathbb{E}[\mathbf{D}^\top \mathbf{D}])/p \cdot \mathbf{I}_p \end{align}\] where \((*)\) follows from \(\mathbf{I}_p = \mathbb{E}[\mathbf{O}^\top \mathbf{O}] = \sum_{i = 1} ^p \mathbb{E}[\mathbf{o}_i\mathbf{o}_i^\top]\) and the exchangeability of the \(\mathbf{o}_i\) (which is derived from the fact that \(\mathbb{O}(p)\) contains the permutation matrices). ◻
Lemma 4 (Hanson-Wright for Spherical Vectors). Let \(\mathbf{v} \in \mathbb{R}^{n}\) be a random vector distributed uniformly on \(S^{n - 1}\) and let \(\mathbf{A} \in \mathbb{R}^{n \times n}\) be a fixed matrix. Then \[\begin{align} \mathbb{P}(\left|\mathbf{v}^\top \mathbf{A} \mathbf{v} - \mathbb{E}[\mathbf{v}^\top \mathbf{A} \mathbf{v}]\right| \geq t) \leq 2\exp\left(-C \min\left(\frac{n^2t^2}{2K^4\|\mathbf{A}\|_F^2}, \frac{nt}{K^2\|\mathbf{A}\|}\right)\right). \end{align}\] for some absolute constants \(C, K\).
Proof. Isoperimetric inequalities imply that vectors uniformly distributed on \(S^{n - 1}\) have sub-Gaussian tails around their median (see [46]). Explicitly, it is stated that for \(\mathbf{v} \sim \mathrm{Unif}(S^{n - 1})\), one has that for any \(1\)-Lispchitz function \(f: \mathbb{R}^n \to \mathbb{R}\), one has \[\mathbb{P}(|f(\mathbf{v}) - M| \geq t) \leq 2e^{- nt^2 / 2},\] where \(M\) denotes the median of \(f(X)\) (i.e. \(\min(\mathbb{P}(f(\mathbf{v}) \geq M), \mathbb{P}(f(\mathbf{v}) \leq M)) \geq \frac{1}{2}\)). We now use [47], which shows that having sub-Gaussian tails with respect to the median is equivalent to having sub-Gaussian tails with respect to the mean, with different constants. Thus, it holds that for any \(1\)-Lipschitz function \(f\), one has \[\mathbb{P}(|f(\mathbf{v}) - \mathbb{E}f(\mathbf{v})| \geq t) \leq 4e^{- nt^2 / 32}.\] Note that this bound is vacuous whenever \(4e^{-nt^2/32} > 1\), and hence we can rewrite this as \[\mathbb{P}(|f(\mathbf{v}) - \mathbb{E}f(\mathbf{v})| \geq t) \leq \min(4e^{- nt^2 / 32}, 1)\] where we now note that for sufficiently large \(K\), one has \(\min(4e^{- nt^2 / 32}, 1) \leq 2e^{-nt^2/K}\) for any \(n \geq 1\) and \(t \geq 0\). Hence, one has \[\label{eq:subg-mean} \mathbb{P}(|f(\mathbf{v}) - \mathbb{E}f(\mathbf{v})| \geq t) \leq 2e^{-nt^2/K}.\tag{6}\] Finally, [48] implies that random variables satisfying such types of concentration inequalities must in turn satisfy a Hanson-Wright type inequality. That is, Eq. 6 implies that, for fixed matrix \(\mathbf{A} \in \mathbb{R}^{n \times n}\), one has \[\mathbb{P}(\left|\mathbf{v}^\top \mathbf{A} \mathbf{v} - \mathbb{E}[\mathbf{v}^\top \mathbf{A} \mathbf{v}]\right| \geq t) \leq 2\exp\left(-C \min\left(\frac{n^2t^2}{2K^4\|\mathbf{A}\|_F^2}, \frac{nt}{K^2\|\mathbf{A}\|}\right)\right)\] as desired. ◻
Lemma 5 (Expectations of Quadratic Forms). Let \(\mathbf{v}\) be uniformly distributed on \(S^{n - 1}\), and let \(\mathbf{A}\) be an \(n \times n\) matrix. Then \[\mathbb{E}[\mathbf{v}^\top \mathbf{A} \mathbf{v}] = \mathsf{Tr}(\mathbf{A} \mathbb{E}[\mathbf{v}\mathbf{v}^\top]) = \frac{\mathsf{Tr}(\mathbf{A})}{n}\]
Proof. The first equality follows from the cyclic property of trace. The second can be seen to follow from the observation used in the proof of Lemma 3 that \(\mathbb{E}[\mathbf{o}_i \mathbf{o}_i^\top] = \frac{1}{p} \mathbf{I}_p\) and the fact that columns of a Haar distributed matrix are uniform on the sphere (see e.g. [49]). Alternatively, recall that for \(\mathbf{g} \sim \mathsf{N}(0, \mathbf{I}_n)\), \(\frac{\mathbf{g}}{\|\mathbf{g}\|} \perp\!\!\!\perp\|\mathbf{g}\|\) from [50]. Hence \[\mathbb{E}[\mathbf{v}\mathbf{v}^\top] = \mathbb{E}[\mathbf{g}\mathbf{g}^\top / \|\mathbf{g}\|^2] = \mathbb{E}[\mathbf{g}\mathbf{g}^\top] / \mathbb{E}[\|\mathbf{g}\|^2] = \mathbf{I}_n/n.\] ◻
Lemma 6. Let \(\mathbf{v}_n\) be a sequence of random vectors, where \(\mathbf{v}_n\) is uniformly distributed on \(S^{n - 1}\). Let \(\mathbf{u}_n\) be another sequence of random vectors independent of \(\mathbf{v}_n\), where \(\limsup \|\mathbf{u}_n\| < C\) for some constant \(C\) almost surely. Then \(\mathbf{v}_n^\top \mathbf{u}_n \xrightarrow{a.s.} 0\).
Proof. We replace \(\mathbf{v}_n\) with \(\frac{\mathbf{g}}{\|\mathbf{g}\|}\) where \(\mathbf{g} \sim \mathsf{N}(0, \mathbf{I}_p/p)\). Then conditional on \(\mathbf{u}_n\), we have \(\mathbf{g}^\top \mathbf{u}_n \sim \mathsf N(0, \|\mathbf{u}_n\|^2/p)\). We then use Mill’s inequality to show \(\mathbf{g}^\top \mathbf{u}_n \xrightarrow{a.s.} 0\): \[\begin{align} \mathbb{P}\left(|\mathbf{g}^\top \mathbf{u}_n| > t \mid \mathbf{u}_n \right) &\leq \sqrt{\frac{2}{\pi}} \frac{\|\mathbf{u}_n\|}{t \cdot \sqrt{p}} \exp\left(-t^2/2 \cdot \frac{p}{\|\mathbf{u}_n\|^2}\right). \end{align}\] Since \(\limsup_{n \to \infty} \|\mathbf{u}_n\|\) is a.s. bounded, the bound above is summable in \(n\), so by Borel-Cantelli, the convergence is almost sure. An application of Slutsky finishes, since we know \(\|\mathbf{g}\| \xrightarrow{a.s.} 1\) by the strong Law of Large Numbers, and hence \(\mathbf{g}^\top \mathbf{u}_n / \|\mathbf{g}\| \to 0\). ◻
First recall the statement of Theorem 1: Under the stated assumptions, \[\mathsf{GCV}_n(\lambda) - \frac{r^2 (v_{\mathbf{D}}(-\lambda) - \lambda v'_{\mathbf{D}}(-\lambda)) + \sigma^2\gamma v'_{\mathbf{D}}(-\lambda)}{\gamma v_{\mathbf{D}}(-\lambda)^2} \xrightarrow{a.s.} 0.\]
We analyze the numerator and denominator separately. The latter is simple: \[\begin{align} (1 - \mathsf{Tr}(\mathbf{S}_{\lambda})/n)^2 &= \left(1 - \mathsf{Tr}((\mathbf{D}^\top \mathbf{D} + \lambda \mathbf{I})^{-1}\mathbf{D}^\top\mathbf{D})/n\right)^2 = \left(\frac{1}{n}\sum_{i = 1} ^n \frac{\lambda}{\lambda + D_{ii}^2}\right)^2 = \gamma^2 \lambda^2 v_{\mathbf{D}}(-\lambda)^2 \end{align}\] The numerator only requires slightly more work. Recall \(\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\). Furthermore, note that the GCV numerator can be rewritten as below: \[\begin{align} \frac{1}{n}\mathbf{y}^\top (\mathbf{I}- \mathbf{S}_{\lambda})^2 \mathbf{y} = \underbrace{\frac{1}{n} \boldsymbol{\beta}^\top \mathbf{X}^\top (\mathbf{I} - \mathbf{S}_{\lambda})^2 \mathbf{X}\boldsymbol{\beta}}_{T_1} + 2 \underbrace{\frac{1}{n} \boldsymbol{\epsilon}^\top (\mathbf{I}- \mathbf{S}_{\lambda})^2 \mathbf{X}\boldsymbol{\beta}}_{T_2} + \underbrace{\frac{1}{n}\boldsymbol{\epsilon}^\top (\mathbf{I}- \mathbf{S}_{\lambda})^2 \boldsymbol{\epsilon}}_{T_3}. \end{align}\] We handle each term. First, \(T_1\) simplifies easily into a quadratic form involving the uniformly random vector \(\mathbf{O}\boldsymbol{\beta}\) which is independent of \(\mathbf{D}\). This will later allow us to apply Lemma 4: \[\begin{align} T_1 &= \frac{1}{n} (\mathbf{O}\boldsymbol{\beta})^\top \mathbf{D}^\top (\mathbf{I}- \mathbf{D}(\mathbf{D}^\top \mathbf{D} + \lambda\mathbf{I})^{-1} \mathbf{D}^\top)^2 \mathbf{D} (\mathbf{O}\boldsymbol{\beta}) \\ \intertext{The expectation of this quantity is as follows:} \mathbb{E}[T_1] &= \frac{\|\boldsymbol{\beta}\|^2}{n} \frac{\mathsf{Tr}(\mathbf{D}^\top (\mathbf{I} - \mathbf{D} (\mathbf{D}^\top \mathbf{D}+ \lambda \mathbf{I})^{-1} \mathbf{D}^\top)^2\mathbf{D})}{p} \\ &= \frac{\|\boldsymbol{\beta}\|^2}{n} \frac{1}{p} \sum_{i = 1}^{p} D_{ii}^2\left(1 - \frac{D_{ii}^2}{D_{ii}^2 + \lambda}\right)^2 \\ &= \frac{\|\boldsymbol{\beta}\|^2}{n} \frac{\lambda^2}{p} \sum_{i = 1} ^p \frac{D_{ii}^2}{(D_{ii}^2 + \lambda)^2} \\ &= \frac{\|\boldsymbol{\beta}\|^2}{n} \lambda^2 \left(m_{\mathbf{D}}(-\lambda) - \lambda m'_{\mathbf{D}}(-\lambda)\right) = \frac{\|\boldsymbol{\beta}\|^2}{n} \frac{\lambda^2}{\gamma} \left(v_{\mathbf{D}}(-\lambda) - \lambda v'_{\mathbf{D}}(-\lambda)\right) \end{align}\] Now, letting \(\mathbf{P} = \mathbf{D}^\top (\mathbf{I} - \mathbf{D} (\mathbf{D}^\top \mathbf{D}+ \lambda \mathbf{I})^{-1} \mathbf{D}^\top)^2\mathbf{D}\), Hanson-Wright (Lemma 4) implies \[\begin{align} \mathbb{P}(|T_1 - \mathbb{E}[T_1]| > t) \leq 2 \exp \left(- c \min\left(\frac{p^2 t^2}{2K^4 \|\mathbf{P}\|_F^2}, \frac{pt}{K^2 \| \mathbf{P} \|} \right)\right) = 2 \exp \left( p \min\left(\frac{t^2}{2K^4 \frac{1}{p} \|\mathbf{P}\|_F^2}, \frac{t}{K^2 \| \mathbf{P} \|} \right)\right) \end{align}\] Note that \[\begin{align} \mathbf{P} = \mathrm{diag} \left( \frac{\lambda^2 D_{ii}^2}{(D_{ii}^2 + \lambda)^2} \right)_{i = 1} ^p \end{align}\] Hence \[\begin{align} \frac{1}{p} \|\mathbf{P}\|_F^2 &= \frac{1}{p} \sum_{i = 1} ^p \left(\frac{\lambda^2 D_{ii}^2}{(D_{ii}^2 + \lambda)^2}\right)^2 = \frac{1}{p} \sum_{i = 1} ^p \left(\frac{\lambda^2}{(D_{ii}^2 + \lambda)^2}\right)^2 D_{ii}^4 \leq \|\mathbf{D}\|_{\mathsf{op}}^4 \\ \|\mathbf{P}\|_{\mathsf{op}} &= \frac{\lambda^2 D_{11}^2}{(D_{11}^2 + \lambda)^2} \leq \|\mathbf{D}\|^2 \end{align}\] By assumption, both terms are almost surely bounded in the limit. Recalling that \(p(n) / n \to \gamma\), the above bound is summable in \(n\) and hence Borel-Cantelli implies almost sure convergence.
To handle the second term, we use a slightly different method. \[\begin{align} T_2 &= \frac{1}{n} {\boldsymbol{\epsilon}}^\top (\mathbf{I} - \mathbf{S}_{\lambda})^2 \mathbf{X}\boldsymbol{\beta} \\ &= \frac{1}{n} (\mathbf{O} \boldsymbol{\beta})^\top \underbrace{\mathbf{D}^\top \mathbf{Q} (\mathbf{I} - \mathbf{S}_{\lambda})^2 {\boldsymbol{\epsilon}}}_{\mathbf{T}_4} \end{align}\] Note that \(\tilde{\boldsymbol{\beta}} = \mathbf{O} \boldsymbol{\beta}\) is a uniformly random vector of norm \(\|\boldsymbol{\beta}\|\). We replace this with \(\|\boldsymbol{\beta}\| \frac{\mathbf{g}}{\|\mathbf{g}\|}\) where \(\mathbf{g} \sim \mathsf{N}(0, \mathbf{I}_p/p)\). Then conditional on \(\mathbf{D},\boldsymbol{\epsilon}\), we have \(\frac{1}{\sqrt{n}} \mathbf{g}^\top \mathbf{T}_4 \sim \mathsf{N}(0, \|\mathbf{T}_4\|/(pn))\). Furthermore note \(\|\mathbf{T}_4\| \leq \|\mathbf{D}^\top\| \|\mathbf{Q}\| \|\mathbf{I}- \mathbf{S}_{\lambda}\|^2 \|\boldsymbol{\epsilon}\| \leq \|\mathbf{D}^\top\|\|\boldsymbol{\epsilon}\|\). Some standard tail bounds complete this. \[\begin{align} \frac{1}{n} \|\boldsymbol{\beta}\| \frac{\mathbf{g}^\top}{\|\mathbf{g}\|} \mathbf{T}_4 &= \frac{\|\boldsymbol{\beta}\|}{\sqrt{n}} \frac{1}{\|\mathbf{g}\|} \left(\frac{1}{\sqrt{n}} \mathbf{g}^\top \mathbf{T}_4\right) = r \cdot \frac{1}{\|\mathbf{g}\|} \frac{1}{\sqrt{n}} \mathbf{g^\top \mathbf{T}_4} \end{align}\] Hence \[\begin{align} \mathbb{P}\left(|\frac{1}{\sqrt{n}} \mathbf{g^\top \mathbf{T}_4}| > t \mid \mathbf{D}, \boldsymbol{\epsilon}\right) &\leq \sqrt{\frac{2}{\pi}} \frac{\|\mathbf{T}_4\|}{t \cdot \sqrt{pn}} \exp\left(-t^2/2 \cdot \frac{pn}{\|\mathbf{T}_4\|^2}\right). \end{align}\] We claim that \(\limsup_{n \to \infty} \|\mathbf{T}_4\|/\sqrt{n}\) is bounded. This follows from the bound above, the law of large numbers applied to \(\boldsymbol{\epsilon}\), and the assumption on \(\mathbf{D}\). Hence the bound above is summable in \(n\) once more, so by Borel-Cantelli, the convergence is almost sure. An application of Slutsky finishes, since we know \(\|\mathbf{g}\| \xrightarrow{a.s.} 1\) by the strong Law of Large Numbers. The convergence for \(T_3\) follows directly from [1]
Proof of Theorem 2. In order to understand how one should tune the regularization parameter in this problem, we first must understand the out of sample risk for a given value of \(\lambda\). The risk admits a decomposition into three terms: \[\begin{align} \frac{1}{n'} \mathbb{E}\left[\left\|\widetilde{\mathbf{X}}(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}})\right\|_2^2 \mid \mathbf{X}, \mathbf{y} \right] &= \frac{1}{n'} \cdot \frac{\mathbb{E}[\mathsf{Tr}(\widetilde{\mathbf{D}}^\top \widetilde{\mathbf{D}})]}{p} \left\|\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}\right\|_2^2 \\ &= \frac{\mathbb{E}[\mathsf{Tr}(\widetilde{\mathbf{D}}^\top \widetilde{\mathbf{D}})]}{n' p} \|(\mathbf{I}- (\mathbf{X}^\top \mathbf{X}+ \lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{X})\boldsymbol{\beta} - (\mathbf{X}^\top \mathbf{X}+ \lambda\mathbf{I})^{-1} \mathbf{X}^\top \boldsymbol{\epsilon}\|_2^2 \\ &= \frac{n\mathbb{E}[\mathsf{Tr}(\widetilde{\mathbf{D}}^\top \widetilde{\mathbf{D}})]}{n' p}\left[\underbrace{ \frac{1}{n} \boldsymbol{\beta} (\mathbf{I}- (\mathbf{X}^\top \mathbf{X}+ \lambda\mathbf{I})^{-1} \mathbf{X}^\top \mathbf{X})^2 \boldsymbol{\beta}}_{T_1} \right.\\ &\qquad \left.+ \underbrace{\frac{1}{n} \boldsymbol{\epsilon}^\top \mathbf{X} (\mathbf{X}^\top \mathbf{X}+ \lambda\mathbf{I})^{-2} \mathbf{X}^\top \boldsymbol{\epsilon}}_{T_2} - \underbrace{2 \boldsymbol{\epsilon}^\top \mathbf{X} (\mathbf{X}^\top \mathbf{X}+ \lambda\mathbf{I})^{-2} \mathbf{X}^\top \mathbf{X}\boldsymbol{\beta}}_{T_3} \right] \end{align}\]
Remark 8. The normalization assumption (Assumption [item:scaling]) ensures that the prefactor is finite. Without further loss of generality, we can assume it is 1. In fact it is sufficient for all proofs to simply assume that \(\limsup \frac{n\mathbb{E}[\mathsf{Tr}(\widetilde{\mathbf{D}}^\top \widetilde{\mathbf{D}})]}{n' p} \leq C\), but assuming it is constant simplifies notation.
For intuition why this term is bounded, note that \(\mathsf{Tr}(\widetilde{\mathbf{D}}^\top \widetilde{\mathbf{D}}) = \mathsf{Tr}(\widetilde{\mathbf{X}}^\top \widetilde{\mathbf{X}})\), which is the covariance matrix of \(n'\) samples normalized by \(\sqrt{n}\) (the reason for this is given in Remark 2 – since \(\boldsymbol{\beta}\) is scaled up by \(\sqrt{n}\) all data must be scaled down by \(\sqrt{n}\)). Hence we expect this trace to be order \(n'p/n\), which exactly cancels.
The first term corresponds to bias, the second to the variance, and the third is negligible. For the first term, \[\begin{align} T_1 &= n^{-1}\boldsymbol{\beta}(\mathbf{I}- (\mathbf{X}^\top \mathbf{X}+ \lambda\mathbf{I})^{-1} \mathbf{X}^\top \mathbf{X})^2\boldsymbol{\beta} \\ &= n^{-1}(\mathbf{O}\boldsymbol{\beta})^{\top} \left[\mathbf{I}- (\mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-1}\mathbf{D}^\top \mathbf{D}\right]^{2} (\mathbf{O}\boldsymbol{\beta}) \\ &= n^{-1}(\mathbf{O}\boldsymbol{\beta})^{\top} \left( \lambda (\mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-1} \right)^{2} (\mathbf{O}\boldsymbol{\beta}) \end{align}\] Again let \(\mathbf{b} = (\mathbf{O}\boldsymbol{\beta})/\|\boldsymbol{\beta}\|\) and \(\mathbf{P} = \left( \lambda (\mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-1} \right)^{2}\), and note \(\mathsf{Tr}(\mathbf{P}) = \sum_{i = 1} ^p \frac{\lambda^2}{(D_{ii}^2 + \lambda)^2}\) and \(\|\mathbf{P}\|_F^2 = \sum_{i = 1} ^p \frac{\lambda^4}{(D_{ii}^2 + \lambda)^4}\). Furthermore, one has \(\mathbb{E}[\mathbf{b}^\top \mathbf{P} \mathbf{b}] = \frac{1}{p} \mathsf{Tr}(\mathbf{P})\) using 5, and clearly \(\|\mathbf{P}\| \leq 1\). We now apply Lemma 4 to show concentration: \[\begin{align} \mathbb{P}\left(\left|\mathbf{b}^\top \mathbf{P} \mathbf{b} - \frac{1}{p} \mathsf{Tr}(\mathbf{P}) \right| \geq t\right) &\leq 2\exp\left(-C \min\left(\frac{p^2 t^2}{2K^4 \|\mathbf{P}\|^2_F}, \frac{pt}{K^2}\right)\right) \\ &= 2\exp\left(-C p\min\left(\frac{t^2}{2K^4 \tfrac{1}{p} \|\mathbf{P}\|^2_F}, \frac{t}{K^2}\right)\right) \end{align}\] Note that \(\frac{1}{p} \|\mathbf{P}\|^2_F\) is always bounded above by \(1\), so this bound can never be made vacuous by a certain setting of \(\mathbf{D}\). This bound is summable in \(n\) and thus we have almost sure convergence of the difference to zero.
This yields pointwise convergence for the bias. To strengthen this to uniform convergence, we show that the derivative of the difference \(\mathbf{b}^\top \mathbf{P} \mathbf{b} - \frac{1}{p} \mathsf{Tr}(\mathbf{P})\) is almost surely bounded on \([\lambda_1, \lambda_2]\): \[\begin{align} \frac{\mathrm{d}}{\mathrm{d}\lambda}\mathbf{b}^\top \mathbf{Pb} = \frac{\mathrm{d}}{\mathrm{d}\lambda}\left[\frac{1}{n} \sum_{i = 1} ^p (\mathbf{o}_i^\top \boldsymbol{\beta})^2 \frac{\lambda^2}{(D_{ii}^2 + \lambda)^2} \right] &= \frac{1}{n} \sum_{i = 1} ^p (\mathbf{o}_i^\top \boldsymbol{\beta})^2 \frac{2\lambda (D_{ii}^2 + \lambda)^2 + 2(D_{ii}^2 + \lambda)(\lambda^2)}{(D_{ii}^2 + \lambda)^4} \\ &\leq \frac{1}{n} \frac{2\lambda (D_{11}^2 + \lambda)^2 + 2(D_{11}^2 + \lambda)(\lambda^2)}{(\lambda)^4} \sum_{i = 1} ^p (\mathbf{o}_i^\top \boldsymbol{\beta})^2 \\ &= \frac{1}{n} \frac{2\lambda (D_{11}^2 + \lambda)^2 + 2(D_{11}^2 + \lambda)(\lambda^2)}{(\lambda)^4} \end{align}\] which is indeed almost surely bounded on \([\lambda_1, \lambda_2]\) in the limit by assumption on \(\|\mathbf{D}\|\). Similarly, \(\frac{\mathrm{d}}{\mathrm{d}\lambda} (\frac{1}{p} \mathsf{Tr}(\mathbf{P}))\) is also bounded on this interval by the same quantity, meaning the difference is Lipschitz. Hence one can discretize the interval \([\lambda_1, \lambda_2]\) into a sufficiently fine grid. Pointwise convergence on every point holds at every point on the grid, then the Lipschitz condition assures that the convergence is uniform.
For the third term \(T_3\), we rewrite it as \[\begin{align} 2(\boldsymbol{\epsilon}/\sqrt{n})^\top \mathbf{Q}\mathbf{D} (\mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-2} \mathbf{D}^\top \mathbf{D}\cdot (\mathbf{O}\boldsymbol{\beta}/\sqrt{n}) \end{align}\] and then apply Lemma 6 to see that this converges almost surely to zero. We can likewise control the derivative uniformly over the interval and again obtain almost-sure convergence.
Lastly, for the second term \(T_2\), we first compute \[\mathbb{E}\left[\frac{1}{n} \boldsymbol{\epsilon}^\top \mathbf{X} (\mathbf{X}^\top \mathbf{X}+ \lambda\mathbf{I})^{-2} \mathbf{X}^\top \boldsymbol{\epsilon}\right] = \frac{\sigma^2}{n} \sum_{i = 1} ^{n \wedge p} \frac{D_{ii}^2}{(D_{ii}^2 + \lambda)^{2}}.\] Then by [1], this term satisfies \[\mathbb{E}\left[\frac{1}{n} \boldsymbol{\epsilon}^\top \mathbf{X} (\mathbf{X}^\top \mathbf{X}+ \lambda\mathbf{I})^{-2} \mathbf{X}^\top \boldsymbol{\epsilon}\right] - \frac{1}{n} \boldsymbol{\epsilon}^\top \mathbf{X} (\mathbf{X}^\top \mathbf{X}+ \lambda\mathbf{I})^{-2} \mathbf{X}^\top \boldsymbol{\epsilon} \xrightarrow{a.s.} 0.\] The derivative is again uniformly controlled, producing uniform convergence once more. ◻
Proof. Under the assumptions of Theorem 1, for any consistent estimators \((\hat{r}^2, \hat{\sigma}^2)\) of \((r^2, \sigma^2)\), respectively, one has, for any \(\lambda_0 > 0\), \[\label{eq:plugin-unif} \sup_{\lambda > \lambda_0}\left|(\hat{r}^2 - r^2) \left(\frac{\lambda^2}{\gamma} v'_{\mathbf{D}}(-\lambda) + \frac{\gamma - 1}{\gamma}\right) + (\hat{\sigma}^2 - \sigma^2) ( v_{\mathbf{D}}(-\lambda) - \lambda v_{\mathbf{D}}'(-\lambda))\right| \xrightarrow{p} 0.\tag{7}\] If \(\hat{r}^2, \hat{\sigma}^2\) are strongly consistent, then naturally the convergence is almost sure. This is immediate from the fact that \[\begin{align} \lambda^2 v'_{\mathbf{D}} (-\lambda) = \frac{1}{n} \sum_{i = 1} ^n \frac{\lambda^2}{(D_{ii}^2 + \lambda)^2} &\leq 1 \\ v_{\mathbf{D}}(-\lambda) - \lambda v'_{\mathbf{D}}(-\lambda) = \frac{1}{n} \sum_{i = 1} ^{n \wedge p} \frac{D_{ii}^2}{(D_{ii}^2 + \lambda)^2} &\leq \frac{1}{4\lambda}. \end{align}\] Combining Eq. 7 with Theorem 2 then completes the corollary. ◻
This was proven in the course of Theorem 1, as the training error is nothing but the numerator of the original GCV.
The exact nondegeneracy condition we require is composed of two portions:
Define the events \(A_n = \bigcup_{i = 1} ^L \{a_i = 0\}\) and \(B_n = \bigcup_{i = 1} ^L \{b_i = 0\}\). Then \(\mathbb{1}(A_n \cup B_n) \xrightarrow{a.s.} 0\).
There exists \(i_* \in \{2, \dots, L\}\) such that \(\liminf |a_{i_*} b_1 - a_1 b_{i_*}| > C\) almost surely for some \(C > 0\).
The need for the first condition is clear: our estimator would not even be defined if the entire spectrum were zero; in fact, the problem would be impossible, as the entire data matrix would be zero. The second is more complex, and we give a detailed discussion on its necessity before giving an explicit proof.
Note furthermore that each of these terms can be observed when using the procedure - this gives the user a good way to check if this assumption is likely satisfied in practice.
As motivation, we first discuss the simpler case when \(L = 2\), as this has connections to the assumptions required for consistent estimation in [31]. In this setting, the estimator for \(\sigma^2\) becomes \[\begin{align} \frac{\left( \frac{t_2}{a_2} - \frac{t_1}{a_1} \right) \left( \frac{b_2}{a_2} - \frac{b_1}{a_1}\right)}{\left( \frac{b_2}{a_2} - \frac{b_1}{a_1} \right)^2} &= \frac{\left( \left(\frac{b_2}{a_2} - \frac{b_1}{a_1}\right)\sigma^2 + o(1) \cdot \left(\frac{1}{a_2} - \frac{1}{a_1} \right) \right) \left( \frac{b_2}{a_2} - \frac{b_1}{a_1}\right)}{\left( \frac{b_2}{a_2} - \frac{b_1}{a_1} \right)^2} \\ &= \sigma^2 + o(1)\frac{a_1 - a_2}{a_1b_2 - b_1a_2}. \end{align}\] where the first line comes from Lemma 1. Similarly, the estimator for \(r^2\) is then \(r^2 + o(1) \frac{b_2 - b_1}{a_1 b_2 - b_1 a_2}\). From this and the fact that the \(a_i\) and \(b_i\) are all bounded above, we see that this error term is asymptotically negligible provided \(a_1b_2 - b_1a_2\) is bounded away from zero in the limit. In the limiting case where \(\lambda_2 = \infty\), one has \(a_2 = \frac{1}{p} \sum_{i = 1} ^p D_{ii}^2\) and \(b_2 = 1\), meaning we require \[\label{eq:nondegen-1} \liminf \left| a_1 b_2 - a_2 b_1 \right| = \liminf \left| \frac{1}{p} \sum_{i = 1} ^p \frac{\lambda^2 D_{ii}^2}{(D_{ii}^2 + \lambda)^2} - \left(\frac{1}{p} \sum_{i = 1} ^p D_{ii}^2\right) \left(\frac{1}{n} \sum_{i = 1} ^n \frac{\lambda^2}{(D_{ii}^2 + \lambda)^2} \right) \right| > c > 0\tag{8}\] almost surely for some \(c\). This is exactly the second part of our nondegeneracy assumption. Furthermore, this is the exact analog of [31] in our setting. The above argument thus establishes that consistent estimation is possible if one assumes equation 8 .
We do not believe this condition to be simply technical. Consider the setting where \(n \leq p\) and every \(D_{ii} = 1\) for \(i \leq n\) (meaning \(D_{ii} = 0\) for \(i > n\)). Let \(\tilde{\boldsymbol{\beta}} = \mathbf{O}{\boldsymbol{\beta}}\). Then one is given \(\mathbf{y} = \mathbf{Q}^\top\tilde{\boldsymbol{\beta}}_{[1:n]} + \boldsymbol{\epsilon}\) and the goal is to infer \(r^2 = \|\boldsymbol{\beta}\|_2^2/\sqrt{n} = \|\tilde{\boldsymbol{\beta}}\|_2^2/\sqrt{n}\) and \(\sigma^2\). Every entry \(y_i\) is now \(\mathbf{q}_i^\top \tilde{\boldsymbol{\beta}}_{[1:n]} + \epsilon_i\); the first term has mean zero and variance approximately \(r^2\), and the second has mean zero and variance \(\sigma^2\), and it essentially becomes impossible to decouple the effects of \(r^2\) and \(\sigma^2\). In this setting, the term above is exactly zero, reflecting that the estimating equations become linearly dependent. That said, we believe this issue should not affect practical use cases; the term we require to be bounded away from zero can be computed by the practitioner directly, and while it is impossible to check if some quantity is bounded away in the limit, they can observe its magnitude.
Having now overviewed a specific case of our method that yields consistent estimation through the use of an additional assumption already used in the literature, we now turn to the general setting. Note that in the general setting, the condition is weaker – we only require that \(\liminf |a_1 b_i - a_i b_1| > C > 0\) holds for one of the \(i\), meaning one loses nothing by using multiple \(\lambda_i\), and in fact in practice we observe this results in improved stability.
Here the estimators take on the following forms: \[\begin{align} \hat{\sigma}^2 &= \sigma^2 + o(1) \frac{\sum_{i = 1} ^L (a_i^{-1} - a_1^{-1})(b_ia_i^{-1} - b_1a_1^{-1})}{\sum_{i = 1} ^L (b_ia_i^{-1} - b_1a_1^{-1})^2} \\ \hat{r}^2 &= r^2 + o(1) \frac{\sum_{i = 1} ^L (b_i^{-1} - b_1^{-1})(a_ib_i^{-1} - a_1b_1^{-1})}{\sum_{i = 1} ^L (a_ib_i^{-1} - a_1b_1^{-1})^2}. \end{align}\] As a result, we will require that the coefficient of both error terms is uniformly bounded, which will yield that our estimators are consistent, meaning it suffices to show \[\label{eq:estimator-error-coef} \limsup \max \left(\left|\frac{\sum_{i = 1} ^L (a_i^{-1} - a_1^{-1})(b_ia_i^{-1} - b_1a_1^{-1})}{\sum_{i = 1} ^L (b_ia_i^{-1} - b_1a_1^{-1})^2}\right|, \left| \frac{\sum_{i = 1} ^L (b_i^{-1} - b_1^{-1})(a_ib_i^{-1} - a_1b_1^{-1})}{\sum_{i = 1} ^L (a_ib_i^{-1} - a_1b_1^{-1})^2} \right| \right) < C\tag{9}\] almost surely for some \(C\).
Now, note that \[\begin{align} a(\lambda) &= \frac{\lambda^2}{\gamma} \left(v_{\mathbf{D}}(-\lambda) - \lambda v'_{\mathbf{D}}(-\lambda)\right) = \frac{1}{p} \sum_{i = 1} ^p D_{ii}^2 \left(1 - \frac{D_{ii}^2}{D_{ii}^2 + \lambda} \right)^2 \\ b(\lambda) &= \lambda^2 v'_{\mathbf{D}}(-\lambda) = \frac{1}{n} \sum_{i = 1} ^n \frac{\lambda^2}{(D_{ii}^2 + \lambda)^2} \end{align}\] are both increasing in \(\lambda\), and thus \(a_1\) is the smallest of the \(a_i\); likewise for \(b_1\). Furthermore, note that the \(a_i\) are bounded above by \(\|\mathbf{D}\|_{\mathsf{op}}\) which is then bounded almost surely by assumption; the \(b_i\) are bounded above by \(1\).
We then calculate as follows: \[\begin{align} \frac{\left| \sum_{i = 1} ^L (a_i^{-1} - a_1^{-1})(b_ia_i^{-1} - b_1a_1^{-1}) \right| }{\sum_{i = 1} ^L (b_ia_i^{-1} - b_1a_1^{-1})^2} &\leq \frac{\left| \sum_{i = 1} ^L (a_i^{-1} - a_1^{-1})(b_ia_i^{-1} - b_1a_1^{-1}) \right| }{(b_{i_*}a_{i_*}^{-1} - b_1a_1^{-1})^2} \cdot \frac{(a_1a_{i_*})^2}{(a_1a_{i_*})^2} \\ &= \frac{\left| \sum_{i = 1} ^L (a_{i_*} (a_1a_{i}^{-1}) - a_{i_*})(b_ia_{i_*} (a_1a_i^{-1}) - b_1a_{i_*}) \right| }{(b_ia_{1} - b_1a_{i_*})^2} \\ &\leq \frac{1}{(b_ia_{1} - b_1a_{i_*})^2} \sum_{i = 1} ^L \left| (a_{i_*} (a_1a_{i}^{-1}) - a_{i_*}) \right| \left| (b_ia_{i_*} (a_1a_i^{-1}) - b_1a_{i_*}) \right| \end{align}\] Note now that since \(a_1\) is the smallest of the \(a_i\), \(a_1a_i^{-1} \leq 1\) for all \(i\); hence, combined with the fact that the \(a_i\) and \(b_i\) are all bounded above in the limit, the final term is also controlled in the limit. \[\begin{align} \limsup \frac{\left| \sum_{i = 1} ^L (a_i^{-1} - a_1^{-1})(b_ia_i^{-1} - b_1a_1^{-1}) \right| }{\sum_{i = 1} ^L (b_ia_i^{-1} - b_1a_1^{-1})^2} &\stackrel{a.s.}{\leq} \frac{1}{C'} L \cdot C'' \cdot 2 \cdot C'' \end{align}\] where \(C'\) is the constant from the non-degeneracy assumption and \(C''\) is the constant from Assumption [item:bound-op].
The same computation can be performed for the other term in 9 which proceeds exactly as above, but with the \(a_i\)’s and \(b_i\)’s exchanged. The result is \[\begin{align} \limsup \frac{\left| \sum_{i = 1} ^L (b_i^{-1} - b_1^{-1})(a_ib_i^{-1} - a_1b_1^{-1}) \right| }{\sum_{i = 1} ^L (a_ib_i^{-1} - a_1b_1^{-1})^2} &\stackrel{a.s.}{\leq} \frac{1}{C'} L \cdot 2 \cdot C'', \end{align}\] as desired.
Proof. Most computations are analogous to the proof of Theorem 2. We first compute \(\mathbb{E}[\widetilde{\mathbf{X}}^\top \widetilde{\mathbf{X}} \mid \mathbf{X}, \mathbf{y}]\) using [13]. We find that it is equal to \(\sum_{i \in \mathcal{J}_c} (\mathfrak{d}_{i}^2 - \mathfrak{d}_\mathsf{bulk}^2) \mathbf{o}_i \mathbf{o}_i^\top + \mathfrak{d}_\mathsf{bulk}^2 \mathbf{I}\). Call this quantity \(\mathbf{C}\) for convenience. We then move to computing the risk. \[\boldsymbol{\beta} - \hat{\boldsymbol{\beta}} = (\mathbf{I}- (\mathbf{X}^\top \mathbf{X}+ \lambda\mathbf{I})^{-1} \mathbf{X}^\top \mathbf{X}) \boldsymbol{\beta} - (\mathbf{X}^\top \mathbf{X}+ \lambda\mathbf{I})^{-1} \mathbf{X}^\top \boldsymbol{\epsilon}.\] Meanwhile \[\mathbb{E}\left[ \|\widetilde{\mathbf{X}}(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}})\|_2^2 \right]= (\boldsymbol{\beta} - \hat{\boldsymbol{\beta}})^\top \left( \sum_{i \in \mathcal{J}_c} (\mathfrak{d}_{i}^2 - \mathfrak{d}_\mathsf{bulk}^2) \mathbf{o}_i \mathbf{o}_i^\top + \mathfrak{d}_\mathsf{bulk}^2 \mathbf{I}\right) (\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}).\] Thus as before, there are three main terms to handle.
Variance term: \[\begin{align} &\boldsymbol{\epsilon}^\top \mathbf{X} (\mathbf{X}^\top \mathbf{X}+ \lambda\mathbf{I})^{-1} \mathbf{C} (\mathbf{X}^\top \mathbf{X}+ \lambda\mathbf{I})^{-1} \mathbf{X}^\top \boldsymbol{\epsilon} \\ &= \mathfrak{d}_\mathsf{bulk}^2 \boldsymbol{\epsilon}^\top \mathbf{Q}^\top \mathbf{D} (\mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-2} \mathbf{D}^\top\mathbf{Q}\boldsymbol{\epsilon} + \\ &\qquad \boldsymbol{\epsilon}^\top \mathbf{Q}^\top \mathbf{D} (\mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-1} \left( \sum_{i \in \mathcal{J}_c} (\mathfrak{d}_{ii}^2 - \mathfrak{d}^2_\mathsf{bulk}) \mathbf{e}_i\mathbf{e}_i^\top\right) (\mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-1} \mathbf{D}^\top \mathbf{Q}\boldsymbol{\epsilon}. \end{align}\] Note now that \[\frac{n}{n'}\mathfrak{d}_\mathsf{bulk}^2 = \frac{n}{n'} \frac{1}{p - |\mathcal{J}_c|} \sum_{i \not \in \mathcal{J}_c} \mathfrak{d}_i^2 = \frac{n \mathbb{E}\mathsf{Tr}(\widetilde{\mathbf{D}}^\top \widetilde{\mathbf{D}})}{n' p} + O(1/p) = 1 + O(1/p),\] where the penultimate equality is due to assumption [item:j-bounded-spec] and the last is due to the scaling assumption [item:scaling]. Note that the \(O(1/p)\) term converges almost surely to zero.
Then \[\begin{align} &\frac{n}{n'}\mathfrak{d}_\mathsf{bulk}^2 \left|\frac{1}{n}\boldsymbol{\epsilon}^\top \mathbf{Q}^\top \mathbf{D} (\mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-2} \mathbf{D}^\top\mathbf{Q}\boldsymbol{\epsilon} - \frac{1}{n} \mathbb{E}[\boldsymbol{\epsilon}^\top \mathbf{Q}^\top \mathbf{D} (\mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-2} \mathbf{D}^\top\mathbf{Q}\boldsymbol{\epsilon}]\right| \\ &\leq \frac{}{} |\boldsymbol{\epsilon}^\top \mathbf{Q}^\top \mathbf{D} (\mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-2} \mathbf{D}^\top\mathbf{Q}\boldsymbol{\epsilon} - \mathbb{E}[\boldsymbol{\epsilon}^\top \mathbf{Q}^\top \mathbf{D} (\mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-2} \mathbf{D}^\top\mathbf{Q}\boldsymbol{\epsilon}]| + O(1/p) \xrightarrow{a.s.} 0. \end{align}\] The same arguments as in Theorem 2 then imply uniform convergence to zero.
On the other hand, the second term is order \(1\) and hence vanishes when divided by \(n\). The convergence is then uniform because the function is monotonic in \(\lambda\). It is almost sure through, for instance, using [1]. Note that the form of \(\mathcal{V}_{\mathbf{X}}\) contains an extra \[\begin{align} \frac{\sigma^2}{n} \sum_{i \in \mathcal{J}_c} \frac{D_{ii}^2}{D^2_{ii} + \lambda} \mathfrak{d}_{i}^2 \end{align}\] term which is uniformly negligible by monotonicity and the assumption that \(\mathfrak{d}_{i}^2\) is a.s. bounded in the limit.
Bias term: \[\begin{align} &\boldsymbol{\beta}^\top (\mathbf{I}- (\mathbf{X}^\top \mathbf{X}+ \lambda\mathbf{I})^{-1} \mathbf{X}^\top \mathbf{X})^\top \mathbf{C} (\mathbf{I}- (\mathbf{X}^\top \mathbf{X}+ \lambda\mathbf{I})^{-1} \mathbf{X}^\top \mathbf{X}))\boldsymbol{\beta} \\ &= \mathfrak{d}_\mathsf{bulk}^2 (\mathbf{O}\boldsymbol{\beta})^\top (\mathbf{I}- (\mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-1} \mathbf{D}^\top \mathbf{D})^2 (\mathbf{O} \boldsymbol{\beta}) \\ &\qquad + (\mathbf{O} \boldsymbol{\beta})^\top \left[(\mathbf{I}- (\mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-1}) \left[ \sum_{i \in \mathcal{J}_c} (\mathfrak{d}_i^2 - \mathfrak{d}_\mathsf{bulk}^2) \mathbf{e}_i \mathbf{e}_i^\top \right] (\mathbf{I}- (\mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-1} \right] (\mathbf{O}\boldsymbol{\beta}). \end{align}\] We now need to handle this a bit more carefully than before, because \(\boldsymbol{\beta}\) is no longer independent of \(\mathbf{O}\). Note that \(\mathbf{O}\boldsymbol{\beta} = \mathbf{O} \boldsymbol{\beta}' + \sum_{i \in \mathcal{J}_a} \alpha_i \mathbf{e}_i\). There are a few different types of terms we need to handle.
Unaligned terms: \[\begin{align} &\mathfrak{d}_\mathsf{bulk}^2 (\mathbf{O} \boldsymbol{\beta}') ^\top (\mathbf{I}- (\mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-1} \mathbf{D}^\top \mathbf{D})^2 (\mathbf{O} \boldsymbol{\beta}') \\ & \qquad + (\mathbf{O} \boldsymbol{\beta})^\top \left[(\mathbf{I}- (\mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-1}\mathbf{D}^\top \mathbf{D}) \left[ \sum_{i \in \mathcal{J}_c} (\mathfrak{d}_i^2 - \mathfrak{d}_\mathsf{bulk}^2) \mathbf{e}_i \mathbf{e}_i^\top \right] (\mathbf{I}- (\mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-1}\mathbf{D}^\top \mathbf{D}) \right] (\mathbf{O}\boldsymbol{\beta}) \end{align}\] The first term is treated equivalently to the bias in the ridge case in Theorem 2, where, as in the variance computation above, the \(\mathfrak{d}_\mathsf{bulk}^2\) can be handled. This produces uniform convergence to its expectation. The second term is \(O(|\mathcal{J}_c|)\) and hence vanishes when divided by \(n\). It does so uniformly because it is monotone increasing in \(\lambda\).
Aligned but uncoupled terms: \[\begin{align} &\sum_{i \in \mathcal{J}_a \setminus \mathcal{J}_c} n\alpha_i^2 \mathfrak{d}_\mathsf{bulk}^2 \left(1 - \frac{D_{ii}^2}{D_{ii}^2 + \lambda}\right) \end{align}\]
Aligned and coupled terms: \[\begin{align} &\sum_{i \in \mathcal{J}_a \cap \mathcal{J}_c} n\alpha_i^2 \mathfrak{d}_i^2 \left(1 - \frac{D_{ii}^2}{D_{ii}^2 + \lambda}\right) \end{align}\]
Cross terms: \[\begin{align} \sum_{i \in \mathcal{J}_a} \mathbf{o}_i^\top \boldsymbol{\beta}' \sqrt{n}\alpha_i \left[(\mathfrak{d}_{\mathsf{bulk}}^2 + \mathbb{1}(i \in \mathcal{J}_c) (\mathfrak{d}_{ii}^2 - \mathfrak{d}_\mathsf{bulk}^2) )\left(1 - \frac{D_{ii}^2}{D_{ii}^2 + \lambda}\right)\right] \end{align}\] These terms are only of order \(O(\sqrt{n})\) and hence vanish - they do so uniformly because, after taking absolute value of the summand, the resulting term is monotone increasing in \(\lambda\), and thus it suffices to control the value at \(\lambda_2\).
Note that the cross terms between \(\alpha_i \mathbf{e}_i\) and \(\alpha_j \mathbf{e}_j\) are all identically zero and thus do not need to be considered.
Summing all of these together, one obtains uniform convergence to \(\mathcal{B}\). Note that the statement of \(\mathcal{B}\) contains an extra \[\frac{\lambda^2}{n} \sum_{i \in \mathcal{J}_c} \frac{r^2}{\gamma} \frac{\mathfrak{d}_i^2 - \mathfrak{d}_\mathsf{bulk}^2}{(D_{ii}^2 + \lambda)^2}\] term, which is uniformly negligible.
Cross term: \[\begin{align} &\boldsymbol{\epsilon}^\top \mathbf{X} (\mathbf{X}^\top \mathbf{X}+ \lambda\mathbf{I})^{-1} \mathbf{C} (\mathbf{I}- (\mathbf{X}^\top \mathbf{X}+ \lambda\mathbf{I})^{-1} \mathbf{X}^\top \mathbf{X})\boldsymbol{\beta} \\ &= \boldsymbol{\epsilon}^\top \mathbf{X} (\mathbf{X}^\top \mathbf{X}+ \lambda\mathbf{I})^{-1} \mathbf{C} (\mathbf{I}- (\mathbf{X}^\top \mathbf{X}+ \lambda\mathbf{I})^{-1} \mathbf{X}^\top \mathbf{X})(\boldsymbol{\beta}' + \sum_{i \in \mathcal{J}_a} \alpha_i \mathbf{o}_i) \\ &= \boldsymbol{\epsilon}^\top \mathbf{Q} \mathbf{D} (\mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-1} \mathbf{O} \mathbf{C} \mathbf{O}^\top (\mathbf{I}- \mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-1} \mathbf{D}^\top \mathbf{D}) \mathbf{O} \boldsymbol{\beta}' + \\ &\qquad \sum_{i \in \mathcal{J}_a \cap \mathcal{J}_c} \mathbf{q}_i^\top \boldsymbol{\epsilon} \frac{D_{ii}}{\lambda + D_{ii}^2} \cdot \mathfrak{d}_i^2 \cdot \frac{\lambda}{D_{ii}^2 + \lambda} \cdot \sqrt{n} \alpha_i \end{align}\] For the first term, note that \(\mathbf{O} \mathbf{C} \mathbf{O}^\top\) is in fact independent of \(\mathbf{O}\). Moreover, the matrix in the middle has bounded operator norm, and thus this term vanishes after dividing by \(1/n\) due to Lemma 6, where we apply the strong law of large numbers to bound \(\|\boldsymbol{\epsilon}\|\). For the second, we note that this is almost surely, for sufficiently large \(n\), uniformly bounded above by \[\begin{align} C\sqrt{n} \cdot (\max_{i \in \mathcal{J}_a} \alpha_i) \cdot \max_{i \in \mathcal{J}_a \cap \mathcal{J}_c} |\mathbf{q}_i^\top \boldsymbol{\epsilon}| \end{align}\] Recall we are dividing everything by \(n\). Hence if we show that \(|\mathbf{q}_i^\top \boldsymbol{\epsilon}|/\sqrt{n} \xrightarrow{a.s.} 0\), then we will be done with a union bound. We rewrite \(|\mathbf{q}_i^\top \boldsymbol{\epsilon}| = \boldsymbol{\epsilon}^\top (\mathbf{q}_i \mathbf{q}_i^\top) \boldsymbol{\epsilon}\) and can now apply [1] to find \[\begin{align} \frac{1}{n} \boldsymbol{\epsilon}^\top (\mathbf{q}_i \mathbf{q}_i^\top) \boldsymbol{\epsilon} - \frac{\sigma^2}{n} \xrightarrow{a.s.} 0. \end{align}\] Thus \(|\mathbf{q}_i^\top \boldsymbol{\epsilon}|^2/n\) converges a.s. to \(0\), and thus by continuous mapping so does \(|\mathbf{q}_i^\top \boldsymbol{\epsilon}|/\sqrt{n}\). ◻
The following is taken from [31]. We note that in this work, the errors are assumed Gaussian. We will mark which ones require this and how we believe they can be relaxed. First, the consistency of \(\alpha_i\) is immediate from the first result of [31], where we note that the proof does not require Gaussianity of the errors. The one step in the proof where care must be made is in controlling \[\frac{1}{p}\left\|\mathbf{O}_{\mathcal{J}}^{\top}\left(\mathbf{D}_{\mathcal{J}}^{\top} \mathbf{D}_{\mathcal{J}}\right)^{-1} \mathbf{D}_{\mathcal{J}}^{\top} \mathbf{Q} \varepsilon\right\|_2^2,\] but we note that this can be done through [1].
Then the reduced model is right-rotationally invariant by [31]. Showing that the reduced model is right-rotationally invariant requires Gaussianity and is what leads to the added assumption of Lemma 2. Without the Gaussian assumption, the errors in the reduced model will not be i.i.d. However, in general we only require that certain inner products concentrate. We believe this will still hold in the reduced model when one explicitly tracks the form of the error.
We will prove uniform convergence term by term. We begin with the variance. Define the following: \[\begin{align} A &= \frac{1}{n}\left[\sum_{i = 1} ^n \frac{D_{ii}^2}{(D_{ii}^2 + \lambda)^2} \left(\mathfrak{d}_\mathsf{bulk}^2 + \mathbb{1}(i \in \mathcal{J}_c)(\mathfrak{d}_i^2 - \mathfrak{d}^2_\mathsf{bulk})\right)\right] \\ \hat{A} &= \frac{1}{n}\left[\sum_{i = 1} ^n \frac{D_{ii}^2}{(D_{ii}^2 + \lambda)^2} \left(\hat{\mathfrak{d}}_\mathsf{bulk}^2 + \mathbb{1}(i \in \mathcal{J}_c)(\hat{\mathfrak{d}}_i^2 - \hat{\mathfrak{d}}^2_\mathsf{bulk})\right)\right] \\ T_1 &= \sigma^2 A \qquad T_2 = \hat{\sigma}^2 \hat{A} \qquad T_3 = \hat{\sigma}^2 {A} \end{align}\] Note that \(T_1 = \mathcal{V}_\mathbf{X}\). First, note that \(\hat{A}\) and \(\hat{\sigma}^2/n\) are both almost surely bounded for sufficiently large \(n\). This follows from the fact that the estimators are strongly consistent for bounded quantities. We can thus uniformly control \(T_1 - T_3\) and then \(T_3 - T_2\), producing uniform convergence.
For the bias, we wish to uniformly control \[\begin{align} \mathcal{B}_\mathbf{X}(\hat{\boldsymbol{\beta}}_\lambda, \boldsymbol{\beta}) &= \frac{\lambda^2}{n} \sum_{i = 1} ^p \frac{\left(r^2/\gamma + \mathbb{1}(i \in \mathcal{J}_a)n\alpha_i^2\right) \left(\mathfrak{d}_\mathsf{bulk}^2 + \mathbb{1}(i \in \mathcal{J}_c)(\mathfrak{d}_i^2 - \mathfrak{d}^2_\mathsf{bulk})\right)}{(D_{ii}^2 + \lambda)^2}. \end{align}\] One can then expand the numerator into 4 constituent terms. Arguments similar to that of the variance then show uniform convergence for each of them.
All experiments were run on a personal laptop. Each experiment took only on the order of 1 minute. All experiments are carried out with \(n = p = 1000\), because this scenario is most sensitive to the amount of regularization needed (as with \(\lambda = 0\), often times the risk is infinite). In Section 3, we first sample a training set. We then repeatedly resample the training noise \(\boldsymbol{\epsilon}\) because this is more economical. Each time, we then run each cross-validation technique and plot the loss curve over \(\lambda\). The estimated risk is then the risk reported by the loss curve at its minimum. The tuned risk is then the actual mean-squared error (which has an analytical expression) produced when using the estimator with that value of \(\lambda\). For ROTI-GCV, Assumption [item:scaling], while useful theoretically, proves to provide some practical challenges, mainly that finding the correct rescaling of the data that makes this assumption hold is not always simple, since the trace does not necessarily concentrate. In the experiments in Section 3, the data is drawn according to distributions satisfying this assumption, and the data is not normalized after being sampled. That said, normalizing the data does not affect any results. It is important to note that this scaling never affects the tuned value of \(\lambda\) and instead only affects the risk estimate.
In Section 4, it becomes necessary to estimate the eigenvalues of the test set data. To do this, we now sample a test set (\(n = p = 1000\) again), and then compute the eigenvalues of the test set. A similar result should hold when using eigenvalues of the training set, since here both sets are equal in distribution.
Speech data was taken directly from [43] and then was standardized (demeaned and rescaled to have variance 1), as is usual when using ridge-regression.
There is one row in \(X\) for each minute. Each row is the residualized return over the last 30 minutes, so consecutive rows are heavily correlated, since they share 29 minutes of returns. \(X\) has 493 rows and 493 columns (\(n = p = 493\)).
We generate \(\beta\) uniformly on the sphere of radius \(r^2\sqrt{n}\), and the training labels \(y = X\beta + \epsilon\), where \(\epsilon_i \sim N(0, \sigma^2)\).
We plot the predicted loss curve given by each cross-validation metric
We compute the actual loss curve over the test data.
Both the equicorrelated and autocorrelated cases can be written as \(\boldsymbol{\Sigma}^{1/2} \mathbf{G}\), where \(\mathbf{G}\) has i.i.d. standard Gaussian entries; the rotational invariance of the Gaussian then implies the right-rotational variance of \(\boldsymbol{\Sigma}^{1/2}\mathbf{G}\) in these cases.
\(t\)-distributed data: A multivariate \(t\) distribution with \(\nu\) degrees of freedom can be generated by first sampling \(\mathbf{y} \sim \mathsf{N}(\mathbf{0}, \mathbf{I}_n)\) and \(u \sim \chi_\nu\); then \(\mathbf{x} = \mathbf{y} / \sqrt{u / \nu}\); the rotational invariance of the Gaussian implies that this distribution is rotationally invariant. This proves rotational invariance for a single sample; stacking multiple independent samples into a matrix retains the rotational invariance.
Products of Gaussian matrices: follows from rotational invariance of the Gaussian.
Matrix normals: Recall that \(\boldsymbol{\Sigma}^{\mathrm{col}}\) is drawn from an inverse Wishart distribution with identity scale and \((1 + \delta)p\) degrees of freedom; this can be generated by \((\mathbf{G}_2^\top \mathbf{G}_2)^{-1}\), where \(\mathbf{G}_2 \in \mathbb{R}^{(1 + \delta)p \times p}\) has i.i.d. standard Gaussian entries (since the inverse Wishart has identity scale) [51]. A matrix normal with row covariance \(\boldsymbol{\Sigma}^{\text{row}}\) and column covariance \(\boldsymbol{\Sigma}^{\text{col}} = (\mathbf{G}_2^\top \mathbf{G}_2)^{-1}\) can then be generated by \((\boldsymbol{\Sigma}^{\text{row}})^{1/2}\mathbf{G} (\mathbf{G}_2^\top \mathbf{G}_2)^{-1/2}\) (with \(\mathbf{G}\) again having i.i.d. Gaussian entries and being independent of \(\mathbf{G}_2\)). The claim then follows from the following. For any fixed \(\mathbf{O} \in \mathbb{O}_p\), \[\begin{align} (\boldsymbol{\Sigma}^{\text{row}})^{1/2}\mathbf{G} (\mathbf{G}_2^\top \mathbf{G}_2)^{-1/2} \mathbf{O} &\stackrel{d}{=} (\boldsymbol{\Sigma}^{\text{row}})^{1/2}\mathbf{G} \mathbf{O} (\mathbf{O} \mathbf{G}_2^\top \mathbf{G}_2 \mathbf{O}^\top)^{-1/2} \mathbf{O}\\ &= (\boldsymbol{\Sigma}^{\text{row}})^{1/2}\mathbf{G} (\mathbf{G}_2^\top \mathbf{G}_2 )^{-1/2} \end{align}\] as required.
Spiked matrices: It suffices to show that for any fixed \(\mathbf{O}\) that \((\lambda \mathbf{V} \mathbf{W}^\top, \mathbf{G}) \stackrel{d}{=} (\lambda \mathbf{V} \mathbf{W}^\top \mathbf{O}, \mathbf{G} \mathbf{O})\). Since the first consists of an independent pair, it suffices to show that \(\mathbf{V}\mathbf{W}^\top \stackrel{d}{=} \mathbf{V} \mathbf{W}^\top \mathbf{O}\), \(\mathbf{G} \stackrel{d}{=} \mathbf{G}\mathbf{O}\) and that \(\lambda \mathbf{V}\mathbf{W}^\top \mathbf{O}\) is independent of \(\mathbf{G} \mathbf{O}\). The last two statements are automatic from the rotational invariance of the Gaussian and the independence of the \(\mathbf{V}\mathbf{W}^\top\) and \(\mathbf{G}\). The first can be seen from the fact that \(\mathbf{W} = \mathbf{W'}\mathbf{P}\), where \(\mathbf{P} \in \mathbb{R}^{p \times r}\) has \(\mathbf{P}_{ij} = \mathbb{1}(i = j)\) and \(\mathbf{W'} \in \mathbb{R}^{p \times p}\) is Haar. Then \(\mathbf{V}\mathbf{W}^\top \mathbf{O} = \mathbf{V} \mathbf{P}^\top (\mathbf{W'})^\top \mathbf{O} \stackrel{d}{=} \mathbf{V} \mathbf{P}^\top (\mathbf{W'})^\top = \mathbf{V} \mathbf{W}^\top\) where the distributional equality follows from the rotational invariance of Haar matrices.
Suppose one has a set of (possibly random) functions \(f_n\) satisfy, for some deterministic \(f\), \[\sup_{x \in [x_1, x_2]} |f_n(x) - f(x)| \xrightarrow{a.s.} 0.\] Define \[\begin{align} x_{\mathsf{cv}, n} = \mathop{\mathrm{arg\;min}}_{x \in [x_1, x_2]} f_n(x) \qquad x_* = \mathop{\mathrm{arg\;min}}_{x \in [x_1, x_2]} f(x). \end{align}\] Then \[\left|f\left(x_{\mathsf{cv}, n}\right) - f(x_*)\right| \to 0.\] Here the random functions \(f\) represent cross validation metrics which uniformly converge to the true asymptotic out-of-sample risk \(f\). This result then shows that the minimizer chosen by the cross-validation metric tends towards the asymptotically optimal value. Note that this result cannot be obtained with only pointwise convergence.
Proof. Note \(f(x_{\mathsf{cv}, n}) - f(x_*) \geq 0\) always. Hence it suffices to show an upper bound which tends to zero. \[\begin{align} f(x_{\mathsf{cv}, n}) - f(x_*) &= f(x_{\mathsf{cv}, n}) - f_n(x_{\mathsf{cv}, n}) + \underbrace{f_n(x_{\mathsf{cv}, n}) - f_n(x_*)}_{\leq 0} + f_n(x_*) - f(x_*) \\ &\leq |f(x_{\mathsf{cv}, n}) - f_n(x_{\mathsf{cv}, n})| + |f_n(x_*) - f(x_*)| \\ &\leq 2\sup_{x \in [x_1, x_2]} |f_n(x) - f(x)| \to 0. \end{align}\] as required. ◻
\(\mathsf{InvWishart}(\mathbf{\Psi}, \nu)\) denotes an inverse Wishart distribution [51] with scale matrix \(\boldsymbol{\Psi}\) and degrees of freedom \(\nu\).
\(\mathsf{MN}(\boldsymbol{\mu}, \boldsymbol{\Sigma}^{\text{row}}, \boldsymbol{\Sigma}^{\text{col}})\) denotes a matrix normal with among-row covariance \(\boldsymbol{\Sigma}^{\text{row}}\) and among-column covariance \(\boldsymbol{\Sigma}^{\text{col}}\).
The LOOCV in our setting is difficult to analyze because the objective depends on \(\mathbf{Q}\), on which we do not place any assumptions. In particular, one can derive \[\begin{align} \mathsf{LOOCV}_n(\lambda) &= \mathbf{y}^\top (\mathbf{I} - \mathbf{S}_{\lambda}) \mathbf{D}_{\lambda}^{-2} (\mathbf{I} - \mathbf{S}_{\lambda}) \mathbf{y}/n \\ &= \mathbf{y}^\top (\mathbf{I}- \mathbf{Q}^\top (\mathbf{D}(\mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-1} \mathbf{D}^\top) \mathbf{Q}) \mathbf{D}_{\lambda}^{-2} (\mathbf{I}- \mathbf{Q}^\top (\mathbf{D}(\mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-1} \mathbf{D}^\top) \mathbf{Q}) \mathbf{y} / n \end{align}\] where \(\mathbf{D}_{\lambda} = \mathrm{diag}((1 - (\mathbf{S}_{\lambda})_{ii})_{i = 1} ^n)\). We try to simplify the diagonal term first. \[\begin{align} (D_{\lambda})_{ii} &= 1 - \mathbf{x}_i^\top (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{x}_i \\ &= 1 - \mathbf{q}_i^\top \mathbf{D} \mathbf{O} (\mathbf{O}^\top \mathbf{D}^\top \mathbf{D} \mathbf{O} + \lambda\mathbf{I})^{-1} \mathbf{O}^\top \mathbf{D}^\top \mathbf{q}_i \\ &= 1 - \mathbf{q}_i^\top \mathbf{D} (\mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-1} \mathbf{D}^\top \mathbf{q}_i \end{align}\] where \(\mathbf{q}_i\) is the \(i\)-th row of \(\mathbf{Q}\). We analyze \[\begin{align} \mathbf{Q} \mathbf{D}^{-2}_{\lambda} \mathbf{Q}^\top &= \sum_{i = 1} ^n \mathbf{q}_i\mathbf{q}_i^\top (\mathbf{D}_{\lambda} ^{-2})_{ii}. \end{align}\] Then \(\mathsf{LOOCV}\) should contribute two terms, as the cross term should vanish. They are \[\begin{align} (\mathbf{O}\boldsymbol{\beta})^\top \mathbf{D}^\top \mathbf{Q} (\mathbf{I}- \mathbf{D}(\mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-1}\mathbf{D}^\top) (\mathbf{Q} \mathbf{D}_{\lambda} ^{-2} \mathbf{Q}^{\top})(\mathbf{I}- \mathbf{D}(\mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-1}\mathbf{D}^\top)\mathbf{O}\boldsymbol{\beta} \end{align}\] which should concentrate around its trace.
The second is \[\begin{align} \boldsymbol{\epsilon}^\top (\mathbf{I}- \mathbf{D}(\mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-1}\mathbf{D}^\top) (\mathbf{Q} \mathbf{D}_{\lambda} ^{-2} \mathbf{Q}^{\top})(\mathbf{I}- \mathbf{D}(\mathbf{D}^\top \mathbf{D}+ \lambda\mathbf{I})^{-1} \boldsymbol{\epsilon} \end{align}\] which also concentrates around some multiple of its trace.
Finding this trace now involves numerous terms with \(\mathbf{Q}\). Everything except the center term is diagonal, but immediately one arrives at the following difficulty: \[\begin{align} (\mathbf{Q} \mathbf{D}^{-2}_{\lambda} \mathbf{Q})_{ii} &= \sum_{j = 1} ^n (\mathbf{q}_j \mathbf{q}_j^\top)_{ii} (\mathbf{D}_{\lambda} ^{-2}) _{jj} \\ &= \sum_{j = 1} ^n (\mathbf{q}_{ji})^2 (\mathbf{D}_{\lambda} ^{-2}) _{jj} \end{align}\] Assuming that the rows of \(\mathbf{Q}\) are exchangeable conditional on \(\mathbf{D}\) is insufficient to make progress. If \(\mathbf{Q}\) is assumed to also be Haar, then likely closed forms can be derived, but this is not pursued.
Assuming matrix multiplications and inversions cost \(O(p^3)\), ROTI-GCV takes \(O(p^3)\) time. The entire process takes only a constant factor more time than computing a single ridge solution, as the fitting process implicitly requires fitting multiple ridge solutions.
Here, we work through applications of our method. We take a look at 5 application scenarios. Our focus will be on detecting \(\mathcal{J}_c\), since that is the new concept introduced in this work. For each of the first four examples, we will generate the signal vector uniformly on the sphere, so there is no signal alignment. Afterwards, we will discuss a 5th example with signal-PC alignment. We refer to [31] for more detailed discussion on alignment detection, and the details on hypothesis testing for signal-PC alignment, presenting a method for selecting \(\mathcal{J}_a\) that controls the false discovery rate. For each, we discuss how to check whether the most important assumptions hold. This includes how the index sets \(\mathcal{J}_a\) and \(\mathcal{J}_c\) (aligned/coupled eigenvectors, definitions both restated below) can be selected, and how our estimation scheme performs.
The five examples we discuss are as follows:
residualized returns, where each row contains the residualized returns a 30 minute interval; this was shown in Figure 4 (f), and are analyzed in more detail in Figure 6.
the gaussian mixture, where \(\mathbf{x}_i \sim \frac{1}{2}\mathsf{N}(\vec{3}, I_p) + \frac{1}{2}N(-\vec{3}, I_p)\); this was shown in Figure 4 (b), and is analyzed in more detail in Figure 7.
speech data, a real dataset retrieved from OpenML, shown in 4 (d), analyzed in more detail in Figure 8;
unresidualized returns, where each row contains the unresidualized return over a 30 minute interval, shown in Figure 9;
speech data with Signal-PC alignment, shown in Figure 4 (e) shown again in Figure 10;
As a refresher for notation:
We study ridge regression given data \((\mathbf{X}, \mathbf{y})\), where \(\mathbf{y} = \mathbf{X}\beta + \boldsymbol{\epsilon}\). The training set \(\mathbf{X} = \mathbf{Q}^\top \mathbf{D} \mathbf{O} \in \mathbb{R}^{n \times p}\) is right-rotationally invariant, meaning \(\mathbf{O} \sim \mathrm{Haar}(\mathbb{O}_p)\). The error \(\boldsymbol{\epsilon}\) is centered with independent entries.
We are interested in the test loss on a test set \(\widetilde{\mathbf{X}} = \widetilde{\mathbf{Q}}^\top \widetilde{\mathbf{D}} \widetilde{\mathbf{O}} \in \mathbb{R}^{n' \times p}\), defined as \[R_{\mathbf{X}, \mathbf{y}}(\hat{{\boldsymbol{\beta}}}(\mathbf{X}, \mathbf{y}), {\boldsymbol{\beta}})=\frac{1}{n^{\prime}} \mathbb{E}\left[\|\widetilde{\mathbf{{X}}} \hat{{\boldsymbol{\beta}}}-\widetilde{\mathbf{{X}}} {\boldsymbol{\beta}}\|^2 \mid \boldsymbol{X}, \boldsymbol{y}\right].\]
The two key parameters are the signal strength \(r^2 = \|\boldsymbol{\beta}\|_2^2/n\) and the noise level \(\sigma^2 = \mathbb{E}[\boldsymbol{\epsilon}_i^2]\)
Let \(\mathbf{o}_i^\top\) denote the \(i\)-th row of \(\mathbf{O}\) and \(\widetilde{\mathbf{o}}_i\) denote the \(i\)-th row of \(\widetilde{\mathbf{O}}\).
Aligned eigenvectors refer vectors \(o_i\) which align with the signal \(\beta\). They are indexed by \(\mathcal{J}_a\), and satisfy \(\boldsymbol{\beta} = \sum_{i \in \mathcal{J}_a} \sqrt{n}\alpha_i \mathbf{o}_i + \boldsymbol{\beta}'\).
Coupled eigenvectors refer to vectors \(\mathbf{o}_i\) which align with \(\widetilde{\mathbf{o}}_i\), which are eigenvectors of the training covariates. They are indexed by \(\mathcal{J}_c\), and we assume for each \(i \in \mathcal{J}_c\), that \(\mathbf{o}_i = \widetilde{\mathbf{o}}_i\).
Aligned and coupled eigenvectors allow us to model better model structures in real data that may not initially fall under the coverage of our method. We will refer to scenarios covered without the need for coupled/aligned eigenvectors as the simpler setting; in the main text, these are shown in Figure 1. For scenarios that require coupled/aligned eigenvectors, we refer to them as the coupled/aligned setting; these are covered in Figure 2 of the main text.
The most important assumptions, reproduced informally from the main text, are as follows:
On the regularity of singular value distributions:
In the simpler setting: we require Assumption 2, that the maximum singular value \(\lambda_{\text{max}} (D)\) is almost surely bounded.
In the aligned/coupled setting, we further require Assumption 11 that the expected maximum singular value of the test covariates, \(\|\mathbb{E}[\widetilde{\mathbf{{D}}}^{\top} \widetilde{\mathbf{{D}}}]\|_{\textsf{op}}\), is also almost surely bounded.
How to check this: One can generally verify that this assumption is satisfied by plotting a histogram of the singular values \((D_{ii})_{i = 1} ^p\) and \((\widetilde{D}_{ii})_{i = 1} ^p\). In the examples we discuss, we only do this for the training set since we do not expect any distribution shift. As examples, see Figures [fig:resid-ret-spectrum], [fig:gaussian-mixture-spectrum] etc.
On the relation between the eigenvectors of the training set \(\{\mathbf{o}_i\}_{i = 1} ^p\) and test set \(\{\widetilde{\mathbf{o}}_i\}_{i = 1} ^p\):
In the simpler setting, we have Assumption 6, which states the test set is independent of the training set.
In the coupled setting, we have Assumption 7, which states the top eigenvectors of the test and training set are exactly equal, i.e. for \(i \in \mathcal{J}_c\), we have \(o_i = \tilde{o}_i\).
One should think of Assumption 7 as a weakening of Assumption 6, where we allow for additional forms of structure to handle trends we see in data.
How to check this: We compute the overlaps \(\sqrt{p}\langle \tilde{o_i}, o_j \rangle\) for the top few eigenvectors (we choose 10). Note that if these two vectors were independent, then one has \(\sqrt{p} \langle \tilde{o_i}, \tilde{o_j} \rangle \xrightarrow{p \to \infty} N(0, 1)\). Hence one can empirically observe if two eigenvectors possess overlap that is significantly larger than expected; if they do, then they should be coupled. For examples, see Figures [fig:resid-ret-numericals], [fig:gaussian-mixture-numericals].
One limitation of this approach is that each eigenvector of the training set can only be coupled to one in the test set; what this means is that if we find eigenvectors in the training set that overlap heavily with multiple eigenvectors of the test set, then our approach is not expected to perform well. We will see this occur in setting 4.
On the position of \(\beta\) relative to \(\{o_i\}_{i = 1} ^p\), which are the rows of \(O\):
In the simpler setting, we have Assumption 5, which states that \(\{o_i\}_{i = 1} ^p\) are independent of \(\beta\).
In the aligned setting, we have Assumption 8, which states a finite collection of eigenvectors indexed by \(\mathcal{J}_a\) are aligned with \(\beta\), so that \(\beta = \sum_{i \in \mathcal{J}_a} \sqrt{n}\alpha_i o_i + \beta'\).
How to check this: Identification of alignment is discussed in [31], which proposes a hypothesis testing framework for identifying \(J_a\). They provide a method to compute a \(p\)-value per eigenvector that tests whether or not it is aligned; one can then apply the Benjamini-Hochberg procedure to control the false discovery rate. In the 5th example, we use the same approach, where the \(p\)-values are shown in Table 2.
Overview of Setting: to recap from the main text (Figure 2d), we considered data where each row contained the 30 minute residualized returns of a collection of 493 stocks. Here, we set \(n = p = 493\) and \(r^2 = \sigma^2 = 1\).
Checking Assumptions:
Regularity of Singular Value Distribution: In Figure [fig:resid-ret-spectrum], we plot a histogram of the singular values and see that it is are well-behaved with no strong outliers.
On coupling: Furthermore, while the histogram of singular values (shown in Figure [fig:resid-ret-overlaps], numerical values in [fig:resid-ret-numericals]) has larger tails than expected for a normal distribution, they are not overly so, and thus we find we do not require coupling either (i.e. Assumption 6 essentially holds). As a result, applying our method without coupling performs well (result shown in Figure [fig:resid-ret-result].
To build intuition for the coupled eigenvectors condition (Assumption 7), we take a first look at how these diagnostic plots change when the data is instead drawn from the Gaussian mixture discussed in the main text (restated below):
Overview of setting: Each row \(x_i \sim \frac{1}{2}N(\vec{3}, I_p) + \frac{1}{2}N(-\vec{3}, I_p)\). We take \(n = p = 1000\), and \(r^2 = \sigma^2 = 1\).
Checking assumptions:
Regularity of Singular Value Distribution: As seen in Figure [fig:gaussian-mixture-spectrum], we have one extremely large singular value, showing Assumption 2 is violated. This is usually a bad sign for our method, but in this case it manages to succeed, showing some robustness.
On coupling: In Figures [fig:gaussian-mixture-overlaps] and [fig:gaussian-mixture-numericals], the huge outlier overlap is indication that the top two eigenvectors overlap with one another, meaning Assumption 6 does not hold. On the other hand, all the other eigenvectors do not align with each other. This is exactly the situation suited for our coupled eigenvectors condition (Assumption 7). As a result, coupling \(o_1\) with \(\tilde{o}_1\) produces solid results, shown in Figure [fig:gaussian-mixture-result].
We find a good use case for our coupled eigenvector condition on speech data taken from OpenML. This is a real dataset which is not included in the main manuscript.
Overview of setting: Dataset sourced is speech data from [43]. \(n = p = 400\), and \(r^2 = \sigma^2 = 1\).
Checking assumptions:
Regularity of Singular Value Distribution: Here, in Figure [fig:speech-data-spectrum-v2], the spectrum has no outlier values.
On coupling: Next, we observe that for each of the top 3 eigenvectors training set, they align strongly with the corresponding eigenvector of the test set, and none of the others. This is exactly the setting in which our coupled eigenvectors can be used. Furthermore, each of the top train eigenvectors is strongly aligned with only one of the test eigenvectors. In light of Table [fig:speech-data-numericals], we choose to couple the first 3 eigenvectors with the respective test-set eigenvectors. This produces the result shown in Figure [fig:speech-data-result-v2], showing that it successfully captures the required structure. This result is quite robust to the choice of \(J_c\); using \(6\) eigenvectors or even \(10\) does not change the outcome much.
We will now look at a case that behaves poorly for our estimator. Here, we look at unresidualized returns.
Overview of Setting: we consider data where each row contains 30 minute returns, but without residualization.
Checking Assumptions:
Regularity of singular values: Here, in Figure [fig:bad-spectrum], there are some larger singular values, but they are not extreme outliers.
On coupling: We see that when checking the alignment of test and training eigenvectors in Figures [fig:bad-overlaps] and [fig:bad-numericals], the top eigenvector of the test set aligns heavily with mulitple eigenvectors of the training set. This is a scenario in which we cannot use the coupled eigenvector condition, since we can only couple one train eigenvector with one test eigenvector. As a result, we see in Figure [fig:bad-result] that our method does not do well in this scenario, which is expected. Note that this example illustrates that these diagnostics can be used ahead of time to understand whether or not our method should succeed.
| \(p\) | ||
| \(o_1\) | 0.000 | |
| \(o_{2}\) | 0.000 | |
| \(o_{3}\) | 0.000 | |
| \(o_{4}\) | 0.000 | |
| \(o_{5}\) | 0.000 | |
| \(o_{6}\) | 0.934 | |
| \(o_{7}\) | 0.588 | |
| \(o_{8}\) | 0.651 | |
| \(o_{9}\) | 0.913 | |
| \(o_{10}\) | 0.395 |
As a final example that includes both alignment and coupling, we perform another experiment with speech data that additionally contains Signal-PC alignment.
Overview of setting: The dataset is again the speech data from [43]. Recall that an aligned signal takes the form \(\beta = \sum_{i \in J_c} ^n \sqrt{n} \alpha_i + \beta'\), where the \(\alpha_i\) are the weights of the aligned components, and \(\beta'\) is the unaligned component of the signal. We take \(\alpha_i = 1/2\) for \(i \in J_a = \{1, 2, 3, 4, 5\}\).
Checking assumptions:
Singular value histogram and coupling detection is equivalent to that of Section 12.
Alignment: Here we next identify alignment through the hypothesis testing framework of [31]. We include in Table 1 the Benjamini-Hochberg adjusted \(p\)-values of alignment of each component, finding that \(p\)-values for each of \(i = 1, 2, 3, 4, 5\) are extremely small (\(<10^{-2}\)), and those of the unaligned components are large \((>0.35)\), meaning we perfectly identify the aligned portion. Our method then performs well in this setting, as shown in Figure 10.
The assumption of Gaussian entries is technical, and we believe it can be removed with some work.↩︎
The algorithm requires a choice of \(\mathcal{J}_a, \mathcal{J}_c\); we give a brief overview of selecting \(\mathcal{J}_a\) and \(\mathcal{J}_c\) in Section 5.3, followed by a detailed discussion in Appendix 9.↩︎
One can further couple the next few eigenvectors (e.g. \(\mathbf{o}_6\) with \(\widetilde{\mathbf{o}}_4\) and so on), but we observe this has no impact on the results.↩︎
We remove the top 8 right singular vectors from data matrix \(\mathbf{X}\). This is analogous to residualization of returns, a common practice in financial modeling. We compute the factors to delete by taking the top PCs of the covariance matrix computed using both test and train data; PCA is crude/weak as a factor model and we have only 8 factors; hence it is more comparable to compute PCs looking at both train and test data. See Appendix 8.2 for more details on residualization.↩︎