Asymptotics of resampling without replacement in robust and logistic regression


Abstract

This paper studies the asymptotics of resampling without replacement in the proportional regime where dimension \(p\) and sample size \(n\) are of the same order. For a given dataset \((\boldsymbol{X},\boldsymbol{y})\in\mathbb{R}^{n\times p}\times \mathbb{R}^n\) and fixed subsample ratio \(q\in(0,1)\), the practitioner samples independently of \((\boldsymbol{X},\boldsymbol{y})\) iid subsets \(I_1,...,I_M\) of \(\{1,...,n\}\) of size \(qn\) and trains estimators \(\hat{\boldsymbol{b}}(I_1),...,\hat{\boldsymbol{b}}(I_M)\) on the corresponding subsets of rows of \((\boldsymbol{X},\boldsymbol{y})\). Understanding the performance of the bagged estimate \(\bar\boldsymbol{b}= \frac{1}{M}\sum_{m=1}^M \hat{\boldsymbol{b}}(I_1),...,\hat{\boldsymbol{b}}(I_M)\), for instance its squared error, requires us to understand correlations between two distinct \(\hat{\boldsymbol{b}}(I_m)\) and \(\hat{\boldsymbol{b}}(I_{m'})\) trained on different subsets \(I_m\) and \(I_{m'}\).

In robust linear regression and logistic regression, we characterize the limit in probability of the correlation between two estimates trained on different subsets of the data. The limit is characterized as the unique solution of a simple nonlinear equation. We further provide data-driven estimators that are consistent for estimating this limit. These estimators of the limiting correlation allow us to estimate the squared error of the bagged estimate \(\bar\boldsymbol{b}\), and for instance perform parameter tuning to choose the optimal subsample ratio \(q\). As a by-product of the proof argument, we obtain the limiting distribution of the bivariate pair \((\boldsymbol{x}_i^T \hat{\boldsymbol{b}}(I_m), \boldsymbol{x}_i^T \hat{\boldsymbol{b}}(I_{m'}))\) for observations \(i\in I_m\cap I_{m'}\), i.e., for observations used to train both estimates.

1 Introduction↩︎

This paper studies the performance of bagging estimators trained on subsampled, overlapping datasets in the context robust linear regression and logistic regression.

1.1 M-estimation in the proportional regime↩︎

We consider an M-estimation problem in the proportional regime where sample size \(n\) and dimension \(p\) are of the same order: Throughout the paper, \(\delta > 1\) is a fixed constant and the ratio \[n/p = \delta \label{regime}\tag{1}\] is held fixed as \(n,p\to+\infty\) simultaneously. The practitioner collects data \((y_i,\boldsymbol{x}_i)_{i\in[n]}\) with scalar-valued responses \(y_i\) and feature vectors \(\boldsymbol{x}_i \in \mathbb{R}^p\). For a given subset of observations \(I\subset [n]\), an estimator \(\hat{\boldsymbol{b}}(I)\) is trained on the subset of observations \((y_i,\boldsymbol{x}_i)_{i\in I}\) using an optimization problem of the form \[\hat{\boldsymbol{b}}(I) = \mathop{\mathrm{arg\,min}}_{\boldsymbol{b}\in \mathbb{R}^p} \sum_{i\in I} \ell_i(\boldsymbol{x}_i^\top \boldsymbol{b}) \label{hbbI95intro}\tag{2}\] where for each \(i\in[n]\), the loss \(\ell_i(\cdot)\) is convex and depends implicitly on the response \(y_i\). We will focus on two regression settings: robust linear regression and Generalized Linear Models (GLM), including logistic regression. In robust regression, the response is of the form \[\label{y95i95robust} y_i = \boldsymbol{x}_i^T \boldsymbol{\beta}^* + \varepsilon_i\tag{3}\] for some possibly heavy-tailed noise \(\varepsilon_i\) independent of \(\boldsymbol{x}_i\). In this case the loss \(\ell_i\) in 2 is given by \[\ell_i(u) = \rho(y_i-u) \label{ell95i95robust}\tag{4}\] where \(\rho\) is a deterministic function, for instance the Huber loss \(\rho (u) = \int_0^{|u|}\min(1,t)dt\) or its smooth variants, e.g., \(\rho(u)=\sqrt{1+u^2}\). The asymptotics of the performance of 2 with \(I=\{1,...,n\}\) and the loss 4 in robust regression in the proportional regime 1 are now well understood [1][4] as we will review in 2. A typical example of GLM to which our results apply is the case of binary logistic regression, where \(\ell_i\) in 2 is the negative log-likelihood \[\ell_i(u) = \log(1+e^u) - u y_i, \qquad y_i\in \{0,1\} \label{logi95loss}\tag{5}\] which is now also well understood for \(I=[n]\) in 2. Related results will be reviewed in 3. The goal of the present paper is to study the performance of bagging several estimators of the form 2 obtained from several subsampled datasets \(I_1,...,I_M\).

1.2 Bagging estimators trained on subsampled datasets without replacement↩︎

Let \(M>0\) be a fixed integer, held fixed as \(n,p\to+\infty\). The practitioner then samples \(M\) subsets of \([n]\) according to the uniform distribution on all subsets of \([n]\) of size \(qn\) for some \(q\in (0,1]\), that is, \[I_1,...,I_M \sim^{\text{iid}} \text{Unif}\{I\subset [n] : |I|=qn \}. \label{I95195I95M}\tag{6}\] Each \(I_m\) thus samples a subset of \([n]\) of size \(qn\) without replacement, and the set of indices \(I_1,...,I_M\) are all independent. While the set of indices are independent, the corresponding subsampled datasets \[(\boldsymbol{x}_i, y_i)_{i \in I_m}, \quad \text{ and } \quad (\boldsymbol{x}_i, y_i)_{i \in I_{m'}} \label{two95datasets}\tag{7}\] are not independent as soon as there is some overlap in the sense \(I_m\cap I_{m'} \ne \emptyset\).

Remark 1. If \(I_m\) and \(I_{m'}\) are independent according to 6 then \(|I_m \cap I_{m'}|\) follows an hyper-geometric distribution with mean \(q^2 n\), and by Chebychev’s inequality using the explicit formula for the variance of hyper-geometric distributions, \(|I_m \cap I_{m'}|/n \to^P q^2\) as \(n\to+\infty\) while \(q\) is held fixed. Thus, not only is the intersection non-empty with high-probability, but it is of order \(n\).

The goal of the paper is to understand the performance of bagging the corresponding subsampled estimates: with the notation \(\hat{\boldsymbol{b}}(I)\) in 2 and \(I_1,...,I_M\) in 6 , the practitioner construct the bagged estimate \[\label{bagging95intro} \bar\boldsymbol{b}= \frac{1}{M} \sum_{m=1}^M \hat{\boldsymbol{b}}(I_m).\tag{8}\]

Related works↩︎

Bagging as a generally applicable principle was introduced in [5], [6]. Early analysis of bagging in low-dimensional regimes were performed in [7] among others. In the proportional regime 1 , [8] demonstrated the role of bagging as an implicit regularization technique when the base learners \(\hat{\boldsymbol{b}}(I_m)\) are least-squares estimates. Bagging Ridge estimators was studied in [9], [10] who characterized the limit of the squared error of 8 using random matrix theory. The implicit regularization power of bagging in the proportional regime is again seen in [9], [10], where it is shown that the optimal risk among Ridge estimates can also be achieved by bagging Ridgeless estimates and optimally choosing the sub sample size. Estimating the risk of a bagged estimate such as 8 for regularized least-squares estimates is done in [9][11]. The risk of bagging random-features estimators, trained on the full dataset but with each base learner having independent weights within the random feature activations, is characterized in [12]. Most recently, [13] studied the limiting equations of several resampling schemes including bootstrap and resampling without replacement, and characterized self-consistent equations for the limiting risk of estimators obtained by minimization of the negative log-likelihood and an additive Ridge penalty.

Organization↩︎

We will first study and state our main results for robust regression in 2. 3 extends the results to logistic regression. Numerical simulations are provided in 2.5 in robust regression and in 3.3 in logistic regression. The main results are proved in 4 simultaneously for robust linear regression and logistic regression. 5 contains several auxiliary lemmas used in the proof in 4.

Notation↩︎

For vectors \(\|\cdot\|\) or \(\|\cdot\|_2\) is the Euclidean norm, while \(\|\cdot\|_{op}\) and \(\|\cdot\|_F\) denote the operator norm and Frobenius norm of matrices. The arrow \(\to^P\) denotes convergence in probability and \(o_P(1)\) denotes any sequence of random variables converging to 0 in probability. The stochastically bounded notation \(O_P(r_n)\) for \(r_n>0\) denotes a sequence of random variables such that for any \(\eta>0\), there exists \(K>0\) with \(\mathop{\mathrm{\mathbb{P}}}(O_P(r_n)>K r_n)\le \eta\).

2 Robust regression↩︎

This section focuses on robust regression in the linear model 3 , where the noise variables \(\varepsilon_i\) are possibly heavy-tailed. Throughout the paper, our working assumption for the robust linear regression setting is the following.

Assumption 1. Let \(q\in(0,1),\delta>0\) be constants such that \(q\delta > 1\) and \(n/p = \delta\) as \(n,p\to+\infty\). Let \(\boldsymbol{\beta}^*\in\mathbb{R}^p\). Assume that \((\boldsymbol{x}_i,y_i)_{i\in[n]}\) are iid with \(y_i = \boldsymbol{x}_i^T\boldsymbol{\beta}^* + \varepsilon_i\) and \(\varepsilon_i\) independent of \(\boldsymbol{x}_i\sim N(\boldsymbol{0}_p,\boldsymbol{I}_p)\) satisfying \(\mathop{\mathrm{\mathbb{P}}}(\varepsilon_i\ne 0)> 0\). Assume that the loss is \(\ell_i(u)= \rho(y_i - u)\) for a twice-continuously differentiable function \(\rho\) with \(\mathop{\mathrm{arg\,min}}_{x\in\mathbb{R}}\rho(x)=\{0\}\) as well as \(|\rho'(t)|\le 1\) and \(0<\rho''(t)\le 1\) for all \(t\in\mathbb{R}\).

2.1 A review of existing results in robust linear regression↩︎

The seminal works [1][3], [14] characterized the performance of robust M-estimation in the proportional regime 1 . For a convex loss \(\rho:\mathbb{R}\to\mathbb{R}\) ans \(\ell_i\) as in 1, these works characterized the limiting squared risk \(\|\hat{\boldsymbol{b}}(\{1,...,n\})-\boldsymbol{\beta}^*\|^2\) of an estimator \(\hat{\boldsymbol{b}}(\{1,...,n\})\), trained on the full dataset, i.e., taking \(I=\{1,...,n\}\) in 2 . In particular, [1][4], [14] show that under the design of \((\boldsymbol{x}_i,y_i)\) given in 1, the squared risk of \(\hat{\boldsymbol{b}}(\{1,...,n\})\) converges in probability to a constant, and this constant is found by solving a system of two nonlinear equations with two unknowns. If a subset \(I\subset [n]\) of size \(|I|=qn\) is used to train 2 , simply changing \(\delta=n/p\) to \(\delta q=|I|/p\), these results imply the convergence in probability \(\|\hat{\boldsymbol{b}}(I)-\boldsymbol{\beta}^*\|^2\to^P \sigma^2\) where \((\sigma,\gamma)\) is the solution to the system \[\begin{align} \tag{9} \tfrac{\sigma^2}{\delta q}&= \mathop{\mathrm{\mathbb{E}}}[ (\sigma G_i - \mathop{\mathrm{prox}}[\gamma\ell_i](\sigma G_i))^2 ] \\ 1 - \tfrac{1}{\delta q} &= \sigma^{-1} \mathop{\mathrm{\mathbb{E}}}[G_i \mathop{\mathrm{prox}}[\gamma\ell_i](\sigma G_i) ] \tag{10} \end{align}\] where \(G_i\sim N(0,1)\) is independent of \(\varepsilon_i\). Above, \(\mathop{\mathrm{prox}}[f](x_0)=\mathop{\mathrm{arg\,min}}_{x\in\mathbb{R}}(x_0-x)^2/2 + f(x)\) denotes the proximal operator of a convex function \(f\) for any \(x_0\in \mathbb{R}\). The system 9 10 was predicted in [1] using a heuristic leave-one-out argument. Early rigorous results [2], [3], [14] assumed either \(\rho\) strongly convex [2] or added an additive strongly-convex Ridge penalty to the M-estimation problem [3], [14]; [4] generalized such results without strong convexity.

We now subsample without replacement, obtaining iid subsets \(I_1,...,I_M\) as in 6 . For each \(m=1,...,M\) the theory above applies individually to \(\hat{\boldsymbol{b}}(I_m)\). In particular \(\|\hat{\boldsymbol{b}}(I_m)-\boldsymbol{\beta}^*\|^2\to^P \sigma^2\). By expanding the square, the squared L2 error of the average \(\bar\boldsymbol{b}\) in 8 is given by \[\|\bar\boldsymbol{b}-\boldsymbol{\beta}^*\|^2 = \frac{1}{M^2} \sum_{m=1}^M \|\hat{\boldsymbol{b}}(I_m)-\boldsymbol{\beta}^*\|^2 + \frac{1}{M^2} \sum_{m=1}^M \sum_{m'=1: m'\ne m}^M (\hat{\boldsymbol{b}}(I_m)-\boldsymbol{\beta}^*)^T (\hat{\boldsymbol{b}}(I_{m'})-\boldsymbol{\beta}^*). \label{risk95bagged}\tag{11}\] Since previous works established that \(\|\hat{\boldsymbol{b}}(I_m)-\boldsymbol{\beta}^*\|^2\to^P \sigma^2\), the first term above is clearly \(\sigma^2/M\). The crux of the problem is thus to characterize the limit in probability, if any, of each term \((\hat{\boldsymbol{b}}(I_m)-\boldsymbol{\beta}^*)^T (\hat{\boldsymbol{b}}(I_{m'})-\boldsymbol{\beta}^*)\) in the second term inside the double sum.

2.2 A glance at our results↩︎

Since \(\rho\) in 4 is Lipschitz and differentiable, the system 9 10 admits a unique solution [15]. Let \((\sigma,\gamma )\) be the solution1 to this system.

The key to understanding the performance of the aforementioned bagging procedure 8 and, for instance, characterizing the limits of \(\| \bar\boldsymbol{b}- \boldsymbol{\beta}^*\|^2\), is the following equation with unknown \(\eta\in[-1,1]\): \[\begin{align} \eta&=\frac{{q^2}\delta}{\sigma^2} \mathop{\mathrm{\mathbb{E}}}\Bigl[ \Bigl(\sigma G_i - \mathop{\mathrm{prox}}[\gamma \ell_i](\sigma G_i)\Bigr) \Bigl(\sigma G_i - \mathop{\mathrm{prox}}[\gamma \ell_i](\sigma \tilde{G}_i)\Bigr) \Bigr] \tag{12} \\&\qquad \text{ where } \begin{pmatrix} G_i \\ \tilde{G}_i \end{pmatrix} \sim N\Bigl(0_2, \begin{pmatrix} 1 & \eta \\ \eta & 1 \end{pmatrix} \Bigr) \tag{13} \end{align}\] with \((G_i,\tilde{G}_i)\) independent of \((\ell_i,\varepsilon_i)\), and \((\ell_i,\varepsilon_i)\) are the same as in 3 4 with the noise \(\varepsilon_i\) independent of \(\boldsymbol{x}_i\). Using 9 , the above equation can be equivalently rewritten as \[\eta = F(\eta) \quad \text{where } F(\eta) \mathrel{\vcenter{:}}= q\frac{ \mathop{\mathrm{\mathbb{E}}}\bigl[ \bigl(\sigma G_i - \mathop{\mathrm{prox}}[\gamma \ell_i](\sigma G_i)\bigr) \bigl(\sigma G_i - \mathop{\mathrm{prox}}[\gamma \ell_i](\sigma \tilde{G}_i)\bigr) \bigr] }{ \mathop{\mathrm{\mathbb{E}}}[(\sigma G_i- \mathop{\mathrm{prox}}[\gamma\ell_i](\sigma G_i))^2 ] } \label{def-F}\tag{14}\] since \(\mathop{\mathrm{\mathbb{E}}}[(\sigma G_i- \mathop{\mathrm{prox}}[\gamma\ell_i](\sigma G_i))^2 ] = \sigma^2/(\delta q)\) in the denominator by 9 . This shows that any solution \(\eta\) must satisfy \(|\eta|\le q\) by the Cauchy-Schwarz inequality.

We will show in the next section that this equation in \(\eta\) has a unique solution. Our main results imply a close relationship between the solution \(\eta\) of 12 and the bagged estimates, in particular 30 satisfies \[\bigl(\hat{\boldsymbol{b}}(I_m)-\boldsymbol{\beta}^*\bigr)^T \bigl(\hat{\boldsymbol{b}}(I_{m'}) - \boldsymbol{\beta}^*\bigr) \to^P \eta \sigma^2.\] For two distinct and fixed \(m < m'\), the solution \(\eta\) further characterizes the joint distribution of two predicted values \(\boldsymbol{x}_i^T\hat{\boldsymbol{b}}(I_m)\) and \(\boldsymbol{x}_i^T\hat{\boldsymbol{b}}(I_m)\) with \(i\in I_m\cap I_{m'}\), by showing the existence of \((G_i,\tilde{G}_i)\) as in 13 , independent of \((\ell_i,U_i)\) and such that \[\begin{align} \boldsymbol{x}_i^T\hat{\boldsymbol{b}}(I_m) &=\mathop{\mathrm{prox}}[\gamma\ell_i](\sigma G_i) + o_P(1) , \\ \boldsymbol{x}_i^T\hat{\boldsymbol{b}}(I_{m'}) &= \mathop{\mathrm{prox}}[\gamma\ell_i](\sigma\tilde{G}_i) + o_P(1). \end{align}\]

2.3 Existence and uniqueness of solutions to the fixed-point equation↩︎

Proposition 2. The function \(F\) in 14 is non-decreasing and \(q\)-Lipschitz with \(0\le F(0)\le q\le 1\). The equation \(\eta = F(\eta)\) has a unique solution \(\eta\in[0,q]\).

Proof. We may realize \(\tilde{G}_i\) as \(\tilde{G}_i = \eta G_i + \sqrt{1-\eta^2}Z_i\) where \(Z_i,G_i\) are iid \(N(0,1)\) independent of \(\ell_i\). For any Lipschitz continuous function \(f\) with \(\mathop{\mathrm{\mathbb{E}}}[f(G_i)^2]<+\infty\), the map \(\varphi: \eta \in [-1,1] \mapsto \mathop{\mathrm{\mathbb{E}}}[f(G_i)f(\tilde{G}_i)]=\mathop{\mathrm{\mathbb{E}}}[f(G_i)f(\eta G_i +\sqrt{1-\eta^2}Z_i)]\in\mathbb{R}\) has derivative \[\varphi'(\eta) = \mathop{\mathrm{\mathbb{E}}}[f'(G_i)f'(\tilde{G}_i)]. \label{varphi-d}\tag{15}\] See 1 for the proof. In our case, this implies that the function 14 has derivative \[F'(\eta) = q^2\delta \mathop{\mathrm{\mathbb{E}}}\Bigl[ \Bigl(1-\mathop{\mathrm{prox}}[\gamma\ell_i]'(\sigma G_i)\Bigr) \Bigl(1-\mathop{\mathrm{prox}}[\gamma\ell_i]'(\sigma \tilde{G}_i)\Bigr) \Bigr] . \label{F39}\tag{16}\] Since \(\mathop{\mathrm{prox}}[\gamma\ell_i]\) is nondecreasing and 1-Lipschitz for any convex function \(\ell_i:\mathbb{R}\to\mathbb{R}\), each factor inside the expectation belongs to \([0,1]\) and \(0\le F'(\eta)\) holds. By bounding from above the second factor, \[\begin{align} F'(\eta) \le q^2 \delta \mathop{\mathrm{\mathbb{E}}}\bigl[ 1-\mathop{\mathrm{prox}}[\gamma\ell_i]'(\sigma G_i) \bigr] = q^2 \delta (q\delta)^{-1} = q \end{align}\] thanks to 10 and Stein’s formula (or integration by parts) for the equality. This shows \(0\le F'(\eta) \le q< 1\) so that \(F\) is a contraction and admits a unique solution in \([-1,1]\).

We now show that the solution must be in \([0,q]\). The definition 14 gives \(F(1)=q\) as \(\mathop{\mathrm{\mathbb{P}}}(G_i=\tilde{G}_i)=1\) when \(\eta=1\). Now we verify \(F(0)\ge 0\). If \(\eta=0\) then \((G_i, \tilde{G}_i, \ell_i)\) are independent and \(G_i=^d\tilde{G}_i\) so by the tower property of conditional expectations, \[\begin{align} F(0) &= \frac{q^2\gamma^2\delta}{\sigma^2} \mathop{\mathrm{\mathbb{E}}}\Bigl[\mathop{\mathrm{\mathbb{E}}}\bigl[\bigl(\sigma G_i-\mathop{\mathrm{prox}}[\gamma \ell_i](\sigma G_i)\bigr) \mid \ell_i\bigr]^2\Bigr] \ge 0. \end{align}\] Since \(0\le F(0)\le F(1)\le q< 1\), the unique fixed-point must belong to \([0,q]\). ◻

2.4 Main results in robust regression↩︎

For any \(I\subset[n]\) with \(|I|=qn = q\delta p\), the M-estimator \(\hat{\boldsymbol{b}}(I)=\mathop{\mathrm{arg\,min}}_{\boldsymbol{b}\in\mathbb{R}^p}\sum_{i\in I} \ell_i(\boldsymbol{x}_i^T\boldsymbol{b})\) satisfies the convergence in probability \[\|\hat{\boldsymbol{b}}(I)-\boldsymbol{\beta}^*\|^2 \to^P \sigma^2, \qquad \frac{1}{|I|}\sum_{i\in I} \Bigl(\ell_i'(\boldsymbol{x}_i^T\hat{\boldsymbol{b}}(I))\Bigr)^2 \to^P \frac{\sigma^2}{\gamma^2 q\delta}. \label{eq95convergence}\tag{17}\] The first convergence in probability was proved by many authors, e.g., [1][4]. The second can be obtained using the CGMT of [4], see for instance [16]. We will take the convergence in probability 17 for granted in our proof.

Theorem 3. Let 1 be fulfilled. Let \(I,\tilde{I}\) be independent and uniformly distributed over all subsets of \([n]\) of size \(qn\). Then \[\label{eq:thm95eta95robust} (\hat{\boldsymbol{b}}(I)-\boldsymbol{\beta}^*)^T(\hat{\boldsymbol{b}}(\tilde{I})-\boldsymbol{\beta}^*) \to^P \sigma^2 \eta, \qquad \frac{(\hat{\boldsymbol{b}}(I)-\boldsymbol{\beta}^*)^T(\hat{\boldsymbol{b}}(\tilde{I})-\boldsymbol{\beta}^*)}{\|(\hat{\boldsymbol{b}}(I)-\boldsymbol{\beta}^*)\|_2 \|(\hat{\boldsymbol{b}}(\tilde{I})-\boldsymbol{\beta}^*)\|_2} \to^P \eta\tag{18}\] where \(\eta\in[0,q]\) is the unique solution to 12 . Furthermore, \(\eta\) and \(\eta\sigma^2\) can be consistently estimated in the sense \[\label{eq:thm95estimation} \frac{\hat{\gamma}(I) \hat{\gamma}(\tilde{I})}{p} \sum_{i\in I \cap \tilde{I}}\ell_i'\Bigl(\boldsymbol{x}_i^T\hat{\boldsymbol{b}}(I)\Bigr) \ell_i'\Bigl(\boldsymbol{x}_i^T\hat{\boldsymbol{b}}(\tilde{I})\Bigr) \to^P \eta\sigma^2, \quad \frac{\hat{\gamma}(I)^2}{p} \sum_{i\in I} \ell_i'\Bigl(\boldsymbol{x}_i^T\hat{\boldsymbol{b}}(I)\Bigr)^2 \to^P \sigma^2\tag{19}\] where \(\hat{\gamma}(I) =p/ \bigl[ \sum_{i\in I} \ell_i''(\boldsymbol{x}_i^T\hat{\boldsymbol{b}}(I)) - \ell_i''(\boldsymbol{x}_i^T\hat{\boldsymbol{b}}(I))^2 \boldsymbol{x}_i^T(\sum_{l\in I}\boldsymbol{x}_l\ell_l''(\boldsymbol{x}_l^T\hat{\boldsymbol{b}}(I))\boldsymbol{x}_l^T)^{-1} \boldsymbol{x}_i \bigr]\). Finally, for any \(i\in I\cap \tilde{I}\), there exists \((G_i,\tilde{G}_i)\) jointly normal as in 13 with \(\mathop{\mathrm{\mathbb{E}}}[G_i\tilde{G}_i]=\eta\) such that \[\label{eq:thm95distribution95robust} \max_{i\in I\cap\tilde{I}} \mathop{\mathrm{\mathbb{E}}}\Bigl[ 1 \wedge \Big\| \begin{pmatrix} \boldsymbol{x}_i^T\hat{\boldsymbol{b}}(I)\\ \boldsymbol{x}_i^T\hat{\boldsymbol{b}}(\tilde{I}) \end{pmatrix} - \begin{pmatrix} \mathop{\mathrm{prox}}[\gamma\ell_i](\sigma G_i)\\ \mathop{\mathrm{prox}}[\gamma\ell_i](\sigma \tilde{G}_i)\\ \end{pmatrix} \Big\|_2 \mid (I,\tilde{I}) \Bigr] \to^P 0.\tag{20}\]

3 is proved in 4. It provides three messages. First, 18 states that the correlation \((\hat{\boldsymbol{b}}(I)-\boldsymbol{\beta}^*)^T(\hat{\boldsymbol{b}}(\tilde{I})-\boldsymbol{\beta}^*)\) between two estimators trained in independent subsets \(I,\tilde{I}\) both of cardinally \(qn\) converges to the unique solution \(\eta\) of 12 . A direct consequence is that the squared risk of the bagged estimate 11 satisfies \[\|\bar\boldsymbol{b}-\boldsymbol{\beta}^*\|^2 \to^P \sigma^2/M + (1-1/M)\sigma^2\eta. \label{risk95bagged95limit}\tag{21}\] Second, both terms in this risk decomposition of the bagged estimate \(\bar\boldsymbol{b}\) can be estimated using 19 averaged over all pairs \((I_m,I_{m'})_{m\ne m'}\), that is, \[\frac{1}{M^2} \sum_{m\ne m'} \frac{\hat{\gamma}(I_m) \hat{\gamma}(\tilde{I}_{m'})}{p} \sum_{i\in I_{m'} \cap \tilde{I}_m}\ell_i'\Bigl(\boldsymbol{x}_i^T\hat{\boldsymbol{b}}(I_{m'})\Bigr) \ell_i'\Bigl(\boldsymbol{x}_i^T\hat{\boldsymbol{b}}(\tilde{I})\Bigr) \to^P \Bigl(1-\frac{1}{M}\Bigr)\eta\sigma^2,\] and \(\frac{1}{M^2}\sum_{m=1}^M\frac{\hat{\gamma}(I_m)^2}{p} \sum_{i\in I_m} \ell_i'(\boldsymbol{x}_i^T\hat{\boldsymbol{b}}(I_m))^2 \to^P \sigma^2/M\). These estimators let us estimate the risk of the bagged estimate 21 , for instance to choose an optimal subsample size \(q\in(0,1)\), or to choose a large enough constant \(M>0\) so that 21 is close to the large-\(M\) limit given by \(\sigma^2\eta\). These estimators are reminiscent of the Corrected Generalized Cross-Validation developed for regularized least-squares estimators in [11].

As shown in 1, resampling and bagging is sometimes beneficial but not always. Whether the curve \(q\mapsto \sigma^2\eta\) is U-shaped and minimized at some \(q^*<1\) (i.e., bagging is beneficial) depends on the interplay between the oversampling ratio \(\delta=n/p\), the distribution of the noise \(\varepsilon_i\) and the robust loss function \(\rho\) used in 2 . In 1, we observe that if \(\varepsilon_i/\tau\) has t-distribution with 2 degrees of freedom and \(\delta=5\), subsampling is not beneficial for \(\tau=1\) but becomes beneficial for \(\tau\ge 1.5\). The generality of this phenomenon is unclear at this point.

The third message of 3 is the characterization of the limiting bivariate distribution of \((\boldsymbol{x}_i^T\hat{\boldsymbol{b}}(I),\boldsymbol{x}_i^T\hat{\boldsymbol{b}}(\tilde{I}))\) for an observation \(i\in I \cap \tilde{I}\) used to train both \(\hat{\boldsymbol{b}}(I)\) and \(\hat{\boldsymbol{b}}(\tilde{I})\). The convergence 20 implies that \((\boldsymbol{x}_i^T\hat{\boldsymbol{b}}(I),\boldsymbol{x}_i^T\hat{\boldsymbol{b}}(\tilde{I}))\) converges weakly to the distribution of \((\mathop{\mathrm{prox}}[\gamma\ell_i](\sigma G_i),\mathop{\mathrm{prox}}[\gamma\ell_i](\sigma \tilde{G}_i))\) where \((G_i,\tilde{G}_i)\) has distribution 13 , as in the fixed-point equation 12 satisfied by \(\eta\).

The setting of resampling without replacement in the proportional regime of the present paper is also studied in the recent paper [13]. There are some significant differences between our contributions and [13]. First, an additive Ridge penalty is imposed in [13] and multiple resampling schemes are studied, while our object of interest is the unregularized M-estimator 2 with a focus on resampling without replacement. The simple fixed-point equation 18 does not appear explicitly in [13], which instead focuses on self-consistent equations satisfied by bias and variance functionals [13] of the specific resampling scheme under study. Another distinctive contribution of the present paper is the proposed estimator 19 which can be used to optimally tune the subsample size, and the proof that the equation 12 admits a unique solution. The use of an additive Ridge penalty brings strong convexity to the optimization problem and simplifies the analysis, as observed in [1]; in this case this makes the analysis [12] based on [17] readily applicable.

2.5 Numerical simulations in robust regression↩︎

Let us verify 3 with numerical simulations. Throughout this section, we focus on the Huber loss \[\rho(t)= \begin{cases} t^2/2 &\text{ if }|t|<1,\\ |t|-1/2& \text{ if }|t|\ge 1. \end{cases} \label{huber}\tag{22}\] The oversampling ratio \(\delta=n/p\) is fixed to \(5\). First, we plot \(\eta\) and \(\sigma^2\eta\) as functions of \(\eta\in [1/\delta, 1]\) for different noise scales: we change the noise distribution as \(\{\text{scale}\}\times \text{t-dist (df=2)}\), \(\text{scale}\in \{1, 1.5, 2, 5, 10\}\). The left figures in 1 imply that the curve \(q\mapsto \eta\) is nonlinear. Note that the dashed line is the affine line \(q\mapsto (q-\delta^{-1})/(1-\delta^{-1})\). More interestingly, the larger the noise scale is, the larger the nonlinearity is. In the right figures in 1, we observe that the plot \(\eta\mapsto\eta\alpha^2\) takes a U-shape curve when the noise scale is sufficiently large. This simulation result suggests that as the scale of noise distribution increases, sub-sampling is eventually beneficial in the sense that the limit of 21 as \(M\to+\infty\) is smaller than the squared error of a single estimate trained on the full dataset. Next, we compare in simulations the correlation \[\frac{(\hat{\boldsymbol{b}}(I)-\boldsymbol{\beta}^*)^T(\hat{\boldsymbol{b}}(\tilde{I})-\boldsymbol{\beta}^*)}{\|(\hat{\boldsymbol{b}}(I)-\boldsymbol{\beta}^*)\|_2 \|(\hat{\boldsymbol{b}}(\tilde{I})-\boldsymbol{\beta}^*)\|_2}\] and the inner product \((\hat{\boldsymbol{b}}(I)-\boldsymbol{\beta}^*)^T(\hat{\boldsymbol{b}}(\tilde{I})-\boldsymbol{\beta}^*)\) with their theoretical limits \((\eta, \eta\sigma^2)\) as in 18 , as well as the estimator in 19 . Here, the noise distribution is fixed to \(3\cdot \text{t-dist(df=2)}\) with \((n, p)=(5000, 1000)\) and \({100}\) repetitions. 2 implies that the correlation and product are approximated well by the corresponding theoretical values and estimates.

a

b

c

d

Figure 1: Plot of \({q} \mapsto \eta\) and \({q}\mapsto \sigma^2\eta\) obtained by solving 12 numerically. Different noise distributions are given by \((\text{scale})\times \text{t-dist (df=2)}\), for scale\(\in\{1,{ 1.5, 2, 5,} 10\}\). The dashed line is the affine line \(q\mapsto (q-\delta^{-1})/(1-\delta^{-1})\). The bottom plots zoom in on a specific region of the top plots..

a

b

Figure 2: Comparison of simulation results, theoretical curves obtained by solving 12 numerically, and estimate constructed by 19 . Here, the noise distribution is fixed to \(3\times \text{t-dist(df=2)}\) and \((n, p)=(5000, 1000)\)..

Remark 4. The second derivative of the Huber loss \(\rho''(x)\) is \(0\) when \(|x|\ge\)​1, so the condition \(\rho''>0\) in 1 is not satisfied. However, the numerical simulation in 2.5 suggests that 18 19 still hold for the Huber loss and that the condition \(\rho''>0\) is an artifact of the proof. We expect that the condition \(\rho''>0\) can be relaxed, at least for a large class of differentiable loss functions including the Huber loss, by adding a vanishing Ridge penalty term to the optimization problem 2 as explained in [18].

3 Resampling without replacement in logistic regression↩︎

3.1 A review of existing results in logistic regression↩︎

Let \(\nu >0, q\in(0,1], \delta>1\) be fixed constants. If a single estimator \(\hat{\boldsymbol{b}}(I)\) is trained with 2 on a subset of observations \(I\subset [n]\) with \(|I|/n=q\) for some constant \(q\in(0,1]\) held fixed as \(n,p\to+\infty\), the behavior of \(\hat{\boldsymbol{b}}(I)\) is now well-understood when \((y_i,\boldsymbol{x}_i)_{i\in [n]}\) are iid with \(\boldsymbol{x}_i \sim N(\boldsymbol{0}_p,\boldsymbol{I}_p)\) normally distributed and the conditional distribution \(y_i \mid \boldsymbol{x}_i\) following a logistic model of the form \[\mathop{\mathrm{\mathbb{P}}}\Bigl(y_i= 1 \mid \boldsymbol{x}_i\Bigr) = \frac{1}{1+\exp(-\boldsymbol{x}_i^T\boldsymbol{\beta}^*)} = \frac{1}{1+\exp(- \nu \boldsymbol{x}_i^T\boldsymbol{w})}\] where \(\boldsymbol{\beta}^*\) is a ground truth with \(\|\boldsymbol{\beta}^*\|=\nu\), and \(\boldsymbol{w}= \boldsymbol{\beta}^*/\nu\) is the projection of \(\boldsymbol{\beta}^*\) on the unit sphere. In this logistic regression model, the limiting behavior of \(\hat{\boldsymbol{b}}(I)\) with the logistic loss 5 trained using \(|I|=(\delta q) p\) samples is characterized as follows: there exists a monotone continuous function \(h(\cdot)\) (with explicit expression given in [19]) such that:

\(\bullet\) If \(\delta q< h(\nu)\) then the logistic MLE 2 does not exist with high-probability.

\(\bullet\) If \(\delta q> h(\nu)\) then there exists a unique [20] solution \((\sigma_*,a_*,\gamma_*)\) to the following the low-dimensional system of equations \[\begin{align} \tag{23} \tfrac{\sigma^2}{\delta q}&= \mathop{\mathrm{\mathbb{E}}}[ (a U_i + \sigma G_i - \mathop{\mathrm{prox}}[\gamma\ell_i](a U_i + \sigma G_i))^2 ], \\ 0 &= \mathop{\mathrm{\mathbb{E}}}[ (a U_i + \sigma G_i - \mathop{\mathrm{prox}}[\gamma\ell_i](a U_i + \sigma G_i)) ], \tag{24} \\ 1 - \tfrac{1}{\delta q} &= \sigma^{-1} \mathop{\mathrm{\mathbb{E}}}[G_i \mathop{\mathrm{prox}}[\gamma\ell_i]'(a U_i + \sigma G_i) ] \tag{25} \end{align}\] where \(U_i = \boldsymbol{x}_i^T\boldsymbol{w}\) and \(G_i\sim N(0,1)\) is independent of \((\ell_i,U_i)\). Above, \(\mathop{\mathrm{prox}}[f](x_0)=\mathop{\mathrm{arg\,min}}_{x\in\mathbb{R}}(x_0-x)^2/2 + f(x)\) denotes the proximal operator of any convex function \(f\) for any \(x_0\in \mathbb{R}\). In this region \(\{\delta q> h(\nu)\}\) where the above system admits a unique solution \((a,\sigma,\gamma)\), the logistic MLE 2 exists with high-probability and the following convergence in probability holds, \[\begin{align} \boldsymbol{w}^T \hat{\boldsymbol{b}}(I) &\to^P a , \tag{26} \\ \qquad \|(\boldsymbol{I}_p-\boldsymbol{w}\boldsymbol{w}^T)\hat{\boldsymbol{b}}(I)\|^2 &\to^P \sigma^2, \tag{27} \\ \frac{1}{|I|}\sum_{i\in I}\ell_i'\Bigl(\boldsymbol{x}_i^T\hat{\boldsymbol{b}}(I)\Bigr)^2 &\to^P \frac{\sigma^2}{ \gamma^2q\delta}. \tag{28} \end{align}\] by [20], [21] for the first two lines and [16] for the third. Further results are obtained in [19], [20], [22], including asymptotic normality results for individual components \(\hat{b}_j\) of 2 . Note that the 3-unknowns system 23 25 is stated in these existing works after integration of the distribution of \(y_i\). We choose the equivalent formulation 23 25 without integrating the conditional distribution of \(y_i\) as the form 23 25 is closer to 9 10 from robust regression, and closer and to the quantities naturally appearing in our proofs. In 4, this common notation is useful to prove the main results simultaneously for robust linear regression and logistic regression.

While the limit in probability of the correlation \(\bar\boldsymbol{b}^T\boldsymbol{\beta}^*\) can be deduced directly from 26 , the case of Mean Squared Error (MSE) \(\|\bar\boldsymbol{b}- \boldsymbol{\beta}^*\|^2\) or the correlation \(\bar\boldsymbol{b}^T\boldsymbol{\beta}^*\) is more subtle. To see the crux of the problem, recall \(\boldsymbol{w}= \boldsymbol{\beta}^*/\|\boldsymbol{\beta}^*\|\), define \(\boldsymbol{P}= (\boldsymbol{I}_p-\boldsymbol{w}\boldsymbol{w}^T)\) for brevity, and consider the decomposition \[\label{MSE} \|\bar\boldsymbol{b}- \boldsymbol{\beta}^*\|^2 = \bigl(\boldsymbol{w}^T(\bar\boldsymbol{b}- \boldsymbol{\beta}^*)\bigr)^2 + \|\boldsymbol{P}\bar \boldsymbol{b}\|^2\tag{29}\] where the second term is \[\label{intro95inner95product95P} \bigl(\boldsymbol{w}^T(\bar\boldsymbol{b}- \boldsymbol{\beta}^*)\bigr)^2 + \frac{1}{M^2}\sum_{m,m'=1}^M \hat{\boldsymbol{b}}(I_m)^T\boldsymbol{P}\hat{\boldsymbol{b}}(I_{m'}).\tag{30}\] In order to characterize the limit of the MSE of \(\bar\boldsymbol{b}\), or to characterize the limit of the normalized correlation \(\|\bar\boldsymbol{b}\|^{-1} \bar\boldsymbol{b}^T\boldsymbol{\beta}^*\), we need to first understand the limit of the inner product \[\hat{\boldsymbol{b}}(I_m)^T\boldsymbol{P}\hat{\boldsymbol{b}}(I_{m'}) \label{corel95logi}\tag{31}\] where \(\hat{\boldsymbol{b}}(I_m)\) and \(\hat{\boldsymbol{b}}(I_{m'})\) are trained on two subsamples \(I_m\) and \(I_{m'}\) with non-empty intersection. This problem happens to be almost equivalent to the corresponding one in robust regression, and we will prove the following result and 3 simultaneously.

3.2 Main results for logistic regression↩︎

Assumption 2. Let \(q\in(0,1),\nu>0,\delta>0\) be constants such that \(q\delta>h(\nu)\) as \(n/p = \delta\) as \(n,p\to+\infty\) with \(\boldsymbol{\beta}^*\in\mathbb{R}^p\) satisfying \(\|\boldsymbol{\beta}^*\|=\nu\). Assume that \((\boldsymbol{x}_i,y_i)_{i\in[n]}\) are iid with \(y_i\in \{0,1\}\) following the logistic model \(\mathop{\mathrm{\mathbb{P}}}(y_i=1\mid \boldsymbol{x}_i)=1/(1+\exp(-\boldsymbol{x}_i^T\boldsymbol{\beta}^*))\). Assume that the loss \(\ell_i\) is the usual binary logistic loss given by 5 .

In other words, we assume a logistic model with parameters on the side of the phase transition where the MLE exists with high-probability. In this regime, the system 23 25 admits a unique solution \((a,\sigma,\gamma)\) and the convergence in probability 26 28 holds.

Proposition 5. Under 2, the equation \[\begin{align} \eta=\frac{{q^2}\delta\gamma^2}{\sigma^2} \mathop{\mathrm{\mathbb{E}}}\Bigl[ \ell_i'\Bigl(\mathop{\mathrm{prox}}[\gamma \ell_i](a U_i + \sigma G_i)\Bigr) \ell_i'\Bigl(\mathop{\mathrm{prox}}[\gamma \ell_i](a U_i + \sigma \tilde{G}_i)\Bigr) \Bigr] \\ \text{ where } \begin{pmatrix} G_i \\ \tilde{G}_i \end{pmatrix} \sim N\Bigl(0_2, \begin{pmatrix} 1 & \eta \\ \eta & 1 \end{pmatrix} \Bigr) \end{align} \label{eta95equation95logi}\qquad{(1)}\] with unknown \(\eta\) admits a unique solution \(\eta\in[0,q]\). Above, \(U_i=\boldsymbol{x}_i^T\boldsymbol{\beta}^*/\|\boldsymbol{\beta}^*\|\) and \((G_i,\tilde{G}_i)\) are independent of \((y_i,\ell_i,U_i)\).

We omit the proof since it is exactly same as the proof of 2. Similarly to robust regression in 3 the solution \(\eta\) to ?? characterizes the limit in probability of the correlation 31 , the estimator 19 is still valid for estimating \(\eta\sigma^2\), and finally we can characterize the joint distribution of two predicted values \(\boldsymbol{x}_i^T\hat{\boldsymbol{b}}(I_m)\) and \(\boldsymbol{x}_i^T\hat{\boldsymbol{b}}(I_m)\) for an observation \(i\in I_m\cap I_{m'}\) appearing in both datasets.

Theorem 6. Let 2 be fulfilled and let \(\boldsymbol{P}=\boldsymbol{I}_p - \boldsymbol{\beta}^* \frac{1}{\|\boldsymbol{\beta}^*\|^2} \boldsymbol{\beta}^*{}^T\). Let \(I,\tilde{I}\) be independent and uniformly distributed over all subsets of \([n]\) of size \(qn\). Then \[\hat{\boldsymbol{b}}(I)\boldsymbol{P}\hat{\boldsymbol{b}}(\tilde{I}) \to^P \sigma^2 \eta, \qquad \frac{\hat{\boldsymbol{b}}(I)^T \boldsymbol{P}\hat{\boldsymbol{b}}(\tilde{I})}{\|\boldsymbol{P}\hat{\boldsymbol{b}}(I)\|_2 \|\boldsymbol{P}\hat{\boldsymbol{b}}(\tilde{I})\|_2} \to^P \eta \label{eq:thm95eta95logi}\tag{32}\] where \(\eta\in[0,q]\) is the unique solution to ?? . Furthermore, \(\eta\) and \(\eta\sigma^2\) can be consistently estimated in the sense that 19 holds. Finally, for any \(i\in I\cap \tilde{I}\), there exists \((G_i,\tilde{G}_i)\) as in 13 , independent of \((y_i,U_i)\) such that \[\label{eq:thm95eta95distribution95logi} \max_{i\in I\cap\tilde{I}} \mathop{\mathrm{\mathbb{E}}}\Bigl[ 1 \wedge \Big\| \begin{pmatrix} \boldsymbol{x}_i^T\hat{\boldsymbol{b}}(I)\\ \boldsymbol{x}_i^T\hat{\boldsymbol{b}}(\tilde{I}) \end{pmatrix} - \begin{pmatrix} \mathop{\mathrm{prox}}[\gamma\ell_i](aU_i + \sigma G_i)\\ \mathop{\mathrm{prox}}[\gamma\ell_i](aU_i + \sigma \tilde{G}_i)\\ \end{pmatrix} \Big\|_2 \mid (I,\tilde{I}) \Bigr] \to^P 0.\tag{33}\]

3.3 Numerical simulations in logistic regression↩︎

Similarly to 2.5, we check the accuracy of 6 with numerical simulations. Here, \((n, p)\) is fixed to \((5000, 500)\) so that \(\delta=n/p=10\). For each signal strength \(\|\boldsymbol{\beta}_*\|\in\{1,2\}\), we compute the correlation \[\hat{\boldsymbol{b}}(I)^T \boldsymbol{P}\hat{\boldsymbol{b}}(\tilde{I})/\bigl(\|\boldsymbol{P}\hat{\boldsymbol{b}}(I)\|_2\|\boldsymbol{P}\hat{\boldsymbol{b}}(\tilde{I})\|_2\bigr)\] and the inner product \(\hat{\boldsymbol{b}}(I)^T \boldsymbol{P}\hat{\boldsymbol{b}}(\tilde{I})\) as we change the sub-sampling ratio \(q=k/n\in[0.4, 1]\) and the estimate constructed by 19 . We perform \(100\) repetitions. The theoretical limits \((\eta, \sigma^2\eta)\) are obtained by solving ?? numerically. 3 shows that the theoretical curves (\(q\mapsto \eta\) and \(q\mapsto \sigma^2\eta\)) match with the correlation and the inner product. The estimator 19 is accurate for medium to large subsample ratio \(q\), but appears slightly biased upwards for small values of \(q\). The source of this slight upward bias is unclear, although possibly due to the finite-sample nature of the simulations \((p=500)\).

In all simulations for logistic regression that we have performed, the curve \(q\mapsto \eta\) is affine, as in the left plot in 3. The reason for this is unclear to us at this point and this appears to be specific logistic regression; for instance the curve \(q\mapsto\eta\) in 1 for robust regression are clearly non-affine.

a

b

c

d

Figure 3: Comparison of simulation results, theoretical curves obtained by solving ?? numerically, and estimate constructed by 19 , with \((n,p)\) fixed to \((2500, 250)\)..

4 Proof of the main results↩︎

We prove here [thm:logi,thm:robust] simultaneously using the following notation:

  • In Robust regression (3), set \(a=0\), let \((\sigma,\gamma)\) be the unique solution to 9 10 , let \(\boldsymbol{\beta}^*=0\) (without loss of generality thanks to translation invariance), and let \(\boldsymbol{P}= \boldsymbol{I}_p\). Furthermore, let \(U_i=0\).

  • In logistic regression (6), let \((a,\sigma,\gamma)\) be the unique solution to 23 25 , let \(\boldsymbol{P}=\boldsymbol{I}_p - \boldsymbol{w}\boldsymbol{w}^\top\) for \(\boldsymbol{w}=\beta^*/\|\boldsymbol{\beta}^*\|\), and let \(U_i=\boldsymbol{x}_i^T\boldsymbol{w}\). Here, \(\boldsymbol{X}\boldsymbol{P}\) is independent of \((y_i,\ell_i,U_i)_{i\in[n]}\).

Thanks to \(\|\boldsymbol{X}/\sqrt n\|_{op}\to^P 1+\delta^{-1/2}\) and 17 or 27 26 , we have \(\|\boldsymbol{X}\hat{\boldsymbol{b}}(I)\|/\sqrt{I} \le K\) for \(K=2 q^{-1/2}(1+\delta^{-1/2})(a^2+\sigma^2)^{1/2}\) with probability approaching one. Thus \(\mathop{\mathrm{\mathbb{P}}}(\hat{\boldsymbol{b}}(I)=\boldsymbol{\hat{\beta}}(I))\to 1\) for \(\boldsymbol{\hat{\beta}}(I)\) in 45 , so we may argue with \(\boldsymbol{\hat{\beta}}= \boldsymbol{\hat{\beta}}(I)\). Similarly for \(\tilde{I}\) we have \(\mathop{\mathrm{\mathbb{P}}}(\hat{\boldsymbol{b}}(I)=\boldsymbol{\hat{\beta}}(I))\to 1\) for \(\boldsymbol{\hat{\beta}}(\tilde{I})\) in 45 , and we may argue with \(\boldsymbol{\tilde{\beta}}= \boldsymbol{\hat{\beta}}(\tilde{I})\). Let also \(\boldsymbol{\psi},\boldsymbol{\tilde{\psi}}\) be defined in 3 (in particular, we have \(\psi_i=0\) of \(i\notin I\) and \(\psi_i = -\ell_i(\boldsymbol{x}_i^T\hat{\boldsymbol{b}}(I))\) in the high-probability event \(\hat{\boldsymbol{b}}(I)=\boldsymbol{\hat{\beta}}\), and similarly for \(\boldsymbol{\tilde{\psi}},\hat{\boldsymbol{b}}(\tilde{I}),\boldsymbol{\tilde{\beta}}\).)

By 48 and 52 from the auxiliary lemmas, we have \[p \hat{\boldsymbol{\beta}}^\top\boldsymbol{P}\tilde{\boldsymbol{\beta}} = \gamma^2 \boldsymbol{\psi}^\top \tilde{\boldsymbol{\psi}} + O_P(\sqrt n)\] where \(\boldsymbol{\psi}^\top \tilde{\boldsymbol{\psi}}= \sum_{i\in I \cap \tilde{I}}\psi_i\tilde{\psi}_i\). With \(n/p=\delta\) and \(|I\cap \tilde{I}| = nq^2 + O_P(n^{1/2})\) thanks to the explicit formulae for the expectation and variance of the hyper-geometric distribution, we have \[\tilde{\boldsymbol{\beta}}^\top \boldsymbol{P}\hat{\boldsymbol{\beta}} = \delta q^2\gamma^2 \boldsymbol{\psi}^T\boldsymbol{\tilde{\psi}}/|I\cap \tilde{I}| + O_P(n^{-1/2}). \label{eq:key}\tag{34}\] By the Cauchy–Schwarz inequality and the concentration of sampling without replacement (11) (or see 7 for details), the absolute value of \(\boldsymbol{\psi}^T\boldsymbol{\tilde{\psi}}/|I\cap \tilde{I}| = \frac{1}{|I\cap \tilde{I}|}\sum_{i\in I\cap\tilde{I}}\psi_i\tilde{\psi}_i\) is smaller than \[\begin{gather} \Bigl(\frac{1}{|I\cap \tilde{I}|}\sum_{i\in I\cap\tilde{I}}\tilde{\psi}_i^2 \Bigr)^{1/2} \Bigl(\frac{1}{|I\cap \tilde{I}|}\sum_{i\in I\cap\tilde{I}}\psi_i^2 \Bigr)^{1/2} \\\le \Bigl(\frac{1}{|\tilde{I}|}\sum_{i\in \tilde{I}}\tilde{\psi}_i^2 \Bigr)^{1/2} \Bigl(\frac{1}{|\tilde{I}|}\sum_{i\in I}\psi_i^2 \Bigr)^{1/2} +o_P(1) = \frac{\sigma^2}{q\delta\gamma^2} + o_P(1) \end{gather}\] thanks to 17 (in robust regression) or 28 (in logistic regression) for the last equality. Combined with 34 , we have proved \(|\tilde{\boldsymbol{\beta}}^\top\boldsymbol{P}\hat{\boldsymbol{\beta}}| \le \delta q^2\gamma^2 \frac{\sigma^2}{q\delta\gamma^2} + o_P(1)= q\sigma^2 + o_P(1).\) Let \(\bar \mathop{\mathrm{\mathbb{E}}}\) be the conditional expectation given \(({I,\tilde{I}},\boldsymbol{X}\boldsymbol{\beta}^*,\boldsymbol{y})\) (In robust regression, \(\boldsymbol{\beta}^*=0\) so \(\bar\mathop{\mathrm{\mathbb{E}}}\) is the conditional expectation given \(\{{I,\tilde{I}}, (\varepsilon_i)_{i\in[n]}\}\)). Thanks to the the Gaussian Poincaré inequality in 8, \[\bar{\eta}\mathrel{\vcenter{:}}= \sigma^{-2} \bar{\mathop{\mathrm{\mathbb{E}}}}[\tilde{\boldsymbol{\beta}}^\top \boldsymbol{P}\hat{\boldsymbol{\beta}}] \quad\text{ satisfies }\quad \begin{cases} \bar \eta = \tilde{\boldsymbol{\beta}}^\top \boldsymbol{P}\hat{\boldsymbol{\beta}}/\sigma^2 + O_P(n^{-1/2}), \\ |\bar \eta |\le q+ o_P(1). \end{cases} \label{concentration-eta-bar}\tag{35}\] Similarly, by 8 we have \(\bar \mathop{\mathrm{\mathbb{E}}}[ \boldsymbol{\psi}^T \boldsymbol{\tilde{\psi}}] /{|I\cap \tilde{I}|} = \boldsymbol{\psi}^T \boldsymbol{\tilde{\psi}}/{|I\cap \tilde{I}|} + O_P(n^{-1/2})\). Combining the previous displays gives \[\bar \eta = \frac{\delta q^2 \gamma^2}{\sigma^2} \frac{1}{|I\cap \tilde{I}|}\bar{\mathop{\mathrm{\mathbb{E}}}}\Bigl[ \sum_{i\in I\cap\tilde{I}}\psi_i\tilde{\psi}_i \Bigr] + o_P(1). \label{ineq95eta95bar}\tag{36}\] For an overlapping observation \(i\in I\cap \tilde{I}\), using the derivative formula in 3 and the moment inequality in 7 conditionally on \((I,\tilde{I}, \boldsymbol{X}\boldsymbol{\beta}^*,\boldsymbol{y})\) and \((\boldsymbol{x}_l)_{l\ne i}\), applied to the standard normal \(\boldsymbol{P}\boldsymbol{x}_i + \boldsymbol{w}Z\) (for \(Z\sim N(0,1)\) independent of everything else) and \(\boldsymbol{W}=[\boldsymbol{P}\hat{\boldsymbol{\beta}}|\boldsymbol{P}\tilde{\boldsymbol{\beta}}]\in\mathbb{R}^{p\times 2}\), we find for the indicator function \(\mathbb{I}\{i\in I\cap \tilde{I}\}\) that \(\mathbb{I}\{i\in I\cap \tilde{I}\} \mathop{\mathrm{\mathbb{E}}}[\text{LHS}_i \mid I,\tilde{I}] \le C \mathop{\mathrm{\mathbb{E}}}[\sum_{j=1}^p \|\frac{\partial \hat{\boldsymbol{\beta}}}{\partial x_{ij}}\|^2 + \|\frac{\partial \tilde{\boldsymbol{\beta}}}{\partial x_{ij}}\|^2]\) where \[\text{LHS}_i \eqqcolon \Bigl\| \begin{pmatrix} \boldsymbol{x}_i^\top\boldsymbol{P}\hat{\boldsymbol{\beta}}- \mathop{\mathrm{tr}}[\boldsymbol{P}\boldsymbol{A}]\psi_i - (\hat{\boldsymbol{\beta}}^\top \boldsymbol{P}\boldsymbol{A}\boldsymbol{X}^\top \boldsymbol{D})\boldsymbol{e}_i \\ \boldsymbol{x}_i^\top\boldsymbol{P}\tilde{\boldsymbol{\beta}}- \mathop{\mathrm{tr}}[\boldsymbol{P}\tilde{\boldsymbol{A}}]\tilde{\psi}_i - (\tilde{\boldsymbol{\beta}}^\top\boldsymbol{P}\tilde{\boldsymbol{A}}\boldsymbol{X}^\top \tilde{\boldsymbol{D}})\boldsymbol{e}_i \end{pmatrix} - (\boldsymbol{W}^\top\boldsymbol{W})^{1/2} \boldsymbol{g}_i \Bigr\|^2\] for all \(i\in I\cap \tilde{I}\), where \(\boldsymbol{g}_i\sim N(\boldsymbol{0}_2, \boldsymbol{I}_2)\). After summing over \(i\in I\cap\tilde{I}\) and using 50 , we get \(\sum_{i\in I\cap\tilde{I}}\mathop{\mathrm{\mathbb{E}}}[\text{LHS}_i \mid I,\tilde{I}] \le C\) and \[\frac{1}{n} \sum_{i\in I\cap\tilde{I}}\mathop{\mathrm{\mathbb{E}}}\Bigl[\text{LHS}_i \mid I,\tilde{I}\Bigr] \le \frac{C'}{n}, \qquad \frac{1}{n} \sum_{i\in I\cap\tilde{I}}\text{LHS}_i = O_{{p}}\Bigl(\frac{1}{n}\Bigr)\] for some deterministic constants \(C,C'\) independent of \(n,p\).

Using 27 in logistic regression or 17 in robust regression, we know \(\|\boldsymbol{P}\hat{\boldsymbol{\beta}}\|^2\to^P \sigma^2\) and similarly \(\|\boldsymbol{P}\tilde{\boldsymbol{\beta}}\|^2\to^P\sigma^2\), as well as \(\mathop{\mathrm{tr}}[\boldsymbol{P}\boldsymbol{A}]\to^P\gamma\), and \(\mathop{\mathrm{tr}}[\boldsymbol{P}\tilde{\boldsymbol{A}}]\to^P\gamma\) by 52 . Using the Lipschitz inequality for the matrix square root \(\|\sqrt{\boldsymbol{M}}-\sqrt{\boldsymbol{N}}\|_{op} \le \|(\sqrt{\boldsymbol{M}}+\sqrt{\boldsymbol{N}})^{-1}\|_{op} \|\boldsymbol{M}- \boldsymbol{N}\|_{op}\) for positive definite matrices \(\boldsymbol{N},\boldsymbol{M}\) [23] [24] which follows from \(\boldsymbol{x}^T(\sqrt{\boldsymbol{M}}+\sqrt{\boldsymbol{N}})\boldsymbol{x}\lambda = \boldsymbol{x}^T(\boldsymbol{M}-\boldsymbol{N})\boldsymbol{x}\) for any unit eigenvector \(\boldsymbol{x}\) of \(\sqrt{\boldsymbol{M}}-\sqrt{\boldsymbol{N}}\) with eigenvalue \(\lambda\), here we get \[\label{39} \Big\| \begin{pmatrix} 1 & \bar\eta \\ \bar\eta & 1 \\ \end{pmatrix}^{-1} \Big\|_{op} = \frac{1}{1-\bar\eta} \le \frac{2}{1-q} \text{ and } \Big\| \sigma \begin{pmatrix} 1 & \bar\eta \\ \bar\eta & 1 \\ \end{pmatrix}^{1/2} - (\boldsymbol{W}^\top\boldsymbol{W})^{1/2} \Big\|_{op} = o_P(1)\tag{37}\] on the event \(|\bar\eta|\le (1+q)/2<1\) which has probability approaching one thanks to 35 . Using the moment bounds 47 to bound from above \(\sum_{i=1}^n((\hat{\boldsymbol{\beta}}^\top \boldsymbol{P}\boldsymbol{A}\boldsymbol{X}^\top \boldsymbol{D})\boldsymbol{e}_i)^2 = \|\boldsymbol{D}\boldsymbol{X}\boldsymbol{A}\boldsymbol{P}\boldsymbol{\hat{\beta}}\|^2\), we find \[\frac{1}{n} \sum_{i\in I\cap \tilde{I}} \| \begin{pmatrix} \boldsymbol{x}_i^\top\boldsymbol{P}\hat{\boldsymbol{\beta}}- \gamma\psi_i \\ \boldsymbol{x}_i^\top\boldsymbol{P}\tilde{\boldsymbol{\beta}}- \gamma\tilde{\psi}_i \end{pmatrix} - \sigma \begin{pmatrix} 1 & \bar\eta \\ \bar\eta & 1 \\ \end{pmatrix}^{1/2} \boldsymbol{g}_i \|^2 = o_P(1) + o_P(1)\frac{1}{n} \sum_{i=1}^n\bigl(\|\boldsymbol{g}_i\|^2+\psi_i^2+\tilde{\psi}_i^2\bigr),\] and thanks to \(\frac{1}{n} \sum_{i=1}^n(\|\boldsymbol{g}_i\|^2+\psi_i^2+\tilde{\psi}_i^2)= O_P(1)\), the previous display converges to 0 in probability. Since \(\boldsymbol{x}_i^T\boldsymbol{P}\boldsymbol{\hat{\beta}}= \boldsymbol{x}_i^T\boldsymbol{\hat{\beta}}- U_i \boldsymbol{w}^T\boldsymbol{\hat{\beta}}\) for \(U_i = \boldsymbol{x}_i^T \boldsymbol{w} =^d N(0,1)\) and given \(\hat{\boldsymbol{\beta}}^\top\boldsymbol{w}\to^P a\) by 26 , together with \(\frac{1}{n} \sum_{i=1} U_i^2 = O_P(1)\) since \(\sum_{i=1} U_i^2\sim \chi^2_n\) we find \[\frac{1}{n} \sum_{i\in I\cap \tilde{I}} \| \begin{pmatrix} \boldsymbol{x}_i^\top\hat{\boldsymbol{\beta}} - a U_i - \gamma\psi_i - \sigma G_i \\ \boldsymbol{x}_i^\top \tilde{\boldsymbol{\beta}} - aU_i - \gamma\tilde{\psi}_i - \sigma\tilde{G}_i \end{pmatrix} \|^2 = o_P(1) ~~ \text{ where} ~~ \begin{pmatrix} G_i \\ \tilde{G}_i \end{pmatrix} = \begin{pmatrix} 1 & \bar \eta \\ \bar\eta & 1 \\ \end{pmatrix}^{1/2} \boldsymbol{g}_i.\] With probability approaching one, the second term in 45 is 0 for the large enough \(K\) that we took at the beginning, and in this event the modified M-estimator \(\hat{\boldsymbol{\beta}}\) equals to the original M-estimator \(\hat{\boldsymbol{b}}(I)\) so that \(\psi_i = - \ell_i(\boldsymbol{x}_i^\top\hat{\boldsymbol{b}})\) (cf. 2), and similarly for \(\tilde{\boldsymbol{\psi}}\). We have established \[\begin{align} \frac{1}{n}\sum_{i\in I\cap\tilde{I}} \| \underbrace{\boldsymbol{x}_i^\top\hat{\boldsymbol{b}}+ \gamma \ell_i'(\boldsymbol{x}_i^\top\hat{\boldsymbol{b}}) - a U_i - \sigma G_i }_{-\text{Rem}_i} \|^2 = o_P(1). \end{align}\] Define \(\text{Rem}_i\) by \(\boldsymbol{x}_i^\top\hat{\boldsymbol{b}}+ \gamma \ell_i'(\boldsymbol{x}_i^\top\hat{\boldsymbol{b}}) = a U_i + \sigma G_i + \text{Rem}_i\) so that \(\boldsymbol{x}_i^\top\hat{\boldsymbol{b}}= \mathop{\mathrm{prox}}[\gamma\ell_i](a U_i + \sigma G_i + \text{Rem}_i)\) by definition of the proximal operator. Now set \(\hat{p}_i = \mathop{\mathrm{prox}}[\gamma\ell_i](a U_i + \sigma G_i)\). Because \(\mathop{\mathrm{prox}}[\gamma\ell_i](\cdot)\) is 1-Lipschitz, \[\Bigl( \sum_{i\in I\cap \tilde{I}} \|\hat{p}_i - \boldsymbol{x}_i^\top \hat{\boldsymbol{b}}\|^2 \Bigr)^{1/2} \le \Bigl( \sum_{i\in I\cap \tilde{I}} \|\text{Rem}_i\|^2 \Bigr)^{1/2} = o_P({\sqrt n}).\] Similarly, a proximal approximation holds for \(\boldsymbol{x}_i^\top \tilde{\boldsymbol{\beta}}\) using \((U_i,\tilde{G}_i)\) instead. We have to be a little careful here because \(\bar \eta\) is independent of the \((G_i,\tilde{G}_i)\) but not of the \((U_i,y_i)\). Using that \(|\ell_i'|\le 1\), and that \(\bar E[|A-B|] = o_P(1)\) if \(A,B\) are bounded random variables such that \(|A-B|=o_P(1)\), 36 gives \[\begin{align} \bar \eta &= \frac{\delta q^2\gamma^2}{\sigma^2} \sum_{i\in I\cap \tilde{I}} \bar{\mathop{\mathrm{\mathbb{E}}}}\Bigl[ \frac{ \ell_i'(\mathop{\mathrm{prox}}[\gamma\ell_i](a U_i+\sigma G_i)) \ell_i'(\mathop{\mathrm{prox}}[\gamma\ell_i](a U_i+\sigma \tilde{G}_i) ) }{|I\cap \tilde{I}|} \Bigr] + o_P(1) \end{align}\] where inside the conditional expectation \(\bar\mathop{\mathrm{\mathbb{E}}}[\cdot]\), \((\bar\eta,U_i,\ell_i)\) are fixed are integration is performed with respect to the distribution of \((G_i,\tilde{G}_i)\), so that \(\bar\eta = \frac{\delta q^2\gamma^2}{\sigma^2} \varphi(\bar\eta) + o_P(1)\) where \[\begin{align} \varphi(t) &= \frac{1}{|I\cap\tilde{I}|} \sum_{i\in I\cap \tilde{I}} \bar\mathop{\mathrm{\mathbb{E}}}\Bigl[ \ell_i'\Bigl(\mathop{\mathrm{prox}}[\gamma\ell_i](a U_i+\sigma G_i^t)\Bigr) \ell_i'\Bigl(\mathop{\mathrm{prox}}[\gamma\ell_i](a U_i+\sigma \tilde{G}_i^t)\Bigr) \Bigr] \\&= \frac{1}{|I\cap\tilde{I}|} \sum_{i\in I\cap \tilde{I}} \iint \ell_i'\Bigl(\mathop{\mathrm{prox}}[\gamma\ell_i](a U_i+\sigma g)\Bigr) \ell_i'\Bigl(\mathop{\mathrm{prox}}[\gamma\ell_i](a U_i+\sigma \tilde{g})\Bigr) \phi_t(g,\tilde{g})\mathrm{d}g\mathrm{d}\tilde{g} \end{align}\] where \(\phi_t\) is the joint density of two jointly normal \((G^t,\tilde{G}^t)\) with \(\mathop{\mathrm{\mathbb{E}}}[(G^t)^2]=\mathop{\mathrm{\mathbb{E}}}[(\tilde{G}^t)^2]=1\) and \(\mathop{\mathrm{\mathbb{E}}}[G^t\tilde{G}^t]=t\), and in the first line \((G_i^t,\tilde{G}_i^t)\sim \phi_t\) is independent of \((\bar\eta,U_i,y_i,\ell_i)_{i\in[n]}\). Because of the law of large numbers for the deterministic solution \(\eta\) of 12 (with \(a=0\) in robust regression) or ?? (in logistic regression), we have \(\eta = \frac{\delta^2q^2\gamma^2}{\sigma^2}\varphi(\eta) {+o_P(1)}\). Taking the difference between the previous display and this fixed-point equation satisfied by \(\eta\), using the mean-value theorem, \[\bar \eta-\eta = \frac{\delta q^2\gamma^2}{\sigma^2} \Bigl(\varphi(\bar\eta) - \varphi(\eta)\Bigr) + o_P(1) = \frac{\delta q^2\gamma^2}{\sigma^2} \Bigl(\bar\eta-\eta\Bigr) \varphi'(t)+ o_P(1)\] for some \(t\in [\bar\eta,\eta]\). By calculation similar to 15 16 thanks to 1, if \((G_i^t,\tilde{G}_i^t)\) has density \(\phi_t\), \[\begin{align} \varphi'(t)&= \frac{ 1 }{|I\cap \tilde{I}|} \frac{\sigma^2}{\gamma^2} \sum_{i\in I\cap \tilde{I}} \bar \mathop{\mathrm{\mathbb{E}}}\Bigl[ \frac{ \gamma \ell_i''(\mathop{\mathrm{prox}}[\gamma\ell_i](aU_i+\sigma G_i^t)) }{ 1 + \gamma \ell_i''(\mathop{\mathrm{prox}}[\gamma\ell_i](aU_i+\sigma G_i^t)) } \frac{\gamma \ell_i''(\mathop{\mathrm{prox}}[\gamma\ell_i](aU_i+\sigma \tilde{G}_i^t))}{1 + \gamma \ell_i''(\mathop{\mathrm{prox}}[\gamma\ell_i](aU_i+\sigma \tilde{G}_i^t))} \Bigr]\\ &\le \frac{ 1 }{|I\cap \tilde{I}|} \frac{\sigma^2}{\gamma^2} \sum_{i\in I\cap \tilde{I}} \bar \mathop{\mathrm{\mathbb{E}}}\Bigl[ \frac{ \gamma \ell_i''(\mathop{\mathrm{prox}}[\gamma\ell_i](aU_i+\sigma G_i^t)) }{ 1 + \gamma \ell_i''(\mathop{\mathrm{prox}}[\gamma\ell_i](aU_i+\sigma G_i^t)) } \Bigr] \\ &= \frac{ 1 }{|I\cap \tilde{I}|} \frac{\sigma^2}{\gamma^2} \sum_{i\in I\cap \tilde{I}} \int\Bigl[ \frac{ \gamma \ell_i''(\mathop{\mathrm{prox}}[\gamma\ell_i](aU_i+\sigma g)) }{ 1 + \gamma \ell_i''(\mathop{\mathrm{prox}}[\gamma\ell_i](aU_i+\sigma g)) } \Bigr]\frac{e^{-g^2/2}}{\sqrt{2\pi}}\mathrm{d}g \quad\text{since } G_i^t\sim N(0,1) \\ &= \frac{\sigma^2}{\gamma^2 q\delta} + o_P(1), \end{align}\] where we have used the law of large numbers and the nonlinear system with equation 10 in robust regression and equation 25 in logistic regression. Combining the above displays, we are left with \[|\bar\eta - \eta| = |\bar \eta - \eta| \frac{\delta q^2\gamma^2}{\sigma^2} |\varphi'(t)| + o_P(1) \le |\bar \eta - \eta| \frac{\delta q^2\gamma^2}{\sigma^2} \frac{\sigma^2}{\gamma^2 q\delta} + o_P(1) = q|\bar{\eta}-\eta| + o_P(1)\] and \(\bar\eta - \eta = o_P(1)\) thanks to \(q\in (0,1)\). Since \(\bar\eta = \hat{\boldsymbol{\beta}}^\top\boldsymbol{P}\tilde{\boldsymbol{\beta}}/\sigma^2 + o_P(1)\) by 35 , the proof of 18 and 32 is complete. Next, 19 follows from 48 and 52 .

Finally for 20 and 33 , by symmetry \(\mathop{\mathrm{\mathbb{E}}}[\text{LHS}_i \mid I, \tilde{I}]\) is the same for all \(i\in I\cap \tilde{I}\). In particular, the maximum of the conditional expectation is the same as the average over \(I\cap \tilde{I}\), so that \(\sum_{i\in I\cap \tilde{I}}\mathop{\mathrm{\mathbb{E}}}[\text{LHS}_i \mid I, \tilde{I}] \le C\) proved above gives \(\max_{i\in I\cap \tilde{I}} \mathop{\mathrm{\mathbb{E}}}[\text{LHS}_i \mid I, \tilde{I}] = O_P(1/n)\) since \(I\cap \tilde{I}\) has cardinality of order \(n\). Finally, we have \[\boldsymbol{W}^\top\boldsymbol{W}\to^P \sigma^2 \begin{pmatrix} 1 & \eta \\ \eta & 1 \end{pmatrix}, \qquad \Bigl(\boldsymbol{W}^\top\boldsymbol{W}\Bigr)^{1/2}\to^P \sigma \begin{pmatrix} 1 & \eta \\ \eta & 1 \end{pmatrix}^{1/2}, \label{cont95mapping95matrix95sq}\tag{38}\] by continuity of the matrix square root and the continuous mapping theorem (or, alternatively, by reusing the argument in 37 ). Using again \(\mathop{\mathrm{tr}}[\boldsymbol{P}\boldsymbol{A}]\to^P\gamma\), \(\boldsymbol{\hat{\beta}}^T\boldsymbol{w}\to^Pa\), \((\hat{\boldsymbol{\beta}}^\top \boldsymbol{P}\boldsymbol{A}\boldsymbol{X}^\top \boldsymbol{D})\boldsymbol{e}_i\to^P 0\), and similarly for \(\boldsymbol{\tilde{\beta}}\), combined with 38 , we obtain 20 and 33 .

5 Auxiliary lemmas↩︎

5.1 Approximate multivariate normality↩︎

Proposition 7. Let \(\boldsymbol{z}\sim N(\boldsymbol{0}_p, \boldsymbol{I}_p)\) and let \(\boldsymbol{W}:\mathbb{R}^n\to\mathbb{R}^{p\times M}\) be a locally Lipschitz function with \(M\le p\). Then there exists \(\boldsymbol{g}\sim N(\boldsymbol{0}_M, \boldsymbol{I}_M)\) such that \[\mathop{\mathrm{\mathbb{E}}}\Bigl[\bigl\|\boldsymbol{W}(\boldsymbol{z})^\top \boldsymbol{z}- \sum_{j=1}^p \frac{\partial \boldsymbol{W}(\boldsymbol{z})^\top \boldsymbol{e}_j}{\partial z_j} - \Bigl\{\boldsymbol{W}(\boldsymbol{z})^\top \boldsymbol{W}(\boldsymbol{z})\Bigr\}^{1/2} \boldsymbol{g} \bigr\|^2\Bigr]\le \C \sum_{j=1}^p \mathop{\mathrm{\mathbb{E}}}\Bigl[\bigl\| \frac{\partial \boldsymbol{W}(z)}{\partial z_j}\bigr\|_F^2\Bigr],\] where \(\{\cdot\}^{1/2}\) is the square root of the positive semi-definite matrix.

This moment inequality is a matrix-generalization of [25] and [26]. It is particularly useful to show that \[\boldsymbol{W}(\boldsymbol{z})^\top \boldsymbol{z}- \sum_{j=1}^p \frac{\partial \boldsymbol{W}(\boldsymbol{z})^\top \boldsymbol{e}_j}{\partial z_j}\] is approximately multivariate normal with covariance approximated by \(\boldsymbol{W}(\boldsymbol{z})^\top\boldsymbol{W}(\boldsymbol{z})\).

Proof. Let \(\tilde{\boldsymbol{z}}\) be an independent copy of \(\boldsymbol{z}\) and let \(\tilde{\boldsymbol{W}}= \boldsymbol{W}(\tilde{\boldsymbol{z}})\). Noting \(M\le p\), we denote the SVD of \(\tilde{\boldsymbol{W}}\in\mathbb{R}^{p\times M}\) by \(\tilde{\boldsymbol{W}}= \sum_{m=1}^M s_m \boldsymbol{u}_m \boldsymbol{v}_m^\top\) where \(s_1\ge s_2\ge ... \ge s_M\ge 0\) are the singular values. Here, we allow some \(s_m\) to be \(0\) to have \(M\) terms by adding extra terms if necessary, so that \((\boldsymbol{v}_1,...,\boldsymbol{v}_M)\) is an orthonormal basis in \(\mathbb{R}^M\). Now we define \[\tilde{\boldsymbol{Q}} = \sum_{m=1}^M \boldsymbol{v}_{m} \boldsymbol{u}_{m}^\top\in \mathbb{R}^{M\times p}\] so that \(\tilde{\boldsymbol{W}} = \tilde{\boldsymbol{Q}}^\top (\tilde{\boldsymbol{W}}^\top \tilde{\boldsymbol{W}})^{1/2}\) thanks to \((\tilde{\boldsymbol{W}}^\top \tilde{\boldsymbol{W}})^{1/2} = \sum_{m=1}^M s_m \boldsymbol{v}_m\boldsymbol{v}_m^\top\). Define \(\boldsymbol{g}= \tilde{\boldsymbol{Q}} \boldsymbol{z}\) and note that\(\boldsymbol{g}\sim N(\boldsymbol{0}_M,\boldsymbol{I}_M)\) since \(\tilde{\boldsymbol{W}}=\boldsymbol{W}(\tilde{\boldsymbol{z}})\) is independent of \(\boldsymbol{z}\). With \(\boldsymbol{W}=\boldsymbol{W}(\boldsymbol{z})\) (omitting the dependence in \(\boldsymbol{z}\)), using \(\boldsymbol{g}= \tilde{\boldsymbol{Q}}\boldsymbol{z}\) and \(\tilde{\boldsymbol{W}} = \tilde{\boldsymbol{Q}}^\top (\tilde{\boldsymbol{W}}^\top \tilde{\boldsymbol{W}})^{1/2}\), we have \[\tilde{\boldsymbol{W}}^\top \boldsymbol{z}- (\boldsymbol{W}^\top\boldsymbol{W})^{1/2}\boldsymbol{g} = \bigl[(\tilde{\boldsymbol{W}}^\top \tilde{\boldsymbol{W}})^{1/2} - (\boldsymbol{W}^\top\boldsymbol{W})^{1/2}\bigr]\boldsymbol{g}.\] Applying the Second order Stein formula [27] (see also 5.1.13 in [28]) to \(\boldsymbol{U}(\boldsymbol{z})=\boldsymbol{W}(z)^\top-\{\boldsymbol{W}(z)^\top \boldsymbol{W}(\boldsymbol{z})\}^{1/2}\tilde{\boldsymbol{Q}}\in\mathbb{R}^{M\times p}\) conditionally on \((\tilde{\boldsymbol{z}},\tilde{\boldsymbol{Q}})\), we find \[\begin{align} &\mathop{\mathrm{\mathbb{E}}}\Bigl[\|\boldsymbol{W}^\top\boldsymbol{z}- \sum_{j=1}^p \frac{\partial (\boldsymbol{W}^\top-\{\boldsymbol{W}^\top\boldsymbol{W}\}^{1/2}\tilde{\boldsymbol{Q}})\boldsymbol{e}_j}{\partial z_j} - \{\boldsymbol{W}^\top\boldsymbol{W}\}^{1/2}\boldsymbol{g}\|^2\Bigr] \nonumber \\ &= \mathop{\mathrm{\mathbb{E}}}\bigl[\|\boldsymbol{U} \boldsymbol{z} - \sum_{j=1}^p \frac{\partial\boldsymbol{U}\boldsymbol{e}_j}{\partial z_j} \|^2\bigr] \nonumber && \boldsymbol{g}= \tilde{\boldsymbol{Q}}\boldsymbol{z} \\&\le \mathop{\mathrm{\mathbb{E}}}\Bigl[\|\boldsymbol{U}(\boldsymbol{z})\|_F^2 +\sum_{j=1}^p \|\frac{\partial \boldsymbol{U}(\boldsymbol{z})}{\partial z_j}\|_F^2 \Bigr]. \label{SOS} \end{align}\tag{39}\] Since \(\tilde{\boldsymbol{W}}^\top = (\tilde{\boldsymbol{W}}^\top \tilde{\boldsymbol{W}})^{1/2}\tilde{\boldsymbol{Q}}\), by the triangle inequality, \[\begin{align} \nonumber \|\boldsymbol{U}\|_F &= \|\boldsymbol{W}^\top - \{\boldsymbol{W}^\top\boldsymbol{W}\}^{1/2}\tilde{\boldsymbol{Q}}\|_F \\&\le \|\boldsymbol{W}-\tilde{\boldsymbol{W}}\|_F^2 + \|\tilde{\boldsymbol{W}}^\top - (\boldsymbol{W}^\top\boldsymbol{W})^{1/2}\tilde{\boldsymbol{Q}}\|_F^2 \\&= \|\boldsymbol{W}-\tilde{\boldsymbol{W}}\|_F + \|[(\tilde{\boldsymbol{W}}^\top \tilde{\boldsymbol{W}}) - (\boldsymbol{W}^\top\boldsymbol{W})^{1/2}]\tilde{\boldsymbol{Q}}\|_F \nonumber \\&\le \|\boldsymbol{W}-\tilde{\boldsymbol{W}}\|_F + \sqrt 2\|\boldsymbol{W}-\tilde{\boldsymbol{W}}\|_F \label{piece1} \end{align}\tag{40}\] thanks to \(\|\tilde{\boldsymbol{Q}}\|_{op}\le 1\) and using, for the last line, inequality \[\|(\tilde{\boldsymbol{W}}^\top\tilde{\boldsymbol{W}})^{1/2}-(\boldsymbol{W}^\top\boldsymbol{W})^{1/2}\|_F\le \sqrt 2 \|\boldsymbol{W}-\tilde{\boldsymbol{W}}\|_F \label{lip}\tag{41}\] from [29], [30]. Now for the second term in 39 , \[\begin{align} \nonumber \sum_{j=1}^p \|\frac{\partial \boldsymbol{U}}{\partial z_j}\|_F^2 &= \sum_{j=1}^p \| \frac{\partial(\boldsymbol{W}^\top - (\boldsymbol{W}^\top\boldsymbol{W})^{1/2}\tilde{\boldsymbol{Q}})}{\partial z_j}\|_F^2 \\&\le 2\sum_{j=1}^p \| \frac{\partial \boldsymbol{W}}{\partial z_j}\|_F^2 + \| \frac{\partial (\boldsymbol{W}^\top\boldsymbol{W})^{1/2}}{\partial z_j}\tilde{\boldsymbol{Q}}\|_F^2. \nonumber \\&\le 4\sum_{j=1}^p \| \frac{\partial \boldsymbol{W}}{\partial z_j}\|_F^2 \label{piece2} \end{align}\tag{42}\] where for the last line we used again inequality 41 valid for any two \(\tilde{\boldsymbol{W}},\boldsymbol{W}\), which grants \[\|\frac{\partial (\boldsymbol{W}^\top\boldsymbol{W})^{1/2}}{\partial z_j}\|_F \le \sqrt{2} \|\frac{\partial \boldsymbol{W}}{\partial z_j}\|_F \label{eq:lip-derivative}\tag{43}\] by definition of the directional derivative and continuity of the Frobenius norm.

It remains to bound from above the divergence term appearing in the left-hand side of 39 . For each \(m\in[M]\), \[\boldsymbol{e}_m^\top \sum_{j=1}^p \frac{\partial(\boldsymbol{W}^\top\boldsymbol{W})^{1/2}}{\partial z_j} \tilde{\boldsymbol{Q}}\boldsymbol{e}_j\] is the divergence of the vector field \(\mathbb{R}^p \ni \boldsymbol{z}\mapsto \tilde{\boldsymbol{Q}}^\top (\boldsymbol{W}^\top\boldsymbol{W})^{1/2}\boldsymbol{e}_m \in\mathbb{R}^p\). Since \(\tilde{\boldsymbol{Q}}\in\mathbb{R}^{M\times p}\) is fixed and its rank is at most \(M\), the Jacobian of this vector field is of rank \(M\) at most. Thus, the divergence (trace of the Jacobian) is smaller than \(\sqrt M\) times the Frobenius norm of the Jacobian. This gives for every \(m\in [M]\) the following bound on the square of the divergence: \[|\boldsymbol{e}_m^\top \sum_{j=1}^p \frac{\partial(\boldsymbol{W}^\top\boldsymbol{W})^{1/2}}{\partial z_j} \tilde{\boldsymbol{Q}}\boldsymbol{e}_j|^2 \le M \sum_{j=1}^p \|\boldsymbol{e}_m^\top \frac{\partial (\boldsymbol{W}^\top\boldsymbol{W})^{1/2}\tilde{\boldsymbol{Q}}}{\partial z_j}\|^2.\] Summing over \(m\in[M]\) we find \[\label{piece3} \|\sum_{j=1}^p \frac{\partial(\boldsymbol{W}^\top\boldsymbol{W})^{1/2}}{\partial z_j} \tilde{\boldsymbol{Q}}\boldsymbol{e}_j \|^2 \le M\sum_{j=1}^p \| \frac{\partial (\boldsymbol{W}^\top\boldsymbol{W})^{1/2}\tilde{\boldsymbol{Q}}}{\partial z_j} \|_F^2.\tag{44}\] Since \(\|\tilde{\boldsymbol{Q}}\|_{op}\le 1\), we can further upper-bound by removing \(\tilde{\boldsymbol{Q}}\) inside the Frobenius norm, and use again 43 . Combining the pieces 39 , 40 , 42 , 44 , we find \[\begin{align} \mathop{\mathrm{\mathbb{E}}}\Bigl[\|\boldsymbol{W}^\top\boldsymbol{z}- \sum_{j=1}^p \frac{\partial \boldsymbol{W}^\top\boldsymbol{e}_j}{\partial z_j} - (\boldsymbol{W}^\top\boldsymbol{W})^{1/2}\boldsymbol{g}\|^2\Bigr] \le \C \mathop{\mathrm{\mathbb{E}}}\Bigl[\|\boldsymbol{W}-\tilde{\boldsymbol{W}}\|_F^2 +\sum_{j=1}^p \|\frac{\partial \boldsymbol{W}}{\partial z_j}\|_F^2 \Bigr]. \end{align}\] Since \(\boldsymbol{W},\tilde{\boldsymbol{W}}\) are iid, using the triangle inequality for the Frobenius norm with \((a+b)^2 \le 2(a^2+b^2)\) and the Gaussian Poincaré inequality finally yield \[\mathop{\mathrm{\mathbb{E}}}\bigl[\|\boldsymbol{W}-\tilde{\boldsymbol{W}}\|_F^2\bigr]\le 4\mathop{\mathrm{\mathbb{E}}}\bigl[\bigl\|\boldsymbol{W}-\mathop{\mathrm{\mathbb{E}}}[{\boldsymbol{W}}]\bigr\|_F^2\bigr] \le C [ \sum_{i=1}^n \|\frac{\partial \boldsymbol{W}}{\partial z_i}\|_F^2]\] and the proof is complete. ◻

5.2 Derivative of \(F(\eta)\)↩︎

Lemma 1. Let \(G\) and \(Z\) be independent \(N(0,1)\) random variables. Then for any Lipschitz continuous function \(f\) with \(\mathop{\mathrm{\mathbb{E}}}[f(G)^2]<+\infty\), the map \(\varphi: \eta \in [-1,1] \mapsto \mathop{\mathrm{\mathbb{E}}}[f(G)f(\eta G + \sqrt{1-\eta^2}Z)]\in\mathbb{R}\) has derivative \(\varphi'(\eta) = \mathop{\mathrm{\mathbb{E}}}[f'(G)f'(\eta G + \sqrt{1-\eta^2}Z)]\).

Proof. Since \(f\) is Lipschitz and \(N(0,1)\) has no point mass, \(f\) is differentiable at \(G\sim N(0,1)\) with probability \(1\) so that \[\begin{align} \varphi'(\eta) &= \mathop{\mathrm{\mathbb{E}}}\Bigl[ f(G) f'\Bigl(\eta G +\sqrt{1-\eta^2}Z\Bigr) \Bigl(G-\frac{\eta}{\sqrt{1-\eta^2}} Z\Bigr) \Bigr] \end{align}\] thanks to the dominated convergence theorem. If we define \(A=\eta G + \sqrt{1-\eta^2}Z\) and \(B=\sqrt{1-\eta^2}G-\eta Z\), then \((A, B)\) are again independent and \(A, B=^d N(0,1)\). Using Stein’s formula for \(B\) conditionally on \(A\), we have \[\begin{align} \varphi'(\eta) &= (1-\eta^2)^{-1/2} \mathop{\mathrm{\mathbb{E}}}[f(\eta A + \sqrt{1-\eta^2}B) f'(A) B]\\ &= (1-\eta^2)^{-1/2}\mathop{\mathrm{\mathbb{E}}}\bigl[f'(A) \mathop{\mathrm{\mathbb{E}}}[{f(\eta A + \sqrt{1-\eta^2}B)} B \mid A] \bigr]\\ &= \mathop{\mathrm{\mathbb{E}}}\bigl[ f'(A) \mathop{\mathrm{\mathbb{E}}}[ f'(\eta A + \sqrt{1-\eta^2} B) \mid A] \bigr]\\ &= \mathop{\mathrm{\mathbb{E}}}[f'(A) f'(\eta A + \sqrt{1-\eta^2}B)]. \end{align}\] This finishes the proof. ◻

5.3 Modified loss and moment inequalities↩︎

This subsection provides useful approximations to study two estimators \(\hat{\boldsymbol{b}}(I),\hat{\boldsymbol{b}}(\tilde{I})\) trained on two subsampled datasets indexed in \(I\) and \(\tilde{I}\). These approximations are used in the proof of the main result in 4, with the key ingredient being 4. The approximations in this subsection are obtained as a consequence of the moment inequalities given in [lm:chi_square,lm:second_order_stein] developed in [31] for estimating the out-of-sample error of a single estimator. Because the moment inequalities in [lm:chi_square,lm:second_order_stein] requires us to bound from above expectations involving \(\hat{\boldsymbol{b}}(I),\hat{\boldsymbol{b}}(\tilde{I})\) and their derivatives, we resort to the following modification of the M-estimators (introduced in [32]) to guarantee that any finite moment \(\hat{\boldsymbol{b}}(I),\hat{\boldsymbol{b}}(\tilde{I})\) and their derivatives are suitably bounded.

Lemma 2. Let \(\hat{\boldsymbol{b}}(I) \in \mathop{\mathrm{arg\,min}}_{\boldsymbol{b}\in\mathbb{R}^p} \sum_{i\in I} \ell_i (\boldsymbol{x}_i^\top\boldsymbol{b})\) be the M-estimator fitted on the subsampled data \((\boldsymbol{x}_i, y_i)_{i\in I}\). Now, for any positive constant \(K>0\) and any twice continuous differentiable function \(H:\mathbb{R}\to\mathbb{R}\) such that \(H'(u)=0\) for \(u\le 0\) and \(H'(u)=1\) for \(u\ge 1\), we define the modified M-estimator \(\hat{\boldsymbol{\beta}}(I)\) as \[\label{reg} \hat{\boldsymbol{\beta}}(I) \in \mathop{\mathrm{arg\,min}}_{\boldsymbol{\beta}\in\mathbb{R}^p} \mathcal{L}(\boldsymbol{X}{\boldsymbol{\beta}}) \quad \text{where} \quad \mathcal{L}(\boldsymbol{u}) = \sum_{i\in I} \ell_i(u_i) + |I| H\Bigl(\frac{1}{2|I|}\sum_{i\in I} u_i^2 - \frac{K}{2}\Bigr)\tag{45}\] for \(\boldsymbol{u}\in\mathbb{R}^n\). If the vanilla M-estimator \(\hat{\boldsymbol{b}}(I)\) exists with high probability \(\mathop{\mathrm{\mathbb{P}}}(\|\boldsymbol{X}\hat{\boldsymbol{b}}(I)\|^2/n \le K )\to 1\) holds for a sufficiently large \(K>0\), then on the event \(\{\|\boldsymbol{X}\hat{\boldsymbol{b}}(I)\|^2/n \le K \}\) the vanilla and modified M-estimators coincide, i.e., \(\hat{\boldsymbol{b}}(I)=\hat{\boldsymbol{\beta}}(I)\).

Lemma 3. Assume that \(\ell_i\) is twice-continuously differentiable with \(\ell_i''(u) \vee |\ell_i'(u)| \le 1\) and \(\ell_i''(u)>0\) for all \(u\in\mathbb{R}\). Fix any \(K>0\) and let \(\hat{\boldsymbol{\beta}}\) be the M-estimator with the modified loss 45 and let \(\boldsymbol{\psi}= -\nabla\mathcal{L}(\boldsymbol{X}\hat{\boldsymbol{\beta}})\). Then, the maps \(\boldsymbol{X}\in\mathbb{R}^{n\times p}\mapsto \hat{\boldsymbol{\beta}}(\boldsymbol{y},\boldsymbol{X})\in\mathbb{R}^p\) and \(\boldsymbol{X}\in\mathbb{R}^{n\times p}\mapsto \boldsymbol{\psi}(\boldsymbol{y},\boldsymbol{X})\in \mathbb{R}^n\) are continuously differentiable, with its derivatives given by \[\begin{align} \frac{\partial \hat{\boldsymbol{\beta}}}{\partial x_{ij}} = \boldsymbol{A}(\boldsymbol{e}_j \psi_i-\boldsymbol{X}^\top \boldsymbol{D}\boldsymbol{e}_i \hat{\beta}_j), \quad \frac{\partial \boldsymbol{\psi}}{\partial x_{ij}} = -\boldsymbol{D}\boldsymbol{X}\boldsymbol{A}\boldsymbol{e}_j \hat{\psi}_i - \boldsymbol{V}\boldsymbol{e}_i \hat{\beta}_j \end{align}\] for all \(i\in[n], j\in [p]\), where \(\boldsymbol{D}= \nabla^2 \mathcal{L}(\boldsymbol{X}\hat{\boldsymbol{\beta}})\in\mathbb{R}^{n\times n}\), \(\boldsymbol{A}=(\boldsymbol{X}^\top \boldsymbol{D}\boldsymbol{X})^{-1}\in\mathbb{R}^{p\times p}\), \(\boldsymbol{V}=\boldsymbol{D}-\boldsymbol{D}\boldsymbol{X}\boldsymbol{A}\boldsymbol{X}^\top \boldsymbol{D}\in \mathbb{R}^{n\times n}\). Here, \(\sum_{i\in I}\|\boldsymbol{x}_i^\top\hat{\boldsymbol{\beta}}\|^2\), \(\|\boldsymbol{\psi}\|^2\) and \(\|\boldsymbol{D}\|_{op}\) are bounded from above as \[\sum_{i\in I}(\boldsymbol{x}_i^\top\hat{\boldsymbol{\beta}})^2 \le |I| (K+2), \quad \|\boldsymbol{\psi}\|^2 \le |I| (1+\sqrt{K+2})^2, \quad \|\boldsymbol{D}\|_{op} \le C(K, q, \delta) \label{eq:as95bound}\tag{46}\] with probability \(1\) and \(\boldsymbol{0}_{n\times n} \preceq \boldsymbol{V}\preceq \boldsymbol{D}\).

Finally, we have for all integer \(m\ge 1\) \[\mathop{\mathrm{\mathbb{E}}}[\|\hat{\boldsymbol{\beta}}\|^m] \vee \mathop{\mathrm{\mathbb{E}}}[\|n\boldsymbol{A}\|_{op}^m] \le \begin{cases} C(m, K, q, \delta, \rho, \mathsf{Law}(\varepsilon_i)) & \text{under \Cref{assum:robust}}, \\ C(m, K, q, \delta) & \text{under \Cref{assum:logi}}. \end{cases} \label{moments}\tag{47}\]

Proof. The proof of the first part of the lemma and 46 is given in Appendix D.4 in [32]. The moment bound 47 is proved in [32] under 2 when \(y_i\) is binary valued. We now prove 47 under 1. Let also \(\boldsymbol{V},\boldsymbol{A}\) be the matrices defined in 3 for \(\boldsymbol{\hat{\beta}}\), and let \(\tilde{\boldsymbol{V}},\tilde{\boldsymbol{A}}\) be corresponding matrices defined in 3 for \(\boldsymbol{\tilde{\beta}}\).

By 46 , \[\|\boldsymbol{\hat{\beta}}\|^2 \le \|(\frac{1}{|I|}\sum_{i\in I}\boldsymbol{x}_i\boldsymbol{x}_i^T)^{-1}\|_{op} (K+2)\] so that the bound on \(\mathop{\mathrm{\mathbb{E}}}[\|\boldsymbol{\hat{\beta}}\|^m]\) follows by the known result \(\mathop{\mathrm{\mathbb{E}}}[\|(\frac{1}{|I|}\sum_{i\in I}\boldsymbol{x}_i\boldsymbol{x}_i^T)^{-1}\|_{op}^m] \le C(\delta q, m)\) which follows from the integrability of the density of the smallest eigenvalue of a Wishart matrix [33], as explained for instance in [26].

Let \(\alpha>0\) be a constant such that \(1-\alpha > (\delta q)^{-1}\) and let \(Q_\alpha\in\mathbb{R}\) be the quantile such that \(\mathop{\mathrm{\mathbb{P}}}(|\varepsilon_i|\le Q_\alpha)=1-\alpha/2\). Since \(q\delta>1\) and \(|I|=(\delta q) p\), by the weak law of large numbers applied to the indicator functions \(\mathbb{I}\{|\varepsilon_i|\le Q_\alpha\}\), with probability approaching one, there exists a random set \(\hat{I}\subset I\) with \(p(\delta q)(1-\alpha)=|I|(1-\alpha)\le |\hat{I}|\) and \(\sup_{i\in \hat{I}}|\varepsilon_i|\le Q_\alpha\). Next, by 46 , there exists a constant \(C(\delta,q,\alpha,K)\) such that \(\frac{1}{|\hat{I}|}\sum_{i\in \hat{I}} (\boldsymbol{x}_i^T\boldsymbol{\hat{\beta}})^2 \le C(\delta,q,\alpha, K)\). Now define \[\check I =\{ i\in \hat{I}: (\boldsymbol{x}_i^T\boldsymbol{\hat{\beta}})^2 \le C(\delta,q,\alpha,K)/(1-\sqrt{\delta q(1-\alpha)}) \}\] and note that by Markov’s inequality, \(|\hat{I} \setminus \check I| / |\hat{I}| \le (1-\sqrt{\delta q(1-\alpha)})\). This gives \(|\check I|\ge\sqrt{\delta q(1-\alpha)}|\hat{I}| \ge p (\delta q(1-\alpha))^{3/2}\) and the constant \((\delta q(1-\alpha))^{3/2}\) is strictly larger than 1. Finally, since for all \(i\in \check I\) we have \(|\varepsilon_i|\le Q_\alpha\) and \((\boldsymbol{x}_i^T\boldsymbol{\hat{\beta}})^2\le C(\delta,q,\alpha,K)/(1-\sqrt{\delta q(1-\alpha)}\), for all \(i\in \check I\) we have \(\varepsilon_i-\boldsymbol{x}_i^T\boldsymbol{\hat{\beta}}\in [-L,L]\) for some constant \(L=L(\delta,q,\alpha, K,Q_\alpha)\). Finally, \[\begin{align} \|n \boldsymbol{A}\|_{op} &\le \frac{n}{|\check I|} \|\Bigl(\frac{1}{|\check I|}\sum_{i\in \check I}\boldsymbol{x}_i \rho''(y_i-\boldsymbol{x}_i^T\boldsymbol{\hat{\beta}}) \boldsymbol{x}_i^T\Bigr)^{-1}\|_{op} \\&\le \frac{\delta \max_{u\in[-L,L]}(\rho''(u)^{-1}) }{(\delta q(1-\alpha))^{3/2}} \|\Bigl(\frac{1}{|\check I|}\sum_{i\in \check I}\boldsymbol{x}_i \boldsymbol{x}_i^T\Bigr)^{-1}\|_{op}. \end{align}\] Since \(\rho''\) is positive and continuous, the moment of order \(m\) of the previous display is bounded from above by some \(C(m,\delta, K,q,Q_\alpha,\rho)\) thanks to the explicit formula of [33] for the density of the smallest eigenvalue of a Wishart matrix, as explained in [32]. ◻

Lemma 4.

Let either 1 or 2 be fulfilled with \(I,\tilde{I}\) independent and uniformly distributed over all subsets of \([n]\) of size \(qn\). Let the notation of 4 be in force for \((\boldsymbol{\hat{\beta}},\boldsymbol{\psi},\boldsymbol{A},\boldsymbol{V})\) (as in [lem:reg,lm:derivative_formula] for \(I\)) and similarly for \((\boldsymbol{\tilde{\beta}},\boldsymbol{\tilde{\psi}},\tilde{\boldsymbol{A}},\tilde{\boldsymbol{V}})\).

Then \[\mathop{\mathrm{tr}}[\boldsymbol{V}]\cdot \hat{\boldsymbol{\beta}}^\top \boldsymbol{P}\tilde{\boldsymbol{\beta}} - \mathop{\mathrm{tr}}[\boldsymbol{P}\tilde{\boldsymbol{A}}] \cdot \boldsymbol{\psi}^\top \tilde{\boldsymbol{\psi}}= O_P(n^{1/2}). \label{eq:SOS95lemma}\tag{48}\]

Proof. We will apply 10 with \(\boldsymbol{\rho}={\boldsymbol{\psi}}\) and \(\boldsymbol{\eta}=\boldsymbol{P}\tilde{\boldsymbol{\beta}}\). Using the derivative formula in 3, \[\begin{align} \sum_{i=1}^n\sum_{j=1}^p \|\frac{\partial {\boldsymbol{\psi}}}{\partial x_{ij}}\|^2 &= \sum_{i=1}^n\sum_{j=1}^p \|- \boldsymbol{D}\boldsymbol{X}\boldsymbol{A}\boldsymbol{e}_j \psi_i - \boldsymbol{V}\boldsymbol{e}_i \hat{\beta}_j \|^2 \\ &\le 2 \| \boldsymbol{D}\boldsymbol{X}\boldsymbol{A}\|_F^2 \|{\boldsymbol{\psi}}\|^2 + 2 \|\boldsymbol{V}\|_F^2 \|\hat{\boldsymbol{\beta}}\|^2\\ &\le 2 \| \boldsymbol{D}\|_F^2 \|\boldsymbol{X}\boldsymbol{A}\|_{op}^2 \|\boldsymbol{\psi}\|^2 + 2\|\boldsymbol{V}\|_F^2 \|\hat{\boldsymbol{\beta}}\|^2\\ &\le \C (n^2 \|\boldsymbol{X}\|_{op}^2 \|\boldsymbol{A}\|_{op}^2 + n \|\hat{\boldsymbol{\beta}}\|^2) \end{align}\] as well as \[\begin{align} \label{bound95derivatives95hbbeta} \sum_{i=1}^n\sum_{j=1}^p \|\frac{\partial \boldsymbol{P}\tilde{\boldsymbol{\beta}}}{\partial x_{ij} }\|^2 &= \sum_{i=1}^n\sum_{j=1}^p \| \boldsymbol{P}\tilde{\boldsymbol{A}}\boldsymbol{e}_j \tilde{\psi}_i - \boldsymbol{P}\tilde{\boldsymbol{A}}\boldsymbol{X}^\top \tilde{\boldsymbol{D}}\boldsymbol{e}_i \tilde{\beta}_j \|^2\\ &\le 2 \|\boldsymbol{P}\tilde{\boldsymbol{A}}\|_F^2 \|\tilde{\boldsymbol{\psi}}\|^2 + 2 \|\boldsymbol{P}\tilde{\boldsymbol{A}}\boldsymbol{X}^\top \tilde{\boldsymbol{D}}\|_F^2 \|\hat{\boldsymbol{\beta}}\|^2\\ &\le 2 \|\tilde{\boldsymbol{A}}\|_F^2 \|\tilde{\boldsymbol{\psi}}\|^2 + 2 \|\boldsymbol{X}\tilde{\boldsymbol{A}}\|_{op}^2 \|\tilde{\boldsymbol{D}}\|_{F}^2 \|\tilde{\boldsymbol{\beta}}\|^2\\ &\le \C(pn\|\tilde{\boldsymbol{A}}\|_{op}^2 + n \|\boldsymbol{X}\|_{op}^2 \|\tilde{\boldsymbol{A}}\|_{op}^2 \|\tilde{\boldsymbol{\beta}}\|^2). \end{align}\tag{49}\] Since \(\mathop{\mathrm{\mathbb{E}}}[\|\boldsymbol{X}{n^{-1/2}}\|_{op}^k] \vee \mathop{\mathrm{\mathbb{E}}}[\|{n}\boldsymbol{A}\|_{op}^2] \vee \mathop{\mathrm{\mathbb{E}}}[\|\hat{\boldsymbol{\beta}}\|^k]\le C\) for a constant independent of \(n,p\) by 47 and integration of \(\mathop{\mathrm{\mathbb{P}}}(\|\boldsymbol{X}{n^{-1/2}}\|_{op}>1+\delta^{-1/2}+tn^{-1/2})\le e^{-t^2/2}\) (see, e.g., [34], [35] or [36]), we obtain since \(\boldsymbol{\psi}=^d\boldsymbol{\tilde{\psi}}\) and \(\boldsymbol{\beta}=^d\boldsymbol{\tilde{\beta}}\), \[\sum_{i=1}^n\sum_{j=1}^p \mathop{\mathrm{\mathbb{E}}}\Bigl[ \|\frac{\partial \boldsymbol{P}{\boldsymbol{\hat{\beta}}}}{\partial x_{ij} }\|^2 + \|\frac{\partial \boldsymbol{P}\tilde{\boldsymbol{\beta}}}{\partial x_{ij} }\|^2 + \frac{1}{n} \|\frac{\partial {\boldsymbol{\psi}}}{\partial x_{ij}}\|^2 + \frac{1}{n} \|\frac{\partial {\boldsymbol{\psi}}}{\partial x_{ij}}\|^2 \Bigr] \le C' \label{bound95frobenius95norm95derivatives}\tag{50}\] for another constant independent of \(n,p\). Thus the RHS of 10 is \(O(n)\). This gives \[\mathop{\mathrm{\mathbb{E}}}\Bigl[\Bigl({\boldsymbol{\psi}}^\top \boldsymbol{X}\boldsymbol{P}\tilde{\boldsymbol{\beta}} - \sum_{i=1}^n\sum_{j=1}^p \frac{\partial }{\partial x_{ij}} ({\boldsymbol{e}_i^\top \boldsymbol{\psi}\cdot\boldsymbol{e}_j^\top \boldsymbol{P}\tilde{\boldsymbol{\beta}}})\Bigr)^2 \Bigr] \le n \C.\] By the derivative formula in 3, \[\begin{align} &\sum_{i=1}^n\sum_{j=1}^p \frac{\partial }{\partial x_{ij}} ({\boldsymbol{e}_i^\top \boldsymbol{\psi}\boldsymbol{e}_j^\top \boldsymbol{P}\tilde{\boldsymbol{\beta}}}) \\ &= \sum_{ij} \boldsymbol{e}_i^\top \bigl(\frac{\partial \boldsymbol{\psi}}{\partial x_{ij}}\bigr) \boldsymbol{e}_j^\top \boldsymbol{P}\tilde{\boldsymbol{\beta}} + \boldsymbol{e}_i^\top \boldsymbol{\psi}\boldsymbol{e}_j^\top\boldsymbol{P}\bigl(\frac{\partial \tilde{\boldsymbol{\beta}}}{\partial x_{ij}}\bigr)\\ &= \sum_{ij} \boldsymbol{e}_i^\top \bigl(-\boldsymbol{D}\boldsymbol{X}\boldsymbol{A}\boldsymbol{e}_j \psi_i- \boldsymbol{V}\boldsymbol{e}_i \hat{\beta}_j \bigr)\boldsymbol{e}_j^\top \boldsymbol{P}\tilde{\boldsymbol{\beta}} + \sum_{ij} \psi_i \boldsymbol{e}_j^\top\boldsymbol{P}{\tilde{\boldsymbol{A}}} (\boldsymbol{e}_j \tilde{\psi}_i -\boldsymbol{X}^\top \tilde{\boldsymbol{D}}\boldsymbol{e}_i \tilde{\beta}_j)\\ &= -\boldsymbol{\psi}^\top \boldsymbol{D}\boldsymbol{X}\boldsymbol{A}\boldsymbol{P}\tilde{\boldsymbol{\beta}}-\mathop{\mathrm{tr}}[\boldsymbol{V}]\hat{\boldsymbol{\beta}}^\top \boldsymbol{P}\tilde{\boldsymbol{\beta}} + \mathop{\mathrm{tr}}[{\tilde{\boldsymbol{A}}}]\boldsymbol{\psi}^\top \tilde{\boldsymbol{\psi}} - \tilde{\boldsymbol{\beta}}^\top \boldsymbol{P}{\tilde{\boldsymbol{A}}} \boldsymbol{X}^\top \tilde{\boldsymbol{D}}\boldsymbol{\psi}, \end{align}\] where \[\begin{align} \mathop{\mathrm{\mathbb{E}}}[|\boldsymbol{\psi}^\top \boldsymbol{D}\boldsymbol{X}\boldsymbol{A}\boldsymbol{P}\tilde{\boldsymbol{\beta}}|^2] &\le \mathop{\mathrm{\mathbb{E}}}[\|\boldsymbol{\psi}\|^2\|\tilde{\boldsymbol{\beta}}\|^2 \|\boldsymbol{D}\boldsymbol{X}\boldsymbol{A}\boldsymbol{P}\|_{op}^2] \le \C \mathop{\mathrm{\mathbb{E}}}[n \|\tilde{\boldsymbol{\beta}}\|^2 \|\boldsymbol{X}\boldsymbol{A}\|_{op}^2] = O(1)\\ \mathop{\mathrm{\mathbb{E}}}[| \tilde{\boldsymbol{\beta}}^\top \boldsymbol{P}{\tilde{\boldsymbol{A}}} \boldsymbol{X}^\top \tilde{\boldsymbol{D}}\boldsymbol{\psi}|^2] &\le \mathop{\mathrm{\mathbb{E}}}[\|\boldsymbol{\psi}\|^2\|\tilde{\boldsymbol{\beta}}\|^2 \|\boldsymbol{P}{\tilde{\boldsymbol{A}}} \boldsymbol{X}^\top \tilde{\boldsymbol{D}}\|_{op}^2] \le \mathop{\mathrm{\mathbb{E}}}[n \|\tilde{\boldsymbol{\beta}}\|^2 \|{\tilde{\boldsymbol{A}}} \boldsymbol{X}^\top\|_{op}^2] = O(1). \end{align}\] This gives \[\mathop{\mathrm{\mathbb{E}}}\Bigl[\Bigl({\boldsymbol{\psi}}^\top \boldsymbol{X}\boldsymbol{P}\tilde{\boldsymbol{\beta}} + \mathop{\mathrm{tr}}[\boldsymbol{V}]\hat{\boldsymbol{\beta}}^\top \boldsymbol{P}\tilde{\boldsymbol{\beta}} - \mathop{\mathrm{tr}}[\boldsymbol{P}{\tilde{\boldsymbol{A}}}]\boldsymbol{\psi}^\top \tilde{\boldsymbol{\psi}}\Bigr)^2\Bigr] = O(n) + O(1).\] Here, \({\boldsymbol{\psi}}^\top\boldsymbol{X}\boldsymbol{P}\tilde{\boldsymbol{\beta}}\) is \(0\) by the KTT condition \(\boldsymbol{X}^\top \boldsymbol{\psi}= \boldsymbol{0}_p\), and the proof is complete. ◻

Lemma 5. Let either 1 or 2 be fulfilled with \(I,\tilde{I}\) independent and uniformly distributed over all subsets of \([n]\) of size \(qn\). Let the notation of 4 be in force for \((\boldsymbol{\hat{\beta}},\boldsymbol{\psi},\boldsymbol{A},\boldsymbol{V})\) (as in [lem:reg,lm:derivative_formula] for \(I\)) and similarly for \((\boldsymbol{\tilde{\beta}},\boldsymbol{\tilde{\psi}},\tilde{\boldsymbol{A}},\tilde{\boldsymbol{V}})\).

Then \[\label{eq:relation95bpsi95P95beta} \|\boldsymbol{\psi}\|^2 - p^{-1} \mathop{\mathrm{tr}}[\boldsymbol{V}]^2 \|\boldsymbol{P}\hat{\boldsymbol{\beta}}\|^2 = O_P(n^{1/2})\tag{51}\]

Proof. We will use 9 with \(\boldsymbol{\rho}=\boldsymbol{\psi}/(\sqrt{nq} (1+\sqrt{K+2}))\). In logistic regression, let us assume by rotational invariance that \(\boldsymbol{\beta}^*/\|\boldsymbol{\beta}^*\|=\boldsymbol{e}_1\) (first canonical basis vector), and we apply 9 conditionally on \((\boldsymbol{y},\boldsymbol{X}\boldsymbol{\beta}^*)\) to the Gaussian matrix \((x_{ij})_{i\in[n], j\ge 2}\). In robust regression, we apply 9 with respect to the full Gaussian matrix \(\boldsymbol{X}= (x_{ij})_{i\in[n], j\ge 2}\), conditionally on the independent noise \((\varepsilon_i)_{i\in [n]}\). To accommodate both settings simultaneously, let us define \[j_0 = \begin{cases} 1 & \text{ in robust linear regression}, \\ 2 & \text{ in logistic regression} \end{cases}\] so that \(\boldsymbol{P}=\sum_{j=j_0}^p \boldsymbol{e}_j\boldsymbol{e}_j^T\) holds.

Note \(\|\boldsymbol{\psi}\|^2 \le nq(1+\sqrt{K+2})^2\) with probability \(1\), so the assumption is satisfied. The derivative term \(\sum_{i=1}^n\sum_{j=j_0}^p \|\frac{\partial \boldsymbol{\psi}}{\partial x_{ij}}\|^2\) is already bounded from above in 50 so that the RHS of 53 is \(O(\sqrt{n})\). Thus, 9 gives \[(p+1-j_0) \frac{\|\boldsymbol{\psi}\|^2}{n} - \frac{1}{n}\sum_{j=j_0}^p \Bigl(\boldsymbol{\psi}^\top \boldsymbol{X}\boldsymbol{e}_j - \sum_{i=1}^n \frac{\partial \boldsymbol{e}_i^\top \boldsymbol{\psi}}{\partial x_{ij}}\Bigr)^2 = O(\sqrt{n}).\] Here, \(\boldsymbol{\psi}^\top \boldsymbol{X}= \boldsymbol{0}_p^\top\) by the KTT condition. For \(\sum_{i=1}^n \frac{\partial \boldsymbol{e}_i^\top \boldsymbol{\psi}}{\partial x_{ij}}\), the derivative formula in 3 gives \[\begin{align} \sum_{j=j_0}^p \Bigl(\sum_{i=1}^n \frac{\partial \boldsymbol{e}_i^\top \boldsymbol{\psi}}{\partial x_{ij}}\Bigr)^2 &= \sum_{j=j_0}^p \Bigl(\sum_{i=1}^n \boldsymbol{e}_i^\top (-\boldsymbol{D}\boldsymbol{X}\boldsymbol{A}\boldsymbol{e}_j \boldsymbol{e}_i^\top \boldsymbol{\psi}- \boldsymbol{V}\boldsymbol{e}_i \boldsymbol{e}_j^\top\hat{\boldsymbol{\beta}})\Bigr)^2\\ &= \sum_{j=j_0}^n (\boldsymbol{\psi}^\top \boldsymbol{D}\boldsymbol{X}\boldsymbol{A}\boldsymbol{e}_j + \mathop{\mathrm{tr}}[\boldsymbol{V}] \boldsymbol{e}_j^\top \hat{\boldsymbol{\beta}})^2\\ &= \|\boldsymbol{P}\boldsymbol{A}^\top \boldsymbol{X}^\top \boldsymbol{D}\boldsymbol{\psi}+ \mathop{\mathrm{tr}}[\boldsymbol{V}]\boldsymbol{P}\hat{\boldsymbol{\beta}}\|^2\\ &= \mathop{\mathrm{tr}}[\boldsymbol{V}]^2 \|\boldsymbol{P}\hat{\boldsymbol{\beta}}\|^2 + \|\boldsymbol{P}\boldsymbol{A}^\top\boldsymbol{X}\boldsymbol{D}\boldsymbol{\psi}\|^2 + 2\mathop{\mathrm{tr}}[\boldsymbol{V}] \hat{\boldsymbol{\beta}}^\top \boldsymbol{P}\boldsymbol{A}^\top\boldsymbol{X}^\top \boldsymbol{D}\boldsymbol{\psi}\\ &= \mathop{\mathrm{tr}}[\boldsymbol{V}]^2 \|\boldsymbol{P}\hat{\boldsymbol{\beta}}\|^2 + O_P(1) + O_P(n) V \end{align}\] Since \(\|\boldsymbol{\psi}\|^2 \le n\) implies \((p+1-j_0) \frac{\|\boldsymbol{\psi}\|^2}{n} = p\frac{\|\boldsymbol{\psi}\|^2}{n}\), the proof is complete. ◻

Lemma 6. Let either 1 or 2 be fulfilled with \(I,\tilde{I}\) independent and uniformly distributed over all subsets of \([n]\) of size \(qn\). Let the notation of 4 be in force for \((\boldsymbol{\hat{\beta}},\boldsymbol{\psi},\boldsymbol{A},\boldsymbol{V})\) (as in [lem:reg,lm:derivative_formula] for \(I\)) and similarly for \((\boldsymbol{\tilde{\beta}},\boldsymbol{\tilde{\psi}},\tilde{\boldsymbol{A}},\tilde{\boldsymbol{V}})\).

Then \[\label{eq:relation95trV95trA95gamma} \mathop{\mathrm{tr}}[\boldsymbol{V}] \mathop{\mathrm{tr}}[\boldsymbol{P}\boldsymbol{A}] = p + O(n^{1/2}), \quad \mathop{\mathrm{tr}}[\boldsymbol{P}\boldsymbol{A}] \to^p \gamma.\tag{52}\]

Proof. By the lemma above, we have \[\mathop{\mathrm{tr}}[\boldsymbol{V}] \|\boldsymbol{P}\hat{\boldsymbol{\beta}}\|^2 - \mathop{\mathrm{tr}}[\boldsymbol{P}\boldsymbol{A}] \|\boldsymbol{\psi}\|^2 = O_P(n^{1/2}), \qquad \|\boldsymbol{\psi}\|^2 - p^{-1}\mathop{\mathrm{tr}}[\boldsymbol{V}]^2\|\boldsymbol{P}\hat{\boldsymbol{\beta}}\|^2 = O_P(n^{1/2}).\] Here, since \(\|\boldsymbol{P}\hat{\boldsymbol{\beta}}\|^2 \to^p \sigma^2>0\) and \(\|\boldsymbol{\psi}\|^2/nq\to^p \sigma^2/(q\delta\gamma^2)\), the second display implies \(\mathop{\mathrm{tr}}[\boldsymbol{V}]/(qn)\to^P 1/(q\delta \gamma)\). On the other hand, substituting the second display to the first display, we are left with \[\mathop{\mathrm{tr}}[\boldsymbol{V}]\|\boldsymbol{P}\hat{\boldsymbol{\beta}}\|^2 (1 - p^{-1}\mathop{\mathrm{tr}}[\boldsymbol{P}\boldsymbol{A}] \mathop{\mathrm{tr}}[\boldsymbol{V}]) = O_P(n^{1/2}).\] Since \(\mathop{\mathrm{tr}}[\boldsymbol{V}]\|\boldsymbol{P}\hat{\boldsymbol{\beta}}\|^2/n \to^P \sigma^2/(\delta\gamma^2) \cdot \sigma^2>0\), this gives \(1 - p^{-1}\mathop{\mathrm{tr}}[\boldsymbol{P}\boldsymbol{A}] \mathop{\mathrm{tr}}[\boldsymbol{V}]= O_P(n^{-1/2})\). Combined with \(\mathop{\mathrm{tr}}[\boldsymbol{V}]/(qn)\to^p 1/(q\delta \gamma)\), we have \(\mathop{\mathrm{tr}}[\boldsymbol{P}\boldsymbol{A}] \to^p \gamma\). ◻

Lemma 7.

Let either 1 or 2 be fulfilled with \(I,\tilde{I}\) independent and uniformly distributed over all subsets of \([n]\) of size \(qn\). Let the notation of 4 be in force for \((\boldsymbol{\hat{\beta}},\boldsymbol{\psi},\boldsymbol{A},\boldsymbol{V})\) (as in [lem:reg,lm:derivative_formula] for \(I\)) and similarly for \((\boldsymbol{\tilde{\beta}},\boldsymbol{\tilde{\psi}},\tilde{\boldsymbol{A}},\tilde{\boldsymbol{V}})\).

Then \[\frac{1}{|I\cap \tilde{I}|} \sum_{i\in I\cap \tilde{I}} \psi_i^2 = \frac{1}{|I|}\sum_{i\in I} \psi_i^2 + O_P(n^{-1/2})\]

Proof. Using 11 conditionally on \((|\tilde{I} \cap I|, I, \boldsymbol{\psi})\), we have \[\mathop{\mathrm{\mathbb{E}}}\bigl[\bigl| \sum_{i\in I\cap \tilde{I}} \psi_i^2- \frac{|I\cap\tilde{I}|}{|I|} \sum_{i\in I}\psi_i^2 \bigr|^2\bigr] \le \mathop{\mathrm{\mathbb{E}}}\bigl[|I\cap \tilde{I}|\frac{\sum_{i\in I} \psi_{i}^2}{|I|}\bigr] \le \mathop{\mathrm{\mathbb{E}}}[|I\cap \tilde{I}|\frac{n}{|I|}] \le n,\] which completes the proof. ◻

Lemma 8.

Let either 1 or 2 be fulfilled with \(I,\tilde{I}\) independent and uniformly distributed over all subsets of \([n]\) of size \(qn\). Let the notation of 4 be in force for \((\boldsymbol{\hat{\beta}},\boldsymbol{\psi},\boldsymbol{A},\boldsymbol{V})\) (as in [lem:reg,lm:derivative_formula] for \(I\)) and similarly for \((\boldsymbol{\tilde{\beta}},\boldsymbol{\tilde{\psi}},\tilde{\boldsymbol{A}},\tilde{\boldsymbol{V}})\).

Let \(\bar{\mathop{\mathrm{\mathbb{E}}}}[\cdot]=\mathop{\mathrm{\mathbb{E}}}[\cdot|\boldsymbol{X}\boldsymbol{\beta}^*,\boldsymbol{y}]\) be the conditional expectation given \((\boldsymbol{X}\boldsymbol{\beta}^*, \boldsymbol{y})\). Then, we have \[\begin{align} \tilde{\boldsymbol{\beta}}^\top\boldsymbol{P}\hat{\boldsymbol{\beta}} &= \bar{\mathop{\mathrm{\mathbb{E}}}}\bigl[\tilde{\boldsymbol{\beta}}^\top\boldsymbol{P}\hat{\boldsymbol{\beta}}\bigr] + O_P(n^{-1/2})\\ \frac{1}{|I\cap \tilde{I}|}\sum_{i\in I\cap \tilde{I}} \psi_i\tilde{\psi}_i &= \frac{1}{|I\cap \tilde{I}|} \bar{\mathop{\mathrm{\mathbb{E}}}}\Bigl[ \sum_{i\in I\cap \tilde{I}} \psi_i\tilde{\psi}_i \Bigr] + O_P(n^{-1/2}). \end{align}\]

Proof. First we show the concentration of \(|\tilde{\boldsymbol{\beta}}\boldsymbol{P}\hat{\boldsymbol{\beta}}|\). By the Gaussian Poincaré inequality with respect to \(\boldsymbol{P}\boldsymbol{X}\), we have \[\begin{align} \bar{\mathop{\mathrm{\mathbb{E}}}}\bigl[ \bigl( \tilde{\boldsymbol{\beta}}^\top\boldsymbol{P}\hat{\boldsymbol{\beta}} - \bar{\mathop{\mathrm{\mathbb{E}}}}\bigl[\tilde{\boldsymbol{\beta}}^\top\boldsymbol{P}\hat{\boldsymbol{\beta}}\bigr]\bigr)^2 \bigr] &\le \bar{\mathop{\mathrm{\mathbb{E}}}} \bigl[ \sum_{i=1}^n \sum_{j=2}^p (\frac{\partial \tilde{\boldsymbol{\beta}}^\top \boldsymbol{P}\hat{\boldsymbol{\beta}}}{\partial x_{ij}})^2 \bigr]\\ &= \bar{\mathop{\mathrm{\mathbb{E}}}}\Bigl[ \sum_{j=2}^p \sum_{i\in I/\tilde{I}} (\tilde{\boldsymbol{\beta}}^\top\boldsymbol{P}\frac{\partial \hat{\boldsymbol{\beta}}}{\partial x_{ij}})^2 + \sum_{j=2}^p \sum_{i\in \tilde{I}/I} (\hat{\boldsymbol{\beta}}^\top\boldsymbol{P}\frac{\partial \tilde{\boldsymbol{\beta}}}{\partial x_{ij}})^2. \Bigr] \end{align}\] By the symmetry of \(\tilde{\boldsymbol{\beta}},\hat{\boldsymbol{\beta}}\), it suffices to bound \(\bar{\mathop{\mathrm{\mathbb{E}}}}\Bigl[ \sum_{j=2}^p \sum_{i\in I/\tilde{I}} (\tilde{\boldsymbol{\beta}}^\top\boldsymbol{P}\frac{\partial \hat{\boldsymbol{\beta}}}{\partial x_{ij}})^2\Bigr]\). Using the derivative formula in 3, \[\begin{align} \sum_{j=2}^p \sum_{i\in I/\tilde{I}} (\tilde{\boldsymbol{\beta}}^\top\boldsymbol{P}\frac{\partial \hat{\boldsymbol{\beta}}}{\partial x_{ij}})^2 &= \sum_{j=2}^p \sum_{i\in I/\tilde{I}} \bigl(\tilde{\boldsymbol{\beta}}^\top\boldsymbol{P}\boldsymbol{A}(\boldsymbol{e}_j \boldsymbol{e}_i^\top \boldsymbol{\psi}-\boldsymbol{X}^\top \boldsymbol{D}\boldsymbol{e}_i \boldsymbol{e}_j^\top \hat{\boldsymbol{\beta}})\bigr)^2 \\ &\le 2 \|\boldsymbol{A}^\top\boldsymbol{P}\tilde{\boldsymbol{\beta}}\|^2 \|\boldsymbol{\psi}\|^2 + \|\tilde{\boldsymbol{\beta}}^\top\boldsymbol{P}\boldsymbol{A}\boldsymbol{X}^\top \boldsymbol{D}\|^2 \|\boldsymbol{P}\boldsymbol{\beta}\|^2, \end{align}\] and the moment of the RHS is \(O(n^{-1})\). This concludes the proof of concentration for \(|\tilde{\boldsymbol{\beta}}\boldsymbol{P}\hat{\boldsymbol{\beta}}|\). For \(\tilde{\boldsymbol{\psi}}^\top \tilde{\boldsymbol{\psi}}\), the same argument using the Gaussian Poincaré inequality gives \[\begin{align} &\bar{\mathop{\mathrm{\mathbb{E}}}}\Bigl[\Bigl( \sum_{i\in I\cap \tilde{I}} \psi_i\tilde{\psi}_i - \bar{\mathop{\mathrm{\mathbb{E}}}}\bigl[ \sum_{i\in I\cap \tilde{I}} \psi_i\tilde{\psi}_i \bigr] \Bigr)^2\Bigr] \\ &\le \bar{\mathop{\mathrm{\mathbb{E}}}} [\sum_{j=2}^p \sum_{i\in I/\tilde{I}} (\tilde{\boldsymbol{\psi}}^\top \frac{\partial \boldsymbol{\psi}}{\partial x_{ij}})^2 + \sum_{j=2}^p \sum_{i\in \tilde{I}/I} (\boldsymbol{\psi}^\top \frac{\partial \tilde{\boldsymbol{\psi}}}{\partial x_{ij}})^2 ], \end{align}\] and \[\begin{align} \sum_{j=2}^p \sum_{i\in I/\tilde{I}} (\tilde{\boldsymbol{\psi}}^\top \frac{\partial \boldsymbol{\psi}}{\partial x_{ij}})^2 &= \sum_{j=2}^p \sum_{i\in I/\tilde{I}} \Bigl(\tilde{\boldsymbol{\psi}}^\top (-\boldsymbol{D}\boldsymbol{X}\boldsymbol{A}\boldsymbol{e}_j \boldsymbol{e}_i^\top \boldsymbol{\psi}- \boldsymbol{V}\boldsymbol{e}_i \boldsymbol{e}_j^\top\hat{\boldsymbol{\beta}})\Bigr)^2 \\ &\le 2 \Bigl(\|\tilde{\boldsymbol{\psi}}^\top \boldsymbol{D}\boldsymbol{X}\boldsymbol{A}\|^2\|\boldsymbol{\psi}\|^2 + \|\tilde{\boldsymbol{\psi}}^\top \boldsymbol{V}\|^2 \|\hat{\boldsymbol{\beta}}\|^2\Bigr), \end{align}\] where the moment of the RHS is \(O(n)\). This gives \[\sum_{i\in I\cap \tilde{I}} \psi_i\tilde{\psi}_i - \bar{\mathop{\mathrm{\mathbb{E}}}}\bigl[ \sum_{i\in I\cap \tilde{I}} \psi_i\tilde{\psi}_i \bigr] = O_P(n^{1/2}).\] Finally, dividing by \(|I\cap \tilde{I}|=nq^2 + O_P(n^{1/2})\), we obtain the concentration of \(\sum_{i\in I\cap \tilde{I}}\psi_i\tilde{\psi}.\) ◻

Moment inequalities↩︎

Lemma 9 (Theorem 2.6 in [31]). Assume that \(\boldsymbol{X}=(x_{ij})\in\mathbb{R}^{n\times p}\) has iid \(N(0,1)\) entries, that \(\boldsymbol{\rho}:\mathbb{R}^{n\times p}\to \mathbb{R}^n\) is weakly differentiable and that \(\|\boldsymbol{\rho}\|^2 \le 1\) almost everywhere. Then \[\begin{gather} \mathop{\mathrm{\mathbb{E}}}\Bigl|p\|\boldsymbol{\rho}\|^2 -\sum_{j=1}^p \Bigl( \boldsymbol{\rho}^\top \boldsymbol{X}\boldsymbol{e}_j - \sum_{i=1}^n \frac{\partial \rho_i}{\partial x_{ij}} \Bigr)^2 \Bigr| \\ \le C \mathop{\mathrm{\mathbb{E}}}\Bigl[1+\sum_{i=1}^n\sum_{j=1}^p \|\frac{\partial \boldsymbol{\rho}}{\partial x_{ij}}\|^2\Bigr]^{1/2} \sqrt{p} + C \mathop{\mathrm{\mathbb{E}}}\Bigl[\sum_{i=1}^n\sum_{j=1}^p \|\frac{\partial \boldsymbol{\rho}}{\partial x_{ij}}\|^2\Bigr], \label{eq:chi295lemma} \end{gather}\tag{53}\] where \(C>0\) is an absolute constant.

Lemma 10 (Proposition 2.5 in [31]). Let \(\boldsymbol{X}= (x_{ij})\in\mathbb{R}^{n\times p}\) with iid \(N(0,1)\) entries and \(\boldsymbol{\rho}:\mathbb{R}^{n\times p}\to\mathbb{R}^n\), \(\boldsymbol{\eta}:\mathbb{R}^{n\times p}\to\mathbb{R}^p\) be two vector functions, with weakly differentiable components \(\rho_1, \dots, \rho_n\) and \(\eta_1, \dots, \eta_p\). Then \[\begin{gather} \mathop{\mathrm{\mathbb{E}}}\Bigl[ \Bigl(\boldsymbol{\rho}^\top \boldsymbol{X} \boldsymbol{\eta} - \sum_{i=1}^n\sum_{j=1}^p \frac{\partial (\rho_i\eta_j)}{\partial x_{ij}} \Bigr)^2 \Bigr] \\\le \mathop{\mathrm{\mathbb{E}}}\bigl[\|\boldsymbol{\rho}\|^2 \|\boldsymbol{\eta}\|^2\bigr] + 2\mathop{\mathrm{\mathbb{E}}}\Bigl[\sum_{i=1}^n\sum_{j=1}^p \|\boldsymbol{\eta}\|^2\|\frac{\partial \boldsymbol{\rho}}{\partial x_{ij}}\|^2 + \|\boldsymbol{\rho}\|^2\|\frac{\partial \boldsymbol{\eta}}{\partial x_{ij}}\|^2 \Bigr]. \end{gather}\]

5.4 Concentration of sampling without replacement↩︎

Lemma 11 (Simple random sampling properties; see, e.g., page 13 of [37]). Consider a deterministic array \((x_i)_{i=1}^M\) of length \(M\ge 1\) and let \(\mu\) be the mean \(M^{-1}\sum_{i\in [M]} x_i\) and \(\sigma^2\) be the variance \(M^{-1} \sum_{i\in M} x_i^2 - \mu^2\). Suppose \(J\) is uniformly distributed on \(\{J \subset [M]: |J|=m\}\) for a fixed integer \(m\le M\). Then, the mean and variance of the sample mean \(\hat{\mu}_J = \frac{1}{|J|}\sum_{i\in J} x_i\) are given by \[\mathop{\mathrm{\mathbb{E}}}[\hat{\mu}_J] = \mu, \quad \text{and} \quad \mathop{\mathrm{\mathbb{E}}}[(\hat{\mu}_J-\mu)^2] = \frac{\sigma^2}{m} \left(1-\frac{m-1}{M-1}\right) \le \frac{\sum_{i\in M} x_i^2}{m M}.\]

References↩︎

[1]
Noureddine El Karoui, Derek Bean, Peter J Bickel, Chinghway Lim, and Bin Yu. On robust regression with high-dimensional predictors. Proceedings of the National Academy of Sciences, 110 (36): 14557–14562, 2013.
[2]
David Donoho and Andrea Montanari. High dimensional robust m-estimation: Asymptotic variance via approximate message passing. Probability Theory and Related Fields, 166: 935–969, 2016.
[3]
Noureddine El Karoui. On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Probability Theory and Related Fields, 170: 95–175, 2018.
[4]
Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassibi. Precise error analysis of regularized \(m\)-estimators in high dimensions. IEEE Transactions on Information Theory, 64 (8): 5592–5628, 2018.
[5]
Leo Breiman. Bagging predictors. Machine learning, 24: 123–140, 1996.
[6]
Leo Breiman. Using iterated bagging to debias regressions. Machine Learning, 45: 261–277, 2001.
[7]
Peter Bühlmann and Bin Yu. Analyzing bagging. The annals of Statistics, 30 (4): 927–961, 2002.
[8]
Daniel LeJeune, Hamid Javadi, and Richard Baraniuk. The implicit regularization of ordinary least squares ensembles. In International Conference on Artificial Intelligence and Statistics, pages 3525–3535. PMLR, 2020.
[9]
Jin-Hong Du, Pratik Patil, and Arun Kumar Kuchibhotla. Subsample ridge ensembles: Equivalences and generalized cross-validation. arXiv preprint arXiv:2304.13016, 2023.
[10]
Pratik Patil, Jin-Hong Du, and Arun Kumar Kuchibhotla. Bagging in overparameterized learning: Risk characterization and risk monotonization. Journal of Machine Learning Research, 24 (319): 1–113, 2023.
[11]
Pierre C Bellec, Jin-Hong Du, Takuya Koriyama, Pratik Patil, and Kai Tan. Corrected generalized cross-validation for finite ensembles of penalized estimators. arXiv preprint arXiv:2310.01374, 2023.
[12]
Bruno Loureiro, Cédric Gerbelot, Maria Refinetti, Gabriele Sicuro, and Florent Krzakala. Fluctuations, bias, variance & ensemble of learners: Exact asymptotics for convex losses in high-dimension. In International Conference on Machine Learning, pages 14283–14314. PMLR, 2022.
[13]
Lucas Clarté, Adrien Vandenbroucque, Guillaume Dalle, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborová. Analysis of bootstrap and subsampling in high-dimensional regularized regression. arXiv preprint arXiv:2402.13622, 2024.
[14]
Noureddine El Karoui. Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators: rigorous results. arXiv preprint arXiv:1311.2445, 2013.
[15]
Pierre C. Bellec and Takuya Koriyama. Existence of solutions to the nonlinear equations characterizing the precise error of M-estimators. arXiv preprint arXiv:2312.13254, 2023.
[16]
Bruno Loureiro, Cedric Gerbelot, Hugo Cui, Sebastian Goldt, Florent Krzakala, Marc Mezard, and Lenka Zdeborová. Learning curves of generic features maps for realistic datasets with a teacher-student model. Advances in Neural Information Processing Systems, 34: 18137–18151, 2021.
[17]
Mohsen Bayati and Andrea Montanari. The lasso risk for gaussian matrices. IEEE Transactions on Information Theory, 58 (4): 1997–2017, 2011.
[18]
Pierre C Bellec and Takuya Koriyama. Error estimation and adaptive tuning for unregularized robust m-estimator. arXiv preprint arXiv:2312.13257, 2023.
[19]
Emmanuel J Candès and Pragya Sur. The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. The Annals of Statistics, 48 (1): 27–42, 2020.
[20]
Pragya Sur and Emmanuel J Candès. A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116 (29): 14516–14525, 2019.
[21]
Fariborz Salehi, Ehsan Abbasi, and Babak Hassibi. The impact of regularization on high-dimensional logistic regression. Advances in Neural Information Processing Systems, 32, 2019.
[22]
Qian Zhao, Pragya Sur, and Emmanuel J Candes. The asymptotic distribution of the mle in high-dimensional logistic models: Arbitrary covariance. Bernoulli, 28 (3): 1835–1861, 2022.
[23]
J Leo van Hemmen and Tsuneya Ando. An inequality for trace ideals. Communications in Mathematical Physics, 76: 143–148, 1980.
[24]
Rajendra Bhatia. Matrix analysis, volume 169. Springer Science & Business Media, 2013.
[25]
Pierre C Bellec and Yiwei Shen. Derivatives and residual distribution of regularized m-estimators with application to adaptive tuning. In Conference on Learning Theory, pages 1912–1947. PMLR, 2022.
[26]
Pierre C Bellec and Cun-Hui Zhang. Debiasing convex regularized estimators and interval estimation in linear models. The Annals of Statistics, 51 (2): 391–436, 2023.
[27]
Pierre C Bellec and Cun-Hui Zhang. Second-order stein: Sure for sure and other applications in high-dimensional inference. The Annals of Statistics, 49 (4): 1864–1903, 2021.
[28]
Vladimir Igorevich Bogachev. Gaussian measures. Number 62. American Mathematical Soc., 1998.
[29]
Huzihiro Araki and Shigeru Yamagami. An inequality for hilbert-schmidt norm. Communications in Mathematical Physics, 81 (1): 89–96, 1981.
[30]
Chen Chun-hui and Sun Ji-guang. Perturbation bounds for the polar factors. Journal of Computational Mathematics, pages 397–401, 1989.
[31]
Pierre C Bellec. Out-of-sample error estimation for m-estimators with convex penalty. Information and Inference: A Journal of the IMA, 12 (4): 2782–2817, 2023.
[32]
Pierre C Bellec. Observable adjustments in single-index models for regularized m-estimators. arXiv preprint arXiv:2204.06990, 2022.
[33]
Alan Edelman. Eigenvalues and condition numbers of random matrices. SIAM journal on matrix analysis and applications, 9 (4): 543–560, 1988.
[34]
Kenneth R Davidson and Stanislaw J Szarek. Local operator theory, random matrices and banach spaces. Handbook of the geometry of Banach spaces, 1 (317-366): 131, 2001.
[35]
Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
[36]
Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
[37]
Arijit Chaudhuri. Modern Survey Sampling. CRC Press, 2014.

  1. Since only the solution to 9 10 is of interest, we denote its solution by \((\sigma,\gamma)\) without extra subscripts for brevity.↩︎