Replicable Learning of Large-Margin Halfspaces6


Abstract

We provide efficient replicable algorithms for the problem of learning large-margin halfspaces. Our results improve upon the algorithms provided by [1]. We design the first dimension-independent replicable algorithms for this task which runs in polynomial time, is proper, and has strictly improved sample complexity compared to the one achieved by [1] with respect to all the relevant parameters. Moreover, our first algorithm has sample complexity that is optimal with respect to the accuracy parameter \(\epsilon\). We also design an SGD-based replicable algorithm that, in some parameters’ regimes, achieves better sample and time complexity than our first algorithm. Departing from the requirement of polynomial time algorithms, using the DP-to-Replicability reduction of [2], we show how to obtain a replicable algorithm for large-margin halfspaces with improved sample complexity with respect to the margin parameter \(\tau\), but running time doubly exponential in \(1/\tau^2\) and worse sample complexity dependence on \(\epsilon\) than one of our previous algorithms. We then design an improved algorithm with better sample complexity than all three of our previous algorithms and running time exponential in \(1/\tau^{2}\).

1 Introduction↩︎

The replicability crisis is omnipresent in many scientific disciplines including biology, medicine, chemistry, and, importantly, AI [3], [4]. A recent article that appeared in Nature [5] explains how the reproducibility crisis in AI has a cascading effect across many other scientific areas due to its widespread applications in other fields such as medicine. Thus, an urgent goal is to design a formal framework through which we can argue about the replicability of experiments in ML. Such a theoretical framework was proposed in a recent work by [1] and has been studied extensively in several learning settings [2], [6][13].

Definition 1 (Replicability [1]). Let \(\mathcal{R}\) be a distribution over random binary strings. A learning algorithm \(\mathcal{A}\) is \(n\)-sample \(\rho\)-replicable if for any distribution \(\mathcal{D}\) over inputs and two independent datasets \(S, S' \sim \mathcal{D}^n\), it holds that \(\mathop{\boldsymbol{P}r\/}_{S,S' \sim \mathcal{D}^n, r \sim \mathcal{R}} [\mathcal{A}(S,r) \neq \mathcal{A}(S',r)] \leq \rho\).

In words, this definition requires that when an algorithm \(\mathcal{A}\) is executed twice on different i.i.d.datasets \(S, S'\) but using shared internal randomness, then the output of the algorithm is exactly the same, with high probability. We note that sharing the randomness across the executions is crucial in achieving this replicability guarantee. Importantly, [10] showed that without sharing randomness, it is impossible to achieve such a strong notion of replicability even for simple tasks such as mean estimation. In practice, this can be achieved by simply publishing the random seed that the ML algorithms are executed with. As we extensively discuss in 1.2, 1 turns out to be connected with other notions of stability such as differential privacy and perfect generalization [2], [14].

In this work, we study the fundamental problem of learning large-margin halfspaces, which means that no example is allowed to lie too close to the separating hyperplane. This task is related to foundational ML techniques such as the Perceptron algorithm [15], SVMs [16], and AdaBoost [17]. Let us recall the concept class of interest.

Definition 2 (Large-Margin Halfspaces). Let \(\mathcal{D}\) be a distribution over \(\mathbb{R}^d \times \{-1,1\}\) whose support does not contain \(x=0\). We say that \(\mathcal{D}\) has linear margin \(\tau\) if there exists a unit vector \(w \in \mathbb{R}^d\) such that for any \((x,y) \in \mathrm{supp}(\mathcal{D})\) it holds that \(y (w^\top x / \|x\|) \geq \tau\).7

Following the PAC learning definition of [18], we say that an algorithm learns with accuracy \(\epsilon\) and confidence \(\delta\) the class of \(\tau\)-margin halfspaces in \(d\) dimensions using \(n = n(\epsilon, \delta, d, \tau)\) samples and runtime \(T =T(\epsilon,\delta,d,\tau)\) if, given \(n\) i.i.d.samples from any distribution \(\mathcal{D}\) satisfying 2, the algorithm outputs, in time \(T\), a classifier \(h : \mathbb{R}^d \to \{-1,1\}\) such that \(\mathop{\boldsymbol{P}r\/}_{(x,y) \sim \mathcal{D}}[h(x) \neq y] \leq \epsilon\), with probability at least \(1-\delta\).

We are interested in replicably learning large-margin halfspaces, i.e., designing algorithms for large-margin halfspaces that further satisfy 1. We remark that when the feature domain is infinite, there is no replicable learning algorithm for learning halfspaces in general. Thus making some assumptions like the large-margin condition is necessary. In particular, [2], [8] show that finiteness of the Littlestone dimension is a necessary condition for learnability by replicable algorithms, and it is known that even one-dimensional halfspaces over \([0,1]\) have infinite Littlestone dimension. See 1 for a comparison of prior work and our contributions.

Table 1: No caption
Replicable Algorithms for Large-Margin Halfspaces
Algorithms Sample Complexity Running Time Proper
Prior Work
with foams rounding \((d \epsilon^{-3} \tau^{-8} \rho^{-2} )^{1.01}\) \(2^d \cdot \textrm{poly}(1/\epsilon, 1/\rho, 1/\tau)\) No
with box rounding \((d^3 \epsilon^{-4} \tau^{-10} \rho^{-2} )^{1.01}\) \(\textrm{poly}(d,1/\epsilon, 1/\rho, 1/\tau)\) No
Our Work
1 (1) \(\epsilon^{-1} \tau^{-7} \rho^{-2}\) \(\textrm{poly}(d,1/\epsilon,1/\rho, 1/\tau)\) Yes
2 (2) \(\epsilon^{-2} \tau^{-6} \rho^{-2}\) \(\textrm{poly}(d,1/\epsilon,1/\rho, 1/\tau)\) Yes
via DP-to-Replicability reduction (3) \(\epsilon^{-2} \tau^{-4} \rho^{-2}\) \(\textrm{poly}(d) \cdot \exp\left( (\frac{1}{\tau})^{\frac{\log( 1/(\epsilon\rho))}{\tau^2}} \right)\) Yes
[alg:algo3] (4) \(\epsilon^{-1} \tau^{-4} \rho^{-2}\) \(\textrm{poly}(d) \cdot \textrm{poly}(1/\epsilon,1/\rho, 1/\tau)^{1/\tau^2}\) Yes

The work of [1] provided the first replicable algorithms for \(\tau\)-margin halfspaces over \(\mathbb{R}^d\). The first algorithm of [1], which uses the “foams” discretization scheme [19], is \(\rho\)-replicable and returns a hypothesis \(h\) that, with probability at least \(1-\rho\), satisfies \(\mathop{\boldsymbol{P}r\/}_{(x,y) \sim \mathcal{D}}[h(x) \neq y] \leq \epsilon\). The sample complexity of this algorithm is roughly \(\widetilde{O}( (d \epsilon^{-3} \tau^{-8} \rho^{-2} )^{1.01})\) and the runtime is exponential in \(d\) and polynomial in \(1/\epsilon, 1/\rho\) and \(1/\tau\). The second algorithm of [1], which uses the “box” discretization scheme, is \(\rho\)-replicable and returns a hypothesis \(h\) that, with probability at least \(1-\rho\), satisfies \(\mathop{\boldsymbol{P}r\/}_{(x,y) \sim \mathcal{D}}[h(x) \neq y] \leq \epsilon\) with sample complexity \(\widetilde{O}( ( d^{3} \epsilon^{-4} \tau^{-10} \rho^{-2} )^{1.01})\) and runtime which is polynomial in \(d, 1/\epsilon, 1/\rho\) and \(1/\tau\). These two algorithms appear in the first two rows of 1.

Some remarks are in order. In our setting, the sample complexity of learning large-margin halfspaces in the absence of the replicability requirement is \(\widetilde{O}(\frac{1}{\epsilon\tau^2}).\) Notice that the sample complexity of both algorithms of [1] depends on the dimension \(d\) of the problem. This is unexpected since the sample complexity of non-replicable algorithms for this task is dimension-independent. In the case of the replicable algorithms of [1], the dependence on the dimension appears due to a rounding/discretization step, which is crucial in establishing the replicability guarantees. Second, both algorithms of [1] are improper in the sense that the hypothesis \(h\) they output does not correspond to a halfspace. This is due to the use of a replicable boosting routine that outputs the majority vote over multiple halfspaces. As a general note, both of these algorithms are fairly complicated: they require multiple discretization/rounding steps, and then they output a weak learner, which finally needs to be boosted using multiple rounds of a replicable boosting scheme. As a result, the sample complexity of their algorithms incurs a significant blow-up in the parameters \(\epsilon,\tau\) compared to the non-replicable setting.

We work in the setting of finite (but not necessarily bounded) bit precision with the goal of designing algorithms that are agnostic to the marginal distribution and sample complexity that is dimension-independent. Indeed, assuming bounded bit precision \(b\), possibly some structure about the marginal distribution, and that the margin \(\tau > 0\), there are replicable algorithms for learning halfspaces with time and sample complexity \(O(\textrm{poly}(d, b, 1/\epsilon, \log(1/\tau), 1/\rho))\) via black-box transformations of existing (complicated) SQ algorithms [20], [21]. In the regime when \(\tau\) is constant and \(d\) is relatively large, our algorithms can outperform the black-box transformations.

1.1 Our Contribution↩︎

[1] leave as an open question whether the sample complexity bounds of their algorithms are tight. In this work, we show that these bounds are sub-optimal. We provide new replicable algorithms for learning large-margin halfspaces that improve upon the results of [1] in various aspects. Our algorithms have no sample dependence on \(d\), strictly improve on the dependence on \(1/\epsilon,1/\rho\), and \(1/\tau\),8 and are proper, meaning that they output linear models. Moreover, our 1 and 2 are computationally efficient while [alg:algo3] forsakes computational efficiency to achieve further improvements in sample complexity.

We now state our first algorithmic result, the proof of which can be found in 3.

Theorem 1 (Efficient Replicable 1). Fix \(\epsilon, \tau, \rho, \delta\in (0, 1)\). Let \(\mathcal{D}\) be a distribution over \(\mathbb{R}^d \times \{-1,1\}\) that has linear margin \(\tau\) as in 2. There is an algorithm that is \(\rho\)-replicable and, given \(m = \widetilde{O}(\epsilon^{-1} \tau^{-7} \rho^{-2} \log(\frac{1}{\delta}))\) i.i.d. samples \((x,y) \sim \mathcal{D}\), computes in time \(\textrm{poly}(d,1/\epsilon,1/\tau,1/\rho,\log(\frac{1}{\delta}))\) a unit vector \(w \in \mathbb{R}^d\) such that \(\mathop{\boldsymbol{P}r\/}_{(x,y) \sim \mathcal{D}}[\textrm{sgn}(w^\top x) \neq y] \leq \epsilon\) with probability at least \(1-\delta\).

1 improves on the sample complexity of the two algorithms appearing in [1], runs in polynomial time, and is proper. Our techniques follow a different path from that of [1]. As we alluded to before, their approach is fairly complicated and is based on the design of a replicable weak halfspace learning algorithm and then a replicable boosting algorithm that combines multiple weak learners. Our approach is single-shot and significantly simpler: Consider \(B\) independent non-overlapping batches of training examples. From each batch \(i \in [B]\), we find a hyperplane with normal vector \(w_i \in \mathbb{R}^d\) that has \(\Omega(\tau)\) margin on the training data. This can be achieved by running the standard Support Vector Machine (SVM) algorithm [16], [22]. We then aggregate our vectors to a single average normal vector \(z = (1/B) \sum_{i \in [B]} w_i\). Finally, we project the vector \(z\) onto a lower-dimensional space, whose dimension does not depend on \(d,\) and we replicably round \(z\) using a rounding scheme due to [23] for which we perform a novel analysis in the shared randomness setting. We emphasize that our algorithm gives a halfspace with the desired accuracy guarantee without the need to use any boosting schemes.

At a technical level, we avoid the dependence on the dimension \(d\) thanks to data-oblivious dimensionality reduction techniques (cf. 7.3), a standard tool in the literature of large-margin halfspaces. Instead of rounding our estimates in the \(d\)-dimensional space, we first project them to a lower-dimensional space, whose dimension does not depend on \(d\), and we perform the rounding in that space. The crucial idea is that one can use the data-obliviousness of Johnson-Lindenstrauss (JL) matrices so that the projection matrices in two distinct executions are the same since the internal randomness is shared. Another technical aspect of our algorithm that differentiates it from prior works on the design of replicable algorithms is the use of a different rounding scheme known as the Alon-Kartag rounding scheme (cf. 2). While this rounding scheme was known, to the best of our knowledge, we are the first to utilize and analyze the scheme in the context of replicability. Consider the simple case of 1-dimensional data. In the same spirit as in [1], we consider a random grid. But rather than rounding the point to a fixed element of each cell of the grid (e.g., its center), we randomly round it to one of the two endpoints of the cell using shared internal randomness so that, in expectation, the rounded point is the same as the original one. This is helpful as it preserves inner products in expectation, and therefore gives better concentration properties across multiple roundings. We believe that this rounding scheme can find more applications in the replicable learning literature. The detailed proof of 1 can be found in 3.

Despite the simplicity of our algorithm, there are technical subtleties that complicate its analysis. For instance, the projection to the low-dimensional space introduces a subtle complication we need to handle. In particular, using ideas from [24] we can show that the aggregated vector in the high-dimensional space has the desired generalization properties. However, when we project it to the low-dimensional space there are vectors that are now misclassified, due to the error introduced by the JL mapping. Using the guarantees of the JL projection, we show that, uniformly over the data-generating distributions, this happens for only a small fraction of the population.

2 follows a similar framework as 1 by running a non-replicable algorithm on independent batches of data and then aggregating the outputs replicably, illustrating the flexibility of our approach. We now describe this result in more detail, whose proof can be found in 4.

Theorem 2 (Efficient Replicable 2). Fix \(\epsilon, \tau, \rho, \delta\in (0, 1)\). Let \(\mathcal{D}\) be a distribution over \(\mathbb{R}^d \times \{-1,1\}\) that has linear margin \(\tau\) as in 2. There is an algorithm that is \(\rho\)-replicable and, given \(m = \widetilde{O}(\epsilon^{-2} \tau^{-6} \rho^{-2} \log(\frac{1}{\delta}))\) i.i.d. samples \((x,y) \sim \mathcal{D}\), computes in time \(\textrm{poly}(d,1/\epsilon,1/\tau,1/\rho,\log(\frac{1}{\delta}))\) a unit vector \(w \in \mathbb{R}^d\) such that \(\mathop{\boldsymbol{P}r\/}_{(x,y) \sim \mathcal{D}}[\textrm{sgn}(w^\top x) \neq y] \leq \epsilon\) with probability at least \(1-\delta\).

Compared to 1, our 2 achieves better dependence on \(\tau\) by incurring an additional \(1/\epsilon\) factor in the sample complexity. At a technical level, as in [25], we provide a convex surrogate that upper bounds the loss \(\mathbb{1}\{y(x^\top w) \leq \tau/2\}\). Running SGD on this convex surrogate provides a unit vector \(w\) that, in expectation over the data, achieves a margin of at least \(\tau/2\) for an \(O(\epsilon)\)-mass of the population. We then apply a standard boosting trick to turn this guarantee into a high probability bound. Next, we work as in 1: we run the above procedure \(B\) times to get \(w_1,...,w_B\) and aggregate our vectors into a single vector \(z = (1/B)\sum_{i \in [B]} w_i\). Lastly, we perform a JL-projection on \(z\) and then round using the Alon-Klartag rounding scheme, as in 1.

1.1.0.1 Computationally Inefficient Reductions from DP.

It is a corollary of the works of [2] and [8] that one can use existing differentially private (DP) algorithms in order to obtain replicable learners. In particular, following the reduction of [2], one can obtain a replicable algorithm for large-margin halfspaces with better sample complexity in terms of \(\tau\), but in a computationally inefficient way. The idea is to take an off-the-shelf DP algorithm (recall 3) for this problem (e.g. [25]) and transform it into a replicable one. We remark that this transformation holds when the algorithm outputs finitely many different solutions and its running time is exponential in the number of these solutions. Fortunately, the pure DP algorithm from [25] satisfies this finite co-domain property. The formal statement of the result we get by combining these two algorithms is presented below.

Proposition 3 (Inefficient Replicable Algorithm; follows from [2], [25]). Fix \(\epsilon, \tau, \rho, \delta\in (0, 1)\). Let \(\mathcal{D}\) be a distribution over \(\mathbb{R}^d \times \{-1,1\}\) that has linear margin \(\tau\) as in 2. There is an algorithm that is \(\rho\)-replicable and, given \(m = \widetilde{O}(\epsilon^{-2} \tau^{-4} \rho^{-2} \log(\frac{1}{\delta}))\) i.i.d. samples \((x,y) \sim \mathcal{D}\), computes, in time \({\exp\left( (\frac{1}{\tau})^{\frac{\log(1/(\epsilon \rho \delta))}{\tau^2}} \right)} \cdot \textrm{poly}(d)\), a unit vector \(w \in \mathbb{R}^d\) such that \(\mathop{\boldsymbol{P}r\/}_{(x,y) \sim \mathcal{D}}[\textrm{sgn}(w^\top x) \neq y] \leq \epsilon\), with probability at least \(1- \delta\).

As mentioned above, the proof of this result follows by combining the DP-to-Replicability transformation of [2] (cf. 8) with the pure DP algorithm for learning large-margin halfspaces due to [25] (cf. 7). We note that since the algorithm of [25] is proper and the reduction of [2] is based on sub-sampling, the output of 3 is also a linear classifier. The main issue with this approach is that, apart from not being a polynomial time algorithm, the reduction requires a quadratic blow-up in the sample complexity of the provided DP algorithm.

To be more specific, the DP algorithm of [25] has sample complexity \(\widetilde{O}(\epsilon^{-1} \tau^{-2})\) for accuracy \(\epsilon\) and margin \(\tau\). This means that the replicable algorithm of 3 incurs a quadratic blow-up in the sample complexity on the parameters \(\epsilon,\tau\). The cost of this transformation is tight under standard cryptographic hardness assumptions [2]. Thus, it is unlikely that we can reduce the dependence on the error parameter \(\epsilon\) using such a generic transformation. We emphasize that our efficient replicable algorithm (cf. 1) has linear sample complexity dependence on \(1/\epsilon.\) The blow-up on the running time of the algorithm is due to the use of correlated sampling in the transformation of [2] which requires exponential running time in the size of the output space. We remark that in the case of the algorithm of [25], the size of the output space is already exponential in \(1/\tau^2.\)

In our work, we also revisit this inefficient algorithm and improve on its sample complexity and runtime, as follows:

Theorem 4 (Improved Inefficient Replicable [alg:algo3]). Fix \(\epsilon, \tau, \rho, \delta\in (0, 1)\). Let \(\mathcal{D}\) be a distribution over \(\mathbb{R}^d \times \{-1,1\}\) that has linear margin \(\tau\) as in 2. Then there is a \(\rho\)-replicable algorithm such that given \(m = \widetilde{O}(\epsilon^{-1} \tau^{-4} \rho^{-2} \log(\frac{1}{\delta}))\) i.i.d. samples \((x,y) \sim \mathcal{D}\), computes in time \(\textrm{poly}(d) \cdot \textrm{poly}(1/\epsilon, 1/\tau, 1/\rho, 1/\delta)^{1/\tau^2}\), a unit vector \(w \in \mathbb{R}^d\) satisfying \(\mathop{\boldsymbol{P}r\/}_{(x,y) \sim \mathcal{D}}[\textrm{sgn}(w^\top x) \neq y] \leq \epsilon\) with probability at least \(1- \delta\).

Compared to the DP-to-Replicability reduction, 4 has better dependence on \(1/\epsilon\) and better running time. For the proof of this result, we refer to 5.

1.2 Related Work↩︎

In terms of technique, a related work is that of [26] for differentially private clustering. The authors also use the JL projection followed by a discretization step as part of their algorithm. Similar to our techniques, this serves to avoid a factor in the discretization scheme that scales with the dimension. One important difference is that [26] solves a task on the samples while we need to solve a task on a distribution using samples. Moreover, while [26] also use a discretization step (not based on the Alon-Kartag scheme) to find a “dense” part of the space to use as the center of the cluster, the discretization step in our work serves a different purpose: to ensure that our estimates, across two i.i.d. executions, will be the same.

1.2.0.1 Replicability.

Pioneered by [1], there has been a growing interest from the learning theory community in studying replicability as an algorithmic property in various learning tasks. Among other things, their work showed that the fundamental class of statistical queries, which appears in various settings (see e.g., [27][30] and the references therein) can be made replicable. Subsequently, [6], [7] studied replicable algorithms in the context of multi-armed bandits and clustering. Later, [12], [13] studied replicability in the context of Reinforcement Learning. Recently, [2], [8], [31] established equivalences between replicability and other notions of algorithmic stability such as differential privacy (DP), and [32] derived more fine-grained characterizations of these equivalences. It is worth mentioning that [33] had already established equivalences between various notions of algorithmic stability and finiteness of the Littlestone dimension of the underlying concept class. Inspired by [1], a related line of work [9][11] proposed and studied alternative notions of replicability such as list-replicability, where the requirement is that when the algorithm is executed multiple times on i.i.d. datasets, then the number of different solutions it will output across these executions is small.

1.2.0.2 Large-Margin Halfspaces.

The problem of learning large-margin halfspaces has been extensively studied and has inspired various fundamental algorithms [15], [17], [34], [35]. In the DP setting, [36] gave a dimension-dependent construction based on a private version of the perceptron algorithm. This was later improved by [25] who gave new DP algorithms for this task with dimension-independent guarantees based on the Johnson-Lindenstrauss transformation. Next, [37] constructed noise-tolerant and private PAC learners for large-margin halfspaces whose sample complexity also does not depend on the dimension. [38] and [39] designed private algorithms for learning halfspaces without margin guarantees when the domain is finite. [40] stated an open problem of finding optimal DP algorithms for learning large-margin halfspaces both with respect to their running time and their sample complexity. [41] studied DP algorithms for various learning tasks with margin, including halfspaces, kernels, and neural networks. In the area of robust statistics, [42] showed a statistical-computational tradeoff in the problem of PAC learning large-margin halfspaces with random classification noise. For further results on robustly learning large-margin halfspaces, we refer to [43] and the references therein.

2 The Main Tool: The Alon-Klartag Rounding Scheme↩︎

Inspired by [23], we introduce and use the following rounding scheme \(\mathtt{AKround}(z, \beta)\) for a point \(z\) with parameter \(\beta\): let \(o = (o_1, \dots, o_k) \sim_{i.i.d.} U[0, \beta]\) be uniformly random offsets and implicitly discretize \(\mathbb{R}^k\) using a grid of side length \(\beta\) centered at \(o\). Let \(o(z)\in \mathbb{R}^k\) denote the “bottom-left” corner of the cube in which \(z\) lies, i.e., the point obtained by rounding down all the coordinates of \(z\). For a vector \(z\), we let \(z[i]\) be its \(i\)-th coordinate. Define \(p(z)[i]\in [0, 1]\) to be such that \[p(z)[i] \cdot o(z)[i] + (1-p(z)[i])\cdot (o(z)[i] + \beta) = z[i]\,.\]

Given offsets \(o\) and thresholds \(u = (u_1,\dots,u_k)\) with \(u_i \sim U[0,1]\), round a vector \(z\) to \(f_{o,u}(z)\) where the \(i\)-th coordinate is equal to \(o(z)[i]\) if \(u_i \leq p(z)[i]\) and \(o(z)[i] + \beta\) otherwise. Crucially, in expectation, the rounded point \(f_{o,u}(z)\) coincides with \(z\).

The next lemma, whose proof can be found in 8, is useful in order to derive the replicability guarantees of our rounding scheme.

lemmaroundingReplicable Let \(z, z' \in \mathbb{R}^k\). Then for independent uniform offsets \(o_1,\dots,o_k \in [0,\beta]\) and thresholds \(u_1,\dots,u_k \in [0,1]\), we have \(\mathop{\boldsymbol{P}r\/}_{o,u}[f_{o,u}(z) \neq f_{o,u}(z')] \leq 2 \beta^{-1} \|z-z'\|_1.\)

Our novel analysis of the stability of the rounding scheme under shared randomness (cf. [lemma:round]) demonstrates its useful properties for designing replicable algorithms. We believe these properties may be of interest beyond the scope of this work and can find applications in designing replicable algorithms for different problems.

Next, we show that the Alon-Klartag rounding scheme additively preserves inner products with high probability. This is formalized below in a lemma whose proof can be found in 8.

lemmaroundingInnerProduct Let \(z, x \in \mathbb{R}^k\) be such that \(\;\|x\|\leq 1\). For uniform offsets \(o_1,\dots,o_k \in [0,\beta]\) and thresholds \(u_1,\dots,u_k \in [0,1]\), we have \(\mathop{\boldsymbol{P}r\/}_{o,u}\left[ | f_{o,u}(z)^\top x - z^\top x| > \alpha \right] \leq 2 \exp(-2 \alpha^2 \beta^{-2}).\)

It is worth mentioning that the Alon-Klartag rounding scheme, along with dimensionality reduction techniques, was also used by [24] in order to prove generalization bounds for SVMs.

3 Replicably Learning Large-Margin Halfspaces: 1↩︎

In this section, we describe our first algorithm and prove its guarantees as stated in 1. Let \(\mathcal{D}\) be a distribution over \(\mathbb{R}^d \times \{-1,1\}\) with linear margin \(\tau\in (0, 1)\) (cf. 2). Thus, there is a unit vector \(w^\star \in \mathbb{R}^d\) such that for all \((x,y) \in \mathop{\mathrm{supp}}(D), x \neq 0\), we have \(y (x^\top w^\star /\|x\|) \geq \tau\). Given \(\epsilon, \rho, \delta\in (0, 1)\), our goal is to design a \(\rho\)-replicable learning algorithm that draws \(m = m(\epsilon,\tau,\rho,\delta)\) i.i.d. samples from \(\mathcal{D}\) and outputs \(\hat{w}\in \mathbb{R}^d\) such that, with probability at least \(1-\delta\) over the randomness of the samples and (potentially) the internal randomness of the algorithm, it holds that \(\mathop{\boldsymbol{P}r\/}_{(x,y) \sim \mathcal{D}}[y(\hat{w}^\top x) \leq 0] \leq \epsilon\).

3.0.0.1 Description of 1.

We consider \(B\) batches of \(n\) samples each. Hence, in total, we draw \(n B\) i.i.d. samples from \(\mathcal{D}\). On each batch \(i \in [B]\), we run the standard SVM algorithm (cf. 1) to find a hyperplane with normal vector \(w_i \in \mathbb{R}^d\) that has margin at least \(\tau/2\) on all training data in the batch. We then compute the average normal vector \(z = (1/B) \sum_{i \in [B]} w_i\). Finally, we round \(z\) as described in 2.

Figure 1: Replicable Large-Margin Halfspaces

3.0.0.2 Correctness of 1.

A straightforward adaptation of the results of [24] (cf. 1) shows that, with probability \(1-\delta/(10 B)\) over the samples, the classifier \(w_i\) has margin at least \(\tau/4\) in a \(\left(1-O \left(\frac{\log n + \log(B/\delta)}{\tau^2n} \right) \right)\) fraction of the population, i.e., \[\mathop{\boldsymbol{P}r\/}_{(x,y)\sim \mathcal{D}}[y(w_i^\top x)/\|x\| < \tau/4] \leq O\left(\frac{\log n + \log(B/\delta)}{\tau^2n}\right) \,.\] We denote the complement of this event as \(E_i\) and condition on not observing \(\cup_{i \in [B]} E_i\).

Now under this event, for all the vectors \(w_i, i \in [B]\) and all points \((x,y) \in \mathbb{R}^d \times \{-1,1\}\), it holds that \(y(w_i^\top x)/\|x\| \geq -1,\) since \(w_i\) is a unit vector. Furthermore, for a \((1- \widetilde{O}(\frac{1}{\tau^2 n}))\)-fraction of the population, the margin is at least \(\tau/4\), i.e., \(y(w_i^\top x)/\|x\| \geq \tau/4.\) Intuitively, this means that the vector \(z = \frac{1}{B } \sum_{i \in [B]} w_i\) should have margin at least \(\tau/8\), except for an \(\widetilde{O}(1/(\tau^3 n))\) fraction of the population. Formally, for \((x,y)\sim \mathcal{D}\), let \(Z_i\) be the indicator variable such that \(y\cdot (w_i^\top x) / \relax{x} < \tau/4\) and \(Z := \sum_{i\in [B]} Z_i\). Then \[\begin{align} &y(z^\top x)/\|x\| \\ &= y\left( \frac{1}{B} \sum_{i \in [B]} w_i\right)^\top (x/\|x\|) = \frac{1}{B} \sum_{i \in [B]} y\cdot w_i^\top(x/\|x\|) \\ &\geq \frac{1}{B} \left(-Z + (B-Z)\tau/4\right) = \tau/4 - \frac{Z}{B}(1+\tau/4) \,. \end{align}\] This means that if \(y(z^\top x)/\|x\| < \tau/8\), then \[\begin{align} \tau/4 - \frac{Z}{B}(1+\tau/4) < \tau/8 \implies Z > \frac{\tau}{16} \cdot B \,. \end{align}\] It suffices to bound the probability of the event that \(Z > \Omega(\tau B)\) to bound the population error of \(z\).

Notice that the summation of the fractions of the population where the \(w_i\) have margin less than \(\tau/4\) is at most \(O(B(\log n + \log(\frac{B}{\delta}))/(\tau^2n))\). As noted above, at least \(\Omega(\tau B)\) of the classifiers must simultaneously have margin less than \(\tau/4\) for \(z\) to misclassify \(x\). Thus the fraction of the population where \(z\) has margin smaller than \(\tau/8\) is at most \[\begin{align} O\left(\frac{B(\log n + \log(\frac{B}{\delta}))}{\tau \cdot B (\tau^2n)}\right) &= O\left( \frac{\log n + \log(\frac{B}{\delta})}{\tau^3 n} \right) \,. \end{align}\] Thus, choosing \(n \geq \widetilde{\Omega}(\log(\frac{B}{\delta})/(\tau^3 \epsilon))\) ensures that the intermediary normal vector \(z = (1/B) \sum_{i=1}^B w_i\) satisfies \(\mathop{\boldsymbol{P}r\/}_{(x, y)\sim \mathcal{D}} [y(z^\top x/\relax{x}) < \tau/8] \leq \epsilon/10\) with probability at least \(1-\delta/10\).

The following lemma whose proof is deferred to 9 ensures that projecting and rounding in the lower dimension approximately preserves the performance of \(z\) with respect to the 0-1 loss (as opposed to the \(\tau/8\)-loss).

lemmaprojectingRoundingMargin Fix \(\epsilon, \tau, \delta\in (0, 1)\) and let \(\mathcal{D}\) be a distribution over \(\mathbb{R}^d\times \set{\pm 1}\) that admits a linear classifier with \(\tau\)-margin. Suppose \(z\) is a random unit vector satisfying \(\mathop{\boldsymbol{P}r\/}_{(x, y)\sim \mathcal{D}} [y(z^\top x/\relax{x}) < \tau/2] \leq \epsilon\) with probability at least \(1-\delta\) over the draw of \(z\). Define \(k = \Omega(\tau^{-2}\log(\frac{1}{\epsilon\delta}))\) and \(\beta = O(\tau/\log(\frac{1}{\epsilon\delta}))\). If \(A\in \mathbb{R}^{k\times d}\) is a JL-matrix (cf. 7.3) and \(b = \mathtt{AKround}(Az, \beta)\) (cf. 2), then \(\hat{w} = A^\top b/ \relax{A^\top b}\) satisfies \(\mathop{\boldsymbol{P}r\/}_{(x, y)\sim \mathcal{D}} [y(\hat{w}^\top x/\relax{x}) \leq 0] \leq 2\epsilon\) with probability at least \(1-2\delta\).

We note that an application of it with \(\tau' = \tau/4\), \(\epsilon' = \epsilon/10\), and \(\delta' = \delta/10\) yields that the final output \(\hat{w}\) of 1 has 0-1 population error of at most \(\epsilon/5\) with probability at least \(1-\delta/5\) as desired.

3.0.0.3 Replicability of 1.

We now state the lemma from which the replicability guarantees of our algorithm follow. Its proof is deferred to 9.

lemmaprojectingRoundingReplicable Fix \(\epsilon, \tau, \rho, \delta\in (0, 1)\). Suppose \(w_1, \dots, w_B\) and \(w_1', \dots, w_B'\) are i.i.d. random unit vectors for \(B = \Omega(\tau^{-4}\rho^{-2}\log(\frac{1}{\epsilon\tau\rho\delta}))\) and \(z = (1/B) \sum_{i\in [B]} w_i\), \(z' = (1/B) \sum_{i\in [B]} w_i'\) are their averages. Define \(k = \Theta(\tau^{-2}\log(\frac{1}{\epsilon\tau\rho\delta}))\) and \(\beta = \Theta(\tau/\log(\frac{1}{\epsilon\tau\rho\delta})\). If \(A\in \mathbb{R}^{k\times d}\) is a JL-matrix (cf. 7.3) and \(b = \mathtt{AKround}(Az, \beta)\), \(b' = \mathtt{AKround}(Az', \beta)\) (cf. 2), then \(b = b'\) with probability at least \(1-\rho\) over the draw of the \(w_i', w_i\)’s, \(A\), and \(\mathtt{AKround}\).

Note that the \(w_i\)’s are i.i.d. unit vectors across all batches and two independent executions since the samples in the \(2B\) batches in the two executions are drawn from the same distribution \(\mathcal{D}\) and the output of the SVM algorithm depends only on its input sample. Hence an application of [lem:projecting32rounding32is32replicable] ensures that 1 is indeed \(\rho\)-replicable.

3.0.0.4 Sample Complexity & Running Time of 1.

The sample complexity is \(nB = \tilde{O}(\epsilon^{-1} \tau^{-7} \rho^{-2} \log(\frac{1}{\delta}))\). By inspection, we see that the total running time of 1 is \(\textrm{poly}(d, n) = \textrm{poly}(d, 1/\epsilon, 1/\tau, 1/\rho, \log(\frac{1}{\delta}))\).

4 Replicably Learning Large-Margin Halfspaces: 2↩︎

Let \(B_1^d\) denote the unit \(\ell_2\)-ball in \(d\)-dimensions. Our approach is inspired by the work of [25] that designed a similar SGD approach for learning large-margin halfspaces under differential privacy constraints. Consider the following surrogate loss \(h\) \[\begin{align} B_1^d\times B_1^d\times \set{-1, +1} &\to \mathbb{R}_{\geq 0} \\ h(w; x, y) &:= \max\left( 0, 2-\frac{2}{\tau} y(x^\top w) \right) \\ &\geq \mathbb{1}\set{y(x^\top w) < \tau/2}. \end{align}\] We remark that \(h(w; x, y) \geq 1\) when \(y(x^\top w) \leq \tau/2\) and \(h(w; x, y) = 0\) when \(y(x^\top w) \geq \tau\). Also, since \(x, w\in B_1^d\), an application of the Cauchy-Schwartz inequality reveals that \(h(w; x, y)\in [0, 2+2/\tau]\). Finally, \(h\) is piecewise linear with each piece being \(2/\tau\)-Lipschitz. Hence, \(h\) is \(O(1/\tau)\)-Lipschitz.

4.0.0.1 Description of 2.

Fix \(\epsilon\in (0, 1)\). We seek to minimize the following loss function over the ball \(B_1^d\): \[f_{\mathcal{D}}(w) := \mathop{\boldsymbol{E}\/}_{(x, y)\sim \mathcal{D}} [h(w; x, y)] + \frac{\epsilon}{10} \relax{w}^2.\] By construction, the co-domain of \(f_\mathcal{D}\) lies within \([0, 2+2/\tau+\epsilon/10]\subseteq[0, O(1/\tau)]\). Note also that \(f_\mathcal{D}\) is an upper bound on the \(\tau/2\)-population loss, i.e., \(\mathop{\boldsymbol{P}r\/}_{(x,y) \sim \mathcal{D}} [y(x^\top w) < \tau/2] \leq f(w)\,.\) First, regarding the minima of \(f_\mathcal{D}\), note that any vector \(w \in B_1^d\) achieving a margin of \(\tau\) satisfies \(f_{\mathcal{D}}(w)\leq \epsilon/10\). This is because \(h(w; x, y) = 0\) for all \((x, y)\in \mathop{\mathrm{supp}}(\mathcal{D})\). As a result, \(\min_{w \in B_1^d} f_\mathcal{D}(w) \leq \epsilon/10\). Second, let us consider an \(\epsilon/10\)-optimal solution \(w'\) with respect to \(f_{\mathcal{D}}\), i.e., \(f_\mathcal{D}(w') - \min_{w \in B_1^d} f \leq \epsilon/10\,.\) The above discussion implies that \(f_{\mathcal{D}}(w') \leq \epsilon/5\) and is thus a \(\tau/2\)-margin classifier for an \(\epsilon/5\)-fraction of the population, i.e., \(w'\) satisfies \(\mathop{\boldsymbol{P}r\/}_{(x,y) \sim \mathcal{D}} [y(x^\top w') < \tau/2] \leq \epsilon/5\,.\)

Note that we may assume without loss of generality that the marginal of \(\mathcal{D}\) over features is supported over \(B_1^d\) since we normalize the input \(x\mapsto x/\relax{x}\) before applying a classifier \(w\).

Since \(f\) is \(O(\epsilon+ 1/\tau) = O(1/\tau)\)-Lipschitz and \(\Omega(\epsilon)\)-strongly convex, we can apply the following standard result:

Theorem 5 (Theorem 6.2 in [44]). Let \(f\) be \(\mu\)-strongly convex with minimizer \(w_*\) and assume that the (sub)gradient oracle \(g(w)\) satisfies \(\mathop{\boldsymbol{E}\/}[\relax{g(w)}^2]\leq G^2\). Then after \(T\) iterations, projected stochastic gradient descent \(\mathtt{SGD}(\mathcal{D}, f, T)\) with step size \(\eta_t = \frac{2}{\mu(t+1)}\) satisfies \[\mathop{\boldsymbol{E}\/}\left[ f\left( \sum_{t=1}^T \frac{2t}{T(T+1)} w_t \right)\right] - f(w_*) \leq \frac{2G^2}{\mu(T+1)}.\]

Since \(G^2 = O(1/\tau^2)\), \(\mu = \Omega(\epsilon)\) and \(f(w_*) \leq \epsilon/10\), choosing \(T\geq \Omega(\epsilon^{-2}\tau^{-2})\) yields an \(\epsilon/10\)-optimal solution in expectation. Repeating this process independently for a small number of times and outputting the one with the lowest objective yields an \(\epsilon/5\)-optimal solution with high probability.

lemmaboostedSGD Let \(B_1^d\) be the unit \(\ell_2\)-ball in \(d\) dimensions. Fix \(\epsilon, \tau, \delta\in (0, 1)\) and let \(\cal D\) be a distribution over \(B_1^d\times \set{\pm 1}\) that admits a linear \(\tau\)-margin classifier. There is an algorithm \(\mathtt{boostSGD}(\mathcal{D}, \epsilon, \tau, \delta)\) that outputs a unit vector \(\tilde{w}\in \mathbb{R}^d\) such that \(f_{\mathcal{D}}(\tilde{w})\leq \min_{w\in B_1^d} f_\mathcal{D}(w) + \epsilon\) with probability at least \(1-\delta\). Moreover, \(\mathtt{boostSGD}\) has sample complexity \(\tilde{O}(\epsilon^{-2}\tau^{-2} \log(\frac{1}{\delta}))\) and running time \(\mathrm{poly}(\frac{1}{\epsilon}, \frac{1}{\tau}, \log(\frac{1}{\delta}), d)\).

The proof of [lem:boosted32SGD] is deferred to 10. Next, we repeat \(\mathtt{boostSGD}\) and take an average to ensure concentration before proceeding as in 1 with the random projection and rounding in the lower dimensional space. Compared to 1, we obtain an improved sample dependence on \(\tau\) for 2 since taking an average of \(\epsilon\)-optimal solutions to a convex objective function yields an \(\epsilon\)-optimal solution. However, we pay an extra factor of \(\epsilon\) in order to run \(\mathtt{SGD}\) as a subroutine.

Figure 2: Replicable Large-Margin Halfspaces

The technical details of 2 (the proof of 2) appear in 10.1.

5 Replicably Learning Large-Margin Halfspaces: 3↩︎

In this section, we provide an algorithm whose sample complexity scales as \(\widetilde{O}(\epsilon^{-1}\tau^{-4}\rho^{-2}\log(\frac{1}{\delta}))\) and has running time \(\textrm{poly}(d) \cdot (\textrm{poly}(1/\epsilon, 1/\rho, 1/\tau, 1/\delta))^{1/\tau^2}\). We remark that the sample complexity is better than the one obtained from the DP transformation in 3. Moreover, the running time is exponentially better than that obtained through the DP transformation.

Before this, we state a very useful result due to [2] related to the sample complexity of replicably learning finite hypothesis classes, which is used in the proof of 4.

Proposition 6 (Theorem 5.13 in [2]). Consider a finite concept class \(\mathcal{H}\). There is a \(\rho\)-replicable agnostic PAC learner \(\mathtt{rLearnerFinite}\) with accuracy \(\epsilon\) and confidence \(\delta\) for \(\mathcal{H}\) with sample complexity \(n = O\left( \frac{\log^2|H| + \log(\frac{1}{\rho \delta})}{\epsilon^2 \rho^2} \log^3(\frac{1}{\rho})\right)\). Moreover, if \(\mathcal{D}\) is realizable then the sample complexity drops to \(\widetilde{O}(\epsilon^{-1} \rho^{-2} \log^2|H|)\). Finally, the algorithm terminates in time \(\textrm{poly}(|\mathcal{H}|,n).\)

5.0.0.1 Description of [alg:algo3]

Let us first provide the algorithm’s pseudo-code.

Figure 3: Replicable Large-Margin Halfspaces

Similar to the other algorithm, we first start by using a JL matrix to project the training set to a \(k\)-dimensional space, for \(k = \widetilde{\Theta}(\tau^{-2}).\) Then, we use a \((\tau/20)\)-net to cover all the unit vectors of this \(k\)-dimensional space, so the size of the net is \(\left(\frac{C''}{\tau}\right)^{\widetilde{O}(\tau^{-2})}\), for some absolute constant \(C'' > 0.\) We think of these points of the net as our hypothesis class \(\mathcal{H}_\tau\). By the properties of the JL transform, we can show that with high probability \(1-\delta/10\), there exists a vector in this class that classifies the entire training set correctly. Moreover, we can show that this classifier has small generalization error. This is formalized in the following result, which is essentially a high-probability version of 5 from [25].

lemmawhpGoodChoiceJL Fix \(\epsilon', \delta_{JL}\in (0, 1)\). Let \(\mathcal{D}\) satisfy 2 with margin \(\tau\) and suppose \(w^\star\) satisfies \(y(w^\star)^\top x \geq \tau\) for every \((x, y)\in \mathop{\mathrm{supp}}(\cal D)\). For a JL-matrix \(A \in \mathbb{R}^{k \times d}\), as stated in 3 with \(k = \Omega(\tau^{-2} \log(1/\delta_{JL}))\), let \(G_A \subseteq \mathbb{R}^d \times \{-1,1\}\) be the set of points \((x,y)\) of \(\mathop{\mathrm{supp}}(\cal D)\) that satisfy

  • \(\relax*{\|Ax\|^2 - \|x\|^2} \leq \tau \|x\|^2/100,\) and,

  • \(y (A w^\star/\|A w^\star\|)^\top (A x/\|A x\|) \geq 96 \tau/100\).

Let \(E_1\) be the event (over \(A)\) that \(\mathop{\boldsymbol{P}r\/}_{(x,y)\sim \mathcal{D}}[(x,y)\in G_A] \geq 1 - \epsilon'\) and \(E_2\) be the event (over \(A\)) that \(\relax*{\|Aw^\star\|^2 - \|w^\star\|^2} \leq \tau \|w^\star\|^2/100.\) Then it holds that \(\mathop{\boldsymbol{P}r\/}_{A}[E_1 \cap E_2] \geq 1-\delta_{JL}/\epsilon'\).

The proof of [lem:whp-good-choice-JL-mapping] appears in 7.3.1. One way to view [lem:whp-good-choice-JL-mapping] is that, with probability \(1-\delta_{JL}/\epsilon'\) over the random choice of \(A\), the classifier \(Aw^\star/\|Aw^\star\|\) will have \(96 \tau/100\) margin on a \(1-\epsilon'\) fraction of \(\mathcal{D},\) where the choice of \(\epsilon'\) will be specified later according to [lem:whp-good-choice-JL-mapping]. Let us condition on this event for the rest of the proof, which we call \(E_r.\) Let \(\tilde{w}^\star\) be the point of the net that \(Aw^\star/\|Aw^\star\|\) is rounded to. Recall that we round to the closest point on a \(\tau/20\)-net of the unit ball with respect to the \(\ell_2\)-norm, so that \(\tilde{w}^\star\) is \(\tau/20\) close to the normalized version of \(A w^\star\). Notice that under the event \(E_r\), for all points \((x,y) \in G_A\) we have that \[\begin{align} &(\tilde{w}^\star)^\top \frac{A x}{\|A x\|} \\ &= \frac{(A w^\star)^\top}{\|A w^\star\|} \frac{A x}{\|A x\|} - \left( \frac{Aw^\star}{\|A w^\star\|} - \tilde{w}^\star \right)^\top \frac{A x}{\|A x\|} \\ &\geq \frac{(A w^\star)^\top}{\|A w^\star\|} \frac{A x}{\|A x\|} - \relax*{\frac{Aw^\star}{\|A w^\star\|} - \tilde{w}^\star} \cdot \frac{\|Ax\|}{\|Ax\|} \\ &\geq 96\tau/100 - \tau/20 > 9/(10 \tau)\,, \end{align}\] where the first inequality follows from Cauchy-Schwartz and the second inequality from the definition of the net. Since the hypothesis class \(\mathcal{H}_\tau\) has finite size, we can use 6 from [2] which states that \(\widetilde{O}(\epsilon^{-1} \rho^{-2} \log(\frac{1}{\delta}) \log^2|\mathcal{H}|)\) samples are sufficient to \(\rho\)-replicably learn a hypothesis class \(\mathcal{H}\) in the realizable setting with error at most \(\epsilon.\) One technical complication we need to handle is that \(\mathcal{D}\) is not necessarily realizable with respect to \(\mathcal{H}_\tau.\) Nevertheless, under \(E_r,\) we have shown that for \(\tilde{w}^\star \in \mathcal{H}_\tau\) it holds that \(\mathop{\boldsymbol{P}r\/}_{(x,y) \sim \mathcal{D}}[y((\tilde{w}^\star)^\top Ax/\|Ax\|) < 9\tau/10] \leq \epsilon'.\) Let us denote by \(\mathcal{D}_r\) the distribution \(\mathcal{D}\) conditioned on the event that \(y((\tilde{w}^\star)^\top Ax/\|Ax\|) \geq 9\tau/10\) and \(\mathcal{D}_b\) its complement, i.e., \(\mathcal{D}\) conditioned on \(y((\tilde{w}^\star)^\top Ax/\|Ax\|) < 9\tau/10.\) Then, we can express \(\mathcal{D}= (1-p_b)\cdot \mathcal{D}_r + p_b\cdot \mathcal{D}_b,\) where \(p_b \leq \epsilon'.\) Hence, in a sample of size \(n\) from \(\mathcal{D}\), with probability at least \(1-n\cdot \epsilon'\) we only see samples drawn i.i.d. from \(\mathcal{D}_r\). Let us call this event \(E_r'\) and condition on it. Let us choose \(\epsilon' = \delta/(10n), \delta_{JL} = \epsilon'^2/10 = \delta^2/1000n\) so that \(\mathop{\boldsymbol{P}r\/}[E_r\cap E_r'] \geq 1-\delta\).

Under the events \(E_r, E_r',\) we can use the replicable learner from [2] to learn the realizable distribution \(\mathcal{D}_r.\) Run the algorithm from 6 with parameters \(\delta/10,\epsilon/10,\rho/10\).9 Since \(|\mathcal{H}| = \left(\frac{C}{\tau}\right)^{\tau^{-2}}\) for some absolute constant \(C\), we see that we need \(n = \widetilde{O}\left(\epsilon^{-1} \tau^{-4} \rho^{-2} \log(\frac{1}{\delta}) \right)\) samples.

5.0.0.2 Replicability of [alg:algo3].

Let us condition on the events \(E_r, E_r'\) across two executions of the algorithm. Since, the matrix \(A\) is the same, due to shared randomness, under these events the samples that the learner of [2] receives are i.i.d. from the same realizable distribution. Then, the replicability guarantee follows immediately from the guarantees of 6 and the fact that all the events we have conditioned on occur with probability at least \(1-\rho/2.\)

5.0.0.3 Correctness of [alg:algo3].

Here we examine the population error of the output of 6. Note that this error is analyzed with respect to the original distribution \(\mathcal{D}\), which is not necessarily realizable, rather than \(\mathcal{D}_r\), which is the distribution on which we run the learning algorithm.

Let us again condition on the events \(E_r, E_r',\) and the event that the output of 6 satisfies the generalization bound it states. We assume without loss of generality that \(\delta \leq \epsilon/2,\) otherwise we set \(\delta = \epsilon/2\) without affecting the overall sample complexity of our algorithm. Notice that these events occur with probability at least \(1-\delta.\) Let \(b\) be the output of the algorithm. Then we have \[\begin{align} \mathop{\boldsymbol{P}r\/}_{(x,y) \sim \mathcal{D}}[y (b^\top x) < 0] &= (1-p_b) \mathop{\boldsymbol{P}r\/}_{(x,y) \sim \mathcal{D}_r}[y (b^\top x) < 0] \\ &\qquad + p_b \mathop{\boldsymbol{P}r\/}_{(x,y) \sim \mathcal{D}_b}[y (b^\top x) < 0] \\ &\leq (1-p_b) \epsilon/10 + p_b < \epsilon. \end{align}\] This concludes the proof.

5.0.0.4 Sample Complexity & Running Time of [alg:algo3].

As noted before, we need \(n = \widetilde{O}\left(\epsilon^{-1} \tau^{-4} \rho^{-2} \log(\frac{1}{\delta}) \right)\) samples. Since we apply 6, we incur a running time of \(\textrm{poly}(d) \cdot \textrm{poly}(1/\epsilon,1/\rho,1/\tau,1/\delta)^{1/\tau^2}.\)

6 Conclusion↩︎

In this work, we have developed new algorithms for replicably learning large-margin halfspaces. Our results vastly improve upon prior work on this problem. We believe that many immediate questions for future research arise from our work. First, it is natural to ask whether there are efficient algorithms that can achieve the \(\widetilde{O}(\epsilon^{-1} \tau^{-4} \rho^{-2}\log(\frac{1}{\delta}))\) sample complexity bound of 3. Also, it would be interesting to see if there are any (not necessarily efficient) replicable algorithms whose sample complexity scales as \(\widetilde{O}(\epsilon^{-1} \tau^{-2} \rho^{-2})\) or if there is some inherent barrier to pushing the dependence on \(\tau\) below \(\tau^{-4}.\) Finally, our analysis of 1 is pessimistic in the sense that it uses a pigeonhole principle argument to establish that the fraction of the population where the aggregate vector does not have margin \(\Omega(\tau)\) is \(\widetilde{O}(\frac{1}{\tau^3n}).\) It would be interesting to see whether this bound can be improved to \(\widetilde{O}(\frac{1}{\tau^2n})\) using a different analysis, which would reduce the overall dependence of the algorithm on \(\tau.\)

Acknowledgements↩︎

Kasper Green Larsen is supported by Independent Research Fund Denmark (DFF) Sapere Aude Research Leader grant No 9064-00068B. Amin Karbasi acknowledges funding in direct support of this work from NSF (IIS-1845032), ONR (N00014- 19-1-2406), and the AI Institute for Learning-Enabled Optimization at Scale (TILOS). Grigoris Velegkas is supported by TILOS, the Onassis Foundation, and the Bodossaki Foundation. Felix Zhou is supported by TILOS.

7 Deferred Tools↩︎

7.1 SVM guarantees↩︎

The following result is a restatement of Theorem 2 from [24]

Lemma 1 (SVM Generalization Guarantee [24]). Let \(\mathcal{D}\) be a distribution over \(\mathbb{R}^d \times \{-1,1\}.\) Let \(n \in \mathbb{N}, S = (x_1,y_1),\ldots,(x_n,y_n)\sim \mathcal{D}^n\), and \(w \in \mathbb{R}^d\) be a unit vector such that \(y_i\left(w^\top \frac{x_i}{\|x_i\|}\right) \geq \tau, \forall i \in [n]\) Then, for every \(\delta > 0\) with probability at least \(1-\delta\) over the random draw of \(S\), it holds that \[\mathop{\boldsymbol{P}r\/}_{(x,y) \sim \mathcal{D}}[y\cdot(w^\top x/\|x\|) < \tau/2] \leq O\left(\frac{\log n + \log(\frac{1}{\delta})}{\tau^2 n}\right) \,.\]

We remark that even though the result of [24] is stated for the generalization error with respect to the misclassification probability, i.e., \(\mathop{\boldsymbol{P}r\/}_{(x,y) \sim \mathcal{D}}[y\cdot(w^\top x/\|x\|) \leq 0]\), their argument also applies to the \(\tau/2\)-margin loss, i.e., \(\mathop{\boldsymbol{P}r\/}_{(x,y) \sim \mathcal{D}}[y\cdot(w^\top x/\|x\|) < \tau/2]\), via a straightforward modification of some constants. In more detail, the only needed change is in the proof of part 1 of Claim 10 in [24]. Here the first condition says \(y\cdot(x^\top w) \leq 0\) and could be made e.g. \(y\cdot(x^\top w) \leq \tau/4\). Their result then holds with \(\tau/4\) instead of \(\tau/2\).

7.2 Vector-Valued Bernstein Concentration Inequality↩︎

We will use the following concentration inequality for the norm of random vectors.

Lemma 2 (Vector Bernstein; [45]). Let \(X_1, \dots, X_B\) be independent random vectors with common dimension \(d\) satisfying the following for all \(i\in [B]\):

(i) \(\mathop{\boldsymbol{E}\/}[X_i] = 0\)

(ii) \(\relax{X_i}\leq \mu\)

(iii) \(\mathop{\boldsymbol{E}\/}[\relax{X_i}^2]\leq \sigma^2\)

Let \(Z := \frac{1}{B} \sum_{i=1}^B X_i\). Then for any \(t\in (0, \frac{\sigma^2}{\mu})\), \[\mathop{\boldsymbol{P}r\/}[\relax{Z} \geq t] \leq \exp\left( -\frac{t^2 B}{8\sigma^2} + \frac{1}{4} \right).\]

7.3 Johnson-Lindenstrauss Lemma↩︎

The remarkable result of [46] states that an appropriately scaled random orthogonal projection matrix preserves the norm of a unit vector with high probability. [47] showed that it suffices to independently sample each entry of the matrix from the standard normal distribution. [48] further simplified the construction to independently sample each entry from the Rademacher distribution. See [49] for a detailed survey of the development.

From hereonforth, we say that a random matrix \(A\in \mathbb{R}^{k\times d}\) is a JL-matrix if either \(A_{ij} \sim_{i.i.d.} \mathcal{N}(0, 1/k)\) or \(A_{ij} \sim_{i.i.d.} U\{-1/\sqrt{k}, +1/\sqrt{k}\}\).

We first state the standard distributional formulation of JL.

Lemma 3 (Distributional JL; [46][48]). Fix \(\epsilon, \delta_{JL}\in (0, 1)\). Let \(A \in \mathbb{R}^{k \times d}\) be a JL-matrix for \(k = \Omega(\epsilon^{-2} \log(\frac{1}{\delta_{JL}}))\). Then for any \(x \in \mathbb{R}^d\), \[{\mathop{\boldsymbol{P}r\/}_A[ \relax*{\|Ax\|^2 - \|x\|^2} > \epsilon\|x\|^2] \leq \delta_{JL}\,.}\]

Let \(T\) be a set of vectors. By applying 3 to the \(O(|T|^2)\) vectors \(u-v\) for \(u, v\in T\) and taking a union bound, we immediately deduce the following result.

Lemma 4 (JL Projection; [46][48]). Fix \(\epsilon, \delta_{JL}\in (0, 1)\). Consider a set \(T\) of \(d\)-dimensional vectors and a JL-matrix \(A \in \mathbb{R}^{k \times d}\) for \(k = \Omega(\epsilon^{-2}\log(\frac{|T|}{\delta_{JL}}))\). Then, \[{\mathop{\boldsymbol{P}r\/}_A \left[ \exists ~ {u, v\in T} : \relax*{\| A(u - v) \|^2 - \|u-v\|^2} > \epsilon\|u-v\|^2 \right] \leq {\delta_{JL}}} \,.\]

An application of 4 towards the polarization identity for \(z, x\in \mathbb{R}^d\) \[4z^\top x = \|z+x\|^2 - \|z-x\|^2\] yields the following inner product preservation guarantee.

Corollary 1 (JL Inner Product Preservation). Fix \(\epsilon, {\delta_{JL}}\in (0, 1)\). Let \(A \in \mathbb{R}^{k \times d}\) be a JL-matrix for \(k = \Omega(\epsilon^{-2}\log(\frac{1}{\delta_{JL}}))\). Then, for any \(x, z\in \mathbb{R}^d\), \[\mathop{\boldsymbol{P}r\/}_A[|z^\top x - (Az)^\top Ax| > \epsilon\|z\|\cdot \|x\|] \leq {\delta_{JL}} \,.\]

The next lemma is another simple implication of the distributional JL.

Lemma 5 (Lemma 5 in [25]). Let \(\mathcal{D}\) satisfy 2 with margin \(\tau\). For a JL-matrix \(A\) as stated in 3, let \(G_A \subseteq \mathbb{R}^d \times \{-1,1\}\) be the set of points \((x,y)\) of the population that satisfy

  • \(\relax*{\|Ax\|^2 - \|x\|^2} \leq \tau \|x\|^2/100\) and

  • \(y (A w^\star/\|A w^\star\|)^\top (A x/\|A x\|) \geq 96 \tau/100\).

Then it holds that \(\mathop{\boldsymbol{P}r\/}_{A, (x, y)\sim \cal D}[(x,y) \in G_A] \geq 1-\delta_{JL}\).

7.3.1 The Proof of [lem:whp-good-choice-JL-mapping]↩︎

We finally prove [lem:whp-good-choice-JL-mapping], whose statement we repeat below for convenience.

Proof of [lem:whp-good-choice-JL-mapping]. We know from 5 that \[\begin{align} \mathop{\boldsymbol{E}\/}_{A}\left[\mathop{\boldsymbol{P}r\/}_{(x,y) \sim \mathcal{D}}[(x,y) \notin G_A]\right] &\leq \delta_{JL} \,, \end{align}\] so Markov’s inequality gives that \[\begin{align} \mathop{\boldsymbol{P}r\/}_{A} \left[\mathop{\boldsymbol{P}r\/}_{(x,y) \sim \mathcal{D}}[(x,y) \notin G_A] \geq \epsilon' \right] \leq \frac{\mathop{\boldsymbol{E}\/}_{A}\left[\mathop{\boldsymbol{P}r\/}_{(x,y) \sim \mathcal{D}}[(x,y) \notin G_A]\right]}{\epsilon'} \leq \frac{\delta_{JL}}{\epsilon'} \,. \end{align}\] Similarly, the guarantees of the JL projection immediately yields \[\mathop{\boldsymbol{P}r\/}_A \left[ \relax*{\|Aw^\star\|^2 - \|w^\star\|^2} \leq \tau \|w^\star\|^2/100\right] \leq \delta_{JL}.\] ◻

8 Details for 2↩︎

Here we fill in the missing proofs from 2.

First, we restate and prove [lemma:round].

Proof of [lemma:round]. Fix a coordinate \(i \in [k]\). The probability that \(o(z)[i] \neq o(z')[i]\) is at most \(|z[i]-z'[i]| \beta^{-1}\). Assume \(o(z)[i] = o(z')[i]\). Then, by the definition of \(p(z), p(z')\) we have that \[\begin{align} |z[i]-z'[i]| &= \left| (p(z)[i]-p(z')[i])o(z)[i] \right. \\ &\qquad + \left. (p(z')[i]-p(z)[i])(o(z)[i]+\beta) \right| \\ &= |(p(z')[i]-p(z)[i])\beta|. \end{align}\] Note then the probability of \(f_{o, u}(z)[i] \neq f_{o, u}(z')[i]\) is \(|p(z')[i]-p(z)[i]| = |z[i]-z'[i]| \beta^{-1}\).

By the uniform choice of \(u_i\), we thus conclude that \(\mathop{\boldsymbol{P}r\/}_{o,u}[f_{o,u}(z)[i] \neq f_{o,u}(z')[i]] \leq 2\beta^{-1} \relax{z[i] - z'[i]}\). A union bound over all \(k\) coordinates implies \(\mathop{\boldsymbol{P}r\/}_{o,u}[f_{o,u}(z) \neq f_{o,u}(z')] \leq 2\beta^{-1} \|z-z'\|_1\). ◻

Next, we prove [lemma:inner], whose statement is repeated.

Proof of [lemma:inner]. Each of the random variables \(f_{o,u}(z)[i]\) lies in an interval of length \(\beta\), are independent, and have expectation \(z[i]\). By linearity, \(f_{o,u}(z)^\top x - z^\top x = (f_{o,u}(z)-z)^\top x\). Let \(v = f_{o,u}(z)-z\). Then \(\mathop{\boldsymbol{E}\/}[x[i]\cdot v[i]]=0\) and \(x[i]\cdot v[i]\) lies in a range of length \(\beta x[i]\). By Hoeffding’s inequality, we have \(\mathop{\boldsymbol{P}r\/}[| v^\top x | > \alpha] < 2 \exp(-2\alpha^2/(\sum_i \beta^2x[i]^2)) \leq 2 \exp(-2 \alpha^2 \beta^{-2})\). ◻

9 Details for 1↩︎

Here we fill in the missing proofs from 3.

We begin with [lem:projecting32rounding32preserves32margin], whose statement we repeat below for convenience.

Proof of [lem:projecting32rounding32preserves32margin]. Let \(\mathcal{D}_z := \mathcal{D}\mid \set{y(z^\top x/\relax{x}) \geq \tau/2}\) be the distribution obtained from \(\mathcal{D}\) by conditioning on \(z\) having large margin and let \(\mathcal{D}_b\) denote the distribution \(\mathcal{D}\) conditioned on the complement. We can decompose \(\mathcal{D}= (1-p_b)\cdot \mathcal{D}_z + p_b\cdot \mathcal{D}_b\) for \(p_b\leq \epsilon\). To complete the correctness argument, we need to show that with high probability, the projection step of \(z, x\) onto the low-dimensional space \(Az, Ax\) and the rounding \(b = \mathtt{AKround}(Az, \beta)\) approximately preserves the inner product \(y(b^\top Ax/\relax{x})\) for a \(1-\epsilon\) fraction of \(\mathcal{D}_z\). Then the final classifier \(A^\top b\) still has low population error. In particular, it suffices to show that \(\mathop{\boldsymbol{P}r\/}_{(x, y)\sim \mathcal{D}_z}[y(b^\top Ax/\relax{x}) < \tau/4]\leq \epsilon\) with probability at least \(1-\delta\) over the random choice of \(A, b\). Then, the total population error is at most \(\epsilon+ p_b\leq 2\epsilon\) with probability at least \(1-2\delta\).

Fix \((x, y)\in \mathop{\mathrm{supp}}(\mathcal{D}_z)\) and remark that \[\begin{align} &y\cdot b^\top \frac{Ax}{\relax{x}} \\ &= y\cdot (Az)^\top A\left( \frac{x}{\relax{x}} \right) - y\cdot (Ax - b)^\top \frac{Ax}{\relax{Ax}}. \end{align}\] By the choices of \[\begin{align} k \geq {\Omega}(\tau^{-2}\log(\frac{1}{\epsilon\delta})), \qquad \beta \leq O(\tau/\log(\frac{1}{\epsilon\delta})), \end{align}\] the JL lemma (cf. 1) and Alon-Kartag rounding scheme (cf. [lemma:inner]) ensures that \[\mathop{\boldsymbol{E}\/}_{A, b} \left[ \mathop{\boldsymbol{P}r\/}_{(x, y)\sim \mathcal{D}_z} [y(b^\top Ax/\relax{x}) < \tau/4] \right] \leq \epsilon\delta.\] An application of Markov’s inequality yields \[\mathop{\boldsymbol{P}r\/}_{A, b} \left[ \mathop{\boldsymbol{P}r\/}_{(x, y)\sim \mathcal{D}_z} [y(b^\top Ax/\relax{x}) < \tau/4] >\epsilon\right] \leq \delta.\] All in all, with probability at least \(1-\delta\) over \(A, b\), the population error is at most \[\begin{align} &\mathop{\boldsymbol{P}r\/}_{(x,y)\sim\mathcal{D}}[y(A^\top b)^\top x \leq 0] \\ &\leq \mathop{\boldsymbol{P}r\/}_{(x,y)\sim\mathcal{D}} [y(b^\top Ax/\relax{x}) < \Theta(\tau)] < 2\epsilon, \end{align}\] concluding the correctness argument of our algorithm. ◻

Next, we prove [lem:projecting32rounding32is32replicable], whose statement we again repeat.

Proof of [lem:projecting32rounding32is32replicable]. An application of the vector Bernstein concentration inequality (cf. 2) yields the following: let \(S_1 = \left\{w_i^{(1)}\right\}_{i \in [B]}\) and \(S_2 = \left\{w_i^{(2)}\right\}_{i \in [B]}\) be two sets of independent and identically distributed random vectors in \(\mathbb{R}^d\) with \(\left\|w_i^{(j)}\right\| \leq 1\) for all \(i\in [B], j \in \{1,2\}\).

Then we can set \(X_i = w_i^{(1)} - \mathop{\boldsymbol{E}\/}w\). Notice that \(\relax{X_i}\leq 2\) and \(\relax{X_i}^2\leq 4\). Hence, an application of 2 yields for any \(t\in (0, 2)\) \[\begin{align} &\mathop{\boldsymbol{P}r\/}_{S_1} \left[ \left\| \frac{1}{B} \sum w_i^{(1)} - \mathop{\boldsymbol{E}\/}[w] \right\| \geq t \right] \\ &\leq \exp\left( -\frac{t^2 B}{32} + \frac{1}{4} \right) = \exp(\Omega(-t^2 B)) \end{align}\] Let \(z = \frac{1}{B} \sum_{i \in [B]} w_i^{(1)}, z' = \frac{1}{B} \sum_{i \in [B]} w_i^{(2)}.\) Choosing \(B\geq \Omega(t^{-2}\log(\frac{1}{\rho}))\) ensures that \(\mathop{\boldsymbol{P}r\/}_{S_1, S_2} \left[ \left\| z - z' \right\| \geq t \right] \leq \rho/10.\)

We condition on the event that \(\left\|z - z' \right\| \leq t.\) Since we are sharing the randomness across the two executions, we use the same JL projection matrix \(A\). Thus our choice of \(k\geq \Omega(\tau^{-2} \log(\frac{1}{\rho}))\) gives that \(\left\| Az - Az'\right\| \leq (1+\tau)t\leq 2t,\) with probability at least \(1-\rho/10\) (cf. 4). It remains to show that after the rounding step, with probability at least \(1-\rho/10\), the two rounded vectors will be the same. The size of our rounding grid is \(\beta = \tilde{\Theta}(\tau)\) and the target dimension of our JL-matrix is \(k = \tilde{\Theta}(\tau^{-2})\). Thus by [lemma:round], the probability that the two points round to different vectors is \(\Theta(\beta^{-1}) \|Az-Az'\|_1 \leq \Theta(\beta^{-1}\sqrt{k}) \|Az-Az'\|_2 = \Theta(t\sqrt{k} / \beta)\) (cf. [lemma:round]). Thus, if we pick \[\begin{align} t &\leq \rho\beta/\sqrt{k} \nonumber \\ B &\geq \tilde{\Omega}(k/(\beta^2\rho^2)) = {\Omega}(\log(\frac{1}{\epsilon\tau\rho\delta}) / (\tau^{4} \rho^{2})), \end{align}\]

we complete the argument. ◻

10 Details for 2↩︎

Here we fill in the missing proofs from 4.

We now prove [lem:boosted32SGD], whose statement we repeat below for convenience.

Proof of [lem:boosted32SGD]. Let \(T = \Omega(\tau^{-2}\epsilon^{-2})\). Run \(\mathtt{SGD}(\mathcal{D}, T)\) for \(n = O(\log(\frac{1}{\delta}))\) times to obtain solutions \(w_1, \dots, w_n\). 5 ensures that we attain an \(\epsilon/10\)-optimal solution in expectation for each \(w_i, i\in [n]\). Markov’s inequality then guarantees that we attain an \(\epsilon/5\)-optimal solution with probability at least \(1/2\) in each repetition. But then with probability at least \(1-\delta/10\), at least one of the \(n\) solutions is \(\epsilon/5\)-optimal.

By a Hoeffding bound, we can estimate each \(f_\mathcal{D}(w_i)\in [0, O(1/\tau)]\) up to an additive \(\epsilon/10\) error with probability at least \(1-\delta/10\) using \(O(n\epsilon^{-2}\tau^{-2}\log(\frac{n}{\delta}))\) samples. Outputting the classifier with the lowest estimated objective yields a \(3\epsilon/10\)-optimal solution with probability at least \(1-\delta/5\).

Normalizing the selected classifier to a unit vector can only decrease \(h\) since the feasible region is \(B_1^d\) and incurs an additional loss of \(\epsilon/10\) due to the regularizer. Thus we have a \(4\epsilon/10\)-optimal solution with probability at least \(1-\delta/5\).

The total sample complexity is \(O(nT) + O(n\epsilon^{-2}\tau^{-2}\log(\frac{n}{\delta})) = \tilde{O}(\epsilon^{-2}\tau^{-2}\log(\frac{1}{\delta})).\) ◻

10.1 Proof of 2↩︎

10.1.0.1 Correctness of 2.

From the choice of \(n \geq \Omega(\epsilon^{-2} \tau^{-2} \log(\frac{B}{\delta})),\) [lem:boosted32SGD] ensures that each unit vector \(w_i\) produced in 2 is \(\epsilon/10\)-optimal with probability at least \(1-\delta/(10B)\). Hence with probability at least \(1-\delta/10\), every \(w_i\) is \(\epsilon/10\)-optimal with respect to \(f_\mathcal{D}\). By Jensen’s inequality, \(z = (1/B) \sum_{i=1}^B w_i\) is \(\epsilon/10\)-optimal with probability at least \(1-\delta/10\). But then by the choice of \(f_\mathcal{D}\) as a convex surrogate loss for \(\mathbb{1}\set{y(x^\top w) < \tau/2}\), \(z\) satisfies \(\mathop{\boldsymbol{P}r\/}_{(x, y)\sim \mathcal{D}} [y(z^\top x/\relax{x}) < \tau/2] \leq 2\epsilon/10\) with probability at least \(1-\delta/10\). In other words, the unit vector \(z\) has a population \(\tau/2\)-loss of at most \(2\epsilon/10\) with probability at least \(1-\delta/10\). But then similar to the correctness of 1, an application of [lem:projecting32rounding32preserves32margin] concludes the proof of correctness.

10.1.0.2 Replicability of 2.

Similar to the correctness of 1, we note that the output \(w_i\) of each execution of \(\mathtt{boostSGD}\) is an i.i.d. unit vector. Thus an application of [lem:projecting32rounding32is32replicable] yields the replicability guarantees of 2.

10.1.0.3 Sample Complexity & Running Time of 2.

The sample complexity is \(nB = \tilde{O}(\epsilon^{-2}\tau^{-6}\rho^{-2}\log(\frac{1}{\delta}))\) as required. Once again, we see that 2 terminates in \(\textrm{poly}(d, 1/\epsilon, 1/\tau, 1/\rho, \log(\frac{1}{\delta}))\) by inspection.

11 Details for 3↩︎

11.1 Differential Privacy↩︎

For \(a,b, \alpha,\beta \in [0,1]\), let \(a \approx_{\alpha, \beta} b\) denote the statement \(a \leq e^\alpha b + \beta\) and \(b \leq e^\alpha a + \beta\). We say that two probability distributions \(P, Q\) are \((\alpha, \beta)\)-indistinguishable if \(P(E) \approx_{\alpha, \beta} Q(E)\) for any measurable event \(E\).10

Definition 3 (Approximate Differential Privacy). A learning rule \(A\) is an \(n\)-sample \((\alpha,\beta)\)-differentially private if for any pair of datasets \(S, S' \in (\mathcal{X}\times \{0,1\})^n\) that differ on a single example, the induced posterior distributions \(A(S)\) and \(A(S')\) are \((\alpha, \beta)\)-indistinguishable.

11.2 The Results of [25] and [2]↩︎

3 is based on combining the following two results. The first is a differentially private learner for large-margin halfspaces from [25].

Proposition 7 (Theorem 6 in [25]). Let \(\alpha, \tau, \epsilon, \delta > 0\). Let \(\mathcal{D}\) be a distribution over \(\mathbb{R}^d \times \{-1,1\}\) that has linear margin \(\tau\) as in 2. There is an algorithm that is \((\alpha,0)\)-differentially private and, given \(m = \widetilde{O}(\alpha^{-1} \epsilon^{-1} \tau^{-2})\) i.i.d. samples \((x,y) \sim \mathcal{D}\), computes in time \(\exp(1/\tau^2)\textrm{poly}(d, 1/\alpha, 1/\epsilon, \log(\frac{1}{\delta}))\) a normal vector \(w \in \mathbb{R}^d\) such that \(\mathop{\boldsymbol{P}r\/}_{(x,y) \sim \mathcal{D}}[\textrm{sgn}(w^\top x) \neq y] \leq \epsilon\), with probability at least \(1- \delta\).

We also use the DP-to-Replicability reduction appearing in [2].

Proposition 8 (Corollary 3.18 in [2]). Fix \(n \in \mathbb{N}\), sufficiently small \(\rho \in (0,1)\), \(\epsilon, \delta\in (0, 1)\) and \(\alpha, \beta > 0\). Let \(\mathcal{A}: \mathcal{X}^n \to \mathcal{Y}\) be an \(n\)-sample \((\alpha,\beta)\)-differentially private algorithm with finite output space solving a statistical task with accuracy \(\epsilon\) and failure probability \(\delta\). Then there is an algorithm \(\mathcal{A}' : \mathcal{X}^{m} \to \mathcal{Y}\) that is \(\rho\)-replicable and solves the same statistical task with \(m = O(\rho^{-2} n^2)\) samples with accuracy \(\epsilon\) and failure probability \(\delta\).

References↩︎

[1]
Russell Impagliazzo, Rex Lei, Toniann Pitassi, and Jessica Sorrell. Reproducibility in learning. In Stefano Leonardi and Anupam Gupta, editors, STOC ’22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 818–831. ACM, 2022. . URL https://doi.org/10.1145/3519935.3519973.
[2]
Mark Bun, Marco Gaboardi, Max Hopkins, Russell Impagliazzo, Rex Lei, Toniann Pitassi, Satchit Sivakumar, and Jessica Sorrell. Stability is stable: Connections between replicability, privacy, and adaptive generalization. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 520–527, 2023.
[3]
Monya Baker. 1,500 scientists lift the lid on reproducibility. Nature, 533 (7604), 2016.
[4]
Joelle Pineau, Koustuv Sinha, Genevieve Fried, Rosemary Nan Ke, and Hugo Larochelle. Iclr reproducibility challenge 2019. ReScience C, 5 (2): 5, 2019.
[5]
Philip Ball. Is ai leading to a reproducibility crisis in science? Nature, 624 (7990): 22–25, 2023.
[6]
Hossein Esfandiari, Amin Karbasi, Vahab Mirrokni, Grigoris Velegkas, and Felix Zhou. Replicable clustering. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
[7]
Hossein Esfandiari, Alkis Kalavasis, Amin Karbasi, Andreas Krause, Vahab Mirrokni, and Grigoris Velegkas. Replicable bandits. In The Eleventh International Conference on Learning Representations, 2023.
[8]
Alkis Kalavasis, Amin Karbasi, Shay Moran, and Grigoris Velegkas. Statistical indistinguishability of learning algorithms. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 15586–15622. PMLR, 2023. URL https://proceedings.mlr.press/v202/kalavasis23a.html.
[9]
Zachary Chase, Shay Moran, and Amir Yehudayoff. Stability and replicability in learning. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023, Santa Cruz, CA, USA, November 6-9, 2023, pages 2430–2439. IEEE, 2023. . URL https://doi.org/10.1109/FOCS57990.2023.00148.
[10]
Peter Dixon, Aduri Pavan, Jason Vander Woude, and N. V. Vinodchandran. List and certificate complexities in replicable learning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
[11]
Zachary Chase, Bogdan Chornomaz, Shay Moran, and Amir Yehudayoff. Local borsuk-ulam, stability, and replicability. arXiv preprint arXiv:2311.01599, 2023.
[12]
Eric Eaton, Marcel Hussing, Michael Kearns, and Jessica Sorrell. Replicable reinforcement learning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
[13]
Amin Karbasi, Grigoris Velegkas, Lin Yang, and Felix Zhou. Replicability in reinforcement learning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
[14]
Badih Ghazi, Ravi Kumar, and Pasin Manurangsi. User-level differentially private learning via correlated sampling. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 20172–20184, 2021.
[15]
Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65 (6): 386, 1958.
[16]
Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20: 273–297, 1995.
[17]
Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55 (1): 119–139, 1997.
[18]
Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27 (11): 1134–1142, 1984.
[19]
Guy Kindler, Anup Rao, Ryan O’Donnell, and Avi Wigderson. Spherical cubes: optimal foams from computational hardness amplification. Communications of the ACM, 55 (10): 90–97, 2012.
[20]
Maria-Florina Balcan and Vitaly Feldman. Statistical active learning algorithms for noise tolerance and differential privacy. Algorithmica, 72 (1): 282–315, 2015. . URL https://doi.org/10.1007/s00453-014-9954-9.
[21]
John Dunagan and Santosh S. Vempala. A simple polynomial-time rescaling algorithm for solving linear programs. In László Babai, editor, Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13-16, 2004, pages 315–320. ACM, 2004. . URL https://doi.org/10.1145/1007352.1007404.
[22]
Vladimir Vapnik. Estimation of dependences based on empirical data. Springer Science & Business Media, 2006.
[23]
Noga Alon and Bo’az Klartag. Optimal compression of approximate inner products and dimension reduction. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 639–650. IEEE, 2017.
[24]
Allan Grønlund, Lior Kamma, and Kasper Green Larsen. Near-tight margin-based generalization bounds for support vector machines. In International Conference on Machine Learning, pages 3779–3788. PMLR, 2020.
[25]
Huy Lê Nguyen, Jonathan Ullman, and Lydia Zakynthinou. Efficient private algorithms for learning large-margin halfspaces. In Algorithmic Learning Theory, pages 704–724. PMLR, 2020.
[26]
Kobbi Nissim, Uri Stemmer, and Salil P. Vadhan. Locating a small cluster privately. In Tova Milo and Wang-Chiew Tan, editors, Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 413–427. ACM, 2016. . URL https://doi.org/10.1145/2902251.2902296.
[27]
Avrim Blum, Adam Kalai, and Hal Wasserman. Noise-tolerant learning, the parity problem, and the statistical query model. Journal of the ACM (JACM), 50 (4): 506–519, 2003.
[28]
Anupam Gupta, Moritz Hardt, Aaron Roth, and Jonathan Ullman. Privately releasing conjunctions and the statistical query barrier. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 803–812, 2011.
[29]
Surbhi Goel, Aravind Gollakota, and Adam Klivans. Statistical-query lower bounds via functional gradients. Advances in Neural Information Processing Systems, 33: 2147–2158, 2020.
[30]
Dimitris Fotakis, Alkis Kalavasis, Vasilis Kontonis, and Christos Tzamos. Efficient algorithms for learning from coarse labels. In Conference on Learning Theory, pages 2060–2079. PMLR, 2021.
[31]
Alkis Kalavasis, Amin Karbasi, Grigoris Velegkas, and Felix Zhou. On the computational landscape of replicable learning. arXiv preprint arXiv:2405.15599, 2024.
[32]
Shay Moran, Hilla Schefler, and Jonathan Shafer. The bayesian stability zoo. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
[33]
Maryanthe Malliaris and Shay Moran. The unstable formula theorem revisited. arXiv preprint arXiv:2212.05050, 2022.
[34]
Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 1999.
[35]
Yoav Freund and Robert E Schapire. Large margin classification using the perceptron algorithm. In Proceedings of the eleventh annual conference on Computational learning theory, pages 209–217, 1998.
[36]
Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. Practical privacy: the sulq framework. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 128–138, 2005.
[37]
Mark Bun, Marco Leandro Carmosino, and Jessica Sorrell. Efficient, noise-tolerant, and private learning via boosting. In Conference on Learning Theory, pages 1031–1077. PMLR, 2020.
[38]
Amos Beimel, Shay Moran, Kobbi Nissim, and Uri Stemmer. Private center points and learning of halfspaces. In Conference on Learning Theory, pages 269–282. PMLR, 2019.
[39]
Haim Kaplan, Yishay Mansour, Uri Stemmer, and Eliad Tsfadia. Private learning of halfspaces: Simplifying the construction and reducing the sample complexity. Advances in Neural Information Processing Systems, 33: 13976–13985, 2020.
[40]
Raef Bassily, Mehryar Mohri, and Ananda Theertha Suresh. Open problem: Better differentially private learning algorithms with margin guarantees. In Conference on Learning Theory, pages 5638–5643. PMLR, 2022.
[41]
Raef Bassily, Mehryar Mohri, and Ananda Theertha Suresh. Differentially private learning with margin guarantees. Advances in Neural Information Processing Systems, 35: 32127–32141, 2022.
[42]
Ilias Diakonikolas, Jelena Diakonikolas, Daniel M Kane, Puqian Wang, and Nikos Zarifis. Information-computation tradeoffs for learning margin halfspaces with random classification noise. In The Thirty Sixth Annual Conference on Learning Theory, pages 2211–2239. PMLR, 2023.
[43]
Ilias Diakonikolas, Daniel Kane, and Pasin Manurangsi. Nearly tight bounds for robust proper learning of halfspaces with a margin. Advances in Neural Information Processing Systems, 32, 2019.
[44]
Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8 (3-4): 231–357, 2015.
[45]
Jonas Moritz Kohler and Aurelien Lucchi. Sub-sampled cubic regularization for non-convex optimization. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1895–1904. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/kohler17a.html.
[46]
William B Johnson. Extensions of lipshitz mapping into hilbert space. In Conference modern analysis and probability, 1984, pages 189–206, 1984.
[47]
Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604–613, 1998.
[48]
Dimitris Achlioptas. Database-friendly random projections. In Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 274–281, 2001.
[49]
Casper Benjamin Freksen. An introduction to johnson-lindenstrauss transforms. arXiv preprint arXiv:2103.00564, 2021.

  1. alvertos.kalavasis@yale.edu↩︎

  2. amin.karbasi@yale.edu↩︎

  3. larsen@cs.au.d↩︎

  4. grigoris.velegkas@yale.edu↩︎

  5. felix.zhou@yale.edu↩︎

  6. Authors listed alphabetically.↩︎

  7. When we do not specify the norm, we assume the \(\ell_2\)-norm.↩︎

  8. By iteratively halving a guess for \(\tau\), we may assume without loss of generality that \(\tau\) is known.↩︎

  9. We assume that \(\delta < \rho, \epsilon\), otherwise we normalize \(\delta\) to get the bound.↩︎

  10. We use the notation \((\alpha,\beta)-\)DP instead of the more common \((\epsilon,\delta)-\)DP to be consistent with the notation of the rest of the paper regarding the accuracy and probability of failure of the learning algorithms.↩︎