Abstract

Contrastive learning—a modern approach to extract useful representations from unlabeled data by training models to distinguish similar samples from dissimilar ones—has driven significant progress in foundation models. In this work, we develop a new theoretical framework for analyzing data augmentation-based contrastive learning, with a focus on SimCLR as a representative example. Our approach is based on the concept of approximate sufficient statistics, which we extend beyond its original definition in [1] for contrastive language-image pretraining (CLIP) using KL-divergence. We generalize it to equivalent forms and general \({\mathrm{f}}\)-divergences, and show that minimizing SimCLR and other contrastive losses yields encoders that are approximately sufficient. Furthermore, we demonstrate that these near-sufficient encoders can be effectively adapted to downstream regression and classification tasks, with performance depending on their sufficiency and the error induced by data augmentation in contrastive learning. Concrete examples in linear regression and topic classification are provided to illustrate the broad applicability of our results.

1 Introduction↩︎

Leveraging massive unlabeled data to learn useful representations has played a central role in recent advances in foundation models. A prominent approach of this kind is contrastive learning, which has driven significant progress in visual representation learning [2], [3], large-scale speech processing [4]), and multimodal AI [5], [6].

In short, contrastive learning finds useful representations of the data by maximizing similarity between paired samples while minimizing it for non-paired samples. Consider SimCLR [2] for visual representation learning as an illustrative example. Given a dataset of images \({\boldsymbol{x}}\in{\mathcal{X}}\), SimCLR generates two augmented views \(({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})\in{\mathcal{X}}\times{\mathcal{X}}\) for each image \({\boldsymbol{x}}\) using random transformations (i.e., data augmentations) such as random cropping, random color distortions, and random Gaussian blur, etc. It then trains an encoder \(f\) that aligns the paired views and separates the non-paired views through minimizing the loss in Eq. 1 . The learned representation \(f({\boldsymbol{x}})\) (or \(f({\boldsymbol{z}}^{(1)})\)) can then be adapted to downstream tasks with few labeled samples and minimal fine-tuning.

Despite its remarkable empirical performance, the theoretical aspects of contrastive learning remain an active area of study [1], [7]. In this work, we present a theoretical analysis of data augmentation-based contrastive learning, with a specific focus on the SimCLR framework [2] as an representative example. Notably, recent work by [1] has introduced new theoretical insights into contrastive language-image pretraining (CLIP). They first introduced the concept of approximate sufficient statistics, showing that the image and text encoders obtained from the empirical risk minimizer of CLIP are approximately sufficient. Additionally, under the joint graphical hierarchical model (JGHM) assumption for image and text data, they demonstrated that such encoders can be efficiently adapted to various downstream multimodal tasks.

Our work complements and extends the work by [1] in two key ways. -1em

We extend the concept of approximate sufficient statistics, which was originally defined for CLIP in a specific form based on KL-divergence, to three equivalent forms and general \({\mathrm{f}}\)-divergences. Based on the equivalent forms of the definition, we establish that minimizing the contrastive loss (e.g., the InfoNCE loss [8]) is essentially finding approximate sufficient statistics that are adaptable to downstream tasks.
We focus on data augmentation-based contrastive learning following the SimCLR framework. In contrast to CLIP, the random transformations in SimCLR introduce additional challenges for theoretical analysis. We show that the downstream performance of the learned encoder depends on its sufficiency and the error induced by the random transformations. Furthermore, motivated by the generalized definition of approximate sufficient statistics, we theoretically demonstrate that encoders trained using alternative contrastive losses can achieve similar downstream performance to those trained using standard SimCLR.

The remainder of this work is organized as follows. Section 2 reviews related literature. In Section 3, we introduce the concept of approximate sufficient statistics. Sections 4.1–4.2 present the setup of data augmentation-based contrastive learning and analyze the downstream performance of the SimCLR-trained encoder. In Section 4.3, we extend our analysis to general \({\mathrm{f}}\)-contrastive losses. Examples in linear regression and topic classification are discussed in Section 5.

2 Related work↩︎

0.8ex 1ex .2ex-1em Self-supervised learning and contrastive learning. Self-supervised learning (SSL) dates back to the early work of [9], which leverages cross-modality information as a self-supervised substitute for labels to improve classification performance. In the past decade, SSL has been explored in image classification through various data augmentations, including rotation [10], colorization [11], and Jigsaw puzzles [12]. More recently, contrastive learning based on paired and non-paired samples has emerged as a prominent approach in SSL [2], [3], [5], [13], [14]. Notably, SimCLR [2] learns image representations by minimizing the InfoNCE loss [8] on randomly augmented views of images, while CLIP [5] does so on paired and non-paired image-text samples.

0.8ex 1ex .2ex-1em Choices of the loss function. Various loss functions have been used in contrastive learning, including NCE [15], InfoNCE [8], Multi-class N-pair loss [16], SigLIP [17], f-MICL [18]. These losses utilize cross-entropy and its variants to distinguish paired from non-paired samples. Most relevant to our work is the InfoNCE loss [8], which is derived based on the InfoMax principle [19], [20].

0.8ex 1ex .2ex-1em Theoretical understanding of contrastive learning. Thus far, there is a rich body of literature on the theoretical understanding of self-supervised learning [1], [7], [18], [21]–[38]. Notably, early works [7], [23], [26] derived generalization error bounds for downstream classification tasks, using linear classifiers trained on representations learned by minimizing the InfoNCE loss. [23] explained contrastive learning through alignment (pulling paired samples together) and uniformity (separating non-paired samples). [25] showed that InfoNCE minimization can implicitly learn the inverse of the data-generating function. [27] demonstrated that contrastive learning recovers document representations that reveal topic posterior information in a document classification problem. More recently, [38] derived new PAC-Bayes bounds on the generalization error of SimCLR using bounded difference concentration and applied them to downstream linear classification. Compared with their results, our generalization error bound in Theorem 1 is independent of the batch size \(K\) and thus allows for large or full-batch learning. The most related work to ours is [1], which introduced the concept of approximate sufficiency to assess the quality of representations. They also demonstrated that the learned representation from CLIP [5] can be effectively adapted to several multimodal downstream tasks in a joint hierarchical graphical model.

Our work differs from existing theories of contrastive learning in several aspects: (1) Similar to [1], we derive more refined “excess risk bounds" instead of the”absolute risk bounds" established under structural conditions for downstream tasks in many prior works. (2) We derive novel unified novel risk bounds for downstream tasks that depend solely on the sufficiency of the encoder and the error induced by data augmentation. (3) We extend the concept of approximate sufficient statistics and theoretically analyze a broader class of contrastive losses.

3 Approximate sufficient statistics↩︎

Before diving into the analysis of contrastive learning, we first introduce the concept of approximate sufficient statistics, which provides a novel viewpoint for characterizing the quality of encoders \(f\) used in contrastive learning. Let \({\mathrm{f}}: {\mathbb{R}}_{+}\mapsto{\mathbb{R}}\) be a convex function such that \({\mathrm{f}}(1)=0\). For random variables \((X, Y)\) on \({\mathcal{X}}\times {\mathcal{Y}}\) with joint density \({\mathbb{P}}(x, y)\) with respective to some measure \({\boldsymbol{\mu}}\) ³, we define the \({\mathrm{f}}\)-mutual information (\({\mathrm{f}}\)-MI) as \[\begin{align} I_{{\mathrm{f}}}(X, Y) = \int {\mathrm{f}}\Big( \frac{{\mathbb{P}}(x, y)}{{\mathbb{P}}(x) {\mathbb{P}}(y)} \Big) {\mathbb{P}}(x) {\mathbb{P}}(y) d{\boldsymbol{\mu}}. \end{align}\] Note that the \({\mathrm{f}}\)-MI is essentially the \({\mathrm{f}}\)-divergence between the joint distribution and the product of marginal distributions. It is non-negative and symmetric in \(X\) and \(Y\). Moreover, provided that \({\mathrm{f}}\) is strictly convex, \(I_{{\mathrm{f}}}(X, Y ) = 0\) if and only if \(X\) and \(Y\) are independent. Let \((X, Y)\) be random variables that have the joint density \({\mathbb{P}}({X, Y})\) (\(Y\) could be thought as the parameter \(\theta\) in Bayesian statistics). For any statistic \(T:{\mathcal{X}}\mapsto T({\mathcal{X}})\), to characterize the information loss of using \(T(X)\) instead of \(X\) for predicting \(Y\), we introduce the following definition of the sufficiency of \(T(X)\).

Definition 1 (Approximate sufficiency). Let \(T: {\mathcal{X}}\to T({\mathcal{X}})\) be a mapping (i.e., a statistic). We define three forms of the sufficiency of \(T\), which will be shown to be equivalent:

Information Loss Sufficiency (ILS): The information loss sufficiency of \(T\) is defined as \[\begin{align} {\rm Suff}_{{\rm il}, {\mathrm{f}}}(T) = I_{{\mathrm{f}}}(X, Y) - I_{{\mathrm{f}}}(T(X), Y). \end{align}\]
Variational Form Sufficiency (VFS): The variational form sufficiency of \(T\) is given by \[\begin{align} {\rm Suff}_{{\rm vf}, {\mathrm{f}}}(T) = \inf_{{\sf S}: T({\mathcal{X}})\times{\mathcal{Y}}\mapsto{\mathbb{R}}} R_{{\mathrm{f}}}({\sf S}\circ T) - \inf_{{\sf S}: {\mathcal{X}}\times{\mathcal{Y}}\mapsto{\mathbb{R}}} R_{{\mathrm{f}}}({\sf S}), \end{align}\] where \({\sf S}\circ T(x,y)\mathrel{\vcenter{:}}={\sf S}(T(x),y)\), and the \({\mathrm{f}}\)-contrastive loss \[\begin{align} R_{{\mathrm{f}}}({\sf S}) \mathrel{\vcenter{:}}={\mathbb{E}}_{{\mathbb{P}}(x, y)}[ - {\sf S}(x, y)] + \inf_{{\sf S}_x:{\mathcal{X}}\mapsto{\mathbb{R}}} {\mathbb{E}}_{{\mathbb{P}}(x){\mathbb{P}}(y)}[{\mathrm{f}}^*({\sf S}(x, y) - {\sf S}_x(x)) + {\sf S}_x(x)],\label{eq:f95contrastive95loss} \end{align}\qquad{(1)}\] where \({\mathrm{f}}^*\) is the Fenchel-dual of \({\mathrm{f}}\).
Conditional Bregman Sufficiency (CBS): The conditional Bregman sufficiency of \(T\) is defined as \[{\rm Suff}_{{\rm cb}, {\mathrm{f}}}(T) = {\mathbb{E}}_{{\mathbb{P}}(x)\times {\mathbb{P}}(y)}\Big[ B_{{\mathrm{f}}}\Big( \frac{{\mathbb{P}}(y | x)}{{\mathbb{P}}(y)}, \frac{{\mathbb{P}}(y | T(x))}{{\mathbb{P}}(y)} \Big) \Big],\] where \(B_{{\mathrm{f}}}(a, b) \mathrel{\vcenter{:}}={\mathrm{f}}(a) - {\mathrm{f}}(b) - (a-b){\mathrm{f}}'(b)\) is the Bregman divergence of \({\mathrm{f}}\).

Indeed, these definitions will be shown to be equivalent (Lemma 1), i.e., \[\begin{align} {\rm Suff}_{{\rm il}, {\mathrm{f}}}(T) = {\rm Suff}_{{\rm vf}, {\mathrm{f}}}(T)= {\rm Suff}_{{\rm cb}, {\mathrm{f}}}(T)\eqqcolon{\rm Suff}_{{\mathrm{f}}}(T). \end{align}\] We say \(T(X)\) is an \({\varepsilon}\)-approximate sufficient statistic if \({\rm Suff}_{{\mathrm{f}}}(T)\leq{\varepsilon}.\)

The Information Loss Sufficiency (ILS) is closely linked to the InfoMax principle [19], [20], which finds a statistic \(T\) that maximizes mutual information \(I_{}(T(X),Y)\) under certain constraints. The equivalence between ILS and CBS suggests that the loss in mutual information can be represented as a divergence between the conditional probabilities \({\mathbb{P}}(Y|X)\) and \({\mathbb{P}}(Y|T(X))\). This provides a concrete measure for interpreting the information loss.

In VFS, by definition, the excess risk \(R_{\mathrm{f}}({\sf S}\circ T) - \inf_{{\widetilde{\sf S}}} R_{\mathrm{f}}({\widetilde{\sf S}})\) serves as an upper bound on the sufficiency \({\rm Suff}_{\mathrm{f}}(T)\), and they are nearly equal when \({\sf S}\) is obtained by minimizing \(R_{\mathrm{f}}({\sf S}\circ T)\) over a sufficiently rich space \({\mathcal{S}}\). Consequently, VFS provides a loss minimization framework for finding \(T\) with low sufficiency by minimizing the \({\mathrm{f}}\)-contrastive loss \(R_{\mathrm{f}}({\sf S})\) over \({\sf S}\) in some space \({\mathcal{S}}\) and extracting \(T\) from \({\sf S}\). Moreover, an extension of approximate sufficiency to similarity scores \({\sf S}\) is introduced in Appendix 7.3.

The concept of approximate sufficient statistics was first proposed in [1], but only in the CBS form for KL divergence (i.e., \({\mathrm{f}}(x)=x\log x\)). In this work, we extend the definition to general \({\mathrm{f}}\)-divergences and establish the equivalence among three forms of sufficiency. Notably, for \({\mathrm{f}}\) that is strictly convex, we have \({\rm Suff}_{ {\mathrm{f}}}(T)=0\) if and only if \(Y\perp\!\!\!\perp X|T(X)\) from the CBS form, aligning with the classic definition of sufficient statistics (see e.g., [39]). We will mainly consider two special cases of \({\mathrm{f}}\): \({\mathrm{f}}(x) = x \log x\) (KL-divergence) and \({\mathrm{f}}(x) = (x-1)^2/2\) (\(\chi^2\)-divergence), with the corresponding sufficiency denoted by \({\rm Suff}_{\sf kl}\) and \({\rm Suff}_{\chi^2}\). For more examples and properties regarding approximate sufficient statistics, we refer the readers to Appendix 7.

In the context of data augmentation-based contrastive learning, we may choose \(X\) and \(Y\) as two augmented views of the sample, and \(T\) as the encoder \(f\). The sufficiency \({\rm Suff}_{{\mathrm{f}}}( f)\) then quantifies the loss of recovering augmented views from the encoder representation. We will show that the downstream performance of \(f\) can be controlled by its sufficiency (in the CBS form) and the error induced by data augmentation. Specifically, for any downstream task, a small risk can be achieved using \(f\) if it is near-sufficient and the random transformations in contrastive learning do not significantly change the downstream outcomes. As a preview of the results, we have

Theorem (Informal). The risk on a downstream task using encoder \(f\) (denoted by \(\mathcal{R}( f)\)) satisfies \[\begin{align} \mathcal{R}( f)\leq c\cdot\Big(\sqrt{{\rm Suff}_{{\mathrm{f}}}( f)}+\epsilon_{\mathcal{G}}\Big) \end{align}\] for some constant \(c>0\), where \({\rm Suff}_{{\mathrm{f}}}( f)\) is the \({\mathrm{f}}\)-sufficiency of \(f\) and \(\epsilon_{\mathcal{G}}\) denotes the error on the downstream task induced by data augmentation.

Contrastive learning with general \({\mathrm{f}}\)-divergence was also studied in [18], [40], but the loss functions considered in these works differ from the variational form in (?? ). In particular, while [18] considered a variational form similar to (?? ), they set \({\sf S}_x = 0\) instead of taking the infimum over \({\sf S}_x\).

4 Statistical properties of contrastive learning↩︎

In this section, we demonstrate that data augmentation-based contrastive learning can find near-sufficient encoders that are effectively adaptable to downstream tasks. We focus on the SimCLR framework in Section 4.1–4.2, and extend the results to general \({\mathrm{f}}\)-contrastive losses in Section 4.3.

4.1 Setup and the ERM estimator↩︎

Let \({\boldsymbol{x}}\in {\mathcal{X}}\) be a random sample drawn from a distribution \({\mathbb{P}}_{{\mathcal{X}}}\) on \({\mathcal{X}}\). Consider a set of transformations \(\mathcal{G}\) in which each transformation \(g: {\mathcal{X}}\to {\mathcal{X}}\) maps \({\mathcal{X}}\) to itself.⁴ Let \({\mathbb{P}}_{\mathcal{G}}\) denote a distribution over the transformations in \(\mathcal{G}\). Given a sample \({\boldsymbol{x}}\) and two transformations \(g^{(1)}, g^{(2)}\sim_{iid}{\mathbb{P}}_{\mathcal{G}}\), we generate two augmented views of \({\boldsymbol{x}}\), denoted as \({\boldsymbol{z}}^{(1)}= g^{(1)}({\boldsymbol{x}})\) and \({\boldsymbol{z}}^{(2)}= g^{(2)}({\boldsymbol{x}})\). The marginal distribution of \({\boldsymbol{z}}^{(1)}\) (or equivalently \({\boldsymbol{z}}^{(2)}\)) is denoted by \({\mathbb{P}}_{{\boldsymbol{z}}}\). Often, we will omit the superscripts and let \({\boldsymbol{z}}= g({\boldsymbol{x}})\) denote a single augmented view generated by a transformation \(g\sim {\mathbb{P}}_{\mathcal{G}}\).

Throughout the remainder of this work, unless otherwise specified, we set \((X,Y) \overset{d}{=} ({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})\) in Definition 1, i.e., we define the sufficiency \({\rm Suff}_{\mathrm{f}}(T) = I_{{\mathrm{f}}}({\boldsymbol{z}}^{(1)}, {\boldsymbol{z}}^{(2)}) - I_{{\mathrm{f}}}(T({\boldsymbol{z}}^{(1)}), {\boldsymbol{z}}^{(2)}).\) For simplicity, we assume the joint distribution of \(({\boldsymbol{x}},{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})\) is either discrete or has a continuous density w.r.t. some base measure on \({\mathcal{X}}^{\otimes3}\). We abuse the notation \({\mathbb{P}}(\cdot)\) to refer to either discrete distributions or the density of continuous distributions, with the intended meaning clear from the context. Also, we occasionally omit the subscript \({\sf kl}\) when referring to KL-sufficiency.

SimCLR [2] learns a representation of the sample \({\boldsymbol{x}}\) (i.e., \(f({\boldsymbol{x}})\) or \(f(g({\boldsymbol{x}}))\)) through performing contrastive learning on the augmented views \(({\boldsymbol{z}}^{(1)}, {\boldsymbol{z}}^{(2)})\). Specifically, given a batch of \(K\) i.i.d. samples \(\{{\boldsymbol{x}}_i\}_{i=1}^K\) from \({\mathbb{P}}_{{\mathcal{X}}}\), we generate \(K\) pairs of augmented views \(\{({\boldsymbol{z}}^{(1)}_i,{\boldsymbol{z}}^{(2)}_i)\}_{i=1}^K\) using \(2K\) i.i.d. transformations \(\{(g^{(1)}_i,g^{(2)}_i)\}_{i=1}^K\) from \({\mathbb{P}}_{\mathcal{G}}\). Let \(f:{\mathcal{X}}\mapsto{\mathbb{R}}^{ p}\) be an encoder function, potentially parametrized by neural networks. The SimCLR risk function is defined as the expected InfoNCE loss [8]: \[\begin{align} &~\overline{\sf R}_{{\sf simclr},K}({\sf S}) \mathrel{\vcenter{:}}=\frac{1}{2}{\mathbb{E}}\Big[ - \log \frac{\exp({\sf S}({\boldsymbol{z}}^{(1)}_1, {\boldsymbol{z}}^{(2)}_1) ) }{\sum_{j \in [K]} \exp({\sf S}({\boldsymbol{z}}^{(1)}_1, {\boldsymbol{z}}^{(2)}_j))} \Big] + \frac{1}{2}{\mathbb{E}}\Big[ - \log \frac{\exp({\sf S}({\boldsymbol{z}}^{(1)}_1, {\boldsymbol{z}}^{(2)}_1) ) }{\sum_{j \in [K]} \exp({\sf S}({\boldsymbol{z}}^{(1)}_j, {\boldsymbol{z}}^{(2)}_1))} \Big],~\text{and} \label{eq:simclr95risk}\\ &~{\sf R}_{{\sf simclr},K}( f) \mathrel{\vcenter{:}}=\overline{\sf R}_{{\sf simclr},K}( {\sf S}_{ f}),\text{ where } {\sf S}_{ f}\mathrel{\vcenter{:}}=\tau(\langle f({\boldsymbol{z}}^{(1)}) , \, f({\boldsymbol{z}}^{(2)}) \rangle),\text{ }\tau:{\mathbb{R}}\mapsto{\mathbb{R}}\text{ is some simple link function.}\notag \end{align}\tag{1}\]

Given a set of encoders denoted by \({\mathcal{F}}\) and \(n={n_1}K\) i.i.d. pairs of augmented views \(\{({\boldsymbol{z}}^{(1)}_i,{\boldsymbol{z}}^{(2)}_i)\}_{i=1}^n\), SimCLR learns an encoder function \(\widehat f\in {\mathcal{F}}\) through empirical risk minimization (ERM), namely, \[\begin{align} \widehat f \mathrel{\vcenter{:}}= \underset{ f\in {\mathcal{F}}}{\mathrm{argmin}} \Big\{ {\widehat {\sf R}}_{{\sf simclr},K}( {\sf S}_{ f}) \mathrel{\vcenter{:}}=&~ \frac{1}{2n}\sum_{i = 1}^{n_1}\Big[\sum_{j=1}^K\Big[ - \log \frac{\exp( {\sf S}_{ f}({\boldsymbol{z}}^{(1)}_{(i-1)K+j}, {\boldsymbol{z}}^{(2)}_{(i-1)K+j}) ) }{\sum_{l \in [K]} \exp( {\sf S}_{ f}({\boldsymbol{z}}^{(1)}_{(i-1)K+j}, {\boldsymbol{z}}^{(2)}_{(i-1)K+l}))} \Big]\notag \\ &~ + \Big[ - \log\frac{\exp( {\sf S}_{ f}({\boldsymbol{z}}^{(1)}_{(i-1)K+j}, {\boldsymbol{z}}^{(2)}_{(i-1)K+j}) ) }{\sum_{l \in [K]} \exp( {\sf S}_{ f}({\boldsymbol{z}}^{(1)}_{(i-1)K+l}, {\boldsymbol{z}}^{(2)}_{(i-1)K+j}))} \Big]\Big]\Big\}. \label{eq:simclr-empirical-risk-minimization} \end{align}\tag{2}\] With the encoder \(\widehat f(\cdot)\) at hand, \(\widehat f({\boldsymbol{x}})\) (or \(\widehat f(g({\boldsymbol{x}}))\)) serves as a representation for each \({\boldsymbol{x}}\in{\mathcal{X}}\), which can be used for downstream tasks.
We now show that the sufficiency of the ERM estimator \(\widehat f\) can be properly controlled. We will demonstrate in Section 4.2 that the downstream performance of \(\widehat f\) is closely tied to its sufficiency. First, we note that a global minimizer of the SimCLR risk is \({\sf S_\star}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})\mathrel{\vcenter{:}}=\log \Big[ \frac{{\mathbb{P}}( {\boldsymbol{z}}^{(1)}, {\boldsymbol{z}}^{(2)}) }{ {\mathbb{P}}({\boldsymbol{z}}^{(1)}) \cdot {\mathbb{P}}({\boldsymbol{z}}^{(2)})}\Big]\) (see Lemma 2 for the proof). To analyze the properties of the ERM estimator, we introduce the following boundedness assumption on the score function \({\sf S}\) and regularity assumption on \(\tau.\)

Assumption 1 (Bounded score). There exists a constant \(B_{\sf S}>0\) such that for all pairs \(({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})\), we have \(\exp( {\sf S}_{ f}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}))\in[1/B_{\sf S},B_{\sf S}]\) for all \(f\in {\mathcal{F}}\) and \(\frac{{\mathbb{P}}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})}{{\mathbb{P}}({\boldsymbol{z}}^{(1)}){\mathbb{P}}({\boldsymbol{z}}^{(2)})}\in[1/B_{\sf S},B_{\sf S}]\).

Assumption 2 (Simple link function). The link function \(\tau:{\mathbb{R}}\mapsto{\mathbb{R}}\) is invertible and there exists some constant \(B_\tau>0\) such that \(|\tau(0)|\leq B_\tau\) and \(\tau,\tau^{-1}\) are \(B_\tau\)-Lipschitz.

Note that the first part of Assumption 1 is satisfied with \(B_{\sf S}=\exp( B_{ f}^2)\) when \(\| f({\boldsymbol{x}})\|_{2}\leq B_{ f}\) for all \(f\in {\mathcal{F}},{\boldsymbol{x}}\in{\mathcal{X}}\) and \(\tau\) is the identity function. Based on these assumptions, we have

Theorem 1 (Sufficiency bound for the ERM estimator). Suppose Assumption 1 and 2 hold for some \(B_{\sf S}\geq1,B_\tau>0\). Let \(\widehat f\) be the empirical risk minimizer defined in Eq. 2 and let \({\sf S_\star}\) be as defined in Section 4.1. Let \({\sf supp}({\boldsymbol{z}}^{(1)})\) be the support of \({\boldsymbol{z}}^{(1)}\) and \({\mathcal{N}}(u,\| \cdot\|_{2,\infty}, {\mathcal{F}})\) be the \(u\)-covering number of \({\mathcal{F}}\) under the \((2,\infty)\)-norm \(\| f\|_{2,\infty}\mathrel{\vcenter{:}}=\sup_{x\in{\sf supp}({\boldsymbol{z}}^{(1)})}\| f(x)\|_{2}\). Then, with probability at least \(1-\delta\), we have \[\begin{align} \label{eq:excess95risk95bound95simclr} {\rm Suff}_{\sf kl}( \widehat f) \leq \Big(1 +\frac{{C}}{K}\Big) \cdot [\mathrm{generalization error}+\mathrm{approximation error}], \end{align}\qquad{(2)}\] where \[\begin{align} \mathrm{generalization error}&\mathrel{\vcenter{:}}= \frac{{C}}{\sqrt n}\Big[\sqrt{{\log(1/\delta)}} +B_\tau^2\int_{0}^{2(\log B_{\sf S}+B_\tau)}\sqrt{\log{\mathcal{N}}(u,\| \cdot\|_{2,\infty}, {\mathcal{F}})}du\Big] , \\ \mathrm{approximation error}&\mathrel{\vcenter{:}}=\inf_{ f\in {\mathcal{F}}}\overline{\sf R}_{{\sf simclr},K}( {\sf S}_{ f})-\overline{\sf R}_{{\sf simclr},K}({\sf S_\star}) \end{align} for some constant {C}>0 depending polynomially on B_{\sf S}.\]

See the proof in Appendix 8.2. In the decomposition on the R.H.S. of ?? , the approximation error term represents the error incurred when approximating the optimal score \({\sf S_\star}\) within the function class \({\mathcal{F}}\). It is a property of the function class \({\mathcal{F}}\), and a richer class tends to have a smaller approximation error. The generalization error bound is derived using concentration properties of functions with bounded differences. Interestingly, it depends only on the total sample size \(n={n_1}K\) rather than the batch size \(K\) or the number of batches \({n_1}\). This allows our results to account for large or full-batch training, as used in SimCLR [2] and CLIP [5]. When \(n\to\infty\), the generalization error vanishes while the approximation error remains constant.

0.8ex 1ex .2ex-1em Why does the SimCLR loss work? Intuitively, \(\overline{\sf R}_{{\sf simclr},K}({\sf S})\) can be viewed as an approximation of the KL-contrastive loss \(R_{{\sf kl}}({\sf S})\) in Eq. ?? using a finite batch size \(K.\) Namely, \[\begin{align} R_{{\sf kl}}({\sf S}) = -{\mathbb{E}}[{\sf S}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})] +{\mathbb{E}}_{{\boldsymbol{z}}^{(1)}_1}\big[\log{\mathbb{E}}_{{\boldsymbol{z}}^{(2)}_2}[\exp({\sf S}({\boldsymbol{z}}^{(1)}_1,{\boldsymbol{z}}^{(2)}_2))]\big] =\lim_{K\to\infty}\overline{\sf R}_{{\sf simclr},K}({\sf S})-\log K.\label{eq:intuitive95bound95suff} \end{align}\tag{3}\] See the proof in Appendix 8.1. As a result, by the definition of VFS in Definition 1 \[\begin{align} {\rm Suff}_{{\sf kl}}( f)\leq R_{\sf kl}( {\sf S}_{ f})-\inf_{{\sf S}} R_{\sf kl}({\sf S})\approx \underbrace{\overline{\sf R}_{{\sf simclr},K}({\sf S}_ f)-\inf_{\sf S}\overline{\sf R}_{{\sf simclr},K}({\sf S})}_{\mathrm{Excess risk}}, \end{align}\] and therefore minimizing the SimCLR loss \({\widehat {\sf R}}_{{\sf simclr},K}({\sf S}_ f)\) effectively controls the sufficiency \({\rm Suff}_{{\sf kl}}( f)\).

4.2 Using the encoder for downstream tasks↩︎

Given an encoder function \(f:{\mathcal{X}}\to {\mathbb{R}}^{ p}\), we are interested in applying it to downstream tasks. Specifically, the goal is to leverage the learned representation \(f({\boldsymbol{x}})\) (or \(f(g({\boldsymbol{x}}))\)) to facilitate learning in downstream tasks, such as regression or classification. By mapping the raw sample \({\boldsymbol{x}}\) to the feature space \({\mathbb{R}}^{ p}\), the representation \(f({\boldsymbol{x}})\) (or \(f(g({\boldsymbol{x}}))\)) is expected to capture the most salient information of \({\boldsymbol{x}}\), simplifying the downstream task while maintaining high performance. In this section, we demonstrate that the downstream performance of the encoder depends on its sufficiency \({\rm Suff}_{\sf kl}( f)\) and the robustness of the downstream task to the random transformation \(g\sim{\mathbb{P}}_{\mathcal{G}}.\)

0.8ex 1ex .2ex-1em Adaptation to downstream regression task. We first study regression tasks. Consider the task of learning an unknown target function \(h_\star:{\mathcal{X}}\mapsto{\mathbb{R}}\). Given an encoder \(f\), our objective is to find a function \({\sf h}:{\mathbb{R}}^{ p}\mapsto{\mathbb{R}}\) such that \({\sf h}( f({\boldsymbol{x}}))\approx h_\star({\boldsymbol{x}})\) (or \({\sf h}( f(g({\boldsymbol{x}})))\approx h_\star({\boldsymbol{x}})\)). The estimation error of \({\sf h}\) is measured by the risk \[\begin{align} {\sf R_{\mathcal{G}}}({\sf h}\circ f)\mathrel{\vcenter{:}}={\mathbb{E}}_{{\boldsymbol{x}}\sim{\mathbb{P}}_{{\mathcal{X}}},g\sim{\mathbb{P}}_{\mathcal{G}}}[({\sf h}( f(g({\boldsymbol{x}})))-h_\star({\boldsymbol{x}}))^2] ,~~~\text{or ~~}{\sf R}({\sf h}\circ f)\mathrel{\vcenter{:}}={\mathbb{E}}_{{\boldsymbol{x}}\sim{\mathbb{P}}_{{\mathcal{X}}}}[({\sf h}( f({\boldsymbol{x}}))-h_\star({\boldsymbol{x}}))^2]. \end{align}\] For example, in regression tasks where the goal is to predict the outcome \({\boldsymbol{y}}\) based on the covariates \({\boldsymbol{x}}\), one can choose \(h_\star({\boldsymbol{x}})={\mathbb{E}}[{\boldsymbol{y}}|{\boldsymbol{x}}]\). The two risks \({\sf R_{\mathcal{G}}}(\cdot),{\sf R}(\cdot)\) correspond to the cases where a random transformation \(g\) is (or is not) applied before passing the input to the encoder \(f\), respectively. Theorem 4 illustrates how the downstream performance of the encoder \(f\) depends on its sufficiency.

\[\begin{theorem}[Performance on downstream regression]\tag{4} Suppose h_\star satisfies \big|{\mathbb{E}}[h_\star({\boldsymbol{x}})|g({\boldsymbol{x}})]\big|\leq B_{h_\star} almost surely. Given an encoderf:{\mathcal{X}}\mapsto{\mathbb{R}}^ p, there exists a measurable function {\sf h}:{\mathbb{R}}^ p\mapsto{\mathbb{R}} such that \begin{align} {\sf R_{\mathcal{G}}}({\sf h}\circ f)\leq c(B_{h_\star}^2\sqrt{{\rm Suff}_{{\sf kl}}( f)}+ \epsilon_{\mathcal{G}}),\tag{5} \end{align} where c>0 is some absolute constant and \epsilon_{\mathcal{G}}\mathrel{\vcenter{:}}={\mathbb{E}}_{{\boldsymbol{x}}\sim{\mathbb{P}}_{{\mathcal{X}}},g\sim{\mathbb{P}}_{\mathcal{G}}}[(h_\star(g({\boldsymbol{x}}))-h_\star({\boldsymbol{x}}))^2]. Moreover, if the augmented view has the same marginal distribution as the original sample, i.e., {\boldsymbol{z}}^{(1)}\overset{d}{=}{\boldsymbol{x}}, then \begin{align} {\sf R}({\sf h}\circ f)\leq c(B_{h_\star}^2\sqrt{{\rm Suff}_{{\sf kl}}( f)}+ \epsilon_{\mathcal{G}})\tag{6} \end{align} for some absolute constant c>0. \end{theorem}\] The proof of Theorem 4 is contained in Appendix 8.3. The term \(\epsilon_{\mathcal{G}}\) characterizes the impact of a random transformation \(g\) on the value of the target function \(h_\star\). In SimCLR, since the encoder \(f\) is trained only on the augmented views \(({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})\), the random transformation \(g\) need to preserve sufficient information on \(h_\star\) (e.g., \(\epsilon_{\mathcal{G}}\) is small) for \(f\) to be effective. This is often the case in practice: for example, random cropping (\(g\)) typically does not alter the class label (\(h_\star\)) of an image; similarly, rotations and scaling (\(g\)) should not affect the true age (\(h_\star\)) of a person in facial images. In addition, Eq. 5 still holds when \(\epsilon_{\mathcal{G}}\) is replaced by the minimum error \(\widetilde{\epsilon}_{\mathcal{G}}\mathrel{\vcenter{:}}=\inf_h{\mathbb{E}}_{{\boldsymbol{x}}\sim{\mathbb{P}}_{{\mathcal{X}}},g\sim{\mathbb{P}}_{\mathcal{G}}}[(h(g({\boldsymbol{x}}))-h_\star({\boldsymbol{x}}))^2]\leq\epsilon_{\mathcal{G}}\). We refer to the proof for more details.

0.8ex 1ex .2ex-1em Adaptation to downstream classification task. We next turn to classification tasks. Suppose in the downstream we are given samples \(({\boldsymbol{x}},{\boldsymbol{y}})\) from some joint distribution \({\mathbb{P}}\) on \({\mathcal{X}}\times[{\sf K}]\), where \({\boldsymbol{x}}\sim{\mathbb{P}}_{{\mathcal{X}}}\) is the input and \({\boldsymbol{y}}\in[{\sf K}]\) is the corresponding label. Note that for any \({\boldsymbol{x}}\), the label \({\boldsymbol{y}}\) follows the conditional probability \({\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{x}})\). Given an encoder \(f\), for any function \({\sf h}:{\mathbb{R}}^{ p}\mapsto\Delta([{\sf K}])\), we measure its classification error by \[\begin{align} {\sf R^{\sf cls}_{\mathcal{G}}}({\sf h}\circ f) \mathrel{\vcenter{:}}={\mathbb{E}}_{({\boldsymbol{x}},{\boldsymbol{y}})\sim{\mathbb{P}},g}[{\mathsf D}_{\rm KL}({\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{x}})||{\sf h}( f(g({\boldsymbol{x}}))))]. \end{align}\]

Theorem 2 (Performance on downstream classification). Suppose \(\inf_{y\in[{\sf K}]}{\mathbb{P}}(y|g({\boldsymbol{x}}))\geq\exp(- B)\) for some \(B>0\) on the support of \(g({\boldsymbol{x}})\). Given an encoder \(f:{\mathcal{X}}\mapsto{\mathbb{R}}^ p\), there exists a measurable function \({\sf h}:{\mathbb{R}}^ p\mapsto\Delta([{\sf K}])\) such that \[\begin{align} {\sf R^{\sf cls}_{\mathcal{G}}}({\sf h}\circ f)\leq c\Big( B\sqrt{{\rm Suff}_{{\sf kl}}( f)}+ \epsilon^{\sf cls}_{\mathcal{G}}\Big),\label{eq:thm95general95downstream95bound95cls} \end{align}\qquad{(3)}\] where \(\epsilon^{\sf cls}_{\mathcal{G}}\mathrel{\vcenter{:}}={\mathbb{E}}_{{\boldsymbol{x}}\sim{\mathbb{P}}_{{\mathcal{X}}},g\sim{\mathbb{P}}_{\mathcal{G}}}[{\mathsf D}_{\rm 2}({\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{x}})||{\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{z}}))+{\mathsf D}_{\rm 2}({\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{z}})||{\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{x}}))]\) and \(c>0\) is some absolute constant. Here, \({\mathsf D}_{\rm 2}\) denotes the \(2\)-Rényi divergence.

The proof of Theorem 2 is contained in Appendix 8.4. Similar to the regression case in Theorem 4 , the downstream classification error is bounded by the sum of a sufficiency term and an error term that characterizes the change in label probabilities induced by the transformation \(g\).

4.3 General \({\mathrm{f}}\)-contrastive learning↩︎

We generalize our theoretical framework to using general \({\mathrm{f}}\)-sufficiency as defined in Definition 1, which could be controlled by minimizing the \({\mathrm{f}}\)-contrastive learning risk. We discuss (1) how to find encoders \(f\) with low \({\mathrm{f}}\)-sufficiency \({\rm Suff}_{\mathrm{f}}( f)\) via data augmentation-based contrastive learning and (2) the implications of low \({\mathrm{f}}\)-sufficiency on downstream performance. Note that \({\mathrm{f}}(x)=x\log x\) yields the standard SimCLR setup.

4.3.1 Finding encoders with low \({\mathrm{f}}\)-sufficieny↩︎

Recall the variational form sufficiency (VFS) in Definition 1. We see that for any \({\mathrm{f}}\) and encoder \(f\) \[\begin{align} {\rm Suff}_{{\mathrm{f}}}( f)\leq \inf_{{\sf S}: f({\mathcal{X}})\times{\mathcal{X}}\mapsto{\mathbb{R}}} R_{{\mathrm{f}}}({\sf S}\circ f) -\inf_{{\sf S}:{\mathcal{X}}\times{\mathcal{X}}\mapsto{\mathbb{R}}}R_{\mathrm{f}}({\sf S}) \leq \underbrace{R_{{\mathrm{f}}}( {\sf S}_{ f}) -\inf_{{\sf S}:{\mathcal{X}}\times{\mathcal{X}}\mapsto{\mathbb{R}}}R_{\mathrm{f}}({\sf S})}_{\mathrm{Excess risk}}. \end{align}\]Thus, for any \({\varepsilon}> 0\), if there exists an encoder \(\widehat f\in {\mathcal{F}}\) such that the excess risk of \({\sf S}_{ \widehat f}\) is less than \({\varepsilon}\), then the sufficiency \({\rm Suff}_{{\mathrm{f}}}( \widehat f) \leq {\varepsilon}\). Consequently, given i.i.d. pairs of augmented views, we can obtain an encoder \(\widehat f\) with low \({\mathrm{f}}\)-sufficiency by choosing \(\widehat f\) as the empirical risk minimizer (ERM) of a finite-sample estimate \(\widehat R_{\mathrm{f}}( {\sf S}_{ f})\) of \(R_{\mathrm{f}}( {\sf S}_{ f})\), provided that \(\widehat R_{\mathrm{f}}( {\sf S}_{ f}) \approx R_{\mathrm{f}}( {\sf S}_{ f})\), the function class \({\mathcal{F}}\) is sufficiently rich, and its \(\| \cdot\|_{2,\infty}\)-covering number is well-controlled.

We focus on \({\chi^2}\)-sufficiency (i.e., \({\mathrm{f}}(x) = (x - 1)^2 / 2\)) in the following. For general \({\mathrm{f}}\), the \({\sf S}_x(x)\) that attains the infimum in Eq. ?? may not have a closed-form solution, and estimating \(\widehat R_{\mathrm{f}}( {\sf S}_{ f})\) requires solving estimating equations, adding complexity to the analysis. Thus, we leave a detailed investigation of the general \({\mathrm{f}}\) case for future work.

When \({\mathrm{f}}(x) = (x - 1)^2 / 2\), basic algebra shows that the \({\chi^2}\)-contrastive loss ?? takes the form \[\begin{align} R_{{\chi^2}}({\sf S}) = {\mathbb{E}}_{{\mathbb{P}}(x, y)}[ - {\sf S}(x, y)] + {\mathbb{E}}_{{\mathbb{P}}(x){\mathbb{P}}(y)}[({\sf S}(x, y) - {\mathbb{E}}_{{\mathbb{P}}(y)}[{\sf S}(x, y)])^2/2 + {\sf S}(x, y) ].\label{eq:csquare95contrastive95loss} \end{align}\tag{7}\] Given \(n={n_1}K\) i.i.d. pairs of augmented views \(\{({\boldsymbol{z}}^{(1)}_i, {\boldsymbol{z}}^{(2)}_i)\}_{i=1}^n\), an unbiased finite-sample estimate of \(R_{{\chi^2}}({\sf S})\) gives \[\begin{align} {\widehat {\sf R}}_{{\sf chisq},K}( {\sf S}_{ f}) &\mathrel{\vcenter{:}}=\frac{1}{n} \sum_{i = 1}^{n_1}\sum_{j=1}^K\Big[ \frac{1}{4(K-1)(K-2)} \sum_{\substack{k, l \in [K] \\ j \neq k,~k \neq l,~l \neq j}} \bigl( {\sf S}_{ f}({\boldsymbol{z}}^{(1)}_{ij}, {\boldsymbol{z}}^{(2)}_{ik}) - {\sf S}_{ f}({\boldsymbol{z}}^{(1)}_{ij}, {\boldsymbol{z}}^{(2)}_{il}) \bigr)^2 \notag \\ & + \frac{1}{K- 1} \sum_{k \neq j} {\sf S}_{ f}({\boldsymbol{z}}^{(1)}_{ij}, {\boldsymbol{z}}^{(2)}_{ik}) - {\sf S}_{ f}({\boldsymbol{z}}^{(1)}_{ij}, {\boldsymbol{z}}^{(2)}_{ij}) \Big], ~ {\sf S}_{ f}\mathrel{\vcenter{:}}=\tau(\langle f({\boldsymbol{z}}^{(1)}) , \, f({\boldsymbol{z}}^{(2)}) \rangle), \label{eq:chisq-empirical-risk-minimization} \end{align}\tag{8}\] where we adopt the shorthand \({\boldsymbol{z}}^{(i)}_{ab}={\boldsymbol{z}}^{(i)}_{(a-1)K+b}\) for \(i\in[2]\). Let \(\widehat f=\mathrm{argmin}_{ f\in {\mathcal{F}}}{\widehat {\sf R}}_{{\sf chisq},K}( {\sf S}_{ f})\) be the ERM estimator. Similar to Theorem 1, we have

Theorem 3 (\({\chi^2}\)-sufficiency bound for the ERM estimator). Suppose \({\sf S}_{ f}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})\in[-\widebar B_{\sf S},\widebar B_{\sf S}]\) for all \(f\in {\mathcal{F}}\) and pairs \(({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})\), and that Assumption 2 holds for some \(B_\tau>0\). Let \({\sf S}^{}_{\star}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})\mathrel{\vcenter{:}}=\frac{{\mathbb{P}}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})}{{\mathbb{P}}({\boldsymbol{z}}^{(1)}){\mathbb{P}}({\boldsymbol{z}}^{(2)})}\). For any \(K\geq3\), with probability at least \(1-\delta\), we have \[\begin{align} \label{eq:excess95risk95bound95chisq} {\rm Suff}_{{\chi^2}}( \widehat f) \leq \mathrm{generalization error}+\mathrm{approximation error}, \end{align}\qquad{(4)}\] where \[\begin{align*} \mathrm{generalization error}&\mathrel{\vcenter{:}}= \frac{c\widebar B_{\sf S}^2}{\sqrt n}\Big[\sqrt{{\log(1/\delta)}} +B_\tau^2\int_{0}^{2(\widebar B_{\sf S}+B_\tau)}\sqrt{\log{\mathcal{N}}(u,\| \cdot\|_{2,\infty}, {\mathcal{F}})}du\Big] , \\ \mathrm{approximation error}&\mathrel{\vcenter{:}}=\inf_{ f\in {\mathcal{F}}}R_{{\chi^2}}( {\sf S}_{ f})-R_{{\chi^2}}( {\sf S}^{}_{\star}) \end{align*} for some absolute constant c>0.\]

The proof of Theorem 3 is provided in Appendix 8.5. Note that we do not assume the boundedness of \({\sf S}^{}_{\star}\) as in Theorem 1.

4.3.2 Implications of low \({\mathrm{f}}\)-Sufficiency↩︎

Similar to the KL case in Section 4.2, the downstream performance of \(f\) can be controlled by its \({\mathrm{f}}\)-sufficiency for a broad class of \({\mathrm{f}}\) considered in Definition 1. Recall the CBS form in Definition 1.

Proposition 4 (\({\mathrm{f}}\)-sufficiency bound on downstream performance). The results in Theorem 4 and 2 hold with \({\rm Suff}_{\sf kl}( f)\) replaced by \(c^2_2\cdot{\rm Suff}_{\mathrm{f}}( f)\) for some value \(c_2>0\) if \[\begin{align} {\mathbb{E}}_{{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}}[{\mathsf D}_{\rm TV}({\mathbb{P}}(\cdot | {\boldsymbol{z}}^{(1)})||{\mathbb{P}}_{{\boldsymbol{z}}^{(2)}|{\boldsymbol{z}}^{(1)}}(\cdot | f({\boldsymbol{z}}^{(1)})))] \leq c_2\cdot \sqrt{{\rm Suff}_{{\mathrm{f}}}( f)} \label{eq:general95f95downstream95cond}. \end{align}\qquad{(5)}\]

Proposition 4 follows immediately by noting that, in the proof of Theorem 4 and 2, \({\rm Suff}_{\sf kl}( f)\) is only used as an upper bound of the expected total variation distance (e.g., by Pinsker’s inequality). It can be verified that KL-divergence and \({\chi^2}\)-divergence satisfy Eq. ?? with \(c_2=1/\sqrt{2}\). Let \(r={\mathbb{P}}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})/[{\mathbb{P}}({\boldsymbol{z}}^{(1)}){\mathbb{P}}({\boldsymbol{z}}^{(2)})]\) denote the density ratio. Moreover, for general \({\mathrm{f}}\), we can choose \(c_2=(2\inf_{({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})}{\mathrm{f}}^{''}(r))^{-1/2}\), which is bounded when \({\mathrm{f}}\) is strongly convex on the range of the density ratio \(r.\) For example, we can choose \(c_2=\sqrt{2}B^{3/4}\) when \({\mathrm{f}}(x) = 1 - \sqrt{x}\) corresponds to squared Hellinger-sufficiency if the density ratio \(r \leq B\) for all pairs \(({\boldsymbol{z}}^{(1)}, {\boldsymbol{z}}^{(2)})\). We refer the readers to Lemma 3 in Appendix 7.2 for further details. Combining the results from Sections 4.3.1 and 4.3.2, we provide end-to-end theoretical guarantees for the downstream performance of encoders obtained by minimizing general \({\mathrm{f}}\)-contrastive losses.

5 Examples↩︎

In this section, we present concrete examples on linear regression and topic classification to illustrate the applicability of our general results in Section 4.

5.1 Linear regression↩︎

Let \({\boldsymbol{x}}\) follow a distribution \({\mathbb{P}}_{{\mathcal{X}}}\) on \({\mathcal{X}}\subseteq {\mathbb{R}}^{d}\). We consider a downstream linear regression task, where each observed sample takes the form \(({\boldsymbol{x}},{\boldsymbol{y}}) \in {\mathbb{R}}^{d} \times {\mathbb{R}}\), with the conditional expectation \({\mathbb{E}}[{\boldsymbol{y}}|{\boldsymbol{x}}]=\langle {\boldsymbol{x}} , \, {{\boldsymbol{\theta}}_\star} \rangle\) for some unknown parameter \({{\boldsymbol{\theta}}_\star}\in{\mathbb{R}}^{d}\). The goal is to predict \({\boldsymbol{y}}\) given \({\boldsymbol{x}}\). While fitting a linear model using only the downstream samples yields a risk of order \(\mathcal{O}(d/m)\), where \(m\) is the number of downstream samples, a smaller risk may be achieved by fitting a linear model on a low-dimensional representation \(f({\boldsymbol{z}}) \in {\mathbb{R}}^{ p}\), where \(p\ll d\), that captures sufficient information about \({\boldsymbol{x}}\) relevant to the downstream task.

Concretely, suppose we are given a linear encoder \(f({\boldsymbol{z}})={\boldsymbol{W}}{\boldsymbol{z}}\) for some \({\boldsymbol{W}}\in{\mathbb{R}}^{ p\times d}\) and \(m\) i.i.d. downstream samples \(\{({\boldsymbol{x}}_i,{\boldsymbol{y}}_i)\}_{i=1}^m\) from the linear model \({\boldsymbol{y}}=\langle {\boldsymbol{x}} , \, {{\boldsymbol{\theta}}_\star} \rangle+\varepsilon\), where \(\varepsilon\sim{\mathcal{N}}(0,{\widebar\sigma}^2)\perp\!\!\!\perp{\boldsymbol{x}}\). Suppose \(\sup_{{\boldsymbol{x}}\in{\mathcal{X}}}\| {\boldsymbol{x}}\|_{2}\leq B_{{\boldsymbol{x}}},\| {{\boldsymbol{\theta}}_\star}\|_{2}\leq B_{{\boldsymbol{\theta}}}\) for some \(B_{{\boldsymbol{x}}}, B_{{\boldsymbol{\theta}}}>0\) and let \(B= B_{{\boldsymbol{x}}} B_{{\boldsymbol{\theta}}}.\) Also assume that \({\mathbb{E}}[({\rm I}_d-{\boldsymbol{W}}^\dagger{\boldsymbol{W}}){\boldsymbol{z}}|{\boldsymbol{W}}{\boldsymbol{z}}]=0\) almost surely. Theorem 5 below gives a theoretical guarantee for learning the downstream task using a given linear encoder.

Theorem 5 (Linear regression with encoder representation). Let \(p\leq d.\) Under the setup and assumptions in Section 5.1, consider fitting a linear model \({\sf h}_{\widehat\boldsymbol{\eta}}({\boldsymbol{x}})=\langle f({\boldsymbol{z}}),\widehat\boldsymbol{\eta}\rangle\) by ordinary least squares, i.e., \[\begin{align} \widehat\boldsymbol{\eta}\mathrel{\vcenter{:}}=\mathrm{argmin}_{\boldsymbol{\eta}\in{\mathbb{R}}^{ p}}\Big\{{\widehat {\sf R}}_{{\sf lin}}({\sf h}_{\boldsymbol{\eta}})\mathrel{\vcenter{:}}=\frac{1}{m}\sum_{i=1}^m(\langle f({\boldsymbol{z}}_i) , \, \boldsymbol{\eta} \rangle-{\boldsymbol{y}}_i)^2\Big\}, \end{align}\] where \({\boldsymbol{z}}=g({\boldsymbol{x}}),{\boldsymbol{z}}_i=g_i({\boldsymbol{x}}_i)\), and \(g,\{g\}_{i=1}^m\) are i.i.d. transformations from \({\mathbb{P}}_{\mathcal{G}}\). Then the expected risk of the truncated linear model \(\widetilde{\sf h}_{\widehat\boldsymbol{\eta}}({\boldsymbol{x}})\mathrel{\vcenter{:}}={\rm proj}_{[-B,B]}({\sf h}_{\widehat\boldsymbol{\eta}}({\boldsymbol{x}}))\) satisfies \[\begin{align} {\mathbb{E}}[{\sf R}_{{\sf lin}}(\widetilde{\sf h}_{\widehat\boldsymbol{\eta}})]\mathrel{\vcenter{:}}={\mathbb{E}}\big[{\mathbb{E}}_{{\boldsymbol{x}},{\boldsymbol{y}},g}[({\boldsymbol{y}}-\widetilde{\sf h}_{\widehat\boldsymbol{\eta}}({\boldsymbol{x}}))^2]\big] \leq \underbrace{{\widebar\sigma}^2}_{\mathrm{irreducible risk}}+ c\Big( (B^2c_2\sqrt{{\rm Suff}_{{\mathrm{f}}}( f)}+\epsilon_{\mathcal{G}})+({\widebar\sigma}^2+B^2)\frac{ p\log m}{m}\Big), \end{align}\] where \(\epsilon_{\mathcal{G}}={\mathbb{E}}[\langle {\boldsymbol{x}}-{\boldsymbol{z}} , \, {{\boldsymbol{\theta}}_\star} \rangle^2]\) and the outer expectation is over \({\{({\boldsymbol{x}}_i,{\boldsymbol{y}}_i,g_i)\}_{i=1}^n}\) for some absolute constant \(c>0\). Here, \(c_2>0\) is any value that satisfies Eq. ?? .

The proof of Theorem 5 is contained in Appendix 9.1. Compared to fitting a linear model on the raw feature \({\boldsymbol{x}}\in{\mathbb{R}}^{d}\), which yields an excess risk of \(\mathcal{O}(d/m)\), Theorem 5 achieves a smaller excess risk of order \(\widetilde{\mathcal{ O}}( p/m)\) when \(p\ll d\) and \(f(g({\boldsymbol{x}}))\) is a “good" representation of \({\boldsymbol{x}}\), in the sense that \({\rm Suff}_{\mathrm{f}}( f)\) and \(\epsilon_{\mathcal{G}}\) are sufficiently small. A similar bound can be established for the risk \({\sf R}_{\sf lin}(\widetilde{\sf h}_{\widehat\boldsymbol{\eta}})\) with high probability under additional sub-Gaussian assumptions on the representation \(f({\boldsymbol{z}})={\boldsymbol{W}}g({\boldsymbol{x}})\) [41]. We provide the bound in expectation \({\mathbb{E}}[{\sf R}_{{\sf lin}}(\widetilde{\sf h}_{\widehat\boldsymbol{\eta}})]\) for simplicity of presentation.

The assumption \({\mathbb{E}}[({\rm I}_d-{\boldsymbol{W}}^\dagger{\boldsymbol{W}}){\boldsymbol{z}}|{\boldsymbol{W}}{\boldsymbol{z}}]=0\) essentially states that the information of the augmented view \({\boldsymbol{z}}\) discarded by the encoder \(f\) does not contain any signal with a non-zero mean. Without this assumption, there may not exist a linear function of \(f({\boldsymbol{z}})\) that achieves a small risk \({\sf R}_{{\sf lin}}(\cdot)\), even though Theorem 4 guarantees the existence of a general function of \(f({\boldsymbol{z}})\) with a small risk. Note that the assumption is satisfied when e.g., \({\boldsymbol{z}}\) follows the standard normal distribution on \({\mathbb{R}}^{d}\).

5.1.1 A concrete scenario↩︎

We now present a scenario in which a linear encoder \(f\) with low KL-sufficiency \({\rm Suff}_{\sf kl}( f)\) can be obtained through SimCLR loss minimization in Eq. 2 . Let \({\sf U}= ({\sf U}_1,{\sf U}_2)\in{\mathbb{R}}^{d\times d}\), where \({\sf U}_1\in{\mathbb{R}}^{d\times p}\), be a fixed unitary matrix, and define \({\boldsymbol{A}}={\sf U}_1{\sf U}_1^\top\). For \(i\in[2]\), define the unit sphere in the column space of \({\sf U}_i\) as \(\mathbb{S}({\sf U}_i)\mathrel{\vcenter{:}}=\{{\boldsymbol{v}}\in{\mathbb{R}}^{d}:\| {\boldsymbol{v}}\|_{2}=1,({\rm I}_d-{\sf U}_i{\sf U}_i^\top){\boldsymbol{v}}={\boldsymbol{0}}\}\). Assume \({\boldsymbol{x}}\in{\mathbb{R}}^{d}\sim{\mathcal{N}}({\boldsymbol{0}},{\rm I}_d/{ p})\) and consider the random transformation \(g\) such that \(g({\boldsymbol{x}})|{\boldsymbol{x}}\overset{d}{=} ( {\boldsymbol{A}}{\boldsymbol{x}}+\eta) | \{{\boldsymbol{A}}{\boldsymbol{x}}+\eta\in \mathbb{S}({\sf U}_1)\oplus\mathbb{S}({\sf U}_2)\}\), i.e., the conditional distribution \(g({\boldsymbol{x}})|{\boldsymbol{x}}\) follows the distribution of \({\boldsymbol{A}}{\boldsymbol{x}}+\eta\) conditioned on \(\mathbb{S}({\sf U}_1)\oplus\mathbb{S}({\sf U}_2)\), ⁵ where the noise \(\eta\sim{\mathcal{N}}({\boldsymbol{0}},{\sigma}^2{\rm I}_d/{ p})\). A concrete example of this transformation involves zeroing out the second half of the coordinates of the sample \({\boldsymbol{x}}\in{\mathbb{R}}^{d}\), adding some Gaussian noise to all coordinates, and then normalizing both halves of the noisy sample to have unit norm. In this case, \({\sf U}_1,{\sf U}_2\) correspond to the first and second halves of the coordinates, respectively.

Under this setup, it is readily verified that the distribution of \(({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})\) is supported on \(\mathbb{S}({\sf U}_1)\oplus\mathbb{S}({\sf U}_2)\), and conditioned on \(\mathbb{S}({\sf U}_1)\oplus\mathbb{S}({\sf U}_2)\), the densities satisfy ⁶ \[\begin{align} {\mathbb{P}}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})&\propto~ \exp \left(-\frac{ p}{2} \left\langle \begin{pmatrix}{\boldsymbol{z}}^{(1)}\\ {\boldsymbol{z}}^{(2)} \end{pmatrix},\begin{bmatrix} {\sf U}_1{\sf U}_1^\top + {\sigma}^2 {\rm I}_d& {\sf U}_1{\sf U}_1^\top\\ {\sf U}_1{\sf U}_1^\top& {\sf U}_1{\sf U}_1^\top + {\sigma}^2 {\rm I}_d \end{bmatrix}^{-1}\begin{pmatrix}{\boldsymbol{z}}^{(1)}\\ {\boldsymbol{z}}^{(2)} \end{pmatrix} \right\rangle \right), \\ {\mathbb{P}}({\boldsymbol{z}}^{(1)})&\propto~ 1~~~~\text{and} ,\\ \frac{ {\mathbb{P}}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})}{{\mathbb{P}}({\boldsymbol{z}}^{(1)}){\mathbb{P}}({\boldsymbol{z}}^{(2)})} &\propto~ \exp \big(\kappa \langle {\boldsymbol{z}}^{(1)}, {\sf U}_1{\sf U}_1^\top {\boldsymbol{z}}^{(2)}\rangle\big) , ~~~~~~~\kappa\mathrel{\vcenter{:}}=\frac{ p}{{\sigma}^2({\sigma}^2+2)}\leq\frac{ p}{{\sigma}^4} . \end{align}\] Note that \(({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})\) restricting on \(\mathbb{S}({\sf U}_1) \times \mathbb{S}({\sf U}_1)\) follows the joint von Mises-Fisher distribution (vMF) [42]. In this case, the optimal score is given by \({\sf S_\star}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})=\tau(\langle f_\star({\boldsymbol{z}}^{(1)}) , \, f_\star({\boldsymbol{z}}^{(2)}) \rangle)\), where \(\tau(x)=\kappa x\) and \(f_\star({\boldsymbol{z}})={\sf U}_1{\boldsymbol{z}}.\) Moreover, we have the following guarantee on the sufficiency of the SimCLR estimator \(\widehat f\).

Corollary 1 (An upper bound on the sufficiency). Under the setup in Section 5.1.1, let \({\mathcal{F}}\mathrel{\vcenter{:}}=\{ f: f({\boldsymbol{z}})={\boldsymbol{W}}{\boldsymbol{z}},~~ {\boldsymbol{W}}\in{\mathbb{R}}^{ p\times d} \text{ and }|\!|\!| {\boldsymbol{W}} | \! | \!|_{{\tiny{op}}}\leq B_{\boldsymbol{W}}\}\) for some \(B_{\boldsymbol{W}}\geq1\), and set \(\tau(x)=\kappa x\). Define \(\widehat f\) as the SimCLR empirical risk minimizer obtained from Eq. 2 , using batch size \(K\) and \(n\) samples. Then, with probability at least \(1-\delta\), we have \[\begin{align} {\rm Suff}_{{\sf kl}}( \widehat f) \leq \Big(1 +\frac{{C}}{K}\Big) \cdot \sqrt\frac{d p\cdot \log B_{\boldsymbol{W}}+ \log(1/\delta)}{n} \end{align}\] for some constant \({C}>0\) that depends polynomially on \(\exp(\kappa)\).

See Appendix 9.2 for the proof. Note that the constant \(\exp(\kappa)\) depends on the noise level \({\sigma}\). When \({\sigma}\gtrsim p^{1/4}\), finding a near-sufficient encoder is relatively easy. Combining Theorem 5 and Corollary 1, we conclude that the learned encoder \(\widehat f\) can achieve a small risk in the downstream linear regression task, provided that there are sufficient pretraining and downstream samples, and that data augmentation does not significantly alter the output of the true linear model (i.e., \(\epsilon_{\mathcal{G}}\) is small). See Appendix 9.3 for an end-to-end statement and its proof.

5.2 Topic classification↩︎

Next, we provide theoretical guarantees for contrastive learning and its downstream performance in a classification setting. Let \({\mathcal{Y}}= \{1, 2, \ldots, M\}\) represent a set of classes. A sample \({\boldsymbol{x}}\) is generated by first selecting a class \(\boldsymbol{y}\in {\mathcal{Y}}\) from some distribution \({\mathbb{P}}_{\mathcal{Y}}\), and then drawing \({\boldsymbol{x}}=({\boldsymbol{x}}^{c_{1}},{\boldsymbol{x}}^{c_{2}})\in[S]\times[S]\) conditioned on \(\boldsymbol{y}\), with the joint distribution \[\begin{align} {\mathbb{P}}({\boldsymbol{x}}| \boldsymbol{y}) = {\mathbb{P}}_c({\boldsymbol{x}}^{c_{1}} | \boldsymbol{y}) \times {\mathbb{P}}_c({\boldsymbol{x}}^{c_{2}} | \boldsymbol{y}), \end{align}\] where \({\mathbb{P}}_c(\cdot|\boldsymbol{y})\) is some conditional distribution over \([S]\). For example, in a topic classification task, each sample consists of a two-part sentence (or a two-word phrase), with the class \(\boldsymbol{y}\) representing the topic (e.g., sports, technology, or health). The first and second parts (or words), \({\boldsymbol{x}}^{c_{1}}\) and \({\boldsymbol{x}}^{c_{2}}\), are independently sampled from a vocabulary of size \(S\), conditioned on the topic \(\boldsymbol{y}\).

0.8ex 1ex .2ex-1em Contrastive learning. We consider learning a near-sufficient encoder \(f\) via minimizing the \({\chi^2}\)-contrastive loss. Namely, we consider the random dropout transformation \(g: [S] \times [S] \to [S]\), which selects one component \({\boldsymbol{x}}^{c_{i}}\) from the pair \(({\boldsymbol{x}}^{c_{1}}, {\boldsymbol{x}}^{c_{2}})\) with equal probability as the augmented view \({\boldsymbol{z}}\) and drops the other. With slight abuse of notation, we also denote the augmented view \({\boldsymbol{z}}\) using one-hot encoding. We consider encoders \(f\) that are linear functions of \({\boldsymbol{z}}\) augmented with the one-hot encoding, namely, consider the encoder space \[\begin{align} {\mathcal{F}}=\{ f_ {\sf aug}:\cup_{i=1}^S\{e_i\}\mapsto{\mathbb{R}}^{M+S}~| f_ {\sf aug}({\boldsymbol{z}})=(({\boldsymbol{W}}{\boldsymbol{z}})^\top, {w}\cdot{\boldsymbol{z}}^\top)^\top,~{\boldsymbol{W}}\in{\mathbb{R}}^{M\times S},{w}\in{\mathbb{R}},\| {\boldsymbol{W}}\|_{2,\infty}\vee|{w}/\sqrt{S}|\leq B_{\boldsymbol{W}}\} \end{align}\] with \(B_{\boldsymbol{W}}={M}\). To learn an encoder \(\widehat f_ {\sf aug}\), we minimize the \({\chi^2}\)-contrastive loss computed using \(n\) i.i.d. pairs of augmented views via Eq. 8 . Importantly, class labels \(\{\boldsymbol{y}_i\}_{i=1}^{n}\) remain unobservable during contrastive learning. We note that a similar data distribution was studied in [27], where the augmented views correspond deterministically to the first and second components of the sample.
0.8ex 1ex .2ex-1em Downstream classification. We consider a downstream task in which we are given i.i.d. samples \(\{({\boldsymbol{x}}_i,\boldsymbol{y}_i)\}_{i=1}^{m}\) from the joint distribution of \(({\boldsymbol{x}}, {\boldsymbol{y}})\), and the goal is to learn the conditional topic distribution \({{\mathbb{P}}(\boldsymbol{y}= y | {\boldsymbol{x}})}_{y \in [M]} \in {\mathbb{R}}^{M}\) using an encoder. Let \(\widehat f_ {\sf aug}({\boldsymbol{z}}) = (({\widehat{\boldsymbol{W}}}{\boldsymbol{z}})^\top, {\widehat w}\cdot{\boldsymbol{z}}^\top)^\top\) be the representation learned from contrastive learning, and define the encoder as \(\widehat f({\boldsymbol{z}}) \mathrel{\vcenter{:}}={\widehat{\boldsymbol{W}}}{\boldsymbol{z}}\in {\mathbb{R}}^{M}\). We train a multi-class linear classifier on \(\widehat f\) to predict the topic distribution.

Define the gold representation \({{\boldsymbol{E}}_\star}\in{\mathbb{R}}^{M\times S}\) whose \(j\)’th column gives \({{\boldsymbol{E}}_\star}_{,\cdot j}=\Big(\frac{{\mathbb{P}}_c({\boldsymbol{y}}=1|{\boldsymbol{x}}^{c_{1}}=j)}{\sqrt{{\mathbb{P}}_{\mathcal{Y}}({\boldsymbol{y}}=1)}},\ldots,\frac{{\mathbb{P}}_c({\boldsymbol{y}}=M|{\boldsymbol{x}}^{c_{1}}=j)}{\sqrt{{\mathbb{P}}_{\mathcal{Y}}({\boldsymbol{y}}=M)}}\Big)^\top\) for \(j\in[S]\). We also make the following regularity assumptions:

The marginal distributions of \({\boldsymbol{y}}\) and \({\boldsymbol{x}}^{c_{1}}\) are uniform over \([M]\) and \([S]\), respectively.
The minimum singular value of \({{\boldsymbol{E}}_\star}{{\boldsymbol{E}}_\star}^\top\) satisfies \(\sigma_{\min}({{\boldsymbol{E}}_\star}{{\boldsymbol{E}}_\star}^\top)\geq \sigma_{{{\boldsymbol{E}}_\star}}^2\)for some \(\sigma_{{{\boldsymbol{E}}_\star}}>0.\)
\(S\geq 4M\) and \(\inf_{y\in[M],s\in[S]}{\mathbb{P}}_c(y|s)\geq\exp(-B)\) for some \(B>0\).

Assumption (a) assumes uniform topic and word (or sentence) distributions, simplifying the analysis of \({\chi^2}\)-contrastive learning. Assumption (b) is a technical assumption that allows us to transform the learned embedding \(\widehat f({\boldsymbol{z}})\) to the gold representation \({{\boldsymbol{E}}_\star}({\boldsymbol{z}})\). Assumption (c) ensures the vocabulary size \(S\) is large compared with the number of topics \(M\) and all topics have non-vanishing conditional probability in \({\mathbb{P}}_c.\) With these assumptions at hand, we have

Theorem 6 (Classification using the \({\chi^2}\)-trained encoder). Under the setup and assumptions in Section 5.2 and let \(\widehat f_ {\sf aug}\) be the ERM in Eq. 8 . Then, with probability at least \(1-\delta\) over \(\{({\boldsymbol{z}}^{(1)}_i,{\boldsymbol{z}}^{(2)}_i)\}_{i=1}^{n}\), \[\begin{align} {\rm Suff}_{{\chi^2}}( \widehat f_ {\sf aug}) \leq R_{{\mathrm{f}}}( {\sf S}_{ \widehat f_ {\sf aug}})-R_{{\mathrm{f}}}({\sf S_\star})\eqqcolon{\rm Suff}_{{\chi^2}}( {\sf S}_{ \widehat f_ {\sf aug}}) \leq \frac{cS^2M^4}{\sqrt{n}}\Big[\sqrt{\log(1/\delta)} + \sqrt{S}{M}^{1.5} \Big] \label{eq:thm95img95suff95bound} \end{align}\qquad{(6)}\] for some absolute constant \(c>0.\)

In downstream classification, given \(m\) i.i.d. samples \(\{({\boldsymbol{x}}_i,\boldsymbol{y}_i)\}_{i=1}^{m}\), consider fitting a multi-class classifier \({\sf h}_{\widehat{\boldsymbol{\Gamma}}}({\boldsymbol{x}})=\widebar{\sf h}_{\widehat{\boldsymbol{\Gamma}}}( \widehat f({\boldsymbol{z}}))\mathrel{\vcenter{:}}={\rm softmax}(\log{\sf trun}(\widehat{\boldsymbol{\Gamma}}_w \widehat f({\boldsymbol{z}})+\widehat{\boldsymbol{\Gamma}}_b))\) with \[\begin{align} \widehat{\boldsymbol{\Gamma}}\mathrel{\vcenter{:}}=\mathrm{argmin}_{{\boldsymbol{\Gamma}}_{w}\in{\mathbb{R}}^{M\times M},{\boldsymbol{\Gamma}}_{b}\in{\mathbb{R}}^{M},~~|\!|\!| {\boldsymbol{\Gamma}}_{w} | \! | \!|_{{\tiny{op}}}\vee\| {\boldsymbol{\Gamma}}_{b}\|_{2}\leq B_\Gamma}\Big\{{\widehat {\sf R}}_{{\rm cls}}({\sf h}_{{\boldsymbol{\Gamma}}})\mathrel{\vcenter{:}}= - \frac{1}{m}\sum_{i=1}^m{\log \widebar{\sf h}_{{\boldsymbol{\Gamma}}}( \widehat f({\boldsymbol{z}}_i) )_{\boldsymbol{y}_i}}\Big\},\label{eq:thm95img95dst95bound} \end{align}\qquad{(7)}\] where \({\boldsymbol{z}}=g({\boldsymbol{x}}),{\boldsymbol{z}}_i=g_i({\boldsymbol{x}}_i)\) and \(g,\{g\}_{i=1}^m\) are i.i.d. dropout transformations, \(B_\Gamma\geq 4\sqrt{S}M/\sigma_{{{\boldsymbol{E}}_\star}}\), and \({\sf trun}(x)\mathrel{\vcenter{:}}={\rm proj}_{[\exp(-B),1]}(x)\). Then there exists some absolute constants \(c,c'>0\) such that, given the encoder \(\widehat f\) and suppose \({\rm Suff}_{{\chi^2}}( {\sf S}_{ \widehat f_ {\sf aug}})\leq c'\frac{\sigma_{{{\boldsymbol{E}}_\star}}^2}{S^2M}\), with probability at least \(1-\delta_1\) \[\begin{align} &\quad~{\sf R}_{{\rm cls}}(\widebar{\sf h}_{\widehat{\boldsymbol{\Gamma}}}) \mathrel{\vcenter{:}}= {\mathbb{E}}_{{\boldsymbol{x}},{\boldsymbol{y}},g}[ {\mathsf D}_{\rm KL}({\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{x}})|| {\sf h}_{\widehat{\boldsymbol{\Gamma}}}( \widehat f(g({\boldsymbol{x}})) ) )]\\ &\leq c\Big(\underbrace{\Big[\epsilon^{\sf cls}_{\mathcal{G}}+\frac{S\exp(B)}{\sigma_{{{\boldsymbol{E}}_\star}}^2}\cdot{\rm Suff}_{{\chi^2}}( {\sf S}_{ \widehat f_ {\sf aug}})\Big]}_{\mathrm{approximation error}} + \underbrace{ \frac{B}{\sqrt{m}}\Big[\sqrt{\log(1/\delta_1)} + {M(\sqrt{\log B_\Gamma}+\sqrt{B}) }\Big]\Big)}_{\mathrm{generalization error}}. \end{align}\]

See the proof in Appendix 9.4. Note that the bound on downstream classification depends on the sufficiency of the score function \({\rm Suff}_{{\chi^2}}( {\sf S}_{ \widehat f_ {\sf aug}})\), introduced in Appendix 7.3, rather than \({\rm Suff}_{{\chi^2}}( \widehat f)\). This distinction arises because we restrict ourselves to linear classifiers, whereas Theorem 2 considers arbitrary measurable functions, leading to an additional approximation error term.

6 Conclusion↩︎

In this work, we presents a new theoretical framework for data augmentation-based contrastive learning, with SimCLR as a representative example. Based on the extended concept of approximate sufficient statistics, we establish a connection between minimizing the \({\mathrm{f}}\)-contrastive losses and minimizing the conditional Bregman sufficiency (CBS) of the encoder. Moreover, we show that the learned encoders can be effectively applied to downstream tasks with performance depending on their sufficiency and the error on the downstream task induced by data augmentation.

Our work opens up many directions for future research. First, as seen in Definition 1, the concept of approximate sufficient statistics is not limited to contrastive learning; exploring its applicability to other self-supervised and supervised learning paradigms is a promising direction. Second, while approximate sufficiency quantifies the information preserved by the encoder, it does not reflect the redundancy in its representation. Thus, it would be interesting to generalize the concept of minimal sufficient statistics and develop practical algorithms for finding representations that are both approximately sufficient and minimal. Lastly, our work mainly focuses on the empirical risk minimizers in contrastive learning. Understanding what representations are learned and how training algorithms influence the learned representation remains another exciting avenue for future research.

Acknowledgement↩︎

This project was supported by NSF grants DMS-2210827, CCF-2315725, CAREER DMS-2339904, ONR grant N00014-24-S-B001, an Amazon Research Award, a Google Research Scholar Award, an Okawa Foundation Research Grant, and a Sloan Research Fellowship.

7 Properties of approximate sufficient statistics↩︎

In this section, we discuss some properties of approximate sufficient statistics introduced in Definition 1 and provide some concrete examples.

7.1 Equivalence in Definition 1 ↩︎

Lemma 1 (Equivalent of three forms of sufficiency). The ILS, VFS, CBS definitions in Definition 1 are equivalent, i.e., for any statistic \(T\) \[\begin{align} {\rm Suff}_{{\rm il}, {\mathrm{f}}}(T) = {\rm Suff}_{{\rm vf}, {\mathrm{f}}}(T)= {\rm Suff}_{{\rm cb}, {\mathrm{f}}}(T)\eqqcolon{\rm Suff}_{{\mathrm{f}}}(T). \end{align}\]

Proof of Lemma 1..

\(({\boldsymbol{I}LS})\) \(\Leftrightarrow\) \(({\boldsymbol{V}FS})\). Note that by the variational form of \({\mathrm{f}}\)-divergence, we have \[\begin{align} &\quad~-I_{{\mathrm{f}}}(X, Y) \\ &= \inf_{{\sf S}:{\mathcal{X}}\times{\mathcal{Y}}\to{\mathbb{R}}}{\mathbb{E}}_{{\mathbb{P}}(x, y)}[ - {\sf S}(x, y)] + {\mathbb{E}}_{{\mathbb{P}}(x){\mathbb{P}}(y)}[{\mathrm{f}}^*({\sf S}(x, y) )] \\ &= \inf_{{{\sf S}_x:{\mathcal{X}}\to{\mathbb{R}}},{{\sf S}:{\mathcal{X}}\times{\mathcal{Y}}\to{\mathbb{R}}}}{\mathbb{E}}_{{\mathbb{P}}(x, y)}[ {\sf S}_x(x)- {\sf S}(x, y)] + {\mathbb{E}}_{{\mathbb{P}}(x){\mathbb{P}}(y)}[{\mathrm{f}}^*({\sf S}(x, y) -{\sf S}_x)] \\ &= \inf_{{\sf S}:{\mathcal{X}}\times{\mathcal{Y}}\to{\mathbb{R}}} {\mathbb{E}}_{{\mathbb{P}}(x, y)}[ - {\sf S}(x, y)] + \inf_{{\sf S}_x:{\mathcal{X}}\mapsto{\mathbb{R}}} {\mathbb{E}}_{{\mathbb{P}}(x){\mathbb{P}}(y)}[{\mathrm{f}}^*({\sf S}(x, y) - {\sf S}_x(x)) + {\sf S}_x(x)] =\inf_{{\sf S}:{\mathcal{X}}\times{\mathcal{Y}}\to{\mathbb{R}}}R_{\mathrm{f}}({\sf S}). \end{align}\] Similarly, \[\begin{align} &\quad-I_{{\mathrm{f}}}(T(X), Y) \\ &= \inf_{{\sf S}:T({\mathcal{X}})\times{\mathcal{Y}}\to{\mathbb{R}}}{\mathbb{E}}_{{\mathbb{P}}(T(x), y)}[ - {\sf S}(T(x), y)] + {\mathbb{E}}_{{\mathbb{P}}(T(x)){\mathbb{P}}(y)}[{\mathrm{f}}^*({\sf S}(T(x), y) )] \\ &= \inf_{{\sf S}:T({\mathcal{X}})\times{\mathcal{Y}}\to{\mathbb{R}}} {\mathbb{E}}_{{\mathbb{P}}(T(x), y)}[ - {\sf S}(T(x), y)] + \inf_{{\sf S}_x:T({\mathcal{X}})\mapsto{\mathbb{R}}} {\mathbb{E}}_{{\mathbb{P}}(T(x)){\mathbb{P}}(y)}[{\mathrm{f}}^*({\sf S}(T(x), y) - {\sf S}_x(T(x))) + {\sf S}_x(T(x))] \\ &= \inf_{{\sf S}:T({\mathcal{X}})\times{\mathcal{Y}}\to{\mathbb{R}}} {\mathbb{E}}_{{\mathbb{P}}(x, y)}[ - {\sf S}(T(x), y)] + \inf_{{\sf S}_x:T({\mathcal{X}})\mapsto{\mathbb{R}}} {\mathbb{E}}_{{\mathbb{P}}(x){\mathbb{P}}(y)}[{\mathrm{f}}^*({\sf S}(T(x), y) - {\sf S}_x(T(x))) + {\sf S}_x(T(x))] \\ &= \inf_{{\sf S}:T({\mathcal{X}})\times{\mathcal{Y}}\to{\mathbb{R}}}R_{\mathrm{f}}({\sf S}\circ T). \end{align}\] Combining the two results yields the equivalence between \(({\boldsymbol{I}LS})\) and \(({\boldsymbol{V}FS})\).

\(({\boldsymbol{I}LS})\) \(\Leftrightarrow\) \(({\boldsymbol{C}BS})\). By definition of the \(({\boldsymbol{I}LS})\) \[\begin{align} {\rm Suff}_{{\rm il},{\mathrm{f}}}(T) &=I_{{\mathrm{f}}}(X, Y) -I_{{\mathrm{f}}}(T(X), Y) \\ &= \int {\mathrm{f}}\Big( \frac{{\mathbb{P}}(x, y)}{{\mathbb{P}}(x) {\mathbb{P}}(y)} \Big) {\mathbb{P}}(x) {\mathbb{P}}(y) d{\boldsymbol{\mu}} - \int {\mathrm{f}}\Big( \frac{{\mathbb{P}}(T(x), y)}{{\mathbb{P}}(T(x)) {\mathbb{P}}(y)} \Big) {\mathbb{P}}(T(x)) {\mathbb{P}}(y) d{\boldsymbol{\mu}}\\ &= \int {\mathrm{f}}\Big( \frac{{\mathbb{P}}(y|x)}{ {\mathbb{P}}(y)} \Big) {\mathbb{P}}(x) {\mathbb{P}}(y) d{\boldsymbol{\mu}} - \int {\mathrm{f}}\Big( \frac{{\mathbb{P}}(y|T(x))}{{\mathbb{P}}(y) } \Big) {\mathbb{P}}(x) {\mathbb{P}}(y) d{\boldsymbol{\mu}} \\ &= {\mathbb{E}}_{{\mathbb{P}}(x){\mathbb{P}}(y)} \Big[ {\mathrm{f}}\Big( \frac{{\mathbb{P}}(y|x)}{ {\mathbb{P}}(y)} \Big) - {\mathrm{f}}\Big( \frac{{\mathbb{P}}(y|T(x))}{{\mathbb{P}}(y) } \Big) \Big]= {\rm Suff}_{{\rm cb},{\mathrm{f}}}(T), \end{align}\] where the last equality follows since \[\begin{align} &\quad {\mathbb{E}}_{{\mathbb{P}}(x){\mathbb{P}}(y)} \Big[ {\mathrm{f}}'\Big( \frac{{\mathbb{P}}(y|T(x))}{{\mathbb{P}}(y) } \Big)\Big( \frac{{\mathbb{P}}(y|x)}{ {\mathbb{P}}(y)}-\frac{{\mathbb{P}}(y|T(x))}{{\mathbb{P}}(y) } \Big) \Big] \notag\\ &= {\mathbb{E}}\Big[{\mathbb{E}}\Big[ {\mathrm{f}}'\Big( \frac{{\mathbb{P}}(y|T(x))}{{\mathbb{P}}(y) } \Big)\Big( \frac{{\mathbb{P}}(y|x)}{ {\mathbb{P}}(y)}-\frac{{\mathbb{P}}(y|T(x))}{{\mathbb{P}}(y) } \Big) \Big|T(x)\Big]\Big]\notag \\ &= {\mathbb{E}}\Big[ \frac{1}{{\mathbb{P}}(y)}{\mathrm{f}}'\Big( \frac{{\mathbb{P}}(y|T(x))}{{\mathbb{P}}(y) } \Big)\Big[ {\mathbb{E}}[{\mathbb{P}}(y|x)|T(x)]-{{\mathbb{P}}(y|T(x))} \Big] \Big]=0.\label{eq:cbs95pf1} \end{align}\tag{9}\]

An equivalent expression of \(({\boldsymbol{C}BS})\). We now show that \[\begin{align} {\mathbb{E}}_{{\mathbb{P}}(x)\times {\mathbb{P}}(y)}\Big[ B_{{\mathrm{f}}}\Big( \frac{{\mathbb{P}}(y | x)}{{\mathbb{P}}(y)}, \frac{{\mathbb{P}}(y | T(x))}{{\mathbb{P}}(y)} \Big) \Big] = \inf_{{\mathbb{Q}}:T({\mathcal{X}})\mapsto\Delta({\mathcal{Y}})} {\mathbb{E}}_{{\mathbb{P}}(x)\times {\mathbb{P}}(y)}\Big[ B_{{\mathrm{f}}}\Big( \frac{{\mathbb{P}}(y | x)}{{\mathbb{P}}(y)}, \frac{{\mathbb{Q}}(y | T(x))}{{\mathbb{P}}(y)} \Big) \Big]. \end{align}\] This follows immediately as for any \({\mathbb{Q}}:T({\mathcal{X}})\mapsto\Delta({\mathcal{Y}})\) \[\begin{align} &\quad{\mathbb{E}}_{{\mathbb{P}}(x)\times {\mathbb{P}}(y)}\Big[ B_{{\mathrm{f}}}\Big( \frac{{\mathbb{P}}(y | x)}{{\mathbb{P}}(y)}, \frac{{\mathbb{Q}}(y | T(x))}{{\mathbb{P}}(y)} \Big) \Big] - {\mathbb{E}}_{{\mathbb{P}}(x)\times {\mathbb{P}}(y)}\Big[ B_{{\mathrm{f}}}\Big( \frac{{\mathbb{P}}(y | x)}{{\mathbb{P}}(y)}, \frac{{\mathbb{P}}(y | T(x))}{{\mathbb{P}}(y)} \Big) \Big] \\ &= {\mathbb{E}}_{{\mathbb{P}}(x)\times {\mathbb{P}}(y)}\Big[ {\mathrm{f}}\Big(\frac{{\mathbb{P}}(y | T(x))}{{\mathbb{P}}(y)} \Big) - {\mathrm{f}}\Big(\frac{{\mathbb{Q}}(y | T(x))}{{\mathbb{P}}(y)} \Big) - {\mathrm{f}}'\Big(\frac{{\mathbb{Q}}(y | T(x))}{{\mathbb{P}}(y)} \Big) \Big( \frac{{\mathbb{P}}(y | x)}{{\mathbb{P}}(y)}- \frac{{\mathbb{Q}}(y | T(x))}{{\mathbb{P}}(y)} \Big) \Big]\\ &\geq {\mathbb{E}}_{{\mathbb{P}}(x)\times {\mathbb{P}}(y)}\Big[ {\mathrm{f}}'\Big(\frac{{\mathbb{Q}}(y | T(x))}{{\mathbb{P}}(y)} \Big) \Big( \frac{{\mathbb{P}}(y | T(x))}{{\mathbb{P}}(y)} -\frac{{\mathbb{P}}(y | x)}{{\mathbb{P}}(y)} \Big) \Big]=0, \end{align}\]where the first equality uses Eq. 9 . ◻

7.2 Properties and examples↩︎

Lemma 2 (Global minimizers of \(R_{\mathrm{f}}({\sf S})\)). Recall \[\begin{align} R_{{\mathrm{f}}}({\sf S})= {\mathbb{E}}_{{\mathbb{P}}(x, y)}[ - {\sf S}(x, y)] + \inf_{{\sf S}_x:{\mathcal{X}}\mapsto{\mathbb{R}}} {\mathbb{E}}_{{\mathbb{P}}(x){\mathbb{P}}(y)}[{\mathrm{f}}^*({\sf S}(x, y) - {\sf S}_x(x)) + {\sf S}_x(x)]. \end{align}\] For \({\mathrm{f}}\) that is strictly convex and differentiable, the following results hold for \(R_{{\mathrm{f}}}(\cdot)\).

The infimum in the definition of \(R_{{\mathrm{f}}}(\cdot)\) is obtained by \({\sf S}_x(x)\) such that \({\mathbb{E}}_{{\mathbb{P}}(y)}[({\mathrm{f}}')^{-1}({\sf S}(x,y)-{\sf S}_x(x))]=1\) for all \(x\).
Let \({\sf S_\star}(x,y)\mathrel{\vcenter{:}}={\mathrm{f}}'(\frac{{\mathbb{P}}(x,y)}{{\mathbb{P}}(x){\mathbb{P}}(y)}).\) The global minimizers of \(R_{{\mathrm{f}}}(\cdot)\) form the set \[\begin{align} \mathcal{M}_{{\mathrm{f}}}\mathrel{\vcenter{:}}=\Big\{{\sf S}:{\mathcal{X}}\times{\mathcal{Y}}\mapsto{\mathbb{R}}, {\sf S}(x,y) ={\sf S_\star}(x,y) +{\sf S}_x(x) \text{ for some }{\sf S}_x: {\mathcal{X}}\mapsto{\mathbb{R}}\Big\}. \end{align}\]

Proof of Lemma 2.. For any fixed \(x\), we have \[\begin{align} \nabla_c{\mathbb{E}}_{{\mathbb{P}}(y)}[{\mathrm{f}}^*({\sf S}(x,y)-c)+c] = {\mathbb{E}}_{{\mathbb{P}}(y)}[-\nabla{\mathrm{f}}^*({\sf S}(x,y)-c)+1] \end{align}\] Claim (1) follows immediately from setting the derivative equals zero and noting that \(\nabla{\mathrm{f}}^*=({\mathrm{f}}')^{-1}\).

To prove claim (2), we first note that adding any function \({\sf S}_x(x)\) to \({\sf S}(x,y)\) does not change the value of \(R_{\mathrm{f}}({\sf S})\) due to the infimum inside the definition of \(R_{\mathrm{f}}({\sf S})\). Therefore, it suffices to show that the unique minimizer of \[\begin{align} \widebar R_{{\mathrm{f}}}({\sf S})\mathrel{\vcenter{:}}={\mathbb{E}}_{{\mathbb{P}}(x, y)}[ - {\sf S}(x, y)] + {\mathbb{E}}_{{\mathbb{P}}(x){\mathbb{P}}(y)}[{\mathrm{f}}^*({\sf S}(x, y))] \end{align}\] is \({\sf S_\star}={\mathrm{f}}'\Big(\frac{{\mathbb{P}}(x,y)}{{\mathbb{P}}(x){\mathbb{P}}(y)}\Big)\). Write \({\sf S}={\sf S_\star}+c{h}\). It can be verified that \(\widebar R_{{\mathrm{f}}}({\sf S_\star}+c{h})\) is strictly convex in \(c\). Thus \({\sf S_\star}\) is the unique minimizer of \(\widebar R_{{\mathrm{f}}}\) if \(\nabla_c\widebar R_{{\mathrm{f}}}({\sf S_\star}+c{h})|_{c=0} =0\) for all \({h}.\) This is true since \[\begin{align} \nabla_c\widebar R_{{\mathrm{f}}}({\sf S_\star}+c{h})|_{c=0} &= {\mathbb{E}}_{{\mathbb{P}}(x, y)}[ - {h}(x, y)] + {\mathbb{E}}_{{\mathbb{P}}(x){\mathbb{P}}(y)}[\nabla{\mathrm{f}}^*({\sf S}(x, y)){h}(x, y)] \\ &= {\mathbb{E}}_{{\mathbb{P}}(x, y)}[ - {h}(x, y)] + {\mathbb{E}}_{{\mathbb{P}}(x){\mathbb{P}}(y)}\Big[\frac{{\mathbb{P}}(x,y)}{{\mathbb{P}}(x){\mathbb{P}}(y)}{h}(x, y)\Big] =0, \end{align}\] where the second inequality uses the property of convex conjugates that \(\nabla{\mathrm{f}}^*({\mathrm{f}}'(x))=x\). ◻

Lemma 3 (A general bound on \({\mathsf D}_{\rm TV}({\mathbb{P}}(y|x)||{\mathbb{P}}(y|T(x)))\) based on sufficiency.). For \({\mathrm{f}}\) in Definition 1 that is twice continuously differentiable, and for any statistic \(T\), we have \[\begin{align} {\mathbb{E}}_{{\mathbb{P}}(x)}[{\mathsf D}_{\rm TV}({\mathbb{P}}(y|x)||{\mathbb{P}}(y|T(x)))] \leq c_2\cdot {\sqrt{{\rm Suff}_{{\rm cb},{\mathrm{f}}}(T)}},\label{eq:tv95upper95bound95gen} \end{align}\qquad{(8)}\] where \(c_2\mathrel{\vcenter{:}}=\Big(2\inf_{(x,y)\in{\sf supp}(x,y)}{\mathrm{f}}^{''}\Big(\frac{{\mathbb{P}}(y | x)}{{\mathbb{P}}(y)}\Big)\Big)^{-1/2}\), and \({\sf supp}(x,y)\) denotes the support of \({\mathbb{P}}(x)\times{\mathbb{P}}(y)\). Notably, when \({\mathrm{f}}(x)=(x-1)^2/2\) (\({\chi^2}\)-divergence), we have \(c_2=1/\sqrt{2}\).

Proof of Lemma 3.. Using the CBS form of sufficiency, we find that \[\begin{align} {\rm Suff}(T) &={\mathbb{E}}_{{\mathbb{P}}(x)\times {\mathbb{P}}(y)}\Big[ B_{{\mathrm{f}}}\Big( \frac{{\mathbb{P}}(y | x)}{{\mathbb{P}}(y)}, \frac{{\mathbb{P}}(y | T(x))}{{\mathbb{P}}(y)} \Big) \Big]\\ &\geq\frac{1}{2}{\mathbb{E}}_{{\mathbb{P}}(x)\times {\mathbb{P}}(y)}\Big[{\mathrm{f}}^{''}\Big(\frac{{\mathbb{P}}(y | x)}{{\mathbb{P}}(y)}\Big)\cdot \Big[\frac{{\mathbb{P}}(y | x)}{{\mathbb{P}}(y)}- \frac{{\mathbb{P}}(y | T(x))}{{\mathbb{P}}(y)}\Big]^2\Big]\\ &\geq\frac{1}{2}\inf_{(x,y)\in{\sf supp}(x,y)}{\mathrm{f}}^{''}\Big(\frac{{\mathbb{P}}(y | x)}{{\mathbb{P}}(y)}\Big)\cdot {\mathbb{E}}_{{\mathbb{P}}(x)\times {\mathbb{P}}(y)}\Big[\Big[\frac{{\mathbb{P}}(y | x)}{{\mathbb{P}}(y)}- \frac{{\mathbb{P}}(y | T(x))}{{\mathbb{P}}(y)}\Big]^2\Big], \end{align}\] where the first inequality follows from the definition of Bregman divergence and the fact that the range of \({\mathbb{P}}(y|T(x))\) belongs to the range of \({\mathbb{P}}(y|x)\). Moreover, by Jensen’s inequality, we have \[\begin{align} \Big({\mathbb{E}}_{{\mathbb{P}}(x)\times {\mathbb{P}}(y)}\Big[\Big[\frac{{\mathbb{P}}(y | x)}{{\mathbb{P}}(y)}- \frac{{\mathbb{P}}(y | T(x))}{{\mathbb{P}}(y)}\Big]^2\Big]\Big)^{1/2} &\geq {\mathbb{E}}_{{\mathbb{P}}(x)\times {\mathbb{P}}(y)}\Big[\Big|\frac{{\mathbb{P}}(y | x)}{{\mathbb{P}}(y)}- \frac{{\mathbb{P}}(y | T(x))}{{\mathbb{P}}(y)}\Big|\Big]\\ &=2 {\mathbb{E}}_{{\mathbb{P}}(x)}[{\mathsf D}_{\rm TV}({\mathbb{P}}(y|x)||{\mathbb{P}}(y|T(x)))] . \end{align}\]Putting pieces together yields Lemma 3. ◻

Example 1 (KL-sufficiency). Take \({\mathrm{f}}(x) = x \log x\) (KL-divergence), then we have \[\begin{align} {\rm Suff}_{{\rm cb},{\mathrm{f}}}(T) &= {\mathbb{E}}_{{\mathbb{P}}(x)}\Big[ {\mathsf D}_{\rm KL}\Big( {\mathbb{P}}(y | x) ||{\mathbb{P}}(y|T(x)) \Big) \Big], ~~~\text{and}\\ R_{\mathrm{f}}({\sf S}) &= {\mathbb{E}}_{{\mathbb{P}}(x,y)}[-{\sf S}(x,y)]+{\mathbb{E}}_{{\mathbb{P}}(x)}[\log {\mathbb{E}}_{{\mathbb{P}}(y)}[\exp({\sf S}(x,y))]]. \end{align}\] It can be verified that the InfoNCE loss in Eq. 1 is an asymptotically unbiased estimate of \(R_{\mathrm{f}}({\sf S})\) as the batch size \(K\to\infty\) (see Eq. 3 ). Moreover, by Pinsker’s inequality \[\begin{align} {\mathbb{E}}_{{\mathbb{P}}(x)}[{\mathsf D}_{\rm TV}({\mathbb{P}}(y|x)||{\mathbb{P}}(y|T(x)))] \leq \frac{1}{\sqrt{2}}\cdot {\sqrt{{\rm Suff}_{{\rm cb},{\sf kl}}(T)}}. \end{align}\]

Example 2 (Chi-sufficiency). Take \({\mathrm{f}}(x) = (x - 1)^2/2\) (\({\chi^2}\)-divergence), then we have \[\begin{align} {\rm Suff}_{{\rm cb}, {\mathrm{f}}}(T) &= {\mathbb{E}}_{{\mathbb{P}}(x) \times {\mathbb{P}}(y)}\Big[ \frac{1}{2}\Big( \frac{{\mathbb{P}}(y | x) }{ {\mathbb{P}}(y)} - \frac{{\mathbb{P}}(y | T(x)) }{ {\mathbb{P}}(y )}\Big)^2 \Big], \\ R_{{\mathrm{f}}}({\sf S}) &= {\mathbb{E}}_{{\mathbb{P}}(x, y)}[ - {\sf S}(x, y)] + {\mathbb{E}}_{{\mathbb{P}}(x){\mathbb{P}}(y)}[({\sf S}(x, y) - {\mathbb{E}}_{{\mathbb{P}}(y)}[{\sf S}(x, y)])^2/2 + {\sf S}(x, y) ]. \end{align}\] Lemma 3 gives \[{\mathbb{E}}_{{\mathbb{P}}(x)}[{\mathsf D}_{\rm TV}({\mathbb{P}}(y | x)|| {\mathbb{P}}(y | T(x)))] \le \frac{1}{\sqrt{2}}\sqrt{ {\rm Suff}_{{\rm cb}, {\chi^2}}(T)}.\] Also, we can bound the \({\chi^2}\)-divergence by the sufficiency: \[{\mathbb{E}}_{{\mathbb{P}}(x)}\chi^2({\mathbb{P}}(y | x)||{\mathbb{P}}(y | T(x))) \le {\rm Suff}_{{\rm cb}, f}(T) \cdot \Big[2\sup_{(x,y)\in{\sf supp}(x,y)} \frac{{\mathbb{P}}(T(x)){\mathbb{P}}(y)}{{\mathbb{P}}(T(x),y)} \Big].\]

Example 3 (Squared Hellinger-sufficiency). Take \({\mathrm{f}}(x) = 1 - \sqrt{x}\), then we have \({\mathrm{f}}^*(x) = -1 - \frac{1}{4x}\) for \(x < 0\), and \[\begin{align} {\rm Suff}_{{\rm cb},{\mathrm{f}}}(T) &= {\mathbb{E}}_{{\mathbb{P}}(x)}\big[H^2({\mathbb{P}}(y)|| {\mathbb{P}}(y|x)) - H^2({\mathbb{P}}(y)|| {\mathbb{P}}(y|T(x)))\big], \end{align}\] where \(H^2(p||q) \mathrel{\vcenter{:}}=\int (\sqrt{p(x)} - \sqrt{q(x)})^2 \, dx / 2\) is the squared Hellinger distance. Similarly, the squared Hellinger distance between \({\mathbb{P}}(y|x), {\mathbb{P}}(y|T(x)\) can be bounded by the sufficiency of \(T\): \[\begin{align} &\quad {\mathbb{E}}_{{\mathbb{P}}(x)}\big[H^2({\mathbb{P}}(y|x)||{\mathbb{P}}(y|T(x)))\big] = \frac{1}{2} \, {\mathbb{E}}_{{\mathbb{P}}(x)}\bigg[\sum_y \big(\sqrt{{\mathbb{P}}(y|x)} - \sqrt{{\mathbb{P}}(y|T(x))}\big)^2\bigg] \\ &\leq \bigg[\sup_{(x,y)\in{\sf supp}(x,y)} \sqrt{\frac{{\mathbb{P}}(T(x), y)}{{\mathbb{P}}(T(x)) {\mathbb{P}}(y)}} \bigg] \cdot {\mathbb{E}}_{{\mathbb{P}}(x)}\bigg[\sum_y \sqrt{{\mathbb{P}}(y)} \, \frac{\big(\sqrt{{\mathbb{P}}(y|T(x))} - \sqrt{{\mathbb{P}}(y|x)}\big)^2}{2\sqrt{{\mathbb{P}}(y|T(x))}}\bigg] \\ &= \bigg[\sup_{(x,y)\in{\sf supp}(x,y)} \sqrt{\frac{{\mathbb{P}}(T(x), y)}{{\mathbb{P}}(T(x)) {\mathbb{P}}(y)}} \bigg] \cdot {\rm Suff}_{{\rm cb},{\mathrm{f}}}(T), \end{align}\] where the last equality follows from \[\begin{align} &\quad {\mathbb{E}}\big[\sqrt{{\mathbb{P}}(y|T(x))} - \sqrt{{\mathbb{P}}(y|x)} \, \big| \, y, T(x)\big] \\ &= {\mathbb{E}}\bigg[\big(\sqrt{{\mathbb{P}}(y|T(x))} - \sqrt{{\mathbb{P}}(y|x)}\big) \cdot \frac{\sqrt{{\mathbb{P}}(y|T(x))} - \sqrt{{\mathbb{P}}(y|x)}}{2\sqrt{{\mathbb{P}}(y|T(x))}} \, \bigg| \, y, T(x)\bigg] + {\mathbb{E}}\bigg[\frac{{\mathbb{P}}(y|T(x)) - {\mathbb{P}}(y|x)}{2\sqrt{{\mathbb{P}}(y|T(x))}} \, \bigg| \, y, T(x)\bigg] \\ &= {\mathbb{E}}\bigg[\frac{\big(\sqrt{{\mathbb{P}}(y|T(x))} - \sqrt{{\mathbb{P}}(y|x)}\big)^2}{2\sqrt{{\mathbb{P}}(y|T(x))}} \, \bigg| \, y, T(x)\bigg]. \end{align}\]

7.3 Sufficiency of similarity scores↩︎

The definition of approximate sufficiency can be extended to score functions \({\sf S}:{\mathcal{X}}\times{\mathcal{Y}}\mapsto{\mathbb{R}}\) that measures the similarity between \((X,Y)\).

Definition 2 (Approximate sufficient score functions). Let \({\sf S}:{\mathcal{X}}\times{\mathcal{Y}}\mapsto{\mathbb{R}}\) be a similarity score function. It induces a conditional density \({\mathbb{P}}_{\sf S}\) on \({\mathcal{X}}\times{\mathcal{Y}}\) w.r.t. the base measure \({\boldsymbol{\mu}}\) via \[\begin{align} {\mathbb{P}}_{\sf S}(y|x)={\mathbb{P}}(y)({\mathrm{f}}')^{-1}({\widebar {\sf S}}(x,y)), \end{align}\] where \({\widebar {\sf S}}(x,y)={\sf S}(x,y)-{\sf S}_x(x)\) such that \({\mathbb{E}}_{{\mathbb{P}}(y)}[({\mathrm{f}}')^{-1}{\widebar {\sf S}}(x,y)]=1\) for all \(x.\) We define the sufficiency of \({\sf S}\) in two equivalent forms:

Variational Form Sufficiency (VFS): The variational form sufficiency of \(T\) is given by \[\begin{align} {\rm Suff}_{{\rm vf}, {\mathrm{f}}}({\sf S}) = R_{{\mathrm{f}}}({\sf S}) - \inf_{{\widetilde{\sf S}}: {\mathcal{X}}\times{\mathcal{Y}}\mapsto{\mathbb{R}}} R_{{\mathrm{f}}}({\widetilde{\sf S}}), \end{align}\] and the \({\mathrm{f}}\)-contrastive loss \[\begin{align} R_{{\mathrm{f}}}({\sf S}) \mathrel{\vcenter{:}}={\mathbb{E}}_{{\mathbb{P}}(x, y)}[ - {\sf S}(x, y)] + \inf_{{\sf S}_x:{\mathcal{X}}\mapsto{\mathbb{R}}} {\mathbb{E}}_{{\mathbb{P}}(x){\mathbb{P}}(y)}[{\mathrm{f}}^*({\sf S}(x, y) - {\sf S}_x(x)) + {\sf S}_x(x)],\label{eq:f95contrastive95loss95score} \end{align}\qquad{(9)}\] where \({\mathrm{f}}^*\) is the Fenchel-dual of \({\mathrm{f}}\).
Conditional Bregman Sufficiency (CBS): The conditional Bregman sufficiency of \(T\) is defined as \[{\rm Suff}_{{\rm cb}, {\mathrm{f}}}({\sf S}) = {\mathbb{E}}_{{\mathbb{P}}(x)\times {\mathbb{P}}(y)}\Big[ B_{{\mathrm{f}}}\Big( \frac{{\mathbb{P}}(y | x)}{{\mathbb{P}}(y)}, \frac{{\mathbb{P}}_{\sf S}(y | x)}{{\mathbb{P}}(y)} \Big) \Big],\] where \(B_{{\mathrm{f}}}(a, b) \mathrel{\vcenter{:}}={\mathrm{f}}(a) - {\mathrm{f}}(b) - (a-b){\mathrm{f}}'(b)\) is the Bregman divergence of \({\mathrm{f}}\).

Note that the excess risk of the contrastive loss equals the sufficiency of \({\sf S}\) under our definition. Similar to Definition 1, we have

Lemma 4 (Equivalence of two forms of score sufficiency). For any similarity score \({\sf S}:{\mathcal{X}}\times{\mathcal{Y}}\mapsto{\mathbb{R}}\), the three forms of sufficiency in Definition 2 are equivalent, i.e., \[\begin{align} {\rm Suff}_{{\rm vf}, {\mathrm{f}}}({\sf S})= {\rm Suff}_{{\rm cb}, {\mathrm{f}}}({\sf S})\eqqcolon{\rm Suff}_{{\mathrm{f}}}({\sf S}). \end{align}\]

Proof of Lemma 4.. \(({\boldsymbol{V}FS})\) \(\Leftrightarrow\) \(({\boldsymbol{C}BS})\). Let \({\sf S_\star}(x,y)={\mathrm{f}}'(\frac{{\mathbb{P}}(x,y)}{{\mathbb{P}}(x){\mathbb{P}}(y)})\). We have by Lemma 2 that \({\sf S_\star}\in\mathrm{argmin}_{{\widetilde{\sf S}}}R_{{\mathrm{f}}}({\widetilde{\sf S}}).\) By the definition of the \(({\boldsymbol{V}FS})\), we have \[\begin{align} {\rm Suff}_{{\rm vf},{\mathrm{f}}}({\sf S}) &=R_{{\mathrm{f}}}({\sf S}) -R_{{\mathrm{f}}}({\sf S_\star})\\ &= {\mathbb{E}}_{{\mathbb{P}}(x, y)}[{\sf S_\star}- {\widebar {\sf S}}(x, y)] + {\mathbb{E}}_{{\mathbb{P}}(x){\mathbb{P}}(y)}[{\mathrm{f}}^*({\widebar {\sf S}}(x, y)) - {\mathrm{f}}^*({\sf S_\star}(x, y)) ]\\ &\overset{(i)}{=} {\mathbb{E}}_{{\mathbb{P}}(x, y)}[{\sf S_\star}- {\widebar {\sf S}}(x, y)] + {\mathbb{E}}_{{\mathbb{P}}(x){\mathbb{P}}(y)}\Big[{\mathrm{f}}\Bigl(\frac{{\mathbb{P}}(x,y)}{{\mathbb{P}}(x){\mathbb{P}}(y)}\Bigr) -\frac{{\mathbb{P}}(x,y)}{{\mathbb{P}}(x){\mathbb{P}}(y)}{\sf S_\star}(x, y)\Big]\\ &= - {\mathbb{E}}_{{\mathbb{P}}(x, y)}[ {\widebar {\sf S}}(x, y)] + {\mathbb{E}}_{{\mathbb{P}}(x){\mathbb{P}}(y)}[{\mathrm{f}}^*({\widebar {\sf S}}(x, y))] + {\mathbb{E}}_{{\mathbb{P}}(x){\mathbb{P}}(y)}\Big[{\mathrm{f}}\Bigl(\frac{{\mathbb{P}}(x,y)}{{\mathbb{P}}(x){\mathbb{P}}(y)}\Bigr)\Big]\\ &\overset{(ii)}{=} - {\mathbb{E}}_{{\mathbb{P}}(x, y)}[ {\widebar {\sf S}}(x, y)] + {\mathbb{E}}_{{\mathbb{P}}(x){\mathbb{P}}(y)}\Big[{\mathrm{f}}\Bigl(\frac{{\mathbb{P}}(x,y)}{{\mathbb{P}}(x){\mathbb{P}}(y)}\Bigr) +\frac{{\mathbb{P}}_{\sf S}(y|x)}{{\mathbb{P}}(y)}{\widebar {\sf S}}(x,y) -{\mathrm{f}}\Bigl(\frac{{\mathbb{P}}_{\sf S}(y|x)}{{\mathbb{P}}(y)}\Bigr)\Big] \\ &{=} {\mathbb{E}}_{{\mathbb{P}}(x){\mathbb{P}}(y)}\Big[{\mathrm{f}}\Bigl(\frac{{\mathbb{P}}(y|x)}{{\mathbb{P}}(y)}\Bigr) -{\mathrm{f}}\Bigl(\frac{{\mathbb{P}}_{\sf S}(y|x)}{{\mathbb{P}}(y)}\Bigr)\Big] -{\mathbb{E}}_{{\mathbb{P}}(x)\times{\mathbb{P}}(y)}\Big[{\widebar {\sf S}}(x,y)\Big(\frac{{\mathbb{P}}(y | x)}{{\mathbb{P}}(y)}-\frac{{\mathbb{P}}_{\sf S}(y | x)}{{\mathbb{P}}(y)}\Big)\Big], \end{align}\] where step (i) and (ii) uses \(f((f')^{-1}(z))+f^*(z)=z(f')^{-1}(z)\) with \(z={\sf S_\star}(x,y)\) and \({\widebar {\sf S}}(x,y)\), respectively. Since \({\widebar {\sf S}}(x,y)={\mathrm{f}}'(\frac{{\mathbb{P}}_{\sf S}(y|x)}{{\mathbb{P}}(y)})\), it follows immediately that \({\rm Suff}_{{\rm vf},{\mathrm{f}}}({\sf S})= {\rm Suff}_{{\rm cb},{\mathrm{f}}}({\sf S})\). ◻

Example 4. Take \({\mathrm{f}}(x) = x \log x\) (KL-divergence). Then \({\sf S_\star}(x, y) = \log\big({\mathbb{P}}(x, y) / [{\mathbb{P}}(x) {\mathbb{P}}(y)]\big)\), \(B_{{\mathrm{f}}}(a, b) = a \log (a/b) - (a - b)\), and \({\mathbb{P}}_{{\sf S}}(y|x) = {\mathbb{P}}(y) \exp({\sf S}(x, y)) / {\mathbb{E}}_{{\mathbb{P}}(y)}[\exp({\sf S}(x, y))]\). Also, we have \[\begin{align} {\rm Suff}_{\sf kl}({\sf S})= R_{{\mathrm{f}}}({\sf S}) - R_{{\mathrm{f}}}({\sf S_\star}) &= \int {\mathbb{P}}( y|x) \log\bigg(\frac{{\mathbb{P}}(y|x)}{{\mathbb{P}}_{{\sf S}}(y|x)}\bigg) -\big({\mathbb{P}}(y|x) - {\mathbb{P}}_{{\sf S}}(y|x)\big) \, {\mathbb{P}}(x) dy \, dx \\ &= {\mathbb{E}}_{x \sim {\mathbb{P}}(x)}\big[{\mathsf D}_{\rm KL}({\mathbb{P}}(y|x) \, \| \, {\mathbb{P}}_{{\sf S}}(y|x))\big]. \end{align}\]

Example 5. Take \({\mathrm{f}}(x) = (x - 1)^2/2\) (\(\chi^2\)-divergence). Then \({\sf S_\star}(x, y) = {\mathbb{P}}(x, y) / [{\mathbb{P}}(x) {\mathbb{P}}(y)] - 1\), \(B_{{\mathrm{f}}}(a, b) = (a - b)^2/2\), and \({\mathbb{P}}_{{\sf S}}(y|x) = {\mathbb{P}}(y) \big({\sf S}(x, y) - {\mathbb{E}}_{y}[{\sf S}(x, y)] + 1\big)\). Moreover, \[\begin{align} {\rm Suff}_{{\chi^2}}({\sf S})= R_{{\mathrm{f}}}({\sf S}) - R_{{\mathrm{f}}}({\sf S_\star}) &= \frac{1}{2}{\mathbb{E}}_{{\mathbb{P}}(x) \times {\mathbb{P}}(y)}\bigg[\frac{\big({\mathbb{P}}(y|x) - {\mathbb{P}}_{{\sf S}}(y|x)\big)^2}{{\mathbb{P}}(y)^2}\bigg] \\ &= \frac{1}{2}{\mathbb{E}}_{{\mathbb{P}}(x)} \sum_y \bigg[\frac{\big({\mathbb{P}}(y|x) - {\mathbb{P}}_{{\sf S}}(y|x)\big)^2}{{\mathbb{P}}(y|x)} \cdot \frac{{\mathbb{P}}(y|x)}{{\mathbb{P}}(y)}\bigg] \\ &\geq \inf_{(x, y)\in{\sf supp}(x,y)} \frac{{\mathbb{P}}(x, y)}{{\mathbb{P}}(x) {\mathbb{P}}(y)} \cdot {\mathbb{E}}_{{\mathbb{P}}(x)}\big[\chi^2({\mathbb{P}}(y|x)|| {\mathbb{P}}_{{\sf S}}(y|x))\big]. \end{align}\]

8 Proofs in Section 4 ↩︎

8.1 Proof of Eq. 3 ↩︎

As given in Example 1 (which can be established using Lemma 2), the KL-contrastive loss has the form \[\begin{align} R_{\sf kl}({\sf S})= {\mathbb{E}}_{({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})}[-{\sf S}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})]+{\mathbb{E}}_{{\boldsymbol{z}}^{(1)}\sim{\mathbb{P}}_{{\boldsymbol{z}}}}[\log {\mathbb{E}}_{{\boldsymbol{z}}^{(2)}\sim{\mathbb{P}}_{{\boldsymbol{z}}}}[\exp({\sf S}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}))]]. \end{align}\] Recall the SimCLR loss \(\overline{\sf R}_{{\sf simclr},K}({\sf S})\) in Eq. 1 . We then have \[\begin{align} &\quad \lim_{K\to\infty}\overline{\sf R}_{{\sf simclr},K}({\sf S})-\log K \\ &= \frac{1}{2} \lim_{K\to\infty}{\mathbb{E}}\Big[ - \log \frac{\exp({\sf S}({\boldsymbol{z}}^{(1)}_1, {\boldsymbol{z}}^{(2)}_1) ) }{\sum_{j \in [K]} \exp({\sf S}({\boldsymbol{z}}^{(1)}_1, {\boldsymbol{z}}^{(2)}_j))/K} \Big] + \frac{1}{2}{\mathbb{E}}\Big[ - \log \frac{\exp({\sf S}({\boldsymbol{z}}^{(1)}_1, {\boldsymbol{z}}^{(2)}_1) ) }{\sum_{j \in [K]} \exp({\sf S}({\boldsymbol{z}}^{(1)}_j, {\boldsymbol{z}}^{(2)}_1))/K} \Big]\\ &= \lim_{K\to\infty}{\mathbb{E}}\Big[ \log {\sum_{j \in [K]} \exp({\sf S}({\boldsymbol{z}}^{(1)}_1, {\boldsymbol{z}}^{(2)}_j))/K} \Big]-{\mathbb{E}}[{\exp({\sf S}({\boldsymbol{z}}^{(1)}_1, {\boldsymbol{z}}^{(2)}_1) ) }]=R_{\sf kl}({\sf S}), \end{align}\] where the second equality follows from the symmetry of \({\sf S}\) in its arugments and the last equality uses the law of large number (note that \({\boldsymbol{z}}^{(1)}_1\) is independent of \({\boldsymbol{z}}^{(2)}_j\) for \(j\neq1\)) and bounded convergence theorem.

8.2 Proof of Theorem 1 ↩︎

We begin the proof by stating the following proposition that connects the excess risk with sufficiency.

Proposition 7 (Near-minimizers of SimCLR as near-sufficient statistics; Proposition 1 in [1]). Suppose Assumption 1 holds and \({\sf S_\star}\) is a global minimizer of \(\overline{\sf R}_{{\sf simclr}, K}({\sf S})\) as defined in Section 4.1. Then, there exists a constant \({C}>0\), which depends polynomially on \(B_{\sf S}\), such that for any function \(f\in {\mathcal{F}}\), its sufficiency can be bounded by its SimCLR excess risk. Namely, for any \(K\geq 2\), we have \[\begin{align} {\rm Suff}({ f})\leq\lim_{K' \to \infty}\Big[ \overline{\sf R}_{{\sf simclr}, K'}( {\sf S}_{ f}) - \overline{\sf R}_{{\sf simclr}, K'}({\sf S_\star}{})\Big] \leq \underbrace{\Big[ \overline{\sf R}_{{\sf simclr}, K}( {\sf S}_{ f}) - \overline{\sf R}_{{\sf simclr}, K}({\sf S_\star}{}) \Big]}_{\mathrm{SimCLR excess risk}}\cdot\Big(1 +\frac{{C}}{K}\Big). \end{align}\]

A similar version of this result has been established for contrastive language-image pretraining (CLIP) in Proposition 1 in [1]. The proof of Proposition 7 follows immediately from the proof of Proposition 1 in [1] as the SimCLR setup can be viewed as a special case of CLIP in which the text and image follows a symmetric distribution conditioned on their shared information.

Adopt the shorthand notation \(\overline{\sf R}_{ K}\) for \(\overline{\sf R}_{{\sf simclr}, K}\). With Proposition 7 at hand, we obtain the following decomposition for some \({C}>0\) polynomially dependent on \(B_{\sf S}\) \[\begin{align} {\rm Suff}( \widehat f) &\leq {\Big[ \overline{\sf R}_{ K}( {\sf S}_{ \widehat f}) - \overline{\sf R}_{ K}({\sf S_\star}) \Big]}\cdot\Big(1 +\frac{{C}}{K}\Big) \\ &= {\Big[ [\overline{\sf R}_{ K}( {\sf S}_{ \widehat f}) - \inf_{ f\in {\mathcal{F}}}\overline{\sf R}_{ K}( {\sf S}_{ f}) ] + [\inf_{ f\in {\mathcal{F}}}\overline{\sf R}_{ K}( {\sf S}_{ f}) - \overline{\sf R}_{ K}({\sf S_\star}) ] \Big]}\cdot\Big(1 +\frac{{C}}{K}\Big) \\ &\leq \underbrace{\Big[ \overline{\sf R}_{ K}( {\sf S}_{ \widehat f}) -\inf_{ f\in {\mathcal{F}}}\overline{\sf R}_{ K}( {\sf S}_{ f})\Big]}_{\mathrm{generalization error}}\cdot\Big(1 +\frac{{C}}{K}\Big) + \underbrace{\Big[\inf_{ f\in {\mathcal{F}}}\overline{\sf R}_{ K}( {\sf S}_{ f})- \overline{\sf R}_{ K}({\sf S_\star}) \Big]}_{\mathrm{approximation error}}\cdot\Big(1 +\frac{{C}}{K}\Big). \end{align}\] Therefore, it remains to prove the following bound.

With probability at least \(1-\delta\), the excess risk \[\begin{align} \overline{\sf R}_{ K}( {\sf S}_{ \widehat f}) -\inf_{ f\in {\mathcal{F}}}\overline{\sf R}_{ K}( {\sf S}_{ f})\leq \frac{{C}}{\sqrt n}\Bigg[\sqrt{{\log(1/\delta)}} +B_\tau^2\int_{0}^{2{(\log B_{\sf S}+B_\tau)}}\sqrt{\log{\mathcal{N}}(u,\| \cdot\|_{2,\infty}, {\mathcal{F}})}]du \Bigg] \label{eq:pf95thm95simclr95claim1} \end{align}\tag{10}\] for some constant \({C}>0\) that is polynomially dependent on \(B_{\sf S}\).

Recall the definition of \({\widehat {\sf R}}_{{\sf simclr},K}\) in Eq. 2 and adopt the shorthand \({\widehat {\sf R}}_{K}\) for \({\widehat {\sf R}}_{{\sf simclr},K}\). Let \(B_{ f}\mathrel{\vcenter{:}}=\sqrt{B_\tau(\log B_{\sf S}+B_\tau)},~B\mathrel{\vcenter{:}}= c(B_{\sf S}^6+1) B_{ f}B_\tau\) for some absolute constant \(c>0\). It can be verified by Assumption 2 that \({\mathcal{F}}\) must satisfies \(\| f\|_{2,\infty}\leq B_{ f}\) for all \(f\in {\mathcal{F}}\) for Assumption 1 to hold. Define the zero-mean random process \(X_ f\mathrel{\vcenter{:}}={\widehat {\sf R}}_{K}( {\sf S}_{ f})-{\mathbb{E}}[{\widehat {\sf R}}_{K}( {\sf S}_{ f})],~~ f\in {\mathcal{F}}.\) We will show that \[\begin{align} {\mathbb{P}}\Big( \big|\sup_{ f\in {\mathcal{F}}}|X_{ f}| - {\mathbb{E}}[\sup_{ f\in {\mathcal{F}}}|X_{ f}|]\big|\geq t \Big) &\leq 2\exp\Big(-\frac{2nt^2}{9B_{\sf S}^4}\Big),~~\text{for all } t\geq0,~\text{~and} \tag{11} \\ \tag{12} {\mathbb{E}}[\sup_{ f\in {\mathcal{F}}}|X_{ f}|] &\leq {\mathbb{E}}[|X_{ f_0}|] + {\mathbb{E}}[\sup_{ f, \widetilde{f}\in {\mathcal{F}}}|X_{ f}-X_{ \widetilde{f}}|]\notag\\ &\leq c\frac{B_{\sf S}^2}{\sqrt{n}} + 32 \frac{B}{\sqrt{n}}\cdot \int_{0}^{2 B_{ f}}\sqrt{\log{\mathcal{N}}(u,\| \cdot\|_{2,\infty}, {\mathcal{F}})} du \end{align} for anyf_0\in {\mathcal{F}} and some absolute constant c>0.\] Combining the two bounds and noting \[\begin{align} \overline{\sf R}_{ K}( {\sf S}_{ \widehat f}) -\inf_{ f\in {\mathcal{F}}}\overline{\sf R}_{ K}( {\sf S}_{ f}) &\leq 2\sup_{ f\in {\mathcal{F}}}|{\widehat {\sf R}}_K( {\sf S}_{ f})-\overline{\sf R}_K( {\sf S}_{ f})|= 2\sup_{ f\in {\mathcal{F}}}|{\widehat {\sf R}}_K( {\sf S}_{ f})-{\mathbb{E}}[{\widehat {\sf R}}_K( {\sf S}_{ f})]| =2\sup_{ f\in {\mathcal{F}}}|X_ f| \end{align}\] yields claim (1).
0.8ex 1ex .2ex-1em Proof of Eq. 11 . Let \(\widebar{\boldsymbol{z}}_i=({\boldsymbol{z}}^{(1)}_i,{\boldsymbol{z}}^{(2)}_i)\). Then \(\{\widebar{\boldsymbol{z}}_i\}_{i=1}^n\) are i.i.d. pairs of augmented views. For any \(i\in[{n_1}],j\in[K]\), suppose \(\widebar{\boldsymbol{z}}_{(i-1)K+j}\) is replaced by some alternative sample \(\widetilde{\boldsymbol{z}}_{(i-1)K+j}=(\widetilde{\boldsymbol{z}}^{(1)}_{(i-1)K+j},\widetilde{\boldsymbol{z}}^{(2)}_{(i-1)K+j})\) in the calculation of \({\widehat {\sf R}}_{K}( {\sf S}_{ f})\). Then we have \[\begin{align} &\quad|X_ f(\widebar{\boldsymbol{z}}_1,\ldots,\widebar{\boldsymbol{z}}_{(i-1)K+j},\ldots,\widebar{\boldsymbol{z}}_n) -X_ f(\widebar{\boldsymbol{z}}_1,\ldots,\widetilde{\boldsymbol{z}}_{(i-1)K+j},\ldots,\widebar{\boldsymbol{z}}_n)|\notag\\ &= | {\widehat {\sf R}}_{K}( {\sf S}_{ f})(\widebar{\boldsymbol{z}}_1,\ldots,\widebar{\boldsymbol{z}}_{(i-1)K+j},\ldots,\widebar{\boldsymbol{z}}_n) - {\widehat {\sf R}}_{K}( {\sf S}_{ f})(\widebar{\boldsymbol{z}}_1,\ldots,\widetilde{\boldsymbol{z}}_{(i-1)K+j},\ldots,\widebar{\boldsymbol{z}}_n) | \leq U_1+U_2,\label{eq:bound95term952295generalization} \end{align}\tag{13}\] where (assuming \(\widetilde{\boldsymbol{z}}_s=\widebar{\boldsymbol{z}}_s\) for \(j\in[n]\{(i-1)K+j\}\)) \[\begin{align} U_1&\mathrel{\vcenter{:}}= \frac{1}{n}\Big|{ {\sf S}_{ f}({\boldsymbol{z}}^{(1)}_{(i-1)K+j}, {\boldsymbol{z}}^{(2)}_{(i-1)K+j}) } - { {\sf S}_{ f}(\widetilde{\boldsymbol{z}}^{(1)}_{(i-1)K+j}, \widetilde{\boldsymbol{z}}^{(2)}_{(i-1)K+j}) } \Big|\leq \frac{2\log B_{\sf S}}{n}, \end{align}\] and \[\begin{align} U_2 &\mathrel{\vcenter{:}}= \frac{1}{2n}\sum_{k=1}^K\Bigg| \Big[ \log \Big(\frac{1}{K}{\sum_{l \in [K]} \exp( {\sf S}_{ f}({\boldsymbol{z}}^{(1)}_{(i-1)K+k}, {\boldsymbol{z}}^{(2)}_{(i-1)K+l}))}\Big) + \log{\Big(\frac{1}{K}\sum_{l \in [K]} \exp( {\sf S}_{ f}({\boldsymbol{z}}^{(1)}_{(i-1)K+l}, {\boldsymbol{z}}^{(2)}_{(i-1)K+k}))}\Big) \Big]\notag \\ &~~~~\qquad - \Big[ \log \Big(\frac{1}{K}{\sum_{l \in [K]} \exp( {\sf S}_{ f}(\widetilde{\boldsymbol{z}}^{(1)}_{(i-1)K+k}, \widetilde{\boldsymbol{z}}^{(2)}_{(i-1)K+l}))}\Big) + \log{\Big(\frac{1}{K}\sum_{l \in [K]} \exp( {\sf S}_{ f}(\widetilde{\boldsymbol{z}}^{(1)}_{(i-1)K+l}, \widetilde{\boldsymbol{z}}^{(2)}_{(i-1)K+k}))}\Big) \Big] \Bigg|\\ & \overset{(i)}{\leq} \frac{B_{\sf S}}{2n} \sum_{k=1}^K\Bigg| \frac{1}{K}\Big|{\sum_{l \in [K]} \exp( {\sf S}_{ f}({\boldsymbol{z}}^{(1)}_{(i-1)K+k}, {\boldsymbol{z}}^{(2)}_{(i-1)K+l}))} - {\sum_{l \in [K]} \exp( {\sf S}_{ f}(\widetilde{\boldsymbol{z}}^{(1)}_{(i-1)K+k}, \widetilde{\boldsymbol{z}}^{(2)}_{(i-1)K+l}))}\Big|\\ &\qquad \qquad\qquad + \frac{1}{K} \Big|\sum_{l \in [K]} \exp( {\sf S}_{ f}({\boldsymbol{z}}^{(1)}_{(i-1)K+l}, {\boldsymbol{z}}^{(2)}_{(i-1)K+k})) - \sum_{l \in [K]} \exp( {\sf S}_{ f}(\widetilde{\boldsymbol{z}}^{(1)}_{(i-1)K+l}, \widetilde{\boldsymbol{z}}^{(2)}_{(i-1)K+k}))\Big| \Bigg|\\ &\leq \frac{B_{\sf S}}{nK} \sum_{k=1}^K\sum_{l=1}^K \Big|\exp( {\sf S}_{ f}({\boldsymbol{z}}^{(1)}_{(i-1)K+k}, {\boldsymbol{z}}^{(2)}_{(i-1)K+l})) -\exp( {\sf S}_{ f}(\widetilde{\boldsymbol{z}}^{(1)}_{(i-1)K+k}, \widetilde{\boldsymbol{z}}^{(2)}_{(i-1)K+l}))\Big|\\ &\overset{(ii)}{\leq}\frac{2(B_{\sf S}^2-1)}{n} , \end{align}\] Here, step (i) follows from the triangle inequality, a Taylor expansion of \(\log(x)\), and Assumption 1; step (ii) follows from Assumption 1 and noting that \(\Big|\exp( {\sf S}_{ f}({\boldsymbol{z}}^{(1)}_{(i-1)K+k}, {\boldsymbol{z}}^{(2)}_{(i-1)K+l})) -\exp( {\sf S}_{ f}(\widetilde{\boldsymbol{z}}^{(1)}_{(i-1)K+k}, \widetilde{\boldsymbol{z}}^{(2)}_{(i-1)K+l}))\Big|\neq 0\) for at most \(2K\) terms with indices \(k,l\in[K]\).

Putting pieces together, we find \[\begin{align} &\quad| {\widehat {\sf R}}_{K}( {\sf S}_{ f})(\widebar{\boldsymbol{z}}_1,\ldots,\widebar{\boldsymbol{z}}_{(i-1)K+j},\ldots,\widebar{\boldsymbol{z}}_n) - {\widehat {\sf R}}_{K} ( {\sf S}_{ f})(\widebar{\boldsymbol{z}}_1,\ldots,\widetilde{\boldsymbol{z}}_{(i-1)K+j},\ldots,\widebar{\boldsymbol{z}}_n) |\\ &\leq \frac{2\log B_{\sf S}+2B_{\sf S}^2-2}{n} \leq \frac{3B_{\sf S}^2}{n} \end{align}\]for any \(\widetilde{\boldsymbol{z}}_{(i-1)K+j}\) and any \(i\in[{n_1}],j\in[K]\) and all \(f\in {\mathcal{F}}\). Therefore, Eq. 11 follows from Corollary 2.21 in [43] for functions with bounded differences.
0.8ex 1ex .2ex-1em Proof of Eq. 12 . First, we have \({\mathbb{E}}[|X_{ f_0}|]\leq cB_{\sf S}^2/\sqrt{n}\) by properties of sub-Gaussian variables and the fact that, for any \(f_0\in {\mathcal{F}}\), \(X_{ f_0}\) is zero-mean with bounded differences \(cB_{\sf S}^2/n\), as implied by the proof of Eq. 11 . By Dudley’s entropy integral bound (see Theorem 5.22 in [43]), it suffices to show \(\{X_{ f}, f\in {\mathcal{F}}\}\) is a zero-mean sub-Gaussian process with respect to the metric \(\rho_{X}( f, \widetilde{f})\mathrel{\vcenter{:}}= B\| f- \widetilde{f}\|_{2,\infty}/\sqrt{n}.\)

Let \(\| {\boldsymbol{x}}\|_{\psi} \mathrel{\vcenter{:}}=\inf \{ t > 0 : \mathbb{E} [ \psi({\boldsymbol{x}}/ t) ] \leq 1 \}\) denote the Orlicz norm for random variables and let \(\psi_2(u)=\exp(u^2)-1\). We have \[\begin{align} \| X_{ f}-X_{ \widetilde{f}}\|_{\psi_2} =\| {\widehat {\sf R}}_{K}( {\sf S}_{ f})-{\widehat {\sf R}}_{K}( {\sf S}_{ \widetilde{f}})-{\mathbb{E}}[{\widehat {\sf R}}_{K}( {\sf S}_{ f})-{\widehat {\sf R}}_{K}( {\sf S}_{ \widetilde{f}})]\|_{\psi_2} \leq c (\| U_3-{\mathbb{E}}[U_3]\|_{\psi_2}+\| U_4-{\mathbb{E}}[U_4]\|_{\psi_2}) \label{eq:bound95term951295generalization} \end{align}\tag{14}\] for some absolute constant \(c>0\) (we allow the value of \(c\) to vary from place to place), where \[\begin{align} U_3&\mathrel{\vcenter{:}}= \frac{1}{n}\sum_{i = 1}^{n_1}\sum_{j=1}^K\Big[{ {\sf S}_{ f}({\boldsymbol{z}}^{(1)}_{(i-1)K+j}, {\boldsymbol{z}}^{(2)}_{(i-1)K+j}) } - { {\sf S}_{ \widetilde{f}}({\boldsymbol{z}}^{(1)}_{(i-1)K+j}, {\boldsymbol{z}}^{(2)}_{(i-1)K+j}) } \Big],\\ U_4&\mathrel{\vcenter{:}}= \frac{1}{2n}\sum_{i = 1}^{n_1}\sum_{j=1}^K\Bigg[ \Big[ \log \Big(\frac{1}{K}{\sum_{l \in [K]} \exp( {\sf S}_{ f}({\boldsymbol{z}}^{(1)}_{(i-1)K+j}, {\boldsymbol{z}}^{(2)}_{(i-1)K+l}))}\Big) + \log{\Big(\frac{1}{K}\sum_{l \in [K]} \exp( {\sf S}_{ f}({\boldsymbol{z}}^{(1)}_{(i-1)K+l}, {\boldsymbol{z}}^{(2)}_{(i-1)K+j}))}\Big) \Big]\notag \\ &\qquad\qquad - \Big[ \log \Big(\frac{1}{K}{\sum_{l \in [K]} \exp( {\sf S}_{ \widetilde{f}}({\boldsymbol{z}}^{(1)}_{(i-1)K+j}, {\boldsymbol{z}}^{(2)}_{(i-1)K+l}))}\Big) + \log{\Big(\frac{1}{K}\sum_{l \in [K]} \exp( {\sf S}_{ \widetilde{f}}({\boldsymbol{z}}^{(1)}_{(i-1)K+l}, {\boldsymbol{z}}^{(2)}_{(i-1)K+j}))}\Big) \Big] \Bigg]. \end{align}\] Notice that for any \({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}\in{\mathcal{X}}, f, \widetilde{f}\in {\mathcal{F}}\), by Assumption 2, we have \[\begin{align} | {\sf S}_{ f}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})- {\sf S}_{ \widetilde{f}}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})| &\leq B_\tau\cdot |\langle f({\boldsymbol{z}}^{(1)}) , \, f({\boldsymbol{z}}^{(2)}) \rangle-\langle \widetilde{f}({\boldsymbol{z}}^{(1)}) , \, \widetilde{f}({\boldsymbol{z}}^{(2)}) \rangle|\notag\\ &\leq B_\tau( \| f({\boldsymbol{z}}^{(2)})\|_{2}\cdot\| f- \widetilde{f}\|_{2,\infty} + \| \widetilde{f}({\boldsymbol{z}}^{(1)})\|_{2}\cdot\| f- \widetilde{f}\|_{2,\infty})\notag\\ & \overset{(i)}{ \leq} 2 B_{ f}B_\tau\| f- \widetilde{f}\|_{2,\infty},\label{eq:bound95term95146595generalization} \end{align}\tag{15}\] where step (i) uses \({\sf S}_{ f}({\boldsymbol{z}},{\boldsymbol{z}})=\| f({\boldsymbol{z}})\|_{2}^2\leq B_{ f}^2\) for \({\boldsymbol{z}}\in{\mathcal{X}}\). Since \(\widebar{\boldsymbol{z}}_i=({\boldsymbol{z}}^{(1)}_i,{\boldsymbol{z}}^{(2)}_i),i\in[n]\) are i.i.d., it follows immediately that \(U_3-{\mathbb{E}}[U_3]\) is \(2 B_{ f}B_\tau\| f- \widetilde{f}\|_{2,\infty}/\sqrt{n}\)-sub-Gaussian, i.e., \[\begin{align} \| U_3-{\mathbb{E}}[U_3]\|_{\psi_2}&\leq \frac{c B_{ f}B_\tau}{\sqrt n}\| f- \widetilde{f}\|_{2,\infty}. \label{eq:bound95term95195generalization} \end{align}\tag{16}\] Recall the definition of \(\{\widebar{\boldsymbol{z}}_{s},\widetilde{\boldsymbol{z}}_{s}\}_{s=1}^n\) in the proof of Eq. 11 . To bound \(\| U_4\|_{\psi_2}\), we start with introducing the shorthands for any fixed indices \(i\in[{n_1}],j\in[K]\) \[\begin{align} {\mathcal{U}}_{k}(\widebar{\boldsymbol{z}})&\mathrel{\vcenter{:}}= \frac{1}{K} {\sum_{l \in [K]} \exp( {\sf S}_{ f}({\boldsymbol{z}}^{(1)}_{(i-1)K+k}, {\boldsymbol{z}}^{(2)}_{(i-1)K+l}))},~~ {\mathcal{V}}_{k}(\widebar{\boldsymbol{z}})\mathrel{\vcenter{:}}= \frac{1}{K} {\sum_{l \in [K]} \exp( {\sf S}_{ f}({\boldsymbol{z}}^{(1)}_{(i-1)K+l}, {\boldsymbol{z}}^{(2)}_{(i-1)K+k}))}, \\ \widetilde{\mathcal{U}}_{k}(\widebar{\boldsymbol{z}})&\mathrel{\vcenter{:}}= \frac{1}{K} {\sum_{l \in [K]} \exp( {\sf S}_{ \widetilde{f}}({\boldsymbol{z}}^{(1)}_{(i-1)K+k}, {\boldsymbol{z}}^{(2)}_{(i-1)K+l}))},~~ \widetilde{\mathcal{V}}_{k}(\widebar{\boldsymbol{z}})\mathrel{\vcenter{:}}= \frac{1}{K} {\sum_{l \in [K]} \exp( {\sf S}_{ \widetilde{f}}({\boldsymbol{z}}^{(1)}_{(i-1)K+l}, {\boldsymbol{z}}^{(2)}_{(i-1)K+k}))} \end{align}\]for all \(k\in[K]\). Similar to the proof of Eq. 11 , for any given index \((i-1)K+j\), we have \[\begin{align} &\quad |U_4(\widebar{\boldsymbol{z}}_1,\ldots,\widebar{\boldsymbol{z}}_{(i-1)K+j},\ldots,\widebar{\boldsymbol{z}}_n) -U_4(\widebar{\boldsymbol{z}}_1,\ldots,\widetilde{\boldsymbol{z}}_{(i-1)K+j},\ldots,\widebar{\boldsymbol{z}}_n)| \\ & =\Big|\frac{1}{2n}\sum_{k=1}^K \Big[\log\Big(\frac{{\mathcal{U}}_{k}(\widebar{\boldsymbol{z}})}{\widetilde{\mathcal{U}}_{k}(\widebar{\boldsymbol{z}})}\Big) + \log\Big(\frac{{\mathcal{V}}_{k}(\widebar{\boldsymbol{z}})}{\widetilde{\mathcal{V}}_{k}(\widebar{\boldsymbol{z}})}\Big) - \log\Big(\frac{{\mathcal{U}}_{k}(\widetilde{\boldsymbol{z}})}{\widetilde{\mathcal{U}}_{k}(\widetilde{\boldsymbol{z}})}\Big) - \log\Big(\frac{{\mathcal{V}}_{k}(\widetilde{\boldsymbol{z}})}{\widetilde{\mathcal{V}}_{k}(\widetilde{\boldsymbol{z}})}\Big) \Big]\Big|\\ &\leq \frac{B_{\sf S}^2}{2n}\sum_{k=1}^K \Bigg[ \Big| \frac{{\mathcal{U}}_{k}(\widebar{\boldsymbol{z}})}{\widetilde{\mathcal{U}}_{k}(\widebar{\boldsymbol{z}})} - \frac{{\mathcal{U}}_{k}(\widetilde{\boldsymbol{z}})}{\widetilde{\mathcal{U}}_{k}(\widetilde{\boldsymbol{z}})}\Big| + \Big|\frac{{\mathcal{V}}_{k}(\widebar{\boldsymbol{z}})}{\widetilde{\mathcal{V}}_{k}(\widebar{\boldsymbol{z}})}- \frac{{\mathcal{V}}_{k}(\widetilde{\boldsymbol{z}})}{\widetilde{\mathcal{V}}_{k}(\widetilde{\boldsymbol{z}})} \Big|\Bigg], \end{align}\] where the last line follows from Assumption 1 and a Taylor expansion of \(\log(x)\). Moreover, \[\begin{align} \sum_{k=1}^K\Big| \frac{{\mathcal{U}}_{k}(\widebar{\boldsymbol{z}})}{\widetilde{\mathcal{U}}_{k}(\widebar{\boldsymbol{z}})} - \frac{{\mathcal{U}}_{k}(\widetilde{\boldsymbol{z}})}{\widetilde{\mathcal{U}}_{k}(\widetilde{\boldsymbol{z}})}\Big| &= \sum_{k=1}^K\Big| \frac{{\mathcal{U}}_{k}(\widebar{\boldsymbol{z}})-\widetilde{\mathcal{U}}_{k}(\widebar{\boldsymbol{z}})}{\widetilde{\mathcal{U}}_{k}(\widebar{\boldsymbol{z}})} - \frac{{\mathcal{U}}_{k}(\widetilde{\boldsymbol{z}})-\widetilde{\mathcal{U}}_{k}(\widetilde{\boldsymbol{z}})}{\widetilde{\mathcal{U}}_{k}(\widetilde{\boldsymbol{z}})}\Big|\\ & \overset{(ii)}{\leq} B_{\sf S}^2 \sum_{k=1}^K|({\mathcal{U}}_{k}(\widebar{\boldsymbol{z}})-\widetilde{\mathcal{U}}_{k}(\widebar{\boldsymbol{z}}))\widetilde{\mathcal{U}}_{k}(\widetilde{\boldsymbol{z}}) - ({\mathcal{U}}_{k}(\widetilde{\boldsymbol{z}})-\widetilde{\mathcal{U}}_{k}(\widetilde{\boldsymbol{z}}))\widetilde{\mathcal{U}}_{k}(\widebar{\boldsymbol{z}}) | \\ &\leq B_{\sf S}^2 \sum_{k=1}^K\big[|(({\mathcal{U}}_{k}-\widetilde{\mathcal{U}}_{k})(\widebar{\boldsymbol{z}})-({\mathcal{U}}_{k}-\widetilde{\mathcal{U}}_{k})(\widetilde{\boldsymbol{z}}))\widetilde{\mathcal{U}}_{k}(\widetilde{\boldsymbol{z}})| + | ({\mathcal{U}}_{k}-\widetilde{\mathcal{U}}_{k} )(\widetilde{\boldsymbol{z}})(\widetilde{\mathcal{U}}_{k}(\widetilde{\boldsymbol{z}})-\widetilde{\mathcal{U}}_{k}(\widebar{\boldsymbol{z}})) | \big]\\ &\overset{(iii)}{\leq} B_{\sf S}^3 \sum_{k=1}^K\big[|({\mathcal{U}}_{k}-\widetilde{\mathcal{U}}_{k})(\widebar{\boldsymbol{z}})-({\mathcal{U}}_{k}-\widetilde{\mathcal{U}}_{k})(\widetilde{\boldsymbol{z}})| + 2 B_{ f}B_\tau\| f- \widetilde{f}\|_{2,\infty}| \widetilde{\mathcal{U}}_{k}(\widetilde{\boldsymbol{z}})-\widetilde{\mathcal{U}}_{k}(\widebar{\boldsymbol{z}}) | \big], \end{align}\] where step (ii) uses Assumption 1, step (iii) uses Assumption 1, Eq. 15 and a Taylor expansion of \(\exp(x)\). Similar to the proof of Eq. 11 , by counting the number of terms in the summations that are different and using Assumption 1, we find \[\begin{align} \sum_{k=1}^K| \widetilde{\mathcal{U}}_{k}(\widetilde{\boldsymbol{z}})-\widetilde{\mathcal{U}}_{k}(\widebar{\boldsymbol{z}}) | &\leq 2B_{\sf S},~~\text{and}\\ \sum_{k=1}^K|({\mathcal{U}}_{k}-\widetilde{\mathcal{U}}_{k})(\widebar{\boldsymbol{z}})-({\mathcal{U}}_{k}-\widetilde{\mathcal{U}}_{k})(\widetilde{\boldsymbol{z}})| &\leq 4 B_{\sf S} B_{ f}B_\tau\| f- \widetilde{f}\|_{2,\infty}. \end{align}\] Similar results hold for \({\mathcal{V}}\) by symmetry. Putting pieces together, we obtain \[\begin{align} |U_4(\widebar{\boldsymbol{z}}_1,\ldots,\widebar{\boldsymbol{z}}_{(i-1)K+j},\ldots,\widebar{\boldsymbol{z}}_n) -U_4(\widebar{\boldsymbol{z}}_1,\ldots,\widetilde{\boldsymbol{z}}_{(i-1)K+j},\ldots,\widebar{\boldsymbol{z}}_n)| \leq \frac{4B_{\sf S}^6 B_{ f}B_\tau}{{n}}. \end{align}\] Therefore, it follows from Corollary 2.21 in [43] for functions with bounded differences that \[\begin{align} \| U_4-{\mathbb{E}}[U_4]\|_{\psi_2}\leq \frac{cB_{\sf S}^6 B_{ f}B_\tau}{\sqrt{n}}. \label{eq:bound95term95295generalization} \end{align}\tag{17}\] Substituting Eq. 16 and 17 into Eq. 14 , we obtain that \(\{X_{ f}, f\in {\mathcal{F}}\}\) is a zero-mean sub-Gaussian process with respect to the metric \(\rho_{X}( f, \widetilde{f})\mathrel{\vcenter{:}}= B\| f- \widetilde{f}\|_{2,\infty}/\sqrt{n}.\) This concludes the proof of Eq. 12 .

8.3 Proof of Theorem 4 ↩︎

Write \({\boldsymbol{z}}=g({\boldsymbol{x}})\) with \(g\sim{\mathbb{P}}_{\mathcal{G}}\perp\!\!\!\perp{\boldsymbol{x}}\sim{\mathbb{P}}_{{\mathcal{X}}}\). Define \(h_{\min}\mathrel{\vcenter{:}}=\mathrm{argmin}_h{\mathbb{E}}_{{\boldsymbol{x}}\sim{\mathbb{P}}_{{\mathcal{X}}},g\sim{\mathbb{P}}_{\mathcal{G}}}[(h(g({\boldsymbol{x}}))-h_\star({\boldsymbol{x}}))^2]\) and \({\sf h}({\boldsymbol{u}})\mathrel{\vcenter{:}}={\mathbb{E}}[ h_{\min}({\boldsymbol{z}}^{(1)})| f({\boldsymbol{z}}^{(1)})={\boldsymbol{u}}]\). Note that \(| h_{\min}({\boldsymbol{z}}^{(1)})|=|{\mathbb{E}}[h_\star({\boldsymbol{x}})|{\boldsymbol{z}}^{(1)}]|\) is bounded by \(B_{h_\star}\) almost surely by the assumption in Theorem 4 .

\[We first show that {\sf R_{\mathcal{G}}}({\sf h}\circ f) satisfies bound~\eqref{eq:thm95general95downstream95bound} with \epsilon_{\mathcal{G}} replaced by \widetilde{\epsilon}_{\mathcal{G}}=\inf_h{\mathbb{E}}_{{\boldsymbol{x}}\sim{\mathbb{P}}_{{\mathcal{X}}},g\sim{\mathbb{P}}_{\mathcal{G}}}[(h(g({\boldsymbol{x}}))-h_\star({\boldsymbol{x}}))^2]. The original bound~\eqref{eq:thm95general95downstream95bound} follows immediately since \widetilde{\epsilon}_{\mathcal{G}}\leq\epsilon_{\mathcal{G}}. Since (a+b)^2\leq2a^2+2b^2, we have \begin{align} &\quad{\sf R_{\mathcal{G}}}({\sf h}\circ f) = {\mathbb{E}}_{{\boldsymbol{x}},{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}}[({\sf h}( f({\boldsymbol{z}}^{(1)}))-h_\star({\boldsymbol{x}}))^2] \overset{(i)}{\leq} 2{\mathbb{E}}_{{\boldsymbol{x}},{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}}[({\sf h}( f({\boldsymbol{z}}^{(1)}))- h_{\min}({\boldsymbol{z}}^{(2)}))^2]+ 2\widetilde{\epsilon}_{\mathcal{G}}. \tag{18} \end{align} Introduce a random variable which follows the distribution of {\boldsymbol{z}}^{(1)} conditioned onf({\boldsymbol{z}}^{(1)}) and is independent of ({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}) when conditioned onf({\boldsymbol{z}}^{(1)}), i.e., [\widetilde{\boldsymbol{z}}^{(1)}\sim {\mathbb{P}}_{{\boldsymbol{z}}}({\boldsymbol{z}}^{(1)}| f({\boldsymbol{z}}^{(1)}))\perp\!\!\!\perp({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})]| f({\boldsymbol{z}}^{(1)}). Consider the joint distribution of the tuple (\widetilde{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}). By Bayes' formula, we have \widetilde{\boldsymbol{z}}^{(1)}\overset{d}{=}{\boldsymbol{z}}^{(1)}\sim{\mathbb{P}}_{{\boldsymbol{z}}} and {\boldsymbol{z}}^{(2)}|\widetilde{\boldsymbol{z}}^{(1)}\sim{\mathbb{P}}({\boldsymbol{z}}^{(2)}| f({\boldsymbol{z}}^{(1)}) = f(\widetilde{\boldsymbol{z}}^{(1)})) and therefore \begin{aligned} &\quad {\mathbb{E}}[({\sf h}( f({\boldsymbol{z}}^{(1)}))- h_{\min}({\boldsymbol{z}}^{(2)}))^2] \overset{(i)}{\leq} {\mathbb{E}}[( h_{\min}(\widetilde{\boldsymbol{z}}^{(1)})- h_{\min}({\boldsymbol{z}}^{(2)}))^2] \notag\\ &= {\mathbb{E}}_{\widetilde{\boldsymbol{z}}^{(1)}\sim{\mathbb{P}}_{{\boldsymbol{z}}},{\boldsymbol{z}}^{(2)}\sim{\mathbb{P}}({\boldsymbol{z}}^{(2)}| f({\boldsymbol{z}}^{(1)})= f(\widetilde{\boldsymbol{z}}^{(1)}))}[( h_{\min}(\widetilde{\boldsymbol{z}}^{(1)})- h_{\min}({\boldsymbol{z}}^{(2)}))^2], \tag{19} \end{aligned} where step~(i) follows from \begin{align*} {\mathbb{E}}[({\sf h}( f({\boldsymbol{z}}^{(1)}))- h_{\min}({\boldsymbol{z}}^{(2)}))^2| f({\boldsymbol{z}}^{(1)})] {\leq} {\mathbb{E}}[( h_{\min}(\widetilde{\boldsymbol{z}}^{(1)})- h_{\min}({\boldsymbol{z}}^{(2)}))^2| f({\boldsymbol{z}}^{(1)})], \end{align*} which uses Jensen's inequality, independence of \widetilde{\boldsymbol{z}}^{(1)} and {\boldsymbol{z}}^{(2)} conditioned onf({\boldsymbol{z}}^{(1)}), and the fact that {\mathbb{E}}[ h_{\min}(\widetilde{\boldsymbol{z}}^{(1)})| f({\boldsymbol{z}}^{(1)})]={\sf h}( f({\boldsymbol{z}}^{(1)})). Moreover, \begin{align} &\quad{\mathbb{E}}_{\widetilde{\boldsymbol{z}}^{(1)}\sim{\mathbb{P}}_{{\boldsymbol{z}}},{\boldsymbol{z}}^{(2)}\sim{\mathbb{P}}({\boldsymbol{z}}^{(2)}| f({\boldsymbol{z}}^{(1)})= f(\widetilde{\boldsymbol{z}}^{(1)}))}[( h_{\min}(\widetilde{\boldsymbol{z}}^{(1)})- h_{\min}({\boldsymbol{z}}^{(2)}))^2]\notag \\ &\overset{(ii)}{\leq} {\mathbb{E}}_{\widetilde{\boldsymbol{z}}^{(1)}\sim{\mathbb{P}}_{{\boldsymbol{z}}},{\boldsymbol{z}}^{(2)}\sim{\mathbb{P}}({\boldsymbol{z}}^{(2)}|{\boldsymbol{z}}^{(1)}=\widetilde{\boldsymbol{z}}^{(1)})}[( h_{\min}(\widetilde{\boldsymbol{z}}^{(1)})- h_{\min}({\boldsymbol{z}}^{(2)}))^2] \notag \\ &\qquad\quad+ \sqrt{2}B_{h_\star}^2\cdot{\mathbb{E}}_{\widetilde{\boldsymbol{z}}^{(1)}\sim{\mathbb{P}}_{{\boldsymbol{z}}}}\Bigg[\sqrt{{\mathsf D}_{\rm KL}\Big( {\mathbb{P}}_{{\boldsymbol{z}}^{(2)}| {\boldsymbol{z}}^{(1)}}(\cdot | \widetilde{\boldsymbol{z}}^{(1)}) \Big|\Big| {\mathbb{P}}_{{\boldsymbol{z}}^{(2)}| {\boldsymbol{z}}^{(1)}}(\cdot | f(\widetilde{\boldsymbol{z}}^{(1)})) \Big)} \Bigg]\notag\\ &\overset{(iii)}{\leq} {\mathbb{E}}_{\widetilde{\boldsymbol{z}}^{(1)}\sim{\mathbb{P}}_{{\boldsymbol{z}}},{\boldsymbol{z}}^{(2)}\sim{\mathbb{P}}({\boldsymbol{z}}^{(2)}|{\boldsymbol{z}}^{(1)}=\widetilde{\boldsymbol{z}}^{(1)})}[( h_{\min}(\widetilde{\boldsymbol{z}}^{(1)})- h_{\min}({\boldsymbol{z}}^{(2)}))^2] + \sqrt{2}B_{h_\star}^2\cdot\sqrt{{\rm Suff}_{{\rm cb},{\sf kl}}( f)}\notag \\ &~= {\mathbb{E}}_{{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}}[( h_{\min}({\boldsymbol{z}}^{(1)})- h_{\min}({\boldsymbol{z}}^{(2)}))^2] + \sqrt{2}B_{h_\star}^2\cdot\sqrt{{\rm Suff}_{{\rm cb},{\sf kl}}( f)}, \tag{20} \end{align} where step~(ii) follows from the variational form of total variation distance and Pinsker's inequality, while step~(iii) uses the (CBS) definition of {\rm Suff}_{\sf kl}( f) in Definition~\ref{def:general95app95suff} and Jensen's inequality. Lastly, we have from a triangle inequality that \begin{align} &\quad~{\mathbb{E}}_{{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}}[( h_{\min}({\boldsymbol{z}}^{(1)})- h_{\min}({\boldsymbol{z}}^{(2)}))^2]\notag\\ &\leq 2( {\mathbb{E}}_{{\boldsymbol{x}},{\boldsymbol{z}}^{(1)}}[( h_{\min}({\boldsymbol{z}}^{(1)})-h_\star({\boldsymbol{x}}))^2] + {\mathbb{E}}_{{\boldsymbol{x}},{\boldsymbol{z}}^{(2)}}[( h_{\min}({\boldsymbol{z}}^{(2)})-h_\star({\boldsymbol{x}}))^2] ) =4\widetilde{\epsilon}_{\mathcal{G}}. \tag{21} \end{align} Combining Eq.~\eqref{eq:general95downstream95pf0}---\eqref{eq:general95downstream95pf3} yields Eq.~\eqref{eq:thm95general95downstream95bound} in Theorem~\ref{thm:general95downstream95bound}. Eq.~\eqref{eq:thm95general95downstream95bound1} in Theorem~\ref{thm:general95downstream95bound} follows immediately by noting \begin{align*} &\qquad {\sf R}({\sf h}\circ f) = {\mathbb{E}}[({\sf h}( f({\boldsymbol{x}}))-h_\star({\boldsymbol{x}}))^2]= {\mathbb{E}}_{{\boldsymbol{z}}^{(1)}}[({\sf h}( f({\boldsymbol{z}}^{(1)}))-h_\star({\boldsymbol{z}}^{(1)}))^2]\\ &\leq 2 {\mathbb{E}}_{{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}}[({\sf h}( f({\boldsymbol{z}}^{(1)}))-h_\star({\boldsymbol{x}}))^2] + 2{\mathbb{E}}_{{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}}[(h_\star({\boldsymbol{z}}^{(1)})-h_\star({\boldsymbol{x}}))^2]\\ &=2 {\mathbb{E}}_{{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}}[({\sf h}( f({\boldsymbol{z}}^{(1)}))-h_\star({\boldsymbol{x}}))^2] + 2\epsilon_{\mathcal{G}} \end{align*} and using Eq.~\eqref{eq:thm95general95downstream95bound}. \@startsection{paragraph}{4} {\z@}{0.8ex \@plus 1ex \@minus .2ex}{-1em} {\normalfont\normalsize\bfseries}{Comments on Theorem~\ref{thm:general95downstream95bound}.} Following the same proof strategy, it can be verified that Eq.~\eqref{eq:thm95general95downstream95bound}~and~\eqref{eq:thm95general95downstream95bound1} also hold when choosing {\sf h}({\boldsymbol{u}}) \mathrel{\vcenter{:}}={\mathbb{E}}[h_\star({\boldsymbol{z}}^{(1)})| f({\boldsymbol{z}}^{(1)})={\boldsymbol{u}}]. The main difference in the proof is to replaceh_{\min} by h_\star in Eq.~\eqref{eq:general95downstream95pf0}---~\eqref{eq:general95downstream95pf3}. Moreover, although we consider the expected squared loss (i.e., \ell(x,y)=(x-y)^2) for simplicity, it can be seen from the proof that a similar version of Eq.~\eqref{eq:thm95general95downstream95bound}~and~\eqref{eq:thm95general95downstream95bound1} hold for general (expected) losses \ell(x,y) that satisfy (1) \ell(x,y) is nonnegative; (2) \ell(x,y) is symmetric in (x,y) and convex in x-y; and (3) \ell(x,z)\leq c(\ell(x,y)+\ell(y,z)) for some absolute constant c>0 and all x,y,z\in{\mathbb{R}}. This includes the absolute loss, Huber loss, losses induced by norms, etc.\]

8.4 Proof of Theorem 2 ↩︎

For any densities \({\mathbb{P}},{\mathbb{Q}}\), define \(\alpha\)-Rényi divergence \[\begin{align} {\mathsf D}_{\rm \alpha}({\mathbb{P}}||{\mathbb{Q}}) \mathrel{\vcenter{:}}= \frac{1}{\alpha-1}\log\Big({\mathbb{E}}_{x\sim{\mathbb{P}}}\Big[\Big(\frac{{\mathbb{P}}(x)}{{\mathbb{Q}}(x)}\Big)^{\alpha-1} \Big]\Big)\notag \end{align}\] for any \(\alpha>0\). Note that the \(1\)-Rényi divergence corresponds to the KL divergence. For any densities \({\mathbb{P}},{\mathbb{Q}},{\mathbb{T}}\), we have the following triangle-like inequality which we will repeatly use in the proof.

Lemma 5 (Triangle-like inequality for Rényi divergence (Lemma 26 in [44])). Let \({\mathbb{P}}\), \({\mathbb{Q}}\), and \({\mathbb{T}}\) be probability densities w.r.t. the same measure. Then \[{\mathsf D}_{\rm \alpha} ({\mathbb{P}}|| {\mathbb{Q}}) \le {\frac{k\alpha}{k\alpha - 1}} {\mathsf D}_{\rm \frac{k\alpha - 1}{k - 1}} ({\mathbb{P}}|| {\mathbb{T}}) + \mathrm{D}_{k\alpha} ({\mathbb{T}}|| {\mathbb{Q}})\] for all \(k, \alpha \in (1, \infty)\).

Write \({\boldsymbol{z}}=g({\boldsymbol{x}})\) with \(g\sim{\mathbb{P}}_{\mathcal{G}}\perp\!\!\!\perp{\boldsymbol{x}}\sim{\mathbb{P}}_{{\mathcal{X}}}\) and define \({\sf h}( f({\boldsymbol{z}}))\mathrel{\vcenter{:}}={\mathbb{P}}({\boldsymbol{y}}| f({\boldsymbol{z}}))\in\Delta([{\sf K}])\) as the conditional distribution of \({\boldsymbol{y}}\) given \(f({\boldsymbol{z}})\), where \({\boldsymbol{z}}=g({\boldsymbol{x}})\) for some random transformation \(g\sim{\mathbb{P}}_{\mathcal{G}}\). It can be verified that \({\sf h}=\mathrm{argmin}_{{\mathbb{Q}}:{\mathbb{R}}^{ p}\mapsto\Delta([{\sf K}])}{\mathsf D}_{\rm KL}({\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{x}})||{\mathbb{Q}}({\boldsymbol{y}}| f({\boldsymbol{z}})))\). \[Therefore, using Lemma~\ref{lm:triangle95like95ineq} with k=4/3,\alpha=1 (by taking the limit \alpha\to 1), we obtain \begin{align} {\sf R^{\sf cls}_{\mathcal{G}}}({\sf h}\circ f) &= {\mathbb{E}}_{{\boldsymbol{x}},{\boldsymbol{y}},{\boldsymbol{z}}^{(1)}}[{\mathsf D}_{\rm KL}({\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{x}})||{\mathbb{P}}({\boldsymbol{y}}| f({\boldsymbol{z}}^{(1)})))]\notag \\ &\leq 4{\mathbb{E}}_{{\boldsymbol{x}},{\boldsymbol{y}},{\boldsymbol{z}}^{(2)}}[{\mathsf D}_{\rm KL}({\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{x}})||{\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{z}}^{(2)}))] + {\mathbb{E}}_{{\boldsymbol{x}},{\boldsymbol{y}},{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}}[{\mathsf D}_{\rm 4/3}({\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{z}}^{(2)})||{\mathbb{P}}({\boldsymbol{y}}| f({\boldsymbol{z}}^{(1)})))]\notag\\ &\leq 4\epsilon^{\sf cls}_{\mathcal{G}} + {\mathbb{E}}_{{\boldsymbol{x}},{\boldsymbol{y}},{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}}[{\mathsf D}_{\rm 4/3}({\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{z}}^{(2)})||{\mathbb{P}}({\boldsymbol{y}}| f({\boldsymbol{z}}^{(1)})))] , \tag{22} \end{align} where the last inequality uses the monotonicity of \alpha-Rényi divergence w.r.t. \alpha. Similar to the proof of Theorem~\ref{thm:general95downstream95bound}, introduce a random variable which follows the distribution of {\boldsymbol{z}}^{(1)} conditioned onf({\boldsymbol{z}}^{(1)}) and is independent of ({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}) when conditioned onf({\boldsymbol{z}}^{(1)}), i.e., [\widetilde{\boldsymbol{z}}^{(1)}\sim {\mathbb{P}}_{{\boldsymbol{z}}}({\boldsymbol{z}}^{(1)}| f({\boldsymbol{z}}^{(1)}))\perp\!\!\!\perp({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})]| f({\boldsymbol{z}}^{(1)}). Consider the joint distribution of the tuple (\widetilde{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}). By Bayes' formula, we have \widetilde{\boldsymbol{z}}^{(1)}\overset{d}{=}{\boldsymbol{z}}^{(1)}\sim{\mathbb{P}}_{{\boldsymbol{z}}} and {\boldsymbol{z}}^{(2)}|\widetilde{\boldsymbol{z}}^{(1)}\sim{\mathbb{P}}({\boldsymbol{z}}^{(2)}| f({\boldsymbol{z}}^{(1)}) = f(\widetilde{\boldsymbol{z}}^{(1)})) and thus \begin{aligned} &\quad {\mathbb{E}}_{{\boldsymbol{x}},{\boldsymbol{y}},{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}}[{\mathsf D}_{\rm 4/3}({\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{z}}^{(2)})||{\mathbb{P}}({\boldsymbol{y}}| f({\boldsymbol{z}}^{(1)})))] \overset{(i)}{\leq} {\mathbb{E}}_{{\boldsymbol{x}},{\boldsymbol{y}},{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}}[{\mathsf D}_{\rm 4/3}({\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{z}}^{(2)})||{\mathbb{P}}({\boldsymbol{y}}|\widetilde{\boldsymbol{z}}^{(1)}))] \notag\\ &= {\mathbb{E}}_{\widetilde{\boldsymbol{z}}^{(1)}\sim{\mathbb{P}}_{{\boldsymbol{z}}},{\boldsymbol{z}}^{(2)}\sim{\mathbb{P}}({\boldsymbol{z}}^{(2)}| f({\boldsymbol{z}}^{(1)})= f(\widetilde{\boldsymbol{z}}^{(1)}))}[{\mathsf D}_{\rm 4/3}({\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{z}}^{(2)})||{\mathbb{P}}({\boldsymbol{y}}|\widetilde{\boldsymbol{z}}^{(1)}))], \tag{23} \end{aligned} where step~(i) uses Jensen's inequality, the convexity of Rényi divergence w.r.t. its second argument and the fact that {\mathbb{E}}[{\mathbb{P}}({\boldsymbol{y}}|\widetilde{\boldsymbol{z}}^{(1)})| f(\widetilde{\boldsymbol{z}}^{(1)})]={\mathbb{P}}({\boldsymbol{y}}| f({\boldsymbol{z}}^{(1)})= f(\widetilde{\boldsymbol{z}}^{(1)})). Moreover, \begin{align} &\quad{\mathbb{E}}_{\widetilde{\boldsymbol{z}}^{(1)}\sim{\mathbb{P}}_{{\boldsymbol{z}}},{\boldsymbol{z}}^{(2)}\sim{\mathbb{P}}({\boldsymbol{z}}^{(2)}| f({\boldsymbol{z}}^{(1)})= f(\widetilde{\boldsymbol{z}}^{(1)}))}[{\mathsf D}_{\rm 4/3}({\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{z}}^{(2)})||{\mathbb{P}}({\boldsymbol{y}}|\widetilde{\boldsymbol{z}}^{(1)}))]\notag \\ &\overset{(ii)}{\leq} {\mathbb{E}}_{\widetilde{\boldsymbol{z}}^{(1)}\sim{\mathbb{P}}_{{\boldsymbol{z}}},{\boldsymbol{z}}^{(2)}\sim{\mathbb{P}}({\boldsymbol{z}}^{(2)}|{\boldsymbol{z}}^{(1)}=\widetilde{\boldsymbol{z}}^{(1)})}[{\mathsf D}_{\rm 4/3}({\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{z}}^{(2)})||{\mathbb{P}}({\boldsymbol{y}}|\widetilde{\boldsymbol{z}}^{(1)}))] \notag \\ &\qquad\quad+ \sqrt{2} B\cdot{\mathbb{E}}_{\widetilde{\boldsymbol{z}}^{(1)}\sim{\mathbb{P}}_{{\boldsymbol{z}}}}\Bigg[\sqrt{{\mathsf D}_{\rm KL}\Big( {\mathbb{P}}_{{\boldsymbol{z}}^{(2)}| {\boldsymbol{z}}^{(1)}}(\cdot | \widetilde{\boldsymbol{z}}^{(1)}) \Big|\Big| {\mathbb{P}}_{{\boldsymbol{z}}^{(2)}| {\boldsymbol{z}}^{(1)}}(\cdot | f(\widetilde{\boldsymbol{z}}^{(1)})) \Big)} \Bigg]\notag\\ &\overset{(iii)}{\leq} {\mathbb{E}}_{\widetilde{\boldsymbol{z}}^{(1)}\sim{\mathbb{P}}_{{\boldsymbol{z}}},{\boldsymbol{z}}^{(2)}\sim{\mathbb{P}}({\boldsymbol{z}}^{(2)}|{\boldsymbol{z}}^{(1)}=\widetilde{\boldsymbol{z}}^{(1)})}[{\mathsf D}_{\rm 4/3}({\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{z}}^{(2)})||{\mathbb{P}}({\boldsymbol{y}}|\widetilde{\boldsymbol{z}}^{(1)}))] + \sqrt{2} B\cdot\sqrt{{\rm Suff}_{{\rm cb},{\sf kl}}( f)}\notag \\ &~= {\mathbb{E}}_{{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}}[{\mathsf D}_{\rm 4/3}({\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{z}}^{(2)})||{\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{z}}^{(1)}))] + \sqrt{2} B\cdot\sqrt{{\rm Suff}_{{\rm cb},{\sf kl}}( f)}, \tag{24} \end{align} where step~(ii) follows from the variational form of total variation distance, Pinsker's inequality and the fact that \begin{align*} {\mathsf D}_{\rm 4/3}({\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{z}}^{(2)})||{\mathbb{P}}({\boldsymbol{y}}|\widetilde{\boldsymbol{z}}^{(1)}))\leq {\mathsf D}_{\rm 2}({\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{z}}^{(2)})||{\mathbb{P}}({\boldsymbol{y}}|\widetilde{\boldsymbol{z}}^{(1)})) = \log{\mathbb{E}}_{{\boldsymbol{y}}\sim{\mathbb{P}}(\cdot|{\boldsymbol{z}}^{(2)})}\Big[\frac{{\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{z}}^{(2)})}{{\mathbb{P}}({\boldsymbol{y}}|\widetilde{\boldsymbol{z}}^{(1)})}\Big]\leq B, \end{align*} and step~(iii) uses the CBS definition of {\rm Suff}_{{\sf kl}}( f) and Jensen's inequality. Finally, applying Lemma~\ref{lm:triangle95like95ineq} another time using\alpha=4/3 and k=1.5 yields \begin{align} &\quad~{\mathbb{E}}_{{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}}[{\mathsf D}_{\rm 4/3}({\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{z}}^{(2)})||{\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{z}}^{(1)}))] \notag\\ &\leq {\mathbb{E}}_{{\boldsymbol{x}},{\boldsymbol{z}}^{(1)}}[{\mathsf D}_{\rm 2}({\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{z}}^{(2)})||{\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{x}}))] + {\mathbb{E}}_{{\boldsymbol{x}},{\boldsymbol{z}}^{(2)}}[{\mathsf D}_{\rm 2}({\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{x}})||{\mathbb{P}}({\boldsymbol{y}}|{\boldsymbol{z}}^{(1)}))] )\leq \epsilon^{\sf cls}_{\mathcal{G}}. \tag{25} \end{align} Combining Eq.~\eqref{eq:general95downstream95pf095cls}---\eqref{eq:general95downstream95pf395cls} yields Theorem~\ref{thm:general95downstream95bound95cls}. \begin{align*} \end{align*}\]

8.5 Proof of Theorem 3 ↩︎

Let \({\mathrm{f}}(x)=(x-1)^2/2\). The proof largely follows the same arguments as the proof of Theorem 1. Thus we only provide a sketch of the proof here. First, it can be readily verified that the set of minimizer of \(R_{{\mathrm{f}}}({\sf S})\) is \[\begin{align} \mathcal{M}_{{\sf S}}\mathrel{\vcenter{:}}=\Big\{{\sf S}:{\sf S}={\sf S_\star}+{\rm const}\text{ ~for some } {\rm const}\in{\mathbb{R}}, ~~~{\sf S_\star}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})\mathrel{\vcenter{:}}=\frac{{\mathbb{P}}( {\boldsymbol{z}}^{(1)}, {\boldsymbol{z}}^{(2)}) }{ {\mathbb{P}}({\boldsymbol{z}}^{(1)}) \cdot {\mathbb{P}}({\boldsymbol{z}}^{(2)})} \Big\}. \end{align}\] Moreover, basic algebra shows that \({\widehat {\sf R}}_{{\sf chisq},K}( {\sf S}_{ f})\) is an unbiased estimate of \(R_{\mathrm{f}}( {\sf S}_{ f})\). Thus, by the VFS in Definition 1, we have the decomposition \[\begin{align} {\rm Suff}_{{\chi^2}}( \widehat f) &\leq R_{\mathrm{f}}( {\sf S}_{ f})-R_{\mathrm{f}}({ {\sf S}^{}_{\star}}) \leq \underbrace{\Big[ R_{\mathrm{f}}( {\sf S}_{ \widehat f}) -\inf_{ f\in {\mathcal{F}}}R_{\mathrm{f}}( {\sf S}_{ f})\Big]}_{\mathrm{generalization error}} + \underbrace{\Big[\inf_{ f\in {\mathcal{F}}}R_{\mathrm{f}}( {\sf S}_{ f})- R_{\mathrm{f}}( {\sf S}^{}_{\star}) \Big]}_{\mathrm{approximation error}}. \end{align}\] Therefore, it remains to show

With probability at least \(1-\delta\), the excess risk \[\begin{align} R_{\mathrm{f}}( {\sf S}_{ \widehat f}) -\inf_{ f\in {\mathcal{F}}}R_{\mathrm{f}}\leq \frac{c\widebar B_{\sf S}^2}{\sqrt n}\Bigg[\sqrt{{\log(1/\delta)}} + B_\tau^2 \int_{0}^{2{(\widebar B_{\sf S}+B_\tau)}}\sqrt{\log{\mathcal{N}}(u,\| \cdot\|_{2,\infty}, {\mathcal{F}})}]du \Bigg] \label{eq:pf95thm95chisq95claim1} \end{align}\tag{26}\] for some absolute constant \(c>0\).

Recall the definition of \({\widehat {\sf R}}_{{\sf chisq},K}\) in Eq. 8 and adopt the shorthand \({\widehat {\sf R}}_{K}\) for \({\widehat {\sf R}}_{{\sf chisq},K}\). Let \(B_{ f}\mathrel{\vcenter{:}}=\sqrt{B_\tau(\widebar B_{\sf S}+B_\tau)},~B\mathrel{\vcenter{:}}= c(\widebar B_{\sf S}+1) B_{ f}B_\tau\) for some absolute constant \(c>0\). It can be verified using Assumption 2 that \({\mathcal{F}}\) must satisfies \(\| f\|_{2,\infty}\leq B_{ f}\) for all \(f\in {\mathcal{F}}\) for Assumption 1 to hold. Define the zero-mean random process \(X_ f\mathrel{\vcenter{:}}={\widehat {\sf R}}_{K}( {\sf S}_{ f})-{\mathbb{E}}[{\widehat {\sf R}}_{K}( {\sf S}_{ f})],~~ f\in {\mathcal{F}}.\) We will prove that for some absolute constant \(c>0\) \[\begin{align} {\mathbb{P}}\Big( \big|\sup_{ f\in {\mathcal{F}}}|X_{ f}| - {\mathbb{E}}[\sup_{ f\in {\mathcal{F}}}|X_{ f}|]\big|\geq t \Big) &\leq 2\exp\Big(-\frac{cnt^2}{\widebar B_{\sf S}^4}\Big),~~\text{for all } t\geq0. \tag{27} \\ \tag{28} {\mathbb{E}}[\sup_{ f\in {\mathcal{F}}}|X_{ f}|] \leq {\mathbb{E}}[|X_{ f_0}|]+ {\mathbb{E}}[\sup_{ f, \widetilde{f}\in {\mathcal{F}}}|X_{ f}-X_{ \widetilde{f}}|] &\leq c\frac{\widebar B_{\sf S}^2}{\sqrt{n}} + 32 \frac{B}{\sqrt{n}}\cdot \int_{0}^{2 B_{ f}}\sqrt{\log{\mathcal{N}}(u,\| \cdot\|_{2,\infty}, {\mathcal{F}})} du. \end{align}\] Combining the two bounds and noting \[\begin{align} \overline{\sf R}_{ K}( {\sf S}_{ \widehat f}) -\inf_{ f\in {\mathcal{F}}}\overline{\sf R}_{ K}( {\sf S}_{ f})\leq 2\sup_{ f\in {\mathcal{F}}}|{\widehat {\sf R}}_K( {\sf S}_{ f})-\overline{\sf R}_K( {\sf S}_{ f})|= 2\sup_{ f\in {\mathcal{F}}}|{\widehat {\sf R}}_K( {\sf S}_{ f})-{\mathbb{E}}[{\widehat {\sf R}}_K( {\sf S}_{ f})]|=2\sup_{ f\in {\mathcal{F}}}X_ f\notag, \end{align}\] yields claim (1).
0.8ex 1ex .2ex-1em Proof of Eq. 27 . Similar to the proof of Eq. 11 , we establish the bound using concentration properties for functions with bounded differences. Following the notations in the proof of Theorem 1, we let \(\widebar{\boldsymbol{z}}_i=({\boldsymbol{z}}^{(1)}_i,{\boldsymbol{z}}^{(2)}_i)\). For any \(i\in[{n_1}],j\in[K]\), suppose \(\widebar{\boldsymbol{z}}_{(i-1)K+j}\) is replaced by \(\widetilde{\boldsymbol{z}}_{(i-1)K+j}=(\widetilde{\boldsymbol{z}}^{(1)}_{(i-1)K+j},\widetilde{\boldsymbol{z}}^{(2)}_{(i-1)K+j})\) in the calculation of \({\widehat {\sf R}}_{K}( {\sf S}_{ f})\). It can be verified using Assumption 1 that \[\begin{align} &\quad|X_ f(\widebar{\boldsymbol{z}}_1,\ldots,\widebar{\boldsymbol{z}}_{(i-1)K+j},\ldots,\widebar{\boldsymbol{z}}_n) -X_ f(\widebar{\boldsymbol{z}}_1,\ldots,\widetilde{\boldsymbol{z}}_{(i-1)K+j},\ldots,\widebar{\boldsymbol{z}}_n)|\notag\\ &= | {\widehat {\sf R}}_{K}( {\sf S}_{ f})(\widebar{\boldsymbol{z}}_1,\ldots,\widebar{\boldsymbol{z}}_{(i-1)K+j},\ldots,\widebar{\boldsymbol{z}}_n) - {\widehat {\sf R}}_{K}( {\sf S}_{ f})(\widebar{\boldsymbol{z}}_1,\ldots,\widetilde{\boldsymbol{z}}_{(i-1)K+j},\ldots,\widebar{\boldsymbol{z}}_n) | \leq \frac{c\widebar B_{\sf S}^2}{n} \label{eq:bound95term952295generalization95csq} \end{align}\tag{29}\] for some absolute constant \(c>0.\) As a result, Eq. 11 follows immediately from Corollary 2.21 in [43] for functions with bounded differences.
0.8ex 1ex .2ex-1em Proof of Eq. 28 . Similar to the proof of Eq. 12 , \({\mathbb{E}}[|X_{ f_0}|]\leq c\widebar B_{\sf S}^2/\sqrt{n}\) by the properties of zero-mean sub-Gaussian variable \(X_{ f_0}\), and therefore, to establish Eq. 28 , it remains to show \(\{X_{ f}, f\in {\mathcal{F}}\}\) is a zero-mean sub-Gaussian process with respect to the metric \(\rho_{X}( f, \widetilde{f})\mathrel{\vcenter{:}}= B\| f- \widetilde{f}\|_{2,\infty}/\sqrt{n}.\)

Let \(\| {\boldsymbol{x}}\|_{\psi} \mathrel{\vcenter{:}}=\inf \{ t > 0 : \mathbb{E} [ \psi({\boldsymbol{x}}/ t) ] \leq 1 \}\) denote the Orlicz norm for random variables and let \(\psi_2(u)=\exp(u^2)-1\). Note that for any \({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)},{\boldsymbol{z}}^{(2)}{'}\in{\mathcal{X}}, f, \widetilde{f}\in {\mathcal{F}}\), we have from Eq. 15 that \[\begin{align} &\quad | {\sf S}_{ f}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})- {\sf S}_{ \widetilde{f}}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})| { \leq} 2 B_{ f}B_\tau\| f- \widetilde{f}\|_{2,\infty},\tag{30} \end{align} and \begin{align} &\quad |( {\sf S}_{ f}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})- {\sf S}_{ f}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}{'}))^2 - ( {\sf S}_{ \widetilde{f}}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})- {\sf S}_{ \widetilde{f}}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}{'}))^2| \notag\\ &\overset{(i)}{\leq} 4\widebar B_{\sf S} (| {\sf S}_{ f}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}) - {\sf S}_{ \widetilde{f}}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})| + | {\sf S}_{ f}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}{'}) - {\sf S}_{ \widetilde{f}}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}{'})| )\notag\\ &\leq 16\widebar B_{\sf S} B_{ f}B_\tau\| f- \widetilde{f}\|_{2,\infty},\tag{31} \end{align} where step~(i) uses Assumption~\ref{ass:bounded95score}.\] Then, following the proof of Eq. 12 , it can be verified that \[\begin{align} \| X_{ f}-X_{ \widetilde{f}}\|_{\psi_2} =\| {\widehat {\sf R}}_{K}( {\sf S}_{ f})-{\widehat {\sf R}}_{K}( {\sf S}_{ \widetilde{f}})-{\mathbb{E}}[{\widehat {\sf R}}_{K}( {\sf S}_{ f})-{\widehat {\sf R}}_{K}( {\sf S}_{ \widetilde{f}})]\|_{\psi_2} \leq \frac{c(\widebar B_{\sf S}+1) B_{ f}B_\tau}{\sqrt{n}}\| f- \widetilde{f}\|_{2,\infty} \notag. \end{align}\]

9 Proofs in Section 5 ↩︎

9.1 Proof of Theorem 5 ↩︎

Recall that \(B= B_{{\boldsymbol{x}}} B_{{\boldsymbol{\theta}}}\). For linear regression with misspecified model, by Theorem 11.3 in [45] (see also e.g., Theorem 1.1 in [46]), we have \[\begin{align} {\mathbb{E}}[{\sf R}_{\sf lin}(\widetilde{\sf h}_{\widehat\boldsymbol{\eta}})]-{\widebar\sigma}^2\leq 8(\inf_{\boldsymbol{\eta}\in{\mathbb{R}}^ p}{\sf R}_{\sf lin}({\sf h}_{\boldsymbol{\eta}})-{\widebar\sigma}^2) +c (B^2+{\widebar\sigma}^2)\frac{ p\log m}{m} \end{align}\] for some absolute constant \(c>0.\)

Thus it suffices to show \[\begin{align} \inf_{\boldsymbol{\eta}\in{\mathbb{R}}^ p}{\sf R}_{{\sf lin}}({\sf h}_{\boldsymbol{\eta}})-{\widebar\sigma}^2\leq c(B^2c_2\sqrt{{\rm Suff}_{\mathrm{f}}( f)}+\epsilon_{\mathcal{G}})\label{eq:linear95example95pf95claim1} \end{align}\tag{32}\] for some absolute constant \(c>0.\) Equivalently, we only need to find some \(\boldsymbol{\eta}\in{\mathbb{R}}^{ p}\) such that \({\sf R}_{\sf lin}({\sf h}_{\boldsymbol{\eta}})\) satisfies the bound in Eq. 32 . On the other hand, from the proof of Theorem 4 , we see that if we choose \(h_\star({\boldsymbol{x}})=\langle {\boldsymbol{x}} , \, {{\boldsymbol{\theta}}_\star} \rangle\) and \(h({\boldsymbol{u}})\mathrel{\vcenter{:}}={\mathbb{E}}[h_\star({\boldsymbol{z}})| f({\boldsymbol{z}})={\boldsymbol{u}}]=\langle {{\boldsymbol{\theta}}_\star} , \, {\mathbb{E}}[{\boldsymbol{z}}| f({\boldsymbol{z}})={\boldsymbol{u}}] \rangle\), then the excess risk \[\begin{align} {\sf R}_{{\sf lin}}(h)-{\widebar\sigma}^2 \leq c(B^2c_2\sqrt{{\rm Suff}_{\mathrm{f}}( f)}+\epsilon_{\mathcal{G}}) \end{align}\] for some absolute constant \(c>0\) by Theorem 4 and Proposition 4. Therefore, it remains to show \(h\) is linear in \(f({\boldsymbol{z}})\). Note that \(f({\boldsymbol{z}})={\boldsymbol{W}}{\boldsymbol{z}}\). Let \({\boldsymbol{W}}^\dagger={\boldsymbol{W}}^\top({\boldsymbol{W}}{\boldsymbol{W}}^\top)^{-1}\in{\mathbb{R}}^{d\times p}\) be the generalized inverse of \({\boldsymbol{W}}\) and \(\boldsymbol{\widetilde{\eta}}={\boldsymbol{W}}^{\dagger\top}{{\boldsymbol{\theta}}_\star}\in{\mathbb{R}}^{ p}\). In fact, choosing \(\boldsymbol{\widetilde{\eta}}={\boldsymbol{W}}^{\dagger\top}{{\boldsymbol{\theta}}_\star}\in{\mathbb{R}}^{ p}\), we have \[\begin{align} h({\boldsymbol{u}})=\langle {{\boldsymbol{\theta}}_\star} , \, {\mathbb{E}}[{\boldsymbol{z}}| f({\boldsymbol{z}})={\boldsymbol{u}}] \rangle = \langle {{\boldsymbol{\theta}}_\star} , \, {\mathbb{E}}[{\boldsymbol{W}}^\dagger{\boldsymbol{u}}+({\rm I}_d-{\boldsymbol{W}}^\dagger{\boldsymbol{W}}){\boldsymbol{z}}| f({\boldsymbol{z}})={\boldsymbol{u}}] \rangle = \langle {{\boldsymbol{\theta}}_\star} , \, {\boldsymbol{W}}^\dagger{\boldsymbol{u}} \rangle= \langle \boldsymbol{\widetilde{\eta}} , \, {\boldsymbol{u}} \rangle, \end{align}\] where the third equality uses the assumption that \({\mathbb{E}}[({\rm I}_d-{\boldsymbol{W}}^\dagger{\boldsymbol{W}}){\boldsymbol{z}}|{\boldsymbol{W}}{\boldsymbol{z}}]=0\) almost surely.

9.2 Proof of Corollary 1 ↩︎

It suffices to apply Theorem 1 to the setup in Corollary 1.

By the boundedness of \({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}\) and the property that \({\mathbb{E}}_{{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}\sim{\mathbb{P}}_{{\boldsymbol{z}}}\times{\mathbb{P}}_{{\boldsymbol{z}}}}[ \frac{ {\mathbb{P}}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})}{{\mathbb{P}}({\boldsymbol{z}}^{(1)}){\mathbb{P}}({\boldsymbol{z}}^{(2)})}]=1\), we have \[\begin{align} \sup_{{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}} \frac{ {\mathbb{P}}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})}{{\mathbb{P}}({\boldsymbol{z}}^{(1)}){\mathbb{P}}({\boldsymbol{z}}^{(2)})}\leq \frac{ \sup_{{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}} \frac{ {\mathbb{P}}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})}{{\mathbb{P}}({\boldsymbol{z}}^{(1)}){\mathbb{P}}({\boldsymbol{z}}^{(2)})}}{ \inf_{{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}} \frac{ {\mathbb{P}}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})}{{\mathbb{P}}({\boldsymbol{z}}^{(1)}){\mathbb{P}}({\boldsymbol{z}}^{(2)})}} \leq\exp(2\kappa). \end{align}\]Similarly we have \(\inf_{{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}} \frac{ {\mathbb{P}}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})}{{\mathbb{P}}({\boldsymbol{z}}^{(1)}){\mathbb{P}}({\boldsymbol{z}}^{(2)})}\geq \exp(-2\kappa)\).

By properties of the von Mises-Fisher distribution (see e.g., [47]), it can be verified that \[\begin{align} \frac{ {\mathbb{P}}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})}{{\mathbb{P}}({\boldsymbol{z}}^{(1)}){\mathbb{P}}({\boldsymbol{z}}^{(2)})} &= {\mathcal{E}}_{ p}(\kappa)\cdot\exp \big(\kappa \langle {\boldsymbol{z}}^{(1)}, {\sf U}_1{\sf U}_1^\top {\boldsymbol{z}}^{(2)}\rangle\big) \cdot\mathbb{1}_{\{{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}\in \mathbb{S}({\sf U}_1)\oplus\mathbb{S}({\sf U}_2) \}}, ~~~~~~~\kappa\mathrel{\vcenter{:}}=\frac{ p}{(1+{\sigma}^2)^2-1}, \end{align}\] where \[\begin{align} {\mathcal{E}}_{ p}(\kappa) &\mathrel{\vcenter{:}}= \frac{\Gamma(p/2)I_{p/2-1}(\kappa)}{(\frac{\kappa}{2})^{p/2-1}} =\Gamma(p/2) \cdot \sum_{m=0}^\infty \frac{1}{m!\Gamma(m+p/2)}\big(\frac{\kappa}{2}\big)^{2m} = \sum_{m=0}^\infty \frac{(p-2)!!}{(2m)!!(2m+p-2)!!}\kappa^{2m}\notag \\ & < \sum_{m=0}^\infty \frac{1}{(2m)!}\kappa^{2m} <e^{\kappa},~\text{ ~~and } {\mathcal{E}}_{ p}(\kappa) > \frac{\Gamma( p/2)}{0!\Gamma( p/2)}\cdot\big(\frac{\kappa}{2}\big)^{0}=1 \label{eq:def95normalizing95const95vmf}. \end{align}\tag{33}\] Thus, when \(\tau(x)=\kappa x\), Assumption 1 and 2 are satisfied with \(B_{\sf S}=\exp(2\kappa),B_\tau=2\kappa\) (note that the condition \(\kappa^{-1}\leq B_\tau\) is unnecessary, as from the proof of Theorem 1, we only need \(|\tau(\langle f({\boldsymbol{z}}^{(1)}) , \, {\boldsymbol{z}}^{(2)} \rangle)|\leq \log B_{\sf S}\), which follows from the boundedness of \({\mathcal{F}}\)).
0.8ex 1ex .2ex-1em Approximation error. The approximation error \(\inf_{ f\in {\mathcal{F}}}\overline{\sf R}_{{\sf simclr},K}( {\sf S}_{ f})-\overline{\sf R}_{{\sf simclr},K}({\sf S_\star})=0\) since \({\sf S_\star}+c_1\) is realized by \(f_\star\) and the link function \(\tau(x)=\kappa x\) for some normalizing constant \(c_1\) and \(\overline{\sf R}_{{\sf simclr},K}({\sf S_\star}) = \overline{\sf R}_{{\sf simclr},K}({\sf S_\star}+c_1).\)
0.8ex 1ex .2ex-1em Generalization error. Let \({\mathcal{W}}\mathrel{\vcenter{:}}=\{{\boldsymbol{W}}\in{\mathbb{R}}^{ p\times d},|\!|\!| {\boldsymbol{W}} | \! | \!|_{{\tiny{op}}}\leq B_{\boldsymbol{W}}\}\). First, for \(f_i({\boldsymbol{z}})={\boldsymbol{W}}_i{\boldsymbol{z}}~(i=1,2)\), since \(\| f_1- f_2\|_{2,\infty}\leq |\!|\!| {\boldsymbol{W}}_1-{\boldsymbol{W}}_2 | \! | \!|_{{\tiny{op}}}\cdot\| {\boldsymbol{z}}\|_{2}\leq 2|\!|\!| {\boldsymbol{W}}_1-{\boldsymbol{W}}_2 | \! | \!|_{{\tiny{op}}}\), it follows that \[\begin{align} \log{\mathcal{N}}(u,\| \cdot\|_{2,\infty}, {\mathcal{F}}) &\leq \log{\mathcal{N}}\Big(\frac{u}{2},|\!|\!| \cdot | \! | \!|_{{\tiny{op}}},{\mathcal{W}}\Big) \leq c d p\cdot\log\Big(1+\frac{4B_{\boldsymbol{W}}}{u}\Big), \end{align}\] where the last inequality follows from the upper bound of the covering number of a unit ball (see e.g., Excercise 5.8 in [43]) and the assumption that \(p\leq d\). Therefore, \[\begin{align} B_\tau\int_{0}^{2(\log B_{\sf S}+B_\tau)}\sqrt{\log{\mathcal{N}}(u,\| \cdot\|_{2,\infty}, {\mathcal{F}})}du \leq c\kappa \int_0^{c\kappa}\sqrt{\log{\mathcal{N}}(u,\| \cdot\|_{2,\infty}, {\mathcal{F}})}du \leq c\sqrt{d p}\kappa^2\sqrt{\log B_{\boldsymbol{W}}}. \end{align}\] Combining the result on the approximation error and the generalization error and applying Theorem 1 yields the desired result.

9.3 An end-to-end result on downstream linear regression↩︎

Combining Theorem 5 and Corollary 1, we reach at the following result on the downstream performance of encoder learned by SimCLR.

Theorem 8 (Linear regression using the SimCLR-trained encoder). Under the setup described in Section 5.1, let \(\widehat f\) be the empirical risk minimizer obtained from Eq. 2 in Corollary 1 on a restricted function space \({\mathcal{F}}^{\sf o}\mathrel{\vcenter{:}}=\{ f({\boldsymbol{z}})={\boldsymbol{W}}{\boldsymbol{z}}\in {\mathcal{F}},~~{\sf span}( {\boldsymbol{W}}^\top) = ({\sf span}( {\boldsymbol{W}}^\top)\cap {\sf span}( {\sf U}_1))\oplus({\sf span}( {\boldsymbol{W}}^\top)\cap {\sf span}( {\sf U}_2)) \}\subseteq {\mathcal{F}}\). In the downstream task, given \(m\) i.i.d. samples \(\{({\boldsymbol{x}}_i,{\boldsymbol{y}}_i)\}_{i=1}^m\) from \({\boldsymbol{y}}={\rm proj}_{[-B,B]}(\langle {\boldsymbol{x}} , \, {{\boldsymbol{\theta}}_\star} \rangle)+\varepsilon\), where \({\boldsymbol{x}}\sim{\mathcal{N}}({\boldsymbol{0}},{\rm I}_d/ p)\) follows the same distribution as in contrastive learning, and \(\varepsilon\sim{\mathcal{N}}(0,{\widebar\sigma}^2)\perp\!\!\!\perp{\boldsymbol{x}}\).

*Consider fitting a (random) linear model \({\sf h}_{\boldsymbol{\eta}}({\boldsymbol{x}})=\langle \widehat f({\boldsymbol{z}}),\boldsymbol{\eta}\rangle\) by ordinary least squares \[\begin{align} \widehat\boldsymbol{\eta}\mathrel{\vcenter{:}}=\mathrm{argmin}_{\boldsymbol{\eta}\in{\mathbb{R}}^{ p}}\Big\{{\widehat {\sf R}}_{{\sf lin}}({\sf h}_{\boldsymbol{\eta}})\mathrel{\vcenter{:}}=\frac{1}{m}\sum_{i=1}^m(\langle \widehat f({\boldsymbol{z}}_i) , \, \boldsymbol{\eta} \rangle-{\boldsymbol{y}}_i)^2\Big\}, \end{align}\] where \({\boldsymbol{z}}=g({\boldsymbol{x}}),{\boldsymbol{z}}_i=g_i({\boldsymbol{x}}_i)\), and \(g,\{g\}_{i=1}^m\) are i.i.d. transformations from \({\mathbb{P}}_{\mathcal{G}}\) as specified in Section 5.1. Then with probability at least \(1-\delta\) over the SimCLR training, the expected risk of the truncated linear model \(\widetilde{\sf h}_{\widehat\boldsymbol{\eta}}({\boldsymbol{x}})\mathrel{\vcenter{:}}={\rm proj}_{[-B,B]}({\sf h}_{\widehat\boldsymbol{\eta}}({\boldsymbol{x}}))\) satisfies \[\begin{align} &\quad {\mathbb{E}}[{\sf R}_{{\sf lin}}(\widetilde{\sf h}_{\widehat\boldsymbol{\eta}})]\mathrel{\vcenter{:}}={\mathbb{E}}\big[{\mathbb{E}}_{{\boldsymbol{x}},{\boldsymbol{y}},g}[({\boldsymbol{y}}-\widetilde{\sf h}_{\widehat\boldsymbol{\eta}}({\boldsymbol{x}}))^2]\big]\\ & \leq \underbrace{{\widebar\sigma}^2}_{\mathrm{irreducible risk}}+ \underbrace{c\Big(B^2\Big(1 +\frac{{C}}{{K}}\Big) \cdot \frac{d^{1/4} p^{1/4}\log^{1/4}B_{\boldsymbol{W}}+{{\log^{1/4}(1/\delta)}}}{{n}^{1/4}}+\epsilon_{\mathcal{G}}\Big)}_{\mathrm{Error from SimCLR training}}+\underbrace{c({\widebar\sigma}^2+B^2)\frac{ p\log m}{m}}_{\mathrm{Error from downstream task}}, \end{align}\] where the outer expectation is over \({\{({\boldsymbol{x}}_i,{\boldsymbol{y}}_i,g_i)\}_{i=1}^n}\), \(c>0\) is some absolute constant, \({C}>0\) is some constant depending polynomially on \(\exp(\kappa)\), and \(\epsilon_{\mathcal{G}}\leq {\mathbb{E}}[\langle {\boldsymbol{x}}-{\boldsymbol{z}} , \, {{\boldsymbol{\theta}}_\star} \rangle^2].\)
In contrast, suppose in addition \({\widebar\sigma}^2\geq1,\| {{\boldsymbol{\theta}}_\star}\|_{2}\leq B_{{\boldsymbol{\theta}}}\) and \(m\geq cd, B\geq c({\widebar\sigma}^2+ B_{{\boldsymbol{\theta}}}^2){\log m}/ p\) for some absolute constant \(c>0\), then the truncated ordinary least squares estimator \(\widetilde{\sf h}_{\sf ols}({\boldsymbol{x}})={\rm proj}_{[-B,B]}(\langle {\boldsymbol{x}} , \, {\widehat{\boldsymbol{\theta}}}_{\sf ols} \rangle)\) obtained from \(\{({\boldsymbol{x}}_i,{\boldsymbol{y}}_i)\}_{i=1}^m\) satisfies \[\begin{align} &\quad {\mathbb{E}}[{\sf R}_{{\sf lin}}(\widetilde{\sf h}_{\sf ols})]-{{\widebar\sigma}^2}\mathrel{\vcenter{:}}={\mathbb{E}}\big[{\mathbb{E}}_{{\boldsymbol{x}},{\boldsymbol{y}}}[({\boldsymbol{y}}-\widetilde{\sf h}_{\sf ols}({\boldsymbol{x}}))^2]\big]-{{\widebar\sigma}^2} \asymp {{\widebar\sigma}^2\frac{d}{m}}, \end{align}\] where \(\asymp\) denotes matching upper and lower bounds up to absolute constant factors, and the outer expectation is over \({\{({\boldsymbol{x}}_i,{\boldsymbol{y}}_i)\}_{i=1}^n}\).

We remark that the truncation in the data generation (i.e., \({\boldsymbol{y}}={\rm proj}_{[-B,B]}(\langle {\boldsymbol{x}} , \, {{\boldsymbol{\theta}}_\star} \rangle)+\varepsilon\)) is due to technical difficulties, however, we can choose the threshold \(B\) sufficiently large, for example, \(B=\mathcal{O}(\log m)\), so that the truncation rarely happens in the generated data. The restriction of the empirical risk minimization to \({\mathcal{F}}^{\sf o}\) ensures that the condition \({\mathbb{E}}[({\rm I}_d-{\boldsymbol{W}}^\dagger{\boldsymbol{W}}){\boldsymbol{z}}|{\boldsymbol{W}}{\boldsymbol{z}}]=0\) in Theorem 5 holds for any \(f({\boldsymbol{z}})={\boldsymbol{W}}{\boldsymbol{z}}\in {\mathcal{F}}^{\sf o}\). Without this restriction, when \({\rm Suff}( \widehat f)\) is sufficiently small, the ERM \(\widehat f({\boldsymbol{z}})=\widehat{\boldsymbol{W}}{\boldsymbol{z}}\) only satisfies \({\mathbb{E}}[({\rm I}_d-\widehat{\boldsymbol{W}}^\dagger\widehat{\boldsymbol{W}}){\boldsymbol{z}}|\widehat{\boldsymbol{W}}{\boldsymbol{z}}]\approx0\), and the downstream error bound would contain an additional term depending on \({\rm Suff}( \widehat f)\).

For the two-step estimator in (a), the first term in the SimCLR training error converges to zero as the pretraining sample size \(n\) increases, and the second term \(\epsilon_{\mathcal{G}}\) is negligible when either the ground truth \({\mathbb{E}}[{\boldsymbol{y}}|{\boldsymbol{x}}]\) does not vary significantly (i.e., \(\| {{\boldsymbol{\theta}}_\star}\|_{2}\) is small) or the data augmentation introduces negligible error (i.e., \(\| {\boldsymbol{x}}-{\boldsymbol{z}}\|_{2}\) is small). Thus, compared with the OLS estimator which has a risk of order \(\mathcal{O}(d/m)\), the two-step estimator achieves a small risk of order \(\mathcal{O}( p/m)\) when the error from SimCLR training is of higher order.

Proof of Theorem 8.. First, we have from Corollary 1 that, with probability at least \(1-\delta\), the learned encoder satisfies \[\begin{align} {\rm Suff}( \widehat f) \leq \Big(1 +\frac{{C}}{K}\Big) \cdot \frac{\sqrt{d p\log B_{\boldsymbol{W}}}+\sqrt{{\log(1/\delta)}}}{\sqrt{n}}, \end{align}\] for some constant \({C}>0\) depending polynomially on \(\exp(\kappa)\). Note that the bound can be directly applied even though we consider the ERM on \({\mathcal{F}}^{\sf o}\in {\mathcal{F}}\) since \(f_\star\in {\mathcal{F}}^{\sf o}\) and the proof of Corollary 1 follows from an upper bound on the supremum of an empirical process, which remains valid when restricting to a smaller function space \({\mathcal{F}}^{\sf o}\subseteq {\mathcal{F}}\).

Consider the problem of fitting a linear regression using data \(\{( \widehat f({\boldsymbol{z}}_i),{\boldsymbol{y}}_i)\}_{i=1}^m\). We have \[\begin{align} |{\mathbb{E}}[{\boldsymbol{y}}| \widehat f({\boldsymbol{z}})]|\leq {\mathbb{E}}[|{\mathbb{E}}[{\boldsymbol{y}}|{\boldsymbol{z}}]|| f({\boldsymbol{z}})] = {\mathbb{E}}[|{\mathbb{E}}[{\rm proj}_{[-B,B]}(\langle {\boldsymbol{x}} , \, {{\boldsymbol{\theta}}_\star} \rangle)|{\boldsymbol{z}}]|| f({\boldsymbol{z}})] \leq B. \end{align}\] Thus the conditions required by Theorem 1.1 in [46] are satisfied and we have \[\begin{align} {\mathbb{E}}[{\sf R}_{\sf lin}(\widetilde{\sf h}_{\widehat\boldsymbol{\eta}})]-{\widebar\sigma}^2\leq 8(\inf_{\boldsymbol{\eta}\in{\mathbb{R}}^ p}{\sf R}_{\sf lin}({\sf h}_{\boldsymbol{\eta}})-{\widebar\sigma}^2) +c (B^2+{\widebar\sigma}^2)\frac{ p\log m}{m}. \end{align}\] Following the proof of Theorem 5, it remains to verify the condition \({\mathbb{E}}[({\rm I}_d-\widehat{\boldsymbol{W}}^\dagger\widehat{\boldsymbol{W}}){\boldsymbol{z}}|\widehat{\boldsymbol{W}}{\boldsymbol{z}}]=0\), where \(\widehat{\boldsymbol{W}}\) is the linear map in \(\widehat f\)( i.e., \(\widehat f({\boldsymbol{z}})=\widehat{\boldsymbol{W}}{\boldsymbol{z}}\)). This follows immediately as \({\boldsymbol{z}}\) follows the uniform distribution on \(\mathbb{S}({\sf U}_1)\oplus\mathbb{S}({\sf U}_2)\) and the assumption that \(\widehat f\in {\mathcal{F}}^{\sf o}\).
Adopt the shorthand \({\sf p}\) for \({\rm proj}_{[-B,B]}\). When applying \({\sf p}\) to a vector, we apply it all the coordinates. Let \(\Sigma={\mathbb{E}}[{\boldsymbol{x}}{\boldsymbol{x}}^\top]={\rm I}_d/ p\) be the covariance matrix. For the ordinary least squares (OLS) estimator, let \({\boldsymbol{X}}=\begin{pmatrix} {\boldsymbol{x}}_1&\ldots&{\boldsymbol{x}}_m \end{pmatrix}^\top\in{\mathbb{R}}^{m\times d}\) denote the sample matrix, \({\boldsymbol{Y}}=\begin{pmatrix} {\boldsymbol{y}}_1&\ldots&{\boldsymbol{y}}_m \end{pmatrix}^\top\in{\mathbb{R}}^{m}\) denote the response vector, and \(\mathcal{E}=\begin{pmatrix} \varepsilon_1&\ldots&\varepsilon_m \end{pmatrix}^\top\in{\mathbb{R}}^{m}\) denote the noise vector. By the definition of OLS, we have \({\widehat{\boldsymbol{\theta}}}=({\boldsymbol{X}}^\top{\boldsymbol{X}})^{-1}{\boldsymbol{X}}^\top{\boldsymbol{Y}}\) and \[\begin{align} {\mathbb{E}}[{\sf R}_{{\sf lin}}(\widetilde{\sf h}_{\sf ols})]-{{\widebar\sigma}^2} &= {\mathbb{E}}[({\sf p}(\langle {\boldsymbol{x}} , \, ({\boldsymbol{X}}^\top{\boldsymbol{X}})^{-1}{\boldsymbol{X}}^\top{\boldsymbol{Y}} \rangle) - {\sf p}(\langle {\boldsymbol{x}} , \, {{\boldsymbol{\theta}}_\star} \rangle))^2]. \end{align}\] We claim two results which we will use later. The proof of them can be found at the end of this section. \[\begin{align} {\mathbb{E}}[{\rm trace}(({\boldsymbol{X}}^\top{\boldsymbol{X}})^{-1}\Sigma)] &=\frac{d}{m-d-1},~~ {\mathbb{E}}[{\rm trace}(({\boldsymbol{X}}^\top{\boldsymbol{X}})^{-1}\Sigma)^2] &=\frac{(m-1)d}{(m-d)(m-d-1)(m-d-3)},\tag{34} \\ {\mathbb{E}}[\| [{\sf p}({\boldsymbol{X}}{{\boldsymbol{\theta}}_\star})-{\boldsymbol{X}}{{\boldsymbol{\theta}}_\star}]\|_{2}^4] &\leq c\frac{{m}^2 B_{{\boldsymbol{\theta}}}^4}{ p^2}\cdot \exp(-\frac{B^2}{c B_{{\boldsymbol{\theta}}}^2/ p})\tag{35} \end{align}\] for some absolute constant \(c>0\).

Choose \(B\geq c({\widebar\sigma}^2+ B_{{\boldsymbol{\theta}}}^2){\log m}/ p\) for some sufficiently large absolute constant \(c>0\). We then have \({\mathbb{E}}[\| [{\sf p}({\boldsymbol{X}}{{\boldsymbol{\theta}}_\star})-{\boldsymbol{X}}{{\boldsymbol{\theta}}_\star}]\|_{2}^4]\leq m^{-4}\). On one hand, to establish the upper bound, we have \[\begin{align} {\mathbb{E}}[{\sf R}_{{\sf lin}}(\widetilde{\sf h}_{\sf ols})]-{{\widebar\sigma}^2} &\leq {\mathbb{E}}[(\langle {\boldsymbol{x}} , \, ({\boldsymbol{X}}^\top{\boldsymbol{X}})^{-1}{\boldsymbol{X}}^\top{\boldsymbol{Y}} \rangle - \langle {\boldsymbol{x}} , \, {{\boldsymbol{\theta}}_\star} \rangle)^2]\\ &\eqqcolon {T}_1+{T}_2, \end{align}\] where \[\begin{align} {T}_1 &\mathrel{\vcenter{:}}= {\mathbb{E}}[(\langle {\boldsymbol{x}} , \, ({\boldsymbol{X}}^\top{\boldsymbol{X}})^{-1}{\boldsymbol{X}}^\top{\sf p}({\boldsymbol{X}}{{\boldsymbol{\theta}}_\star}) \rangle - \langle {\boldsymbol{x}} , \, {{\boldsymbol{\theta}}_\star} \rangle)^2] \\ &={\mathbb{E}}[\langle {\boldsymbol{x}} , \, ({\boldsymbol{X}}^\top{\boldsymbol{X}})^{-1}{\boldsymbol{X}}^\top[{\sf p}({\boldsymbol{X}}{{\boldsymbol{\theta}}_\star})-{\boldsymbol{X}}{{\boldsymbol{\theta}}_\star}] \rangle^2]\\ & \leq {\mathbb{E}}[|\!|\!| {\boldsymbol{X}}({\boldsymbol{X}}^\top{\boldsymbol{X}})^{-1}\Sigma({\boldsymbol{X}}^\top{\boldsymbol{X}})^{-1}{\boldsymbol{X}}^\top | \! | \!|_{{\tiny{op}}}\cdot\| [{\sf p}({\boldsymbol{X}}{{\boldsymbol{\theta}}_\star})-{\boldsymbol{X}}{{\boldsymbol{\theta}}_\star}]\|_{2}^2]\\ &\overset{(i)}{\leq} \sqrt{{\mathbb{E}}[{\rm trace}(({\boldsymbol{X}}^\top{\boldsymbol{X}})^{-1}\Sigma({\boldsymbol{X}}^\top{\boldsymbol{X}})^{-1}\Sigma)]} \cdot \sqrt{{\mathbb{E}}[\| [{\sf p}({\boldsymbol{X}}{{\boldsymbol{\theta}}_\star})-{\boldsymbol{X}}{{\boldsymbol{\theta}}_\star}]\|_{2}^4]}\overset{(ii)}{\leq} \frac{1}{m^2}\leq \frac{{\widebar\sigma}^2}{m^2} \end{align}\] and \[\begin{align} {T}_2 &\mathrel{\vcenter{:}}= {\mathbb{E}}[(\langle {\boldsymbol{x}} , \, ({\boldsymbol{X}}^\top{\boldsymbol{X}})^{-1}{\boldsymbol{X}}^\top\mathcal{E}) \rangle^2]\\ &= {\widebar\sigma}^2{\mathbb{E}}[{\rm trace}(({\boldsymbol{X}}^\top{\boldsymbol{X}})^{-1}\Sigma)] \overset{(iii)}{=} {\widebar\sigma}^2\frac{d}{m-d-1}. \end{align}\]Here, step (i) uses Cauchy Schwartz inequality, step (ii) and (iii) follow from claim 34 and 35 and the choice of \(B\). Combining the bounds on \({T}_1,{T}_2\) yields the upper bound \({\mathbb{E}}[{\sf R}_{{\sf lin}}(\widetilde{\sf h}_{\sf ols})]-{{\widebar\sigma}^2} \leq c{\widebar\sigma}^2\frac{d}{m-d-1}\).

To establish the lower bound, since \({\mathbb{E}}[a^2]\geq{\mathbb{E}}[b^2]+{\mathbb{E}}[(a-b)^2]-2\sqrt{{\mathbb{E}}[(a-b)^2]}\cdot \sqrt{{\mathbb{E}}[b^2]}\), it follows that \[\begin{align} {\mathbb{E}}[{\sf R}_{{\sf lin}}(\widetilde{\sf h}_{\sf ols})]-{{\widebar\sigma}^2} &= {\mathbb{E}}[({\sf p}(\langle {\boldsymbol{x}} , \, ({\boldsymbol{X}}^\top{\boldsymbol{X}})^{-1}{\boldsymbol{X}}^\top{\boldsymbol{Y}} \rangle) - {\sf p}(\langle {\boldsymbol{x}} , \, {{\boldsymbol{\theta}}_\star} \rangle))^2]\\ &= {\mathbb{E}}[({\sf p}(\langle {\boldsymbol{x}} , \, {{\boldsymbol{\theta}}_\star}+({\boldsymbol{X}}^\top{\boldsymbol{X}})^{-1}{\boldsymbol{X}}^\top\mathcal{E} \rangle) - {\sf p}(\langle {\boldsymbol{x}} , \, {{\boldsymbol{\theta}}_\star} \rangle))^2]\\ &\geq{T}_3-({T}_4+{T}_5), \end{align}\]where \[\begin{align} {T}_3 &= {\mathbb{E}}[(\langle {\boldsymbol{x}} , \, {{\boldsymbol{\theta}}_\star}+({\boldsymbol{X}}^\top{\boldsymbol{X}})^{-1}{\boldsymbol{X}}^\top\mathcal{E} \rangle-\langle {\boldsymbol{x}} , \, {{\boldsymbol{\theta}}_\star} \rangle))^2]={\mathbb{E}}[{\rm trace}(({\boldsymbol{X}}^\top{\boldsymbol{X}})^{-1}\Sigma)]={\widebar\sigma}^2\frac{d}{m-d-1}. \\ {T}_4 &= 2 \sqrt{{T}_3}\sqrt{{T}_5},\\ {T}_5 &\mathrel{\vcenter{:}}= {\mathbb{E}}[[({\sf p}(\langle {\boldsymbol{x}} , \, {{\boldsymbol{\theta}}_\star}+({\boldsymbol{X}}^\top{\boldsymbol{X}})^{-1}{\boldsymbol{X}}^\top\mathcal{E} \rangle) -\langle {\boldsymbol{x}} , \, {{\boldsymbol{\theta}}_\star}+({\boldsymbol{X}}^\top{\boldsymbol{X}})^{-1}{\boldsymbol{X}}^\top\mathcal{E} \rangle) - ({\sf p}(\langle {\boldsymbol{x}} , \, {{\boldsymbol{\theta}}_\star} \rangle)-\langle {\boldsymbol{x}} , \, {{\boldsymbol{\theta}}_\star} \rangle)]^2]\\ &\leq {\mathbb{E}}[[{\sf p}(\langle {\boldsymbol{x}} , \, {{\boldsymbol{\theta}}_\star}+({\boldsymbol{X}}^\top{\boldsymbol{X}})^{-1}{\boldsymbol{X}}^\top\mathcal{E} \rangle) -\langle {\boldsymbol{x}} , \, {{\boldsymbol{\theta}}_\star}+({\boldsymbol{X}}^\top{\boldsymbol{X}})^{-1}{\boldsymbol{X}}^\top\mathcal{E} \rangle ]^2]+{\widebar\sigma}^2\frac{1}{m^2}, \end{align}\] where the inequality uses claim 35 . To find a further upper bound of \({T}_4,{T}_5\), we first note that \(({{{\boldsymbol{\theta}}_\star}+({\boldsymbol{X}}^\top{\boldsymbol{X}})^{-1}{\boldsymbol{X}}^\top\mathcal{E}})\) is independent of \({\boldsymbol{x}}\), and \[\begin{align} \| {{{\boldsymbol{\theta}}_\star}+({\boldsymbol{X}}^\top{\boldsymbol{X}})^{-1}{\boldsymbol{X}}^\top\mathcal{E}}\|_{2}^2\leq 2 B_{{\boldsymbol{\theta}}}^{2}+2\| ({\boldsymbol{X}}^\top{\boldsymbol{X}})^{-1}{\boldsymbol{X}}^\top\mathcal{E}\|_{2}^2\leq c{\widebar\sigma}^2\frac{d}{m} +v, \end{align}\]where \(v\) is some zero-mean \(c{\widebar\sigma}^2\)-sub-Exponential variable by Theorem 1 in [41]. Under our choice of \(B\), following the proof of claim 35 and integrating over the sub-Exponential variable \(v\), it can be verified that (when choosing the absolute constant in \(B\) sufficiently large) \({T}_5\leq 2{\widebar\sigma}^2/{m^2}\). Putting the bounds on \({T}_3,{T}_5\) (and hence \({T}_4\)) together, we conclude that \({\mathbb{E}}[{\sf R}_{{\sf lin}}(\widetilde{\sf h}_{\sf ols})]-{{\widebar\sigma}^2} \geq c{\widebar\sigma}^2\frac{d}{m-d-1}\) for some absolute constant \(c>0.\)

0.8ex 1ex .2ex-1em Proof of claim 34 and 35 . Claim 34 follows directly from properties of the inverse Wishart distribution [48]. For Claim 35 , since each coordinate of \({\boldsymbol{X}}{{\boldsymbol{\theta}}_\star}\) are i.i.d. \({\mathcal{N}}(0,\| {{\boldsymbol{\theta}}_\star}\|_{2}^2)\), w.l.o.g., it suffice to show \[\begin{align} {\mathbb{E}}[|{\sf p}(z)-z|^4]\leq c\exp(-B^2/c). \end{align}\] for \(z\sim{\mathcal{N}}(0,1)\). Note that this follows immediately since \[\begin{align} {\mathbb{E}}[|{\sf p}(z)-z|^4]\leq c\int_{B}^\infty s^4\exp(-s^2/2)ds\leq cs^3\exp(-s^2/2)\leq c \exp(-s^2/c). \end{align}\] ◻

9.4 Proof of Theorem 6 ↩︎

We prove Eq. ?? and ?? in Appendix 9.4.1 and 9.4.2, respectively.

9.4.1 Proof of Eq. ??↩︎

It suffices to apply Theorem 3 to the setup in Theorem 6. With a slight abuse of notation, we use both one-hot vectors in \(\cup_{i=1}^S\{e_i\}\) and integers in \([S]\) to represent the augmented views \({\boldsymbol{z}}\) and do not distinguish them in the proof. We also occasionally omit the subscripts in \({\mathbb{P}}_{\mathcal{Y}},{\mathbb{P}}_c\) when the meaning is clear from the context.

We claim that \[\begin{align} \frac{{\mathbb{P}}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})}{{\mathbb{P}}({\boldsymbol{z}}^{(1)}){\mathbb{P}}({\boldsymbol{z}}^{(2)})} = \frac{1}{2}\cdot\sum_{y=1}^{M} \frac{{\mathbb{P}}_c(y|{\boldsymbol{z}}^{(1)})\cdot {\mathbb{P}}_c(y|{\boldsymbol{z}}^{(2)})}{{\mathbb{P}}_{\mathcal{Y}}(y)} + \frac{S}{2}\cdot\mathbb{1}_{\{{\boldsymbol{z}}^{(1)}={\boldsymbol{z}}^{(2)}\}}\label{eq:img95suff95bd95pf951}. \end{align}\tag{36}\] We will prove this claim momentarily. With this claim at hand, we have 0.8ex 1ex .2ex-1em Approximation error. Let \[\begin{align} f_\star({\boldsymbol{z}})\mathrel{\vcenter{:}}=\frac{1}{\sqrt2}\Big(\frac{{\mathbb{P}}_c({\boldsymbol{y}}=1|{\boldsymbol{x}}^{c_{1}}={\boldsymbol{z}})}{\sqrt{{\mathbb{P}}_{\mathcal{Y}}({\boldsymbol{y}}=1)}},\ldots,\frac{{\mathbb{P}}_c({\boldsymbol{y}}=M|{\boldsymbol{x}}^{c_{1}}={\boldsymbol{z}})}{\sqrt{{\mathbb{P}}_{\mathcal{Y}}({\boldsymbol{y}}=M)}},{\sqrt{S}}{\boldsymbol{z}}^\top\Big)^\top. \end{align}\] It can be verified that the parameter \(({\boldsymbol{W}}_\star,{w}_\star)\) corresponding to \(f_\star\) lies in \({\sf\Gamma}.\) Therefore, the approximation error \(\inf_{ f\in {\mathcal{F}}}R_{{\chi^2}}( {\sf S}_{ f})-R_{{\chi^2}}( {\sf S}^{}_{\star})=0\) since \({\sf S_\star}\) is realized by \(f_\star\) and the link function \(\tau(x)= x\).
0.8ex 1ex .2ex-1em Generalization error. Let \({\mathcal{W}}\mathrel{\vcenter{:}}=\{{\boldsymbol{W}}\in{\mathbb{R}}^{M\times S},{w}\in{\mathbb{R}},\| {\boldsymbol{W}}\|_{2,\infty}\vee|{w}/\sqrt{S}|\leq B_{\boldsymbol{W}}\}\) and define the metric \({|\!|\!|{({\boldsymbol{W}}_1,{w}_1)-({\boldsymbol{W}}_2,{w}_2)}|\!|\!|}\mathrel{\vcenter{:}}=\| {\boldsymbol{W}}_1-{\boldsymbol{W}}_2\|_{2,\infty}\vee|({w}_1-{w}_2)/\sqrt{S}|\) on \({\mathcal{W}}.\)

First, for \(f_i({\boldsymbol{z}})=(({\boldsymbol{W}}_i{\boldsymbol{z}})^\top,{w}_i\cdot{\boldsymbol{z}}^\top)^\top~(i=1,2)\), simple calculation shows \(\| f_1- f_2\|_{2,\infty}\leq 2(|{w}_1-{w}_1|\vee |\!|\!| {\boldsymbol{W}}_1-{\boldsymbol{W}}_2 | \! | \!|_{{\tiny{op}}})\), and therefore \[\begin{align} \log{\mathcal{N}}(u,\| \cdot\|_{2,\infty}, {\mathcal{F}}) &\leq \log{\mathcal{N}}\Big(\frac{u}{2},{|\!|\!|{\cdot}|\!|\!|},{{\mathcal{W}}}\Big) \leq \log{\mathcal{N}}\Big(\frac{u}{2},\| \cdot\|_{2,\infty}, {{\mathcal{W}}_1}\Big) + \log{\mathcal{N}}\Big(\frac{u\sqrt{S}}{2},|\cdot|,{{\mathcal{W}}_2}\Big) \\ &\leq SM\cdot\log\Big(1+\frac{4B_{\boldsymbol{W}}}{u}\Big) + \log\Big(1+\frac{4 B_{\boldsymbol{W}}}{u}\Big), \end{align}\] where \({\mathcal{W}}_1\mathrel{\vcenter{:}}=\{{\boldsymbol{W}}\in{\mathbb{R}}^{M\times S},\| {\boldsymbol{W}}\|_{2,\infty}\leq B_{\boldsymbol{W}}\},{\mathcal{W}}_2\mathrel{\vcenter{:}}=\{{w}\in{\mathbb{R}},|{w}|\leq\sqrt{S}B_{\boldsymbol{W}}\}\) and the last inequality follows from the upper bound of the covering number of the unit ball (see e.g., Example 5.8 in [43]) and the assumption that \(M\leq S\). In addition, it is readily verified that \({\sf S}_{ f}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})\in[-\widebar B_{\sf S},\widebar B_{\sf S}]\) with \(\widebar B_{\sf S}= 4 B_{\boldsymbol{W}}^2S=4M^2S\) for all \({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}.\) Consequently, \[\begin{align} &\quad B_\tau\int_{0}^{B_{\boldsymbol{W}}}\sqrt{\log{\mathcal{N}}(u,\| \cdot\|_{2,\infty}, {\mathcal{F}})}du\\ &\leq c\Bigg(\int_{0}^{B_{\boldsymbol{W}}}\sqrt{SM\cdot\log\Big(1+\frac{4B_{\boldsymbol{W}}}{u}\Big)}du + \int_{0}^{B_{\boldsymbol{W}}}\sqrt{ \log\Big(1+\frac{4 B_{\boldsymbol{W}}}{u}\Big)}du\Bigg)\\ &\leq c\sqrt{SM}B_{\boldsymbol{W}} =c\sqrt{SM^3} . \end{align}\] Combining the result on the approximation error and the generalization error and applying Theorem 3 yields the desired result.
For \({\boldsymbol{z}}^{(1)}\neq{\boldsymbol{z}}^{(2)}\), by Bayes’ formula, we have \[\begin{align} \frac{{\mathbb{P}}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})}{{\mathbb{P}}({\boldsymbol{z}}^{(1)}){\mathbb{P}}({\boldsymbol{z}}^{(2)})} &= \sum_{{\boldsymbol{x}}} \frac{{\mathbb{P}}({\boldsymbol{z}}^{(2)}|{\boldsymbol{x}})\cdot{\mathbb{P}}({\boldsymbol{x}}|{\boldsymbol{z}}^{(1)})}{{\mathbb{P}}({\boldsymbol{z}}^{(2)})}= \sum_{{\boldsymbol{x}}} \frac{{\mathbb{P}}({\boldsymbol{x}}|{\boldsymbol{z}}^{(2)})\cdot{\mathbb{P}}({\boldsymbol{x}}|{\boldsymbol{z}}^{(1)})}{{\mathbb{P}}({\boldsymbol{x}})}\notag\\ &\overset{(i)}{=} 2 \frac{{\mathbb{P}}({\boldsymbol{x}}=({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})|{\boldsymbol{z}}^{(2)})\cdot{\mathbb{P}}({\boldsymbol{x}}=({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})|{\boldsymbol{z}}^{(1)})}{{\mathbb{P}}({\boldsymbol{x}}=({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}))},\label{eq:img95suff95bd95pf951952} \end{align}\tag{37}\] where step (i) follows from symmetry between \({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}.\) Moreover, \[\begin{align} {\mathbb{P}}({\boldsymbol{x}}=({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})|{\boldsymbol{z}}^{(1)}) &= \frac{1}{2} {\mathbb{P}}({\boldsymbol{x}}^{c_{2}}={\boldsymbol{z}}^{(2)}|{\boldsymbol{x}}^{c_{1}}={\boldsymbol{z}}^{(1)}) =\frac{1}{2}{\sum_{y=1}^{M}{{\mathbb{P}}_c({\boldsymbol{x}}^{c_{2}}={\boldsymbol{z}}^{(2)}|y)\cdot{\mathbb{P}}_c(y|{\boldsymbol{x}}^{c_{1}}={\boldsymbol{z}}^{(1)})}}{}\notag\\ &= \frac{1}{2}\sum_{y=1}^{M}\frac{{{\mathbb{P}}_c(y|{\boldsymbol{x}}^{c_{2}}={\boldsymbol{z}}^{(2)})\cdot{\mathbb{P}}_c(y|{\boldsymbol{x}}^{c_{1}}={\boldsymbol{z}}^{(1)})}}{{\mathbb{P}}_{\mathcal{Y}}(y)}\cdot {\mathbb{P}}_c({\boldsymbol{x}}^{c_{2}}={\boldsymbol{z}}^{(2)})\notag\\ &= \frac{1}{2} \sum_{y=1}^{M} \frac{{{\mathbb{P}}_c(y|{\boldsymbol{z}}^{(2)})\cdot{\mathbb{P}}_c(y|{\boldsymbol{z}}^{(1)})}}{{\mathbb{P}}_{\mathcal{Y}}(y)}\cdot {\mathbb{P}}_c({\boldsymbol{z}}^{(2)}),\tag{38}\\ {\mathbb{P}}_c({\boldsymbol{z}}) &\overset{(ii)}{=}{\mathbb{P}}({\boldsymbol{z}}),~~\text{and}\tag{39} \\ {\mathbb{P}}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}) &\overset{(iii)}{=} 2{\mathbb{P}}(({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}),{\boldsymbol{x}}=({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})) =\frac{1}{2}{\mathbb{P}}({\boldsymbol{x}}=({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}))\tag{40}, \end{align}\] where step (ii) follows from the generation process of the augmented views \(({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})\), and step (iii) follows from symmetry between \({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}.\) Substituting Eq. 38 into Eq. 37 , we find \[\begin{align} \frac{{\mathbb{P}}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})}{{\mathbb{P}}({\boldsymbol{z}}^{(1)}){\mathbb{P}}({\boldsymbol{z}}^{(2)})} &= \frac{1}{2}\Big(\sum_{y=1}^{M} \frac{{{\mathbb{P}}_c(y|{\boldsymbol{z}}^{(2)})\cdot{\mathbb{P}}_c(y|{\boldsymbol{z}}^{(1)})}}{{\mathbb{P}}_{\mathcal{Y}}(y)}\Big)^2\cdot \frac{ {\mathbb{P}}_c({\boldsymbol{z}}^{(1)}){\mathbb{P}}_c({\boldsymbol{z}}^{(2)})}{{\mathbb{P}}({\boldsymbol{x}}=({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}))}\notag \\ &{=} \frac{1}{4}\Big(\sum_{y=1}^{M} \frac{{{\mathbb{P}}_c(y|{\boldsymbol{z}}^{(2)})\cdot{\mathbb{P}}_c(y|{\boldsymbol{z}}^{(1)})}}{{\mathbb{P}}_{\mathcal{Y}}(y)}\Big)^2\cdot \frac{ {\mathbb{P}}({\boldsymbol{z}}^{(1)}){\mathbb{P}}({\boldsymbol{z}}^{(2)})}{{\mathbb{P}}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})}\label{eq:img95suff95bd95pf951953}, \end{align}\tag{41}\] where the second equality uses Eq. 39 and 40 . Reorganizing Eq. 41 , we obtain \[\begin{align} \frac{{\mathbb{P}}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})}{{\mathbb{P}}({\boldsymbol{z}}^{(1)}){\mathbb{P}}({\boldsymbol{z}}^{(2)})} = \frac{1}{2}\Big(\sum_{y=1}^{M} \frac{{{\mathbb{P}}_c(y|{\boldsymbol{z}}^{(2)})\cdot{\mathbb{P}}_c(y|{\boldsymbol{z}}^{(1)})}}{{\mathbb{P}}_{\mathcal{Y}}(y)}\Big) = \frac{1}{2} \frac{{\mathbb{P}}_c({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})}{{\mathbb{P}}_c({\boldsymbol{z}}^{(1)}){\mathbb{P}}_c({\boldsymbol{z}}^{(2)})} ,\label{eq:img95suff95bd95pf951956} \end{align}\tag{42}\] where we recall \({\mathbb{P}}_c(\cdot)\) is the marginal distribution of \({\boldsymbol{x}}^{c_{1}}\) (or \({\boldsymbol{x}}^{c_{2}}\)) and the second equality follows from Bayes’ formula and the fact that \({\boldsymbol{x}}^{c_{1}}\perp\!\!\!\perp{\boldsymbol{x}}^{c_{2}}|\boldsymbol{y}\).

For \({\boldsymbol{z}}^{(1)}={\boldsymbol{z}}^{(2)}=z\), using Eq. 39 and properties of conditional distribution, we have \[\begin{align} \sum_{z'\in[S]} \frac{{\mathbb{P}}_c({\boldsymbol{z}}^{(1)}=z,{\boldsymbol{z}}^{(2)}=z')}{{\mathbb{P}}_c({\boldsymbol{z}}^{(1)}=z){\mathbb{P}}_c({\boldsymbol{z}}^{(2)}=z')} &= \frac{1}{{\mathbb{P}}_c({\boldsymbol{z}}^{(2)}=z)} = \frac{1}{{\mathbb{P}}({\boldsymbol{z}}^{(2)}=z)}= \sum_{z'\in[S]} \frac{{\mathbb{P}}({\boldsymbol{z}}^{(1)}=z,{\boldsymbol{z}}^{(2)}=z')}{{\mathbb{P}}({\boldsymbol{z}}^{(1)}=z){\mathbb{P}}({\boldsymbol{z}}^{(2)}=z')} . \end{align}\] Combining this with Eq. 42 for all \({\boldsymbol{z}}^{(2)}\neq{\boldsymbol{z}}^{(1)}\) and noting that the marginal \({\mathbb{P}}_c(\cdot)\) is the uniform distribution on \([S]\), we obtain \[\begin{align} \frac{{\mathbb{P}}({\boldsymbol{z}}^{(1)}=z,{\boldsymbol{z}}^{(2)}=z)}{{\mathbb{P}}({\boldsymbol{z}}^{(1)}=z){\mathbb{P}}({\boldsymbol{z}}^{(2)}=z)} &= \frac{1}{2}\cdot\frac{{\mathbb{P}}_c({\boldsymbol{x}}^{c_{1}}=z,{\boldsymbol{x}}^{c_{2}}=z)}{{\mathbb{P}}_c({\boldsymbol{x}}^{c_{1}}=z){\mathbb{P}}_c({\boldsymbol{x}}^{c_{2}}=z)} +\frac{1}{2}\sum_{z'\in[S]}\frac{{\mathbb{P}}_c({\boldsymbol{x}}^{c_{1}}=z,{\boldsymbol{x}}^{c_{2}}=z')}{{\mathbb{P}}_c({\boldsymbol{x}}^{c_{1}}=z){\mathbb{P}}_c({\boldsymbol{x}}^{c_{2}}=z')} \\ &= \frac{1}{2}\cdot\frac{{\mathbb{P}}_c({\boldsymbol{x}}^{c_{1}}=z,{\boldsymbol{x}}^{c_{2}}=z)}{{\mathbb{P}}_c({\boldsymbol{x}}^{c_{1}}=z){\mathbb{P}}_c({\boldsymbol{x}}^{c_{2}}=z)} +\frac{S}{2} \\ &= \frac{1}{2}\cdot\sum_{y=1}^{M} \frac{{{\mathbb{P}}_c(y|z)\cdot{\mathbb{P}}_c(y|z)}}{{\mathbb{P}}_{\mathcal{Y}}(y)}+\frac{S}{2}. \end{align}\]

9.4.2 Proof of Eq. ??↩︎

Write \({\boldsymbol{z}}=g({\boldsymbol{x}})\). By a standard risk decomposition, we have \[\begin{align} {\sf R}_{\rm cls}({\sf h}_{\widehat{\boldsymbol{\Gamma}}}) &={\mathbb{E}}[{\widehat {\sf R}}_{\rm cls}({\sf h}_{\widehat{\boldsymbol{\Gamma}}})]-{\mathbb{E}}[{\widehat {\sf R}}_{\rm cls}({\mathbb{P}}_{{\boldsymbol{y}}|{\boldsymbol{x}}}(\cdot|{\boldsymbol{x}}))]\\ &={\mathbb{E}}[{\widehat {\sf R}}_{\rm cls}({\sf h}_{\widehat{\boldsymbol{\Gamma}}})]-\inf_{{\sf h}}{\mathbb{E}}[{\widehat {\sf R}}_{\rm cls}({\sf h})] \\ &= \underbrace{\inf_{{\boldsymbol{\Gamma}}:|\!|\!| {\boldsymbol{\Gamma}}_{w} | \! | \!|_{{\tiny{op}}}\vee\| {\boldsymbol{\Gamma}}_{b}\|_{2}\leq B_\Gamma} {\mathbb{E}}_{{\boldsymbol{x}},g}[ {\mathsf D}_{\rm KL}({\mathbb{P}}_{{\boldsymbol{y}}|{\boldsymbol{x}}}(\cdot|{\boldsymbol{x}})|| \widebar{\sf h}_{{\boldsymbol{\Gamma}}}( f({\boldsymbol{z}}) ) )]}_{\mathrm{approximation error}} \\ &\qquad\qquad\qquad + \underbrace{{\mathbb{E}}[{\widehat {\sf R}}_{\rm cls}({\sf h}_{\widehat{\boldsymbol{\Gamma}}})]-\inf_{{\boldsymbol{\Gamma}}:|\!|\!| {\boldsymbol{\Gamma}}_{w} | \! | \!|_{{\tiny{op}}}\vee\| {\boldsymbol{\Gamma}}_{b}\|_{2}\leq B_\Gamma}{\mathbb{E}}[{\widehat {\sf R}}_{\rm cls}({\sf h}_{\boldsymbol{\Gamma}})]}_{\mathrm{generalization error}}. \end{align}\] We will prove that for some absolute constant \(c>0\), \[\begin{itemize} \item [1.] \begin{align} \tag{43} \mathrm{approximation error}\leq c\Big(\epsilon^{\sf cls}_{\mathcal{G}}+\frac{S\exp(B)}{\sigma_{{{\boldsymbol{E}}_\star}}^2}\cdot(R_{{\mathrm{f}}}({\sf S}_{ \widehat f_ {\sf aug}})-R_{{\mathrm{f}}}({\sf S_\star}))\Big), \end{align} and \item[2.]with probability at least 1-\delta, \begin{align}\tag{44} \mathrm{generalization error}\leq \frac{cB}{\sqrt{m}}\Big[\sqrt{\log(1/\delta)} + {M(\sqrt{\log B_\Gamma}+\sqrt{B}) }\Big]. \end{align} \end{itemize}\]

0.8ex 1ex .2ex-1em Approximation error. Let \({{\boldsymbol{E}}_\star}\in{\mathbb{R}}^{M\times S}\) be the representation where \[\begin{align} {{\boldsymbol{E}}_\star}_{,\cdot j}=\frac{1}{\sqrt2}\Big(\frac{{\mathbb{P}}_c({\boldsymbol{y}}=1|{\boldsymbol{x}}^{c_{1}}=j)}{\sqrt{{\mathbb{P}}_{\mathcal{Y}}({\boldsymbol{y}}=1)}},\ldots,\frac{{\mathbb{P}}_c({\boldsymbol{y}}=M|{\boldsymbol{x}}^{c_{1}}=j)}{\sqrt{{\mathbb{P}}_{\mathcal{Y}}({\boldsymbol{y}}=M)}}\Big)^\top \end{align}\] for \(j\in[S]\) and let \({{\boldsymbol{E}}_\star}({\boldsymbol{z}})\) denote the \({\boldsymbol{z}}\)-th column of \({{\boldsymbol{E}}_\star}.\) Let \({\widehat{\boldsymbol{E}}}\mathrel{\vcenter{:}}=\begin{pmatrix} \widehat f(1)&\cdots & \widehat f(S) \end{pmatrix}\in{\mathbb{R}}^{M\times S}\).

Given a representation \(\widehat f({\boldsymbol{z}})\), consider the classifier \[\begin{align} \widebar{\sf h}_{{\boldsymbol{\Gamma}}}( \widehat f({\boldsymbol{z}}) ) &= {\rm softmax}(\log{\sf trun}({\boldsymbol{\Gamma}}_{w} \widehat f({\boldsymbol{z}})+{\boldsymbol{\Gamma}}_{b})), ~~\text{where}\notag\\ {\boldsymbol{\Gamma}}_{w} &\mathrel{\vcenter{:}}= \sqrt{2}{{\boldsymbol{P}}_{\mathcal{Y}}}^{1/2}({{\boldsymbol{E}}_\star}{{\boldsymbol{E}}_\star}^\top)^{-1}{{\boldsymbol{E}}_\star}({\rm I}_S-{\mathcal{P}}_{{\boldsymbol{1}}_S}){\widehat{\boldsymbol{E}}}^\top,~~~ {\boldsymbol{\Gamma}}_{b}\mathrel{\vcenter{:}}=\frac{1}{\sqrt{2}}{{\boldsymbol{P}}_{\mathcal{Y}}}^{1/2}({{\boldsymbol{E}}_\star}{{\boldsymbol{E}}_\star}^\top)^{-1}{{\boldsymbol{E}}_\star}{\boldsymbol{1}}_{S},\label{eq:pf95thm95img95dst95bound95claim295def} \end{align}\tag{45}\] and \({{\boldsymbol{P}}_{\mathcal{Y}}}\mathrel{\vcenter{:}}={\rm diag}\{{{\mathbb{P}}_{\mathcal{Y}}({\boldsymbol{y}}=1)},\ldots,{{\mathbb{P}}_{\mathcal{Y}}({\boldsymbol{y}}=M)}\}\). It can be verified that \(|\!|\!| {\boldsymbol{\Gamma}}_{w} | \! | \!|_{{\tiny{op}}}\leq 2\sqrt{S}M/\sigma_{{{\boldsymbol{E}}_\star}}\leq B_\Gamma\) and \(\| {\boldsymbol{\Gamma}}_{b}\|_{2}\leq \sqrt{S}/\sigma_{{{\boldsymbol{E}}_\star}}\leq B_\Gamma\). Moreover, we have by Lemma 5 that \[\begin{align} {\mathbb{E}}_{{\boldsymbol{x}},g}[ {\mathsf D}_{\rm KL}({\mathbb{P}}_{{\boldsymbol{y}}|{\boldsymbol{x}}}({\boldsymbol{y}}|{\boldsymbol{x}})|| \widebar{\sf h}_{{\boldsymbol{\Gamma}}}( \widehat f({\boldsymbol{z}}) ) )] &\leq 2 {\mathbb{E}}_{{\boldsymbol{x}},g}[ {\mathsf D}_{\rm KL}({\mathbb{P}}_{{\boldsymbol{y}}|{\boldsymbol{x}}}(\cdot|{\boldsymbol{x}})|| {\mathbb{P}}_{{\boldsymbol{y}}|{\boldsymbol{z}}}(\cdot|{\boldsymbol{z}})] + {\mathbb{E}}_{{\boldsymbol{x}},g}[ {\mathsf D}_{\rm 2}({\mathbb{P}}_{{\boldsymbol{y}}|{\boldsymbol{z}}}(\cdot|{\boldsymbol{z}})|| \widebar{\sf h}_{{\boldsymbol{\Gamma}}}( \widehat f({\boldsymbol{z}}) ) )] \\ &\leq 2\epsilon^{\sf cls}_{\mathcal{G}} + {\mathbb{E}}_{{\boldsymbol{x}},g}[ {\mathsf D}_{\rm 2}({\mathbb{P}}_{{\boldsymbol{y}}|{\boldsymbol{z}}}(\cdot|{\boldsymbol{z}})|| \widebar{\sf h}_{{\boldsymbol{\Gamma}}}( \widehat f({\boldsymbol{z}}) ) )]. \end{align}\] Therefore, it remains to prove \[\begin{align} {\mathbb{E}}_{{\boldsymbol{x}},g}[ {\mathsf D}_{\rm 2}({\mathbb{P}}_{{\boldsymbol{y}}|{\boldsymbol{z}}}(\cdot|{\boldsymbol{z}})|| \widebar{\sf h}_{{\boldsymbol{\Gamma}}}( \widehat f({\boldsymbol{z}}) ) )] \leq \frac{cS\exp(B)}{\sigma_{{{\boldsymbol{E}}_\star}}^2}\cdot(R_{{\mathrm{f}}}({\sf S}_{ \widehat f_ {\sf aug}})-R_{{\mathrm{f}}}({\sf S_\star})) \label{eq:pf95thm95img95dst95bound95claim1}. \end{align}\tag{46}\] Since \[\begin{align} {\mathbb{E}}_{{\boldsymbol{x}},g}[ {\mathsf D}_{\rm 2}({\mathbb{P}}_{{\boldsymbol{y}}|{\boldsymbol{z}}}(\cdot|{\boldsymbol{z}})|| \widebar{\sf h}_{{\boldsymbol{\Gamma}}}( \widehat f({\boldsymbol{z}}) ) )] &\leq {\mathbb{E}}_{{\boldsymbol{x}},g}\Big[{\mathbb{E}}_{y\sim{\mathbb{P}}_{y|{\boldsymbol{z}}}(\cdot|{\boldsymbol{z}})}\frac{{\mathbb{P}}_{{\boldsymbol{y}}|{\boldsymbol{z}}}(y|{\boldsymbol{z}})- \widebar{\sf h}_{{\boldsymbol{\Gamma}}}( \widehat f({\boldsymbol{z}}) )_y}{ \widebar{\sf h}_{{\boldsymbol{\Gamma}}}( \widehat f({\boldsymbol{z}}) )_y} \Big]\notag\\ &= {\mathbb{E}}_{{\boldsymbol{x}},g}\Big[\sum_{y\in[M]}\frac{({\mathbb{P}}_{{\boldsymbol{y}}|{\boldsymbol{z}}}(y|{\boldsymbol{z}})- \widebar{\sf h}_{{\boldsymbol{\Gamma}}}( \widehat f({\boldsymbol{z}}) )_y)^2}{ \widebar{\sf h}_{{\boldsymbol{\Gamma}}}( \widehat f({\boldsymbol{z}}) )_y} \Big]\notag\\ &\leq {\exp(B)}\cdot {\mathbb{E}}_{{\boldsymbol{x}},g} \Big[\sum_{y\in[M]}{({\mathbb{P}}_{{\boldsymbol{y}}|{\boldsymbol{z}}}(y|{\boldsymbol{z}})- \widebar{\sf h}_{{\boldsymbol{\Gamma}}}( \widehat f({\boldsymbol{z}}) )_y)^2} \Big]\notag\\ &= {\exp(B)}\cdot {\mathbb{E}}_{{\boldsymbol{x}},g} \Big[\sum_{y\in[M]}{({\mathbb{P}}_c(y|{\boldsymbol{x}}^{c_{1}}={\boldsymbol{z}})- \widebar{\sf h}_{{\boldsymbol{\Gamma}}}( \widehat f({\boldsymbol{z}}) )_y)^2} \Big],\label{eq:pf95thm95img95dst95bound95claim2} \end{align}\tag{47}\] where the third line uses the definition of \({\sf trun}\) and claim 48 in the proof of Lemma 6, and the last line uses the fact that \({\mathbb{P}}_c({\boldsymbol{y}}=y|{\boldsymbol{x}}^{c_{1}}=j)={\mathbb{P}}_{{\boldsymbol{y}}|{\boldsymbol{z}}}({\boldsymbol{y}}=y|{\boldsymbol{z}}=j)\) for all \(y\in[M],j\in[S]\). Eq. 46 follows immediately from Lemma 6 which gives an upper bound on the term in Eq. 47 . 0.8ex 1ex .2ex-1em Generalization error. The proof follows from a standard analysis of empirical process similar to the proof of Eq. 26 in the proof of Theorem 3. Thus, we only provide a sketch of the proof here.

Let \({{\sf\Gamma}}\mathrel{\vcenter{:}}=\{{\boldsymbol{\Gamma}}: |\!|\!| {\boldsymbol{\Gamma}}_{w} | \! | \!|_{{\tiny{op}}} \vee \| {\boldsymbol{\Gamma}}_{b}\|_{2} \leq B_\Gamma\}\) and define the norm \({|\!|\!|{{\boldsymbol{\Gamma}}-\widetilde{\boldsymbol{\Gamma}}}|\!|\!|}\mathrel{\vcenter{:}}=|\!|\!| {\boldsymbol{\Gamma}}_{w}-\widetilde{\boldsymbol{\Gamma}}_{w} | \! | \!|_{{\tiny{op}}}\vee \| {\boldsymbol{\Gamma}}_{b}-\widetilde{\boldsymbol{\Gamma}}_{b}\|_{2}\). First, by a triangle inequality, the fact that \(\| \log{\sf h}_{{\boldsymbol{\Gamma}}}\|_{\infty}\leq 2B\) (which follows from the definition of \({\sf trun}\)), and Corollary 2.21 in [43] for functions with bounded differences, we have \[\begin{align} \mathrm{generalization error}\leq 2 {\mathbb{E}}\Big[ \sup_{{\boldsymbol{\Gamma}}\in{\sf\Gamma}} | {\widehat {\sf R}}_{\rm cls}({\sf h}_{{\boldsymbol{\Gamma}}}) - {\mathbb{E}}[{\widehat {\sf R}}_{\rm cls}({\sf h}_{{\boldsymbol{\Gamma}}})] |\Big]+2B\frac{\sqrt{\log(1/\delta)}}{\sqrt{m}} \end{align}\] with probability at least \(1-\delta.\) Let \(X_{\boldsymbol{\Gamma}}\mathrel{\vcenter{:}}={\widehat {\sf R}}_{\rm cls}({\sf h}_{{\boldsymbol{\Gamma}}}) - {\mathbb{E}}[{\widehat {\sf R}}_{\rm cls}({\sf h}_{{\boldsymbol{\Gamma}}})] .\) Then we have \[\begin{align} {\mathbb{E}}\Big[ \sup_{{\boldsymbol{\Gamma}}\in{\sf\Gamma}} | {\widehat {\sf R}}_{\rm cls}({\sf h}_{{\boldsymbol{\Gamma}}}) - {\mathbb{E}}[{\widehat {\sf R}}_{\rm cls}({\sf h}_{{\boldsymbol{\Gamma}}})] |\Big] \leq {\mathbb{E}}[|X_{{\boldsymbol{\Gamma}}_0}|]+{\mathbb{E}}[\sup_{{\boldsymbol{\Gamma}},\widetilde{\boldsymbol{\Gamma}}\in{\sf\Gamma}}|X_{{\boldsymbol{\Gamma}}}-X_{\widetilde{\boldsymbol{\Gamma}}}|]\leq \frac{2B}{\sqrt{m}}+ {\mathbb{E}}[\sup_{{\boldsymbol{\Gamma}},\widetilde{\boldsymbol{\Gamma}}\in{\sf\Gamma}}|X_{{\boldsymbol{\Gamma}}}-X_{\widetilde{\boldsymbol{\Gamma}}}|]. \end{align}\] Moreover, the process \(\{X_{{\boldsymbol{\Gamma}}}\}_{{\boldsymbol{\Gamma}}\in{\sf\Gamma}}\) is a zero-mean sub-Gaussian process with respect to the metric \(\rho_{X}({\boldsymbol{\Gamma}},\widetilde{\boldsymbol{\Gamma}})\mathrel{\vcenter{:}}= 2\| \log\widebar{\sf h}_{{\boldsymbol{\Gamma}}}-\log\widebar{\sf h}_{\widetilde{\boldsymbol{\Gamma}}}\|_{\infty}/\sqrt{m}\) since \(X_{{\boldsymbol{\Gamma}}}\) is the average of i.i.d. random variables bounded by \[\begin{align} &\quad2\sup_{i\in[m]}|\langle e_{\boldsymbol{y}_i} , \, \log \widebar{\sf h}_{{\boldsymbol{\Gamma}}}( \widehat f({\boldsymbol{z}}_i) ) \rangle -\langle e_{\boldsymbol{y}_i} , \, \log \widebar{\sf h}_{\widetilde{\boldsymbol{\Gamma}}}( \widehat f({\boldsymbol{z}}_i) ) \rangle|\\ & \leq 2\| \log \widebar{\sf h}_{{\boldsymbol{\Gamma}}}( \widehat f({\boldsymbol{z}}_i) )- \log \widebar{\sf h}_{\widetilde{\boldsymbol{\Gamma}}}( \widehat f({\boldsymbol{z}}_i) )\|_{\infty}\leq\rho_{X}({\boldsymbol{\Gamma}},\widetilde{\boldsymbol{\Gamma}})\cdot \sqrt{m},~~\text{and moreover} \\ \rho_{X}({\boldsymbol{\Gamma}},\widetilde{\boldsymbol{\Gamma}}) &\overset{(i)}{\leq} c\| \log{\sf trun}({\boldsymbol{\Gamma}}_{w} \widehat f({\boldsymbol{z}})+{\boldsymbol{\Gamma}}_{b}) - \log{\sf trun}(\widetilde{\boldsymbol{\Gamma}}_{w} \widehat f({\boldsymbol{z}})+\widetilde{\boldsymbol{\Gamma}}_{b})\|_{\infty}/\sqrt{m},\\ &\overset{(ii)}{\leq} c\exp(B)\cdot {|\!|\!|{{\boldsymbol{\Gamma}}-\widetilde{\boldsymbol{\Gamma}}}|\!|\!|}/\sqrt{m}\eqqcolon\widebar B{|\!|\!|{{\boldsymbol{\Gamma}}-\widetilde{\boldsymbol{\Gamma}}}|\!|\!|}/\sqrt{m}, \end{align}\] where step (i) uses \(\| \log{\rm softmax}({\boldsymbol{u}})-\log{\rm softmax}({\boldsymbol{v}})\|_{\infty}\leq 2 \| {\boldsymbol{u}}-{\boldsymbol{v}}\|_{\infty}\) and step (ii) follow from Taylor expansion of \(s(x)=\log x\), the assumption that \(\| \widehat f({\boldsymbol{z}})\|_{2}\leq B_{\boldsymbol{W}}={M}\). Therefore, we have by Dudley’s integral bound (see e.g., Theorem 5.22 in [43]) that \[\begin{align} &\quad{\mathbb{E}}[\sup_{{\boldsymbol{\Gamma}},\widetilde{\boldsymbol{\Gamma}}\in{\sf\Gamma}}|X_{{\boldsymbol{\Gamma}}}-X_{\widetilde{\boldsymbol{\Gamma}}}|] \leq c\int_{0}^{cB/\sqrt{m}} \sqrt{\log{\mathcal{N}}(u,\rho_{X},\{X_{{\boldsymbol{\Gamma}}},{\boldsymbol{\Gamma}}\in{\sf\Gamma}\})} du \leq c\int_{0}^{cB/\sqrt{m}} \sqrt{\log{\mathcal{N}}\Big(u,\frac{\widebar B}{\sqrt{m}}{|\!|\!|{\cdot}|\!|\!|},{\sf\Gamma}\Big)} du\\ &\leq c\int_{0}^{cB/\sqrt{m}} \sqrt{\log{\mathcal{N}}\Big(\frac{\sqrt{m}\cdot u}{{\widebar B}},{|\!|\!|{\cdot}|\!|\!|},{\sf\Gamma}\Big)} du \\ &\leq c\int^{cB/\sqrt{m}}_{0} \Bigg(\sqrt{\log{\mathcal{N}}\Big(\frac{\sqrt{m}\cdot u}{{\widebar B}},|\!|\!| \cdot | \! | \!|_{{\tiny{op}}},{\sf\Gamma}_w\Big)} + \sqrt{\log{\mathcal{N}}\Big(\frac{\sqrt{m}\cdot u}{{\widebar B}},\| \cdot\|_{2},{\sf\Gamma}_b\Big)} \Bigg) du \\ &\leq c\int^{cB/\sqrt{m}}_{0} \sqrt{M^2\cdot\log\Big(1+4\frac{B_\Gamma\widebar B}{\sqrt{m}u}\Big)} du\leq c\frac{BM\log^{1/2} (B_\Gamma\widebar B) }{\sqrt{m}} \leq c\frac{BM(\log^{1/2} B_\Gamma+\sqrt{B}) }{\sqrt{m}} , \end{align}\] where \({\sf\Gamma}_w\mathrel{\vcenter{:}}=\{{\boldsymbol{\Gamma}}_{w}\in{\mathbb{R}}^{M\times M}:|\!|\!| {\boldsymbol{\Gamma}}_{w} | \! | \!|_{{\tiny{op}}}\leq B_\Gamma\}\) and \({\sf\Gamma}_b\mathrel{\vcenter{:}}=\{{\boldsymbol{\Gamma}}_{b}\in{\mathbb{R}}^{M}:\| {\boldsymbol{\Gamma}}_{b}\|_{2}\leq B_\Gamma\}\), and the last line uses the covering number bound of unit balls. Putting pieces together yields the desired bound.

9.5 An auxiliary lemma↩︎

Lemma 6 (Upper bound on the term in Eq. 47 ). Let the assumptions in Theorem 2 and the notations in its proof in Appendix 9.4.2 hold. Assume \(R_{{\mathrm{f}}}({\sf S}_{ \widehat f_ {\sf aug}})-R_{{\mathrm{f}}}({\sf S_\star})\leq c\sigma_{{{\boldsymbol{E}}_\star}}^2/(S^2M)\) for some absolute constant \(c>0\), then \[\begin{align} {\mathbb{E}}_{{\boldsymbol{x}},g} \Big[\sum_{y\in[M]}{({\mathbb{P}}_c(y|{\boldsymbol{x}}^{c_{1}}={\boldsymbol{z}})- \widebar{\sf h}_{{\boldsymbol{\Gamma}}}( \widehat f({\boldsymbol{z}}) )_y)^2} \Big]\leq \frac{c'S}{\sigma_{{{\boldsymbol{E}}_\star}}^2}\cdot(R_{{\mathrm{f}}}({\sf S}_{ \widehat f_ {\sf aug}})-R_{{\mathrm{f}}}({\sf S_\star})) \end{align}\] for some absolute constant \(c'>0.\)

Proof of Lemma 6.. The proof consists of two steps. First, we plug the definition of \(\widebar{\sf h}_{{\boldsymbol{\Gamma}}}\) into Eq. [lm:eq:pf95thm95img95dst95bound95claim1] and simply the expression. Then, we demonstrate that the simplied expression can be further bounded using the excess risk \(R_{{\mathrm{f}}}({\sf S}_{ \widehat f_ {\sf aug}})-R_{{\mathrm{f}}}({\sf S_\star})\) of the learned encoder \(\widehat f_ {\sf aug}.\)
Since \[\begin{align} |\!|\!| \nabla_{\boldsymbol{u}}{\rm softmax}(\log {\boldsymbol{u}}) | \! | \!|_{{\tiny{op}}}= |\!|\!| \frac{1}{\| {\boldsymbol{u}}\|_{1}}{\rm I}_{M}-\frac{{\boldsymbol{u}}}{\| {\boldsymbol{u}}\|_{1}}{\boldsymbol{1}}^\top_{M} | \! | \!|_{{\tiny{op}}}\leq \frac{1}{\| {\boldsymbol{u}}\|_{1}}+1 \end{align}\] for any \({\boldsymbol{u}}\in{\mathbb{R}}^{M}_{>0}\), we have \[\begin{align} {\mathbb{E}}_{{\boldsymbol{x}},g} \Big[\sum_{y\in[M]}{({\mathbb{P}}_c(y|{\boldsymbol{x}}^{c_{1}}={\boldsymbol{z}})- \widebar{\sf h}_{{\boldsymbol{\Gamma}}}( \widehat f({\boldsymbol{z}}) )_y)^2} \Big] &\leq c {\mathbb{E}}_{{\boldsymbol{x}},g} \Big[\sum_{y\in[M]}{({\mathbb{P}}_c(y|{\boldsymbol{x}}^{c_{1}}={\boldsymbol{z}})- {\sf trun}({\boldsymbol{\Gamma}}_{w} \widehat f({\boldsymbol{z}})+{\boldsymbol{\Gamma}}_{b})_y)^2} \Big]\\ &\leq c {\mathbb{E}}_{{\boldsymbol{x}},g} \Big[\sum_{y\in[M]}{({\mathbb{P}}_c(y|{\boldsymbol{x}}^{c_{1}}={\boldsymbol{z}})- ({\boldsymbol{\Gamma}}_{w} \widehat f({\boldsymbol{z}})+{\boldsymbol{\Gamma}}_{b})_y)^2} \Big], \end{align}\] where in the first inequality we use the claim that \[\begin{align} \label{eq:lm95improve95claim} |1- \| {\sf trun}({\boldsymbol{\Gamma}}_{w} \widehat f({\boldsymbol{z}})+{\boldsymbol{\Gamma}}_{b})\|_{1}|\leq 1/2. \end{align}\tag{48}\] The proof of this claim is deferred to the end of the proof of the lemma. The second inequality follows from a Taylor expansion of \(s(x)=\log x\), the boundedness assumption that \({\mathbb{P}}_c(y|{\boldsymbol{x}}^{c_{1}}={\boldsymbol{z}})\in[\exp(-B),1]\), and noting the truncation \({\sf trun}(\cdot)\) reduces the \(\ell_2\) error. Moreover, for any \({\boldsymbol{z}}\in[S]\), by the definition of \(({\boldsymbol{\Gamma}}_{w},{\boldsymbol{\Gamma}}_{b})\) in Eq. 45 \[\begin{align} &\quad{\boldsymbol{\Gamma}}_{w} \widehat f({\boldsymbol{z}})+{\boldsymbol{\Gamma}}_{b} \\ &= \sqrt{2}{{\boldsymbol{P}}_{\mathcal{Y}}}^{1/2}({{\boldsymbol{E}}_\star}{{\boldsymbol{E}}_\star}^\top)^{-1}{{\boldsymbol{E}}_\star}[({\rm I}_S-{\mathcal{P}}_{{\boldsymbol{1}}_S}){\widehat{\boldsymbol{E}}}^\top \widehat f({\boldsymbol{z}})+ {\boldsymbol{1}}_S/2]\\ &= \sqrt{2}{{\boldsymbol{P}}_{\mathcal{Y}}}^{1/2}({{\boldsymbol{E}}_\star}{{\boldsymbol{E}}_\star}^\top)^{-1}{{\boldsymbol{E}}_\star}{{\boldsymbol{E}}_\star}^\top{{\boldsymbol{E}}_\star}({\boldsymbol{z}})\\ &\quad + \sqrt{2}{{\boldsymbol{P}}_{\mathcal{Y}}}^{1/2} ({{\boldsymbol{E}}_\star}{{\boldsymbol{E}}_\star}^\top)^{-1}{{\boldsymbol{E}}_\star} [({\rm I}_S-{\mathcal{P}}_{{\boldsymbol{1}}_S}){\widehat{\boldsymbol{E}}}^\top \widehat f({\boldsymbol{z}})+ {\boldsymbol{1}}_S/2 -{{\boldsymbol{E}}_\star}^\top{{\boldsymbol{E}}_\star}({\boldsymbol{z}}) ]\\ &= \sqrt{2}{{\boldsymbol{P}}_{\mathcal{Y}}}^{1/2}{{\boldsymbol{E}}_\star}({\boldsymbol{z}}) + \sqrt{2}{{\boldsymbol{P}}_{\mathcal{Y}}}^{1/2}({{\boldsymbol{E}}_\star}{{\boldsymbol{E}}_\star}^\top)^{-1}{{\boldsymbol{E}}_\star} [({\rm I}_S-{\mathcal{P}}_{{\boldsymbol{1}}_S}){\widehat{\boldsymbol{E}}}^\top \widehat f({\boldsymbol{z}})+ {\boldsymbol{1}}_S/2 -{{\boldsymbol{E}}_\star}^\top{{\boldsymbol{E}}_\star}({\boldsymbol{z}}) ]. \end{align}\] Since \(\sqrt{2}{{\boldsymbol{P}}_{\mathcal{Y}}}^{1/2}{{\boldsymbol{E}}_\star}({\boldsymbol{z}})=({\mathbb{P}}_c(y|{\boldsymbol{x}}^{c_{1}}={\boldsymbol{z}}))_{y\in[M]}\) and \({\boldsymbol{z}}\overset{d}{=}{\boldsymbol{x}}^{c_{1}}\) follows the uniform distribution on \([S]\) by assumption, it follows that \[\begin{align} &\quad{\mathbb{E}}_{{\boldsymbol{x}},g} \Big[\sum_{y\in[M]}{({\mathbb{P}}_c(y|{\boldsymbol{x}}^{c_{1}}={\boldsymbol{z}})- ({\boldsymbol{\Gamma}}_{w} \widehat f({\boldsymbol{z}})+{\boldsymbol{\Gamma}}_{b})_y)^2} \Big]\notag\\ &\leq 2{\mathbb{E}}_{{\boldsymbol{z}}}[ \| ({{\boldsymbol{E}}_\star}{{\boldsymbol{E}}_\star}^\top)^{-1}{{\boldsymbol{E}}_\star} [({\rm I}_S-{\mathcal{P}}_{{\boldsymbol{1}}_S}){\widehat{\boldsymbol{E}}}^\top \widehat f({\boldsymbol{z}})+ {\boldsymbol{1}}_S/2 -{{\boldsymbol{E}}_\star}^\top{{\boldsymbol{E}}_\star}({\boldsymbol{z}})]\|_{2}^2 ]\notag\\ &\leq \frac{2}{\sigma_{{{\boldsymbol{E}}_\star}}^2} {\mathbb{E}}_{{\boldsymbol{z}}}[ \| [({\rm I}_S-{\mathcal{P}}_{{\boldsymbol{1}}_S}){\widehat{\boldsymbol{E}}}^\top \widehat f({\boldsymbol{z}})+ {\boldsymbol{1}}_S/2 -{{\boldsymbol{E}}_\star}^\top{{\boldsymbol{E}}_\star}({\boldsymbol{z}})]\|_{2}^2 ]\notag\\ &\leq \frac{2}{S\sigma_{{{\boldsymbol{E}}_\star}}^2} |\!|\!| ({\rm I}_S-{\mathcal{P}}_{{\boldsymbol{1}}_S}){\widehat{\boldsymbol{E}}}^\top{\widehat{\boldsymbol{E}}}+ {\boldsymbol{1}}_S{\boldsymbol{1}}_S^\top/2 -{{\boldsymbol{E}}_\star}^\top{{\boldsymbol{E}}_\star} | \! | \!|_{{}}^2 \notag\\ &= \frac{2}{S\sigma_{{{\boldsymbol{E}}_\star}}^2} |\!|\!| ({\rm I}_S-{\mathcal{P}}_{{\boldsymbol{1}}_S}){\widehat{\boldsymbol{E}}}^\top{\widehat{\boldsymbol{E}}} -({\rm I}_S-{\mathcal{P}}_{{\boldsymbol{1}}_S}){{\boldsymbol{E}}_\star}^\top{{\boldsymbol{E}}_\star} | \! | \!|_{{}}^2\label{eq:pf95img95exm95lm95c1} , \end{align}\tag{49}\] where the last equality follows since \({{\boldsymbol{E}}_\star}^\top({\boldsymbol{z}}^{(1)}){{\boldsymbol{E}}_\star}({\boldsymbol{z}}^{(2)})=\frac{{\mathbb{P}}_c({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})}{2{\mathbb{P}}_c({\boldsymbol{z}}^{(1)}){\mathbb{P}}_c({\boldsymbol{z}}^{(2)})}\) for any \(({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})\in[S]\), and \(\frac{1}{S}\sum_{{\boldsymbol{z}}^{(2)}\in[S]}\frac{{\mathbb{P}}_c({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})}{{\mathbb{P}}_c({\boldsymbol{z}}^{(1)}){\mathbb{P}}_c({\boldsymbol{z}}^{(2)})}=1\) for all \({\boldsymbol{z}}^{(1)}\in[S]\).
We claim that for some absolute constant \(c>0,\) \[\begin{align} &\quad |\!|\!| ({\rm I}_S-{\mathcal{P}}_{{\boldsymbol{1}}_S}){\widehat{\boldsymbol{E}}}^\top{\widehat{\boldsymbol{E}}} -({\rm I}_S-{\mathcal{P}}_{{\boldsymbol{1}}_S}){{\boldsymbol{E}}_\star}^\top{{\boldsymbol{E}}_\star} | \! | \!|_{{}}^2\notag\\ &\leq c|\!|\!| ({\rm I}_S-{\mathcal{P}}_{{\boldsymbol{1}}_S})({\widehat{\boldsymbol{E}}}^\top{\widehat{\boldsymbol{E}}}+{\widehat w}{\rm I}_S) -({\rm I}_S-{\mathcal{P}}_{{\boldsymbol{1}}_S})({{\boldsymbol{E}}_\star}^\top{{\boldsymbol{E}}_\star}+{S}\cdot{\rm I}_S/2) | \! | \!|_{{}}^2,~~\text{and} \tag{50}\\ &\quad|\!|\!| ({\rm I}_S-{\mathcal{P}}_{{\boldsymbol{1}}_S})({\widehat{\boldsymbol{E}}}^\top{\widehat{\boldsymbol{E}}}+{\widehat w}{\rm I}_S) -({\rm I}_S-{\mathcal{P}}_{{\boldsymbol{1}}_S})({{\boldsymbol{E}}_\star}^\top{{\boldsymbol{E}}_\star}+{S}\cdot{\rm I}_S/2) | \! | \!|_{{}}^2\notag \\ &\leq S^2\cdot(R_{{\mathrm{f}}}({\sf S}_{ \widehat f_ {\sf aug}})-R_{{\mathrm{f}}}({\sf S_\star})) . \tag{51} \end{align}\] Combining claim 50 and 51 and bound 49 yields the desired bound. Now, it remains to prove these two claims. 0.8ex 1ex .2ex-1em Proof of claim 50 . Adopt the shorthand notation \(\Delta=({\widehat{\boldsymbol{E}}}^\top{\widehat{\boldsymbol{E}}}+{\widehat w}{\rm I}_S) -({{\boldsymbol{E}}_\star}^\top{{\boldsymbol{E}}_\star}+{S}\cdot{\rm I}_S/2)\). First, by the triangle inequality, it suffices to show \[\begin{align} &\quad |\!|\!| ({\rm I}_S-{\mathcal{P}}_{{\boldsymbol{1}}_S})({\widehat w}-S/2) | \! | \!|_{{}}^2\leq c|\!|\!| ({\rm I}_S-{\mathcal{P}}_{{\boldsymbol{1}}_S})\Delta | \! | \!|_{{}}^2 \end{align}\] for some absolute constant \(c>0\). Note that \({\sf rank}({\widehat{\boldsymbol{E}}}^\top{\widehat{\boldsymbol{E}}}-{{\boldsymbol{E}}_\star}^\top{{\boldsymbol{E}}_\star})\leq 2M\), therefore, there are at least \(S/2\) singular values of \(\Delta\) which equal \(|{\widehat w}-S|/2.\) As a result, we have \[\begin{align} |\!|\!| ({\rm I}_S-{\mathcal{P}}_{{\boldsymbol{1}}_S})\Delta | \! | \!|_{{}}^2 &={\rm trace}(\Delta({\rm I}_S-{\mathcal{P}}_{{\boldsymbol{1}}_S})\Delta) =|\!|\!| \Delta | \! | \!|_{{}}^2-\frac{1}{S}{\boldsymbol{1}}^\top_S\Delta^2{\boldsymbol{1}}_S\\ &\geq |\!|\!| \Delta | \! | \!|_{{}}^2-|\!|\!| \Delta | \! | \!|_{{\tiny{op}}}^2\geq\frac{1}{4}|\!|\!| ({\widehat w}-S/2){\rm I}_S | \! | \!|_{{}}^2\geq \frac{1}{4}|\!|\!| ({\rm I}_S-{\mathcal{P}}_{{\boldsymbol{1}}_S})({\widehat w}-S/2) | \! | \!|_{{}}^2. \end{align}\]

0.8ex 1ex .2ex-1em Proof of claim 51 . Adpot the shorthands \({\sf S}^{\sf m}_{ \widehat f_ {\sf aug}}\mathrel{\vcenter{:}}=\Big({\sf S}_{ \widehat f_ {\sf aug}}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})\Big)_{{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}\in[S]}\in{\mathbb{R}}^{S\times S}\) and \({\sf S_\star}^{\sf m}\mathrel{\vcenter{:}}=\Big({\sf S_\star}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})\Big)_{{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}\in[S]}\in{\mathbb{R}}^{S\times S},\) where \({\sf S_\star}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})=\frac{{\mathbb{P}}({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})}{{\mathbb{P}}_{{\boldsymbol{z}}}({\boldsymbol{z}}^{(1)}){\mathbb{P}}_{{\boldsymbol{z}}}({\boldsymbol{z}}^{(2)})}\). Since we assume \({\boldsymbol{z}}\overset{d}{=}{\boldsymbol{x}}^{c_{1}}\) follows the uniform distribution on \([S]\), by the definition of \(\widehat f_ {\sf aug}\) and claim 36 in the proof of Eq. ?? \[\begin{align} &\quad |\!|\!| ({\rm I}_S-{\mathcal{P}}_{{\boldsymbol{1}}_S})({\widehat{\boldsymbol{E}}}^\top{\widehat{\boldsymbol{E}}}+{\widehat w}{\rm I}_S) -({\rm I}_S-{\mathcal{P}}_{{\boldsymbol{1}}_S})({{\boldsymbol{E}}_\star}^\top{{\boldsymbol{E}}_\star}+{S}\cdot{\rm I}_S/2) | \! | \!|_{{}}^2 \\ &= |\!|\!| ({\rm I}_S-{\mathcal{P}}_{{\boldsymbol{1}}_S})({\sf S}^{\sf m}_{ \widehat f_ {\sf aug}}-{\sf S_\star}^{\sf m}) | \! | \!|_{{}}^2\\ &=S^2\cdot {T}_1, \end{align}\] where \[\begin{align} {T}_1\mathrel{\vcenter{:}}= {\mathbb{E}}_{{\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)}\sim{\mathbb{P}}_{{\boldsymbol{z}}}\times{\mathbb{P}}_{{\boldsymbol{z}}}}[(({\sf S}_{ \widehat f_ {\sf aug}}-{\sf S_\star})({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})-{\mathbb{E}}_{{\boldsymbol{z}}^{(2)}\sim{\mathbb{P}}_{{\boldsymbol{z}}}}[({\sf S}_{ \widehat f_ {\sf aug}}-{\sf S_\star})({\boldsymbol{z}}^{(1)},{\boldsymbol{z}}^{(2)})])^2]. \end{align}\] Finally, by a second-order Taylor expansion of \(R_{{\mathrm{f}}}({\sf S})\) at \({\sf S_\star}\), we have \[\begin{align} R_{{\mathrm{f}}}({\sf S}_{ \widehat f_ {\sf aug}})-R_{{\mathrm{f}}}({\sf S_\star})= {T}_1. \end{align}\] Combining the two equalities yields the claim.

0.8ex 1ex .2ex-1em Proof of claim 48 . Note that for any \({\boldsymbol{z}}\in[S]\), \[\begin{align} |1- \| {\sf trun}({\boldsymbol{\Gamma}}_{w} \widehat f({\boldsymbol{z}})+{\boldsymbol{\Gamma}}_{b})\|_{1}| &\leq \sum_{y\in[M]}|{\mathbb{P}}_c(y|{\boldsymbol{x}}^{c_{1}}={\boldsymbol{z}})- {\sf trun}({\boldsymbol{\Gamma}}_{w} \widehat f({\boldsymbol{z}})+{\boldsymbol{\Gamma}}_{b})_y|\\ &\leq \sum_{y\in[M]}|{\mathbb{P}}_c(y|{\boldsymbol{x}}^{c_{1}}={\boldsymbol{z}})- ({\boldsymbol{\Gamma}}_{w} \widehat f({\boldsymbol{z}})+{\boldsymbol{\Gamma}}_{b})_y|\\ &\leq \sqrt{MS}\cdot\sqrt{{\mathbb{E}}_{{\boldsymbol{x}},g}\Big[\sum_{y\in[M]}({\mathbb{P}}_c(y|{\boldsymbol{x}}^{c_{1}}={\boldsymbol{z}})- ({\boldsymbol{\Gamma}}_{w} \widehat f({\boldsymbol{z}})+{\boldsymbol{\Gamma}}_{b})_y)^2\Big]}, \end{align}\] where the last line follows from the assumption that \({\boldsymbol{x}}^{c_{1}}\) (and hence \({\boldsymbol{z}}\)) follows the uniform distribution on \([S].\) Thus, combining Eq. 49 , claim 50 and 51 yields \[\begin{align} |1- \| {\sf trun}({\boldsymbol{\Gamma}}_{w} \widehat f({\boldsymbol{z}})+{\boldsymbol{\Gamma}}_{b})\|_{1}|\leq c\frac{S\sqrt{M}}{\sigma_{{{\boldsymbol{E}}_\star}}}\cdot\sqrt{R_{\mathrm{f}}( {\sf S}_{ \widehat f_ {\sf aug}}) - R_{\mathrm{f}}({\sf S_\star})} \leq \frac{1}{2}. \end{align}\] ◻

References↩︎

[1]

Kazusato Oko, Licong Lin, Yuhang Cai, and Song Mei, A statistical theory of contrastive pre-training and multimodal generative ai, arXiv preprint arXiv:2501.04641 (2025).

[2]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton, A simple framework for contrastive learning of visual representations, International conference on machine learning, PMLR, 2020, pp. 1597–1607.

[3]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick, Momentum contrast for unsupervised visual representation learning, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738.

[4]

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli, wav2vec 2.0: A framework for self-supervised learning of speech representations, Advances in neural information processing systems 33(2020), 12449–12460.

[5]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., Learning transferable visual models from natural language supervision, International conference on machine learning, PMLR, 2021, pp. 8748–8763.

[6]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi, Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, International conference on machine learning, PMLR, 2023, pp. 19730–19742.

[7]

Nikunj Saunshi, Orestis Plevrakis, Sanjeev Arora, Mikhail Khodak, and Hrishikesh Khandeparkar, A theoretical analysis of contrastive unsupervised representation learning, International Conference on Machine Learning, PMLR, 2019, pp. 5628–5637.

[8]

Aaron van den Oord, Yazhe Li, and Oriol Vinyals, Representation learning with contrastive predictive coding, arXiv preprint arXiv:1807.03748 (2018).

[9]

Virginia De Sa, Learning classification with unlabeled data, Advances in neural information processing systems 6(1993).

[10]

Spyros Gidaris, Praveer Singh, and Nikos Komodakis, Unsupervised representation learning by predicting image rotations, arXiv preprint arXiv:1803.07728 (2018).

[11]

Richard Zhang, Phillip Isola, and Alexei A Efros, Colorful image colorization, Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, Springer, 2016, pp. 649–666.

[12]

Mehdi Noroozi and Paolo Favaro, Unsupervised learning of visual representations by solving jigsaw puzzles, European conference on computer vision, Springer, 2016, pp. 69–84.

[13]

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al., Bootstrap your own latent-a new approach to self-supervised learning, Advances in neural information processing systems 33(2020), 21271–21284.

[14]

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig, Scaling up visual and vision-language representation learning with noisy text supervision, International conference on machine learning, PMLR, 2021, pp. 4904–4916.

[15]

Michael Gutmann and Aapo Hyvärinen, Noise-contrastive estimation: A new estimation principle for unnormalized statistical models, Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings, 2010, pp. 297–304.

[16]

Kihyuk Sohn, Improved deep metric learning with multi-class n-pair loss objective, Advances in neural information processing systems 29(2016).

[17]

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer, Sigmoid loss for language image pre-training, Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 11975–11986.

[18]

Yiwei Lu, Guojun Zhang, Sun Sun, Hongyu Guo, and Yaoliang Yu, \(f\)-micl: Understanding and generalizing infonce-based contrastive learning, arXiv preprint arXiv:2402.10150 (2024).

[19]

Ralph Linsker, Self-organization in a perceptual network, Computer 21(1988), no. 3, 105–117.

[20]

R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio, Learning deep representations by mutual information estimation and maximization, arXiv preprint arXiv:1808.06670 (2018).

[21]

Ben Poole, Sherjil Ozair, Aaron Van Den Oord, Alex Alemi, and George Tucker, On variational bounds of mutual information, International Conference on Machine Learning, PMLR, 2019, pp. 5171–5180.

[22]

Yonglong Tian, Dilip Krishnan, and Phillip Isola, Contrastive multiview coding, Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, Springer, 2020, pp. 776–794.

[23]

Tongzhou Wang and Phillip Isola, Understanding contrastive representation learning through alignment and uniformity on the hypersphere, International conference on machine learning, PMLR, 2020, pp. 9929–9939.

[24]

Kento Nozawa and Issei Sato, Understanding negative samples in instance discriminative self-supervised representation learning, Advances in Neural Information Processing Systems 34(2021), 5784–5797.

[25]

Roland S Zimmermann, Yash Sharma, Steffen Schneider, Matthias Bethge, and Wieland Brendel, Contrastive learning inverts the data generating process, International Conference on Machine Learning, PMLR, 2021, pp. 12979–12990.

[26]

Jordan T Ash, Surbhi Goel, Akshay Krishnamurthy, and Dipendra Misra, Investigating the role of negatives in contrastive representation learning, arXiv preprint arXiv:2106.09943 (2021).

[27]

Christopher Tosh, Akshay Krishnamurthy, and Daniel Hsu, Contrastive estimation reveals topic posterior information to linear models, Journal of Machine Learning Research 22(2021), no. 281, 1–31.

[28]

, Contrastive learning, multi-view redundancy, and linear models, Algorithmic Learning Theory, PMLR, 2021, pp. 1179–1206.

[29]

Jeff Z HaoChen, Colin Wei, Adrien Gaidon, and Tengyu Ma, Provable guarantees for self-supervised deep learning with spectral contrastive loss, Advances in Neural Information Processing Systems 34(2021), 5000–5011.

[30]

Weiran Huang, Mingyang Yi, Xuyang Zhao, and Zihao Jiang, Towards the generalization of contrastive self-supervised learning, arXiv preprint arXiv:2111.00743 (2021).

[31]

Zixin Wen and Yuanzhi Li, Toward understanding the feature learning process of self-supervised contrastive learning, International Conference on Machine Learning, PMLR, 2021, pp. 11112–11122.

[32]

Jason D Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo, Predicting what you already know helps: Provable self-supervised learning, Advances in Neural Information Processing Systems 34(2021), 309–323.

[33]

Yifei Wang, Qi Zhang, Yisen Wang, Jiansheng Yang, and Zhouchen Lin, Chaos is a ladder: A new theoretical understanding of contrastive learning via augmentation overlap, arXiv preprint arXiv:2203.13457 (2022).

[34]

Liangliang Shi, Gu Zhang, Haoyu Zhen, Jintao Fan, and Junchi Yan, Understanding and generalizing contrastive learning from the inverse optimal transport perspective, International conference on machine learning, PMLR, 2023, pp. 31408–31421.

[35]

Zhenmei Shi, Jiefeng Chen, Kunyang Li, Jayaram Raghuram, Xi Wu, Yingyu Liang, and Somesh Jha, The trade-off between universality and label efficiency of representations from contrastive learning, arXiv preprint arXiv:2303.00106 (2023).

[36]

Ryumei Nakada, Halil Ibrahim Gulluk, Zhun Deng, Wenlong Ji, James Zou, and Linjun Zhang, Understanding multimodal contrastive learning and incorporating unpaired data, International Conference on Artificial Intelligence and Statistics, PMLR, 2023, pp. 4348–4380.

[37]

Ravid Shwartz Ziv and Yann LeCun, To compress or not to compress—self-supervised learning and information theory: A review, Entropy 26(2024), no. 3, 252.

[38]

Anna Van Elst and Debarghya Ghoshdastidar, Tight pac-bayesian risk certificates for contrastive learning, arXiv preprint arXiv:2412.03486 (2024).

[39]

Robert W Keener, Theoretical statistics: Topics for a core course, Springer Science & Business Media, 2010.

[40]

Xiangxiang Xu and Lizhong Zheng, Dependence induced representations, 2024 60th Annual Allerton Conference on Communication, Control, and Computing, IEEE, 2024, pp. 1–8.

[41]

Daniel Hsu, Sham M Kakade, and Tong Zhang, An analysis of random design linear regression, arXiv preprint arXiv:1106.2363 6(2011).

[42]

Ronald Aylmer Fisher, Dispersion on a sphere, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences 217(1953), no. 1130, 295–305.

[43]

Martin J. Wainwright, High-dimensional statistics: A non-asymptotic viewpoint, Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press, 2019.

[44]

Mark Bun and Thomas Steinke, Concentrated differential privacy: Simplifications, extensions, and lower bounds, Theory of cryptography conference, Springer, 2016, pp. 635–658.

[45]

László Györfi, Michael Kohler, Adam Krzyzak, and Harro Walk, A distribution-free theory of nonparametric regression, Springer Science & Business Media, 2006.

[46]

Jean-Yves Audibert and Olivier Catoni, Linear regression through pac-bayesian truncation, arXiv preprint arXiv:1010.0072 (2010).

[47]

Kanti V Mardia and Peter E Jupp, Directional statistics, John Wiley & Sons, 2009.

[48]

Dietrich Von Rosen, Moments for the inverted wishart distribution, Scandinavian Journal of Statistics (1988), 97–109.

Department of Statistics, UC Berkeley. Email: liconglin@berkeley.edu.↩︎
Department of Statistics and Department of EECS, UC Berkeley. Email: songmei@berkeley.edu.↩︎
For example, \({\boldsymbol{\mu}}\) can be the Lebesgue measure on Euclidean spaces, or the counting measure on discrete spaces.↩︎
More generally, we only need each transformation \(g: {\mathcal{X}}\to \mathcal{Z}\) maps \({\mathcal{X}}\) to a space \(\mathcal{Z}\), which entails a natural injective map back to \({\mathcal{X}}\).↩︎
\(\mathbb{S}({\sf U}_1)\oplus\mathbb{S}({\sf U}_2)\mathrel{\vcenter{:}}=\{{\boldsymbol{v}}\in{\mathbb{R}}^{d}:{\boldsymbol{v}}={\boldsymbol{v}}_1+{\boldsymbol{v}}_2 \text{ for some }{\boldsymbol{v}}_1\in\mathbb{S}({\sf U}_1),{\boldsymbol{v}}_2\in\mathbb{S}({\sf U}_2)\}\).↩︎
All densities are with respect to the Lebesgue measure.↩︎

A Statistical Theory of Contrastive Learning via Approximate Sufficient Statistics

Abstract

1 Introduction↩︎

2 Related work↩︎

3 Approximate sufficient statistics↩︎

4 Statistical properties of contrastive learning↩︎

4.1 Setup and the ERM estimator↩︎

4.2 Using the encoder for downstream tasks↩︎

4.3 General \({\mathrm{f}}\)-contrastive learning↩︎

4.3.1 Finding encoders with low \({\mathrm{f}}\)-sufficieny↩︎

4.3.2 Implications of low \({\mathrm{f}}\)-Sufficiency↩︎

5 Examples↩︎

5.1 Linear regression↩︎

5.1.1 A concrete scenario↩︎

5.2 Topic classification↩︎

6 Conclusion↩︎

Acknowledgement↩︎

7 Properties of approximate sufficient statistics↩︎

7.1 Equivalence in Definition 1 ↩︎

7.2 Properties and examples↩︎

7.3 Sufficiency of similarity scores↩︎

8 Proofs in Section 4 ↩︎

8.1 Proof of Eq. 3 ↩︎

8.2 Proof of Theorem 1 ↩︎

8.3 Proof of Theorem 4 ↩︎

8.4 Proof of Theorem 2 ↩︎

8.5 Proof of Theorem 3 ↩︎

9 Proofs in Section 5 ↩︎

9.1 Proof of Theorem 5 ↩︎

9.2 Proof of Corollary 1 ↩︎

9.3 An end-to-end result on downstream linear regression↩︎

9.4 Proof of Theorem 6 ↩︎

9.4.1 Proof of Eq. ??↩︎

9.4.2 Proof of Eq. ??↩︎

9.5 An auxiliary lemma↩︎

References↩︎

Subjects

Updated on Academus

A Statistical Theory of Contrastive Learning via Approximate Sufficient Statistics

Abstract

1 Introduction↩︎

2 Related work↩︎

3 Approximate sufficient statistics↩︎

4 Statistical properties of contrastive learning↩︎

4.1 Setup and the ERM estimator↩︎

4.2 Using the encoder for downstream tasks↩︎

4.3 General \({\mathrm{f}}\)-contrastive learning↩︎

4.3.1 Finding encoders with low \({\mathrm{f}}\)-sufficieny↩︎

4.3.2 Implications of low \({\mathrm{f}}\)-Sufficiency↩︎

5 Examples↩︎

5.1 Linear regression↩︎

5.1.1 A concrete scenario↩︎

5.2 Topic classification↩︎

6 Conclusion↩︎

Acknowledgement↩︎

7 Properties of approximate sufficient statistics↩︎

7.1 Equivalence in Definition 1↩︎

7.2 Properties and examples↩︎

7.3 Sufficiency of similarity scores↩︎

8 Proofs in Section 4↩︎

8.1 Proof of Eq. 3↩︎

8.2 Proof of Theorem 1↩︎

8.3 Proof of Theorem 4↩︎

8.4 Proof of Theorem 2↩︎

8.5 Proof of Theorem 3↩︎

9 Proofs in Section 5↩︎

9.1 Proof of Theorem 5↩︎

9.2 Proof of Corollary 1↩︎

9.3 An end-to-end result on downstream linear regression↩︎

9.4 Proof of Theorem 6↩︎

9.4.1 Proof of Eq. ??↩︎

9.4.2 Proof of Eq. ??↩︎

9.5 An auxiliary lemma↩︎

References↩︎

Subjects

Updated on Academus

7.1 Equivalence in Definition 1 ↩︎

8 Proofs in Section 4 ↩︎

8.1 Proof of Eq. 3 ↩︎

8.2 Proof of Theorem 1 ↩︎

8.3 Proof of Theorem 4 ↩︎

8.4 Proof of Theorem 2 ↩︎

8.5 Proof of Theorem 3 ↩︎

9 Proofs in Section 5 ↩︎

9.1 Proof of Theorem 5 ↩︎

9.2 Proof of Corollary 1 ↩︎

9.4 Proof of Theorem 6 ↩︎