July 20, 2025
We develop a unified Data Processing Inequality PAC-Bayesian framework—abbreviated DPI-PAC-Bayesian—for deriving the generalization error bounds in the supervised learning setting. By embedding the Data Processing Inequality (DPI) into the change‑of‑measure technique, we obtain explicit bounds on the binary Kullback–Leibler generalization gap for both Rényi divergence and any \(f\)-divergence measured between a data‑independent prior distribution and an algorithm‑dependent posterior distribution. We present three bounds derived under our framework using Rényi, Hellinger \(p\) and Chi-Squared divergences. Additionally, our framework also demonstrates a close connection with other well-known bounds. When the prior distribution is chosen to be uniform, our bounds recover to the classical Occam’s Razor bound and, crucially, eliminate the extraneous \(\log(2\sqrt{n})/n\) slack present in the PAC‑Bayes bound, thereby achieving tighter bounds. The framework thus bridges data‑processing and PAC‑Bayesian perspectives, providing a flexible, information‑theoretic tool to construct generalization guarantees.
DPI‑PAC‑Bayesian framework, generalization bounds, Data Processing Inequality, PAC-Bayes bound
Bounding techniques under supervised learning setting can provide theoretical guarantees for the performance of machine learning models on unseen data, improving the generalization capabilities of the models.
This work focuses on high-probability generalization bounds. A typical result states that, with probability at least \(1-\delta\) over a set of i.i.d. samples, the population risk of a model is upper-bounded by \(f\bigl(\delta,\;\text{empirical risk on the samples}\bigr)\). By contrast, information-theoretic bounds usually bound the expected gap between population and empirical risks. A thorough comparison of these two families of bounds is provided in [1].
As the work of Langford [2] mentioned, the high-probability generalization bounds for supervised learning setting can be classified into two classes: test-set bounds and train-set bounds. Both of these bounds have their advantages and disadvantages. Although test-set bounds can give a tight upper bound on the error rate on unseen data, the main problem of such bounds is that the data used to evaluate the bounds cannot be used for learning. Specifically, we have to remove some training examples and keep them as a holdout set, which could lead to loss of performance on our learned hypothesis when training examples are inadequate. Compared to test-set bounds, train-set bounds are the current focus of learning theory work. The biggest advantage of train-set bounds is that we can use entire data samples to perform both learning and bound construction, but many train-set bounds are generally loose. Therefore, it is crucial to develop techniques to improve the tightness of train-set bounds, so that these bounds can provide better insight into the learning problem itself.
In this work, we propose a flexible DPI-PAC-Bayesian framework for deriving train-set generalization error bounds under the supervised learning setting by combining Data Processing Inequality (DPI) with the spirit of the PAC-Bayesian perspective. This framework accommodates Rényi divergence and also arbitrary \(f\)-divergence measures.
In addition to its flexibility, the framework shows a close connection to other widely used train-set bounds and also yields provably tight bounds. Our theoretical results demonstrate that, in some special cases, the bounds derived by our framework can recover to the Occam’s Razor bound and also can be explicitly tighter than the PAC-Bayes bound.
Consider a standard supervised learning setting. We have \(n\) i.i.d. training samples \(S=\{Z_1,...Z_n \}\), which are randomly drawn from an underlying data-generating distribution \(\mathcal{D}\). A hypothesis space \(\mathcal{W}\) that includes a set of hypotheses (or classifiers) \(w\). The learning algorithm is treated as a conditional probability distribution \(P_{W|S}\). For a given training set \(s\), the algorithm samples a hypothesis \(w\) according to \(P_{W|S=s}\). Coupled with a marginal distribution \(P_S\) over training samples, this defines a joint distribution over hypothesis space and data, given by \(P_{W,S} = P_S P_{W|S}\) on \(\mathcal{W} \times \mathcal{Z}^n\). The performance of a hypothesis \(w\in\mathcal{W}\) on a training sample is measured by a loss function \(\ell:\mathcal{W}\times\mathcal{Z}\rightarrow[0,1]\). The empirical loss of \(w\) is defined as \(\hat{L}(S,w)=\frac{1}{n}\sum_{i=1}^n\ell(Z_i,w)\), while the generalization loss of \(w\) on \(\mathcal{D}\) is \(L(w)=\mathbb{E}_{Z\sim\mathcal{D}}\{\ell(Z,w)\}\). For any \(w\in\mathcal{W}\), we consider the bounds on the generalization gap \(L(w)-\hat{L}(S,w)\). For ease of exposition, throughout this work we consider on the finite‑hypothesis case \(|\mathcal{W}|<\infty\). The prior distribution \(Q_W\) assigns a strictly positive mass to every \(w\in\mathcal{W}\) and \(\min_wQ_W(w)>0\), but the same technique can be extended to more general case when the hypothesis space \(\mathcal{W}\) is infinite.
Consider a fixed kernel \(W(y|x)\) and two different probability distributions \(P_X\) and \(Q_X\) defined on the same space \(\mathcal{X}\). Then, define \(P_Y(y) = \sum_x W(y|x) P_X(x)\) \(Q_Y(y) = \sum_x W(y|x) Q_X(x)\). Moreover, for a convex function \(f:(0,+\infty) \to \mathbb{R}\) satisfying \(f(1) = 0\), the \(f\)-divergence between two distributions on a probability space \(\mathcal{X}\) is defined as \[D_f(P_X\|Q_X) = \sum_{x \in \mathcal{X}} Q_X(x)\, f\!\left(\frac{P_X(x)}{Q_X(x)}\right).\]
We introduce the expressions of Rényi divergence and two crucial \(f\)-divergences for our bounds derivation: 1. Rényi divergence (\(\alpha>0\), \(\alpha\neq1\)) \[D_\alpha(P\|Q)=\frac{1}{\alpha-1}\ln(\sum_{x}P_X(x)^{\alpha}Q_X(x)^{1-\alpha}).\] 2. \(\chi^2\)-divergence \[\chi^2(P||Q)=\sum\limits_x\frac{P_X(x)^2}{Q_X(x)}-1.\] 3. Hellinger \(p\)-divergence (\(p>0\), \(p\neq1\)) \[\mathcal{H}^p(P||Q)=\frac{\sum\limits_xP_X(x)^pQ_X(x)^{1-p}-1}{p-1}.\]
Proposition 1 (Data Processing Inequality). With the distributions \(P_X,Q_X,P_Y,Q_Y\) and the kernel \(W(y|x)\) defined previously, we have \[\begin{align} &\text{(i) } D_f(P_Y\|Q_Y)\le D_f(P_X\|Q_X),\\ &\text{(ii) } D_\alpha(P_Y\|Q_Y)\le D_\alpha(P_X\|Q_X). \end{align}\]
That is, passing \(P_X\) and \(Q_X\) through the same kernel will make them “more similar”.
To measure the discrepancy between empirical and expected losses, we employ the Kullback-Leibler (KL) function, \(\mathrm{KL}(\hat{L}(S,W)||L(W))\). For \(p, q\in[0,1]\), the KL function is defined as \[\mathrm{KL}(p||q)=p\ln\frac{p}{q}+(1-p)\ln\frac{1-p}{1-q}.\] The KL-loss form bounds can be further relaxed using Pinsker’s inequality \(\mathrm{KL}(p||q)\geq2(p-q)^2\) and then yield the bounds of the classical difference-loss form.
In the supervised setting, generalization bounds fall into two categories: test‑set bounds, which require that an extra subset of the data be held out solely for evaluation, because these examples cannot be used during training, and train-set bounds, which use the entire dataset both to learn the hypothesis and to compute the bound.
To illustrate how a test-set bound is evaluated, consider the following two-party scenario [2]: (1) A learner trains a hypothesis \(w\) on a data set that the verifier will never see, and then transmits this fixed \(w\) to the Verifier. (2) A verifier samples a set of data \(S\) , using \(w\) together with the empirical loss on \(S\), computes the right-hand side of the bound.
In test-set bound, \(S\) is generated after \(w\) is fixed, and is independent of the learner’s training data.
Theorem 1. (KL test-set bound [2]).
With probability at least \(1-\delta\) over \(P_S\), it holds that \[\forall w,\;\mathrm{KL}( \hat{L}(S, w)||L(w)) \le \frac{\log \frac{1}{\delta}}{n} . \label{eq:KL-test-bound}\qquad{(1)}\]
The bound is very simple and can be seen as computing a confidence interval for the binomial distribution as in [3].
In a train-set bound, the same set of \(S\) is used twice—first to train the hypothesis and then to evaluate the bound. The evaluation protocol is as follows: (1) A learner chooses a prior \(Q_W(w)\) over the hypothesis space before seeing \(S\) and sends it to the verifier. (2) The verifier samples the data \(S\) and sends it to the learner. (3) The learner chooses \(w\) based on \(S\) and sends it to the verifier. (4) The verifier evaluates the bound.
The first and tightest train-set bound is the Occam’s Razor bound [4], and Langford [2] has proved that the bound cannot be improved without incorporating extra information.
Theorem 2. (Occam’s Razor bound [4]). Assume \(w \in \mathcal{W}\) with* \(\mathcal{W}\) a countable set. Let \(Q_{W}(w)\) be a distribution over \(\mathcal{W}\). For any \(\delta\in(0,1]\) , with probability at least \(1-\delta\) over \(P_S\) , it holds that \[\begin{align} \forall w,\, \mathrm{KL}(\hat{L}(S,w)||L(w)) \le \frac{\log\frac{1}{Q_{ W}(w)}+\log\frac{1}{\delta}}{n}. \label{or95bound} \end{align}\tag{1}\] *
PAC-Bayes bounds [5] are also train-set bounds. We present the PAC-Bayes-KL bound from the work of McAllester [6], which is one of the tightest known PAC-Bayes bounds in the literature and can be relaxed in various ways to obtain other PAC-Bayes Bounds [7].
Theorem 3. (PAC-Bayes bound [8]). Let \(P_{W|S}\) be a fixed conditional distribution (given data \(S\)) on \(\mathcal{W}\). Define \(P[L] := \mathbb{E}_{P_{W|S}} \left\{ L(W) \right\}, \;P[\hat{L}] := \mathbb{E}_{P_{W|S}} \left\{ \hat{L}(S,W) \right\}.\) For any* \(\delta\in(0,1]\) , with probability at least \(1-\delta\) over \(P_S\) , it holds that \[\begin{align} \mathrm{KL}(P[\hat{L}]||P[L]) \leq \frac{D(P_{W|S} \,\|\, Q_ W) + \log \frac{2\sqrt{n}}{\delta}}{n}, \label{eq:OR} \end{align}\tag{2}\] *
where \(Q_{W}(w)\) is a prior distribution over the hypothesis space \(\mathcal{W}\)–specified before seeing training samples \(S\). Furthermore, in the PAC-Bayesian framework, \(W\sim P_{{W}|S}\) is the output of the posterior distribution (or the learning algorithm) on \(S\).
To compare the two bounds, we specialize the PAC‑Bayes bound 2 by choosing the posterior \(P_{W\mid S}(w)=\delta(w)\), i.e. a Dirac mass on a single hypothesis \(w\). Then the term \(\mathrm{KL}(P_{W|S}||Q_ W)=\log(1/Q_ W(w))\) The PAC-Bayes bound becomes \[\begin{align} \forall w,\; \,\mathrm{KL}(\hat{L}(S, w)||L(w)) \leq \frac{\log \frac{1}{Q_{W}(w)}+\log \frac{2\sqrt{n}}{\delta}}{n}. \label{pac-bayes-pointmass} \end{align}\tag{3}\]
Comparing this to the OR bound 2 , we see an extra term \(\log 2\sqrt{n}\). Then an open question that is worth studying [9]: Can the PAC-Bayes bounds be as tight as the OR bound? To be specific, when specializing \(P_{{W}|S}\) to a deterministic algorithm, can we remove the term \(\log2\sqrt{n}\) from the PAC-Bayes bounds?
There are few works on this problem. [10] has tried more general PAC-Bayes bounds with other \(d\) functions : \[\begin{align} \mathbb{P}_{S}\Bigl\{ \forall P_{W},\; \lambda\,d\bigl(P[L],\,P[\hat{L}]\bigr) &\le D\bigl(P_{W}\,\Vert\,Q_{W}\bigr) + \log\Phi(\lambda)\\ &\quad\;+\;\log\frac{1}{\delta} \Bigr\} \;\ge\; 1 - \delta. \end{align}\]
For example, if we use Catoni’s function \(C_\beta\) [11], and optimize the parameter \(\beta\), then the \(\log 2\sqrt{n}\) term (from \(\Phi(\lambda)\)) will be removed, which is not allowed for a valid PAC-Bayes bound [12]. Here we study this problem from different perspectives. Our work aims to bridge the gap between the PAC-Bayesian framework and the OR bound by the DPI.
Following the same technique introduced in [13], we derived three key lemmas by combining DPI with the Rényi divergence and two \(f\)-divergences. In particular, a similar result for the KL divergence appeared in [14].
In this subsection, we define two probability spaces \(({\Omega}, \mathcal{F},P)\), \(({\Omega}, \mathcal{F},Q)\), where \({\Omega}=\mathcal{X\times Y}\). Let \(E\in\mathcal{F}\) be a (measurable) event.
Lemma 1. (Change of measure with Rényi Divergence). For any \(\alpha > 1\) and any event \(E \in \mathcal{F}\), we have the bound \[P(E)\le Q(E)^{\frac{\alpha-1}{\alpha}}e^{\frac{\alpha-1}{\alpha}D_\alpha(P||Q)}.\]
Proof. We define a fixed kernel \(W(y|x)\) to generate \(P_Y(y)\) and \(Q_Y(y)\): \[\begin{cases} W(y=1|x) = 1,\;\textit{if} \;x\in E, \\ W(y=0|x) = 1,\;otherwise. \end{cases}\] Notice for all \(x\), we have \(W(y=1|x)+W(y=0|x)=1\). By Proposition 1, the Rényi divergence satisfies \[D_\alpha(P_X \| Q_X) \ge D_\alpha(P_Y \| Q_Y), \quad \text{for any } \alpha \in (1, \infty).\] Since \(Y \in \{0,1\}\), the RHS becomes \[\begin{align} D_\alpha \big( P_Y(y) \| Q_Y(y) \big) = \frac{1}{\alpha -1} \log \sum_{y \in \{0,1\}}P_Y(y)^\alpha Q_Y(y)^{1-\alpha}, \end{align}\] where\[\begin{align} P_Y(y=1)&= \sum_{x \in E} W(y=1|x) P_X(x) + \sum_{x \notin E} w(y=1|x) P_X(x)\\ &= P(E), \end{align}\] and \(P_Y(y=0)= 1 - P(E)\). Similarly, we have \(Q_Y(y=1) = Q(E)\), \(Q_Y(y=0) = 1 - Q(E)\). Thus when \(\alpha\in(1,\infty)\), we have \[\begin{align} D_{\alpha}(P_X(x)||Q_X(x)) &\geq \frac{1}{\alpha-1}\log [P(E)^\alpha Q(E)^{1-\alpha}\\ &\phantom{\geq\frac{1}{\alpha-1}} +(1 - P(E))^\alpha(1 - Q(E))^{1-\alpha}]\\ &\geq \frac{1}{\alpha-1}\log [P(E)^\alpha Q(E)^{1-\alpha}], \end{align}\] then we have proved the Lemma 1 by rearranging the above inequality. ◻
Lemma 2. (Change of measure with Hellinger \(p\)-Divergence). For any \(p>1\) and any event \(E \in \mathcal{F}\) such that \(P(E) < \frac{1}{2}\) and \(Q(E) < \frac{1}{2}\), we have the bound \[\begin{align} P(E)\le \left[1+Q(E)^{(1-p)}\right]^{-\frac{1}{p}}\left[(p-1)\mathcal{H}^{p}(P||Q)+1\right]^{\frac{1}{p}} \label{DPI95original95bound95hellinger95p95inequality}. \end{align}\qquad{(2)}\]
Proof. We define the same fixed kernel \(W(y|x)\) as in the proof of Lemma 1. Thus for Hellinger \(p\)-Divergence we have \[\begin{align} \mathcal{H}^{p}(P_X(x)\|Q_X(x)) &\geq \frac{1}{p-1} \left[ (1 - P(E))^p (1 - Q(E))^{1 - p} \right. \notag\\ &\quad\quad\quad\quad\quad\quad\quad\left. + (P(E))^p (Q(E))^{1-p} - 1 \right]. \end{align}\]
We can further relax the RHS as \[\frac{1}{p-1} \left[P(E)^p(1+Q(E)^{(1-p)}) - 1 \right].\] The claimed result follows by rearranging the terms. ◻
The conditions \(P(E)<\frac{1}{2}\) and \(Q(E)<\frac{1}{2}\) are naturally satisfied when \(E\) is defined as the failure event in which the KL-based test-set bound does not hold.
Lemma 3. (Change of measure with Chi-Squared Divergence). For any event \(E \in \mathcal{F}\), we have the bound \[\begin{align} P(E)\le Q(E)^\frac{1}{2}(\chi^2(P||Q)+2)^\frac{1}{2}. \label{DPI95original95bound95Chi95Squared} \end{align}\qquad{(3)}\]
Proof. See Appendix 3.5 ◻
These three lemmas are inspired by the change-of-measure principle commonly employed in the PAC-Bayesian framework. In PAC-Bayes analysis, the Donsker-Varadhan inequality enables one to bound expectations under an intractable posterior by reweighting expectations under a tractable prior distribution, typically introducing a KL divergence term to quantify the complexity of the posterior.
Within our framework, we exploit the DPI to upper‑bound the posterior distribution \(P(E)\) through several \(f\)-divergences between \(P\) and \(Q\); this idea was first introduced in [13]. Each bound comprises a scaled prior term, such as \(Q(E)^\gamma\), multiplied by an exponential penalty term that depends on the chosen divergence. These multiplicative correction factors—e.g., \(e^{\frac{\alpha - 1}{\alpha} D_\alpha(P \| Q)}\) in the Rényi case—can be viewed as the cost of performing a change of measure from \(Q\) to \(P\) under the respective divergence.
This perspective highlights a unifying theme across our results: DPI provides an information-theoretic control analogous to that in PAC-Bayes, enabling generalization bounds through divergence-based reweighting of prior knowledge.
Building on the preceding lemmas that fuse the DPI with Rényi (\(D_\alpha\)), Chi-Squared (\(\mathcal{\chi}^2\)), and Hellinger (\(\mathcal{H}^p\)) \(p\)-divergences, we now establish the core results developed by our DPI‑PAC‑Bayesian framework. The next three theorems present train‑set generalization bounds in terms of \(D_\alpha\), \(\mathcal{H}^p\), and \(\chi^2\) divergences developed by our framework.
Theorem 4. (\(D_\alpha\)-PAC-Bayes bound). Let \(Q\) be a distribution over a finite hypothesis space \(\mathcal{W}\) such that \(Q_{\min} := \min_w Q_{W}(w)>0\). For any \(\alpha>1\), and for any \(\delta\in(0,1]\), with probability at least \(1-\delta\) over \(P_S\), it holds that \[\begin{align} \forall w,\; \mathrm{KL}(\hat{L}(S, w)||L(w)) \leq \frac{ \log \frac{1}{Q_{min}} + \frac{\alpha}{\alpha - 1} \log \frac{1}{\delta} }{ n}. \end{align}\]
Proof. We choose \(P_{S,{W}} = P_S P_{{W}|S}\), and \(Q_{S,{W}} = P_S Q_{W}\) for some \(Q_{W}\), and define the event \[E := \left\{ (S, W) : \mathrm{KL}( \hat{L}(S, W)||L(W)) \geq \frac{\log \frac{1}{\delta}}{n} \right\},\] For any specific \(w\), we define the event \[E_w := \left\{ S: \mathrm{KL}( \hat{L}(S, w)||L(w)) \geq \frac{\log \frac{1}{\delta}}{n} \right\}.\] Then we apply the Lemma 1 to get
\[P(E) \leq Q(E)^{\frac{\alpha -1}{\alpha}} \left(\sum_{w,s} P(s,w)^\alpha Q(s,w)^{1-\alpha} \right)^{\frac{1}{\alpha}}.\] When \(\alpha>1\), for \(Q(E)^{\frac{\alpha -1}{\alpha}}\) we have \[\begin{align} Q(E)^{\frac{\alpha -1}{\alpha}}&= \left( \sum_{w} Q_W(w) \, \mathbb{P}\{E_w\} \right)^{\frac{\alpha - 1}{\alpha}} \\ &\le \left( \delta \sum_{w} Q_W(w) \right)^{\frac{\alpha - 1}{\alpha}}= \delta^{\frac{\alpha - 1}{\alpha}}, \end{align}\] where we used \(\sum_{w} Q_W(w)=1\), and also the test-set bound in Theorem ?? which states \(\mathbb{P}\{E_w\}\leq\delta\). Furthermore, we have \[\begin{align} \sum_{w,s} P(w,s)^\alpha Q(w,s)^{1-\alpha} &= \sum_{w,s} P_S(s) P_{W|S}(w|s)^\alpha Q_W(w)^{1-\alpha} \\ &= \sum_{s} Q_W(w^*(s))^{1-\alpha} P_S(s) \\ &\le Q_{\min}^{1-\alpha}, \end{align}\] where \(P_{W|S}(w|s)=\delta(w^*)\) is a distribution that concentrates its mass on the hypothesis \(w^*\) is defined as \[w^*\in\text{argmax}_{w\in \mathcal{W}} \mathrm{KL}(\hat{L}(S,w)||L(w)),\] i.e. \(w^*\) is any maximizer of the KL divergence between empirical and population loss. In this case the bound becomes \[\begin{align} {P}\{E\} &= \mathbb{P}_S\left\{ \sup_{w} \mathrm{KL}(\hat{L}(S, w)||L(w)) \ge \frac{\log \frac{1}{\delta}}{n} \right\}\le \delta^{\frac{\alpha - 1}{\alpha}} Q_{\min}^{\frac{1 - \alpha}{\alpha}}. \end{align}\] By reparameterizing \(\delta'\) as \(\delta^{\frac{\alpha - 1}{\alpha}} Q_{\min}^{\frac{1 - \alpha}{\alpha}}\), we can then achieve \[\begin{align} &\mathbb{P}_S \left\{ \exists w,\, \mathrm{KL}(\hat{L}(S, w)||L(w)) \geq \frac{ \log \frac{1}{Q_{min}} + \frac{\alpha}{\alpha - 1} \log \frac{1}{\delta'} }{n} \right\}\leq \delta', \end{align}\] or equivalently \[\begin{align} &\mathbb{P}_S \left\{ \forall w,\, \mathrm{KL}(\hat{L}(S, w)||L(w)) \leq \frac{ \log \frac{1}{Q_{min}} + \frac{\alpha}{\alpha - 1} \log \frac{1}{\delta'} }{n} \right\}\\ &\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\geq 1-\delta'. \end{align}\] ◻
The main novelty of our framework comes from specifying an "undesirable event" \(E\), where the flexible choice of \(E\) provides the flexibility for our framework but also achieves a tighter generalization bound. Therefore, defining an optimal and measurable "undesirable event" \(E\) can be an interesting question to study in the future.
Theorem 5. (\(\mathcal{H}^p\)-PAC-Bayes bound). Let \(Q\) be a distribution over a finite hypothesis space \(\mathcal{W}\) such that \(Q_{\min} := \min_w Q_{W}(w)>0\). For any \(p>1\), and for any \(\delta\in(0,1]\), with probability at least \(1-\delta\) over \(P_S\), it holds that \[\begin{align} \forall w,\; \mathrm{KL}(\hat{L}(S, w)||L(w)) \leq \frac{ \log\left[(Q_{min})^{1-p}\delta^{-p}{-1}\right] }{(p-1)n}. \end{align}\]
Proof. See Appendix 3.6 ◻
Theorem 6. (\(\mathcal{\chi}^2\)-PAC-Bayes bound). Let \(Q\) be a distribution over a finite hypothesis space \(\mathcal{W}\) such that \(Q_{\min} := \min_w Q_{W}(w)>0\). For any \(\delta\in(0,1]\), with probability at least \(1-\delta\) over \(P_S\), it holds that \[\begin{align} \forall w,\; \mathrm{KL}(\hat{L}(S, w)||L(w)) &\leq \frac{\log\frac{1+Q_{min}}{Q_{min}}+2\log\frac{1}{\delta}}{n}. \end{align}\]
Proof. Following the proof procedures in Theorem 4 and Theorem [PAC-Bayes95with95H95p]. Lemma 3 is applied, we bound \(Q(E)^{\frac{1}{2}}\) by KL test-set bound \[Q(E)^{\frac{1}{2}}\le (\sum\limits_w Q_W(w)\mathbb{P}\{E_w\})^{\frac{1}{2}}=\delta^{\frac{1}{2}}.\] Also we have\[\begin{align} \left(\chi^2\left(P(w,s)||Q(w,s)\right)+2\right)^\frac{1}{2}&=\left(\sum\limits_{w,s}\frac{P(w,s)^2}{Q(w,s)}-1+2\right)^\frac{1}{2}\\ &=\left(\sum\limits_{s}\frac{P_S(s)}{Q_W(w^*(s))}+1\right)^\frac{1}{2}\\ &\le\left(\frac{\sum\limits_{s}P_S(s)}{Q_{min}}+1\right)^\frac{1}{2}\\ &=\left(\frac{1+Q_{min}}{Q_{min}}\right)^\frac{1}{2}. \end{align}\] Then we can achieve the bound in Theorem [PAC-Bayes95with95chi95square] by the reparameterization trick used in both Theorem 4 and Theorem [PAC-Bayes95with95H95p]. ◻
Remark 1. The DPI‑PAC‑Bayesian framework can be applied to arbitrary f-divergence and yields generalization bounds whose relative tightness is governed by their divergence parameters. The \(\mathcal{\chi}^2\)-PAC-Bayes bound is parameter‑free, while both \(D_\alpha\)-PAC-Bayes and \(\mathcal{H}^p\)-PAC-Bayes bounds possess a free parameter—\(\alpha>1\) and \(p>1\)—that modulates the trade‑off between the divergence penalty and the confidence term \(\log(\frac{1}{\delta})\).
We apply our bounds in a logistic classification problem in a 2-dimensional space, where \(w\in\mathbb{R}^2\), \(Z_i=(\boldsymbol{x}_i,y_i)\in\mathbb{R}^3\). Each \(\boldsymbol{x}_i=\{(x_{i1},x_{i2})\}\) is sampled from a multivariate Gaussian distribution \(\mathcal{N}(0,\boldsymbol{I}_2)\). The label \(y\in\{0,1\}\) is generated from the Bernoulli distribution with probability \(p(y=1|\boldsymbol{x}_i,w^*)=\frac{1}{1+e^{-\boldsymbol{x}^T_i w^*}}\), where \(w^*=(0.5,0.5)\). The generalization gap is measured by \(\mathrm{KL}(\hat{L}(S,W)||L(W))\), where the loss function is given by 0-1 loss \(\ell(Z_i,w)=I(\frac{1}{2}(\operatorname{sign}(x^T_i w)+1)\neq y_i)\). We work with a finite hypothesis space \(\mathcal{W}\) with \(|\mathcal{W}| = 50\). Each hypothesis is a weight vector \(w \in \mathbb{R}^2\) whose coordinates are sampled independently from the uniform distribution \(\mathrm{Unif}([-100,100])\). Because there is no prior information about the data, it is natural for us to assign the same importance (or probability) to each hypothesis, then we adopt the uniform prior distribution on \(\mathcal{W}\) (i.e. \(\frac{1}{Q_{min}}=\frac{1}{Q_{W}(w)}\)).
In Figure 1, we compare the tightness of the bounds derived by our framework, where we change the size of the training sample from 100 to 1600. For the \({D}_\alpha\)-PAC-Bayes bound and the \(\mathcal{H}^p\)-PAC-Bayes bound, we experiment with different parameters, where \(\alpha,p\in\{10,10^3,10^7\}\).
Across the entire range of \(n\), both \(\mathcal{D}_{\alpha}\)-PAC‑Bayes and \(\mathcal{H}^{p}\)-PAC‑Bayes remain strictly below the \(\chi^{2}\)-PAC‑Bayes bound, indicating tighter generalization guarantees. Additionally, compared with the \(\mathcal{H}^{p}\)-PAC-Bayes bound, the resulting curves for the \(\mathcal{D}_{\alpha}\)-PAC-Bayes bound demonstrate minimal dependence on its parameter.
In summary, under the present experimental setting, the \(\mathcal{D}_{\alpha}\)-PAC‑Bayes bound delivers the most parameter‑robust capability and tightest guarantee, followed by the \(\mathcal{H}^{p}\)-PAC‑Bayes bound; the \(\chi^{2}\)-PAC‑Bayes bound is the loosest among the three. Note that under the current experimental setup, all of our derived bounds are tighter than the PAC-Bayes bound.
The bounds developed by the DPI-PAC-Bayesian framework exhibit a close connection to the OR bound and the PAC-Bayes bound. In particular, we will show in the sequel that when the prior distribution \(Q_{W}(w)\) is chosen to be uniform, the \(D_\infty\)-PAC-Bayes and \(\mathcal{H}^\infty\)-PAC-Bayes bounds recover to the OR bound. Additionally, our bounds are in spirit similar to the PAC-Bayes bound, but our bounds are provably tighter than the PAC-Bayes bound in the special case. An important note is that the bounds converge monotonically to their tightest forms when \(\alpha, \;p\rightarrow\infty\).
Corollary 1 (Limiting \(D_{\infty}\)/ \(\mathcal{H}^{\infty}\)-PAC‑Bayes bound). Let the prior \(Q_{{W}}\) be uniform over \(\mathcal{W}\). For any \(\delta\in(0,1]\), with probability at least \(1-\delta\) over \(P_S\), we have \[\forall\,w,\; \mathrm{KL}\!(\hat{L}(S,w)||L(w))\;\le\; \frac{ \log\frac{1}{Q_{{W}}(w)}\;+\;\log\frac{1}{\delta}}{n}.\] The same bound is obtained in either of the following limits: \(\alpha\to\infty\) in the \(D_\alpha\)‑PAC‑Bayes bound; \(p\to\infty\) in the \(\mathcal{H}^p\)‑PAC‑Bayes bound.
Corollary 2. (\(\mathcal{\chi}^2\)-PAC-Bayes bound). Let the prior \(Q_{{W}}\) be uniform over \(\mathcal{W}\). For any \(\delta\in(0,1]\), with probability at least \(1-\delta\) over \(P_S\), it holds that \[\forall w, \; \mathrm{KL}(\hat{L}(S, w)||L(w)) \leq \frac{ \log \frac{1+Q_{W}(w)}{Q_{W}(w)}+2\log \frac{1}{\delta} }{n}.\]
Importantly, compared to the PAC-Bayes bound 3 , both the \(D_\infty\)-PAC‑Bayes bound and the \(\mathcal{H}^\infty\)-PAC‑Bayes bound remove the extra term \(\log2\sqrt{n}\), and these two bounds recover the same expression of the OR bound in Theorem 2.
Additionally, the \(D_\infty\)-PAC-Bayes bound and the \(\mathcal{H}^\infty\)-PAC-Bayes bound provide a tighter generalization guarantee than the \(\chi^{2}\)-PAC‑Bayes bound by a margin of \(\bigl[\log\bigl(1+Q_{ W}(w)\bigr)+\log(1/\delta)\bigr]\!/n\).
We define the same kernel as in the proof of Lemma 1 and 2. For Chi-Squared divergence we have \[\begin{align} \mathcal{\chi}^{2}(P_X(x)\|Q_X(x)) &\geq \frac{(P(E)-Q(E))^2}{Q(E)(1-Q(E))}. \end{align}\] By rearranging the inequality, we achieve \[\begin{align} {P(E)^2} &\leq Q(E)(1-Q(E))\mathcal{\chi}^{2}(P_X\|Q_X)+2P(E)Q(E)-Q(E)^2 \\ &\leq Q(E)(1-Q(E))\mathcal{\chi}^{2}(P_X \|Q_X)+2Q(E)-Q(E)^2\\ &\leq Q(E)\mathcal{\chi}^{2}(P_X \|Q_X)+2Q(E). \end{align}\] Because both sides are nonnegative, we may take square roots, giving \[\begin{align} P(E) &\leq(Q(E)\mathcal{\chi}^{2}(P_X \|Q_X)+2Q(E))^{\frac{1}{2}}\\ &=Q(E)^{\frac{1}{2}}(\mathcal{\chi}^{2}(P_X \|Q_X)+2)^{\frac{1}{2}} \end{align}\]
We adopt the same joint distributions \(Q_{S, W}\) and \(P_{S, W}\) and define event \(E\) the same as Theorem 4. The posterior \(P_{W|S}(w|s)\) is taken to be the point mass \(\delta(w^*)\) on the hypothesis space, where \(w^*\in\text{argmax}_{w\in \mathcal{W}} \mathrm{KL}(\hat{L}(S,w)||L(w))\).
We have the inequality ?? in Lemma 2, where two terms—\(\left[1+Q(E)^{1-p}\right]^{-\frac{1}{p}}\) and \(\mathcal{H}^P(P(W,S)||Q(W,S))\)—need to be upper bounded.
When \(p\ge1\), finding the upper bound for the term \(\left[1+Q(E)^{1-p}\right]^{-\frac{1}{p}}\) is equivalent to finding the upper bound of \(Q(E)\). We still use the test-set bound \(Q(E)=\sum\limits_wQ_W(w)\mathbb{P}(E_w)\le\delta\sum\limits_wQ_W(w)=\delta\). Also we have \[\begin{align} \mathcal{H}^P(P(W,S)||Q(W,S)) &=\frac{\sum\limits_{w,s}P(w,s)^pQ(w,s)^{1-p}-1}{p-1}\\ &=\frac{\sum\limits_{s}P_S(s)Q_W(w^*(s))^{1-p}-1}{p-1}\\ &\leq\frac{\left[\left(Q^{1-p}_{min}\sum\limits_{s}P_S(s)\right)-1\right]}{p-1}\\ &\leq\frac{\left(Q^{1-p}_{min}-1\right)}{p-1}. \end{align}\] Applying the above inequalities, we can achieve \[P(E)\leq \left(Q_{min}\right)^\frac{1-p}{p}(1+\delta^{1-p})^{-\frac{1}{p}}.\] Thus we get \[\begin{align} {P}\{E\} = &\mathbb{P}_S\left\{ \sup_{w} \mathrm{KL}(\hat{L}(S, w)||L(w)) \ge \frac{\log \frac{1}{\delta}}{n} \right\}\\ &\le \left(Q_{min}\right)^\frac{1-p}{p}(1+\delta^{1-p})^{-\frac{1}{p}}. \end{align}\] By reparameterizing \(\delta'\) as \(\left(Q_{min}\right)^\frac{1-p}{p}(1+\delta^{1-p})^{-\frac{1}{p}}\), we can then achieve \[\begin{align} &\mathbb{P}_S \left\{\exists w,\, \mathrm{KL}(\hat{L}(S, w)||L(w)) \geq \frac{\log \left[\frac{1}{\delta^{p}}(Q_{min})^{1-p}-1\right]}{n(p-1)}\right\}\\ &\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\leq \delta', \end{align}\] or equivalently \[\begin{align} &\mathbb{P}_S \left\{\forall w,\, \mathrm{KL}( \hat{L}(S, w)||L(w)) \leq \frac{\log \left[\frac{1}{\delta^{p}}(Q_{min})^{1-p}-1\right]}{n(p-1)}\right\}\\ &\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\geq1- \delta'. \end{align}\]