Abstract

Estimating the Kullback–Leibler (KL) divergence between language models has many applications, e.g., reinforcement learning from human feedback (RLHF), interpretability, and knowledge distillation. However, computing the exact KL divergence between two arbitrary language models is intractable. Thus, practitioners often resort to sampling-based estimators. While it is easy to fashion a simple Monte Carlo (MC) estimator that provides an unbiased estimate of the KL divergence between language models, this estimator notoriously suffers from high variance and can even result in a negative estimate of the KL divergence, a non-negative quantity. In this paper, we introduce a Rao–Blackwellized estimator that is unbiased and provably has variance less than or equal to that of the standard Monte Carlo estimator. In an empirical study on sentiment-controlled fine-tuning, we show that our estimator provides more stable KL estimates and reduces variance substantially. Additionally, we derive an analogous Rao–Blackwellized estimator of the gradient of the KL divergence, which leads to more stable training and produces models that more frequently appear on the Pareto frontier of reward vs. KL compared to the ones trained with the MC estimator of the gradient.

https://github.com/rycolab/kl-rb

1 Introduction↩︎

The Kullback–Leibler [1] divergence is a statistical divergence that quantifies how one probability distribution differs from another. Measuring the KL divergence between probability distributions is a well-established problem that has been studied extensively in the statistics literature [2], [3]. In some special cases, e.g., in the case that we wish to measure the KL divergence between two Gaussian measures, the KL divergence has an analytical solution. However, in the general case, exact computation of the KL divergence is not analytically tractable or approximable with an efficient algorithm [4]. This paper treats the case of computing the KL divergence between two language models (LMs), a fundamental task in natural language processing with numerous practical applications.

The KL divergence plays a central role across multiple applications. In reinforcement learning from human feedback [5]–[7], it is used as a regularization term to constrain the fine-tuned model from drifting too far from a reference model, preserving fluency and preventing reward over-optimization. In interpretability research, KL divergence quantifies how a specific prompt shifts the model distribution by comparing the model’s distributions before and after controlled interventions [8]–[10]. As an evaluation metric, KL divergence is used to assess how well language models approximate target distributions [11], [12]. In knowledge distillation, KL divergence is minimized to align a student model with a teacher model [13].

The above applications demonstrate that measuring the KL divergence between two language models is useful and widespread. However, in the case of neural language models, it is far from straightforward. It is easy to see why: Given an alphabet of symbols \(\alphabet\) and two language models \(\plm\) and \(\qlm\), distributions over \(\kleene{\alphabet}\), the is given by the following expression:¹\(^,\)² \[\begin{align} \label{eq:kl} \kl(\plm \mid\mid \qlm) \defeq \sum_{\str \in \kleene{\alphabet}} \plm(\str) \log \frac{\plm(\str)}{\qlm(\str)}. \end{align}\tag{1}\] Recalling that \(\kleene{\alphabet}\) is a countably infinite set, we cannot expect, in general, to compute 1 exactly in finite time without additional assumptions.³ While in some very special cases, e.g., where \(\plm\) and \(\qlm\) are deterministic finite-state automata, there exist efficient algorithms, [11], [15], [16], we should not expect such an algorithm to exist in the case where \(\plm\) and \(\qlm\) are neural language models, e.g., those based on the transformer [17]–[19]. Thus, most researchers turn to approximation, with Monte Carlo estimation being the most widely used method.

The Monte Carlo (MC) estimator for KL divergence (1 ) involves sampling \(M\) strings \(\yvar^{(1)}, \ldots, \yvar^{(M)} \simIID \plm\) and then averaging \(\log \frac{\plm(\yvar^{(m)})}{\qlm(\yvar^{(m)})}\). Even though this estimator is unbiased, it often exhibits high variance, which means the approximation can be noisy and unreliable.⁴ More pathologically, the naive MC estimator can result in negative estimates of KL, which may be undesirable in practice.⁵ To address these issues, practitioners adopt alternative techniques to ensure non-negativity. For example, [20] proposes an unbiased, non-negative KL estimator that is widely used in practice [21], [22]. However, the proposed method, in its original form, does not theoretically yield an estimator with lower variance, and, as we show empirically, can exhibit enormous variance.

In this paper, we derive an improved estimator of KL using Rao–Blackwellization [23]–[26], a well-established variance-reduction technique from statistics.⁶ This results in an estimator that is provably unbiased and has a variance that is always less than or equal to that of the standard Monte Carlo estimator, while requiring no additional computational overhead. As a point of comparison to our RB estimator, we also provide a comprehensive formal analysis of various existing methods for estimating KL divergence, examining their bias and variance.

We empirically validate our theoretical findings using the sentiment-controlled generation task [29] as a testbed. Specifically, we measure the KL divergence between a GPT-2 model [18] before and after fine-tuning, where the fine-tuning objective is to steer the model toward generating positive movie reviews. Our experimental results confirm that our proposed estimator significantly reduces the variance of the Monte Carlo estimator, yielding the most stable and reliable estimates among all methods studied. In contrast, alternative estimators from the literature fail to achieve meaningful variance reduction, and in some cases, lead to unbounded variance. We further examine how using our derived estimator in the fine-tuning loop of RLHF impacts the downstream performance. Our results suggest that using our Rao–Blackwellized estimator reduces the instability across different RLHF runs. We further look at the Pareto frontier of average rewards achieved by the model vs. its KL divergence with the reference model. We observe that models fine-tuned using the RB estimator appear significantly more often on the Pareto frontier of reward vs. KL compared to the models fine-tuned with the MC estimator.

2 Preliminaries↩︎

2.1 Language Models↩︎

Let \(\alphabet\) be an , a finite, non-empty set of symbols. A is a finite sequence of symbols from \(\alphabet\). Let \(\kleene{\alphabet}\) denote the set of all such strings. A \(\plm\) is a distribution over \(\kleene{\alphabet}\). The function \(\pprob\) of a prefix \(\prefix \in \alphabet^*\) is \[\begin{align} \pprob(\prefix) \defeq \sum_{\str \in \alphabet^*} \plm(\prefix \str), \end{align}\] which is the cumulative probability of all strings in the language that have \(\prefix\) as their prefix. We denote the as \(\pprob(\str \mid \prefix) = \frac{\pprob(\prefix \str) }{ \pprob(\prefix) }\), where we additionally define \(\pprob(\eos \mid \str) \defeq \frac{\plm(\str)}{\pprob(\str)}\). Language models can be factored into the product of distributions using the chain rule of probability, i.e., for any string \(\str = y_1 \cdots y_N \in \alphabet^*\) we can write \[\begin{align} \label{eq:lm} \plm(\str) = \pprob(\eos \mid \str) \prod_{n=1}^N \pprob(y_n \mid \str_{<n}), \end{align}\tag{2}\] where \(\str_{<n} \defeq y_1 \cdots y_{n-1}\) and \(\eos \notin \alphabet\) is a distinguished end-of-string symbol. Let \(\alphabar \defeq \alphabet \cup \{\eos\}\). In 2 , each \(\pprob(\cdot \mid \str_{<n})\) can fruitfully be viewed as a distribution over \(\alphabar\). Despite the overloading of the notation, whether \(\pprob(\cdot \mid \str_{<n})\) refers to a prefix probability, a function which takes an argument from \(\kleene{\alphabet}\), or a distribution over \(\alphabar\) will always be clear from context. Throughout this paper, we use \(\yvar\) to represent the string-valued random variable sampled from \(\plm\). When taking \(M\) i.i.d.samples from \(\plm\), we use \(\yvar^{(m)}\) to denote the \(m^{\text{th}}\) sample.

2.2 Monte Carlo KL Estimation↩︎

A simple way to estimate the KL divergence is with the defined as \[\begin{align} \label{eq:klmc} \klmc &= \frac{1}{M} \sum_{m=1}^M \log \frac{\plm(\yvar^{(m)}) }{\qlm(\yvar^{(m)}) } = \frac{1}{M} \sum_{m=1}^M \efunc(\yvar^{(m)}), \end{align}\tag{3}\] where \(\yvar^{(1)}, \ldots, \yvar^{(M)} \simIID \plm\) and \(\efunc(\yvar) \defeq \log \frac{\plm(\yvar) }{\qlm(\yvar) }\). Throughout the paper, we assume that the KL divergence is finite, i.e., \(\kl(\plm \mid\mid \qlm) < \infty\). It is straightforward to show \(\klmc\) is unbiased, i.e., \(\E [\klmc] = \kl(\plm \mid\mid \qlm)\) and the variance of this estimator is \(\var{\klmc} = \frac{1}{M} \var{ \efunc(\yvar) }\). In 7, we discuss the Horvitz–Thompson estimator, another unbiased KL estimator.

Note that while the exact KL value is always non-negative, \(\efunc(\yvar)\) may be positive or negative. Consequently, the Monte Carlo estimate \(\klmc\) may also be negative. This happens because the estimate is based on a limited number of samples, and some sample draws can lead to negative values. This can be problematic during RLHF, which depends on the KL divergence being non-negative.

2.3 Control Variate Monte Carlo Estimation↩︎

A general approach to reduce estimator variance is through control variates [30]. For KL divergence between language models, this technique was popularized by [20] and is widely used in RLHF libraries [22], [27], [28]. Formally, a is any function \(g\colon \kleene{\alphabet} \to \R\) for which \(\G \defeq \E\Brackets{g(\yvar)}\) can be efficiently computed.⁷ We define the as

klcvv = _m=1^M f(^(m)) + (g(^(m)) - )̏.

where \(\alpha \in \R\) is a calibration parameter that must be chosen a priori. The proposition below characterizes the variance of \(\klcv\) as a function of \(\alpha\), which will tell use how to choose \(\alpha\) optimally.

propositioncvbias Consider the control variate MC estimator \(\klcv\) defined in [eq:klcv], and assume that \(\E [g(\yvar)] < \infty\). Then \(\klcv\) is an unbiased estimator, and its variance is given by \[\begin{align} &\var{\klcv} = \frac{\var{ \efunc } + \alpha^2 \var{ g } + 2 \alpha \cov{ \efunc, g }}{M} \label{eq:cv-variance}. \end{align}\tag{4}\]

Proof. See 8. ◻

Assume \(0 < \var{g} < \infty\). It is straightforward to show that \(\alpha^* \defeq -\cov{\efunc,g}/\var{g}\) is the value that minimizes the variance. If we plug \(\alpha^*\) in 4 , and simplify, we see that \[\begin{align} \var{\klcv} = \frac{1}{M}\var{ \efunc } \Big(1 - \corr{ \efunc, g}^2 \Big) \label{eq:cvvariance}, \end{align}\tag{5}\] which directly translates to reducing the variance of the MC estimator. The magnitude of the correlation between \(\efunc\) and \(g\) determines the degree of variance reduction. Note that the value of \(\alpha^*\) may be estimated from a pilot sample when it cannot be computed analytically.⁸

2.3.0.1 KL Estimation with a Control Variate.

A specific control variate for KL estimation was proposed by [20], who defined \(g(\yvar) = \frac{\qlm(\yvar)}{\plm(\yvar)}\). Substituting this into [eq:klcv], the MC estimator of the KL divergence with this control variate is \[\begin{align} \label{eq:cvestimator} &\!\!\klcv = \frac{1}{M} \sum_{m=1}^M \log \frac{\plm(\yvar^{(m)})}{\qlm(\yvar^{(m)})} + \alpha \cdot \left(\frac{\qlm(\yvar^{(m)})}{\plm(\yvar^{(m)})} - 1\right). \end{align}\tag{6}\]

2.3.0.2 Remarks.

This proposal has some notable properties. First, \(\G = \E \left[\frac{\qlm(\yvar)}{\plm(\yvar)} \right] = 1\), meaning \(\G\) is known in advance.⁹ Second, \(\cov{f,g}=0\) is zero if and only if \(\plm\) is equal to \(\qlm\) ([proposition:cvcov]), meaning that when the two distributions are not equal, and \(\alpha\) is chosen suitably, we are guaranteed to strictly reduce variance. [20] proposes setting \(\alpha = 1\); the benefit of this suboptimal choice is that the resulting estimator is always non-negative ([proposition:nonnegative-CV]).¹⁰ However, setting \(\alpha=1\) will increase variance when \(\alpha^* < \frac{1}{2}\) ([cor:alpha-one-reduction]). Indeed, our experiments (5.1) confirm that \(\alpha=1\) is a poor choice in practice—it is better to estimate \(\alpha^*\)¹¹ to ensure that the control variate is correctly calibrated.

3 Rao–Blackwellized Monte Carlo↩︎

In this article, we propose the application of another classical technique to reduce the variance of the Monte Carlo estimation of the KL divergence— [25], [26]. Despite its standing in the statistics literature, a Rao–Blackwellized Monte Carlo estimator has yet to gain traction in the context of RLHF [21], [22], [27], [28].

We define the \(\rbestimator\) as follows: \[\begin{align} \label{eq:rb-estimator} \rbestimator &\defeq \frac{1}{M} \sum_{m=1}^M \sum_{n=1}^{|\yvar^{(m)}|} \kl(\pprob(\cdot \mid \yvar^{(m)}_{< n}) \| \qpprob(\cdot \mid \yvar^{(m)}_{< n})). \end{align}\tag{7}\] The key benefit of this estimator is that we analytically compute the expectation over the \(n\textsuperscript{th}\) symbol in each string rather than relying on the single sampled value at that position.

3.0.0.1 Rao–Blackwellization Background.

The starting point of Rao–Blackwellization is the following inequality involving the conditional variance: \(\var{\E[\estimator \mid \tvar]} \leq \var{\estimator}\), where \(\estimator\) is an unbiased estimator of \(\E[f]\) and \(\tvar\) is a statistic¹² for which we can explicitly compute \(\E [\estimator \mid \tvar]\). This technique is often referred to as Rao—Blackwellization because the inequality is associated with the Rao–Blackwell theorem [32].¹³

3.0.0.2 Notation.

Before moving forward, we introduce some convenient notation. Let \(\yvarb^{(m)}\) denotes the \(\eos\)-padded version of \(\yvar^{(m)}\). Additionally, we extend the definition of the prefix probabilities \(\pprob\) and \(\qpprob\) to strings containing padding symbols, i.e., \(\str \notin \kleene{\alphabet}\), with the following additional case: \(\pprob(y \mid \str) \defeq \mathbb{1}\{y = \eos \}\), and \(\qpprob(y \mid \str) \defeq \mathbb{1}\{y = \eos \}\).

3.0.0.3 Understanding Our Rao–Blackwellized Estimator.

We now present a proof that our Rao–Blackwellized estimator \(\rbestimator\) is unbiased and indeed does result in a variance reduction. One might wonder why this requires more than a straightforward application of the Rao–Blackwell theorem. The reason is that \(\rbestimator\) does not arise directly from the standard formulation of Rao–Blackwellization. Instead, we apply Rao–Blackwellization separately to each summand, where each summand corresponds to an estimator over strings of a particular length. In general, Rao–Blackwellizing the summands pointwise does not guarantee a reduction in the variance of their sum, since the summands may be correlated. However, in the specific case of \(\rbestimator\), we can prove that the overall variance is reduced, despite Rao–Blackwellizing the summands independently, as we state in the following theorem.

theoremrbinf Suppose the MC estimator \(\klmc\) has finite variance, i.e., \(\var{\klmc} < \infty\). Then the following properties regarding \(\rbestimator\) hold:

Proof. See [proof:rbinf] for a detailed proof. However, we provide the proof sketch below for the reader to quickly understand the structure of the argument, which is broken down into three steps.

Step-wise Estimation. We begin by Rao–Blackwellizing the for any \(n > 0\), \[\begin{align} \klmcn \defeq &= \frac{1}{M} \sum_{m=1}^M \log \frac{\pprob(\Yb_n^{(m)} \mid \yvarb^{(m)}_{<n})}{\qpprob(\Yb_n^{(m)} \mid \yvarb^{(m)}_{<n})}. \end{align}\] Intuitively, \(\klmcn\) measures the average KL of the \(n\textsuperscript{th}\) symbol. The next step is to define \(\tvarn(\yvarb) = \yvarb_{<n}\) and apply Rao–Blackwellization to each \(\klmcn\) as follows: \[\begin{align} \rbestimatorn &\defeq \E_{\yvarb^{(1)}, \ldots, \yvarb^{(M)}}\left[\klmcn \mid \tvarn\right] \\ &= \frac{1}{M} \sum_{m=1}^M \E_{\yvarb^{(m)}}\left[ \log \frac{\pprob(\Yb_n^{(m)} \mid \yvarb^{(m)}_{<n})}{\qpprob(\Yb_n^{(m)} \mid \yvarb^{(m)}_{<n})} \middle| \tvarn(\yvarb^{(m)}) \right] \\ &= \frac{1}{M} \sum_{m=1}^M \kl(\pprob(\cdot \mid \yvarb^{(m)}_{< n}) \| \qpprob(\cdot \mid \yvarb^{(m)}_{< n})).\label{eq:rb} \end{align}\tag{8}\] Now, it is clear from the Rao–Blackwellization theorem that \(\rbestimatorn\) is unbiased and it provides a variance reduction (i.e., \(\var{\rbestimatorn} \le \var{\klmcn}\)) for all \(n > 0\). Inuitively, the source of the variance reduction in the \(\rbestimatorn\) estimator is that we compute the expectation over the \(n\textsuperscript{th}\) symbol exactly rather relying on the sampled symbol at that position.
Truncated Estimation. Next, we define \(\klmcN \defeq \sum_{n=1}^N \klmcn\) and \(\rbestimatorN \defeq \sum_{n=1}^N \rbestimatorn\). Intuitively, these estimators target the KL divergence for symbols up to a maximum length of \(N\). In [lemma:rb], we prove that \(\E[\rbestimatorN] = \E[\klmcN]\), and \(\var{\rbestimatorN} \le \var{\klmcN}\) for all \(N > 0\) using the law of total expectation and Jensen’s inequality, where the latter is used in a manner analougsly to have it is used in the original Rao–Blackwellization theorem.
Complete Estimation. Now, we consider the complete estimation. First, we observe that the limit of the truncated estimators converge to \(\klmc = \lim_{N \rightarrow \infty} \klmcN\) ([lemma:ntoinf]), and analogously, \(\rbestimator = \lim_{N \rightarrow \infty} \rbestimatorN\). Thereby, we are able to show that \(\rbestimator\) is an unbiased estimator of the KL divergence with variance less than or equal to that of \(\klmc\).

◻

3.0.0.4 Remarks.

Notably, \(\rbestimator\) is guaranteed to be non-negative, a property desired by some when designing estimators for the KL divergence between two language models, as the KL divergence itself is always non-negative (cf. remarks in 2.3). In the case of our Rao–Blackwellized estimator, non-nonegativity follows from the fact that each step-wise estimator computes the exact KL divergence between the two next-symbol distributions, conditioned on the sampled context \(\str_{<n}^{(m)}\). Since all terms in 8 are non-negative, \(\rbestimator\) remains non-negative as well.

3.0.0.5 Complexity Analysis.

Although computing \(\rbestimator\) might seem more expensive than \(\klmc\), the overall runtime is dominated by the cost of forward passes. Because a forward pass already involves producing the full distribution over \(\alphabar\) at each position \(n\), the additional \(\bigo{M N |\alphabar|}\) work required by \(\rbestimator\) is negligible compared to the \(M\) forward passes needed for both \(\klcv\) and \(\klmc\).

4 Estimating the Gradient↩︎

KL estimation is essential in many applications, especially in fine-tuning large language models. In reinforcement learning from human feedback (RLHF), for example, the objective includes a KL regularization term to balance reward maximization with staying close to a reference model. Since the language model is a differentiable function of parameters \(\btheta\) optimized via gradient descent, this setup requires computing the gradient of the KL divergence with respect to \(\btheta\): \[\label{eq:klgrad} \klgrad \defeq \nablat \kl(\policy \mid\mid \qlm) = \E \left[\log \frac{\policy(\yvar)}{\qlm(\yvar)} \nablat \log \policy(\yvar)\right].\tag{9}\] Therefore, the Monte Carlo estimator of this gradient is \[\label{eq:mcgrad} \mcgrad = \frac{1}{M} \sum_{m=1}^M \log \frac{\policy(\yvar^{(m)})}{\qlm(\yvar^{(m)})} \nablat \log \policy(\yvar^{(m)}).\tag{10}\] Now, we derive the Rao–Blackwellized Monte Carlo estimator of the gradient. We start with restating Theorem 2.2 in [34], which will prove useful.

Theorem 1 ([34]; Theorem 2.2). Let \(\plm\) and \(\qlm\) be language models over \(\alphabet\). The following equality holds \[\begin{align} \kl(\plm \mid\mid \qlm) = \sum_{\str \in \alphabet^*} \pprob(\str) \kl\mleft(\pprob \mleft(\cdot \mid \str \mright) \mid\mid \qpprob \mleft(\cdot \mid \str \mright)\mright), \end{align}\] where we treat \(\pprob(\cdot \mid \str)\) and \(\qpprob \mleft(\cdot \mid \str \mright)\) as probability distributions over \(\alphabar^*\).

We refer the reader to [34] for the proof. Next, to derive the Rao–Blackwellized estimator of the gradient, we take the gradient of the local KL as stated in the following theorem.

theoremrbgrad Let \(\policy\) and \(\qlm\) be two language models over \(\alphabet\) and \(\prefixprob\) the prefix probability function of \(\policy\). Then, the following equality holds \[\label{eq:localklgrad} \nablat \kl(\policy \mid\mid \qlm) = \sum_{\str \in \Sigma^*} \prefixprob(\str) \E_{Y} \Big[ \log \frac{\ppolicy(Y \mid \str)}{\qpprob(Y \mid \str)} \cdot \left( \nablat \log \prefixprob(Y \mid \str) + \nablat \log \prefixprob(\str) \right) \Big].\tag{11}\]

Proof. See 10. ◻

We then construct the Monte Carlo estimator of the gradient using the above theorem, which naturally results in the following unbiased, Rao–Blackwellized Monte Carlo estimator of the gradient: \[\begin{align} \label{eq:rbgrad} \gradrb &= \frac{1}{M}\sum_{m=1}^M \sum_{n=1}^{|\yvar^{(m)}|} \E_{Y} \Big[\log \frac{\ppolicy(Y \mid \yvar^{(m)}_{<n})}{\qpprob(Y \mid \yvar^{(m)}_{<n})} \cdot \left( \nablat \log \prefixprob(\yvar^{(m)}_{<n}) + \nablat \log \prefixprob(Y \mid \yvar^{(m)}_{<n}) \right) \Big]. \end{align}\tag{12}\] The above estimator is unbiased, as it is the MC estimator of 11 .

theoremgradvar Assuming \(\var{\gradrb} < \infty, \var{\mcgrad} < \infty\), we have \[\E \left[ \left\lVert \gradrb - \klgrad \right\rVert^2 \right] \leq \E \left[ \left\lVert \mcgrad - \klgrad \right\rVert^2 \right].\]

Proof. See 10.1. ◻

4.0.0.1 Off-policy Gradient.

So far, we have discussed how to estimate the gradient of \(\kl(\policy \mid\mid \qlm)\) using samples drawn from the current policy \(\policy\). Crucially, we derive the gradient manually rather than relying on automatic differentiation because the samples depend on \(\btheta\) through \(\policy\). However, in practice and for efficiency reasons, we often collect large batches of samples in parallel with the optimization loop. As a result, these samples are generated from a slightly outdated version of the policy, denoted \(\oldpolicy\). To compute the KL divergence using samples from \(\oldpolicy\), we first write the KL as the expectation under \(\oldpolicy\) as \[\begin{align} \kl(\policy \mid \mid \qlm) = \E_{\yvar \sim \oldpolicy} \left[\frac{\policy(\yvar)}{\oldpolicy(\yvar)} \log \frac{\policy(\yvar)}{\qlm(\yvar)} \right]. \end{align}\] Therefore, the MC estimator using samples \(\yvar^{(1)}, \ldots, \yvar^{(M)} \simIID \oldpolicy\) is \[\begin{align} \klmcold = \frac{1}{M} \sum_{m=1}^M \frac{\policy(\yvar^{(m)})}{\oldpolicy(\yvar^{(m)})} \log \frac{\policy(\yvar^{(m)})}{\qlm(\yvar^{(m)})}. \end{align}\] Given the unbiasedness proof of the Rao–Blackwellization in [theorem:rbinf], we can similarly write \[\kl(\policy \mid \mid \qlm) = \E_{\yvar \sim \oldpolicy} \Bigg[\frac{\policy(\yvar)}{\oldpolicy(\yvar)} \sum_{n=1}^{|\yvar^{(m)}|} \E_{Y} \left[\log \frac{\prefixprob(Y \mid \yvar_{<n})}{\qpprob(Y \mid \yvar_{<n})}\right] \Bigg].\] Therefore, the Rao–Blackwellized MC estimator using samples \(\yvar^{(1)}, \ldots, \yvar^{(M)} \simIID \oldpolicy\) is \[\rbestimatorold = \frac{1}{M} \sum_{m=1}^M \frac{\policy(\yvar^{(m)})}{\oldpolicy(\yvar^{(m)})} \sum_{n=1}^{|\yvar^{(m)}|} \E_{Y} \left[\log \frac{\ppolicy(Y \mid \yvar_{<n}^{(m)})}{\qpprob(Y \mid \yvar_{<n}^{(m)})}\right].\] Since \(\klmcold\) and \(\rbestimatorold\) use samples from the old policy that does not depend on \(\btheta\), we can apply automatic differentiation to compute the estimate of the KL gradient by computing the gradient of \(\klmcold\) and \(\rbestimatorold\).

5 Experiments↩︎

We use the sentiment control task as the testbed to empirically evaluate our theoretical findings on the KL estimators. Concretely, the reference model, denoted as \(\qlm\), is the GPT-IMDB¹⁴ model, i.e., a GPT-2 [18] model fine-tuned on imdb corpus [35]. The goal of the task is to fine-tune this language model such that the samples from it are movie reviews with a positive sentiment. The fine-tuned language model is denoted with \(\policy\). In the following experiments, we estimate the KL divergence between \(\policy\) and \(\qlm\). We provide a code snippet for implementing the RB estimator in 12.1.

5.1 Analyzing the KL Estimators↩︎

In this experiment, we empirically evaluate the bias, variance, and consistency of various KL estimators. To obtain \(\policy\), we fine-tune \(\qlm\) with direct preference optimization [29] on a sample of 5,000 data points from the IMDB training set. To create the preference data required for DPO training, following [29], we sample \(4\) responses for each prompt and create \(6\) pairs per prompt.

r0.45

Table 1: Estimated value \(\pm\) empirical standard deviation of different estimators. When aggregating over prompts, \(\klht\) and \(\klcv\) fail to significantly reduce the variance of \(\klmc\). RB estimator, however, achieves the lowest standard deviation.
	\(M=1\)	\(M=5\)	\(M=10\)
\(\klmc\)	\(6.76 \pm 0.16\)	\(6.76 \pm 0.07\)	\(6.76 \pm 0.05\)
\(\klht\)	\(6.76 \pm 0.16\)	\(6.76 \pm 0.07\)	\(6.76 \pm 0.05\)
\(\klcvo\)	\(6.28 \pm 2.54\)	\(6.28 \pm 1.13\)	\(6.28 \pm 0.79\)
\(\klcv\)	\(6.76 \pm 0.16\)	\(6.76 \pm 0.07\)	\(6.76 \pm 0.05\)
\(\rbestimator\)	\(6.76 \pm \mathbf{0.11}\)	\(6.76 \pm \mathbf{0.05}\)	\(6.76 \pm \mathbf{0.03}\)

To determine the preferred response in each pair, we employ a binary sentiment classifier,¹⁵ selecting the response with the higher probability of positive sentiment. Upon successful fine-tuning, \(\policy\) should assign a higher probability mass to movie reviews with positive sentiment while maintaining a low KL divergence with \(\qlm\). We then evaluate this KL divergence using our estimators to assess their reliability in measuring distributional shifts induced by fine-tuning.

Figure 3: Standard deviation of KL estimators across various prompts in the IMDB datasest.

We evaluate on 512 examples from the IMDB dataset. For each review, we randomly select a prefix length between 2 and 8 tokens and use it as the prompt. we then generate 4000 samples from \(\policy\) for each prompt. Using these samples, we compute the MC, control variate (CV), and Rao–Blackwellized (RB) estimators and estimate their standard We also implement the Horvitz–Thompson (HT) estimator; see 7 for details. The CV estimator, \(\klcv\), is computed twice: once using the optimal \(\alpha\) estimated from 1000 samples, and once with \(\alpha = 1\) to match the setup in [20].

In 1, we report the expected KL estimate along with the empirical standard deviation of different estimators evaluated at sample sizes \(M=1, 5, 10\). To obtain these estimates, we compute each estimator using \(M\) samples, repeating the process \(4000/M\) times to estimate both the expected value and the standard deviation of the estimates. Our findings confirm that all estimators except one (\(\klcv, \alpha=1\)), are unbiased and report an expected KL divergence of \(6.76\). We also observe that the CV estimators fail to significantly reduce the variance of the standard MC estimator. Importantly, the RB estimator achieves the lowest standard deviation and offers a more robust estimate compared to the MC estimator. Interestingly, we observe that the \(\klcv\) estimator exhibits a noticeable bias and high standard deviation when \(\alpha=1\), i.e., when it is not set to its optimal value. The bias arises from numerical instability during the computation of \(g(\yvar) = \frac{\qlm(\yvar)}{\plm(\yvar)} - 1\). The high variance is due to large values of \(\var{g(\yvar)}\). Specifically, for certain prompts, \(\var{g(\yvar)}\) can be unbounded. We visualize the estimates for \(3\) example prompts in 11.

Since the robustness of the estimators depends on the choice of the prompt, we further analyze their estimated standard deviations across all prompts. 3 presents a box plot of standard deviations (in log scale) for each estimator. The \(\klcv\) estimator with \(\alpha=1\) shows significant instability for certain prompts, with numerous outliers indicating high variance. In contrast, the \(\klmc\) and the standard \(\klcv\) estimators exhibit comparable standard deviations. In particular, the Rao-Blackwellized estimator consistently achieves the lowest standard deviation, suggesting that it provides the most stable estimates.

5.2 KL Estimation and RLHF Training Dynamics↩︎

A key application of KL estimation is in the RLHF training loop. From the previous experiment, we observed that the RB estimator significantly reduces the standard deviation of the MC estimator. Therefore, it is natural to ask how this affects RLHF performance when this estimator is used in the training loop. The RLHF objective consists of two terms: (i) the expected rewards for samples generated by the language model \(\policy\), which in this case is the samples’ score under a sentiment classifier¹⁶, and (ii) the negative KL divergence between the fine-tuned model \(\policy\) and the reference model \(\qlm\), which represents the language model before fine-tuning.

Figure 4: Comparison of the Monte Carlo (MC) and Rao–Blackwellized (RB) estimators in the RLHF fine-tuning loop. We perform RLHF with each estimator 5 times and plot the mean and standard deviation (in shades) of the average reward values and the KL at each fine-tuning step. We observe that the MC estimator is not as stable as the RB estimator and its performance varies significantly across different runs. However, RB estimator reliably offers a good balance between achieving low KL and high reward values in all runs. — Figure 4: Comparison of the Monte Carlo (MC) and Rao–Blackwellized (RB) estimators in the RLHF fine-tuning loop. We perform RLHF with each estimator \(5\) times and plot the mean and standard deviation (in shades) of the average reward values and the KL at each fine-tuning step. We observe that the MC estimator is not as stable as the RB estimator and its performance varies significantly across different runs. However, RB estimator reliably offers a good balance between achieving low KL and high reward values in all runs.

We compare the MC and RB estimators for computing the gradient of the KL divergence term in the RLHF training loop. We use the RL algorithm¹⁷ proposed by [36],¹⁸ which is an improved verfaiion of the reinforce algorithm [37].¹⁹

Figure 5: Compared to models trained with MC esimator, models trained with RB appear on the Pareto front 78\% of the time. — Figure 5: Compared to models trained with MC esimator, models trained with RB appear on the Pareto front \(78\%\) of the time.

First, we empirically test [theorem:gradvar] by measuring the variance of the gradient norm. We sample \(40\) prompts and compute the gradient of the KL divergence with respect to the model parameters using both the MC and RB estimators. To estimate variance empirically, we repeat this process \(5\) times. Both estimators are evaluated on the same prompts and model initializations to ensure a fair comparison. We find that the variance of the gradient norm estimated with the MC estimator is \(59.90\), whereas with the RB estimator it is \(45.44\). This corresponds to a \(24.6\%\) reduction in variance, providing direct empirical evidence consistent with our theoretical motivation.

We then proceed with using both estimators in RL fine-tuning. We track two metrics throughout fine-tuning: (i) the average reward associated with samples drawn from \(\policy\), and (ii) the KL divergence between \(\policy\) and \(\qlm\). The results are visualized in 4. The purple trace represents the training run where the \(\rbestimator\) is used in the optimization loop to estimate the gradient of the KL divergence, while the blue trace represents the run using \(\klmc\). The x-axis denotes the fine-tuning step, with the left plot showing the evolution of the average reward and the right plot displaying the KL divergence between \(\policy\) and \(\qlm\) over the course of fine-tuning. We repeat each experiment \(5\) times and report the mean and standard deviation of each metric. Notably, the KL values in the right plot are estimated using the RB estimator. However, we observe the same overall trend when using the MC estimator for evaluation.

As illustrated in 4, the performance of the models trained using the MC estimator varies significantly across the \(5\) experiments, resulting in a large standard deviation in both average rewards and the KL divergence. However, RB estimator consistantly achieves high rewards and reasonable KL values across all runs. This observation suggests that the RB estimator makes RLHF runs more stable.

Finally, we vary the KL coefficient, \(\beta\), in \([0.01, 0.1]\) range and fine-tune \(18\) models with each estimator. For each estimator, we plot the Pareto frontier of average rewards versus KL divergence in 5, displaying models that do not appear on the Pareto front with reduced opacity. Overall, we find that fine-tuning with the RB estimator is more effective at achieving high rewards while maintaining a low KL divergence from the reference model. To quantify this effect, we compute the fraction of RB fine-tuned models that appear on the overall Pareto front—i.e., the frontier obtained when considering all models fine-tuned with either estimator. We then conduct a permutation test and report the results in 5. We find that \(78\%\) of the points on the overall Pareto front come from RB fine-tuned models. Restricting to models with KL values below \(5\), this fraction rises to \(95\%\), with both results being statistically significant.

6 Conclusion↩︎

In this paper, we study the problem of estimating the KL divergence between language models. We provide a comprehensive formal analysis of various KL estimators, with a focus on their bias and variance. We introduce the RB estimator, which is provably unbiased and has variance at most equal to that of the standard NC estimator. This estimator applies the well-known Rao–Blackwellization technique to reduce the variance of the standard MC method. Our empirical results show that the RB estimator significantly reduces the variance compared to the MC estimator, while other estimators fail to achieve meaningful variance reduction or, in some cases, suffer from unbounded variance. Additionally, we find that using our proposed RB estimator makes RLHF more stable and produces models that more frequently lie on the Pareto frontier of reward versus KL, compared to models fine-tuned with the MC estimator.

Impact Statement↩︎

In this paper, we investigate the fundamental problem of estimating KL divergence between language models. One key application of KL estimation is in RLHF, which aims to enhance fluency while aligning language models with user preferences. However, RLHF can also be misused by bad actors to optimize models for generating misleading, biased, or harmful content. While our work provides a deeper understanding of KL estimation techniques, it is purely foundational research and does not introduce new risks or directly contribute to harmful applications.

Limitations↩︎

In our RLHF experiments, evaluating the variance of our estimator and comparing it to existing methods requires training a large number of models. For instance, the significance test in 5.2 involves training 36 models. Due to limited computational resources, we used the controlled-generation task as a testbed. Given the strength of both our theoretical and empirical results, we hope future work will adopt the Rao–Blackwellized estimator and apply it to larger language models and a wider variety of RL-based approaches to LM alignment.

Acknowledgements↩︎

We thank Ahmad Beirami and Cristina Pinneri for the insightful discussions throughout the course of this project. We also thank Alexander K. Lew for the valuable feedback on a draft of this paper. Afra Amini is supported by the ETH AI Center doctoral fellowship.

7 Horvitz–Thompson Estimation↩︎

When estimating the KL divergence between language models \(\plm\) and \(\qlm\), we have access not only to samples from \(\plm\) but also to the probability assigned by \(\plm\) to any string \(\str\). This enables the use of a more informed estimator, which leverages these probabilities during its construction. Notably, this estimator is a specific instance of the estimator [38], defined as

htestimatoreq = _ = _ (),

where \(\sampleset\) is the random variable representing the set of all sampled strings. Any sampling design can be specified to generate the elements of \(\sampleset\). The , denoted by \(\inclusion(\str)\), is the probability that a particular string \(\str\) is included in \(\sampleset\), or equivalently, \(\E \left[\mathbb{1} \{\str \in \sampleset\} \right]\).

propositionhtbias \(\klht\) is an unbiased estimator of the KL divergence, i.e., \(\E\big[\klht\big] = \kl(\plm \mid\mid \qlm)\).

Proof. The bias of the estimator is as follows: \[\begin{align} & \E_{\sampleset}\Big[\klht\Big] - \kl(\plm \mid\mid \qlm) \nonumber\\ &= \E_{\sampleset}\Big[\sum_{\str \in \sampleset} \frac{\plm(\str)}{\inclusion(\str)} \log \frac{\plm(\str) }{\qlm(\str)}\Big] - \kl(\plm \mid\mid \qlm) &\justification{definition of \klht}\\ &= \E_{\sampleset}\Big[\sum_{\str \in \kleene{\alphabet}} \frac{\plm(\str)}{\inclusion(\str)} \log \frac{\plm(\str) }{\qlm(\str)} \, \cdot \, \mathbb{1}\{\str \in \sampleset\}\Big] - \kl(\plm \mid\mid \qlm) \\ &= \sum_{\str \in \kleene{\alphabet}} \frac{\plm(\str)}{\inclusion(\str)} \log \frac{\plm(\str) }{\qlm(\str)} \, \cdot \, \E_{\sampleset}\Big[\mathbb{1}\{\str \in \sampleset\}\Big] - \kl(\plm \mid\mid \qlm) &\justification{linearity of expectation}\\ &= \sum_{\str \in \kleene{\alphabet}} \frac{\plm(\str)}{\cancel{\inclusion(\str)}} \log \frac{\plm(\str) }{\qlm(\str)} \, \cdot \, \cancel{\inclusion(\str)} - \kl(\plm \mid\mid \qlm) &\justification{definition of \inclusion(\str)}\\ &= 0. \end{align}\] ◻

Similar to the MC estimator, the HT estimator does not necessarily return a non-negative estimate of the KL. In principle, however, we should prefer [eq:htestimator] to 3 because it exploits more information—namely, the knowledge of \(\plm\). Whether the HT estimator yields lower variance than the MC estimator depends on the sampling design used to construct \(\sampleset\). In our experiments in 11, we used the sampling-with-replacement design, where \(\inclusion(\str) = 1 - \left(1 - \plm(\str) \right)^M\). Compared to the MC estimator, we observed no significant reduction in variance in our experiments.

8 Control Variate Monte Carlo Estimation↩︎

Proof. \[Recall that the control variate Monte Carlo estimator is defined as \klcvv* Note that \{\boldsymbol{Y}^{(m)}\}_{m=1}^M \simIID \plm. First, we look at the expected value of the estimator: \begin{align} &\E\Brackets{ \klcv } = \frac{1}{M} \sum_{m=1}^M \E\Brackets{ f(\yvar^{(m)}) } + \alpha \cdot \underbrace{(\E\Brackets{g(\yvar^{(m)})} - \G)}_{=0} = \E\Brackets{ f } \end{align} Therefore, it is unbiased. Next, we manipulate the variance \begin{align} &\var{ \klcv } = \frac{1}{M} \var{ f + \alpha \cdot (g - \G) } = \frac{1}{M} \var{ f } + \frac{\alpha^2}{M} \var{ g } + \frac{2 \alpha}{M} \cov{ f, g }, \end{align}\] where the second equality above stems from well-known variance identities.²⁰ ◻

Provided that the target KL divergence is finite, the estimator \(\klcv\) with \(\alpha=1\) is nonnegative with probability one. Furthermore, \(\alpha=1\) is the only value for which we can guarantee nonnegativity of \(\klcv\).

Proof. \[To prove that \klcv is nonnegative with probability one when \alpha=1, we will prove that each term \log \frac{\plm(\yvar^{(m)})}{\qlm(\yvar^{(m)})} + \alpha \cdot \left(\frac{\qlm(\yvar^{(m)})}{\plm(\yvar^{(m)})} - 1\right) in the summation that defines \klcv is nonnegative. Let r = \frac{\qlm(\yvar^{(m)})}{\plm(\yvar^{(m)})}. We identify \alpha for which \log \frac{1}{r} + \alpha \cdot (r - 1) \ge 0 holds for all r > 0.\footnote{Note that r > 0 whenever this term is finite.} Equivalently, we want to find \alpha values for which: \begin{align} \inf_{r > 0} -\log r + \alpha \cdot (r - 1) \ge 0. \label{eq:min-over-r-step} \end{align} Next, we prove that \alpha=1 is the only value satisfies the inequality \ref{eq:min-over-r-step}. To see why, consider the first-order optimality conditions for the minimization over r. These conditions are necessary and sufficient as -\log r + \alpha \cdot (r - 1) is convex for all \alpha over the region r > 0. Solving for r such that the derivative is zero gives us \begin{aligned} 0 &= \frac{\partial}{\partial r} \left[ -\log r + \alpha \cdot (r - 1) \right] \\ \iff 0 &= -1/r + \alpha \\ \iff r &= 1/\alpha. \end{aligned} Plugging that value allows us to simplify away the minimization: \begin{align} 0 &\le \inf_{r > 0} -\log r + \alpha \cdot (r - 1) \\ &= -\log(1/\alpha) + \alpha \cdot (1/\alpha - 1) \\ &= \log(\alpha) + 1 - \alpha. \end{align} For each term in \klcv to be nonnegative, we need \log(\alpha) + 1 - \alpha \ge 0. However, the function \log(\alpha) + 1 - \alpha is strictly concave and nonpositive for all \alpha > 0, attaining zero only at \alpha = 1. Thus, when \alpha=1 every term in \klcv is nonnegative. When \alpha \ne 1, some terms may be nonnegative, leading to the possibility that \klcv may be negative.\tag{13}\] ◻

propositioncvcov Given the random variable \(\yvar \sim \plm\), let \(\efunc(\yvar) = \log \frac{\plm(\yvar)}{\qlm(\yvar)}\) and \(g(\yvar) = \frac{\qlm(\yvar)}{\plm(\yvar)}\). We prove \(\cov{\efunc, g}\) is zero if and only if \(\plm = \qlm\).

Proof. \[\begin{align} \cov{\efunc, g} &= \E\left[\left(\log \frac{\plm(\yvar)}{\qlm(\yvar)} - \E \left[\log \frac{\plm(\yvar)}{\qlm(\yvar)} \right] \right) \left(\frac{\qlm(\yvar)}{\plm(\yvar)} - \underbrace{\E \left[\frac{\qlm(\yvar)}{\plm(\yvar)} \right]}_{=1} \right) \right] \\ &= \E\left[\frac{\qlm(\yvar)}{\plm(\yvar)} \log \frac{\plm(\yvar)}{\qlm(\yvar)} - \cancel{\log \frac{\plm(\yvar)}{\qlm(\yvar)}} - \E \left[\log \frac{\plm(\yvar)}{\qlm(\yvar)} \right] \frac{\qlm(\yvar)}{\plm(\yvar)} + \cancel{\E \left[\log \frac{\plm(\yvar)}{\qlm(\yvar)} \right]} \right] \\ &= \E\left[\frac{\qlm(\yvar)}{\plm(\yvar)} \log \frac{\plm(\yvar)}{\qlm(\yvar)} \right] - \E\left[\log \frac{\plm(\yvar)}{\qlm(\yvar)} \right]\underbrace{\E\left[\log \frac{\qlm(\yvar)}{\plm(\yvar)} \right]}_{=1} \\ &= \E\left[-\frac{\qlm(\yvar)}{\plm(\yvar)} \log \frac{\qlm(\yvar)}{\plm(\yvar)} \right] - \E\left[\log \frac{\plm(\yvar)}{\qlm(\yvar)} \right] \\ &= -\kl(\qlm \mid\mid \plm) - \kl(\plm \mid\mid \qlm). \end{align} Since each KL divergence is non-negative and equals zero if and only if the two distributions coincide almost everywhere, we have, \begin{align} \cov{\efunc, g} = 0 \iff -\kl(\qlm \mid\mid \plm) - \kl(\plm \mid\mid \qlm) = 0 \iff \plm = \qlm. \end{align}\] ◻

\[\begin{align} \var{\klcv^{(\alpha)}} \le \var{\klmc\vphantom{^{(\alpha)}}} \Longleftrightarrow \alpha \in [\min(0, 2 \alpha^*), \max(0, 2 \alpha^*)] \label{eq:safe-alpha-interval} \end{align}\tag{14}\]

Proof. We first establish conditions of \(\alpha\) for which the variance of \(\var{\klcv}\) does not increase that of \(\var{\klmc}\). In other words, we seek conditions on \(\alpha\) such that the following inequality holds: \[\begin{align} \var{\klcv^{(\alpha)}} &\le \var{\klmc\vphantom{^{(\alpha)}}} \\ \frac{\var{ \efunc } + \alpha^2 \var{ g } + 2 \alpha \cov{ \efunc, g }}{M} &\le \frac{\var{ \efunc }}{M} \\ \var{ \efunc } + \alpha^2 \var{ g } + 2 \alpha \cov{ \efunc, g } &\le \var{ \efunc } \\ \alpha^2 \var{ g } + 2 \alpha \cov{ \efunc, g } &\le 0. \end{align}\] Observe that the function on the left-hand side is quadratic in \(\alpha\), and, moreover, it is convex because \(\var{ g } \ge 0\). The minimum of this quadratic is \(\alpha^* = -\frac{\cov{ \efunc, g }}{\var{g}}\). We can use the quadratic formula to identify the two values of \(\alpha\) where it equals \(0\), i.e., \(\alpha \in \{ 0, 2 \alpha^* \}\). Now, because the quadratic is convex, we have that it is \(\le 0\) for values of \(\alpha \in [\min(0, 2 \alpha^*), \max(0, 2 \alpha^*)]\). Note that \(\alpha^*\) can be positive or negative; hence the \(\min\) and \(\max\). ◻

When does [20]’s ([20]) suboptimal choice of \(\alpha=1\) not hurt estimator variance? To answer this question, we substitute \(\alpha=1\) into the variance of \(\klcv\) given in 4 , we have \[\var{\klcv} = \frac{\var{ \efunc } + \var{ g } + 2 \cov{ \efunc, g }}{M}.\] Therefore, for \(\var{\klcv} \leq \var{\klmc}\), we must have \[\begin{align} \frac{\var{ \efunc } + \var{ g } + 2 \cov{ \efunc, g }}{M} &\le \frac{\var{ \efunc }}{M} \\ \var{ \efunc } + \var{ g } + 2 \cov{ \efunc, g } &\le \var{ \efunc } \\ \var{ g } + 2 \cov{ \efunc, g } &\le 0 \\ -\frac{\cov{ \efunc, g }}{\var{ g }} &\ge \frac{1}{2} \\ \alpha^* &\ge \frac{1}{2}. \end{align}\] Therefore, choosing \(\alpha=1\) does not hurt the variance if \(\alpha^* \ge \frac{1}{2}\), but does otherwise.

9 Rao–Blackwellized Estimator↩︎

lemmarb With regard to the estimator \(\rbestimatorN\), the following two properties hold for all \(N > 0\)

\(\E[\rbestimatorN] = \E[\klmcN]\)
\(\var{\rbestimatorN} \leq \var{\klmcN}\)

Proof. In the proof below, we consider the case where \(M=1\). The proof easily generalizes to \(M > 1\) using the i.i.d. assumption. Additionally, we write \(\rbestimatorN(\yvarb)\), rather than suppressing the argument to the estimator as we do in the main text. We begin by proving the unbiasedness property of the estimator. \[\begin{align} \E[\rbestimatorN] &= \E_{\yvarb'} \Big[\sum_{n=1}^N \E_{\yvarb} \big[ \klmcn(\yvarb) \mid \yvarb_{<n} = \yvarb'_{<n} \big] \Big] &\justification{definition of \rbestimatorN} \\ &= \sum_{n=1}^N \E_{\yvarb'} \Big[ \E_{\yvarb} \big[ \klmcn(\yvarb) \mid \yvarb_{<n} = \yvarb'_{<n} \big] \Big] &\justification{linearity of expectation}\\ &= \sum_{n=1}^N \E_{\yvarb} \left[ \klmcn(\yvarb) \right] &\justification{law of total expectation}\\ &= \E_{\yvarb} \left[ \sum_{n=1}^N \klmcn(\yvarb) \right] &\justification{linearity of expectation}\\ &= \E[\klmcN] &\justification{definition of \klmcN} \end{align}\] Next, we prove the variance-reduction property: \[\begin{align} &\var{\rbestimatorN} \nonumber \\ &= \E_{\yvarb'} \left[\left(\sum_{n=1}^N \E_{\yvarb} \left[\klmcn(\yvarb) \,\middle|\, \yvarb_{<n} = \yvarb'_{<n}\right] \right)^2\right] - \E[\klmcN]^2 \\ &\substack{\justification{defintion of \rbestimatorN and unbiasedness}} \nonumber\\ & = \E_{\yvarb'} \left[\left(\sum_{n=1}^N \E_{\yvarb^n} \left[\klmcn(\yvarb^n) \,\middle|\, \yvarb^n_{<n} = \yvarb'_{<n}\right] \right)^2\right] - \E[\rbestimatorN]^2 \\&\substack{\justification{each \yvarb^n is distributed independently and identically to \yvarb}}\nonumber \\ &= \E_{\yvarb'} \left[ \left(\E_{\yvarb^1} \left[\cdots \E_{\yvarb^N} \left[ \sum_{n=1}^N \klmcn(\yvarb^n) \,\middle|\, \yvarb^N_{<N} = \yvarb'_{<N}\right] \cdots \,\middle|\, \yvarb^1_{<1} = \yvarb'_{<1}\right] \right)^2\right] + \E[\klmcN]^2 \\ &\substack{\justification{linearity of expectation}} \nonumber\\ &\leq \E_{\yvarb'} \left[\E_{\yvarb^1} \left[\cdots \E_{\yvarb^N}\left[\left(\sum_{n=1}^N \klmcn(\yvarb^n) \right)^2 \,\middle|\, \yvarb^N_{<N} = \yvarb'_{<N}\right] \cdots \,\middle|\, \yvarb^1_{<1} = \yvarb'_{<1}\right]\right] - \E[\klmcN]^2 \\ &\substack{\justification{Jensen's inequality}} \nonumber\\ &= \E_{\yvarb^1} \left[\cdots \E_{\yvarb^N}\left[\left(\sum_{n=1}^N \klmcn(\yvarb^n) \right)^2 \right] \right] - \E[\klmcN]^2 \\&\substack{\justification{law of total expectation}} \nonumber\\ &= \E_{\yvarb} \left[\left(\sum_{n=1}^N \klmcn(\yvarb) \right)^2\right] - \E[\klmcN]^2 \\&\substack{\justification{\{\yvarb^n\}_{n=1}^N are i.i.d.}}\nonumber \\ &= \E_{\yvarb} \left[\left(\klmcN(\yvarb) \right)^2\right] - \E[\klmcN]^2 \\&\substack{\justification{definition of \klmcN}} \nonumber\\ &= \var{\klmcN} \\ &\substack{\justification{definition of variance}} \nonumber \end{align}\] ◻

lemmantoinf \[\begin{align} \sum_{n=1}^\infty \klmcn = \klmc \end{align}\]

Proof. \[\begin{align} \klmc &= \frac{1}{M}\sum_{m=1}^M \log \frac{\plm(\yvar^{(m)})}{\qlm(\yvar^{(m)})} \\ &= \frac{1}{M}\sum_{m=1}^M \log \frac{\pprob(\eos \mid \yvar^{(m)})}{\qpprob(\eos \mid \yvar^{(m)})} \prod_{n=1}^{|\yvar^{(m)}|} \log \frac{\pprob(Y^{(m)}_n \mid \yvar^{(m)}_{<n})}{\qpprob(Y^{(m)}_n \mid \yvar^{(m)}_{<n})} \tag{15}\\ &= \frac{1}{M}\sum_{m=1}^M \log \prod_{n=1}^{\infty} \frac{\pprob(\Yb^{(m)}_n \mid \yvarb^{(m)}_{<n})}{\qpprob(\Yb^{(m)}_n \mid \yvarb^{(m)}_{<n})} \tag{16}\\ &= \frac{1}{M}\sum_{m=1}^M \log \lim_{N \rightarrow \infty}\prod_{n=1}^{N} \frac{\pprob(\Yb^{(m)}_n \mid \yvarb^{(m)}_{<n})}{\qpprob(\Yb^{(m)}_n \mid \yvarb^{(m)}_{<n})} \\ &= \lim_{N \rightarrow \infty} \frac{1}{M}\sum_{m=1}^M \log \prod_{n=1}^{N} \frac{\pprob(\Yb^{(m)}_n \mid \yvarb^{(m)}_{<n})}{\qpprob(\Yb^{(m)}_n \mid \yvarb^{(m)}_{<n})} \\ &= \lim_{N \rightarrow \infty} \frac{1}{M}\sum_{m=1}^M \log \frac{\pprob(\yvarb^{(m)}_{\leq N})}{\qpprob(\yvarb^{(m)}_{\leq N})} \\ &= \lim_{N \rightarrow \infty} \sum_{n=1}^N \klmcn \end{align}\] Note that going from 15 to 16 , we use the padding construction given in 3. ◻

Proof. In this proof, we consider the special case of \(M = 1\). The proof easily generalizes to \(M > 1\) using the i.i.d. assumption. Additionally, we write \(\rbestimator(\yvarb)\), rather than suppressing the argument to the estimator as we do in the main text. We begin with proving the unbiasedness of the estimator, using [lemma:ntoinf] and [lemma:rb]. \[\begin{align} \E \big[\rbestimator(\yvarb) \big] &= \E_{\yvarb'} \Big[\lim_{N\rightarrow \infty}\sum_{n=1}^N \E_{\yvarb} \big[ \klmcn(\yvarb) \mid \yvarb_{<n} = \yvarb'_{<n} \big] \Big] & \\ &=\lim_{N\rightarrow \infty} \sum_{n=1}^{N}\E_{\yvarb'} \Big[ \E_{\yvarb} \big[ \klmcn(\yvarb) \mid \yvarb_{<n} = \yvarb'_{<n} \big] \Big] \\&\substack{\justification{Tonelli's Theorem}} \nonumber \\ &= \lim_{N\rightarrow \infty} \sum_{n=1}^{N}\E_{\yvarb} \left[\klmcn(\yvarb) \right] \\ &\substack{\justification{\Cref{lemma:rb}}} \nonumber \\ &= \E_{\yvarb} \left[\lim_{N\rightarrow \infty} \sum_{n=1}^{N}\klmcn(\yvarb) \right] \\ &\substack{\justification{Fubini's Theorem, \E[ \sum_{n=1}^{\infty} |\klmcn(\yvarb)| ] < \infty}} \nonumber \\ &= \E_{\yvarb} \left[\klmc(\yvarb) \right] \\ &\substack{\justification{\Cref{lemma:ntoinf}}} \nonumber \\ &= \kl(\plm \mid\mid \qlm). \\ &\substack{\justification{unbiasedness of \klmc}} \nonumber \end{align}\] Finally, we prove the variance-reduction property. \[\begin{align} \var{\rbestimator} &= \E_{\yvarb'} \left[ \left(\lim_{N\rightarrow \infty}\sum_{n=1}^N \E_{\yvarb}\big[ \klmcn(\yvarb) \mid \yvarb_{<n} = \yvarb'_{< n} \big] - \kl(\plm \mid\mid \qlm) \right)^2 \right] \\ & \substack{\justification{Definition of \rbestimator(\strrv)}} \nonumber \\ &= \lim_{N\rightarrow \infty} \E_{\yvarb'} \left[ \left(\sum_{n=1}^N \E_{\yvarb}\big[ \klmcn(\yvarb) \mid \yvarb_{<n} = \yvarb'_{< n} \big] - \kl(\plm \mid\mid \qlm) \right)^2 \right] \\ & \substack{\justification{dominated convergence theorem, \var{\rbestimator} < \infty}} \nonumber \\ &\leq \lim_{N\rightarrow \infty} \E_{\yvarb}\left[\left(\sum_{n=1}^N \klmcn(\yvarb) - \kl(\plm, \qlm) \right)^2 \right] \\ &\substack{\justification{\Cref{lemma:rb}, variance reduction}} \nonumber \\ &= \E_{\yvarb}\left[\lim_{N\rightarrow \infty} \left(\sum_{n=1}^N \klmcn(\yvarb) - \kl(\plm, \qlm) \right)^2 \right] \\ &\substack{\justification{dominated convergence theorem, \var{\klmc} < \infty}} \nonumber \\ &=\E_{\yvarb}\left[\left( \klmc(\yvarb) - \kl(\plm \mid\mid \qlm) \right)^2 \right] \\ &\substack{\justification{\Cref{lemma:ntoinf}}} \nonumber \\ &=\var{\klmc}. \end{align}\] ◻

10 Rao–Blackwellized Estimator of the Gradient↩︎

Proof. \[\begin{align} &\nablat \kl(\plm \mid\mid \qlm) \nonumber \\ &= \sum_{\str \in \Sigma^*} \kl\mleft(\prefixprob(\cdot \mid \str) \mid\mid \qpprob(\cdot \mid \str)\mright) \nablat \prefixprob(\str) + \prefixprob(\str) \nablat \kl\mleft(\prefixprob(\cdot \mid \str) \mid\mid \qpprob(\cdot \mid \str)\mright) \\ &=\sum_{\str \in \Sigma^*} \kl\mleft(\prefixprob(\cdot \mid \str) \mid\mid \qpprob(\cdot \mid \str)\mright) \nablat \prefixprob(\str) + \prefixprob(\str) \sum_{\yb \in \alphabar} \nablat \prefixprob(\yb \mid \str) \log \frac{\prefixprob(\yb \mid \str)}{\qpprob(\yb \mid \str)} \\ &=\sum_{\str \in \Sigma^*} \kl\mleft(\prefixprob(\cdot \mid \str) \mid\mid \qpprob(\cdot \mid \str)\mright) \nablat \prefixprob(\str) + \prefixprob(\str) \sum_{\yb \in \alphabar} \log \frac{\prefixprob(\yb \mid \str)}{\qpprob(\yb \mid \str)} \nablat \prefixprob(\yb \mid \str) + \nablat \prefixprob(\yb \mid \str) \\ &=\sum_{\str \in \Sigma^*} \kl\mleft(\prefixprob(\cdot \mid \str) \mid\mid \qpprob(\cdot \mid \str)\mright) \nablat \prefixprob(\str) + \prefixprob(\str) \E_{\Yb} \left[\left(\log \frac{\prefixprob(\Yb \mid \str)}{\qpprob(\Yb \mid \str)} + 1 \right) \nablat \log \prefixprob(\Yb \mid \str) \right] \\ &=\sum_{\str \in \Sigma^*} \kl\mleft(\prefixprob(\cdot \mid \str) \mid\mid \qpprob(\cdot \mid \str)\mright) \nablat \prefixprob(\str) + \prefixprob(\str) \E_{\Yb} \left[\log \frac{\prefixprob(\Yb \mid \str)}{\qpprob(\Yb \mid \str)} \nablat \log \prefixprob(\Yb \mid \str) \right] \\ &=\sum_{\str \in \Sigma^*} \E_{\Yb} \left[\log \frac{\prefixprob(\Yb \mid \str)}{\qpprob(\Yb \mid \str)} \right] \nablat \prefixprob(\str) + \prefixprob(\str) \E_{\Yb} \left[\log \frac{\prefixprob(\Yb \mid \str)}{\qpprob(\Yb \mid \str)} \nablat \log \prefixprob(\Yb \mid \str) \right] \\ &=\sum_{\str \in \Sigma^*} \prefixprob(\str)\E_{\Yb} \left[\log \frac{\prefixprob(\Yb \mid \str)}{\qpprob(\Yb \mid \str)} \right] \nablat \log \prefixprob(\str) + \prefixprob(\str) \E_{\Yb} \left[\log \frac{\prefixprob(\Yb \mid \str)}{\qpprob(\Yb \mid \str)} \nablat \log \prefixprob(\Yb \mid \str) \right] \\ &=\sum_{\str \in \Sigma^*} \prefixprob(\str)\E_{\Yb} \left[\log \frac{\prefixprob(\Yb \mid \str)}{\qpprob(\Yb \mid \str)} \left(\nablat \log \prefixprob(\Yb \mid \str) + \nablat \log \prefixprob(\str) \right)\right]. \end{align}\] ◻

10.1 Variance-Reduction Proof↩︎

The proof structure is as follows: We first prove that the inequality holds when we constrain \(\yvarb\) to have length less than or equal to \(N\). We then generalize to the infinite-length sequences by analyzing as \(N \rightarrow \infty\). We begin with defining the truncated MC and RB estimators. Let \(\mcgradN\) be the truncated MC estimator of the gradient: \[\begin{align} \label{eq:gradmcN} \mcgradN = \sum_{n=1}^N \frac{1}{M} \sum_{m=1}^M \log \frac{\prefixprob(\Yb_n^{(m)} \mid \yvarb_{<n}^{(m)})}{\qpprob(\Yb_n^{(m)} \mid \yvarb_{<n}^{(m)})} \nablat \log \prefixprob(\yvarb_{\leq N}^{(m)}). \end{align}\tag{17}\]

Let \(\gradrbN\) be the truncated RB estimator: \[\begin{align} \label{eq:gradrbN} \gradrbN = \frac{1}{M}\sum_{m=1}^M \sum_{n=1}^N \E_{\Yb_n} \left[\log \frac{\prefixprob(\Yb_n \mid \yvarb^{(m)}_{<n})}{\qpprob(\Yb_n \mid \yvarb^{(m)}_{<n})} \left( \nablat \log \prefixprob(\Yb_n \mid \yvarb^{(m)}_{<n}) + \nablat \log \prefixprob(\yvarb^{(m)}_{<n}) \right) \right]. \end{align}\tag{18}\]

Lemma 1. The truncated MC estimator of the gradient \(\mcgradN\) converges to \(\mcgrad\) as \(N\) goes to \(\infty\), i.e., \(\lim_{N \rightarrow \infty} \mcgradN = \mcgrad\).

Proof. \[\begin{align} \mcgrad &= \frac{1}{M} \sum_{m=1}^M \log \frac{\prefixprob(\yvar^{(m)})}{\qpprob(\yvar^{(m)})} \nablat \log \prefixprob(\yvar^{(m)}) \\ &= \frac{1}{M} \sum_{m=1}^M \log \frac{\prefixprob(\eos \mid \yvar^{(m)})}{\qpprob(\eos \mid \yvar^{(m)})} \prod_{n=1}^{|\yvar^{(m)}|}\frac{\prefixprob(Y_n \mid \yvar_{<n}^{(m)})}{\qpprob(Y_n \mid \yvar_{<n}^{(m)})} \nablat \log \prefixprob(\yvar^{(m)}) \\ &= \frac{1}{M} \sum_{m=1}^M \log \prod_{n=1}^{\infty}\frac{\prefixprob(\Yb_n \mid \yvarb_{<n}^{(m)})}{\qpprob(\Yb_n \mid \yvarb_{<n}^{(m)})} \nablat \log \prefixprob(\yvarb^{(m)}) \\ &= \frac{1}{M} \sum_{m=1}^M \log \lim_{N \rightarrow \infty}\prod_{n=1}^{N}\frac{\prefixprob(\Yb_n \mid \yvarb_{<n}^{(m)})}{\qpprob(\Yb_n \mid \yvarb_{<n}^{(m)})} \nablat \log \prefixprob(\yvarb_{\leq N}^{(m)}) \\ &= \frac{1}{M} \sum_{m=1}^M \lim_{N \rightarrow \infty} \log \prod_{n=1}^{N}\frac{\prefixprob(\Yb_n \mid \yvarb_{<n}^{(m)})}{\qpprob(\Yb_n \mid \yvarb_{<n}^{(m)})} \nablat \log \prefixprob(\yvarb_{\leq N}^{(m)}) \\ &=\lim_{N \rightarrow \infty} \frac{1}{M} \sum_{m=1}^M \log \frac{\prefixprob(\yvarb_{\leq N}^{(m)})}{\qpprob(\yvarb_{\leq N}^{(m)})} \nablat \log \prefixprob(\yvarb_{\leq N}^{(m)}) \\ &= \lim_{N \rightarrow \infty} \mcgradN. \end{align}\] ◻

Lemma 2. The following identity holds: \[\nablat \log \prefixprob(\yvarb_{\leq n}) = \E_{\yvarb'} \left[\nablat \log \prefixprob(\yvarb_{\leq N}') \,\middle|\, \yvarb'_{\leq n} = \yvarb_{\leq n} \right],\] for any \(N > n\).

Proof. Note that \(\prefixprob(\yvarb_{\leq n}) = \sum_{\strbar' \in \alphabar^{N-n}} \prefixprob(\yvarb_{\leq n}\strbar' )\) for any \(N > n\). Therefore, we have \[\begin{align} \nablat \log \prefixprob(\yvarb_{\leq n}) &= \frac{\nablat \prefixprob(\yvarb_{\leq n})}{\prefixprob(\yvarb_{\leq n})} \\ &= \sum_{\strbar' \in \alphabar^{N-n}} \frac{\nablat \prefixprob(\yvarb_{\leq n}\strbar' )}{\prefixprob(\yvarb_{\leq n})} \\ &= \sum_{\strbar' \in \alphabar^{N-n}}\frac{\prefixprob(\yvarb_{\leq n}\strbar' )}{\prefixprob(\yvarb_{\leq n})} \nablat \log \prefixprob(\yvarb_{\leq n}\strbar') \\ &= \sum_{\strbar' \in \alphabar^{N-n}}\frac{\prefixprob(\strbar' \mid \yvarb_{\leq n}) \prefixprob(\yvarb_{\leq n})}{\prefixprob(\yvarb_{\leq n})} \nablat \log \prefixprob(\yvarb_{\leq n}\strbar') \\ &= \sum_{\strbar' \in \alphabar^{N-n}} \prefixprob(\strbar' \mid \yvarb_{\leq n}) \nablat \log \prefixprob(\yvarb_{\leq n}\strbar') \\ &= \E_{\yvarb'} \left[\nablat \log \prefixprob(\yvarb_{\leq N}') \,\middle|\, \yvarb'_{\leq n} = \yvarb_{\leq n} \right]\label{eq:gradres} \end{align}\tag{19}\] ◻

Lemma 3. Let \(\mcgradN\) be the truncated MC estimator of the gradient defined in 17 and \(\gradrbN\) the truncated RB estimator of the gradient defined in 18 . Define \(\klgrad^N \defeq \E[\mcgradN] = \E[\gradrbN]\). We have \[\E\left[ \left\lVert \gradrbN - \klgrad^N \right\rVert^2 \right] \leq \E\left[ \left\lVert \mcgradN - \klgrad^N \right\rVert^2 \right].\]

Proof. Without loss of generality, we assume \(M=1\). The proof generalizes to \(M > 1\) with the i.i.d. assumption. \[\begin{align} &\E\left[ \left\lVert \gradrbN - \klgrad^N \right\rVert^2 \right] \\ &= \E_{\yvarb'} \left[ \left\lVert \sum_{n=1}^N \E_{\yvarb} \left[\klmcn(\yvarb_{\leq n}) \underbrace{\left(\nablat \log \prefixprob(\yvarb_{<n}) + \nablat \log \prefixprob(\Yb_n \mid \yvarb_{<n})\right)}_{\nablat \log \prefixprob(\yvarb_{\leq n})} - \klgrad^N \,\middle|\, \yvarb_{<n} = \yvarb'_{<n} \right] \right\rVert^2 \right] & \\ &\substack{\justification{definition of \gradrbN}} \nonumber \\ &= \E_{\yvarb'} \left[ \left\lVert \sum_{n=1}^N \E_{\yvarb^n} \left[\klmcn(\yvarb^n_{\leq n}) \nablat \log \prefixprob(\yvarb^n_{\leq n}) - \klgrad^N \,\middle|\, \yvarb^n_{<n} = \yvarb'_{<n} \right] \right\rVert^2 \right] \\&\substack{\justification{each \yvarb^n is distributed independently and identically to \yvarb}}\nonumber \\ &= \E_{\yvarb'} \left[ \left\lVert \E_{\yvarb^1} \left[\cdots \E_{\yvarb^N} \left[\sum_{n=1}^N \klmcn(\yvarb^n_{\leq n}) \nablat \log \prefixprob(\yvarb^n_{\leq n}) - \klgrad^N \,\middle|\, \yvarb^N_{<N} = \yvarb'_{<N}\right] \cdots \,\middle|\, \yvarb^1_{<1} = \yvarb'_{<1} \right] \right\rVert^2 \right] & \\ &\substack{\justification{linearity of expectation}} \nonumber \\ &\leq \E_{\yvarb'} \left[ \E_{\yvarb^1} \left[\cdots \E_{\yvarb^N} \left[\left\lVert \sum_{n=1}^N \klmcn(\yvarb^n_{\leq n}) \nablat \log \prefixprob(\yvarb^n_{\leq n}) - \klgrad^N \right\rVert^2 \,\middle|\, \yvarb^N_{<N} = \yvarb'_{<N}\right] \cdots \,\middle|\, \yvarb^1_{<1} = \yvarb'_{<1} \right] \right] & \\ &\substack{\justification{Jensen's inequality}} \nonumber \\ &= \E_{\yvarb^1} \left[\cdots \E_{\yvarb^N} \left[\left\lVert \sum_{n=1}^N \klmcn(\yvarb^n_{\leq n}) \nablat \log \prefixprob(\yvarb^n_{\leq n}) - \klgrad^N \right\rVert^2 \right] \right] & \\ &\substack{\justification{law of total expectation}} \nonumber \\ &= \E_{\yvarb} \left[\left\lVert \sum_{n=1}^N \klmcn(\yvarb_{\leq n}) \nablat \log \prefixprob(\yvarb_{\leq n}) - \klgrad^N \right\rVert^2 \right] \label{eq:gradmid}\\ &\substack{\justification{\yvarb^1, \ldots, \yvarb^N are i.i.d.}} \nonumber \\ &\leq \E_{\yvarb} \left[\left\lVert \sum_{n=1}^N \klmcn(\yvarb_{\leq n}) \E_{\yvarb'} \left[\nablat \log \prefixprob(\yvarb_{\leq N}') \mid \yvarb'_{\leq n} = \yvarb_{\leq n} \right] - \klgrad^N \right\rVert^2 \right] \\ &\substack{\justification{\Cref{lemm:prefixgrad}}} \nonumber \\ &= \E_{\yvarb} \left[ \left\lVert \sum_{n=1}^N \E_{\yvarb'} \left[ \klmcn(\yvarb'_{\leq n}) \nablat \log \prefixprob(\yvarb_{\leq N}') - \klgrad^N \,\middle|\, \yvarb'_{\leq n} = \yvarb_{\leq n} \right] \right\rVert^2 \right] \\ &\substack{\justification{linearity of expectation}} \nonumber \\ &= \E_{\yvarb} \left[ \left\lVert \E_{\yvarb'^1} \left [\cdots \E_{\yvarb'^N} \left[\sum_{n=1}^N \klmcn(\yvarb'^n_{\leq n}) \nablat \log \prefixprob(\yvarb_{\leq N}'^n) - \klgrad^N \,\middle|\, \yvar'^N_{\leq N} = \yvar_{\leq N} \right] \cdots \,\middle|\, \yvar'^1_{\leq 1} = \yvarb_{\leq 1}\right]\right\rVert^2 \right] \\ &\substack{\justification{linearity of expectation}} \nonumber \\ &\leq \E_{\yvarb} \left[ \E_{\yvarb'^1} \left [\cdots \E_{\yvarb'^N} \left[\left\lVert \sum_{n=1}^N \klmcn(\yvarb'^n_{\leq n}) \nablat \log \prefixprob(\yvarb_{\leq N}'^n) - \klgrad^N \right\rVert^2 \,\middle|\, \yvarb'^N_{\leq N} = \yvarb_{\leq N} \right] \cdots \,\middle|\, \yvarb'^1_{\leq 1} = \yvarb_{\leq 1}\right] \right] \\ &\substack{\justification{Jensen's inequality}} \nonumber \\ &= \E_{\yvarb'^1} \left [\cdots \E_{\yvarb'^N} \left[\left\lVert \sum_{n=1}^N \klmcn(\yvarb'^n_{\leq n}) \nablat \log \prefixprob(\yvarb_{\leq N}'^n) - \klgrad^N \right\rVert^2 \right] \right] \\ &\substack{\justification{law of total expectation}} \nonumber \\ &= \E_{\yvarb'} \left[ \left\lVert \sum_{n=1}^N \klmcn(\yvarb'_{\leq n}) \nablat \log \prefixprob(\yvarb_{\leq N}') - \klgrad^N \right\rVert^2 \right] \\ &\substack{\justification{\yvarb'^1, \cdots, \yvarb'^N are i.i.d.}} \nonumber \\ &= \E_{\yvarb} \left[ \left\lVert \mcgradN(\yvarb) - \klgrad^N \right\rVert^2 \right]. \\ &\substack{\justification{definition of \mcgradN}} \nonumber \end{align}\tag{20}\] ◻

Proof. Without loss of generality, we assume \(M=1\). The proof generalizes to \(M > 1\) with the i.i.d. assumption. \[\begin{align} \E\left[ \left\lVert \gradrb - \klgrad \right\rVert^2 \right] &= \E\left[ \left\lVert \lim_{N \rightarrow \infty} \gradrbN - \klgrad^N \right\rVert^2 \right] \\ &\substack{\justification{definition of \gradrb}} \nonumber \\ &= \lim_{N \rightarrow \infty} \E\left[ \left\lVert \gradrbN - \klgrad^N \right\rVert^2 \right] \\ &\substack{\justification{dominated convergence theorem, \var{\gradrb} < \infty}} \nonumber \\ &\leq \lim_{N \rightarrow \infty} \E\left[ \left\lVert \mcgradN - \klgrad^N \right\rVert^2 \right] \\ &\substack{\justification{\Cref{lemm:gradn}}} \nonumber \\ &= \E\left[ \left\lVert \lim_{N \rightarrow \infty} \mcgradN - \klgrad^N \right\rVert^2 \right] \\ &\substack{\justification{dominated convergence theorem, \var{\mcgrad} < \infty}} \nonumber \\ &= \E\left[ \left\lVert \mcgrad - \klgrad \right\rVert^2 \right]. \\ &\substack{\justification{\Cref{lemm:ntoinfgrad}}} \nonumber \end{align}\] ◻

10.2 A Note on Rao–Blackwellizing KL in Trust-Region Algorithms↩︎

The conventional Monte Carlo estimator of \(\kl(\policy \mid\mid \qlm)\) used in the PPO algorithm in open-sourced RLHF libraries, e.g., [21], [22], is as follows: \[\begin{align} \label{eq:mcppo} \klmcppo = \frac{1}{M} \sum_{m=1}^M \frac{\policy(\yvar^{(m)})}{\oldpolicy(\yvar^{(m)})} \log \frac{\oldpolicy(\yvar^{(m)})}{\qlm(\yvar^{(m)})}, \end{align}\tag{21}\] where \(\yvar^{(1)}, \cdots, \yvar^{(M)} \simIID \oldpolicy\). Notably, the expected value of the above estimator is not equal to \(\kl(\policy \mid\mid \qlm)\) and is \[\begin{align} \E[\klmcppo] = \E_{\yvar \sim \oldpolicy} \left[\frac{\policy(\yvar)}{\oldpolicy(\yvar)} \log \frac{\oldpolicy(\yvar)}{\qlm(\yvar)} \right] = \E_{\yvar \sim \policy} \left[\log \frac{\oldpolicy(\yvar)}{\qlm(\yvar)} \right]. \end{align}\] A natural question at this point is: what is the relationship between \(\klmcppo\) and \(\klmc\), and why is minimizing \(\klmcppo\) a valid proxy for minimizing \(\klmc\)? Crucially, the KL divergence between \(\policy\) and \(\qlm\) can be decomposed into the sum of \(\E[\klmcppo]\) and the KL divergence between \(\policy\) and \(\oldpolicy\), as shown in the following equation: \[\begin{align} \underbrace{\E_{\yvar \sim \policy} \left[\log \frac{\oldpolicy(\yvar)}{\qlm(\yvar)} \right]}_{\E[\klmcppo]} + \underbrace{\E_{\yvar \sim \policy} \left[\log \frac{\policy(\yvar)}{\oldpolicy(\yvar)} \right]}_{\text{trust region},\kl(\policy, \oldpolicy)} = \E_{\yvar \sim \policy} \left[\log \frac{\policy(\yvar)}{\qlm(\yvar)} \right] = \kl(\policy \mid\mid \qlm). \end{align}\] Therefore, minimizing \(\kl(\policy \mid\mid \qlm)\) is equivalent to minimizing both \(\E[\klmcppo]\) and \(\kl(\policy \mid\mid \oldpolicy)\). Notably, since the KL divergence between the current policy and the old policy, \(\kl(\policy \mid\mid \oldpolicy)\), is already constrained by PPO’s clipping mechanism, the algorithm effectively focuses on penalizing only the first term, using \(\klmcppo\).

A naïve approach to Rao–Blackwellizing \(\klmcppo\) defined in 21 , is as follows: \[\begin{align} \rbestimatorppo = \frac{1}{M} \sum_{m=1}^M \frac{\policy(\yvar^{(m)})}{\oldpolicy(\yvar^{(m)})} \lim_{N \rightarrow \infty}\sum_{n=1}^N \E_{Y_n \sim \oldpolicy}\left[\log \frac{\prefixprobold(Y_n \mid \yvar_{<n}^{(m)})}{\qpprob(Y_n \mid \yvar^{(m)}_{<n})}\right]. \end{align}\] Importantly, \(\rbestimatorppo\) does not give an unbiased estimate of \(\E_{\yvar \sim \policy} \left[\log \frac{\oldpolicy(\yvar)}{\qlm(\yvar)} \right]\), i.e., \[\begin{align} \E\left[\rbestimatorppo \right] = \E_{\yvar \sim \policy}\left[ \lim_{N \rightarrow \infty}\sum_{n=1}^N \E_{\Yb_n \sim \oldpolicy}\left[\log \frac{\prefixprobold(\Yb_n \mid \yvarb_{<n})}{\qpprob(\Yb_n \mid \yvarb_{<n})}\right] \right] \neq \E_{\yvar \sim \policy} \left[\log \frac{\oldpolicy(\yvar)}{\qlm(\yvar)} \right]. \end{align}\] Therefore, we caution the reader against using this estimator as a replacement for \(\klmcppo\) in practice.

11 Additional Experiments↩︎

In 6, we visualize the KL estimates for three different prompts: (left) a positive prompt, (middle) a neutral prompt, and (right) a negative adversarial prompt. The traces represent the average estimates from all the repetitions, while the shaded regions indicate the standard deviation. Except \(\klcv, \alpha=1\), all other estimators are unbiased and consistent.

As the sample size increases, the chance of sampling from the tail of \(\policy\) also increases. These tail samples often correspond to negative movie reviews that had a high probability under the language model prior to fine-tuning, i.e., \(\qlm\), leading to extremely large values of \(g(\yvar)\) and, consequently, a high standard deviation. This effect indeed depends on the prompt and is particularly pronounced for neutral and adversarial prompts.

Figure 6: Comparing the bias, variance, and consistency of the estimators as the sample size increases. The \klcv estimator with \alpha=1 exhibits a higher standard deviation, particularly for neutral and negative prompts, where the variance becomes extremely large. In contrast, the RB estimator, \rbestimator, achieves the lowest standard deviation. — Figure 6: Comparing the bias, variance, and consistency of the estimators as the sample size increases. The \(\klcv\) estimator with \(\alpha=1\) exhibits a higher standard deviation, particularly for neutral and negative prompts, where the variance becomes extremely large. In contrast, the RB estimator, \(\rbestimator\), achieves the lowest standard deviation.

We conducted an additional experiment to evaluate the estimators on a random subset of prompts from the UltraFeedback dataset [39]. We compute the KL divergence between Zephyr-7B-Beta [40] and its reference model, Mistral-7B-v0.1 [41]. Zephyr is fine-tuned from Mistral using DPO on UltraFeedback, and as part of this fine-tuning, it is desirable not to diverge significantly from the base model.

r0.45

Table 2: Estimated value \(\pm\) empirical standard deviation of different estimators. RB estimator consistently achieves the lowest standard deviation.
	\(M=1\)	\(M=5\)	\(M=10\)
\(\klmc\)	\(18.05 \pm 3.19\)	\(18.05 \pm 1.63\)	\(18.05 \pm \mathbf{0.75}\)
\(\klht\)	\(18.05 \pm 12.64\)	\(18.56 \pm 5.34\)	\(19.24 \pm 3.62\)
\(\klcvo\)	\(17.17 \pm 3.17\)	\(17.17 \pm 1.62\)	\(17.17 \pm 0.75\)
\(\klcv\)	\(17.80 \pm 3.18\)	\(17.80 \pm 1.63\)	\(17.80 \pm 0.75\)
\(\rbestimator\)	\(18.05 \pm \mathbf{3.16}\)	\(18.05 \pm \mathbf{1.61}\)	\(18.05 \pm \mathbf{0.75}\)

We randomly sampled 512 prompts and generated \(100\) responses per prompt. For estimation, we used subsets of \(1\), \(5\), and \(10\) samples, reserving the remaining samples to estimate each method’s standard deviation. 2 reports the KL estimate ± standard deviation for each estimator. Unlike our GPT-2 experiments, we had to use half-precision to perform inference and forward passes on a single GPU, which introduces bias in the HT and CV estimators. Consistent with our findings on the IMDB dataset, our proposed RB estimator consistently achieves the lowest standard deviation across all settings, reaffirming its stability and reliability.

12 Experimental Details↩︎

12.1 Code Snippet↩︎

def compute_kl(logprobs, ref_logprobs, logits, ref_logits):
    """
    Compute KL divergence using two estimators.
    
    Args:
        logprobs: Log probabilities of sampled actions from policy
        ref_logprobs: Log probabilities of same actions from reference
        logits: Full distribution logits from policy (all actions)
        ref_logits: Full distribution logits from reference (all actions)
        
    Example:
        # Policy samples action 3 from 1000 possible actions
        logprobs = [-2.3]           # Only action 3's log prob
        logits = [0.1, -0.5, ...]   # All 1000 action logits (raw)
        
        # MC: uses only sampled action
        # RB: uses all 1000 actions for exact expectation
    """
    # Monte Carlo: unbiased but higher variance
    kl_mc = mean(logprobs - ref_logprobs)
    
    # Rao-Blackwell: lower variance, uses full distribution
    log_p = log_softmax(logits)          # Normalize to log probs
    log_q = log_softmax(ref_logits)      # Normalize to log probs
    kl_rb = mean(sum(exp(log_p) * (log_p - log_q)))
    
    return kl_mc, kl_rb

12.2 RLHF Experiments↩︎

In 3, we include the hyperparameters used with the RLOO algorithm for the sentiment control experiment. Each experiment takes approximately 20 minutes on a single rtx_4090 GPU.

Table 3: No caption
Hypterparameter	Value
Optimizer	AdamW (\(\epsilon = 1\mathrm{e-}5\), lr\(=3\mathrm{e-}6\))
Scheduler	Linear
Batch Size	\(32\)
\(\beta\)	\(0.07\)
\(k\)	\(2\)
Number of RLOO Updates Iteration Per Epoch	\(4\)
Clip range	\(0.2\)
Sampling Temperature	\(1\)

NeurIPS Paper Checklist↩︎

Claims
Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
Answer:
Justification: The main contributions are that our proposed estimator is unbiased and has variance less than or equal to the variance of the MC estimator, and it improves the stability of RLHF training. All these claims are clearly stated in the abstract and introduction.
Guidelines:
- The answer NA means that the abstract and introduction do not include the claims made in the paper.
- The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.
- The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.
- It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.
Limitations
Question: Does the paper discuss the limitations of the work performed by the authors?
Answer:
Justification: We explain the limitations in ¿sec:sec:limitation?.
Guidelines:
- The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.
- The authors are encouraged to create a separate "Limitations" section in their paper.
- The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.
- The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.
- The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.
- The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.
- If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.
- While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.
Theory assumptions and proofs
Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
Answer:
Justification: Yes, the main theoretical claim is that the RB estimator is unbiased and has variance less than or equal to the variance of the MC estimator, this is supported by theorem 2, and the proof is in appendix E.3. The rest of the theoretical claims regarding other estimators are proved in appendix A-F.
Guidelines:
- The answer NA means that the paper does not include theoretical results.
- All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced.
- All assumptions should be clearly stated or referenced in the statement of any theorems.
- The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.
- Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.
- Theorems and Lemmas that the proof relies upon should be properly referenced.
Experimental result reproducibility
Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?
Answer:
Justification: We provide a simple code snippet to implement our RB estimator in 12.1, which also highlights the difference between the RB and MC estimator.
Guidelines:
- The answer NA means that the paper does not include experiments.
- If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.
- If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.
- Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.
- While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example
  1. If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.
  2. If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.
  3. If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).
  4. We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.
Open access to data and code
Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
Answer:
Justification: We have released the code but, to preserve anonymity, we have not included the link to the public repository in the paper. Instead, the code is attached as supplementary material to the submission.
Guidelines:
- The answer NA means that paper does not include experiments requiring code.
- Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.
- While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).
- The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.
- The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.
- The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.
- At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).
- Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.
Experimental setting/details
Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?
Answer:
Justification: The experimental details is explained in 12.
Guidelines:
- The answer NA means that the paper does not include experiments.
- The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.
- The full details can be provided either with the code, in appendix, or as supplemental material.
Experiment statistical significance
Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
Answer:
Justification: When evaluating estimators in 5.1, we report the standard deviation across runs. Furthermore, for RLHF Pareto plot, we perform a permutation test in 5.2 to test significance of the results.
Guidelines:
- The answer NA means that the paper does not include experiments.
- The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.
- The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).
- The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)
- The assumptions made should be given (e.g., Normally distributed errors).
- It should be clear whether the error bar is the standard deviation or the standard error of the mean.
- It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.
- For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).
- If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.
Experiments compute resources
Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?
Answer:
Justification: We provide this in 12.
Guidelines:
- The answer NA means that the paper does not include experiments.
- The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.
- The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.
- The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper).
Code of ethics
Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?
Answer:
Justification: We have reviewed the NeurIPS code of ethics and we confirm that the paper conforms to the NeurIPS code of ethics.
Guidelines:
- The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.
- If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.
- The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).
Broader impacts
Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
Answer:
Justification: We discuss this in ¿sec:sec:impact?.
Guidelines:
- The answer NA means that there is no societal impact of the work performed.
- If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.
- Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.
- The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.
- The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.
- If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).
Safeguards
Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?
Answer:
Justification: We do not release any new models or datasets, therefore, the paper poses no such risks.
Guidelines:
- The answer NA means that the paper poses no such risks.
- Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.
- Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.
- We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.
Licenses for existing assets
Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
Answer:
Justification: We cite the dataset in 5.1. While the original paper doesn’t specify a particular license, it emphasizes that the data is intended for academic and non-commercial use.
Guidelines:
- The answer NA means that the paper does not use existing assets.
- The authors should cite the original paper that produced the code package or dataset.
- The authors should state which version of the asset is used and, if possible, include a URL.
- The name of the license (e.g., CC-BY 4.0) should be included for each asset.
- For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.
- If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.
- For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.
- If this information is not available online, the authors are encouraged to reach out to the asset’s creators.
New assets
Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
Answer:
Justification: We do not release new assets.
Guidelines:
- The answer NA means that the paper does not release new assets.
- Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.
- The paper should discuss whether and how consent was obtained from people whose asset is used.
- At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.
Crowdsourcing and research with human subjects
Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?
Answer:
Justification: This paper does not involve crowdsourcing nor research with human subjects.
Guidelines:
- The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.
- Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.
- According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.
Institutional review board (IRB) approvals or equivalent for research with human subjects
Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?
Answer:
Justification: This paper does not involve crowdsourcing nor research with human subjects.
Guidelines:
- The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.
- Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.
- We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.
- For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.
Declaration of LLM usage
Question: Does the paper describe the usage of LLMs if it is an important, original, or non-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does not impact the core methodology, scientific rigorousness, or originality of the research, declaration is not required.
Answer:
Justification: The core method developed in this paper does not involve LLMs as any important, original, or non-standard components.
Guidelines:
- The answer NA means that the core method development in this research does not involve LLMs as any important, original, or non-standard components.
- Please refer to our LLM policy (https://neurips.cc/Conferences/2025/LLM) for what should or should not be described.

References↩︎

[1]

S. Kullback and R. A. Leibler. 1951. http://www.jstor.org/stable/2236703. The Annals of Mathematical Statistics.

[2]

Imre Csiszár. 1967. https://doi.org/10.1007/BF02018661. Studia Scientiarum Mathematicarum Hungarica.

[3]

Alison L. Gibbs and Francis Edward Su. 2002. http://www.jstor.org/stable/1403865. International Statistical Review / Revue Internationale de Statistique.

[4]

John R. Hershey and Peder A. Olsen. 2007. https://doi.org/10.1109/ICASSP.2007.366913. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]

Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf. In Advances in Neural Information Processing Systems.

[6]

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F. Christiano. 2020. https://proceedings.neurips.cc/paper_files/paper/2020/file/1f89885d556929e98d3ef9b86448f951-Paper.pdf. In Advances in Neural Information Processing Systems.

[7]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf. In Advances in Neural Information Processing Systems.

[8]

Pouya Pezeshkpour. 2023. https://arxiv.org/abs/2306.06264. Computing Research Repository, arXiv:2306.06264.

[9]

Kevin Du, Vésteinn Snæbjarnarson, Niklas Stoehr, Jennifer White, Aaron Schein, and Ryan Cotterell. 2024. https://aclanthology.org/2024.acl-long.714/. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[10]

Niklas Stoehr, Kevin Du, Vésteinn Snæbjarnarson, Robert West, Ryan Cotterell, and Aaron Schein. 2024. https://aclanthology.org/2024.findings-emnlp.479/. In Findings of the Association for Computational Linguistics: EMNLP 2024.

[11]

Nadav Borenstein, Anej Svete, Robin Chan, Josef Valvoda, Franz Nowak, Isabelle Augenstein, Eleanor Chodroff, and Ryan Cotterell. 2024. https://aclanthology.org/2024.acl-long.807/. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[12]

Anej Svete, Nadav Borenstein, Mike Zhou, Isabelle Augenstein, and Ryan Cotterell. 2024. https://aclanthology.org/2024.emnlp-main.550/ In Proceedings of the Conference on Empirical Methods in Natural Language Processing.

[13]

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. 2024. https://openreview.net/forum?id=3zKtaqxLhW. In The International Conference on Learning Representations.

[14]

Hendrik Jan Hoogeboom. 2015. https://web.archive.org/web/20240623112236/https://liacs.leidenuniv.nl/ hoogeboomhj/second/codingcomputations.pdf.

[15]

Daniel J. Lehmann. 1977. https://doi.org/https://doi.org/10.1016/0304-3975(77)90056-1. Theoretical Computer Science.

[16]

Zhifei Li and Jason Eisner. 2009. https://aclanthology.org/D09-1005/. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.

[17]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. In Advances in Neural Information Processing Systems.

[18]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.

[19]

Meta. 2023. http://arxiv.org/abs/2307.09288. Technical report, Meta.

[20]

John Schulman. 2020. http://joschu.net/blog/kl-approx.html. http://joschu.net/blog/kl-approx.html.

[21]

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. 2020. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl.

[22]

Alexander Havrilla, Maksym Zhuravinskyi, Duy Phung, Aman Tiwari, Jonathan Tow, Stella Biderman, Quentin Anthony, and Louis Castricato. 2023. https://aclanthology.org/2023.emnlp-main.530. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.

[23]

C. R. Rao. 1945. Information and the accuracy attainable in the estimation of statistical parameters. Bulletin of the Calcutta Mathematical Society, 37:81–91.

[24]

David Blackwell. 1947. https://doi.org/10.1214/aoms/1177730491. The Annals of Mathematical Statistics, 18(1):105–110.

[25]

Alan E. Gelfand and Adrian F. M. Smith. 1990. http://www.jstor.org/stable/2289776. Journal of the American Statistical Association, 85(410):398–409.

[26]

George Casella and Christian P. Robert. 1996. http://www.jstor.org/stable/2337434. Biometrika.

[27]

Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, and Yu Cao. 2024. https://arxiv.org/pdf/2405.11143. Computing Research Repository.

[28]

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. https://arxiv.org/pdf/2409.19256. Computing Research Repository, arXiv:2409.19256.

[29]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. https://arxiv.org/pdf/2305.18290.pdf. In Advances in Neural Information Processing Systems.

[30]

Sheldon M. Ross. 2002. https://books.google.ch/books?id=DApvQgAACAAJ.

[31]

Tomas Geffner and Justin Domke. 2018. https://proceedings.neurips.cc/paper_files/paper/2018/file/dead35fa1512ad67301d09326177c42f-Paper.pdf. In Advances in Neural Information Processing Systems.

[32]

E. L. Lehmann and George Casella. 1998. https://link.springer.com/book/10.1007/b98854.

[33]

Christian P. Robert and George Casella. 2004. https://doi.org/10.1007/978-1-4757-4145-2.

[34]

Luca Malagutti, Andrius Buinovskij, Anej Svete, Clara Meister, Afra Amini, and Ryan Cotterell. 2024. https://aclanthology.org/2024.naacl-long.382/. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers).

[35]

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. https://aclanthology.org/P11-1015. In Proceedings of the Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.

[36]

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. 2024. https://arxiv.org/abs/2402.14740.

[37]

R. J. Williams. 1992. https://doi.org/10.1007/BF00992696. Machine Learning.

[38]

D. G. Horvitz and D. J. Thompson. 1952. http://www.jstor.org/stable/2280784. Journal of the American Statistical Association.

[39]

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2024. https://openreview.net/forum?id=BOorDpKHiJ. In Proceedings of the International Conference on Machine Learning.

[40]

Lewis Tunstall, Edward Emanuel Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro Von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M Rush, and Thomas Wolf. 2024. https://openreview.net/forum?id=aKkAwZB6JV. In Proceedings of the Conference on Language Modeling.

[41]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. https://arxiv.org/pdf/2310.06825. Computing Research Repository, arXiv:2310.06825.

Throughout this paper \(\log\) denotes the natural logarithm function; thus, KL divergence is measure in nats rather than bits. We also note that terms of the form \(\plm(\str) \log \frac{\plm(\str)}{\qlm(\str)}\) in 1 where \(\plm(\str) = 0\) can correctly be taken to equal zero because \(\lim_{\plm\to 0^+} p \log \plm/\qlm = 0\).↩︎
In conditional tasks like dialogue generation, language models are prompted with an input \(\prompt \in \alphabet^*\), inducing a conditional distribution \(\plm(\cdot \mid \prompt)\). KL divergences are typically averaged over a set of prompts. For simplicity, we omit \(\prompt\) in notation and write \(\plm(\str)\); all of our results extend straightforwardly to the conditional case.↩︎
In general, computing the KL divergence between two arbitrary LMs exactly is undecidable. To see why, assume that each of the two language models is a probabilistic context-free grammar. In this case, deciding whether their KL divergence is zero is undecidable, as it follows directly from the undecidability of testing equivalence between two unweighted context-free grammars [14]. In the more restrictive case of probabilistic finite-state language models, it is pspace-hard. Importantly, however, the intractablity of exact computation does not imply that approximation is intractable. In practice, one can often obtain good Monte Carlo estimates of the KL divergence, provided that its variance is well-controlled. We study very practical methods for improving variance in this paper.↩︎
Monte Carlo estimation formally requires that the underlying random variable have finite variance; if the variance is unbounded, the estimator no longer converges in the limit.↩︎
Since the KL divergence is non-negative by definition, a negative estimate can be problematic when KL is used as part of a loss function, as it may destabilize the learning dynamics.↩︎
Despite its simplicity, our proposed estimator is absent from existing literature and open-source RLHF libraries [21], [22], [27], [28], highlighting a gap we believe is worth addressing.↩︎
One could also consider control variates of the form \(g\colon \kleene{\alphabet} \to \R^d\) for \(d > 1\) [31].↩︎
Note that the control variate method can be straightforwardly extended to support multiple control variates.↩︎
Note that for this to hold, we need have \(\qlm(\str) = 0\) whenever \(\plm(\str) = 0\), which is different from the support condition we assumed for \(\kl(\plm \mid\mid \qlm) < \infty\).↩︎
Note that \(\alpha=1\) is the only value of \(\alpha\) that guarantees nonnegativity (see proof of [proposition:nonnegative-CV]).↩︎
[prop:requirements-on-alpha-for-reduction] provides the exact conditions on \(\alpha\) required for a variance reduction.↩︎
Note that \(\tvar\) is a function of \(\{\yvar^{(m)}\}_{m=1}^M\). We have suppressed this function’s arguments in our notation for improved readability.↩︎
Note that in the Rao–Blackwell theorem, we get the stronger result that when \(\tvar\) is a sufficient statistic, \(\var{\E[\estimator \mid \tvar]}\) is optimal, i.e., it is the minimal-variance, unbiased estimator. However, we can perform Rao–Blackwellization even when \(\tvar\) is not sufficient and are still guaranteed that the variance is no worse [33].↩︎
Specifically, we use https://huggingface.co/lvwerra/gpt2-imdb.↩︎
Specifically, we use https://huggingface.co/lvwerra/distilbert-imdb.↩︎
Specifically, we look at the logits of the positive class.↩︎
10.2 discusses a common mistake when Rao–Blackwellizing the KL estimator in trust-region algorithms.↩︎
Specifically, we use the available implementation of this algorithm in the trl library [21].↩︎
12 reports the hyperparameters used for the algorithm.↩︎
Let \(X\) and \(Y\) be real-valued random variables, and let \(a\) be a real scalar. Then, the following hold: \(\var{X + Y} = \var{X} + \var{Y} + 2 \cov{X,Y}\), \(\var{a X} = a^2 \var{X}\), and \(\var{X + a} = \var{X}\).↩︎

Better Estimation of the Kullback–Leibler Divergence Between Language Models