October 03, 2025
Diffusion models are widely used for generative tasks across domains. While pre-trained diffusion models effectively capture the training data distribution, it is often desirable to shape these distributions using reward functions to align with downstream applications. Policy gradient methods, such as Proximal Policy Optimization (PPO), are widely used in the context of autoregressive generation. However, the marginal likelihoods required for such methods are intractable for diffusion models, leading to alternative proposals and relaxations. In this context, we unify variants of Rejection sAmpling based Fine-Tuning (RAFT) as GRAFT, and show that this implicitly performs PPO with reshaped rewards. We then introduce P-GRAFT to shape distributions at intermediate noise levels and demonstrate empirically that this can lead to more effective fine-tuning. We mathematically explain this via a bias-variance tradeoff. Motivated by this, we propose inverse noise correction to improve flow models without leveraging explicit rewards. We empirically evaluate our methods on text-to-image(T2I) generation, layout generation, molecule generation and unconditional image generation. Notably, our framework, applied to Stable Diffusion 2, improves over policy gradient methods on popular T2I benchmarks in terms of VQAScore and shows an \(8.81\%\) relative improvement over the base model. For unconditional image generation, inverse noise correction improves FID of generated images at lower FLOPs/image.
Pre-trained generative models often require task-specific adaptations based on reward feedback - a standard strategy is to leverage RL algorithms, such as Proximal Policy Optimization (PPO) [1]. While such methods have found great success in the context of language modeling [2], [3], their adoption to diffusion models is not straightforward. In particular, unlike autoregressive models, marginal likelihoods required for the implementation of KL regularization in PPO are intractable for diffusion models. Hence, in practice, KL regularization is ignored [4] or relaxations such as trajectory KL regularization [5] is considered. However, ignoring the KL term results in unstable training in large-scale settings [6], whereas using the trajectory KL constraint gives subpar results [4]. Further, fine-tuning with trajectory KL also results in the initial value function bias problem [7], [8].
Apart from policy gradient methods, recent research has also focused on fine-tuning methods based on rejection sampling such as RSO [9], RAFT [10] and Reinforce-Rej [11]. Further, fine-tuning based on Best-of-N (BoN) sampling and its relation to policy gradient methods have also been explored, but in the context of autoregressive models [12], [13]. Given the intractability of (marginal)PPO, we explore conceptual connections between rejection sampling based fine-tuning methods and PPO, specifically in the context of diffusion models. In particular, we make the following contributions:
(a) We conceptualize a Generalized Rejection Sampling (GRS) framework which subsumes various rejection sampling strategies including classical rejection sampling from MCMC literature and Best-of-N. We show that GRS samples from the solution to PPO but with a reshaped reward - fine-tuning using GRS, which we term Generalized Rejection sAmpling Fine-Tuning (GRAFT) enables PPO with marginal KL constraint for diffusion models, despite the marginal likelihoods being intractable.
(b) Leveraging properties of diffusion models, we propose Partial-GRAFT (P-GRAFT) a framework which fine-tunes only till an intermediate denoising step by assigning the rewards of final generations to partial, noisy generations. We show that this leads to better fine-tuning empirically and provide a mathematical justification via a bias-variance tradeoff. Empirically, we demonstrate significant quality gains across the tasks of text-to-image generation, layout generation and molecule generation.
(c) Motivated by P-GRAFT, we introduce Inverse Noise Correction - an adapter-based, parameter-efficient method to improve flow models even without explicit rewards. We empirically demonstrate improved quality as well as FLOPs for unconditional image generation.
(d) In particular, SDv2 fine-tuned using P-GRAFT demonstrates significant improvements in VQAScore over policy-gradient methods as well as SDXL-Base across datasets. The proposed Inverse Noise Correction strategy provides significant FID improvement at reduced FLOPs/image.
A more comprehensive list of related work can be found in Appendix 8.
PPO for Generative Modeling: Following [14], we introduce PPO in our setting: Consider a state space \(\mathcal{X}\), a reward function \(r: \mathcal{X} \to \mathbb{R}\) and a reference probability measure \(\bar{p}\) over \(\mathcal{X}\). Let \(\mathcal{P}(X)\) be the set of probability measures over \(\mathcal{X}\) and \(\alpha \in (0,\infty)\). Define \(R^{\mathsf{reg}}: \mathcal{P}(\mathcal{X}) \to \mathbb{R}\) by \(R^{\mathsf{reg}}(p) = \mathbb{E}_{X\sim p}[r(X)] - \alpha \mathsf{KL}(p||\bar{p})\), where \(\mathsf{KL}(\cdot \| \cdot)\) is the KL divergence. PPO aims to obtain \[\label{eq:ppo95def} p^\text{ppo} = {\mathop{\mathrm{arg\,sup}}}_{p \in \mathcal{P}(\mathcal{X})} R^{\mathsf{reg}}(p)\,.\tag{1}\] Using the method of Lagrangian Multipliers, we can show that \(p^\text{ppo}(x) \propto \exp(r(x)/\alpha)\bar{p}(x)\). In generative modeling literature, \(\bar{p}\) is often the law of generated samples from a pre-trained model - fine-tuning is done on the model so as to sample from the tilted distribution \(p^{\text{ppo}}\).
PPO via Rejection Sampling: Classical rejection sampling from the Monte Carlo literature [15] can be used to sample from \(p^{\mathsf{ppo}}\). We note this folklore result in our setting:
Lemma 1. Let \(r (x) \leq r_{\max}\) for some \(r_{\max}\). Given a sample \(Y\sim\bar{p}\), we accept it with probability \(\mathbb{P}(\mathsf{Accept}|Y) = \exp(\tfrac{r(Y)-r_{\max}}{\alpha})\). Then, conditioned on \(\mathsf{Accept}\), \(Y\) is a sample from \(p^{\mathsf{ppo}}\).
Lemma 1 provides a way to obtain exact samples from \(p^{\mathsf{ppo}}\). A well known challenge with this method is sample inefficiency - as often in practice, \(\alpha\) is small leading to small acceptance probability. Thus, in practice, methods such as Best-of-N (BoN) which always accept a fixed fraction of samples are used. We now introduce Generalized Rejection sAmpling Fine Tuning (GRAFT), a framework to unify such rejection sampling approaches. More specifically, Lemma 2 shows that this still leads to PPO, but with reshaped rewards. We then discuss its utility in the context of diffusion models.
Assume \((X^{(i)})_{i \in [M]}\) are \(M\) i.i.d. samples with law \(\bar{p}\) over a space \(\mathcal{X}\). Given reward function \(r: \mathcal{X} \to \mathbb{R}\), let the reward corresponding to \(X^{(i)}\) be \(R_i := r(X^{(i)})\), the empirical distribution of \((X^{(i)})_{i \in [M]}\) be \(\hat{P}_X (\cdot)\) and the empirical CDF of \((R_i)_{i\in[M]}\) be \(\hat{F}_R (\cdot)\). We introduce Generalized Rejection Sampling (GRS) to accept a subset of high reward samples, \(\mathcal{A}:=(Y^{(j)})_{j \in [M_s]} \subseteq (X^{(i)})_{i \in [M]}\), where \(Y^{(j)}\) denotes the \(j^{\text{th}}\) accepted sample.
Definition 1. Generalized Rejection Sampling (GRS): Let the acceptance function \(A: \mathbb{R} \times [0, 1] \times \mathcal{X} \times [0, 1] \to [0, 1]\) be such that \(A\) is co-ordinate wise increasing in the first two co-ordinates. The acceptance probability of sample \(i\) is \(p_i := A(R_i, \hat{F}_R(R_i), X^{(i)}, \hat{P}_X )\). Draw \(C_i \sim \text{Ber}(p_i) \; \forall \; i \in \{1, \dots, M\}\), not necessarily independent of each other. Then, \(X^{(i)} \in \mathcal{A}\) iff \(C_i = 1\).
Definition 1 subsumes popular rejection sampling approaches such as RAFT and BoN. We now show that GRS implicitly samples from the solution to PPO with the reshaped reward \(\hat{r}(\cdot)\):
Lemma 2. The law of accepted samples under GRS (Def 1) given by \(p(X^{(1)} = x | X^{(1)} \in \mathcal{A})\) is the solution to the following Proximal Policy Optimization problem: \[\begin{align} \mathop{\mathrm{arg\,max}}_{\hat{p}} \left[ \mathbb{E}_{{x} \sim \hat{p}} \hat{r}({x}) - \alpha \mathsf{KL} \left( \hat{p} \| \bar{p} \right) \right] ;\quad \tfrac{\hat{r}(x)}{\alpha} := \log \big( \mathbb{E} \big[ A(r(x), \hat{F}_R(r(x)), x, \hat{P}_X) | X^{(1)} = x\big] \big) \end{align}\] Here, the expectation is with respect to the randomness in the empirical distributions \(\hat{F}_R\) and \(\hat{P}_X\).
\(\hat{r}(\cdot)\) is monotonically increasing with respect to the actual reward since \(A\) is an increasing function of the reward and its empirical CDF. We now instantiate GRS with commonly used variants of \(A\):
\(\mathbf{\mathsf{Top-K}}\) Sampling: Let the reward distribution be continuous with CDF \(F(\cdot)\). We accept the top \(K\) samples out of the \(M\) samples based on their reward values. \[\text{Corresponding Acceptance Function:} \quad A(r,\hat{F}_R,x,\hat{P}_X) = \begin{cases} 0 \quad\text{ if } \hat{F}_R(r) \leq 1-\frac{K}{M} \\ 1 \quad\text{ if } \hat{F}_R(r) > 1-\frac{K}{M} \end{cases}\] Lemma 2 shows that this acceptance function performs PPO with the reshaped reward \(\hat{r}\) satisfying: \(\frac{\hat{r}(x)}{\alpha} = \log \big[ \sum_{k=0}^{K-1} \binom{M-1}{k} F(r(x))^{M-k-1} (1 - F(r(x)))^{k}\big]\).
Preference Rewards: Setting \(M=2\) and \(K=1\) in the above formulation gives preference rewards, i.e., \(X^{(1)}\) is accepted and \(X^{(2)}\) is rejected if \(r(X^{(1)}) > r(X^{(2)})\) (and vice versa). This strategy performs PPO with the reshaped reward \(\frac{\hat{r}(x)}{\alpha} = \log F(r(x))\). Since \(F\) is an increasing function, the PPO equivalent monotonically reshapes the reward \(r(x)\) to \(\log F(r(x))\).
Varying \(K\) from \(1\) to \(M\), varies the strength of the tilt in \(\mathbf{\mathsf{Top-K}}\) sampling. In particular, \(K=M\) corresponds to \(\frac{\hat{r}(x)}{\alpha} = 0\) (no tilt) and \(K=1\) corresponds to \(\frac{\hat{r}(x)}{\alpha} = M \log F(r(x))\).
Binary Rewards with De-Duplication: Suppose \(r(X) \in \{0,1\}\) (for eg., corresponds to unstable/ stable molecules in molecule generation). De-duplication of the generated samples might be necessary to maintain diversity. Given any structure function \(f\) (for eg., extracts the molecule structure from a configuration), let \(N_f(X,\hat{P}_X) = |\{i: f(X^{(i)}) = f(X)\}|\), i.e, the number of copies of \(X\) in the data. \[\text{Proposed Acceptance Function:} \quad A(r,\hat{F}_R,x,\hat{P}_X) = \begin{cases} 0 &\quad \text{ if } r = 0 \\ \frac{1}{N_f(x,\hat{P}_X)}&\quad \text{ if } r = 1 \end{cases}\] Draw \(C_i \sim \mathsf{Ber}(p_i)\) without-replacement among the duplicate/similar samples (i.e, they are marginally Bernoulli but are not independent). Thus, exactly one out of the duplicate molecules are selected almost surely. Applying Lemma 2, we conclude that this performs PPO with \[\frac{\hat{r}(x)}{\alpha} = \begin{cases} -\infty &\quad \text{ if } r(x) = 0 \\ \log \mathbb{E}\bigr[\tfrac{1}{N_f(x,\hat{P}_X)}\bigr|X^{(1)} = x\bigr] &\quad \text{ if } r(x) = 1 \end{cases}\] We see that the shaped reward increases with diversity and with the value of the original reward. We use this in the molecule generation experiments to avoid mode collapse (Section 6.2).
While specialized versions of Lemma 2 are known in the context of AR models [12], the result is particularly useful in the context of diffusion models. Note that given a sample \(x\) along with a prompt \(y\), the marginal likelihood \(\bar{p}(x|y)\) can be easily computed for AR models. For diffusion models, we only have access to conditional likelihoods along the denoising trajectory of the diffusion process whereas \(\mathsf{KL} ({p} || \bar{p})\) is intractable. That is, if the denoising process is run from \(t_N\) to \(t_0\), we have access to \(\bar{p}(x_{t_i} | x_{t_{i+1}})\). A commonly used relaxation is the trajectory KL, \(\mathsf{KL} ({p}(X_{0:T} ) || \bar{p}(X_{0:T}))\), which can be shown as an upper bound on the marginal KL. As discussed in [7], this constraint can lead to the initial value function bias problem since the KL regularization is with respect to the learned reverse process. It becomes necessary to learn an appropriate tilt even at time \(T\). In this context, Lemma 2 offers a simple yet effective alternative to implicitly achieve marginal KL regularization.
Based on GRS, we propose GRAFT: Generalized Rejection sAmpling Fine Tuning (Algorithm 6) - given a reference model \(\bar{p}\), we generate samples and perform the GRS strategy proposed in 1. A dataset is generated from the accepted samples and standard training is done on the generated dataset.
Having established that GRAFT implicitly performs PPO, we now examine methods to further improve the framework. Continuous diffusion models typically start with Gaussian noise \(X_T\) at time \(T\) and denoise it to the output \(X_0\) via a discretized continuous time SDE. With \(N\) denoising steps, the model constructs a denoising trajectory \({X}_{t_N} \to \dots {X}_{t_i} \to \dots \to {X}_{t_0}\) (\(t_N = T\) and \(t_0 = 0\)), denoted by \({X}_{T:0}\). We now consider the effect of shaping the distribution of an intermediate state \(X_t\). For the rest of the discussion, we reserve \(n\) and \(N\) to refer to discrete timesteps, and \(t\) and \(T\) for continuous time. For any \(t \in [0,T]\) denote the marginal density of \({X}_t\) by \(\bar{p}_t({x})\).
We first extend GRS to Partial Generalized Rejection Sampling (P-GRS). Let \(X_t^{(1)}, \dots, X_t^{(M)}\) be partially denoised (denoised till time \(t\)) samples. Let their corresponding completely denoised samples be \(X_0^{(1)}, \dots, X_0^{(M)}\). Rewards are computed using the completely denoised samples (i.e. \(R_i = r(X_0^{i})\) for the \(i^{\text{th}}\) sample). We denote the empirical distribution of \(\{ X_{0}^{(1)}, \dots, X^{(M)}_{0} \}\) by \(\hat{P}_{X_{0}} (\cdot)\) and the empirical CDF of \(\{ R_{1}, \dots, R_{M} \}\) by \(\hat{F}_R (\cdot)\).
Definition 2. Partial Generalized Rejection Sampling (P-GRS): Consider an acceptance function \(A: \mathbb{R} \times [0, 1] \times \mathcal{X} \times [0, 1] \to [0, 1]\) such that \(A\) is co-ordinate wise increasing in the first two co-ordinates. The acceptance probability of sample \(i\) is \(p_i := A(R_i, \hat{F}_R(R_i), X^{(i)}_0, \hat{P}_{X_{0}} )\). Draw \(C_i \sim \text{Ber}(p_i) \; \forall \; i \in [M]\), not necessarily independent of each other. Then, \(X_t^{(i)} \in \mathcal{A}\) iff \(C_i = 1\).
Lemma 3. The law of the accepted samples under P-GRS ( Def. 2) given by \(p_t(X^{(1)}_t = x | X^{(1)}_t \in \mathcal{A})\) is the solution to the following Proximal Policy Optimization problem: \[\begin{align} \mathop{\mathrm{arg\,max}}_{\hat{p}} \left[ \mathbb{E}_{X \sim \hat{p}} \hat{r}(X) - \alpha \mathsf{KL} \left( \hat{p} \| \bar{p}_t \right) \right] ;\quad \tfrac{\hat{r}(x)}{\alpha} := \log \big( \mathbb{E}\big[A(r(X_0^{(1)}), \hat{F}_R(r(X_0^{(1)})), X_0^{(1)}, \hat{P}_X )\bigr| X_t^{(1)} = x\big] \big) \end{align}\]
The key difference is that the reshaped reward now depends on the expected value of the acceptance function given a partially denoised state \(X_t\). This tilts \(\bar{p}_t\) instead of \(\bar{p}_0\). It is straightforward to modify the PPO rewards corresponding to GRS to that of P-GRS. We illustrate this by instantiating Lemma 3 for preference rewards, as done with GRS (Lemma 2) above.
Preference rewards: With P-GRS, \(p_t\big(X_t^{(1)}=x|X_t^{(1)}\in\mathcal{A}\big)\propto \bar{p}_t(x)\exp(\frac{\hat{r}(x)}{\alpha})\) with \(\frac{\hat{r}(x)}{\alpha} = \log\mathbb{E}[F(r(X_0))|X_t = x]\).
Based on Lemma 3, we introduce P-GRAFT: Partial GRAFT (Algorithms 1 and [alg:p95graft95inf]). Here, fine-tuning is done on a (sampled) dataset of partially denoised vectors instead of fully denoised vectors. The fine-tuned model is only trained from times \(T\) to \(t\), and is used for denoising from noise only till time \(t\). We switch to the reference model for further denoising. We will now discuss the mathematical aspects of P-GRAFT and provide a justification for its improved performance.
We analyze P-GRAFT from a bias-variance tradeoff viewpoint. Let us associate reward \(r(X_0)\) with \(X_t\). As argued in Lemma 4, variance of \(r(X_0)\) conditioned on \(X_t\) increases with \(t\). Consequently, P-GRAFT obtains noisy rewards, seemingly making it less effective than GRAFT. However, we subsequently show that the learning problem itself becomes easier when \(t\) is large since the score function becomes simpler (i.e, the bias reduces). Therefore, we can balance the trade-off between the two by choosing an “appropriate” intermediate time \(t\) for the distributional tilt.
Lemma 4. The expected conditional variance \(\mathbb{E}[\mathsf{Var}(r(X_0)|X_t)]\) is an increasing function of \(t\).
Example: Consider molecule generation, where molecules are generated by a pre-trained diffusion model. The generated molecule can be stable (\(r(X_0) = 1\)) or unstable (\(r(X_0)=0\)). Intuitively, \(X_t\), for \(t < T\), carries more information about \(r(X_0)\) than \(X_T\). We reinforce this claim empirically by giving the following illustrative statistical test. Consider the two hypotheses:
\(H_0: r(X_0)\) is independent of \(X_t \,; \qquad H_1: r(X_0)\) and \(X_t\) are dependent.
H0.64
Given \(X_t\), we obtain \(100\) roll outs \(X_0^{(i)}|X_t\) for \(1\leq i\leq 100\) and its empirical average \(\hat{r}(X_t) = \sum_{i=1}^{100} r(X_0^{(i)})/100\). If \(r(X_0)\) is independent of \(X_t\) (under \(H_0\)), the law of \(\hat{r}(X_t)\) is the binomial distribution \(\mathsf{Bin}(100,\theta)\) with \(\theta = \mathbb{P}(r(X_0)=1)\) being the marginal probability of observing a stable molecule. We perform \(1000\) repetitions for the experiment above for various values of \(t\) and plot the empirical distributions in Figure [fig:mol95round]. For \(t = t_{3N/4}\) (when \(X_t\) close to \(\mathcal{N}(0,\mathbf{I})\)), the distribution is close to the Binomial distribution and for \(t = t_{N/4}\) (when \(X_t\) is close to the target) it is far. That is, \(X_{t_{N/4}}\) already carries a lot of information about \(r(X_0)\). This is further supported by the expected conditional variances reported in Table [tab:exp95cond95variance].
Bias reduces with increasing \(t\): We follow the Stochastic Differential Equation (SDE) framework from [16] for our analysis. Let the target distribution \(q_0\) be the law of accepted samples under P-GRS. Diffusion models consider the forward process to be the Ornstein-Uhlenbeck Process given by \(dX^{f}_t = -X^{f}_t dt + \sqrt{2}dB_t\) where \(X^f_0 \sim q_0\) is drawn from the target distribution over \(\mathbb{R}^d\) and \(B_t\) is the standard Brownian motion in \(\mathbb{R}^d\). It is well known that \(X^f_t \stackrel{d}{=} e^{-t}X^f_0 + \sqrt{1-e^{-2t}}Z\), where \(Z \sim \mathcal{N}(0,\mathbf{I})\) independent of \(X^f_0\).
Let \(q_t\) be the density of the law of \(X^f_t\). Diffusion models learn the score function \([0,T]\times \mathbb{R}^d \ni (t,X) \to \nabla \log q_t(X)\) via score matching (see Appendix 8 for literature review on score matching). P-GRAFT, in contrast, attempts to learn \(\nabla \log q_s\) between for \(s \in [t,T]\). At time \(T\), \(\nabla \log q_T(X) \approx -X\), the score of the standard Gaussian distribution, which is easy to learn. When \(t=0\), the score \(\nabla \log q_0(X)\) corresponds to the data distribution which can be very complicated. Diffusion models use Denoising Score Matching, based on Tweedie’s formula introduced by [17]. We show via Bakry-Emery theory [18] that the score function \(\nabla \log q_t(X)\) converges to \(q_{\infty}(X)\) exponentially in \(t\), potentially making the learning easier. Consider \(s_{\theta}(X,t):\mathbb{R}^d\times\mathbb{R}^{+}\to \mathbb{R}^d\) to be a neural network with parameters \(\theta\), then score matching objective is given by: \[\mathcal{L} (\theta) = \mathbb{E}\int_0^{T} dt \|\tfrac{X^f_t-e^{-t}X^f_0}{1-e^{-2t} } + s_{\theta}(X^f_t,t)\|^2.\] In practice, \(\mathcal{L}(\theta)\) is approximated with samples. By Tweedie’s formula, we have: \(\mathbb{E}[\frac{X^f_t-e^{-t}X^f_0}{1-e^{-2t} }|X^f_t] = -\nabla \log q_t(X^f_t)\). Thus, for some constant \(\mathsf{C}\), independent of \(\theta\): \[\begin{align} \mathcal{L}(\theta) + \mathsf{C} &= \mathbb{E}\int_{0}^{T}dt \|\nabla \log q_t(X^f_t)-s_{\theta}(X^f_t,t)\|^2 = \int_{0}^{T}dt \int_{\mathbb{R}^d} dX \;q_t(X) \|\nabla \log q_t(X)-s_{\theta}(X,t)\|^2. \end{align}\] As shown by [19], \(\mathcal{L}(\theta)\) directly controls the quality of generations. Note that \(q_{\infty}\) is the density of \(\mathcal{N}(0,\mathbf{I})\) and \(\nabla \log q_{\infty}(X) = -X\). The theorem below is proved in Appendix 11.5.
Theorem 1. Define \(H_t^s\) for \(s \leq t\): \(H_t^s = \int_{s}^{t} dt\int_{\mathbb{R}^d} dX q_s(X)\|\nabla \log q_s(X) -\nabla \log q_{\infty}(X)\|^2\). Then, \[H_t^{T} \leq \frac{e^{-2t}}{1-e^{-2t}}H_0^{t}\]
Therefore, the score functions between time \((t,T)\) are exponentially closer to the simple Gaussian score function compared to the score functions between times \((0,t)\) in the precise sense given in Theorem 1. This means that the score functions at later times should be easier to learn.
In the analysis so far, we have established bias-variance tradeoffs for diffusion models - models which use SDEs to sample from a target distribution. We now extend this analysis to flow models, which use ODEs to sample. Flow models follow a deterministic ODE starting from an initial (random) noise. The bias-variance results from the previous section indicate that, conditioned on the initial noise vector, the variance of reward should be zero, making the learning process potentially easier. Another property of flow models is that because of the deterministic mapping, they admit reversal - this property has been utilized extensively in the literature to map images to ‘noise’ for image editing [20], [21] and as part of the 2-rectification [22] to achieve straighter flows. We will combine these two ideas to develop a framework for improving flow models even without explicit rewards. We now develop this idea from first principles.
We restrict our attention to flow models with optimal transport based interpolation [22], [23], which learn a velocity field \(v(x, t): \mathbb{R}^d \times [0,1] \to \mathbb{R}^d\) such that the following ODE’s solution at time \(t = 1\) has the target distribution \(p^{\text{data}}\): \[\begin{align} \label{eq:fm} \frac{dX_t}{dt}=v(X_t,t),~X_0\sim \mathcal{N}(0,\mathbf{I}). \end{align}\tag{2}\] Note that in the literature for flow models (unlike diffusion models), \(t=0\) corresponds to noise and \(t = 1\) corresponds to the target, a convention we follow in this section.
The errors in learned model: Suppose we have a pre-trained vector-field, corresponding to parameter \(\theta\) and solve the ODE 2 with \(v(x,t) = v_{\theta}(x,t)\). Then, \(\mathsf{Law}(X_1) \neq p^{\text{data}}\) due to:
a) Discretization error of the ODE and b) Statistical error due to imperfect learning.
Despite these two errors, the trained ODE is still invertible. We will leverage reversibility to arrive at our algorithm. To this end, consider the time reversal of 2 : \[\begin{align} \label{eq:rev-fm} \frac{dx^{\mathsf{rev}}_t}{dt}=-v_{\theta}(x^{\mathsf{rev}}_t,1-t),~x^{\mathsf{rev}}(0)\sim p^{\text{data}}. \end{align}\tag{3}\]
The Inverse Noise: Consider the forward Euler discretization of 2 with step-size \(\eta\): \[\label{eq:fwd95euler} \hat{x}_{(i+1)\eta} \leftarrow \hat{x}_{i\eta} + \eta v_{\theta}(\hat{x}_{i\eta},i\eta)\,.\tag{4}\] Let \(T_{\theta,\eta}\) be the function which maps \(\hat{x}_0\) to \(\hat{x}_{1}\) i.e, \(\hat{x}_{1} = T_{\theta,\eta}(\hat{x}_0)\). The foward Euler approximation \(T_{\theta,\eta}^{-1}(\hat{x}_1) \approx \hat{y_1}\) where \(\hat{y}_{i\eta} \leftarrow \hat{y}_{(i-1)\eta} - \eta v_{\theta}(\hat{y}_{(i-1)\eta},1-(i-1)\eta)\) with \(\hat{y}_0 = \hat{x}_1\) is not good enough as noted in the image inversion/ editing literature [20], [21], [24]. This is mitigated via numerical and control theoretic techniques. We utilize the ‘backward Euler discretization’ (3 , as used in [21]) to exactly invert 4 . \[\label{eq:rev95euler} \hat{x}^{\mathsf{rev}}_{\eta i} \leftarrow \hat{x}^{\mathsf{rev}}_{\eta(i-1)} - \eta v_{\theta}(\hat{x}^{\mathsf{rev}}_{\eta i},1-\eta(i-1))\tag{5}\] This is an implicit equation since \(\hat{x}^{\mathsf{rev}}_{\eta i}\) being calculated in the LHS also appears in the RHS. It is not apriori clear that this can be solved. Lemma 5 addresses this issue:
Lemma 5. Suppose \(v_{\theta}\) is \(L\) Lipschitz in \(x\) under \(\ell_2\)-norm and \(\eta L < 1\). Then,
(1) \(\hat{x}_{\eta i}^{\mathsf{rev}}\) in 5 has a unique solution which can be obtained by a fixed point method.
(2) \(T_{\theta,\eta}\) is invertible and \(T^{-1}_{\theta,\eta}(x_{0}^{\mathsf{rev}}) = x_1^{\mathsf{rev}}\).
That is, the mapping from noise to data given by the learned, discretized model is invertible. We show some important consequences of this in Lemma 6. Define the following probability distributions. Let \(p^{\text{data}} = \mathsf{Law}(\mathsf{Data})\) (i.e, target data distribution).
\(p_0 = \mathsf{Law}(\hat{x}_0) = \mathcal{N}(0,\mathbf{I})\) | \(p_1 = \mathsf{Law}(\hat{x}_1)\) | \(p_0^{\mathsf{rev}} = \mathsf{Law}(\hat{x}^{\mathsf{rev}}_0) = p^{\text{data}}\) | \(p_1^{\mathsf{rev}} = \mathsf{Law}(\hat{x}_1^{\mathsf{rev}})\) |
---|
We call \(p_1^{\mathsf{rev}}\) the inverse noise distribution. With perfect training and \(0\) discretization error, \(p^{\mathsf{rev}}_1 = \mathcal{N}(0,\mathbf{I})\). However, due to these errors \(p_1^{\mathsf{rev}} \neq \mathcal{N}(0,\mathbf{I})\).
Lemma 6. Under the assumption of Lemma 5, \(p_1^{\mathsf{rev}}\), \(p_1\), \(p^{\text{data}}\) and \(p_0 = \mathcal{N}(0,\mathbf{I})\) satisfy:
\((T_{\theta,\eta})_{\#}p_1^{\mathsf{rev}} = p^{\text{data}}\,; \quad\)
\(\mathsf{TV}(p_1^{\mathsf{rev}},p_0) = \mathsf{TV}(p_1,p^{\text{data}})\,;\quad\)
\(\mathsf{KL}(p_0||p_1^{\mathsf{rev}}) = \mathsf{KL}(p_1||p^{\text{data}})\,.\)
That is, the distance between the inverse noise and the true noise is the same as the distance between the generated distribution and the true target distribution. Item 1 shows that if we can sample from the inverse noise distribution \(p_1^{\mathsf{rev}}\), then we can use the pre-trained model \(v_{\theta}(\cdot, \cdot)\) with discretization and obtain samples from the true target \(p^{\text{data}}\). In [25], the authors note that even 2-rectification suffers when the inverse noise \(p_1^{\mathsf{rev}}\) is far from \(\mathcal{N}(0,\mathbf{I})\). While 2-rectification aims to improve improve the computational complexity while maintaining quality by aiming to obtain straight flows, we introduce inverse noise correction to improve quality of generations in a sample efficient way.
Inverse Noise Correction is given in Algorithms 3 and [alg:inv95corr95inf], and illustrated in Figure 2. Given samples from the target distribution, \(\mathcal{D}\), TRAIN_FLOW(\(\mathcal{N}(0,\mathbf{I}), \mathcal{D}\)) trains a rectified flow model between \(\mathcal{N}(0,\mathbf{I})\) to the target distribution [22]. Now, suppose we are given a dataset \(\{X^{(1)},\dots,X^{(M)}\} \sim p^{\text{data}}\) and a trained flow model \(v_{\theta}\) which generates \(\hat{x}_1 \sim p_1\) using 4 starting with \(\hat{x}_0 \sim p_0\). We obtain samples \(X_1^{\mathsf{rev},(i)} \sim p_1^{\mathsf{rev}}\) by backward Euler iteration in 5 . Thereafter, we train another flow model \(v_{\theta^{\mathsf{rev}}}\) which learns to sample from \(p_1^{\mathsf{rev}}\) starting from \(\mathcal{N}(0,\mathbf{I})\).
During inference, we sample a point from \(X_0\sim \mathcal{N}(0,\mathbf{I})\) and obtain a sample \(X_1^{\mathsf{rev}} \sim p_1^{\mathsf{rev}}\) using \(v_{\theta^\mathsf{rev}}\). Once we have the corrected noise sample, we generate images using the original flow model \(v_{\theta}\) which now starts from \(X_1^{\mathsf{rev}}\) instead of \(X_0\). FWD_Euler\((v_{\theta}, \eta, \hat{x_0})\) obtains \(\hat{x}_1\) via Euler iteration (4 ). Similarly, BWD_Euler\((v_{\theta},\eta,\hat{x}_0^{\mathsf{rev}},N_b)\) obtains \(x_1^{\mathsf{rev}}\) by approximately solving backward Euler iteration (5 ). They are formally described as Algorithms 4 and 5 in Appendix 9. Theoretical Justification along the lines of Section 4.1 is given in Appendix 11.8.
We use the notation P-GRAFT(\(N_I\)) to denote P-GRAFT with intermediate timestep \(N_I\) as described in Algorithms 1 and [alg:p95graft95inf]. For instance, P-GRAFT(\(0.75N\)) would denote instantiating P-GRAFT with \(N_I = 0.75N\), where \(N\) is the total number of denoising steps. Recall that \(t_N\) corresponds to pure noise and \(t_0\) corresponds to a completely denoised sample.
Setup: The objective is to fine-tune a pre-trained model so that generated images better align with prompts. We consider Stable Diffusion v2 [26] as the pre-trained model. The reward model used is VQAScore [27] - a prompt-image alignment score between 0 to 1, with higher scores denoting better prompt-alignment. We fine-tune (separately) on GenAI-Bench [28] as well as the train split of T2ICompBench++ [29]. Evaluations are done on GenAI-Bench, validation split of T2ICompBench++ and GenEval [30]. We use LoRA [31] for compute-efficient fine-tuning. \(\mathsf{Top-K}\) sampling (Section 3) is used for both GRAFT and P-GRAFT. Since LoRA fine-tuning is used, the model switching in [alg:p95graft95inf] can be done by simply turning off the LoRA adapter. More implementation details are given in Appendix 12.
Results: are reported in Table [tab:100951095sampling] - for fine-tuning on GenAI-Bench, we use \(\mathsf{Top-10}\) of \(100\) samples and on T2ICompBench++, we use \(\mathsf{Top-1}\) of \(4\) samples. First, note that both GRAFT and P-GRAFT outperform base SDv2, SDXL-Base and DDPO. The best performance is obtained for P-GRAFT with \(N_I = 0.25N\) across all evaluations - this clearly shows the bias-variance tradeoff in action. Further, both GRAFT and P-GRAFT also generalize to unseen prompts.
In particular, DDPO did not improve over the baseline even when trained with more samples and FLOPs as compared to GRAFT/P-GRAFT. Experiments with different sets of hyperparameters as well as adding other features such as KL regularization and a per-prompt advantage estimator on top of DDPO also did not show any significant improvements over SDv2 (see Appendix 12.3). We also conduct ablations to further verify the effectiveness of the proposed methods - these include experiments on different values of \((M, K)\) in \(\mathsf{Top-K}\) of \(M\) sampling, different LoRA ranks for fine-tuning as well as a reverse P-GRAFT strategy (where the fine-tuned model is used in the later denoising steps instead of initial steps). We find that P-GRAFT remains effective across different \((M, K)\) and that performance is insensitive to the LoRA rank. Further, P-GRAFT significantly outperforms reverse P-GRAFT. More details on ablations can be found in Appendix 12.1.
Model | Fine-Tuned on GenAI-Bench | Fine-Tuned on T2ICompBench++ - Train | ||||
---|---|---|---|---|---|---|
2-4 (lr)5-7 | ||||||
GenAI | T2I - Val | GenEval | GenAI | T2I - Val | GenEval | |
SD v2 | \(66.87 \scriptstyle{\pm 0.14}\) | \(69.20 \scriptstyle{\pm 0.17}\) | \(73.49 \scriptstyle{\pm 0.41}\) | \(66.87 \scriptstyle{\pm 0.14}\) | \(69.20 \scriptstyle{\pm 0.17}\) | \(73.49 \scriptstyle{\pm 0.41}\) |
SDXL-Base | \(69.69 \scriptstyle{\pm 0.17}\) | \(72.98 \scriptstyle{\pm 0.16}\) | \(73.90\scriptstyle{\pm 0.40}\) | \(69.69 \scriptstyle{\pm 0.17}\) | \(72.98 \scriptstyle{\pm 0.16}\) | \(73.90\scriptstyle{\pm 0.40}\) |
DDPO | \(65.70 \scriptstyle{\pm 0.17}\) | \(68.03 \scriptstyle{\pm 0.16}\) | \(72.13 \scriptstyle{\pm 0.37}\) | \(64.65 \scriptstyle{\pm 0.17}\) | \(69.05 \scriptstyle{\pm0.15}\) | \(69.60 \scriptstyle{\pm 0.37}\) |
GRAFT | \(70.51 \scriptstyle{\pm 0.15}\) | \(75.69 \scriptstyle{\pm 0.13}\) | \(79.85 \scriptstyle{\pm 0.31}\) | \(70.97 \scriptstyle{\pm0.14}\) | \(75.88 \scriptstyle{\pm 0.13}\) | \(79.57 \scriptstyle{\pm 0.30}\) |
P-GRAFT(\(0.75N\)) | \(69.46 \scriptstyle{\pm 0.15}\) | \(74.51 \scriptstyle{\pm 0.14}\) | \(79.44 \scriptstyle{\pm 0.33}\) | \(69.51 \scriptstyle{\pm 0.15}\) | \(74.30 \scriptstyle{\pm 0.13}\) | \(78.50 \scriptstyle{\pm 0.33}\) |
P-GRAFT(\(0.5N\)) | \(71.00 \scriptstyle{\pm 0.14}\) | \(75.45 \scriptstyle{\pm 0.14}\) | \(80.60 \scriptstyle{\pm0.31}\) | \(70.73 \scriptstyle{\pm 0.14}\) | \(75.37 \scriptstyle{\pm 0.12}\) | \(79.25 \scriptstyle{\pm 0.30}\) |
P-GRAFT(\(0.25N\)) | \(\mathbf{71.94} \scriptstyle{\pm 0.14}\) | \(\mathbf{76.12} \scriptstyle{ \pm 0.13}\) | \(\mathbf{80.96} \scriptstyle{\pm 0.29}\) | \(\mathbf{71.42} \scriptstyle{\pm 0.14}\) | \(\mathbf{76.15} \scriptstyle{\pm 0.13}\) | \(\mathbf{80.29} \scriptstyle{\pm 0.30}\) |
Model | Unconditional | Class-conditional | ||
---|---|---|---|---|
2-3 (lr)4-5 | ||||
Alignment | FID | Alignment | FID | |
Baseline | \(0.094\) | \(8.32\) | \(0.088\) | \(4.08\) |
GRAFT | \(0.064\) | \(10.68\) | \(0.068\) | \(5.04\) |
P-GRAFT(\(0.5N\)) | \(0.071\) | \(9.24\) | \(0.072\) | \(4.55\) |
P-GRAFT(\(0.25N\)) | \(\mathbf{0.053}\) | \(9.91\) | \(\mathbf{0.064}\) | \(4.67\) |
Model | Mol: Stability | Sampling Rounds | |
---|---|---|---|
Baseline | \(90.50 \scriptstyle{\pm 0.15}\) | - | |
GRAFT | \(90.76 \scriptstyle{\pm 0.20}\) | \(9 \times\) | |
P-GRAFT(\(0.5N\)) | \(90.46 \scriptstyle{\pm 0.27}\) | \(1 \times\) | |
P-GRAFT(\(0.25N\)) | \(\mathbf{92.61} \scriptstyle{\pm 0.13}\) | \(1 \times\) | |
Setup: All experiments are done on pre-trained models trained using IGD [32], a discrete-continuous diffusion framework capable of handling both layout generation and molecule generation. For layouts, we experiment with improving the alignment of elements in the generated layout as measured by the alignment metric - note that the reward is taken as 1 - alignment since lower values for the metric indicate better alignment. For molecules, the objective is to generate a larger fraction of stable molecules - molecules which are deemed stable are assigned a reward of \(1\) whereas unstable molecules are assigned a reward of \(0\). For molecule generation, we use the de-duplication instantiation of GRAFT/P-GRAFT (Section 3) to ensure diversity of generated molecules - we use RDKit to determine whether two molecules are identical or not. We use PubLayNet [33] for layout generation, and QM9 [34] for molecule generation. To the best of our knowledge, this is the first work which addresses fine-tuning in the context of discrete-continuous diffusion models. Ablations and experimental details are given in Appendix 13.
Results: for layout generation are given in Table [tab:lay95gen]. Both P-GRAFT and GRAFT uniformly improve performance across both unconditional and class-conditional generation, with P-GRAFT:\(0.25N\) giving the best performance. We also report FID scores computed between the generated samples and the test set of PubLayNet - this is a measure of how close the generated samples are to the pre-training distribution. As expected, the baseline has the lowest FID. Note that the FID score for P-GRAFT is smaller than GRAFT, indicating that P-GRAFT aligns more closely to the pre-training distribution. For molecule generation, results are given in Table [tab:mol95gen]. Again, the best performance is with P-GRAFT at \(0.25N\). Note that improvement with GRAFT is marginal, despite being trained on \(9 \times\) the number of samples used for P-GRAFT - this points to the learning difficulty in later denoising steps.
Sampling Steps | FID | FLOPs/image (\(\times 10^{12})\) | ||
---|---|---|---|---|
1-2 (lr)3-4 | ||||
(16M parameters) | ||||
(65M parameters) | ||||
\((256 \times 256)\) | ||||
\((256 \times 256)\) | ||||
- | \(1000\) | \(11.93\) | \(8.40\) | \(6.869\) |
- | \(200\) | \(13.39\) | \(8.63\) | \(1.374\) |
\(100\) | \(100\) | \(8.94\) | \(7.90\) | \(0.903\) |
\(200\) | \(200\) | \(\mathbf{8.02}\) | \(\mathbf{7.26}\) | \(1.806\) |
Setup: We consider unconditional image generation on CelebA-HQ [35] and LSUN-Church [36] at \(256 \times 256\) resolution. We first train pixel-space flow models from scratch. A training corpus of inverted noise is then generated by running the trained flow models in reverse, employing the backward Euler method, on all samples in the dataset. A second flow model, which we refer to as the Noise Corrector model, is then trained to generate this inverse noise. Once the Noise Corrector is trained, this model is first used to transform standard Gaussian noise to the inverse noise. The pre-trained model then generates samples starting from the inverse noise. FID with 50000 generated samples with respect to the dataset is used to measure the performance. We emphasize that the our goal is not to compete with state-of-the-art (SOTA) models rather to demonstrate that our procedure can be used to improve the performance of a given flow model by simply learning the distributional shift of noise at \(t=0\). SOTA models are larger ( [26] has \(\approx 300\)M parameters) and are more sophisticated - we do not seek to match their performance.
Results: Table 4, shows that the Noise Corrector significantly improves FID scores across both datasets. Apart from quality gains, Noise Corrector also allows for faster generation - running the Noise Corrector for \(100\) steps and then running the pre-trained model for \(100\) steps can outperforms the pre-trained model with \(1000\) steps. The Noise Corrector only has \(0.25 \times\) the number of parameters, leading to further latency gains as evidenced by FLOPs counts.
We establish GRAFT, a framework for provably performing PPO with marginal KL for diffusion models through rejection sampling. We then introduce P-GRAFT, a principled framework for intermediate distribution shaping of diffusion models and provide a mathematical justification for this framework. Both GRAFT and P-GRAFT perform well empirically, outperforming policy gradient methods on the text-to-image generation task. Further, both frameworks also extend seamlessly to discrete-continuous diffusion models. Finally, we introduce Inverse Noise Correction, a strategy to improve flow models even without explicit rewards and demonstrate significant quality gains even with lower FLOPs/image.
Policy Gradient Methods: Majority of the existing literature on policy gradient methods in the context of generative modeling draw inspiration from Proximal Policy Optimization(PPO) [1] and REINFORCE [37]. PPO based methods in the context of language modeling include [2], [3], [14], [38], whereas frameworks based on REINFORCE include [39]–[42]. Policy gradient methods have also been studied in the context of fine-tuning diffusion models [4], [5], [43].
Offline Fine-Tuning Methods: Algorithms which utilize offline preference datasets for fine-tuning generative models have also been widely studied. In the context of language modeling, these include methods like SLiC [44], DPO [45] and SimPO [46]. Such methods have also been explore in the context of diffusion models as well - these include methods like Diffusion-DPO [47] and Diffusion-KTO [48].
Rejection Sampling Methods: Recently, many works have explored rejection sampling methods in the context of autoregressive models - these include RSO [9], RAFT [10] and Reinforce-Rej [11]. In particular, Reinforce-Rej demonstrated that rejection sampling methods can match or even outperform policy gradient methods.
Fine-Tuning Diffusion Models: Apart from the policy gradient methods discussed already, a host of other methods have also been proposed for fine-tuning diffusion models. Direct reward backpropagation methods include DRaFT [49] and VADER [50]. Note that these methods assume access to a differentiable reward. [8] approaches the problem from the lens of entropy-regularized control - however, the method is computationally heavy and requires gradient checkpointing as well as optimizing an additional neural SDE. [7] proposes a memoryless forward process to overcome the initial value function bias problem for the case of ODEs. PRDP [6] formulates a supervised learning objective whose optimum matches with the solution to PPO, but with trajectory KL constraint - the supervised objective, with clipping, was found to make the training stable as compared to DDPO.
Score Matching: Score matching for distribution estimation was first introduced in [51]. The algorithm used in this case is called Implicit Score Matching. Diffusion models primarily use Denoising Score Matching (DSM), which is based on Tweedie’s formula [17], [52]. The sample complexity of DSM has been extensively studied in the literature [53]–[56].Many alternative procedures such as Sliced Score Matching [57] and Target Score Matching [58] have been proposed.
ODE Reversal in Flow Models: A prominent use case of ODE reversal in flow models is that of image editing [20], [21], [59]–[62]. The reverse ODE has also been used to achieve straighter flows, allowing for faster generation, through 2-rectification/reflow algorithm [22], [63]–[65]. Notably, concurrent work [66] also proposes a strategy for aligning distilled models by fine-tuning at the noise level.
In the backward Euler Algorithm 5, at each time instant \(j\) in the reverse procedure, we solve a fixed point equation to obtain high precision solution of Eq. 5 . The step-size \(\eta\) is tuned empirically so that the recursion does not blow up. Once the step-size is carefully tuned, the iteration converges to the solution at an exponential rate. In practice, we observed that \(N_b=10\) is sufficient to obtain satisfactory results.
While instantiations of GRAFT are well-known in the literature and are straightforward to implement, we provide the exact algorithm here for the sake of completeness.
Proof. Let \(B\) be any measurable set. Consider the following probability measure: \[\begin{align} \mathbb{P}(X^{(1)} \in B | X^{(1)} \in \mathcal{A}). \end{align}\] Using Bayes’ rule, this measure can be rewritten as: \[\begin{align} \mathbb{P}(X^{(1)} \in B | X^{(1)} \in \mathcal{A}) &= \frac{\mathbb{P}(X^{(1)} \in B , X^{(1)} \in \mathcal{A})}{\mathbb{P}(X^{(1)} \in \mathcal{A})}. \end{align}\] Recall that \(X^{(1)}\) is drawn from the distribution \(\bar{p}\). Then, from the definition of \(\mathbb{P}(X^{(1)} \in B , X^{(1)} \in \mathcal{A})\), we have: \[\begin{align} \mathbb{P}(X^{(1)} \in B , X^{(1)} \in \mathcal{A}) &= \int_B \mathbb{P}(X^{(1)} \in \mathcal{A} | X^{(1)} = x ) d\bar{p}(x). \end{align}\] From Definition 1, we know that \(X^{(1)} \in \mathcal{A}\) iff \(C_1 = 1\). Therefore: \[\begin{align} \mathbb{P}(X^{(1)} \in B , X^{(1)} \in \mathcal{A}) &= \int_B \mathbb{P}(C_1 = 1 | X^{(1)} = x ) d\bar{p}(x) \\ &= \int_B \mathbb{E} \left[ \mathbb{1}( C_1 = 1) | X^{(1)} = x \right] d\bar{p}(x) \end{align}\] where \(\mathbb{1}(\cdot)\) denotes the indicator function. Using the tower property of expectations, this can be rewritten as: \[\begin{align} \mathbb{P}(X^{(1)} \in B , X^{(1)} \in \mathcal{A}) &= \int_B \mathbb{E} \left[ \mathbb{E} \left[ ( \mathbb{1}( C_1 = 1) | X^{(1)} = x, X^{(2)}, \dots, X^{(M)} ) \right] \bigr| X^{(1)} = x \right] d\bar{p}(x) \\ &= \int_B \mathbb{E} \left[ \mathbb{P} \left( C_1 = 1 | X^{(1)} = x, X^{(2)}, \dots, X^{(M)} \right) \bigr| X^{(1)} = x \right] d\bar{p}(x). \end{align}\] Note that in the conditional expectation here, \(X^{(1)}, \dots, X^{(M)}\), are distributed according to \(\bar{p}_0\) since \(\{X^{(j)}\}_{j=1}^{M}\) are i.i.d samples. Again, from Definition 1, we know that \[\mathbb{P}(C_1 = 1 | X^{(1)} = x, \{X^{(j)}\}_{j=2}^{M}) = A(r(x), \hat{F}_R(r(x)), x, \hat{P}_X)\] where \(\hat{F}_R\) and \(\hat{P}_X\) are computed using the samples \(\{X^{(j)}\}_{j=1}^{n}\). From definition of Radon-Nikodym derivative, the distribution of the accepted samples can therefore be written as: \[\begin{align} \label{app:eqn:rn95derivative} \bar{p}^a(x) = Z_1 \mathbb{E} \left[ A(r(x), \hat{F}_R(r(x)), x, \hat{P}_X) | X^{(1)} = x \right] \bar{p}(x) \end{align}\tag{6}\] where \(Z_1 = 1/\mathbb{P}(X^{(1)} \in \mathcal{A})\) is a normalizing constant independent of \(x\). Now, from the method of Lagrangian Multipliers, as mentioned in Section 2, the solution to the PPO optimization objective with reward function \(\hat{r}(\cdot)\) is given by: \[\begin{align} \label{app:eqn:ppo95solution} {p}^{\text{ppo}}(x) = Z_2 \exp\left(\frac{\hat{r}(x)}{\alpha}\right) \bar{p}(x) \end{align}\tag{7}\] where \(Z_2\) is the normalization constant. Comparing 6 and 7 , \(\bar{p}^{a} = p^{\text{ppo}}\) whenever: \[\begin{align} \frac{\hat{r}(x)}{\alpha} &= \log \left( \mathbb{E} \left[ A(r(x), \hat{F}_R(r(x)), x, \hat{P}_X | X^{(1)} = x \right] \right) \end{align}\] ◻
Substituting \(A(\cdot)\) in: \[\begin{align} \log \left( \mathbb{E}_{\{X^{(j)} \}_{j=2}^{n}} \left[A(r(x), \hat{F}_R(r(x)), x, \hat{P}_X) | X^{(1)} = x,\{X^{(j)}\}_{j=2}^{M}\right] \right) \end{align}\] we get: \[\begin{align} \log \left( \int_{ \{ X^{(j)} \}_{j = 2}^{n} } \mathbb{1} (r(x) \in \mathsf{Top-K}(r(x), r(x^{(2)}), \dots, r(x^{(M)})))d\bar{p}(x^{(2)})\dots d\bar{p}(x^{(M)}) \right) \end{align}\] where \(\mathsf{Top-K} (r(x), r(x^{(2)}), \dots, r(x^{(M)}))\) denotes the top-\(K\) samples in \(\{ r(x), r(x^{(2)}), \dots, r(x^{(M)}) \}\). Let \(U_K\) denote the event where \(X^{(1)} = x\) ranks in top-\(K\) among the \(M\) samples, where the other \(M-1\) samples are i.i.d from \(\bar{p}\). This event can be decomposed as: \[\begin{align} U_K = \cup_{k = 1}^K E_k \end{align}\] where \(E_k\) denotes the event where \(r(x)\) is the \(k^\text{th}\) in the ranked (descending) ordering of rewards. Further, note that \(\{ E_k \}\) are mutually exclusive events. Therefore: \[\begin{align} \mathbb{P} (U_K ) = \sum_{k = 1}^K \mathbb{P} (E_k ) \end{align}\]
If \(x\) ranks \(k ^\text{th}\) when ranked in terms of rewards, there are \(k-1\) samples which have higher rewards than \(x\) and \(M-k\) samples which have lower rewards than \(x\). Thus, the required probability can be computed by finding the probability of having \(K-1\) samples having higher rewards and the rest having lower rewards. Note that the ordering within the \(K-1\) group or \(M-K\) group doesn’t matter. The probability of any one sample having a higher reward than \(r(x)\) is \(1 - F(r({x}))\) and having a lower reward is \(F(r({x}))\). Therefore, the required probability can be computed as: \[\begin{align} \mathbb{P} (E_k) = \binom{M-1}{k-1} (1 - F(r({x})))^{k-1} (F(r({x})))^{M-k} \end{align}\] And hence: \[\begin{align} \mathbb{P} (U_K) &= \sum_{k = 0} ^{K-1} \binom{M-1}{k} (1 - F(r({x})))^{k} (F(r({x})))^{M-k-1} \end{align}\] Therefore: \[\begin{align} \frac{\hat{r}(x)}{\alpha} &= \log \left( \sum_{k = 0} ^{K-1} \binom{M-1}{k} (1 - F(r({x})))^{k} (F(r({x})))^{M-k-1} \right) \end{align}\]
It is straightforward to check that this is an increasing function in \(r\).
Substituting \(A(\cdot)\) in: \[\begin{align} \log \left( \mathbb{E}_{X^{(2)}} A(r(x), \hat{F}_R(r(x)), x, \hat{P}_X) | X^{(1)} = x,X^{(2)} \right) \end{align}\] we get: \[\begin{align} \log \left( \int_{X^{(2)}} \mathbb{1}(r(x^{(2)}) \leq r(x)) d \bar{p}(x^{(2)}) \right) = \log F(r(x)) \end{align}\]
Proof. Let \(B\) be any measurable set. Consider the following probability measure: \[\begin{align} \mathbb{P}(X^{(1)}_t \in B | X^{(1)}_t \in \mathcal{A}). \end{align}\] Using Bayes’ rule, this measure can be rewritten as: \[\begin{align} \mathbb{P}(X^{(1)}_t \in B | X^{(1)}_t \in \mathcal{A}) &= \frac{\mathbb{P}(X^{(1)}_t \in B , X^{(1)}_t \in \mathcal{A})}{\mathbb{P}(X^{(1)}_t \in \mathcal{A}))}. \end{align}\]
Recall that \(X^{(1)}_t\) is drawn from the distribution \(\bar{p}_t\). Then, from the definition of \(\mathbb{P}(X^{(1)}_t \in B , X^{(1)}_t \in \mathcal{A})\), we have: \[\begin{align} \mathbb{P}(X^{(1)}_t \in B , X^{(1)}_t \in \mathcal{A}) &= \int_B \mathbb{P}(X^{(1)}_t \in \mathcal{A} | X^{(1)}_t = x ) d\bar{p}_t(x). \end{align}\] From Definition 2, we know that \(X^{(1)}_t \in \mathcal{A}\) iff \(C_1 = 1\). Therefore: \[\begin{align} \mathbb{P}(X^{(1)}_t \in B , X^{(1)}_t \in \mathcal{A}) &= \int_B \mathbb{P}(C_1 = 1 | X^{(1)}_t = x ) d\bar{p}_t(x) \\ &= \int_B \mathbb{E} \left[ \mathbb{1}( C_1 = 1) | X_t^{(1)} = x \right] d\bar{p}_t(x) \end{align}\] where \(\mathbb{1}(\cdot)\) denotes the indicator function. Using the tower property of expectations, this can be rewritten as: \[\begin{align} \mathbb{P}(X_t^{(1)} \in B , X_t^{(1)} \in \mathcal{A}) &= \int_B \mathbb{E} \left[ \mathbb{E} \left[ ( \mathbb{1}( C_1 = 1) | X_t^{(1)} = x, X_0^{(1)}, X_0^{(2)}, \dots, X_0^{(M)} ) \right] \bigr| X_t^{(1)} = x \right] d\bar{p}_t(x) \\ &= \int_B \mathbb{E} \left[ \mathbb{P} \left( C_1 = 1 | X_t^{(1)} = x, X_0^{(1)}, X_0^{(2)}, \dots, X_0^{(M)} \right) \bigr| X_t^{(1)} = x \right] d\bar{p}_t(x). \end{align}\] Note that in the conditional expectation here, \(X_0^{(2)}, \dots, X_0^{(M)}\), are distributed according to \(\bar{p}_0\) since \(\{X_0^{(j)}\}_{j=1}^{n}\) are i.i.d samples. However, \(X_0^{(1)}\) is distributed according to \(\bar{p}_{0|t}\) because of the conditioning on \(X_t^{(1)}\). Again, from Definition 2, we know that \[\mathbb{P}(C_1 = 1 | X_t^{(1)} = x, \{X_0^{(j)}\}_{j=1}^{M}) = A(r(X_0^{(1)}), \hat{F}_R(r(X_0^{(1)})), X_0^{(1)}, \hat{P}_X )\] where \(\hat{F}_R\) and \(\hat{P}_X\) are computed using the samples \(\{X^{(j)}\}_{j=1}^{M}\). From the definition of Radon-Nikodym derivative, the density of the accepted samples can therefore be written as: \[\begin{align} \label{app:eqn:rn95derivative2} \bar{p}^a_{t}(x) = Z_1 \mathbb{E} \left[ A(r(X_0^{(1)}), \hat{F}_R(r(X_0^{(1)})), X_0^{(1)}, \hat{P}_X | X_t^{(1)} = x) \right] \bar{p}_t(x) \end{align}\tag{8}\] where \(Z_1 = 1/\mathbb{P}(X_t^{(1)} \in \mathcal{A}))\) is a normalizing constant independent of \(x\). Now, from the method of Lagrangian Multipliers, as mentioned in Section 2, the solution to the PPO optimization objective (with reward function \(\hat{r}(\cdot)\)) is (where \(Z_2\) is the normalization constant): \[\begin{align} \label{app:eqn:ppo95solution2} {p}^{\text{ppo}}(x) = Z_2 \exp\left(\frac{\hat{r}(x)}{\alpha}\right) \bar{p}_t(x). \end{align}\tag{9}\] Comparing 8 and 9 , \(\bar{p}^{a}_t = p^{\text{ppo}}\) whenever: \[\begin{align} \frac{\hat{r}(x)}{\alpha} &= \log \left( \mathbb{E} \left[ A(r(X_0^{(1)}), \hat{F}_R(r(X_0^{(1)})), X_0^{(1)}, \hat{P}_X | X_t^{(1)} = x \right] \right) \end{align}\] ◻
Proof. Let \(s > t\). Note that \(X_s \to X_t \to X_0\) forms a Markov chain. By the law of total variance, we have for any random variables \(Y,Z\): \[\begin{align} \mathsf{Var}(Z) &= \mathbb{E}\mathsf{Var}(Z|Y) + \mathsf{Var}(\mathbb{E}[Z|Y]) \nonumber \\ &\geq \mathbb{E}\mathsf{Var}(Z|Y) \end{align}\]
Given \(X_s\), Suppose \(Z,Y\) be jointly distributed as the law of \((r(X_0),X_t)\). Then, we have \(X_s\) almost surely:
\[\begin{align} \mathsf{Var}(r(X_0)|X_s) \geq \mathbb{E}[\mathsf{Var}(r(X_0)|X_s,X_t)|X_s] = \mathbb{E}[\mathsf{Var}(r(X_0)|X_t)|X_s] \end{align}\] In the last line, we have used the Markov property to show that the law of \(r(X_0)|X_s,X_t\) is the same as the law of \(r(X_0)|X_t\) almost surely. We conclude the result by taking expectation over both the sides. ◻
Proof. We will follow the exposition in [67] for our proofs. \(q_t\) converges to \(q_{\infty}\) as \(t \to \infty\). By [67] applied to the forward process, we conclude that:
\[\frac{d}{dt}\mathsf{KL}(q_t||q_{\infty}) = -\int_{\mathbb{R}^d} dX q_t(X)\|\nabla \log q_t(X) -\nabla \log q_{\infty}(X)\|^2\]
\[\label{eq:score95eq} \implies \int_{t}^{T} dt\int_{\mathbb{R}^d} dX q_s(X)\|\nabla \log q_s(X) -\nabla \log q_{\infty}(X)\|^2 = \mathsf{KL}(q_t||q_{\infty}) - \mathsf{KL}(q_T||q_{\infty})\tag{10}\] For brevity, we call the LHS to be \(H_t^{T}\). Clearly, \[H_t^{T} - e^{-2t}H_0^{T} = \mathsf{KL}(q_t||q_{\infty}) - e^{-2t}\mathsf{KL}(q_0||q_{\infty}) + \mathsf{KL}(q_T||q_{\infty})(e^{-2t}-1)\,.\] Notice that \(q_{\infty}\) is the density of the standard Gaussian random variable. Therefore, it satisfies the Gaussian Logarithmic Sobolev inequality [68]. Thus, we can apply [67] to conclude that for every \(s \geq 0\), \(\mathsf{KL}(q_s||q_{\infty}) \leq e^{-2s}\mathsf{KL}(q_0||q_{\infty})\). Thus, \[H_t^{T} \leq \frac{e^{-2t}}{1-e^{-2t}}H_0^{t}\] ◻
The uniqueness and the convergence of fixed point iteration for implicit Euler methods have been established under great generality in [69]. However, we give a simpler proof for our specialized setting here.
Consider the update for the backward Euler iteration at each time step \(t=\eta i\) \[\begin{align} \hat{x}^{\mathsf{rev}}_{\eta i} \rightarrow \hat{x}^{\mathsf{rev}}_{\eta(i-1)} - \eta v_{\theta}(\hat{x}^{\mathsf{rev}}_{\eta i},1-\eta(i-1)) \end{align}\] Let us define an operator \(T_{\theta, \eta}^{\hat{x}^{\mathsf{rev}}_{\eta(i-1)}}: \mathbb{R}^d\to \mathbb{R}^d\) such that \[\begin{align} T_{\theta, \eta}^{\hat{x}^{\mathsf{rev}}_{\eta(i-1)}}(x)=\hat{x}^{\mathsf{rev}}_{\eta(i-1)} - \eta v_{\theta}(x,1-\eta(i-1)) \end{align}\] First, we will show that \(T_{\theta, \eta}^{\hat{x}^{\mathsf{rev}}_{\eta(i-1)}}\) as defined above is a contractive operator under the condition \(\eta L<1\). Then, one can use Banach fixed point theorem to establish uniqueness of the solution and obtain the solution through fixed point iteration. To this end, consider two point \(x_1\) and \(x_2\) in \(\mathbb{R}^d\) and apply \(T_{\theta, \eta}^{\hat{x}^{\mathsf{rev}}_{\eta(i-1)}}\) to them \[\begin{align} \left\|T_{\theta, \eta}^{\hat{x}^{\mathsf{rev}}_{\eta(i-1)}}(x_1)-T_{\theta, \eta}^{\hat{x}^{\mathsf{rev}}_{\eta(i-1)}}(x_2)\right\|_2 &= \eta\|v_{\theta}(x_1,1-\eta(i-1))-v_{\theta}(x_2,1-\eta(i-1))\|_2\\ &\leq \eta L\|x_1-x_2\|_2. \end{align}\] Since \(\eta L<1\), we conclude that \(T_{\theta, \eta}^{\hat{x}^{\mathsf{rev}}_{\eta(i-1)}}\) is a contractive operator. Thus, by Banach fixed point theorem, the fixed point equation \(T_{\theta, \eta}^{\hat{x}^{\mathsf{rev}}_{\eta(i-1)}}(x)=x\) has a unique solution for each step \(t=\eta i\). To obtain the solution to the backward Euler update, we use the Banach fixed point method, i.e., start with \(x_{(0)}=\hat{x}^{\mathsf{rev}}_{\eta(i-1)}\) (or any arbitrary point in \(\mathbb{R}^d\)) and run the iteration \(x_{(k+1)}=T_{\theta, \eta}^{\hat{x}^{\mathsf{rev}}_{\eta(i-1)}}(x_{(k)})\). Then, \(\lim_{k\to \infty}x_{(k)}=\hat{x}^{\mathsf{rev}}_{\eta i}\).
The invertibility of the operator \(T_{\theta,\eta}\) follows directly from the previous part. Since the solution for the backward Euler method is unique at each time step \(t=\eta i\), it implies that there exists a one-to-one mapping between sample points \(x_0^{\mathsf{rev}}\) and \(x_1^{\mathsf{rev}}\).
Before starting the proof of this lemma, we will state the following well-known theorem from information theory.
Theorem 2. [Date Processing Inequality] Let \(\mathcal{X}\) and \(\mathcal{Y}\) be two sample spaces. Denote \(\mathcal{P}(\mathcal{X})\) and \(\mathcal{P}(\mathcal{Y})\) as the set of all possible probability distributions on \(\mathcal{X}\) and \(\mathcal{Y}\), respectively. Let \(P_X, Q_X\in \mathcal{P}(\mathcal{X})\) and \(P_{Y|X}\) be a transition kernel. Denote \(P_Y\) and \(Q_Y\) to be the push through, i.e., \(P_Y(B)=\int_{\mathcal{X}}P_{Y|X}(B|X=x)dP_X(x)\). Then, for any \(f\)-divergence we have \[\begin{align} \label{eq:DPI} D_f(P_X||Q_X)\geq D_f(P_Y|Q_Y) \end{align}\tag{11}\]
By part 3 of Lemma 5, we have that \(x_1^{\mathsf{rev}} = T_{\theta,\eta}^{-1}(x_0^{\mathsf{rev}}) \implies T_{\theta,\eta}(x_1^{\mathsf{rev}}) = x_0^{\mathsf{rev}}\). Suppose \(x_0^{\mathsf{rev}} \sim p^*\), then by definition, \(x_1^{\mathsf{rev}} \sim p_1^{\mathsf{rev}}\). This concludes the result.
Recall that \(\mathsf{TV}\)-norm is an \(f\)-divergence. Furthermore, \(T_{\theta, \eta}\) is the push forward function from \(p_0\) and \(p_1^{\mathsf{rev}}\) to \(p_1\) and \(p^*\), respectively. Thus, using DPI 2, we have \[\begin{align} \mathsf{TV}(p_1^{\mathsf{rev}},p_0)\geq \mathsf{TV}(p^*,p_1). \end{align}\] Additionally, \(T_{\theta, \eta}\) is an invertible mapping. Hence, \(T_{\theta, \eta}^{-1}\) can also be viewed as the push forward function from \(p_1\) and \(p^*\) to \(p_0\) and \(p_1^{\mathsf{rev}}\), respectively. Thus, again using DPI 2, we get \[\begin{align} \mathsf{TV}(p^*,p_1)\geq \mathsf{TV}(p_1^{\mathsf{rev}},p_0). \end{align}\] Combining both the bounds, we get the desired claim.
KL divergence is also a valid \(f\)-divergence. Thus, repeating the arguments from the previous part, one gets the desired equality.
In this section, our goal is to provide a theoretical justification for inverse noise correction in the context of flow models. Specifically, we will argue that if \(\mathsf{KL}(p^X||\mathcal{N}(0, \boldsymbol{I}))\) is small, then it is less challenging to learn the score function corresponding to \(p_t\) and thereby the velocity field \(v^X_t\) governing the rectified flow. To this end, let \(X\) be a sample from a distribution \(p^X\) and \(Z, Y\) be standard normal random variables all independent of each other. Consider the following two linear interpolations: \[\label{eq:rect95flow} \begin{align} X_t=tX+(1-t)Z\\ Y_t=tY+(1-t)Z. \end{align}\tag{12}\] Denote \(p_t\) and \(q_t\) as the distribution of \(X_t\) and \(Y_t\), respectively. Then, it is easy to verify that they satisfy the following continuity equations: \[\begin{align}\label{eq:cont95eq} \dot{p_t}+\nabla\cdot(v_t^Xp_t)=0\\ \dot{q_t}+\nabla\cdot(v_t^Yq_t)=0 \end{align}\tag{13}\] where \(v_t^X(x)=\mathbb{E}[X-Z|X_t=x]\) and \(v_t^Y(x)=\mathbb{E}[Y-Z|Y_t=x]\). Then, we have the following theorem which establishes the relation between KL-divergence of \(p_1\) and \(q_1\) in terms of the velocities \(v_t^X\) and \(v_t^Y\). The proof for the theorem is provided in Section 11.9.
Theorem 3. Let \(p_t\) and \(q_t\) be the distribution of \(X_t\) and \(Y_t\) defined in 12 . Then, the KL-divergence between \(p_1\) and \(q_1\) satisfy the following relation \[\begin{align} \label{eq:kl-rectflow} \mathsf{KL}(p_1||q_1)=\mathsf{KL}(p^X||\mathcal{N}(0, \mathbf{I}))&=\int_0^1\frac{t}{1-t}\int_{\mathbb{R}^d}p_t(x)\left\|v_t^X(x)-v_t^Y(x)\right\|^2 dx dt. \end{align}\tag{14}\]
Now, consider the distribution of the inverse noise \(p_1^{\mathsf{rev}}\) obtained by iterating 5 and substitute it with \(p^X\) in the theorem above. Suppose that the flow model is trained such that \(\mathsf{KL}(p^{\text{data}}||p_1)\leq \epsilon\). Then, by Lemma 6 it follows that \(\mathsf{KL}(p_1^{\mathsf{rev}}||p_0)\leq \epsilon\). Combining this observation with 14 , it is easy to see that the velocities \(v_t^X(x)\) and \(v_t^Y(x)\) should be close to each other. Additionally, since \(q_t\) simply corresponds to learning a flow model from standard Gaussian to itself, we can explicitly compute \(v_t^Y\) as follows: \[\begin{align} v_t^Y(x)&=\frac{x}{t}+\frac{1-t}{t}\frac{-x}{(1-t)^2+t^2}\\ &=\frac{x(2t-1)}{(1-t)^2+t^2}. \end{align}\] Thus, \(v_t^Y(x)\) is a linear function of \(x\) and a rational function of \(t\). Because \(\mathsf{KL}(p_1^{\mathsf{rev}}||p_0)\leq \epsilon\), Theorem 3 suggests that learning \(v_t^X\) from data should be relatively easier as it is close to \(v_t^Y\).
Then, the time derivative of the KL-divergence between \(p_t\) and \(q_t\) is given by \[\begin{align} \frac{d\mathsf{KL}(p_t||q_t)}{dt}&=\int_{\mathbb{R}^d}\frac{d}{dt}\left(p_t(x)\log\left(\frac{p_t(x)}{q_t(x)}\right)\right)dx\\ &=\int_{\mathbb{R}^d}\left(\dot{p}_t(x)\log\left(\frac{p_t(x)}{q_t(x)}\right)+p_t(x)\frac{d}{dt}\log(p_t(x))-p_t(x)\frac{d}{dt}\log(q_t(x))\right)dx\\ &=\int_{\mathbb{R}^d}\left(\dot{p}_t(x)\log\left(\frac{p_t(x)}{q_t(x)}\right)+\dot{p}_t(x)-p_t(x)\frac{\dot{q}_t(x)}{q_t(x)}\right)dx \end{align}\] We will consider each term separately as \(T_1, T_2\) and \(T_3\). For \(T_1\) using the continuity equation, we have \[\begin{align} T_1&=\int_{\mathbb{R}^d}\dot{p}_t(x)\log\left(\frac{p_t(x)}{q_t(x)}\right)dx\\ &=-\int_{\mathbb{R}^d}\nabla\cdot(v_t^X(x)p_t(x))\log\left(\frac{p_t(x)}{q_t(x)}\right)dx\\ &=\int_{\mathbb{R}^d}p_t(x)\left\langle v_t^X(x), \nabla\log\left(\frac{p_t(x)}{q_t(x)}\right)\right\rangle dx\end{align}\] Note that \(\int_{\mathbb{R}^d}p_t(x)dx=1\) for all \(t\in [0, 1]\). Thus for \(T_2\), we obtain \[\begin{align} T_2=\int_{\mathbb{R}^d}\dot{p}_t(x)dx=\frac{d}{dt}\int_{\mathbb{R}^d}p_t(x)dx=\frac{d}{dt}1=0. \end{align}\] For the final term \(T_3\), we again use the continuity equation to get \[\begin{align} T_3&=-\int_{\mathbb{R}^d}p_t(x)\frac{\dot{q}_t(x)}{q_t(x)}dx\\ &=\int_{\mathbb{R}^d}p_t(x)\frac{\nabla\cdot(v_t^Yq_t)}{q_t(x)}dx\\ &=-\int_{\mathbb{R}^d}q_t(x)\left\langle\nabla\left(\frac{p_t(x)}{q_t(x)}\right), v_t^Y(x)\right\rangle dx\\ &=-\int_{\mathbb{R}^d}p_t(x)\left\langle\nabla\log\left(\frac{p_t(x)}{q_t(x)}\right), v_t^Y(x)\right\rangle dx. \end{align}\] Combining all the terms above, we get \[\begin{align} \label{eq:kl-derv1} \frac{d\mathsf{KL}(p_t||q_t)}{dt}&=\int_{\mathbb{R}^d}p_t(x)\left\langle\nabla\log\left(\frac{p_t(x)}{q_t(x)}\right), v_t^X(x)-v_t^Y(x)\right\rangle dx. \end{align}\tag{15}\] To obtain an expression for score function in terms of the velocity vector, we use Tweedie’s formula [70] which leads us to \[\begin{align} \label{eq:vel95x} \mathbb{E}[X-Z|X_t=x]&=\frac{1}{1-t}\mathbb{E}[X-X_t|X_t=x]\nonumber\\ &=\frac{1}{1-t}\mathbb{E}[X|X_t=x]-\frac{x}{1-t}\nonumber\\ &=\frac{1}{t(1-t)}\left(x+(1-t)^2\nabla \log p_t(x)\right)-\frac{x}{1-t}\nonumber\\ &=\frac{x}{t}+\frac{1-t}{t}\nabla \log p_t(x). \end{align}\tag{16}\] Similarly, we obtain \[\begin{align} \label{eq:vel95y} v_t^Y(x)=\mathbb{E}[Y-Z|Y_t=x]=\frac{x}{t}+\frac{1-t}{t}\nabla \log q_t(x). \end{align}\tag{17}\] Plugging in the expressions for the score functions into 15 , we obtain \[\begin{align} \frac{d\mathsf{KL}(p_t||q_t)}{dt}&=\int_{\mathbb{R}^d}p_t(x)\left\langle \frac{t}{1-t}\left(v_t^X(x)-v_t^Y(x)\right), v_t^X(x)-v_t^Y(x)\right\rangle dx\\ &=\frac{t}{1-t}\int_{\mathbb{R}^d}p_t(x)\left\|v_t^X(x)-v_t^Y(x)\right\|^2 dx\\ \implies \mathsf{KL}(p_1||q_1)-\mathsf{KL}(p_0||q_0)&=\int_0^1\frac{t}{1-t}\int_{\mathbb{R}^d}p_t(x)\left\|v_t^X(x)-v_t^Y(x)\right\|^2 dx dt. \end{align}\] Recall that \(p_0=q_0=q_1=\mathcal{N}(0, \boldsymbol{I})\) and \(p_1=p^X\). Thus, we get the desired claim \[\begin{align} \mathsf{KL}(p^X||\mathcal{N}(0, I))&=\int_0^1\frac{t}{1-t}\int_{\mathbb{R}^d}p_t(x)\left\|v_t^X(x)-v_t^Y(x)\right\|^2 dx dt. \end{align}\]
We report results for various choices of \(K\) and \(M\) for \(\mathsf{Top-K}\) of M sampling for GenAI-Bench in Tables 5, 6 and 7. Note that these models are also trained on GenAI-Bench. We also report the (mean)score separately for the “Basic" and”Advanced" split in the prompt set. Results for \(\mathsf{Top-10}\) of \(100\) sampling for T2I-CompBench++ is given in Table 8. Models for Table 8 were trained on the train split of T2I-CompBench++. All results are consistent with the developed theory: both GRAFT and P-GRAFT outperform base SDv2 and P-GRAFT, for an appropriate choice of \(N_I\) always outperform GRAFT.
Model | Basic | Advanced | Mean | ||
---|---|---|---|---|---|
SD v2 | 74.83 | 59.19 | 66.32 | ||
GRAFT | 77.33 | 62.76 | 69.41 | ||
P-GRAFT (\(0.8N\)) | 76.30 | 62.18 | 68.62 | ||
P-GRAFT (\(0.5N\)) | 78.57 | 63.38 | 70.32 | ||
Model | Basic | Advanced | Mean | ||
---|---|---|---|---|---|
SD v2 | 74.83 | 59.19 | 66.32 | ||
GRAFT | 79.61 | 64.26 | 71.2 | ||
P-GRAFT (\(0.75N\)) | 76.02 | 62.91 | 68.89 | ||
P-GRAFT (\(0.5N\)) | 78.68 | 64.5 | 70.97 | ||
P-GRAFT (\(0.25N\)) | 80.05 | 64.85 | 71.79 | ||
Model | Basic | Advanced | Mean | ||
---|---|---|---|---|---|
SD v2 | 74.83 | 59.19 | 66.32 | ||
GRAFT | 78.01 | 63.31 | 70.02 | ||
P-GRAFT (\(0.75N\)) | 77.36 | 63.33 | 69.73 | ||
P-GRAFT (\(0.5N\)) | 78.18 | 64.28 | 70.62 | ||
P-GRAFT (\(0.25N\)) | 78.77 | 65.29 | 71.44 | ||
Model | Mean | ||||
---|---|---|---|---|---|
SD v2 | 69.76 | ||||
GRAFT | 74.66 | ||||
P-GRAFT (\(0.25N\)) | 75.16 | ||||
While experimental results in Table [tab:100951095sampling] already demonstrate the bias-variance tradeoff, we provide further evidence of Lemma 4 in the context of text-to-image generation. We evaluate conditional variance of VQAReward scores of the base SDv2 model in GenAI-Bench. We follow the methodology as described in Section 4.1 except that we generate \(4\) images per prompt for a total of \(1600\) prompts. The results are given in Table 9. It can be seen that even at \(N_I = 0.75N\), the expected conditional variance of the reward is significantly smaller than at \(t_N\). This explains why even \(N_I = 0.75N\) gives a significant gain over the base model as seen in Table [tab:100951095sampling].
\(N_I\) | \(\mathbb{E}\left[\mathsf{Var}(r(X_0)|X_{t_n})\right]\) |
---|---|
\(N\) | \(0.0193\) |
\(3N/4\) | \(0.0080\) |
\(N/2\) | \(0.0039\) |
\(N/4\) | \(0.0019\) |
We increase the LoRA rank used for fine-tuning and check the impact on the performance. Table 10 shows that increasing LoRA rank does not seem to affect performance, indicating that the default LoRA rank is sufficient. Ablations are done on GenAI-Bench with \(M=100, K=1\).
Model | Rank | Mean Reward |
---|---|---|
4 | 70.97 | |
6 | 70.87 | |
8 | 70.57 | |
10 | 70.84 | |
4 | 71.79 | |
6 | 71.84 | |
8 | 71.49 | |
10 | 71.63 | |
In P-GRAFT, we always use the fine-tuned model for the first \((N - N_I)\) steps and then switch to the reference model. We experiment with a reverse stitching strategy, where we use the reference model for the earlier denoising steps and fine-tuned model for the later denoising steps. For switching timestep \(N_I\), we denote this strategy as RP-GRAFT (\(N_+I\)) - i.e. RP-GRAFT (\(0.75N\)) indicates that the base model will be used from \(t_N\) to \(t_{0.75N}\), after which the fine-tuned model will be used. From Table 11, we observe that this strategy is significantly worse when compared to P-GRAFT - this provides further evidence of the bias-variance tradeoff. Ablations are done with \(M=100, K=1\).
Model | Basic | Advanced | Mean |
---|---|---|---|
SDv2 | 74.83 | 59.19 | 66.32 |
GRAFT | 79.61 | 64.26 | 71.20 |
RP-GRAFT (\(0.75N\)) | 79.23 | 62.63 | 70.20 |
RP-GRAFT (\(0.5N\)) | 76.60 | 60.87 | 68.05 |
RP-GRAFT (\(0.25N\)) | 75.74 | 59.76 | 67.05 |
Since we require samples only from the pre-trained model, sampling and training can be done separately. Therefore, we first perform rejection sampling according to \(\mathsf{Top-K}\) of M for the chosen values of \(K\) and \(M\). The selected samples are then used as the dataset for training. If not mentioned explicitly, hyperparameters can be assumed to be the default values for SD 2.0 in the Diffusers library [71].
Training on GenAI-Bench:
The hyperparameters for sampling and training are given in Table 12 and Table 13 respectively. Note that one training epoch is defined as one complete pass over the training dataset. The size of the training dataset depends on the chosen \(K\) and \(M\). For instance, \(K = 10\) and \(M = 100\) results in \(10\) images per-prompt, for a total of \(16000\) images. One training epoch corresponds to a single pass over these \(16000\) images, which with a batch size of \(8\) corresponds to \(2000\) iterations per epoch.
Sampling Steps | 50 |
Scheduler | EulerDiscreteScheduler |
Guidance Scale | 7.5 |
Training Epochs | \(10\) |
Image Resolution | \(768 \times 768\) |
Batch Size | \(8\) |
Learning Rate | \(10^{-4}\) |
LR Schedule | Constant |
LoRA Fine-Tuning | True |
Training on T2I-CompBench++:
The hyperparameters for sampling and training are given in Table 14 and Table 15 respectively. We use different sampling schedulers for the two datasets to ensure that our results hold irrespective of the choice of the scheduler.
Sampling Steps | 50 |
Scheduler | DDIMScheduler |
\(\eta\) (DDIMScheduler specific hyperparameter ) | 1.0 |
Guidance Scale | 7.5 |
Training Epochs | \(10\) |
Image Resolution | \(768 \times 768\) |
Batch Size | \(8\) |
Learning Rate | \(10^{-4}\) |
LR Schedule | Constant |
LoRA Fine-Tuning | True |
P-GRAFT Training:
Training and sampling using GRAFT is straightforward since standard training and inference scripts can be used out-of-the box: the only additional step need is rejection sampling on the generated samples before training. For P-GRAFT, the following changes are to be made:
While sampling the training data, the intermediate latents should also be saved along with the final denoised iamge/latent. Rejection sampling is to be done on these intermediate latents, but using the rewards corresponding to the final denoised images.
While training, note that training has to be done by noising the saved intermediate latents. This needs a re-calibration of the noise schedule, since by default, training assumes that we start from completely denoised samples. The easiest way to
re-calibrate the noise schedule is by getting a new set of values for the betas
parameter, new_betas
as follows (where \(N_I\) denotes the intermediate step of P-GRAFT): \[\begin{align} \texttt{new\_betas} [0, N_I] &\leftarrow 0 \\ \texttt{new\_betas} [N_I, N] &\leftarrow \texttt{betas} [N_I, N]
\end{align}\] After re-calibrating the noise, we use new_betas
to get the corresponding new_alphas
and new_alphas_cumprod
. It is also necessary to note that while training, the denoiser has been trained to
predict \(X_0\) given any noised state \(X_t\) and not the saved intermediate latent \(X_{t_{N_I}}\). Let the corresponding saved completely denoised latent
be \(X_{0}\) To ensure that the training is consistent, we train using the following strategy: \[\begin{align} &\text{Sample} \; \epsilon \sim \mathcal{N}(0, \mathbb{I}) \\ &\text{Get} \;
X_{t} \leftarrow \left(\sqrt{\texttt{new\_alphas\_cumprod}[t]} \right)X_{t_{N_i}} + \left(\sqrt{1 - \texttt{new\_alphas\_cumprod}[t]}\right)\epsilon \\ & \text{Get} \; \epsilon' \leftarrow \frac{X_{t} - \sqrt{\texttt{alphas\_cumprod}[t]}
X_0}{\sqrt{1 - \texttt{alphas\_cumprod}[t]}} \\ & \text{Compute Loss using } X_{t} \text{ and } \epsilon'
\end{align}\]
DDPO[4] is an on-policy policy gradient method for diffusion models that optimizes a clipped importance-weighted objective over the denoising trajectory. The original paper reports results on experiments using at most \(400\) prompts. Both prompt sets we consider are significantly larger (\(1600\) prompts for GenAI-Bench and \(5600\) (train) prompts for T2I-CompBench++). This difference is crucial, since it has been shown in [6] that scaling DDPO to large prompt sets result in unstable training and subpar performance. We also observe this phenomenon, as evidenced by the results in Table [tab:100951095sampling]. As menioned in the main text, we also augment DDPO with additional elements in an attempt to improve performance. In particular, we study the following variants:
DDPO: Clipped importance-weighted policy gradient.
DDPO+KL DDPO augmented with a stepwise KL regularizer to the (frozen) reference model.
DDPO+KL+EMA DDPO with KL regularization as well as a prompt-wise exponential-moving-average baseline for advantage estimation.
We use the official PyTorch implementation of DDPO2 - we further adapt the codebase to implement other variants. Fine-tuning is always done on SDv2 using LoRA on the UNet only with a LoRA rank of 16. For the results reported in Table [tab:100951095sampling], we retain the hyperparameters used in [4]. In particular, we use a PPO clip range of \(10^{-4}\), gradient clipping norm of \(1.0\), Adam optimizer with \(\beta_1 = 0.9, \beta_2 = 0.999\) and weight decay of \(10^{-4}\). Following the original paper, we train with a relatively high learning rate of \(3 \times 10^{-4}\) since LoRA fine-tuning is used. We sample \(32\) prompts per epoch and train with a batch size of \(8\), leading to \(4\) training iterations per epoch. However, note that each training iteration requires gradients across the whole denoising trajectory - this means that within each training iteration, \(50\) gradient calls are needed, corresponding to \(50\) sampling steps. For GenAI-Bench, training is done for \(500\) such epochs, whereas for T2I-Compbench++, training is done for \(800\) epochs. With this setup, in Tables 16 and 17, we compare the sampling/compute requirements for DDPO and GRAFT/P-GRAFT. In particular, note that GRAFT/P-GRAFT already outperforms DDPO with \(K = 1, M = 4\) despite DDPO being trained on \(10 \times\) more samples and \(50 \times\) more gradient calls.
Additional configurations with base hyperparameters: With the base hyperparameters described above, we also try augmenting DDPO with KL and EMA as described above. The training curves are given in Figure 7.
We also try additional settings for hyperparameters apart from the oens we have reported so far. Sampling uses DDIM with \(T\in[40,50]\) steps and classifier-free guidance \(g=5\). Optimization uses AdamW with learning rates \(\{2{\times}10^{-5},10^{-5}\}\), batch sizes \(8/8\) (sampling/training),PPO-style clipping \(\epsilon\in\{0.1,0.2\}\). Following DDPO, we replay the scheduler to compute per-step log-probabilities on the same trajectories: \(\ell_t=\log p_{\theta}(x_{t-1}\!\mid x_t,c)\) and \(\ell^{\mathrm{old}}_t=\log p_{\theta_0}(x_{t-1}\!\mid x_t,c)\). We use the clipped objective: \[\label{eq:ddpo} \mathcal{L}_{\mathrm{DDPO}} = -\,\mathbb{E}\!\left[ \min\!\big(r_t A,\; \mathrm{clip}(r_t,\,1{-}\epsilon,\,1{+}\epsilon)\,A\big) \right], \qquad r_t=\exp\!\big(\ell_t-\ell^{\mathrm{old}}_t\big),\tag{18}\] with a centered batchwise advantage \(A\). Specifically, we experiment with higher clipping range, \(\epsilon\!\in\!\{0.1,0.2\}\) and use a whitened batchwise advantage. \(\ell_t,\ell^{\mathrm{old}}_t\) are obtained by replaying the DDIM scheduler on the same trajectory.
Figure 7: Training curves for the three policy-gradient baselines on GenAI Bench (1,600 prompts) with low value of clipping.. a — DDPO+KL, b — DDPO+KL+EMA
Figure 8: Training curves for the three policy-gradient baselines on GenAI Bench (1,600 prompts) with high value of clipping.. a — DDPO, b — DDPO+KL, c — DDPO+KL+EMA
Result. On 1600 prompts, the learning curve exhibits a short initial rise followed by a sharp collapse after \(\sim\)150 steps (Fig. 8). The setting of 1600 heterogeneous prompts induces high variance and many ratios \(r_t\) saturate at the clipping boundary, producing low-magnitude effective gradients and the observed drop in reward.
We augment 18 with a per-step quadratic penalty to the frozen reference: \[\label{eq:ddpo95kl} \mathcal{L}_{\mathrm{DDPO+KL}} = \mathcal{L}_{\mathrm{DDPO}} + \beta\,\frac{1}{T}\sum_{t=1}^{T}\big(\ell_t-\ell^{\mathrm{old}}_t\big)^2, \qquad \beta\in\{0.02,\,0.005\}.\tag{19}\] Result. The KL term prevents divergence of the policy and eliminates the reward collapse after the first few steps. Even with this, average reward improvements remain limited. Larger \(\beta\) contracts the policy towards the reference, whereas smaller \(\beta\) provides insufficient variance control, yielding small net gains.
To mitigate cross-prompt bias, we maintain for each prompt \(z\), an EMA of reward and variance, \[b(z)\leftarrow(1{-}\alpha)b(z)+\alpha\,r,\qquad v(z)\leftarrow(1{-}\alpha)v(z)+\alpha\big(r-b(z)\big)^2,\] and employ a whitened advantage inside 18 : \(\widehat{A}=\frac{r-b(z)}{\sqrt{v(z)+\varepsilon}}+\eta,\;\eta\sim\mathcal{N}(0,\sigma^2).\)
Result. Training is the most stable among the three variants and exhibits smooth reward trajectories without collapse, yet the absolute improvement in mean reward is modest relative to the base policy.
PRDP: We also tried implementing PRDP [6] using the PRDP loss function provided in the appendix of the paper since no official code was provided. However, we did not see any significant improvement compared to the baseline despite following the algorithm and hyperparameters closely. One potential reason for this could be that we use LoRA fine-tuning whereas the original paper uses full fine-tuning. Further, we rely on gradient checkpointing for the implementation as well since the backpropogation is through the entire sampling trajectory.
Algorithm | Samples generated | Samples Trained on | Gradient Calls |
---|---|---|---|
GRAFT(\(K=10, M=100\)) | \(160\)k | \(16\)k | \(20\)k |
GRAFT (\(K=1, M=4\)) | \(6.4\)k | \(1.6\)k | \(2\)k |
DDPO | \(16\)k | \(16\)k | \(100\)k |
Algorithm | Samples generated | Samples Trained on | Gradient Calls |
---|---|---|---|
GRAFT (\(K=1, M=4\)) | \(22.4\)k | \(5.6\)k | \(7\)k |
DDPO | \(25.6\)k | \(25.6\)k | \(160\)k |
We compare the compute cost of P-GRAFT and DDPO in terms of total UNet FLOPs. Let \(F_u\) denote the cost of one UNet forward pass at \(64{\times}64\) latent resolution. Following [72], we approximate a backward training step 2 times of a forward step. So if \(F_u\) is the forward step compute, a forward + backward step will incur \(3F_u\). We assume a batch size of \(1\) for both algorithms for standardization.
For \(P\) prompts, \(M\) samples per prompt, top-\(K\) retained, \(T\) diffusion steps, \(E_{\mathrm{sft}}\) epochs. For the implementation we use the standard stable diffusion training script that only samples a single timestep \(t \in [0, T]\) during training: \[F_{\mathrm{P\text{-}GRAFT}} = \underbrace{P\,M\,T}_{\text{sampling}}\,F_u \;+\; \underbrace{E_{\mathrm{sft}}\,P\,K}_{\text{ training}} 3F_u,\] \[F_{\mathrm{DDPO}} = \underbrace{E_{\mathrm{ddpo}}\,N_{\mathrm{gen}}\,T}_{\text{trajectories}} \cdot (1+3)F_u\]
GenAI-Bench configuration. We use \(P{=}1600\), \(M{=}100\), \(K{=}10\), \(T{=}40\), \(E_{\mathrm{sft}}{=}10\) for P-GRAFT; Trajectories generated per epoch \(N_{\mathrm{gen}}{=}128\), and Number of Epochs \(E_{\mathrm{ddpo}}{=}50\) for DDPO
Algorithm | Sampling | Training | Total |
---|---|---|---|
P-GRAFT (\(K{=}10,\,M{=}100\)) | \(6.40\text{M}\) | \(0.48\text{M}\) | \(6.88\text{M}\) |
P-GRAFT (\(K{=}1,\,M{=}4\)) | \(0.256\text{M}\) | \(0.048\text{M}\) | \(0.304\text{M}\) |
DDPO (\(E{=}50\), \(N_{\mathrm{gen}}{=}128\)) | \(0.256\text{M}\) | \(0.768\text{M}\) | \(1.024\text{M}\) |
Discussion. P-GRAFT’s total compute is dominated by sample generation, while backpropagation is confined to fine-tuning on the selected top-\(K\) samples. In contrast, DDPO backpropagates through all \(T\) denoising steps online for every sample, creating a sequential bottleneck. Consequently, despite DDPO’s nominal FLOPs appearing comparable or lower in our regime, its wall-clock time is substantially longer due to stepwise backward passes that are less parallelizable. Moreover, as shown in Table 2, P-GRAFT achieves higher rewards under the reported budgets; and in the compute-matched case (\(K{=}1\); Table 5), P-GRAFT still outperforms DDPO, indicating that gains come from improved optimization and not just additional training compute.
Fine-tuning for both layout generation and molecule generation are done using models pre-trained using the Interleaved Gibbs Diffusion (IGD) [32] framework. IGD performs well for discrete-continuous generation tasks with strong constraints between variables - and hence is particularly useful for tasks like layout generation and molecule generation. Further, IGD offers a generalizable framework which can be used for both tasks - while other discrete-continuous diffusion frameworks exist, they are specialized to a particular task, often using domain specific adaptations.
On a high-level, IGD interleaves diffusion for discrete and continuous elements using Gibbs-style noising and denoising. Essentially, discrete elements are noised using flipping and trained using a binary classification loss. Continuous elements use typical DDPM-style noising and training. While the exact forward and reverse processes are different from DDPM-style processes which we have considered in the main text, the key results follow empirically and theoretically.
A layout is defined as a set of \(N\) elements \(\{ e_i \}_{i=1}^N\). Each element \(e_i\) is represented by a discrete category \(t_i \in \mathbb{N}\) and a continuous bounding box vector \(\mathbf{p}_i \in \mathbb{R}^4\). Following [32], we use the parameterization \(\mathbf{p}_i = [x_i,\, y_i,\, l_i,\, w_i]^\top\), where \((x_i,\, y_i)\) represents the upper-left corner of the bounding box, and \((l_i,\, w_i)\) its length and width, respectively. Unconditional generation represents generation with no explicit conditioning for the elements, whereas Class-Conditional generation indicates generations conditioned on element categories.
For pre-training, we follow the exact strategy used in [32]. Fine-tuning is also done with the same hyperparameters used for pre-training. Since the data and model sizes are significantly smaller compared to images, each round of rejection sampling is done on \(32768\) samples, of which the top \(50 \%\) samples are selected. For each sampling round, \(10000\) training iterations are performed with a training batch size of \(4096\). The results reported in Table [tab:lay95gen] are for \(20\) such sampling rounds. FID computation is done by comparing against the test split of PubLayNet.
The task of molecule generation involves synthesizing molecules given a dataset of molecules. A molecule consists of \(n\) atoms denoted by \(\{z_i, \mathbf{p}_i\}_{i=1}^{n}\), where \(z_{i} \in \mathbb{N}\) is the atom’s atomic number and \(\mathbf{p}_i \in \mathbb{R}^{3}\) is the position. A diffusion model is trained to generate such molecules. In this work, we take such a pre-trained model, and try to increase the fraction of stable molecules, as deemed by RDKit.
For pre-training, we follow the exact strategy used in [32]. Fine-tuning is also done with the same hyperparameters used for pre-training. Since the data and model sizes are significantly smaller compared to images, each round of rejection sampling is done on \(32768\) samples. We select all stable molecules, but with the de-duplication strategy described in Section 3 - we find that this is crucial to maintain diversity of generated molecules. For each sampling round, \(10000\) training iterations are performed with a training batch size of \(4096\). The \(1 \times\) in Table [tab:mol95gen] corresponds to \(10\) such sampling rounds - \(9 \times\) therefore corresponds to \(90\) sampling rounds.
To demonstrate that the fine-tuned models still generate diverse molecules, and do not collapse to generating a few stable molecules, we report the uniqueness metric computed across the generated molecules below. From Table 19, it is clear that the fine-tuned models still generate diverse samples since the uniqueness of the generated molecules remain close to the pre-trained model. Uniqueness is as determined by RDKit.
Model | Mol: Stability | Uniqueness | |
---|---|---|---|
Baseline | \(90.50 \scriptstyle{\pm 0.15}\) | \(95.60 \scriptstyle{\pm 0.10}\) | |
GRAFT | \(90.76 \scriptstyle{\pm 0.20}\) | \(96.04 \scriptstyle{\pm 0.46}\) | |
P-GRAFT(\(0.5N\)) | \(90.46 \scriptstyle{\pm 0.27}\) | \(95.70 \scriptstyle{\pm 0.28}\) | |
P-GRAFT(\(0.25N\)) | \(\mathbf{92.61} \scriptstyle{\pm 0.13}\) | \(95.32 \scriptstyle{\pm 0.07}\) |
We also try out an ablation where we use GRAFT, but without the de-duplication - i.e., we train on all stable molecules irrespective of whether they are unique or not. The results are shown in Figure 11 - without de-duplication, it can be seen that though stability is recovered, uniqueness is lost, indicating that the model produces only a small subset of molecules it was initially able to produce.
Figure 11: Molecule Stability and Uniqueness without De-duplication. a — Molecule Stability, b — Molecule Uniqueness
IGD makes use of a version of predictor-corrector method [73]–[76] termed ReDeNoise at inference-time to further improve generations. The results reported so far make use of this predictor-corrector. While ReDeNoise improves performance significantly, it comes at the cost of higher inference-time compute. We report results of the baseline and fine-tuned version without ReDeNoise in Table 20. Both GRAFT and P-GRAFT still show improvement over the baseline, even without ReDeNoise.
Model | Mol: Stability | Sampling Steps | |
---|---|---|---|
Baseline | 84.00 | - | |
GRAFT | 87.13 | \(9 \times\) | |
P-GRAFT (\(0.5N\)) | 84.57 | \(1 \times\) | |
P-GRAFT (\(0.25N\)) | 88.36 | \(1 \times\) |
The pre-trained models, corresponding to TRAIN_FLOW function in Algorithm 3, are trained using the NSCNpp architecture and hyperparameters from the official codebase of [16] with minor changes which we describe below. The noise corrector model is also trained with the same architecture except that the number of channels are reduced from the original 128 to 64 channels - this leads to a reduction in parameter count by \(\approx 4\times\). For the pre-trained model, we train with \(\texttt{num\_scales} = 2000\), positional embeddings and a batch size of \(128\). For the noise corrector model, we use the same hyperparameters except for \(\texttt{num\_scales} = 1000\). FID with \(50000\) samples is used to measure the performance, as is standard in the literature. Note that a separate noise corrector model is trained for each choice of \(\eta\) in Algorithm 3, i.e., for the results reported in Table 4, separate noise corrector models are trained for pre-trained steps of \(100\) and \(200\).
CelebA-HQ: For the baseline pre-trained flow model, we use the checkpoint after \(330\)k iterations, since this gave the lowest FID. For noise corrector model training, we use this checkpoint to generate the inverted noise dataset and train on it for \(150\)k iterations.
LSUN-Church: For the baseline pre-trained flow model, we use the checkpoint after \(350\)k iterations, since this gave the lowest FID. For noise corrector model training, we use this checkpoint to generate the inverted noise dataset and train on it for \(55\)k iterations. Note that Backward Euler (Algorithm 5) suffered from numerical instability, which we hypothesize is due to plain backgrounds, when done on LSUN-Church. To alleviate this issue, we perturb the images with a small Gaussian noise \(\mathcal{N}(0, \sigma^2 \mathbf{I})\), with \(\sigma = 10^{-3}\).
We present a comparison of the exact FLOPs used for inference:
Equal contribution↩︎