Abstract

Substantial progress has recently been made in the understanding of the cutoff phenomenon for Markov processes, using an information-theoretic statistics known as varentropy sal-2023?, sal-2024?, sal-2025?, ped-sal-2025?. In the present paper, we propose an alternative approach which bypasses the use of varentropy and exploits instead a new W-TV transport inequality, combined with a classical parabolic regularization estimate bob-gen-led-2001?, ott-vil-2001?. While currently restricted to non-negatively curved processes on smooth spaces, our argument no longer requires the chain rule, nor any approximate version thereof. As applications, we recover the main result of sal-2025? establishing cutoff for the log-concave Langevin dynamics, and extend the conclusion to a widely-used discrete-time sampling algorithm known as the Proximal Sampler.

1 Introduction↩︎

The broad aim of the present work is to investigate the nature of the transition from out of equilibrium to equilibrium in MCMC algorithms, a popular class of methods for sampling from a target measure \(\pi\in\mathcal{P}\tond*{\mathbb{R}^d}\) by simulating a stochastic process that is ergodic with respect to \(\pi\). We will here focus on two particular implementations that are widely used in practice: the Langevin dynamics, and the Proximal Sampler. Throughout the paper, we will assume that the target measure \(\pi\) is log-concave, i.e. of the form \[\begin{align} \pi({\mathrm d}x) & = & e^{-V(x)}{\mathrm d}x, \end{align}\] for some convex potential \(V\in C^2(\mathbb{R}^d)\).

1.1 The Langevin dynamics↩︎

The Langevin dynamics with target distribution \(\pi\) and initialization \(\mu_0\in \mathcal{P}\tond*{\mathbb{R}^d}\) is simply the solution to the stochastic differential equation \[\begin{align} \label{eq:LD} X_0\sim \mu_0, \quad {\mathrm d}X_t & = & -\nabla V(X_t)\,{\mathrm d}t + \sqrt{2}\,{\mathrm d}B_t, \end{align}\tag{1}\] where \((B_t)_{t\ge 0}\) is a standard \(d-\)dimensional Brownian motion. Under our assumptions, it is well known that the marginal law \(\mu_t\mathrel{\vcenter{:}}=\mathop{\mathrm{law}}(X_t)\) approaches \(\pi\) as \(t\to\infty\). Moreover, a variety of quantitative convergence guarantees are available in different metrics; see, e.g., bak-gen-led-2014? for an extensive treatment. In particular, the parabolic regularization estimate \[\begin{align} \label{eq:KL-W2-reg} \forall t>0,\quad {H}\tond*{\mu_t\,\middle |\, \pi} & \leq & \frac{W^2(\mu_0,\pi)}{4t}, \end{align}\tag{2}\] was proved in bob-gen-led-2001?, ott-vil-2001?, and used to recover and generalize the celebrated HWI inequality from ott-vil-2000?. Here and throughout the paper, we use the classical notation \[\begin{align} W^2(\mu,\pi) & \mathrel{\vcenter{:}}= & \inf_{X\sim\mu, Y\sim \pi} \mathbb{E}\quadr*{\abs*{X-Y}^2}, \end{align}\] for the \(2\)-Wasserstein distance between \(\mu\) and \(\pi\), and \[\begin{align} {H}\tond*{\mu\,\middle |\, \pi}& \mathrel{\vcenter{:}}= & \int \log\tond*{\frac{\mathrm{d}\mu}{\mathrm{d}\pi}} \mathrm{d}\mu, \end{align}\] for the relative entropy (or KL-divergence) of \(\mu\) with respect to \(\pi\). As usual, it is understood that \({H}\tond*{\mu\,\middle |\, \pi}=\infty\) when \(\mu\) is not absolutely continuous with respect to \(\pi\). To translate 2 into the classical language of mixing times, we let \[\begin{align} \mathrm{TV}\tond*{\mu, \pi} & \mathrel{\vcenter{:}}= & \sup_{A\in \mathcal{B}(\mathbb{R}^d)}\abs*{\mu(A)-\pi(A)} \end{align}\] denote the total-variation distance between \(\mu\) and \(\pi\), and we recall that the mixing time of the process \((X_t)_{t\ge 0}\) with initialization \(\mu_0\) and precision \(\varepsilon\in(0,1)\) is defined as \[\begin{align} \label{def:mixing} \mathrm{t}_{\mathrm{mix}}(\mu_0,\varepsilon) & \mathrel{\vcenter{:}}= & \min\{t\ge 0\colon \mathrm{TV}\tond*{\mu_t, \pi}\le\varepsilon \}. \end{align}\tag{3}\] It then readily follows from 2 and Pinsker’s inequality that \[\begin{align} \label{eq:tmixLS} \forall \varepsilon\in(0,1),\quad \mathrm{t}_{\mathrm{mix}}(\mu_0,\varepsilon) & \le & \frac{W^2(\mu_0,\pi)}{8\varepsilon^2}. \end{align}\tag{4}\] In concrete words, running the Langevin dynamics for a time of order \(W^2(\mu_0,\pi)\) suffices to approximately sample from \(\pi\). Unfortunately, the continuous-time nature of the Langevin dynamics is not appropriate for practical implementation, and a suitable discretization is required. The simplest choice is the Euler–Maruyama method, in which a step size \(h>0\) is chosen and a discrete-time process \((X_k)_{k\in\mathbb{N}}\) is produced via the stochastic recursion \[\begin{align} \label{eq:LMC} X_0\sim\mu_0,\quad X_{k+1} & = & X_k -h\nabla V(X_k) + \sqrt{2}\tond*{B_{(k+1)h} - B_{kh}}. \end{align}\tag{5}\] This sampling scheme is known as the Langevin Monte Carlo (LMC) algorithm or Unadjusted Langevin algorithm. It is of fundamental importance, and has been extensively studied. We refer the unfamiliar reader to the book in progress che-book-2024+? and the references therein. One drawback of this approach, however, is that the LMC algorithm is biased: the stationary distribution of 5 is in general different from \(\pi\). As a consequence, theoretical performance guarantees require the step size to be very small, resulting in an increased number of iterations compared to the theoretical time-scale 4 .

1.2 The Proximal Sampler↩︎

The Proximal Sampler is an unbiased discrete-time algorithm for sampling from \(\pi\) introduced in lee-she-tia-2021?. We refer the reader to che-book-2024+? or the recent papers che-eld-2022?, mit-wib-2025?, wib-2025? for more details. As usual, we denote by \(\gamma_{x,t}\) the \(d\)-dimensional Gaussian law with mean \(x\) and covariance matrix \(tI_d\), and we use the shorthand \(\gamma_t \mathrel{\vcenter{:}}= \gamma_{0,t}\). Given a step size \(h>0\), we consider a pair \((X,Y)\) of \(\mathbb{R}^d-\)valued random variables with joint law \[\begin{align} \boldsymbol{\pi}({\mathrm d}x\, {\mathrm d}y) & \propto & \exp\tond*{-V(x) - \frac{\abs*{x-y}^2}{2h}}{\mathrm d}x\, {\mathrm d}y. \end{align}\] The Proximal Sampler with target \(\pi\) consists in applying alternating Gibbs sampling to \(\boldsymbol{\pi}\). More precisely, a sequence \(X_0,Y_0,X_1,Y_1,\ldots\) of \(\mathbb{R}^d\)-valued random variables is produced by first sampling \(X_0\) according to some prescribed initialization \(\mu_0\in \mathcal{P}\tond*{\mathbb{R}^d}\) and then inductively, for each \(k\ge 0\), sampling \(Y_k\) and \(X_{k+1}\) according to the following laws:

Forward step: conditionally on \((X_0,Y_0,\ldots,X_{k-1},Y_{k-1},X_k)\), \(Y_k\) is distributed as \[\begin{align} \label{eq:forward-step-prox} Y_k & \sim & \mathop{\mathrm{law}}\tond*{ Y \mid X = X_k} \;= \;\gamma_{X_k, h}. \end{align}\tag{6}\]
Backward step: conditionally on \((X_0,Y_0,\ldots,X_k,Y_k)\), \(X_{k+1}\) is distributed as \[\begin{align} \label{eq:backward-step-prox} X_{k+1} & \sim & \mathop{\mathrm{law}}\tond*{ X \mid Y = Y_k} \;\propto \;\exp \tond*{-V(x)-\frac{\abs*{x-Y_k}^2}{2h}}\mathrm{d}x. \end{align}\tag{7}\]

Since the first marginal of \(\boldsymbol{\pi}\) is \(\pi\), the algorithm is unbiased, meaning that \(\pi\) is stationary for the Markov chain \((X_k)_{k\ge 0}\). Moreover, the forward step is trivial to implement, as it amounts to adding an independent Gaussian noise. When \(h\) is small enough, the backward step is also tractable, due to the regularizing effect of the quadratic potential in 7 . For example, if \(\pi\) is \(\alpha\)-log-concave and \(\beta\)-log-smooth (i.e. \(\alpha I_d\preccurlyeq \nabla^2V \preccurlyeq \beta I_d\)), then the conditional law 7 is \(\tond*{\alpha + \frac{1}{h}}\)-log-concave with condition number \(\kappa_h = \frac{1+\beta h}{1+\alpha h} < \frac{\beta}{\alpha}\). As a result, several methods are available to efficiently generate \(X_{k+1}\), such as rejection sampling che-che-sal-wib-2022?, approximate rejection sampling fan-yua-che-2023?, or high-accuracy samplers alt-che-2024-faster?. Following a standard convention in this setting che-che-sal-wib-2022?, we will here assume to have access to a restricted Gaussian oracle that implements the backward step 7 exactly, and focus on the number of iterations needed for \(\mu_k\mathrel{\vcenter{:}}=\mathop{\mathrm{law}}(X_k)\) to approach \(\pi\), in the sense of 3 .

Although this is not obvious from the above description, the Proximal Sampler can be interpreted as a discretization of the Langevin dynamics 1 . This is particularly clear once we view those dynamics as minimizing schemes for the relative entropy functional \({H}\tond*{\cdot\,\middle |\, \pi}\) in the Wasserstein space \(\tond*{\mathcal{P}_2(\mathbb{R}^d), W}\) jor-kin-ott-1998?, che-book-2024+?. In light of this, it is not too surprising that many classical convergence guarantees for the Langevin dynamics translate to the Proximal Sampler. In particular, the following analogue of the parabolic regularization estimate 2 was recently established in che-che-sal-wib-2022?: \[\begin{align} \label{eq:KL-W2-reg-prox-sampl} \forall k\in\mathbb{N},\quad {H}\tond*{\mu_k\,\middle |\, \pi} & \leq & \frac{W^2\tond*{\mu_0,\pi}}{kh}. \end{align}\tag{8}\] By virtue of Pinsker’s inequality, this readily yields the (discrete-time) mixing-time estimate \[\begin{align} \forall \varepsilon\in(0,1),\quad \mathrm{t}_{\mathrm{mix}}(\mu_0,\varepsilon) & \le & \left\lceil\frac{W^2(\mu_0,\pi)}{2h\varepsilon^2}\right\rceil. \end{align}\] In concrete words, running the Proximal Sampler with step size \(h\) for roughly \(W^2\tond*{\mu_0,\pi}/h\) iterations suffices to produce approximate samples from the target distribution \(\pi\).

1.3 Main results↩︎

Rather than asking how long the Langevin dynamics or the Proximal Sampler should be run in order to be close to equilibrium, we would here like to understand how abrupt their transition from out of equilibrium to equilibrium is. In other words, we seek to estimate the width of the mixing window, defined for any precision \(\varepsilon\in\tond*{0,\frac{1}{2}}\) by \[\begin{align} \mathop{\mathrm{w_{mix}}}(\mu_0,\varepsilon) & \mathrel{\vcenter{:}}= & \mathrm{t}_{\mathrm{mix}}(\mu_0,\varepsilon)-\mathrm{t}_{\mathrm{mix}}(\mu_0,1-\varepsilon). \end{align}\] The analysis of this fundamental quantity is a challenging task which, until recently, had only been carried out in a handful of models. Over the past couple of years, a systematic approach to this problem was developed in the series of works sal-2023?, sal-2024?, sal-2025?, ped-sal-2025?, using an information-theoretic notion called varentropy. In the present paper, we propose an alternative approach which completely bypasses the use of varentropy, and instead exploits the parabolic regularization estimates 2 and 8 directly, in conjunction with a new W-TV transport inequality. As a first application, we are able to recover exactly the main result of sal-2025?, which reads as follows. Recall that the Poincaré constant of \(\pi\in\mathcal{P}\tond*{\mathbb{R}^d}\), denoted \(C_{\rm{P}}\tond*{\pi}\), is the smallest number such that \[\begin{align} \label{def:PI} \mathrm{Var}_{\pi}\tond*{f} & \leq & C_{\rm{P}}\tond*{\pi} \int \abs*{\nabla f}^2 d\pi, \end{align}\tag{9}\] for all smooth functions \(f\colon \mathbb{R}^d\to\mathbb{R}\). In particular, \(C_{\rm{P}}\tond*{\delta_x}=0\), for any \(x\in\mathbb{R}^d\).

Theorem 1 (Mixing window of the Langevin dynamics). The Langevin dynamics with a log-concave target \(\pi\in\mathcal{P}\tond*{\mathbb{R}^d}\) and an arbitrary initialization \(\mu_0\in\mathcal{P}\tond*{\mathbb{R}^d}\) satisfies \[\begin{align} \mathop{\mathrm{w_{mix}}}(\mu_0,\varepsilon) & \leq & \frac{3}{\varepsilon}\tond*{C_{\rm{P}}\tond*{\pi} + \sqrt{C_{\rm{P}}\tond*{\pi}C_{\rm{P}}\tond*{\mu_0}}+ \sqrt{C_{\rm{P}}\tond*{\pi}\mathrm{t}_{\mathrm{mix}}(\mu_0,1-\varepsilon)}}, \end{align}\] for any precision \(\varepsilon\in \tond*{0,\frac{1}{2}}\).

The interest of this estimate lies in its relation to the cutoff phenomenon, a remarkably universal – but still largely unexplained – phase transition from out of equilibrium to equilibrium undergone by certain ergodic Markov processes in an appropriate limit. We refer the unfamiliar reader to the recent lecture notes salez2025modernaspectsmarkovchains? and the references therein for an up-to-date introduction to this fascinating question.

Corollary 1 (Cutoff for the Langevin dynamics). Consider the setup of Theorem 1, but assume that the ambient dimension \(d\), the target \(\pi\), and the initialization \(\mu_0\) now depend on an implicit parameter \(n\in\mathbb{N}\), in such a way that \[\begin{align} \label{eq:prod-cond-cpi} \frac{\mathrm{t}_{\mathrm{mix}}(\mu_0,\varepsilon)}{C_{\rm{P}}\tond*{\pi}+\sqrt{C_{\rm{P}}\tond*{\pi}C_{\rm{P}}\tond*{\mu_0}}} & \xrightarrow[n\to\infty]{} & +\infty, \end{align}\tag{10}\] for some fixed \(\varepsilon\in (0,1)\). Then, a cutoff occurs, in the sense that for every \(\varepsilon\in(0,1)\), \[\begin{align} \label{def:cutoff} \frac{\mathrm{t}_{\mathrm{mix}}(\mu_0,1-\varepsilon)}{\mathrm{t}_{\mathrm{mix}}(\mu_0,\varepsilon)} & \xrightarrow[n\to\infty]{} & 1. \end{align}\tag{11}\]

Note that, in the standard setup where the initialization is a Dirac mass, our cutoff criterion 10 reduces to the natural condition \(\frac{\mathrm{t}_{\mathrm{mix}}(\mu_0,\varepsilon)}{C_{\rm{P}}\tond*{\pi}}\to\infty\), which is known as the product condition in the classical literature on Markov chains (see, e.g., salez2025modernaspectsmarkovchains?).

Remark 2 (Manifolds). We have here chosen to work in the Euclidean space \(\mathbb{R}^d\) in order to keep the presentation simple and accessible. However, a careful inspection will convince the interested reader that our proof of Theorem 1 carries over to the more general setting of non-negatively curved diffusions on smooth complete weighted Riemannian manifolds.

As a second – and genuinely new – application of our transport approach to cutoff, we extend the above results to the Proximal Sampler, thereby tightening its relation to the Langevin dynamics. To lighten the formulas, we introduce the quantity \[\begin{align} \widehat{C}_{\rm{P}}\tond*{\pi} & \mathrel{\vcenter{:}}= & 1+\frac{C_{\rm{P}}\tond*{\pi}}{h}, \end{align}\] which, as we will see, can be seen as the natural discrete-time analogue of \(C_{\rm{P}}\tond*{\pi}\).

Theorem 3 (Mixing window of the Proximal Sampler). The Proximal Sampler with log-concave target \(\pi\), arbitrary initialization \(\mu_0\in\mathcal{P}\tond*{\mathbb{R}^d}\) and step size \(h>0\) satisfies \[\begin{align} \mathop{\mathrm{w_{mix}}}(\mu_0,\varepsilon) & \leq & \frac{6}{\varepsilon}\tond*{\widehat{C}_{\rm{P}}\tond*{\pi} + \sqrt{\widehat{C}_{\rm{P}}\tond*{\pi}\widehat{C}_{\rm{P}}\tond*{\mu_0}}+ \sqrt{\widehat{C}_{\rm{P}}\tond*{\pi}\mathrm{t}_{\mathrm{mix}}(\mu_0,1-\varepsilon)}}, \end{align}\] for any precision \(\varepsilon\in \tond*{0,\frac{1}{2}}\).

Corollary 2 (Cutoff for the Proximal Sampler). Consider the setup of Theorem 3, but assume that the ambient dimension \(d\), the target \(\pi\), the initialization \(\mu_0\), and the step size \(h\) now depend on an implicit parameter \(n\in\mathbb{N}\), in such a way that \[\begin{align} \label{eq:modified-prod-cond-cpi} \frac{\mathrm{t}_{\mathrm{mix}}(\mu_0,\varepsilon)}{\widehat{C}_{\rm{P}}\tond*{\pi}+\sqrt{\widehat{C}_{\rm{P}}\tond*{\pi}\widehat{C}_{\rm{P}}\tond*{\mu_0}}} & \xrightarrow[n\to\infty]{} & \infty, \end{align}\tag{12}\] for some fixed \(\varepsilon\in (0,1)\). Then a cutoff occurs, in the sense of 11 above.

Here again, the condition 10 reduces to \(\frac{\mathrm{t}_{\mathrm{mix}}(\mu_0, \varepsilon)}{\widehat{C}_{\rm{P}}\tond*{\pi}}\to\infty\) for Dirac initializations. To the best of our knowledge, this is the very first result establishing cutoff for the Proximal Sampler. We emphasize that the latter is a discrete-time Markov process on a continuous state space, an object to which the varentropy approach developed in sal-2023?, sal-2024?, sal-2025?, ped-sal-2025?, salez2025modernaspectsmarkovchains? does not currently apply. Indeed, in that series of work, varentropy is controlled using either the celebrated chain rule, which notoriously fails in discrete time, or an approximate version of it involving a certain sparsity parameter, which only makes sense on discrete spaces. To bypass this limitation, our main idea is to replace the reverse Pinsker inequality sal-2023? where varentropy appears with the following W-TV transport inequality, which seems new and of independent interest.

Theorem 4 (W-TV transport inequality). For any \(\mu,\nu\in\mathcal{P}\tond*{\mathbb{R}^d}\), \[\begin{align} W^2(\mu,\nu)& \le &\frac{ 4\left(C_{\rm{P}}\tond*{\mu}+C_{\rm{P}}\tond*{\nu}\right)\mathrm{TV}\tond*{\mu, \nu}}{1-\mathrm{TV}\tond*{\mu, \nu}}. \end{align}\]

Acknowledgment↩︎

F.P. thanks Yuansi Chen for helpful comments. J.S. is supported by the ERC consolidator grant CUTOFF (101123174). Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible.

2 Proofs↩︎

2.1 The W-TV transport inequality↩︎

In this section, we prove Theorem 4. Given two probability measures \(\mu,\nu \in \mathcal{P}\tond*{\mathbb{R}^d}\), we recall that the chi-squared divergence of \(\nu\) w.r.t. \(\mu\) is defined by the formula \[\begin{align} \chi^2\tond*{\nu\,\middle |\, \mu} & \mathrel{\vcenter{:}}= & \int \tond*{\frac{\mathrm{d}\nu}{\mathrm{d}\mu}-1} \mathrm{d}\nu, \end{align}\] with \(\chi^2\tond*{\nu\,\middle |\, \mu}=\infty\) if \(\nu\) is not absolutely continuous w.r.t. \(\mu\). Our starting point is the following transport-variance inequality, whose proof can be found in liu-2020?.

Lemma 1 (Transport-variance inequality). For any \(\mu,\nu \in \mathcal{P}\tond*{\mathbb{R}^d}\), \[\begin{align} W^2(\mu,\nu) & \leq & 2C_{\rm{P}}\tond*{\mu} \chi^2\tond*{\nu\,\middle |\, \mu}. \end{align}\]

Unfortunately, the chi-squared divergence appearing here could be arbitrarily large compared to the total-variation term with which we seek to control \(W^2(\mu,\nu)\). To preclude such pathologies, we introduce a probability measure \(\lambda\) which interpolates nicely between \(\mu\) and \(\nu\), in the sense of having small Radon-Nikodym derivatives w.r.t. both.

Lemma 2 (Interpolation). Given two probability measures \(\mu,\nu\) on a measurable space, there exists a probability measure \(\lambda\) which is absolutely continuous w.r.t. \(\mu\) and \(\nu\), with \[\begin{align} \left\|\frac{\mathrm{d}\lambda}{\mathrm{d}\mu}\right\|_\infty\vee\;\;\left\|\frac{\mathrm{d}\lambda}{\mathrm{d}\nu}\right\|_\infty & \leq & \frac{1}{1-\mathrm{TV}\tond*{\mu, \nu}}. \end{align}\]

Proof. We may assume that \(\mu\ne\nu\), otherwise the claim is trivial. Now, fix an arbitrary measure \(\sigma\) which is absolutely continuous w.r.t. both \(\mu\) and \(\nu\) (for example, \(\sigma:=\mu+\nu\)), and let \(f:=\frac{\mathrm{d}\mu}{\mathrm{d}\sigma}\) and \(g:=\frac{\mathrm{d}\nu}{\mathrm{d}\sigma}\) denote the corresponding Radon-Nikodym derivatives. With this notation at hand, we classically have the integral representation \[\begin{align} \mathrm{TV}\tond*{\mu, \nu} & = & 1-\int \tond*{f \wedge g}\, {\mathrm d}\sigma. \end{align}\] Consequently, we can define a probability measure \(\lambda\) by the formula \[\begin{align} {\mathrm d}\lambda & := & \frac{f\wedge g}{1-\mathrm{TV}\tond*{\mu, \nu}}\,\mathrm{d}{\sigma}. \end{align}\] This measure satisfies the desired property. Indeed, it is clearly absolutely continuous w.r.t. both \(\mu\) and \(\nu\), with corresponding Radon-Nikodym derivatives \[\begin{align} \frac{\mathrm{d}\lambda}{\mathrm{d}\mu} \;= \;\frac{1\wedge \frac{g}{f}}{1-\mathrm{TV}\tond*{\mu, \nu}},& \textrm{ and } & \frac{\mathrm{d}\lambda}{\mathrm{d}\nu} \;= \;\frac{1\wedge \frac{f}{g}}{1-\mathrm{TV}\tond*{\mu, \nu}}, \end{align}\] those formulae being interpreted as zero outside \(\mathop{\mathrm{Supp}}(\lambda):=\{f\wedge g>0\}\). ◻

We now have everything we need to establish Theorem 4.

Proof of Theorem 4. Fix \(\mu,\nu\in\mathcal{P}\tond*{\mathbb{R}^d}\) and let \(\lambda\) be as in Lemma 2. Then, \[\begin{align} W^2(\mu,\nu) & \le & 2W^2(\mu,\lambda)+2W^2(\nu,\lambda)\\ & \le & 4C_{\rm{P}}\tond*{\mu}\chi^2\tond*{\lambda\,\middle |\, \mu}+4C_{\rm{P}}\tond*{\nu}\chi^2\tond*{\lambda\,\middle |\, \nu}, \end{align}\] by the triangle inequality and Lemma 1. On the other hand, we have the crude bound \[\begin{align} \chi^2\tond*{\lambda\,\middle |\, \mu} & \le & \left\|\frac{\mathrm{d}\lambda}{\mathrm{d}\mu}\right\|_\infty-1 \;\le \;\frac{\mathrm{TV}\tond*{\mu, \nu}}{1-\mathrm{TV}\tond*{\mu, \nu}}, \end{align}\] by Lemma 2, and similarly with \(\nu\) instead of \(\mu\). ◻

2.2 Cutoff for the Langevin dynamics↩︎

In this section, we prove Theorem 1 and Corollary 1. Consider the Langevin dynamics 1 with target \(\pi\) and initialization \(\mu_0\in\mathcal{P}\tond*{\mathbb{R}^d}\), and write \(\mu_t = \mathop{\mathrm{law}}(X_t)\) for the law at time \(t\ge 0\). As is well known from the Bakry–Émery theory bak-gen-led-2014? (see Remark 5 below for an alternative proof), the log-concavity of \(\pi\) ensures the local Poincaré inequality \[\begin{align} \label{eq:LPI} \forall t\ge 0,\quad C_{\rm{P}}\tond*{\mu_t} & \le & C_{\rm{P}}\tond*{\mu_0}+2t. \end{align}\tag{13}\] Another ingredient that we will need is the basic mixing-time estimate \[\begin{align} \label{eq:sal-fast-tv-mixing-small-kl} \mathrm{t}_{\mathrm{mix}}\tond*{\mu_0,\varepsilon} & \leq & \frac{C_{\rm{P}}\tond*{\pi}\tond*{1+{H}\tond*{\mu_0\,\middle |\, \pi}}}{\varepsilon}, \end{align}\tag{14}\] borrowed from sal-2023?, and which relies on the classical fact that \[\begin{align} \forall t\ge 0,\quad \chi^2\tond*{\mu_t\,\middle |\, \pi} & \le & e^{-2C_{\rm{P}}\tond*{\pi}t} \chi^2\tond*{\mu_0\,\middle |\, \pi}, \end{align}\] together with an easy interpolation argument between \(\chi^2\tond*{\mu_t\,\middle |\, \pi},{H}\tond*{\mu_t\,\middle |\, \pi}\) and \(\mathrm{TV}\tond*{\mu_t, \pi}\).

Proof of Theorem 1. Fix \(\varepsilon\in \tond*{0,\frac{1}{2}}\) and set \(t_0 := \mathrm{t}_{\mathrm{mix}}(\mu_0,1-\varepsilon)\). By the very definition of \(t_0\), our W-TV transport inequality (Theorem 4) gives \[\begin{align} W^2(\mu_{t_0},\pi) & \leq & \frac{4C_{\rm{P}}\tond*{\pi}+4C_{\rm{P}}\tond*{\mu_{t_0}}}{\varepsilon}. \end{align}\] Therefore, the parabolic regularization estimate 2 applied to \(\mu_{t_0}\) instead of \(\mu_0\) yields \[\begin{align} \forall s\ge 0,\quad {H}\tond*{\mu_{t_0+s}\,\middle |\, \pi} & \leq & \frac{C_{\rm{P}}\tond*{\pi}+C_{\rm{P}}\tond*{\mu_{t_0}}}{s \varepsilon}. \end{align}\] On the other hand, applying 14 to \(\mu_t\) instead of \(\mu_0\) ensures that \[\begin{align} \mathrm{t}_{\mathrm{mix}}(\mu_0,\varepsilon) & \leq & t+ \frac{C_{\rm{P}}\tond*{\pi}\tond*{1+{H}\tond*{\mu_t\,\middle |\, \pi}}}{\varepsilon}, \end{align}\] for any \(t\ge 0\). Choosing \(t=t_0+s\) and combining this with the previous line, we obtain \[\begin{align} \mathop{\mathrm{w_{mix}}}(\mu_0,\varepsilon) & \le & s+\frac{C_{\rm{P}}\tond*{\pi}}{\varepsilon}+\frac{C_{\mathrm P}^2(\pi)+C_{\rm{P}}\tond*{\pi}C_{\rm{P}}\tond*{\mu_{t_0}}}{s \varepsilon^2}. \end{align}\] Since this bound is valid for any \(s\ge 0\), we may finally optimize on \(s\) to conclude that \[\begin{align} \mathop{\mathrm{w_{mix}}}(\mu_0,\varepsilon) & \le & \frac{C_{\rm{P}}\tond*{\pi}}{\varepsilon}+\frac{2}{\varepsilon}\sqrt{C_{\mathrm P}^2(\pi)+C_{\rm{P}}\tond*{\pi}C_{\rm{P}}\tond*{\mu_{t_0}}}. \end{align}\] This implies the desired estimate, thanks to 13 and the subadditivity of \(\sqrt{\cdot}\). ◻

Proof of Corollary 2. We now let the ambient dimension \(d\), the target \(\pi\) and the initialization \(\mu_0\) depend on an implicit parameter \(n\in \mathbb{N}\), in such a way that the condition (10 ) holds for some \(\varepsilon\in(0,1)\). Since \(\mathrm{t}_{\mathrm{mix}}(\mu_0,\varepsilon)\) is a non-increasing function of \(\varepsilon\), (10 ) must in fact holds for every small enough \(\varepsilon>0\), and Theorem 1 then readily implies that \[\begin{align} \frac{\mathop{\mathrm{w_{mix}}}(\mu_0,\varepsilon)}{\mathrm{t}_{\mathrm{mix}}(\mu_0,\varepsilon)} & \xrightarrow[n\to\infty]{} & 0. \end{align}\] Since this holds for every small enough \(\varepsilon>0\), the cutoff phenomenon (11 ) follows. ◻

2.3 Cutoff for the Proximal Sampler↩︎

To prove Theorem 3, we fix a log-concave target \(\pi\in\mathcal{P}\tond*{\mathbb{R}^d}\), a step size \(h>0\), and an initialization \(\mu_0\in\mathcal{P}\tond*{\mathbb{R}^d}\), and we consider the random sequence \(X_0,Y_0,X_1,Y_1,\ldots\) generated by the Proximal Sampler 6 7 . We write \(\mu_k=\mathrm{law}(X_k)\). The main ingredient we need in order to mimic the proof of Theorem 1 is a version of the local Poincaré inequality 13 for the Proximal Sampler, provided in the following lemma.

Lemma 3 (local Poincaré inequality for the Proximal Sampler). We have \[\begin{align} \forall k\in\mathbb{N},\quad C_{\rm{P}}\tond*{\mu_k} & \le & C_{\rm{P}}\tond*{\mu_0} + 2kh. \end{align}\]

Proof. Let us use the convenient short-hand \(C_{\rm{P}}\tond*{U}:=C_{\rm{P}}\tond*{\mathrm{law}(U)}\) when \(U\) is a \(\mathbb{R}^d\)-valued random variable. By induction, it is enough to prove the claim when \(k=1\), i.e. \[\begin{align} C_{\rm{P}}\tond*{X_1} & = & C_{\rm{P}}\tond*{X_0}+2h. \end{align}\] First observe that, by construction, the random variable \(Y_0-X_0\) is \(\gamma_h\)-distributed and independent of \(X_0\). Using the sub-additivity of the Poincaré constant under convolutions and the Gaussian Poincaré inequality (see, e.g., bak-gen-led-2014?), we deduce that \[\begin{align} C_{\rm{P}}\tond*{Y_0} & \le & C_{\rm{P}}\tond*{X_0}+h, \end{align}\] which reduces our task to proving that \[\begin{align} \label{goal} C_{\rm{P}}\tond*{X_1} & \le &C_{\rm{P}}\tond*{Y_0}+h. \end{align}\tag{15}\] Let us first establish this under the additional assumption that \(\pi\) is log-smooth, i.e. \(\nabla^2V \preccurlyeq \beta I_d\) for some \(\beta<\infty\). To do so, we rely on a clever continuous-time stochastic interpolation between \(Y_0\) and \(X_1\) introduced in che-che-sal-wib-2022?. More precisely, it is shown therein that \(X_1\stackrel{d}{=}U_h\), where \((U_t)_{t\in[0,h]}\) solves the SDE \[\begin{align} \label{eq:backward-BM} U_0 = Y_0, \quad \mathrm{d}U_t & = & \nabla \log f_{h-t}(U_t)\mathrm{d}t + \mathrm{d}B_t, \end{align}\tag{16}\] with \(f_t\) denoting the density of \(\pi*\gamma_t\). Following a strategy used in vem-wib-2019?, we now track the evolution of the Poincaré constant along an appropriate time-discretization of this SDE. Specifically, given a resolution \(n\in\mathbb{N}\), we consider the Euler–Maruyama discretization \((\tilde{U}_0,\ldots,\tilde{U}_n)\) of 16 with step size \(\delta\mathrel{\vcenter{:}}= \frac{h}{n}\), defined inductively by \[\begin{align} \tilde{U}_0 = Y_0, \quad \tilde{U}_{j+1} & \mathrel{\vcenter{:}}= & \tilde{U}_j + \delta \nabla \log f_{h-\delta j}(\tilde{U}_j) + B_{\delta\tond{j+1}} -B_{\delta j}. \end{align}\] As above, the sub-additivity of \(\mu\mapsto C_{\rm{P}}\tond*{\mu}\) under convolutions yields \[\begin{align} \label{eq:hessft} C_{\rm{P}}\tond*{\tilde{U}_{j+1}} & \leq & C_{\rm{P}}\tond*{\tilde{U}_j + \delta \nabla \log f_{h-\delta j}(\tilde{U}_j)} + \delta. \end{align}\tag{17}\] To estimate the right-hand side, we recall that by assumption, \(0\preccurlyeq -\nabla^2 \log f_0\preccurlyeq \beta I_d\) for some \(\beta<\infty\), and that this property is preserved under the heat flow, i.e. \[\begin{align} \forall t\ge 0,\quad 0 \;\preccurlyeq\;-\nabla^2\log f_t & \preccurlyeq & \beta I_d, \end{align}\] see sau-wel-2014? for the lower bound and equation (6) in mik-she-2023? for the upper bound. Consequently, the gradient-descent map \(x \to x + \delta \nabla \log f_{t}(x)\) is \(1\)-Lipschitz as soon as \(\beta\delta\le 1\), which we can enforce by choosing \(n\ge h\beta\). Since the Poincaré constant can not increase under \(1-\)Lipschitz pushforwards (see cor_era-2002?), we deduce that \[\begin{align} C_{\rm{P}}\tond*{\tilde{U}_j + \delta \nabla \log f_{h-\delta j}(\tilde{U}_j)} & \le & C_{\rm{P}}\tond*{\tilde{U}_j}. \end{align}\] Inserting this into 17 and solving the resulting recursion, we conclude that \[\begin{align} C_{\rm{P}}\tond*{\tilde{U}_n} & \le & C_{\rm{P}}\tond*{Y_0}+h. \end{align}\] Sending \(n\to\infty\) gives 15 , since the Euler–Maruyama approximation \(\tilde{U}_n\) converges in distribution to \(U_h\stackrel{d}{=}X_1\) as the resolution \(n\) tends to infinity. Finally, to remove our log-smoothness assumption on \(\pi\), we fix a regularization parameter \(\varepsilon>0\) and consider the random sequence \(X_0^\varepsilon,Y_0^\varepsilon,X_1^\varepsilon,\ldots\) generated by the Proximal Sampler with initialization \(\mu_0\), step size \(h\), and regularized target \(\pi_\varepsilon:=\pi*\gamma_\varepsilon\). Since the latter is log-concave and log-smooth, the first step of the proof ensures that \[\begin{align} \label{goal:eps} C_{\rm{P}}\tond*{X^\varepsilon_1} & \le & C_{\rm{P}}\tond*{Y_0^\varepsilon}+h. \end{align}\tag{18}\] But by construction, we have \(\mathrm{law}(Y_0^\varepsilon)=\mu_0*\gamma_h=\mathrm{law}(Y_0)\), and for each \(y\in\mathbb{R}^d\), \[\begin{align} \mathrm{law}(X^\varepsilon_1|Y^\varepsilon_0=y) & = & \frac{e^{-\frac{|x-y|^2}{2h}}\pi_\varepsilon(\mathrm{d}x)}{\int e^{-\frac{|z-y|^2}{2h}}\pi_\varepsilon(\mathrm{d}z)} \;\xrightarrow[\varepsilon\to 0]{} \; \frac{e^{-\frac{|x-y|^2}{2h}}\pi(\mathrm{d}x)}{\int e^{-\frac{|z-y|^2}{2h}}\pi(\mathrm{d}z)} \;= \;\mathrm{law}(X_1|Y_0=y), \end{align}\] simply because \(\pi_\varepsilon\to\pi\) as \(\varepsilon\to 0\). Thus, \(\mathrm{law}(X_1^\varepsilon)\to \mathrm{law}(X_1)\) as \(\varepsilon\to 0\), and we may safely pass to the limit in 18 to obtain 15 . ◻

Remark 5 (Extensions). The above argument is rather robust. For example, replacing \(f_{h-t}\) by \(f_0=-\frac{1}{2}V\) in 16 (and rescaling time) gives a simple alternative proof of the celebrated local Poincaré inequality 13 , and the same reasoning actually also yields local log-Sobolev inequalities. When the potential \(V\) is strongly log-concave, sharp improved estimates on those constants can be derived accordingly, using the strong contractivity of the gradient-descent map.

We will also need the following analogue of the mixing-time estimate 14 .

Lemma 4 (Mixing-time estimate for the Proximal Sampler). We have \[\begin{align} \mathrm{t}_{\mathrm{mix}}\tond*{\mu_0,\varepsilon} & \leq & \ceil*{\widehat{C}_{\rm{P}}\tond*{\pi}\frac{1+{H}\tond*{\mu_0\,\middle |\, \pi}}{\varepsilon}}. \end{align}\]

Proof. It was shown in che-che-sal-wib-2022? that for any \(k\in\mathbb{N}\), \[\begin{align} \chi^2\tond*{\mu_k\,\middle |\, \pi} & \leq & \tond*{1+\frac{h}{C_{\rm{P}}\tond*{\pi}}}^{-2k}\chi^2\tond*{\mu_0\,\middle |\, \pi}\\ & \leq & \exp\left(-\frac{2k}{\widehat{C}_{\rm{P}}\tond*{\pi}}\right)\chi^2\tond*{\mu_0\,\middle |\, \pi}, \end{align}\] where the second line follows from our definition of \(\widehat{C}_{\rm{P}}\tond*{\pi}\) and the bound \(e^{\frac{1}{u}}\le \frac{u}{u-1}\), valid for any \(u\ge 1\). The remainder of the proof is then exactly as in sal-2023?. ◻

We now have everything we need to mimic the proof of Theorem 1.

Proof of Theorem 3. Fix \(\varepsilon\in \tond*{0,\frac{1}{2}}\) and set \(k_0:= \mathrm{t}_{\mathrm{mix}}\tond*{\mu_0,1-\varepsilon}\). Our W-TV transport inequality (Theorem 4) combined with the parabolic regularization estimate 8 gives \[\begin{align} {H}\tond*{\mu_{k_0+k}\,\middle |\, \pi} & \leq & \frac{4\widehat{C}_{\rm{P}}\tond*{\pi}+4\widehat{C}_{\rm{P}}\tond*{\mu_{k_0}}}{\varepsilon k}. \end{align}\] As above, we can then apply Lemma 4 to \(\mu_{k_0+k}\) instead of \(\mu_0\) to obtain \[\begin{align} \mathop{\mathrm{w_{mix}}}(\mu_0,\varepsilon) & \leq & k + 1+ \widehat{C}_{\rm{P}}\tond*{\pi}\tond*{\frac{1}{\varepsilon}+\frac{4\widehat{C}_{\rm{P}}\tond*{\pi}+4\widehat{C}_{\rm{P}}\tond*{\mu_{k_0}}}{\varepsilon^2 k}}. \end{align}\] But this holds for any \(k\in\mathbb{N}\), and choosing \(k=\ceil*{\frac{2}{\varepsilon}\sqrt{\widehat{C}_{\rm{P}}\tond*{\pi}\tond*{\widehat{C}_{\rm{P}}\tond*{\pi}+\widehat{C}_{\rm{P}}\tond*{\mu_{k_0}}}}}\) yields \[\begin{align} \mathop{\mathrm{w_{mix}}}(\mu_0,\varepsilon) & \leq & 2+\frac{\widehat{C}_{\rm{P}}\tond*{\pi}}{\varepsilon}+ \frac{4}{\varepsilon}\sqrt{\widehat{C}_{\rm{P}}\tond*{\pi}\tond*{\widehat{C}_{\rm{P}}\tond*{\pi}+\widehat{C}_{\rm{P}}\tond*{\mu_{k_0}}}}. \end{align}\] The result now readily follows from Lemma 3 and the sub-additivity of \(\sqrt{\cdot}\). ◻

A transport approach to the cutoff phenomenon