Abstract

The unadjusted Langevin algorithm is widely used for sampling from complex high-dimensional distributions. It is well known to be biased, with the bias typically scaling linearly with the dimension when measured in squared Wasserstein distance. However, the recent paper of Chen et al.[1] identifies an intriguing new delocalization effect: For a class of distributions with sparse interactions, the bias between low-dimensional marginals scales only with the lower dimension, not the full dimension. In this work, we strengthen the results of [1] in the sparse interaction regime by removing a logarithmic factor, measuring distance in relative entropy (a.k.a.KL-divergence), and relaxing the strong log-concavity assumption. In addition, we expand the scope of the delocalization phenomenon by showing that it holds for a class of distributions with weak interactions. Our proofs are based on a hierarchical analysis of the marginal relative entropies, inspired by the authors’ recent work on propagation of chaos.

1 Introduction↩︎

Langevin Monte Carlo (LMC), also known as the unadjusted Langevin algorithm, is a simple but widely used sampling method, especially in high dimension. Given a probability density \(\pi(x) \propto e^{-V(x)}\) on \({\mathbb{R}}^n\) and a step size \(h > 0\), the LMC iteration is given by \[X_{(k+1)h} = X_{kh} - h\nabla V(X_{kh}) - \sqrt{2h} Z_k, \label{intro:LMCiterates}\tag{1}\] where \((Z_k)\) are iid standard Gaussians. There is a rich literature on theoretical convergence guarantees for LMC, especially in the benchmark strongly log-concave and smooth setting. For fixed \(h > 0\), \(X_{kh}\) converges to a distribution \(\pi_h\) as \(k\to\infty\), and the convergence rate is typically dimension-free, when measured in relative entropy (a.k.a.Kullback-Leibler divergence) or the (quadratic) Wasserstein metric.

However, the distance between \(\pi_h\) and \(\pi\), known as the bias, inevitably depends on the ambient dimension \(n\). The best known bound on the squared Wasserstein distance \(W_2^2(\pi_h,\pi)\) or the relative entropy \(H(\pi_h\,|\,\pi)\) is \(O(nh)\) (or \(O(nh^2)\) under higher-order smoothness assumptions). This suggests an unfortunate tradeoff between accuracy and step-size: To achieve a given level of accuracy, the step size must scale inversely with the dimension, \(h \sim 1/n\). This tradeoff is often ignored in practice, in extremely high-dimensional applications such as molecular dynamics for which accordingly small step sizes are impractical.

The recent paper [1] by Chen, Cheng, Niles-Weed, and Weare identifies a remarkable new phenomenon which can justify ignoring this tradeoff, which they term delocalization of bias: The bias contained in low-dimensional marginals often scales only with the lower dimension. In other words, the bias-per-coordinate is often dimension-free. To state this precisely, we use the following notation. For a probability measure \(\rho\) on \({\mathbb{R}}^n\) and a set \(u \subset [n]=\{1,\ldots,n\}\), let us write \(\rho^u\) for the marginal of the coordinates in \(u\). That is, if \((X_i)_{i\in [n]} \sim \rho\), then \((X_i)_{i \in u} \sim \rho^u\). The paper [1] shows in certain contexts that delocalization of bias holds in the sense that \[\max_{u \subset [n], \, |u|=k} W_2^2(\pi^u_h,\pi^u) = O( h k\log n). \label{intro:delocalization-CCNW}\tag{2}\] When this holds, it means roughly that a \(k\)-dimensional observable can be accurately estimated using a step size of order merely \(h \sim 1/k\log n\), rather than \(h \sim 1/n\). We refer to the introduction of [1] for a more thorough explanation of the practical implications of this phenomenon, as well as a justification for focusing on LMC as opposed to other sampling methods.

The authors of [1] rigorously prove 2 in two particular contexts, namely when \(\pi\) is Gaussian or when the potential \(V\) satisfies a certain structural condition of sparse interactions which will be described in more detail below. Note also that 2 holds trivially for product measures as well, without the log factor. However, delocalization is not a universal phenomenon. As illustrated in [1], it can fail even for simple examples, such as a rotation of a strongly log-concave product measure in which a single coordinate becomes, in a sense, too influential.

The goal of this paper is to deepen the study of delocalization of bias, in the following ways:

We improve the main result of [1] in the sparse setting. Specifically, we replace \(W_2^2\) with relative entropy (a.k.a.Kullback-Leibler divergence), which is stronger by Talagrand’s inequality. We also remove the logarithmic factor, improving 2 to the optimal \(O(hk)\).
We enlarge the scope of the delocalization phenomenon, by proving that it holds for a different class of potentials with weak interactions. This class neither contains nor is contained in the sparse setting. To summarize very roughly, the sparseness condition constrains the pattern of zeros in the Hessian \(\nabla^2 V\), whereas the weak interaction condition constrains the \(\ell^\infty\to\ell^\infty\) operator norm of \(\nabla^2 V\).
We do not require \(\pi\) to be log-concave, as in [1]. Instead, we make two assumptions which are both implied by strong log-concavity: that \(\pi\) satisfies a logarithmic-Sobolev inequality, and also that its conditional measures satisfy certain transport inequalities.

Our method of proof combines ideas from the Vemapala-Wibisono approach for LMC analysis [2] with methods introduced in a prior paper of ours [3] in the study of propagation of chaos. The former paper [2] analyzes LMC by using a continuous-time interpolation of the LMC iterates and then analyzing the time-derivative of relative entropy (relative to the target \(\pi\)) along this interpolation. The latter paper [3] builds on prior work of the first author [4], [5] on mean field approximations, which on the surface has nothing to do with LMC. As the present work will illustrate, the methods of [3] are well-suited to the more general problem of quantitatively comparing low-dimensional marginals of high-dimensional Fokker-Planck equations. The core idea of the approach is to show that the set-indexed vector \((H_{kh}(u))_{u \subset [n]}\) satisfies a system of linear inequalities, recursive in \(k\), where \(H_{kh}(u)\) denotes the relative entropy between the law of \((X_{kh}^i)_{i \in u}\) and \(\pi^u\), for each coordinate set \(u \subset [n]\).

1.1 Main results for sparse interactions↩︎

The sparse setting is built on a simple undirected graph \(G=([n],E)\) which encodes the interactions between coordinates. The edge set is defined by \[E := \{(i,j) \in [n]^2 : \partial_{ij} V \text{ is not identically zero} \}.\] For \(i \in [n]\), let \(N_0(i)=\{i\}\), and the \(k\)-neighborhood \(N_k(i)\) of vertex \(i\) is defined recursively as the union of \(N_{k-1}(i)\) with the set of vertices in \([n] \setminus N_{k-1}(i)\) having an edge to a vertex in \(N_{k-1}(i)\). Let us write also \(N_k(u)=\cup_{i \in u}N_k(i)\) for \(u \subset [n]\). Abbreviate \(N(i)=N_1(i)\) and \(N(u)=N_1(u)\). For \(x=(x_i)_{i \in [n]} \in {\mathbb{R}}^n\) and \(u \subset [n]\) we write \(x_u=(x_i)_{i \in u}\). In particular, note that \(\partial_iV(x)\) depends on \(x\) only through \(x_{N(i)}\).

The sparse setup introduced [1] is slightly different. They define a graph first, and then assume the potential has the structure \(V(x) = \sum_{i=1}^n V_i(x_{N(i)})\) for some functions \(V_i : {\mathbb{R}}^{N(i)}\to{\mathbb{R}}\). Instead, we take the potential \(V\) as given, and we define the graph in terms of the potential. These two perspectives are essentially equivalent, up to changing constants in the neighborhood growth assumptions below.

The main assumptions in [1] are that \(V\) is strongly convex and smooth, i.e., \(0 < \alpha I \le \nabla^2 V \le \beta I < \infty\), and that the graph has polynomial growth, in the sense that there exist \(p \in {\mathbb{N}}\) and \(c > 0\) such that \[\max_{i \in [n]} |N_k(i)| \le c(1+k)^p , \;\text{ for all } k \in {\mathbb{N}}. \label{intro:def:sparse}\tag{3}\] Under these assumptions, [1] shows for \(h\le \alpha/\beta^2\) that \[{\mathcal{W}}_{2,\ell^\infty}(\pi_h,\pi) \lesssim h\log(2n) \min\bigg\{\Big( \frac{\beta}{\alpha}\log(2n)\Big)^{p+2}, \;\frac{\beta^2}{\alpha^2}n \Big)\bigg\}. \label{intro:prevresult}\tag{4}\] Throughout the paper, we will use the following Wasserstein distances, defined for any \(p \ge 1\) and any probability measures \(\mu,\nu\) on a common Euclidean space with finite \(p\)th moments: \[\begin{align} W_p^p(\mu,\nu) &:= \inf_\pi \int |x-y|^p\,\pi(dx,dy), \qquad W_{p,\ell^\infty}^p(\mu,\nu) := \inf_\pi \int |x-y|_\infty^p\,\pi(dx,dy), \end{align}\] where the infima are over couplings of \(\mu\) and \(\nu\), and \(|\cdot|\) and \(|\cdot|_\infty\) denote the \(\ell^2\) (Euclidean) norm and \(\ell^\infty\) norm, respectively. The estimate 4 controls all low-dimensional marginals via \[W_2^2(\pi_h^u,\pi^u) \lesssim h|u| \log(2n) \min\bigg\{\Big( \frac{\beta}{\alpha}\log(2n)\Big)^{p+2}, \;\frac{\beta^2}{\alpha^2}n \Big)\bigg\}, \quad \forall u \subset [n]. \label{intro:prevresult2}\tag{5}\]

Our first two theorems improve this bound in several ways. We will work extensively with the relative entropy \(H\) (a.k.a.Kullback-Leibler divergence) and relative Fisher information \(I\), defined for two probability measures on Euclidean space by \[H(\nu\,|\,\mu) := \int f\,\log f \,d\mu, \qquad I(\nu\,|\,\mu) := \int |\nabla \log f|^2\,d\mu, \qquad f := \frac{d\nu}{d\mu}. \label{def:H-I}\tag{6}\] We set \(H(\nu\,|\,\mu) :=\infty\) if \(\nu\not\ll\mu\), and similarly \(I(\nu\,|\,\mu) :=\infty\) if \(\nu\not\ll\mu\) or if \(\log f\) is not weakly differentiable. We make the following assumptions:

Logarithmic Sobolev inequality: There exists \(\alpha > 0\) such that \(\pi(dx)\propto e^{-V(x)}dx\) satisfies \[2\alpha H(\mu\,|\,\pi) \le I(\mu\,|\,\pi), \quad \forall \mu \in {\mathcal{P}}({\mathbb{R}}^n).\]
Lipschitz gradient: There exists \(\beta > 0\) such that \(\nabla V\) is \(\beta\)-Lipschitz.
Conditional Talagrand inequality: There exists \(\gamma > 0\) such that \[W_1^2(\mu,\pi^{N(u)\setminus u|u}_{x_u}) \le \frac{2\gamma}{\alpha}H(\mu\,|\,\pi^{N(u)\setminus u|u}_{x_u}), \quad \forall u \subset [n], \;x_u \in {\mathbb{R}}^u, \;\mu \in {\mathcal{P}}({\mathbb{R}}^{N(u)\setminus u}),\] where \(\pi^{N(u)\setminus u|u}_{x_u}\) denotes the conditional law of \(X_{N(u) \setminus u}\) given \(X_u=x_u\) under \(X \sim \pi\).

If \(\pi\) is strongly log-concave in the sense that \(\nabla^2 V \ge \alpha I > 0\), then (LSI) holds by Bakry-Émery [6]. Moreover, (A.2) holds with \(\gamma=1\), because strongly log-concave measures satisfy the Talagrand inequality, and because strong log-concavity (with the same parameter) is preserved by the operations of conditioning, which is easy to show, and marginalizing, which is a well known consequence of the Prékopa-Leindler inequality. These assumptions remain valid for bounded perturbations of strongly log-concave measures, by Holley-Stroock and the fact that LSI implies Talagrand inequality [6].

Our first theorem assumes polynomial growth of the graph’s neighborhoods.

Theorem 1. Suppose assumptions (LSI) and (A.1,2) hold, and there exist \(c,p \ge 1\) such that \[\max_{i \in [n]}|N_{k+1}(i)| \le c(1+k^p), \quad \forall k \ge 0. \label{asmp:polygrowth}\tag{7}\] Then there exist constants \(C,h^*>0\), depending only on \((\alpha,\beta,\gamma,c,p)\), such that \[\frac{\alpha}{2}W_2^2(\pi^u_h,\pi^u) \le H(\pi^u_h\,|\,\pi^u) \le Ch|u|, \qquad \forall u \subset [n], \;\;0 \le h \le h^*. \label{thm:mainbound}\tag{8}\] We may explicitly take \[C = 40c^2p^p\frac{\beta^2}{\alpha}\Big( \frac{8\gamma\beta^2}{\alpha^2} + 1\Big)^p, \qquad h^* = \frac{\alpha}{4c\beta^2}.\]

Theorem 1 is generalized by Theorem 9, which includes estimates of the LMC dynamics, not just at stationarity. Ignoring dependence on other parameters, the bound on the bias in Theorem 1 is \(O(kh)\) in any coordinate set \(u\) of size \(k\). The ambient dimension \(n\) plays no direct role in this bound or in the restriction on the time step \(h\). In particular, Theorem 9 has no logarithmic factor like 5 . However, the dependence on the condition number \(\beta/\alpha\) is slightly better in 5 where it is of order \((\beta/\alpha)^{p+2}\), in contrast with \((\beta/\alpha)^{2p+2}\) in our Theorem 1. We have not attempted to optimize the dependence on \(c\) or \(p\) in 8 , other than in the exponent of \(\beta/\alpha\).

Our next theorem, still in the sparse regime, pushes the growth condition as far as our methods seem to allow, from polynomial to exponential, up to a certain critical exponent.

Theorem 2. Suppose assumptions (LSI) and (A.1,2) hold, and there exist \(c,r \ge 1\) such that \[r < 1 + \frac{\alpha^2}{\gamma\beta^2}, \qquad \max_{i\in [n]}|N_{k+1}(i)| \le cr^k , \quad \forall k \ge 0. \label{asmp:expgrowth}\tag{9}\] Then there exist constants \(C,h^*>0\), depending only on \((\alpha,\beta,\gamma,c,r)\) such that 8 holds. Setting \(\eta := 1-\frac{\gamma\beta^2}{\alpha^2}(r-1)\), and noting that \(0 < \eta < 1\), we may explicitly take \[C = \frac{10\beta^2c^2}{\alpha\eta^3 } e^{\gamma(r-1)/c}, \qquad h^* = \frac{\alpha \eta^{3/2}}{5\beta^2c } .\]

We do not know if the condition \(r < 1 + \frac{\alpha^2}{\gamma\beta^2}\) is sharp. Of course, the assumption 9 is more general than 7 , and if one is interested only in how the bias depends on \(|u|\) and \(h\) then Theorem 1 follows as a special case of Theorem 2. We state them separately because the dependence on the condition number \(\beta/\alpha\) is somewhat unclear in Theorem 2, due to its appearance in both \(C\) and the constraint on \(r\) in 9 . In contrast, Theorem 1 has a simple dependence on \(\beta/\alpha\), which we derive directly rather than as a consequence of Theorem 2. This dependence deteriorates as the growth rate \(p\) of the graph increases, and we might interpret the extreme case of Theorem 2 as a bound of the form \(f(\beta/\alpha)\) where \(f(x)\) jumps to \(+\infty\) as \(x\) increases past the critical value of \(1/\gamma(r-1)\).

We lastly record a similar delocalization phenomenon for the continuous-time Langevin dynamics, which we prove in Section 6 via a much simpler version of our proof of Theorem 1.

Theorem 3. Suppose assumptions (LSI) and (A.1,2) hold. Suppose \((Y_t)_{t \ge 0}\) solves \[dY_t = -\nabla V(Y_t)dt + \sqrt{2}dB_t, \qquad Y_0 \sim \rho_0,\] where \(\rho_0\) is some given initial distribution. Let \(\rho_t\) denote the law of \(Y_t\) for \(t > 0\). Let \(H_t(u) = H(\rho^u_t\,|\,\pi^u)\) for each \(t \ge 0\) and \(u \subset [n]\). Let \((\Lambda(t))_{t \ge 0}\) denote a Poisson process with rate \(\gamma\beta^2 /2\alpha\). Then \[H_t(u) \le e^{-2\alpha(1-\epsilon)t} {\mathbb{E}}H_0\big(N_{\Lambda(t/\epsilon)}(u)\big), \qquad \forall u \subset [n], \;t \ge 0, \;\epsilon \in (0,1). \label{ineq:continuoustime}\tag{10}\]

The Poisson process also appears behind the scenes in the proofs of Theorems 1 and 2. In 10 we may always bound \(H_0(N_k(u)) \le H_0([n])\) for all \(k\), by the data processing inequality, and send \(\epsilon \to 0\) to recover the usual “global” entropy estimates \(H_t([n]) \le e^{-2\alpha t}H_0([n])\). Note that one cannot expect a bound like \(H_t(u) \le e^{-ct}H_0(u)\) for all \(u\); for example, if \(\rho^u_0=\pi^u\) but \(\rho_0 \neq \pi\) then there is no reason to expect that \(\rho^u_t=\pi^u\) for \(t > 0\). But in situations where \(H_0(N_k(u))\) grows slowly enough with respect to \(k\), the bound 10 reveals a form of the delocalization phenomenon. For instance, if \(H_0(u) \le C_0|u|\) for all \(u \subset [n]\) and \(|N_k(u)| \le c(1+k^p)\) for all \(k \ge 0\) as in Theorem 1, then \[{\mathbb{E}}H_0(N_{\Lambda(s)}(u)) \le cC_0(1+{\mathbb{E}}\Lambda^p(s))|u|, \quad s \ge 0,\] and by taking \(\epsilon=1/\max(1,2\alpha t)\) we can extract from 10 a bound like \(H_t(u) = O(e^{-2\alpha t}|u|)\).

1.2 Main results for weak interactions↩︎

We next introduce a new class of potentials for which delocalization of bias can be shown, which we call weak interactions. We give two results. We start with the one that is simpler to state, based on the framework of [1], which was kindly suggested (along with its proof) by Yifan Chen after a discussion of our Theorem 5. In the following, we write \(\|M\|_{\infty\to\infty}\) for the \(\ell^\infty\)-operator norm of an \(n \times n\) matrix \(M\).

Theorem 4. Let \(\beta > \alpha > \alpha_0 > 0\). Suppose \(\alpha I \le \nabla^2 V \le \beta I\). Assume that for each \(x \in {\mathbb{R}}^n\) there is a decomposition \(\nabla^2V(x)=D(x)+M(x)\), where \(D(x)\) is a diagonal matrix, and \(M(x)\) is zero on the diagonal and satisfies \(\|M(x)\|_{\infty \to \infty} \le \alpha_0\). Then, for \(h \le 1/\beta\), \[W^2_{2,\ell^\infty}(\pi_h,\pi) \le h\log(2n)\Big(\frac{4\beta}{\alpha-\alpha_0} \Big)^2 .\] In particular, \[W^2_2(\pi_h^u,\pi^u) \le |u|h\log(2n)\Big(\frac{4\beta}{\alpha-\alpha_0} \Big)^2 , \qquad \forall u \subset [n].\]

The proof of Theorem 4, given in Section 5, is an application of the method of [1], and a particularly simple one because we in fact have a one-step contraction along the LMC iterates. Their more complex multi-step analysis, introduced for the sparse case, is not needed.

A natural class of examples arises from pairwise interaction potentials, of the form \[V(x) = \sum_{i=1}^n V_i(x_i) + \sum_{1 \le i < j \le n}^n V_{ij}(x_i-x_j), \label{intro:def:pairwise-nonexch}\tag{11}\] for some functions \(V_i\) and \(V_{ij}\) on \({\mathbb{R}}\), with \(V_{ij}\) assumed (for simplicity) to be even and satisfying \(V_{ij}=V_{ji}\). The assumptions of Theorem 4 hold for suitable \(\beta\) if \((V''_i,V''_{ij})\) are bounded from above uniformly over \((i,j)\), if \(V_i'' \ge \alpha\), if \(V_{ij}'' \ge 0\), and if \(\max_i\sum_{j \neq i} |V_{ij}''| \le \alpha_0\). As a special case, this includes symmetric mean field models, in which \(V_i=U\) and \(V_{ij}=W/n\) for some \((i,j)\)-independent functions \(U\) and \(W\). However, exchangeable models like this are not so interesting from a delocalization perspective, as will be explained in Section 1.4.

We next give a second result on weak interactions, based on our new entropic methods rather than the approach of [1]. The advantages are that it will remove the log factor, relax the strong log-concavity assumption, and hold in relative entropy instead of Wasserstein. The major disadvantage is that it will require a much more specific structural assumption on the potential \(V\), which is more restrictive than the decomposition in Theorem 4. Precisely, the potential takes the form \[V(x) = \sum_{ u \subset [n]} V_u(x_u). \label{def:Vstructure}\tag{12}\] Here the sum is over nonempty subsets of \([n]\), and \(V_u : {\mathbb{R}}^u \to {\mathbb{R}}\) are some given functions. The following parameters \((M_0,M_1,R_1)\) will play important roles: \[M_p := \max_{i \in [n]}\sum_{w \ni i} L_w|w|^p, \qquad R_p := \max_{i \in [n]}\sum_{w \ni i, \, |w| \ge 2}L_w(|w|-1)^p, \quad \text{for } p =0,1. \label{def:MpRp-constants}\tag{13}\] In addition to (LSI), we will make the following assumptions:

Lipschitz gradients: For each set \(u \subset [n]\) there is a constant \(L_u \ge 0\) such that \(\nabla V_u\) is \(L_u\)-Lipschitz.
Conditional Talagrand inequality: There exists \(\gamma > 0\) such that \[W_1^2(\mu,\pi^{w \setminus u|u}_{x_u}) \le \frac{2\gamma}{\alpha}H(\mu\,|\,\pi^{w\setminus u|u}_{x_u}), \quad \forall u,w \subset [n], \;L_w \neq 0, \;x_u \in {\mathbb{R}}^u, \;\mu \in {\mathcal{P}}({\mathbb{R}}^{w\setminus u}),\] where \(\pi^{w\setminus u|u}_{x_u}\) denotes the conditional law of \(X_{w \setminus u}\) given \(X_u=x_u\) under \(X \sim \pi\).
Weak interactions: \(\gamma M_0R_1 < \alpha^2\).

We think of \(R_p\) as a measure of the maximal interaction effect felt by any coordinate, because the sum excludes singletons \(w\). The assumption (B.3) thus requires that the interactions are sufficiently weak, a condition which is common in the literature on uniform-in-time mean field limits [5].

Theorem 5. Under assumptions (LSI) and (B.1–3), there exist constants \(C,h^*>0\), depending only on \((\alpha,M_0,M_1,R_1,\gamma)\), such that 8 holds. Setting \(\eta := 1 - \frac{\gamma M_0 R_1}{\alpha^2}\), and noting that \(0 < \eta < 1\), we may explicitly take \[C = \frac{10M_0 M_1 }{\alpha\eta^3 }e^{\gamma }, \qquad h^* = \frac{\alpha \eta^{3/2}}{5 M_0M_1 } .\]

We will see in 42 that the Lipschitz constant of \(\nabla V\) is bounded by \(M_0\), which is in turn bounded by \(M_1\). We thus think of \(M_0M_1/\alpha^2\) as a weaker form of the squared condition number. The constant \(R_1\) play a different role, measuring only the strength of interations between coordinates, and it notably shows up only in the smallness condition (B.3) which is equivalent to \(\eta > 0\).

Theorem 5 is generalized by Theorem 12, which includes dynamic estimates. The structural assumption 12 is meant as a generalization of the pairwise interaction setting of 11 , allowing interactions between arbitrarily many coordinates (three-way, four-way, etc.). However, assumption (B.3) constrains how strong the higher-order interactions are allowed to be. Essentially, in order to keep \(R_1\) small, the weights \(L_w\) must be smaller for larger sets \(w\). The pairwise case fits into the setting of Theorem 5 by taking \(V_u=0\) for all \(|u| \ge 2\), and we then have \[\begin{align} M_0 &= \max_{i \in [n]} \|V''_i\|_\infty + \sum_{j \neq i} \|V_{ij}''\|_\infty, \qquad R_1 = \max_{i \in [n]} \sum_{j \neq i} \|V_{ij}''\|_\infty. \end{align}\] In the log-concave setting \(\nabla^2 V \ge \alpha I\), we can again take \(\gamma=1\) in (B.2), and then assumption (B.3) is stronger than the decomposition assumption of Theorem 4.

The weak interaction setting of Theorems 4 and 5 is quite different from the sparse setting of Theorems 1 and 2. For instance, in the case of pairwise interactions 11 , the weak interaction imposes constraints on the row sums \(\sum_j \|V''_{ij}\|_\infty\). Sparseness, on the other hand, constraints the zero pattern of the matrix \((V_{ij}'')_{i,j \in [n]}\). These regimes are not directly comparable. An analog of Theorem 3 is possible for continuous-time Langevin under the weak interaction assumptions, where a more complicated \(2^{[n]}\)-valued “growth process” takes the place of the Poisson-driven process \((N_{\Lambda(t)}(u))_{t \ge 0}\). We omit the details, as it will become apparent from the proofs of Theorems 3 and 5 how to concoct such a result.

It is worth noting that none of the above results on sparse and weak interactions covers the general Gaussian case, which was studied directly in [1]. An interesting question posed in [1] is to identify a broader class of models for which delocalization holds which includes both the Gaussian and sparse cases. The case of weak interactions is a third class to add to the mix, and it raises the natural and more challenging question of finding an even more general delocalization result which includes all three classes.

We are not aware of any other work on the delocalization of bias phenomenon since its recent introduction in [1]. Perhaps the most closely related is the recent paper [7] which developed an interesting approach based on Stein’s method for comparing low-dimensional marginals, motivated by sampling problems for spatial models.

From the sampling literature, the main point of departure for our methodology is the seminal work of Vempala-Wibisono [2], and we refer also to [8] for a nice textbook treatment. Their approach was based on time-differentiating the relative entropy along a Fokker-Planck equation satisfied by the time-marginal laws of a continuous-time interpolation of the LMC iterates. Their approach also works with Rényi entropies, a generalization we do not pursue here. We deepen their method by extending it to each marginal \(u \subset [n]\). This leads to a recursive linear system of differential inequalities for the set-indexed vector of marginal entropies. This system of inequalities is hierarchical, in the sense that the \(u\)-coordinate depends on the \(w\)-coordinate only for \(w \supset u\).

This hierarchical entropy method draws its inspiration from the first author’s work [4] on mean field particle systems with pairwise interactions, where a similar approach related to the so-called BBGKY hierarchy was developed in order to derive the sharp rate of propagation of chaos. These mean field models are exchangeable, so the \(u\)-marginal for a coordinate set \(u\) depends on \(u\) only through its cardinality. As a result, instead of a vector of entropies indexed by \(2^{[n]}\), the paper [4] faced a vector indexed merely by \([n]\), which was substantially simpler to analyze. More recently, our paper [3] joint with L.C.Yeung adapted the methods of [4] far beyond the mean field setting, deriving non-asymptotic propagation of chaos bounds which are essentially measuring the distance between low-dimensional marginals of an SDE and suitably chosen product measures. To understand the intricate set-indexed hierarchy of differential inequalities, a key idea was introduced in [3] which we borrow in this paper: The linear operator applied to the vector of entropies happens to have the structure of a rate matrix (or infinitesimal generator) of a certain Markov process on state space \(2^{[n]}\). This opens the door to a relatively clean analysis in stochastic and/or operator-theoretic terms, as opposed to the more brute-force iteration performed in [4] which appeared to be impossible to adapt to the set-indexed setting.

1.4 Subadditivity versus delocalization↩︎

It is worth noting that a weaker notion of delocalization holds automatically, in which the max is replaced by an average, thanks to a well known subadditivity inequality, whose simple proof is explained after inequality 16 below: \[{n \choose k}^{-1}\sum_{u \subset [n], \, |u|=k} W_2^2(\pi^u_h,\pi^u) \le \frac{k}{n} W_2^2(\pi_h,\pi). \label{intro:ineq:subadditive}\tag{14}\] (The same holds for relative entropy in place of \(W_2^2\).) That is, the typical \(k\)-marginal distance is no more than its uniform share of the full distance. Hence, using the well known bound \(W_2^2(\pi_h,\pi) = O(nh)\), one immediately deduces that the left-hand side of 14 is \(O(kh)\). This has implications for certain “averaged" observables. For instance, if \(f(x) = (1/n)\sum_i g_i(x_i)\) for 1-Lipschitz functions \(g_i\), then Kantorovich duality and 14 yield \[\int_{{\mathbb{R}}^n} f\,d(\pi_h-\pi) = \frac{1}{n}\sum_{i=1}^n \int_{\mathbb{R}}g_i\,d(\pi^{\{i\}}_h -\pi^{\{i\}}) \le \frac{1}{n}\sum_{i=1}^n W_2(\pi^{\{i\}}_h,\pi^{\{i\}}) = O(h),\] with no dimension dependence. This observation, which extends readily to U-statistics, appeared also in [9] in the context of a more general study about how to improve dimension-dependence of \(W_2^2(\pi_h,\pi)\) under finer information about \(\nabla^2V\). However, this subadditivity method does not say anything about asymmetric observables, e.g., those depending on just a few distinguished coordinates, which require a finer estimate like 2 .

One exceptional case worth pointing out is when the measure \(\pi\) is exchangeable, in which case the marginal \(\pi^u\) depends on the set \(u\) only through its size. The same is then true of \(\pi_h^u\), and there is no difference between taking a maximum or an average on the left-hand side of 14 . In other words, delocalization holds for exchangeable measures as a trivial consequence of subadditivity.

1.5 On the notion of delocalization↩︎

This short section discusses some general observations about the concept of delocalization but has nothing to do with LMC, and the terminology introduced here will not be used elsewhere in the paper.

The term delocalization originates in random matrix theory, though with a somewhat different meaning. A vector \(x\in{\mathbb{R}}^n\) is said to be delocalized if its mass is spread out across its coordinates in some sense. It has been established for many classical random matrix models that eigenvectors are delocalized in this sense, with high probability. In other words, the joint distribution of the entries of an eigenvector put most of its mass on delocalized vectors.

Let us explain one precise way that these ideas are formulated in random matrix theory, following [10], [11]. In the following, let \(x_u=(x_i)_{i \in u}\) for any set \(u \subset [n]\). For \(k \in [n]\) and \(r \ge 0\), we say a vector \(x \in {\mathbb{R}}^n\) is \((k,r)\)-delocalized if \[\max_{u \subset [n], \, |u|=k}|x_u|^2 \le r|x|^2. \label{def:k-r-delocalized}\tag{15}\] Here \(|\cdot|\) denotes the usual Euclidean norm. In words, the maximal squared norm among \(k\)-element subsets is no more than a fraction \(r\) the squared norm of the full vector. Of course, this statement is only meaningful if \(r < 1\). For example, the vector of all ones is \((k,r)\)-delocalized for \(r=k/n\). On the other extreme, a basis vector in \({\mathbb{R}}^n\) is not really delocalized at all, as it is \((k,r)\)-delocalized only for \(r=1\). By bounding the maximum from below by the average, and noting that \[{n \choose k}^{-1}\sum_{u \subset [n], \, |u|=k}|x_u|^2 = {n \choose k}^{-1} \sum_{i=1}^n \sum_{u \subset [n], \, |u|=k}x_i^2 \, 1_{i \in u} = \frac{k}{n}|x|^2, \label{ineq:subadditivity-Euclidean}\tag{16}\] we deduce a sort of speed limit for delocalization: If \(x \neq 0\) is \((k,r)\)-delocalized, we must have \(r \ge k/n\). Note also that 14 can be deduced from 16 , by replacing \(x\) with \(X-Y\), where \(X \sim \pi_h\) and \(Y \sim \pi\) are optimally coupled for \(W_2^2(\pi_h,\pi)\).

The notion of delocalization studied in the present work and in [1] is measure-theoretic in nature and quite different from the Euclidean notion 15 . For a reference measure \(Q\) on \({\mathbb{R}}^n\), we might say that a measure \(P\) on \({\mathbb{R}}^n\) is \((Q,k,r)\)-delocalized if \[\max_{u \subset [n], \, |u|=k} W_2^2(P^u,Q^u) \le rW_2^2(P,Q). \label{def:k-r-delocalized-measure}\tag{17}\] Other choices of distance or divergence instead of \(W_2\) would also be reasonable here. This says that \(k\)-dimensional marginals of \(P\) are closer to those of \(Q\) by a factor of \(r\) compared to the full joint distributions. To appreciate how different the Euclidean notion 15 is from its measure-theoretic counterpart 17 , consider the following two simple examples:

Take \(Q=\delta_0\), and let \(P\) be the uniform measure on the set of \(n\) standard basis vectors. Then 17 holds for any \(k\) with \(r=k/n\). But for any \(r < 1\), the support of \(P\) does not contain any vector \(x\) satisfying 15 .
Let \(X_1,\ldots,X_n\) be iid Bernoulli(\(1/2\)). Let \(Q\) be their joint distribution, and let \(P\) be the joint law of \((X_1,\ldots,X_{n-1},Y_n)\) where \(Y_n=\sum_{i \in [n-1]}X_i \mod 2\). Then \(P\) and \(Q\) are distinct but have the same \((n-1)\)-dimensional marginals, so \(P\) is \((Q,k,0)\)-delocalized for any \(k < n\). There is no speed limit here as there is for the Euclidean notion (\(r \ge k/n\)).

1.6 Organization of the rest of the paper↩︎

The rest of the paper is devoted to the proofs of Theorems 1, 2, 4, and 5. Section 2 begins with some common steps shared by Theorems 1, 2, and 5. Then, Sections 3 and 4 respectively carry out the proofs, the former for the sparse cases and the latter for the weak interaction case. Lastly, Section 5 proves Theorem 4.

Acknowledgements↩︎

We are grateful to Jonathan Weare, Jonathan Niles-Weed, Xiaoou Cheng, and especially Yifan Chen for helpful discussions throughout the course of this work. We also thank Sinho Chewi and Matthew Zhang for comments and suggestions on an early draft of the paper.

2 First steps of the proof↩︎

In this section we provide the common first steps shared by the proofs of Theorems 1, 2, and 5. Let us first recall and summarize some basic notation and facts that we will use repeatedly. For a vector \(x=(x_i)_{i \in [n]} \in {\mathbb{R}}^n\) and a set \(u \subset [n]\), we write \(x_u=(x_i)_{i \in u}\) or \(x^u=(x_i)_{i \in u}\) for the subvector in \({\mathbb{R}}^u\). For a measure \(\rho\) on \({\mathbb{R}}^n\), we similarly write \(\rho^u\) for the corresponding marginal. We will sometimes use the bracket notation for integration, \(\langle\rho,f\rangle = \int f\,d\rho\).

The assumption (LSI) has several important consequences. First, it implies the Poincaré and Talagrand inequalities, \[\langle \pi,f^2\rangle-\langle\pi,f\rangle^2 \le \frac{1}{\alpha}\langle \pi,|\nabla f|^2\rangle, \qquad W_2^2(\mu,\pi) \le \frac{2}{\alpha}H(\mu\,|\,\pi), \;\;\forall \mu \in {\mathcal{P}}({\mathbb{R}}^n), \label{ineq:PoincareTalagrand}\tag{18}\] the former holding for Lipschitz functions \(f\). See [6] and [6], respectively. Note that (LSI) holds also for every marginal \(\pi^u\), \(u \subset [n]\), and thus so do the inequalities 18 . It is worth noting that \(\alpha \le \beta\) when (LSI) holds and \(\nabla V\) is \(\beta\)-Lipschitz; indeed, the covariance matrix of \(\pi\) is bounded from above by \((1/\alpha)I\) in semidefinite order by the Poincaré inequality, and it is bounded from below by \((1/\beta)I\) by the Cramér-Rao lower bound.

Our proofs start from a standard continuous-time interpolation. Fix \(h>0\), and let \(t_-\) denote the largest element of \(\{0,h,2h,3h,\ldots\}\) less than or equal to \(t\). Define \[X_t = X_{t_-} - (t-t_-)\nabla V(X_{t_-}) + \sqrt{2}(B_t-B_{t_-}), \qquad t > 0.\] where \(B\) is a Brownian motion, and \(X_0\) is given. In other words, \(X\) solves the (path-dependent) SDE \[dX_t = -\nabla V(X_{t_-})dt + \sqrt{2}\,dB_t. \label{eq:XSDE}\tag{19}\] Then \((X_0,X_h,X_{2h},\ldots)\) has the same law as the LMC iterates. Let \(\rho_t\) denote the law of \(X_t\), for each \(t\ge 0\), and note that for \(t > 0\) it has a strictly positive density when, e.g., \(\nabla V\) is Lipschitz. It was observed in [2] that \(\rho\) solves (weakly) the Fokker-Planck equation \[\partial_t \rho = \nabla \cdot (b\rho) + \Delta \rho, \quad \text{where} \quad b_t(x) := {\mathbb{E}}[\nabla V(X_{t_-})\,|\,X_t=x]. \label{eq:VempalaWibisono1}\tag{20}\] We begin with a refinement of this for the marginals of \(\rho\). Throughout the paper, we write \(\nabla_uV\) and \(\Delta_uV\) for the gradient and Laplacian, respectively, with respect to just the coordinates in \(u\).

Proposition 6. Suppose \(V\) is \(C^1\) and \(\nabla V\) is Lipschitz. Let \(u \subset [n]\) be nonempty. There exists a jointly measurable function \(b^u : [0,\infty) \times {\mathbb{R}}^u \to {\mathbb{R}}^u\) such that \[b^u_t(X^u_t) = {\mathbb{E}}[\nabla_u V(X_{t_-}) \,|\,X^u_t], \;\;a.s., \;a.e. \;t > 0.\] Moreover, \((\rho^u_t)_{t \ge 0}\) is a solution in the sense of distributions of the Fokker-Planck equation \[\partial_t\rho^u = \nabla_u(b^u\rho^u) + \Delta_u \rho^u. \label{eq:FKPmarginal}\qquad{(1)}\]

Proof. The existence of a jointly measurable version of the conditional expectation is shown in [12]. To derive the Fokker-Planck equation, let \(\varphi\) be a smooth test function of compact support. Apply Itô’s formula to \(\varphi(X_t^u)\), using the SDE represetation 19 : \[\begin{align} \frac{d}{dt}\int_{{\mathbb{R}}^u} \varphi\,d\rho^u_t = \frac{d}{dt}{\mathbb{E}}\varphi(X^u_t) &= {\mathbb{E}}\Big[-\nabla_u V(X_{t_-}) \cdot \nabla_u\varphi(X^u_t) + \Delta_u \varphi(X_t^u)\Big] \\ &={\mathbb{E}}\Big[-b^u_t(X^u_t) \cdot \nabla_u\varphi(X^u_t) + \Delta_u \varphi(X^u_t)\Big], \end{align}\] with the last step using the definition of conditional expectation. This rewrites as \[\frac{d}{dt}\int_{{\mathbb{R}}^u} \varphi\,d\rho^u_t = \int_{{\mathbb{R}}^u} \big(-b^u_t \cdot \nabla_u\varphi + \Delta_u \varphi\big)\,d\rho^u_t,\] which is exactly the distributional form of the announced Fokker-Planck equation. ◻

The main theorems (except for Theorem 4) can be understood as stability estimates for the coupled system of partial differential equations satisfied by \((\rho^u)_{u \subset [n]}\). This system is hierarchical in the sense that the drift \(b^u\) appearing in the equation for \(\rho^u\) depends on \((\rho^w)_{w \supset u}\). Recalling the definition of relative entropy in 6 , we abbreviate \[H_t(u) := H(\rho_t^u\,|\,\pi^u), \qquad u \subset [n].\] The Vempala-Wibisono strategy [2] was to use 20 in order to time-differentiate the full entropy \(H_t([n])\), ultimately obtaining a differential inequality for it. We will instead differentiate all of the marginal entropies \(H_t(u)\) and ultimately obtain a system of differential inequalities for them.

Proposition 7. Suppose \(V\) is \(C^1\) and \(\nabla V\) is Lipschitz, and suppose (LSI) holds. For each nonempty \(u \subset [n]\), \(\epsilon \in (0,1)\), and \(t > 0\), we have \[\frac{d}{dt}H_t(u) \le A^1_t(u) + A^2_t(u) - \alpha H_t(u),\] where \[\begin{align} A^1_t(u) &:= \frac{1}{2(1-\epsilon)}{\mathbb{E}}\big|\nabla_u V(X_t)-\nabla_u V(X_{t_-})\big|^2 , \\ A^2_t(u) &:= \frac{1}{2\epsilon}\int_{{\mathbb{R}}^u} \Big|{\mathbb{E}}[\nabla_u V(X_t)\,|\,X^u_t=x_u] - {\mathbb{E}}_{Y \sim \pi}[\nabla_uV(Y)\,|\,Y^u=x_u]\Big|^2 \,\rho^u_t(dx_u). \end{align}\] Here we adopt the convention that \(A^2_t([n])=0\).

Proof. We begin with a standard calculation, which we illustrate formally, ignoring questions of smoothness of \(\rho^u_t\) or the singularity of the logarithm. Writing the Fokker-Planck equation ?? as \(\partial_t\rho^u=\nabla_u \cdot (\rho^u(b^u+\nabla_u\log\rho^u))\), we compute \[\begin{align} \frac{d}{dt}H_t(u) = \frac{d}{dt}\int_{{\mathbb{R}}^u} \log\frac{\rho^u_t}{\pi^u}\,\rho^u_t &= \int_{{\mathbb{R}}^u} \partial_t\rho^u_t + \int_{{\mathbb{R}}^u} \log\frac{\rho^u_t}{\pi^u} \,\partial_t\rho^u_t \\ &= 0-\int_{{\mathbb{R}}^u} \nabla_u \log\frac{\rho^u_t}{\pi^u} \cdot (b^u_t + \nabla_u\log\rho^u_t)\, \rho^u_t \\ &= \int_{{\mathbb{R}}^u} \bigg( (-\nabla_u\log\pi^u - b^u_t ) \cdot\nabla_u\log\frac{\rho^u_t}{\pi^u} - \Big|\nabla_u\log\frac{\rho^u_t}{\pi^u}\Big|^2\bigg)\,\rho^u_t. \end{align}\] Recalling the definition of Fisher information from 6 , we complete the square to get \[\frac{d}{dt}H_t(u) \le \frac{1}{2}{\mathbb{E}}\big|\nabla_u\log\pi^u(X^u_t) + b^u_t(X^u_t)\big|^2 - \frac{1}{2}I(\rho^u_t\,|\,\pi^u). \label{ineq:completesquare}\tag{21}\] To make this rigorous, Lemma 3.1 of [5] covers our setting. Now, recalling the formula for \(b^u_t(X^u_t)\) as a conditional expectation, we add and subtract \({\mathbb{E}}[\nabla_uV(X_t)\,|\,X^u_t]\) inside the square and use the real variable inequality \((x+y)^2 \le \epsilon^{-1}x^2 + (1-\epsilon)^{-1}y^2\) to get \[\begin{align} \frac{1}{2}{\mathbb{E}}\big[|\nabla_u\log\pi^u(X^u_t) + b^u_t(X^u_t)|^2\big] &\le \frac{1}{2(1-\epsilon)}{\mathbb{E}}\big|{\mathbb{E}}[\nabla_uV(X_t)-\nabla_uV(X_{t_-})\,|\,X^u_t]\big|^2 \\ &\qquad + \frac{1}{2\epsilon}{\mathbb{E}}\big|\nabla_u\log\pi^u(X^u_t)+{\mathbb{E}}[\nabla_uV(X_t)\,|\,X^u_t]\big|^2. \end{align} \label{pf:entropy1-AB}\tag{22}\] The first term on the right-hand side is bounded by \(A^1_t(u)\) after applying Jensen to remove the conditional expectation. The marginal density \(\pi^u\) is proportional to \(\int e^{-V(x)}\,dx_{[n] \setminus u}\) and in particular \[\nabla_u\log\pi^u(x_u) = - {\mathbb{E}}_{Y \sim \pi}[\nabla_uV(Y)\,|\,Y^u=x_u].\] Hence, the second term on the right-hand side of 22 is exactly \(A^2_t(u)\). Finally, noting that the marginal \(\pi^u\) obeys the log-Sobolev inequality with the same constant as \(\pi\), we have \(-(1/2)I(\rho^u_t\,|\,\pi^u) \le -\alpha H_t(u)\), and this completes the proof. ◻

Remark 8. In the step 21 , one could instead introduce another parameter \(\delta > 0\) and replace the \(-\frac{1}{2}\) in front of Fisher information by \(-(1-\delta)\); the factor \(\frac{1}{2}\) on first term in 21 accordingly changes to \(\frac{1}{4\delta}\). This does not seem to lead to significant benefits, just improving somewhat the exponential decay rates in Theorems 9, 10, and 12, so we have opted to simplify the presentation by taking \(\delta=1/2\) instead of optimizing it. We have included the parameter \(\epsilon\) in order to allow us to optimize the smallness conditions in the assumptions 9 and (B.3) as much as possible.

3 Proof in the sparse case↩︎

The goal of this section is to prove the following dynamic generalization of Theorems 1 and 2, which naturally require an assumption on the initial distribution \(X_0 \sim \rho_0\) of the LMC iterates: \[\text{(Init)} \quad\qquad \text{There exists } C_0 \ge 0 \text{ such that } H_0(u):=H(\rho_0^u\,|\,\pi^u) \le C_0|u|, \;\forall u \subset [n]. \qquad\qquad\qquad\qquad\]

Theorem 9. Suppose (Init) and the assumptions of Theorem 1 hold. Define \(C\) and \(h^*\) as in Theorem 1. Then, for \(h \le h^*\), \(k \in {\mathbb{N}}\), and \(u \subset [n]\), \[H_{kh}(u) \le 2cC_0 e^{-\alpha kh/2} \Big(\frac{2\gamma\beta^2}{\alpha} kh + 1\Big)^p |u| + C h |u| .\]

Theorem 10. Suppose (Init) and the assumptions of Theorem 2 hold. Define \((\eta,C,h^*)\) as in Theorem 2. Then, for \(h \le h^*\), \(k \in {\mathbb{N}}\), and \(u \subset [n]\), \[H_{kh}(u) \le c C_0 e^{-\frac{\alpha\eta^2}{2(2-\eta)} k h } |u| + C h |u|.\]

Theorem 1 (resp.2) follows immediately from Theorem 9 (resp.10), by choosing any initial condition satisfying (Init) with \(C_0\) finite, such as \(\rho_0=\pi\), and then sending \(k\to\infty\). Indeed, lower semicontinuity of relative entropy and the Talagrand inequality 18 yield \[\frac{\alpha}{2}W_2^2(\pi_h^u, \pi^u) \le H(\pi_h^u\,|\,\pi^u) \le \liminf_{k\to\infty} H_{kh}(u) \le C h |u|.\] The proofs of Theorems 9 and 10 follow most of the same path, so we will proceed at first under just the assumptions (LSI) and (A.1,2).

3.1 The continuity term, \(A^1_t\)↩︎

We first estimate \(A^1_t(u)\) from Proposition 7. The first key observation is that \(\nabla_uV(x)\) is a \(\beta\)-Lipschitz function of just the coordinates in \(N(u)\), so that \[\begin{align} A^1_t(u) = \frac{1}{2(1-\epsilon)}{\mathbb{E}}\big|\nabla_u V(X_t)-\nabla_u V(X_{t_-})\big|^2 \le \frac{1}{2(1-\epsilon)}\beta^2{\mathbb{E}}\big|X^{N(u)}_t-X^{N(u)}_{t_-}\big|^2. \end{align}\] Abbreviate \(w:=N(u)\). Using the definition of \(X\), \[\begin{align} {\mathbb{E}}\big| X^w_t - X^w_{t_-} \big|^2 &= {\mathbb{E}}\big| -(t-t_-)\nabla_w V(X_{t_-}) + \sqrt{2}(B^w_t-B^w_{t_-})\big|^2 \\ &= (t-t_-)^2{\mathbb{E}}|\nabla_w V(X_{t_-})|^2 + 2(t-t_-)|w|. \end{align}\] We next claim that \[{\mathbb{E}}|\nabla_w V(X_{t_-})|^2 \le 2\langle\pi, |\nabla_w V|^2\rangle + 2\beta^2W_2^2(\rho^{N(w)}_{t_-},\pi^{N(w)}). \label{pf:continuity-claim1}\tag{23}\] To see this, use again the Lipschitz assumption to get, for all \(x,y \in {\mathbb{R}}^n\), \[\begin{align} |\nabla_wV(x)|^2 &\le 2|\nabla_wV(x)-\nabla_wV(y)|^2 + 2|\nabla_wV(y)|^2 \\ &\le 2\beta^2|x_{N(w)}-y_{N(w)}|^2 + 2|\nabla_wV(y)|^2. \end{align}\] Kantorovich duality [13] states \(\int f\,d\rho^{N(w)}_t - \int g\,d\pi^{N(w)} \le W_2^2(\rho^{N(w)}_t,\pi^{N(w)})\) for any continuous functions \(f\) and \(g\) on \({\mathbb{R}}^{N(w)}\) satisfying \(f(x_{N(w)})-g(y_{N(w)})\le|x_{N(w)}-y_{N(w)}|^2\) pointwise. From this we deduce 23 .

Next, using \(\nabla V = -\nabla\log \pi\), integration by parts yields \[\langle\pi, |\nabla_w V|^2\rangle = -\int_{{\mathbb{R}}^n} \nabla_w V \cdot \nabla_w \pi = \int_{{\mathbb{R}}^n} \Delta_w V\,\pi \le \beta |w|,\] where we used \(\partial_{ii}V \le \|\nabla^2V\|_{\mathrm{op}} \le \beta\) for each \(i \in w\). Putting it together, \[A^1_t(u) \le \frac{\beta^2}{2(1-\epsilon)}\Big( 2\beta (t-t_-)^2|w| + 2\beta^2 (t-t_-)^2W_2^2(\rho^{N(w)}_{t_-},\pi^{N(w)})+ 2(t-t_-)|w|\Big).\] Use the Talagrand inequality 18 , and plug in \(w=N(u)\) and \(N(w)=N_2(u)\) to get \[A^1_t(u) \le \frac{\beta^2h}{1-\epsilon} (\beta h + 1)|N(u)| + \frac{2\beta^4h^2}{\alpha(1-\epsilon)} H_{t_-}(N_2(u)). \label{pf:continuityterm1}\tag{24}\]

3.2 The hierarchy term, \(A^2_t\)↩︎

We next estimate \(A^2_t(u)\) from Proposition 7. We begin with the following simple lemma, which explains how the metric \(W_1\) controls vector-valued and not just scalar-valued Lipschitz test functions. Here and in the sequel we find it convenient to use the bracket notation \(\langle \mu,f\rangle = \int f\,d\mu\) for integration.

Lemma 1. Let \(m,k \in {\mathbb{N}}\). Suppose a probability measure \(\mu\) on \({\mathbb{R}}^m\) satisfies \[W_1^2(\mu,\nu) \le \gamma H(\nu\,|\,\mu), \qquad \forall \nu \in {\mathcal{P}}({\mathbb{R}}^m).\] Then, for any \(L\)-Lipschitz function \(f : {\mathbb{R}}^m \to {\mathbb{R}}^k\), we have \[\big| \langle \mu-\nu,f\rangle\big|^2 \le \gamma L^2 H(\nu\,|\,\mu), \qquad \forall \nu \in {\mathcal{P}}({\mathbb{R}}^m).\]

Proof. The case \(k=1\) is an immediate consequence of Kantorovich duality [13]. In general, let \(u \in {\mathbb{R}}^k\) be a unit vector and apply the \(k=1\) case to the scalar function \(u\cdot f\) to get \[\big( \langle \mu-\nu,f\rangle \cdot u\big)^2 \le \gamma L^2 H(\nu\,|\,\mu), \qquad \forall \nu \in {\mathcal{P}}({\mathbb{R}}^m).\] Take the supremum over unit vectors on the left-hand side to complete the proof. ◻

Let us write \(\rho^{w|u}_{t,x_u}\) for the conditional law of \(X^w_t\) given \(X^u_t=x_u\), for each \(t > 0\) and \(u,w \subset [n]\). Define \(\pi^{w|u}_{x_u}\) similarly. Recall by the sparsity assumption that \(\nabla_uV(x)\) depends only on the coordinates \(x_{N(u)}\). For \(x_u \in {\mathbb{R}}^u\) let us abuse notation slightly by writing \(\nabla_uV(x_u,\cdot)\) for the function of \(x_{N(u)\setminus u}\) obtained by fixing the coordinates indexed by \(u\). Then \(A^2_t(u)\) can be written as \[A^2_t(u) = \frac{1}{2\epsilon}\int_{{\mathbb{R}}^u} \big|\langle \rho^{N(u) \setminus u|u}_{t,x_u}-\pi^{N(u) \setminus u|u}_{x_u}, \,\nabla_u V(x_u,\cdot)\rangle\big|^2 \,\rho^u_t(dx_u).\] The assumed transport inequality (A.2) along with Lemma 1 imply \[A^2_t(u) \le \frac{\gamma\beta^2}{\alpha\epsilon}\int_{{\mathbb{R}}^u}H(\rho^{N(u) \setminus u|u}_{t,x_u}\,|\,\pi^{N(u) \setminus u|u}_{x_u}) \,\rho^u_t(dx_u).\] By the chain rule for relative entropy, the integral is exactly \(H_t(N(u))-H_t(u)\), and so \[A^2_t(u) \le \frac{\gamma\beta^2}{\alpha\epsilon}\big(H_t(N(u))-H_t(u)\big). \label{pf:hierarchyterm1}\tag{25}\] Note that if \(u\) has no neighbors (for instance, if \(u=[n]\)) then \(N(u)=u\) and \(A^2_t(u)=0\).

Remark 11. The above paragraph is the only place that assumption (A.2) is used. It could thus be weakened, by instead directly assuming \[\big|\langle \mu - \pi^{N(u) \setminus u|u}_{x_u}, \,\nabla_u V(x_u,\cdot)\rangle\big|^2 \le \frac{2\gamma\beta^2}{\alpha}H(\mu\,|\,\pi^{N(u) \setminus u|u}_{x_u}), \qquad \forall u \subset [n], \;x_u \in {\mathbb{R}}^u, \;\mu \in {\mathcal{P}}({\mathbb{R}}^u).\]

3.3 Developing the hierarchy↩︎

Combining Proposition 7 with the estimates 24 and 25 , we get \[\begin{align} \frac{d}{dt}H_t(u) \le & \frac{\beta^2h}{1-\epsilon}(\beta h + 1)|N(u)| + \frac{2\beta^4h^2 }{\alpha(1-\epsilon)} H_{t_-}(N_2(u)) \\ &+ \frac{\gamma\beta^2}{\alpha\epsilon}\big(H_t(N(u))-H_t(u)\big) - \alpha H_t(u). \end{align} \label{ineq:sparsehierarchy}\tag{26}\] This holds for all \(u \subset [n]\). To understand this system of inequalities, we follow the idea of [3], introducing an auxiliary stochastic process for which 26 becomes the associated Feynman-Kac formula (more precisely, an inequality version). This auxiliary process is a continuous-time Markov chain, taking values in the space of subsets of \([n]\), which serves as a convenient tool for analyzing 26 . View \(H_t\) for each \(t\) as a vector indexed by sets, i.e., \(H_t=(H_t(u))_{u \subset [n]} \in {\mathbb{R}}^{2^{[n]}}\). The right-hand side of 26 is an affine function of \(H_t\). In particular, define \[C_t(u) := \frac{\beta^2h}{1-\epsilon}(\beta h + 1)|N(u)| + \frac{2\beta^4 h^2}{\alpha(1-\epsilon)} H_{t_-}(N_2(u)),\] and define a linear operator \(\mathsf{A} : {\mathbb{R}}^{2^{[n]}} \to {\mathbb{R}}^{2^{[n]}}\) by \[\mathsf{A} F(u) := \frac{\gamma\beta^2}{\alpha\epsilon}\big(F(N(u))-F(u)\big), \qquad \forall F \in {\mathbb{R}}^{2^{[n]}}, \;u \subset [n]. \label{def:Aoperator-sparse}\tag{27}\] This operator has the structure of an infinitesimal generator; that is, the corresponding (rate) matrix has no negative entries outside the main diagonal, and each row sum is zero. The associated continuous-time Markov process \(({\mathcal{X}}_t)_{t \ge 0}\) takes values in \(2^{[n]}\) and transitions from \(u\) to \(N(u)\) at rate \(\gamma\beta^2/\alpha\epsilon\), for each \(u\), with no other transitions. The semigroup generated by \(\mathsf{A}\) is then related to \({\mathcal{X}}\) via \(e^{t\mathsf{A}}F(u)={\mathbb{E}}[F({\mathcal{X}}_t)\,|\,{\mathcal{X}}_0=u]\) for any \(F \in {\mathbb{R}}^{2^{[n]}}\) and \(t \ge 0\). For later use, we note that that behavior of the process \({\mathcal{X}}\) is extremely simple: When initialized from \({\mathcal{X}}_0=u\) it simply follows the path \(u \to N_1(u) \to N_2(u) \to \cdots\), with Exp(\(\lambda\)) holding times between jumps, where \(\lambda:=\gamma\beta^2/\alpha\epsilon\). In particular, we can construct \({\mathcal{X}}\) as \({\mathcal{X}}_t=N_{\Lambda(t)}(u)\), where \((\Lambda(t))_{t \ge 0}\) denotes a Poisson process with rate \(\lambda\), started from \(\Lambda(0)=0\).

Now, the inequality 26 can be written in vector form as \[\frac{d}{dt}H_t \le C_t + (\mathsf{A}-\alpha \mathsf{I})H_t,\] where \(\mathsf{I}\) denotes the identity operator, and where we understand the inequality to be coordinatewise. For \(t \ge 0\) the operator \(e^{t\mathsf{A}}\) is monotone with respect to coordinatewise order, in the sense that \(e^{t\mathsf{A}}F \le e^{t\mathsf{A}}G\) if \(F \le G\). From this it follows that, for any \(t > s > 0\), \[\frac{d}{ds}e^{-\alpha (t-s)}e^{(t-s)\mathsf{A}}H_s = e^{-\alpha (t-s)}e^{(t-s)\mathsf{A}}\bigg(\frac{d}{ds}H_s - (\mathsf{A} -\alpha \mathsf{I}) H_s\bigg) \le e^{-\alpha (t-s)}e^{(t-s)\mathsf{A}}C_s. \label{pf:Gronwall}\tag{28}\] Let \(k \in {\mathbb{N}}\) and integrate from \(s=(k-1)h\) to \(t=kh\) to get \[H_{kh} \le e^{-\alpha h}e^{h\mathsf{A}}H_{(k-1)h} + \int_{(k-1)h}^{kh} e^{-\alpha (kh-s)}e^{(kh-s)\mathsf{A}}C_s\,ds.\] For any function \(F \in {\mathbb{R}}^{2^{[n]}}\) satisfying \(F(u) \le F(N(u))\) for all \(u \subset [n]\), such as \(F=C_s\), it follows easily from the probabilistic representation of the semigroup that \(t \mapsto e^{t\mathsf{A}}F(u)\) is nondecreasing. Noting also that \(C_s=C_{(k-1)h}\) for \(s \in [(k-1)h,kh)\), we get \[H_{kh} \le e^{-\alpha h}e^{h\mathsf{A}}H_{(k-1)h} + \frac{1}{\alpha}(1-e^{-\alpha h})e^{h\mathsf{A}}C_{(k-1)h}. \label{pf:Hkh-bound1}\tag{29}\] Next, define a linear operator \(\mathsf{N} : {\mathbb{R}}^{2^{[n]}} \to {\mathbb{R}}^{2^{[n]}}\) and a function \(G \in {\mathbb{R}}^{2^{[n]}}\), \[\mathsf{N}F(u) := F(N(u)),\qquad G(u) := \frac{\beta^2h}{1-\epsilon}(\beta h + 1)|u|. \label{pf:Gdef}\tag{30}\] This way, we may rewrite \[C_{(k-1)h} = \mathsf{N}G + \frac{2\beta^4 h^2}{\alpha(1-\epsilon)} \mathsf{N}^2 H_{(k-1)h}.\] Plugging this into 29 and combining the \(H_{(k-1)h}\) terms, \[H_{kh} \le \Big(e^{-\alpha h}e^{h\mathsf{A}} + \frac{2\beta^4h^2}{\alpha^2 (1-\epsilon)}(1-e^{-\alpha h}) e^{h\mathsf{A}}\mathsf{N}^2 \Big)H_{(k-1)h} + \frac{1-e^{-\alpha h}}{\alpha} e^{h\mathsf{A}}\mathsf{N} G.\] The operators \(\mathsf{N}\) and \(\mathsf{A}\) commute, because for all \(F \in {\mathbb{R}}^{2^{[n]}}\) we have \[\mathsf{A}\mathsf{N}F(u) = \frac{\gamma\beta^2}{\alpha\epsilon}\big(F(N(N(u))) - F(N(u))\big) = \mathsf{N}\mathsf{A}F(u).\] Hence, \[H_{kh} \le \Big(e^{-\alpha h}\,\mathsf{I} + \frac{2\beta^4h^2}{\alpha^2 (1-\epsilon)}(1-e^{-\alpha h}) \mathsf{N}^2 \Big) e^{h\mathsf{A}}H_{(k-1)h} + \frac{1-e^{-\alpha h}}{\alpha} e^{h\mathsf{A}}\mathsf{N} G.\] Iterate this inequality to find \[\begin{align} H_{kh} \le \;&\Big(e^{-\alpha h}\,\mathsf{I} + \frac{2\beta^4h^2}{\alpha^2 (1-\epsilon)}(1-e^{-\alpha h}) \mathsf{N}^2 \Big)^k e^{kh\mathsf{A}} H_0 \\ &+ \frac{1-e^{-\alpha h}}{\alpha}\sum_{j=0}^{k-1} \Big(e^{-\alpha h}\,\mathsf{I} + \frac{2\beta^4h^2}{\alpha^2 (1-\epsilon)}(1-e^{-\alpha h}) \mathsf{N}^2 \Big)^{j} e^{(j+1)h\mathsf{A}} \mathsf{N} G. \end{align} \label{pf:operator-iteration}\tag{31}\] Here we used the fact that the operators \(e^{s\mathsf{A}}\) and \(\mathsf{N}\) are monotone with respect to pointwise order. The rest of the analysis involves specific estimates of how the operators \(\mathsf{N}\) and \(e^{t\mathsf{A}}\) act on the “size function” \(S(u):=|u|\), and we will now treat the two Theorems 9 and 10 separately.

3.4 The polynomial growth case, Theorem 9 ↩︎

We now impose the polynomial growth assumption 7 , and we take \(\epsilon=1/2\) for simplicity. By a union bound, \[|N_{k+1}(u)| \le \sum_{i \in u} |N_{k+1}(i)| \le c|u|(1+k^p), \quad \forall u \subset [n], \;k \ge 0. \label{pf:combinatorialbound-poly}\tag{32}\] In particular, \(|N(u)| \le c|u|\), or \(\mathsf{N}S \le cS\) coordinatewise. Next, we exploit the stochastic representation of the semigroup, writing \({\mathcal{X}}_t=N_{\Lambda(t)}({\mathcal{X}}_0)\) as discussed above. This yields \[e^{t\mathsf{A}}S(u) = {\mathbb{E}}|N_{\Lambda(t)}(u)| \le c|u|(1+{\mathbb{E}}\Lambda^p(t)) \le c f(t)S(u),\] where \(f(t) := 1 + (\lambda t + p)^p\), and the last step uses a Poisson moment bound due to [14]. This implies \(\mathsf{N}^2 e^{(j+1)h\mathsf{A}}S \le c^2 f((j+1)h)S\), and so \[\Big(e^{-\alpha h}\,\mathsf{I} + \frac{4\beta^4h^2}{\alpha^2 }(1-e^{-\alpha h}) \mathsf{N}^2 \Big)^{j} e^{(j+1)h\mathsf{A}}S \le c b^j f((j+1)h) S, \label{pf:bigopbound-cor}\tag{33}\] where the constant \(b\) is defined by \[b := e^{-\alpha h} + \frac{4c^2 \beta^4h^2}{\alpha^2 }(1-e^{-\alpha h}) .\] Since \(\alpha \le \beta\) and \(c \ge 1\), we have \(\alpha h \le \alpha^2/4c\beta^2 \le 1/4\) and thus \(1-e^{-\alpha h}\ge e^{-1/4}\alpha h\) and \[b \le e^{-\alpha h} + \frac{1}{4}(1-e^{-\alpha h}) = 1 - \frac{3}{4}(1-e^{-\alpha h}) \le 1 - \frac{3}{4} e^{-1/4}\alpha h \le 1-\frac{1}{2} \alpha h \le e^{-\alpha h/2}.\] Plugging this into 33 , and using again \(\mathsf{N}S \le cS\), we find \[\begin{align} \Big(e^{-\alpha h}\,\mathsf{I} + \frac{4\beta^4h^2}{\alpha^2 }(1-e^{-\alpha h}) \mathsf{N}^2 \Big)^{k} e^{kh\mathsf{A}}S &\le c e^{-\alpha kh/2} f(kh)S, \\ \Big(e^{-\alpha h}\,\mathsf{I} + \frac{4\beta^4h^2}{\alpha^2 }(1-e^{-\alpha h}) \mathsf{N}^2 \Big)^{j} e^{(j+1)h\mathsf{A}} \mathsf{N}S &\le c^2 e^{-\alpha j h/2} f((j+1)h)S. \end{align}\] From the definition of \(G\) in 30 with \(\epsilon=1/2\), and from the inequality \(\beta h \le 1/4\), we get \(G \le \frac{5}{2}\beta^2hS\). Thus, using 31 and the assumption (Init) in the form \(H_0 \le C_0S\), we have \[H_{kh}(u) \le C_0c e^{-\alpha kh/2} f(kh) |u| + \frac{5}{2} c^2\beta^2h \frac{1-e^{-\alpha h}}{\alpha} |u| \sum_{j=0}^\infty e^{-\alpha j h/2} f((j+1)h). \label{pf:poly2}\tag{34}\] We simplify the sum by plugging in \(f(t) \le 2(\lambda t + p)^p\) (which holds because \(p \ge 1\)). We then use the simple calculation \[\sup_{x \ge 0} (\lambda x + p) e^{-\alpha x/4p } \le \frac{4p\lambda}{\alpha} \vee p \le \frac{4p\lambda}{\alpha} + p\] to get \[\begin{align} \sum_{j=0}^\infty e^{-\alpha j h/2} f((j+1)h) &\le 2e^{\alpha h/2}\sum_{j=1}^\infty e^{-\alpha jh/2}(\lambda jh + p)^p \\ &\le 2e^{\alpha h/2}p^p\Big(\frac{4\lambda}{\alpha} + 1\Big)^p \sum_{j=1}^\infty e^{-\alpha jh/4} \\ &= \frac{2e^{\alpha h/2}p^p }{e^{\alpha h/4}-1} \Big(\frac{4\lambda}{\alpha} + 1\Big)^p. \end{align}\] Plug this into 34 , and use \(e^{\alpha h/2} \le e^{1/8} \le 2\) and \((1-e^{-\alpha h})/(e^{\alpha h/4}-1) \le 4\): \[H_{kh}(u) \le C_0c e^{-\alpha kh/2} f(kh) |u| + 40c^2 p^p\frac{\beta^2}{\alpha}\Big(\frac{4\lambda}{\alpha} + 1\Big)^ph|u| .\] Finally, plug in \(\lambda = 2\gamma\beta^2/\alpha\) to complete the proof of Theorem 9.

3.5 The exponential growth case, Theorem 10 ↩︎

We next prove Theorem 10, starting from 31 . We will first show that the size function \(S(u)=|u|\) satisfies \[e^{t\mathsf{A}} S \le ce^{\lambda t(r-1)} S , \quad \forall t > 0 . \label{pf:Gselfbound}\tag{35}\] We first use the growth assumption along with a union bound to get \(|N_{k+1}(u)| \le c|u|r^k\) for all \(k \ge 0\) and \(u \subset [n]\). In particular, \(|N(u)| \le c|u|\), or \(\mathsf{N}S \le cS\). Using the stochastic representation of the semigroup, we deduce \[e^{t\mathrm{A}}S(u) = {\mathbb{E}}|N_{\Lambda(t)}(u)| \le c|u|{\mathbb{E}}\, r^{\Lambda(t)} = c|u|e^{\lambda t(r-1)}= c e^{\lambda t(r-1)}S(u).\] This proves 35 , which we then combine with \(\mathsf{N}S \le cS\) to get \(\mathsf{N} e^{(j+1)h\mathsf{A}}S \le c^2 e^{\lambda (j+1)h(r-1)}S\), as well as \[\Big(e^{-\alpha h}\,\mathsf{I} + \frac{2\beta^4h^2}{\alpha^2 (1-\epsilon)}(1-e^{-\alpha h}) \mathsf{N}^2 \Big)^{j} e^{(j+1)h\mathsf{A}}S \le c b^j e^{h\lambda (j+1)(r-1) } S, \label{pf:bigopbound}\tag{36}\] where we define the constant \[b := e^{-\alpha h} + \frac{2c^2\beta^4h^2}{\alpha^2 (1-\epsilon)}(1-e^{-\alpha h}) .\] We estimate \(b\) in simpler terms as follows, with the key point being that \(h\) can be taken small enough to make \(b < 1\). We start from the elementary inequality \[e^{-x}+\bigg(1-\frac{2\epsilon(1-\epsilon)}{ 1-e^{-2(1-\epsilon)}}\bigg)(1-e^{-x}) \le 1 - \epsilon x, \quad \text{for } 0 \le x \le 2(1-\epsilon). \label{ineq:1-eps46x}\tag{37}\] Indeed, this is equivalent to the inequality \[\frac{2(1-\epsilon)}{ 1-e^{-2(1-\epsilon)}} \ge \frac{x}{ 1-e^{-x}}, \quad \text{for } 0 \le x \le 2(1-\epsilon),\] whose right-hand side is an increasing function of \(x\). Now, suppose \[\widehat{h} := \min\bigg(\frac{2(1-\epsilon)}{\alpha}, \;\frac{\alpha}{c\beta^2\sqrt{2}}\bigg((1-\epsilon)\Big(1 - \frac{2 \epsilon(1-\epsilon)}{1-e^{-2(1-\epsilon)}}\Big)\bigg)^{1/2}\bigg). \label{def:h-hat}\tag{38}\] Then, for \(h \le \widehat{h}\), the first term in the min ensures that \(\alpha h \le 2(1-\epsilon)\), and the second term lets us upper bound \(\frac{2c^2 \beta^4h^2}{\alpha^2 (1-\epsilon)} \le 1-\frac{2\epsilon(1-\epsilon)}{1-e^{-2(1-\epsilon)}}\) in the definition of \(b\). Applying 37 then yields \[b \le 1-\epsilon\alpha h \le e^{ - \epsilon \alpha h}.\] Plugging this into 36 , and using again \(\mathsf{N}S \le cS\), we find \[\begin{align} \Big(e^{-\alpha h}\,\mathsf{I} + \frac{2\beta^4h^2}{\alpha^2 (1-\epsilon)}(1-e^{-\alpha h}) \mathsf{N}^2 \Big)^{k} e^{kh\mathsf{A}}S &\le c e^{\lambda (r-1)kh - \epsilon \alpha k h } S, \\ \Big(e^{-\alpha h}\,\mathsf{I} + \frac{2\beta^4h^2}{\alpha^2 (1-\epsilon)}(1-e^{-\alpha h}) \mathsf{N}^2 \Big)^{j} e^{(j+1)h\mathsf{A}} \mathsf{N}S &\le c^2 e^{\lambda (r-1)(j+1)h - \epsilon \alpha j h } S. \end{align}\] Finally, recall from 30 that \(G(u)\) is just a constant multiple of \(S(u)=|u|\) by definition, and also \(H_0 \le C_0S\) by assumption. Hence, returning to 31 , we find \[\begin{align} H_{kh}(u) &\le C_0 c e^{\lambda (r-1)kh - \epsilon \alpha k h } |u| + \frac{ c^2 \beta^2h}{1-\epsilon}(\beta h + 1)\frac{1-e^{-\alpha h}}{\alpha}\sum_{j=0}^{k-1}e^{\lambda (r-1)(j+1)h - \epsilon \alpha j h } |u|. \end{align}\] Because of the assumption \(r -1 < \alpha^2/\gamma\beta^2\), we may choose \(\epsilon\) close to 1 so that \[\begin{align} \tau := \epsilon \alpha - \lambda(r-1) &= \epsilon \alpha - \frac{\gamma\beta^2}{\alpha\epsilon}(r-1) > 0. \end{align}\] Then, \[\begin{align} \frac{1-e^{-\alpha h}}{\alpha}\sum_{j=0}^{k-1}e^{\lambda (r-1)(j+1)h - \epsilon \alpha j h } &\le \frac{1-e^{-\alpha h}}{\alpha(1-e^{-\tau h})}e^{\lambda h(r-1)} \le \Big(\frac{1}{\tau} + h\Big)e^{\lambda h(r-1)}. \end{align}\] where the last line used \(1-e^{-\alpha h} \le \alpha h\) and \(\tau h /(1-e^{-\tau h}) \le 1+\tau h\). Thus, \[\begin{align} H_{kh}(u) &\le C_0 c e^{-\tau k h } |u| + \frac{ c^2\beta^2h}{1-\epsilon}(\beta h + 1) \Big(\frac{1}{\tau} + h\Big)e^{\lambda h(r-1)} |u|. \end{align}\] This is valid for \(h \le \widehat{h}\), where \(\widehat{h}\) was defined in 38 . We have thus proven Theorem 10, except with \(\widehat{h}\) instead of \(h^*\), and with \(C\) replaced by \[\widehat{C}_h =\frac{ c^2r\beta^2 }{1-\epsilon}(\beta h + 1) \Big(\frac{1}{\tau} + h\Big)e^{h(r-1)\frac{\gamma\beta^2}{\alpha\epsilon}}.\]

The rest of the proof is just a somewhat tedious simplification of the constants, to show that \(\widehat{h} \ge h^*\) and that \(\widehat{C}_h \le C\) for \(h \le h^*\), where \(h^*\) and \(C\) were defined in Theorem 2. Take \(\epsilon= \frac{1}{2} + \frac{\gamma\beta^2}{2\alpha^2}(r-1)\), which is the midpoint between \(\frac{\gamma\beta^2}{\alpha^2}(r-1)\) and 1. Then the exponent \(\tau\) becomes \[\begin{align} \tau &= \frac{\alpha}{2} + \frac{\gamma\beta^2}{2\alpha}(r-1) - \frac{\gamma\beta^2}{\frac{\alpha}{2} + \frac{\gamma\beta^2}{2\alpha}(r-1)}(r-1) \\ &= \frac{\alpha}{2} \frac{\Big(1 - \frac{\gamma\beta^2}{\alpha^2}(r-1)\Big)^2 }{1 + \frac{\gamma\beta^2}{\alpha^2}(r-1)}. \end{align}\] Recalling the definition \(\eta = 1 - \frac{\gamma\beta^2}{\alpha^2}(r-1)\) from the theorem statement, we get \[\begin{align} \tau = \frac{\alpha\eta^2}{2(2-\eta)}, \quad 1-\epsilon &= \frac{\eta}{2}, \quad 2\epsilon(1-\epsilon) = \frac{1}{2}\eta(2-\eta). \end{align}\] The constants then become \[\begin{align} \widehat{C}_h &= \frac{2 c^2\beta^2}{\eta} (\beta h + 1) \bigg( \frac{4-2\eta}{\alpha\eta^2 } + h\bigg)e^{h(r-1)\frac{2\gamma\beta^2}{\alpha(2-\eta)}}, \\ \widehat{h} &= \min\bigg\{\frac{\eta}{\alpha}, \frac{\alpha\sqrt{\eta}}{2\beta^2c } \bigg(1-\frac{\frac{1}{2}\eta(2-\eta)}{1-e^{-\eta} }\bigg)^{1/2}\bigg\} . \end{align}\] To bound \(\widehat{h} \ge h^*\), we use the inequality \[\bigg(1-\frac{\frac{1}{2}\eta(2-\eta)}{1-e^{-\eta} }\bigg)^{1/2} \ge \frac{\eta}{\sqrt{6}}. \label{ineq:etabound-h}\tag{39}\] To see this, note that \(f(x):=1-e^{-x}-\frac{1}{2} x(2-x)\) satisfies \(f(0)=f'(0)=f''(0)=0\), with \(f''(x)=1-e^{-x}\) concave. This concavity implies \(1-e^{-t}\ge (t/\eta)(1-e^{-\eta})\) for \(0 \le t \le \eta\), so that \[1-e^{-\eta}-\frac{1}{2} \eta(2-\eta) = \int_0^\eta\int_0^x (1-e^{-t})\,dtdx \ge \frac{\eta^2}{6}(1-e^{-\eta}).\] Thus, using also \(2\sqrt{6} \le 5\), we have shown that \[\widehat{h} \ge \min\bigg\{\frac{\eta}{\alpha}, \frac{\alpha \eta^{3/2}}{5\beta^2c } \bigg\} \ge \frac{\alpha \eta^{3/2}}{5\beta^2c } = h^*,\] where the last step used \(0 < \eta < 1 \le c\) and \(\alpha \le \beta\) (which was noted after 18 ). We finally bound \(\widehat{C}_h \le C\) for \(h \le h^*\). Note that the second case of the minimum in \(h^*\) and the facts that \(\eta \le 1 \le cr\) and \(\alpha \le \beta\) imply \(h^* \le 1/5\beta\). Hence, \(\beta h +1 \le 6/5\) for \(h \le h^*\). Using also \(h^* \le \eta/\alpha\) and \(4-2\eta+\eta^3\le 4\), we may bound \(\widehat{C}_h\) for \(h \le \widehat{h}\) by \[\begin{align} \widehat{C}_h &\le \frac{2c^2\beta^2}{\eta} \cdot \frac{6}{5}\cdot \frac{4-2\eta+\eta^3}{\alpha\eta^2 }e^{h(r-1)\frac{2\gamma\beta^2}{\alpha }} \le \frac{10\beta^2c^2}{\alpha\eta^3 } e^{\gamma(r-1)/c} = C, \end{align}\] where we bounded the exponent using \(h \le \alpha /2\beta^2c\). This completes the proof of Theorem 10.

4 Proof in the weak interaction case↩︎

We next turn to the weak interaction case. The following dynamic theorem implies Theorem 5, by the same logic detailed after the statement of Theorem 10.

Theorem 12. Grant assumptions (LSI), (B.1–3), and (Init). Define \((\eta,C,h^*)\) be as in Theorem 5. Then, for \(h \le h^*\), \(k \in {\mathbb{N}}\), and \(u \subset [n]\), \[H_{kh}(u) \le C_0 e^{-\frac{\alpha\eta^2}{2(2-\eta)} k h } |u| + C h |u|.\]

The proof will follow similar steps but is more involved due to the more complex structure of the potential 12 . We will need some simple combinatorial identities. For \(u \subset [n]\) and \(F \in {\mathbb{R}}^{2^{[n]}}\), \[\sum_{i \in u}\sum_{w \ni i}F(w) = \sum_{w \cap u \neq\emptyset} F(w)|u \cap w|, \label{eq:combinatorial1}\tag{40}\] where the second summation is over those \(w \subset [n]\) which contain \(i\), and the third is over \(w \subset [n]\) which intersect \(u\). Indeed, in the summation on the left-hand side, each set \(w\) is counted \(|u \cap w|\) times. More generally, if \(G(w)=(G_i(w))_{i \in w}\) is in \({\mathbb{R}}^w\) for each \(w \subset [n]\), \[\sum_{i \in u}\sum_{w \ni i}G^2_i(w) = \sum_{w \cap u \neq\emptyset}\sum_{i \in u \cap w} G^2_i(w) \le \sum_{w \cap u \neq\emptyset} |G(w)|^2.\] We will use this in combination with Cauchy-Schwarz, in the form \[\sum_{i \in u} \bigg(\sum_{w \ni i} G_i(w)\bigg)^2 \le \sum_{i \in u} \bigg(\sum_{w \ni i} L_w \bigg)\bigg(\sum_{w \ni i} \frac{G_i^2(w)}{L_w }\bigg) \le M_0 \sum_{w \cap u \neq\emptyset} \frac{|G(w)|^2}{L_w},\label{ineq:CScombinatorial}\tag{41}\] where \(M_0\) was defined in 13 . To avoid division by zero in 41 , we will be careful only to use it in the case that \(G(w)=0\) for every \(w \subset [n]\) such that \(L_w=0\), and in this case we adopt the convention that \(|G(w)|^2/L_w=0\).

Let us note for later use that 41 implies the Lipschitz constant of \(\nabla V\) is bounded by \(M_0\). Indeed, use the specific form of the potential \(V\) to write \[\partial_i V(x) = \sum_{ w \ni i}\partial_i V_w(x_w), \qquad i \in [n].\] Appling 41 with \(u=[n]\) then yields \[\begin{align} |\nabla V(x)-\nabla V(y)|^2 &= \sum_{i\in [n]}\bigg(\sum_{w \ni i}(\partial_iV_w(x_w) - \partial_iV_w(y_w))\bigg)^2 \nonumber \\ &\le M_0 \sum_{ w \subset [n]} \frac{|\nabla_wV_w(x_w) - \nabla_wV_w(y_w)|^2}{L_w} \nonumber \\ &\le M_0 \sum_{ w \subset [n]} L_w|x_w-y_w|^2 \nonumber \\ &\le M_0^2|x-y|^2. \label{ineq:Lipschitz-weak} \end{align}\tag{42}\]

4.1 The continuity term, \(A^1_t\)↩︎

Using 41 and the fact that \(\nabla_wV_w\) is \(L_w\)-Lipschitz by (B.1), we get \[\begin{align} A^1_t(u) &= \frac{1}{2(1-\epsilon)} \sum_{i \in u}{\mathbb{E}}\bigg(\sum_{ w \ni i}\big(\partial_iV_w(X^w_t) - \partial_i V_w(X^w_{t_-})\big)\bigg)^2 \nonumber \\ &\le \frac{M_0}{2(1-\epsilon)}\sum_{w\cap u \neq \emptyset} \frac{1}{L_w} {\mathbb{E}}\big|\nabla_w V_w(X^w_t) - \nabla_wV_w(X^w_{t_-})\big|^2 \nonumber \\ &\le \frac{M_0}{2(1-\epsilon)}\sum_{w\cap u \neq \emptyset} L_w {\mathbb{E}}\big| X^w_t - X^w_{t_-} \big|^2 \nonumber \\ &= \frac{M_0}{2(1-\epsilon)}\sum_{w\cap u \neq \emptyset} L_w \Big( (t-t_-)^2{\mathbb{E}}|\nabla_w V(X_{t_-})|^2 + 2(t-t_-)|w|\Big). \label{pf:weak-continuity1} \end{align}\tag{43}\] We would like to control the expectation in terms of its analog under \(\pi\), but this is trickier to achieve than in Section 3.1. The issue is that \(\nabla_wV\) depends on all coordinates, but we do not want a distance between the full distributions \(\rho_t\) and \(\pi\) to appear, as this would lose the delocalization effect. Instead, we note that \(\nabla V\) has mean zero under \(\pi\) to write \[\begin{align} {\mathbb{E}}|\nabla_w V(X_{t_-})|^2 &= \sum_{i \in w}{\mathbb{E}}\bigg(\sum_{v \ni i}\big(\partial_i V_v(X_{t_-})-\langle\pi^v,\partial_iV_v\rangle)\bigg)^2 \nonumber \\ &\le M_0 \sum_{v \cap w \neq \emptyset} \frac{1}{L_v} {\mathbb{E}}|\nabla_v V_v(X_{t_-}) - \langle\pi^v,\nabla_vV_v\rangle|^2. \label{pf:weak-continuity2} \end{align}\tag{44}\] Now, each term in the sum can be estimated using Kantorovich duality as in 23 : \[{\mathbb{E}}|\nabla_v V_v(X_{t_-})- \langle\pi^v,\nabla_vV_v\rangle|^2 \le 2\int_{{\mathbb{R}}^v}|\nabla_v V_v- \langle\pi^v,\nabla_vV_v\rangle|^2\,d\pi^v + 2L_v^2W_2^2(\rho^v_{t_-},\pi^v).\] Using the Poincaré inequality 18 and the fact that \(\nabla_vV_v\) is \(L_v\)-Lipschitz, \[\int_{{\mathbb{R}}^v}|\nabla_v V_v- \langle\pi^v,\nabla_vV_v\rangle|^2\,d\pi^v = \sum_{i \in v}\big(\langle \pi^v, (\partial_iV_v)^2\rangle - \langle \pi^v, \partial_iV_v\rangle^2\big) \le \frac{L_v^2}{\alpha}|v| .\] The Talagrand inequality 18 yields \(W_2^2(\rho^v_{t_-},\pi^v) \le (2/\alpha)H_{t_-}(v)\), and we return to 44 to get \[{\mathbb{E}}|\nabla_w V(X_{t_-})|^2 \le \frac{2M_0}{\alpha} \sum_{v \cap w \neq \emptyset} L_v \big(2H(\rho^v_{t_-}\,|\,\pi^v) + |v|\big). \label{pf:weak-continuity3}\tag{45}\] To streamline notation, we define an operator \(\mathsf{N} : {\mathbb{R}}^{2^{[n]}}\to {\mathbb{R}}^{2^{[n]}}\) by \[\mathsf{N}F(u) := \sum_{w \cap u \neq \emptyset}L_wF(w). \label{pf:Ndef-weak}\tag{46}\] Using also the “size function” \(S(u)=|u|\), we may thus write 45 more compactly as \[\begin{align} {\mathbb{E}}|\nabla_w V(X_{t_-})|^2 \le \frac{2M_0}{\alpha}\big(2\mathsf{N}H_{t_-}(w) + \mathsf{N}S(w)\big). \end{align}\] Plug this back into 43 to get \[A^1_t(u) \le C_t(u) := \frac{M_0}{1-\epsilon}\Big( \frac{2M_0}{\alpha}h^2\mathsf{N}^2H_{t_-}(u) + \frac{M_0}{\alpha}h^2\mathsf{N}^2S(u) + h\mathsf{N}S(u)\Big). \label{pf:continuityterm1-weak}\tag{47}\]

4.2 The hierarchy term, \(A^2_t\)↩︎

Using the specific form of the potential and 41 , \[\begin{align} A^2_t(u) &= \frac{1}{2\epsilon}\sum_{i \in u}\int_{{\mathbb{R}}^u} \bigg(\sum_{w\ni i}{\mathbb{E}}[\partial_i V_w(X^w_t)\,|\,X^u_t=x_u] - {\mathbb{E}}_{Y \sim \pi}[\partial_i V_w(Y^w)\,|\,Y^u=x_u]\bigg)^2 \,\rho^u_t(dx_u) \\ &\le \frac{M_0}{2\epsilon}\sum_{w \cap u \neq \emptyset}\frac{1}{L_w}\int_{{\mathbb{R}}^u}\Big|{\mathbb{E}}[\nabla_w V_w(X^w_t)\,|\,X^u_t=x_u] - {\mathbb{E}}_{Y \sim \pi}[\nabla_w V_w(Y^w)\,|\,Y^u=x_u]\Big|^2 \,\rho^u_t(dx_u). \end{align}\] We argue as in Section 3.2, writing in terms of conditional measures as \[\begin{align} A^2_t(u) &\le \frac{M_0}{2\epsilon}\sum_{w \cap u \neq \emptyset} \frac{1}{L_w}\int_{{\mathbb{R}}^u} \big|\langle \rho^{w \setminus u|u}_{t,x_u}-\pi^{w \setminus u|u}_{x_u}, \,\nabla_w V_w(x_u,\cdot)\rangle\big|^2 \,\rho^u_t(dx_u) \end{align}\] Since \(\nabla_w V_w\) is \(L_w\)-Lipschitz, the assumed transport inequality (B.2) followed by the chain rule for relative entropy yield \[\begin{align} A^2_t(u) &\le \frac{ \gamma M_0}{\alpha\epsilon}\sum_{w \cap u \neq \emptyset} L_w \int_{{\mathbb{R}}^u} H\big(\rho^{w \setminus u|u}_{t,x_u}\,|\,\pi^{w \setminus u|u}_{x_u}\big)\,\rho^u_t(dx_u) \nonumber \\ &= \frac{\gamma M_0}{\alpha\epsilon}\sum_{w \cap u \neq \emptyset} L_w\big(H_t(w \cup u) - H_t(u)\big). \label{pf:hierarchyterm1-weak} \end{align}\tag{48}\]

Remark 13. The above paragraph is the only place that assumption (B.2) is used in the proof of Theorem 5. It could thus be weakened, by instead directly assuming \[\big|\langle \mu - \pi^{w \setminus u|u}_{x_u}, \,\nabla_w V_w(x_u,\cdot)\rangle\big|^2 \le \frac{2\gamma L_w^2}{\alpha}H(\mu\,|\,\pi^{w \setminus u|u}_{x_u}), \qquad \forall w,u \subset [n], \;x_u \in {\mathbb{R}}^u, \;\mu \in {\mathcal{P}}({\mathbb{R}}^u).\]

4.3 Developing the hierarchy↩︎

To understand the resulting system of inequalities, we proceed as in Section 3.3, but now with a more complicated Markov process. View \(H_t\) again as a vector indexed by sets, i.e., \(H_t \in {\mathbb{R}}^{2^{[n]}}\). Define a linear operator \(\mathsf{A} : {\mathbb{R}}^{2^{[n]}} \to {\mathbb{R}}^{2^{[n]}}\) by \[\mathsf{A} F(u) := \frac{\gamma M_0}{\alpha\epsilon}\sum_{w \cap u \neq \emptyset} L_w\big(F(w \cup u) - F(u)\big).\] Recognizing this operator in 48 , we may thus combine Proposition 7 with 48 and 47 to get the coordinatewise inequality \[\frac{d}{dt}H_t \le C_t + (\mathsf{A}-\alpha I)H_t.\] The operator \(\mathsf{A}\) again has the structure of an infinitesimal generator (rate matrix) for a continuous-time Markov process. We will not use the stochastic representation here, preferring to work directly with operators. But it is important to note that the semigroup \(e^{t\mathsf{A}}\) for \(t > 0\) is monotone with respect to coordinatewise order: \(e^{t\mathsf{A}}F \le e^{t\mathsf{A}}G\) whenever \(F \le G\). Note also that \(\mathsf{A}C_{(k-1)h} \ge 0\) coordinatewise because \(C_{(k-1)h}\) is nondecreasing with respect to set inclusion, and this implies that \(t\mapsto e^{t\mathsf{A}}C_{(k-1)h}(u)\) is nondecreasing. We may now copy the argument leading to 29 to get \[H_{kh} \le e^{-\alpha h}e^{h\mathsf{A}}H_{(k-1)h} + \frac{1}{\alpha}(1-e^{-\alpha h}) e^{h\mathsf{A}} C_{(k-1)h}. \label{pf:Hkh-bound1-weak}\tag{49}\] Next, define a function \(G \in {\mathbb{R}}^{2^{[n]}}\) by \[G := \frac{h M_0}{1-\epsilon}\Big( \frac{M_0}{\alpha}h\mathsf{N}^2S(u) + \mathsf{N}S(u)\Big). \label{pf:Gdef-weak}\tag{50}\] This way, we may rewrite \[C_{(k-1)h} = G + \frac{ 2h^2 M_0^2}{\alpha(1-\epsilon)} \mathsf{N}^2 H_{(k-1)h}.\] Plugging this into 49 and combining the \(H_{(k-1)h}\) terms, \[\begin{align} H_{kh} &\le \Big(e^{-\alpha h}e^{h\mathsf{A}} + \frac{2h^2 M_0^2}{\alpha^2 (1-\epsilon)}(1-e^{-\alpha h}) e^{h\mathsf{A}}\mathsf{N}^2 \Big)H_{(k-1)h} + \frac{1-e^{-\alpha h}}{\alpha} e^{h\mathsf{A}} G. \end{align}\] Iterate this inequality to find \[\begin{align} H_{kh} \le \;& \Big(e^{-\alpha h}e^{h\mathsf{A}} + \frac{2h^2 M_0^2}{\alpha^2 (1-\epsilon)}(1-e^{-\alpha h}) e^{h\mathsf{A}}\mathsf{N}^2 \Big)^k H_0 \\ &\qquad + \frac{1-e^{-\alpha h}}{\alpha} \sum_{j=0}^{k-1} \Big(e^{-\alpha h}e^{h\mathsf{A}} + \frac{2h^2 M_0^2}{\alpha^2 (1-\epsilon)}(1-e^{-\alpha h}) e^{h\mathsf{A}}\mathsf{N}^2 \Big)^je^{h\mathsf{A}} G . \end{align} \label{pf:operator-iteration-weak}\tag{51}\] Here we must depart from the method of Section 3, because the operators \(\mathsf{N}\) and \(\mathsf{A}\) do not necessarily commute. Instead, the key ingredients will be the following estimates for how these operators act on the size function \(S(u)=|u|\). First, using 40 , \[\begin{align} \mathsf{A}S(u) &= \frac{\gamma M_0}{\alpha\epsilon} \sum_{w \cap u \neq \emptyset} L_w |w \setminus u| \\ &= \frac{\gamma M_0}{\alpha\epsilon} \sum_{i \in u}\sum_{w \ni i} L_w \frac{|w \setminus u|}{|w \cap u|} \\ &\le \frac{\gamma M_0}{\alpha\epsilon} \sum_{i \in u}\sum_{w \ni i } L_w(|w|-1) \\ &\le \frac{\gamma M_0 R_1}{\alpha\epsilon}S(u). \end{align}\] This quickly implies \[\begin{align} e^{t\mathsf{A}}S(u) \le e^{\lambda t} S(u), \qquad t \ge 0, \qquad \lambda := \frac{\gamma M_0 R_1}{\alpha\epsilon}. \end{align}\] A similar calculation shows \[\begin{align} \mathsf{N}S(u) &= \sum_{w \cap u \neq \emptyset}L_w|w| \le M_1S(u). \end{align}\] Combining these last two bounds, we find \[\Big(e^{-\alpha h}e^{h\mathsf{A}} + \frac{2h^2 M_0^2}{\alpha^2 (1-\epsilon)}(1-e^{-\alpha h}) e^{h\mathsf{A}}\mathsf{N}^2 \Big) S \le \Big(e^{-\alpha h} + \frac{2h^2 M_0^2M_1^2}{\alpha^2 (1-\epsilon)}(1-e^{-\alpha h}) \Big) e^{\lambda h} S. \label{pf:weak11}\tag{52}\] We simplify this using the real variable inequality 37 . Define \[\widehat{h} := \min\bigg(\frac{2(1-\epsilon)}{\alpha}, \;\frac{\alpha}{\sqrt{2}M_0M_1}\bigg((1-\epsilon)\Big(1 - \frac{2 \epsilon(1-\epsilon)}{1-e^{-2(1-\epsilon)}}\Big)\bigg)^{1/2}\bigg). \label{def:h-hat-weak}\tag{53}\] For \(h \le \widehat{h}\) we have \[e^{-\alpha h} + \frac{2h^2 M_0^2 M_1^2}{\alpha^2 (1-\epsilon)}(1-e^{-\alpha h}) \le 1-\alpha\epsilon h \le e^{-\alpha\epsilon h}.\] Hence, we may simplify 52 further: \[\Big(e^{-\alpha h}e^{h\mathsf{A}} + \frac{2h^2 M_0^2 M_1^2}{\alpha^2 (1-\epsilon)}(1-e^{-\alpha h}) e^{h\mathsf{A}}\mathsf{N}^2 \Big) S \le e^{(\lambda-\alpha\epsilon ) h} S.\label{pf:weak-iter11}\tag{54}\] Note that the weak interaction assumption (B.3) allows us to choose \(\epsilon\) close to 1 to ensure that \[\tau := \alpha\epsilon - \lambda = \alpha\epsilon - \frac{\gamma M_0 R_1}{\alpha\epsilon} > 0.\] Noting that \(H_0 \le C_0S\) by the assumption (Init), the bound 54 lets us estimate the first term of 51 by \[\Big(e^{-\alpha h}e^{h\mathsf{A}} + \frac{2h^2 M_0^2}{\alpha^2 (1-\epsilon)}(1-e^{-\alpha h}) e^{h\mathsf{A}}\mathsf{N}^2 \Big)^k H_0 \le C_0e^{-\tau kh } S. \label{pf:weak-iter-term1}\tag{55}\] To handle the second term of 51 , we combine \(e^{t\mathsf{A}}S \le e^{\lambda t}S\) with 54 to obtain \[\sum_{j=0}^{\infty}\Big(e^{-\alpha h}e^{h\mathsf{A}} + \frac{2h^2 M_0^2}{\alpha^2 (1-\epsilon)}(1-e^{-\alpha h}) e^{h\mathsf{A}}\mathsf{N}^2 \Big)^je^{h\mathsf{A}} S \le \frac{e^{\lambda h}}{1 - e^{-\tau h}} S.\] Recalling the definition of \(G\) from 50 and using \(\mathsf{N}S \le M_1S\) lets us finally bound the second term of 51 by \[\begin{align} \frac{1-e^{-\alpha h}}{\alpha} &\sum_{j=0}^{k-1} \Big(e^{-\alpha h}e^{h\mathsf{A}} + \frac{2h^2 M_0^2}{\alpha^2 (1-\epsilon)}(1-e^{-\alpha h}) e^{h\mathsf{A}}\mathsf{N}^2 \Big)^je^{h\mathsf{A}} G \\ &\le h \frac{e^{\lambda h}}{\alpha(1-\epsilon)}\frac{1-e^{-\alpha h}}{1 - e^{-\tau h}} \bigg(\frac{hM_0^2M_1^2}{\alpha} + M_0M_1 \bigg) S. \end{align}\] Plugging this and 55 into 51 , and using also \(\alpha^{-1}(1-e^{-\alpha h})/(1-e^{-\tau h}) \le \tau^{-1}+ h\), we get \[H_{kh}(u) \le C_0e^{-\tau kh}|u| + h \frac{e^{\lambda h}}{1-\epsilon} \bigg(\frac{1}{\tau}+h\bigg) \bigg(\frac{ hM_0^2M_1^2}{\alpha } + M_0M_1 \bigg) |u|. \label{pf:operator-iteration-weak2}\tag{56}\] This is valid for \(h \le \widehat{h}\), where \(\widehat{h}\) was defined in 53 . We have thus proven Theorem 12, except with \(\widehat{h}\) instead of \(h^*\), and with \(C\) replaced by \[\widehat{C}_h = \frac{e^{\gamma M_0 R_1 h/\alpha\epsilon}M_0M_1}{1-\epsilon} \bigg(\frac{ M_0 M_1 h }{\alpha } + 1 \bigg)\bigg(\frac{1}{\tau} + h\bigg).\]

4.4 Simplifying the constants↩︎

We finally show that \(\widehat{h} \ge h^*\) and that \(C \le \widehat{C}\). Take \(\epsilon= \frac{1}{2} + \frac{\gamma M_0 R_1}{2\alpha^2}\), which is the midpoint between \(\frac{\gamma M_0 R_1}{\alpha^2}\) and 1. Then the exponent \(\tau\) becomes \[\begin{align} \tau &= \frac{\alpha}{2} + \frac{\gamma M_0 R_1}{2\alpha} - \frac{\gamma M_0 R_1}{\frac{\alpha}{2} + \frac{\gamma M_0 R_1}{2\alpha} } = \frac{\alpha}{2} \frac{\Big(1 - \frac{\gamma M_0 R_1}{\alpha^2} \Big)^2 }{1 + \frac{\gamma M_0 R_1}{\alpha^2} }. \end{align}\] Recalling the definition \(\eta = 1 - \frac{\gamma M_0 R_1}{\alpha^2}\) from the theorem statement, we get \[\begin{align} \tau = \frac{\alpha\eta^2}{2(2-\eta)}, \quad 1-\epsilon &= \frac{\eta}{2}, \quad 2\epsilon(1-\epsilon) = \frac{1}{2}\eta(2-\eta). \end{align}\] Using \(\epsilon \ge 1/2\) in the exponent, the constants then become \[\begin{align} \widehat{C}_h &= \frac{2M_0 M_1}{\eta} e^{2\gamma M_0 R_1 h/\alpha} \Big(\frac{ M_0 M_1 h}{\alpha } + 1 \Big)\bigg(\frac{4-2\eta}{\alpha\eta^2} + h\bigg), \\ \widehat{h} &= \min\bigg\{\frac{\eta}{\alpha}, \frac{\alpha\sqrt{\eta}}{2 M_0M_1 } \bigg(1-\frac{\frac{1}{2}\eta(2-\eta)}{1-e^{-\eta} }\bigg)^{1/2}\bigg\} . \end{align}\] To bound \(\widehat{h}\), we again use the inequality 39 to deduce \[\widehat{h} \ge \min\bigg\{\frac{\eta}{\alpha}, \frac{\alpha \eta^{3/2}}{5 M_0 M_1 } \bigg\} \ge \frac{\alpha \eta^{3/2}}{5 M_0 M_1 } = h^*,\] where the last step used \(0 < \eta < 1\) along with the fact that \(\alpha\) is bounded by the Lipschitz constant of \(\nabla V\), which is itself bounded by \(M_0 \le M_1\). The exponential in \(\widehat{C}_h\) is bounded by \(e^{\gamma}\) for \(h \le h^*\), because \(h^* \le \alpha/5M_0M_1\) and \(R_1 \le M_1\). Using \(h^* \le \eta/\alpha\) and also \(4-2\eta+\eta^3\le 4\) since \(0 \le \eta \le 1\), we get \[\begin{align} \widehat{C}_h &\le \frac{2M_0 M_1}{\eta} e^{\gamma } \cdot \frac{6}{5} \cdot \frac{4-2\eta+\eta^3}{\alpha\eta^2 } \le \frac{10M_0 M_1e^{\gamma } }{\alpha\eta^3 } = C. \end{align}\]

5 Proof of Theorem 4 via coupling↩︎

Here we give the proof of Theorem 4. As a prelimary step, we claim that \[\langle \pi,|\nabla V|_\infty^2\rangle \le 4\beta\log(2n). \label{pf:subgaussianestimate}\tag{57}\] This improves on [1] by a factor of \(\beta/\alpha\), and it allows us to require merely \(h\le 1/\beta\) instead of \(h\le\alpha/\beta^2\). We thank Sinho Chewi for pointing out this improvement. Indeed, this comes from the fact that \(\nabla V=-\nabla\log\pi\) is \(\sqrt{\beta}\)-subgaussian under \(\pi\), in the sense that \[\log \int e^{\theta \cdot \nabla V(x)}\,\pi(dx) \le \frac{\beta}{2}|\theta|^2, \qquad \forall \theta \in {\mathbb{R}}^n.\] See [15] or [16]. Then, applying [1] yields 57 .

Turning now to the main line of the proof, we consider the usual coupling of the LMC iterates with the continuous-time Langevin dynamics. That is, let \((B_t)_{t \ge 0}\) be an \(n\)-dimensional Brownian motion, and let \(Y_0 \sim \pi\). Let \[\begin{align} X_{(k+1)h} &= X_{kh} - h\nabla V(X_{kh}) - \sqrt{2h}(B_{(k+1)h}-B_{kh}), \qquad k \ge 0, \\ dY_t &= -\nabla V(Y_t)dt + \sqrt{2}dB_t. \end{align}\] We introduce an auxiliary process as in [1]: \[\begin{align} \overline{Y}_{(k+1)h} = Y_{kh} - h\nabla V(Y_{kh}) - \sqrt{2h}(B_{(k+1)h}-B_{kh}), \qquad k \ge 0. \end{align}\] By the triangle inequality, \[\big({\mathbb{E}}|X_{(k+1)h}-Y_{(k+1)h}|_\infty^2\big)^{1/2} \le \big({\mathbb{E}}|X_{(k+1)h}-\overline{Y}_{(k+1)h}|_\infty^2\big)^{1/2} + \big({\mathbb{E}}|\overline{Y}_{(k+1)h}-Y_{(k+1)h}|_\infty^2\big)^{1/2}. \label{pf:contraction1}\tag{58}\] Using \(\nabla^2 V \le \beta I\), we may exactly follow the calculation of [1] to control the second term by \[{\mathbb{E}}|\overline{Y}_{(k+1)h}-Y_{(k+1)h}|_\infty^2 \le \frac{2\beta^2 h^4}{3}\int_{{\mathbb{R}}^n}|\nabla V|_\infty^2 \,d\pi + 8\beta^2 h^3\log(2n),\] Apply 57 , use \(h\le 1/\beta\), and combining terms to get \[{\mathbb{E}}|\overline{Y}_{(k+1)h}-Y_{(k+1)h}|_\infty^2 \le 16 \beta^2h^3 \log(2n). \label{pf:contraction2}\tag{59}\] To deal with the first term of 58 , interpolate in the usual way and use the assumed structure of \(\nabla^2V\): \[X_{(k+1)h}-\overline{Y}_{(k+1)h} = X_{kh} - Y_{kh} - h\big(\nabla V(X_{kh}) - \nabla V(Y_{kh})\big) = (R_D-hR_M)(X_{kh} - Y_{kh}),\] where we define the random matrices \[R_D := I - h\int_0^1D(uX_{kh} + (1-u)Y_{kh})\,du, \qquad R_M := \int_0^1M(uX_{kh} + (1-u)Y_{kh})\,du.\] Note that the entries of \(D\) lie in the interval \([\alpha,\beta]\), and so the entries of the diagonal matrix \(R_D\) lie in \([1-\beta h,1-\alpha h] \subset [0,1-\alpha h]\) since \(h \le \alpha/\beta^2 \le 1/\beta\). Since \(R_D\) is diagonal, we deduce \(\|R_D\|_{\infty\to\infty} \le 1-\alpha h\). The norm bound assumed for \(M\) thus yields \[\begin{align} |R_D-hR_M|_{\infty\to\infty} &\le |1-\alpha h| + h|R_M|_{\infty\to\infty} \le 1-(\alpha-\alpha_0) h. \end{align}\] It follows that \[\big({\mathbb{E}}|X_{(k+1)h}-\overline{Y}_{(k+1)h}|_\infty^2\big)^{1/2} \le \big(1-(\alpha-\alpha_0) h\big)\big({\mathbb{E}}|X_{kh} - Y_{kh}|_\infty^2\big)^{1/2}.\] Now, assuming \((X_{kh},Y_{kh})\) was coupled optimally for \(W_{2,\ell^\infty}(\rho_{kh},\pi)\), we combine this estimate with 59 to deduce \[\begin{align} W_{2,\ell^\infty}(\rho_{(k+1)h},\pi) &\le e^{-(\alpha-\alpha_0)h}W_{2,\ell^\infty}(\rho_{kh},\pi) + 4\beta h^{3/2}\sqrt{\log(2n)}. \end{align}\] Iterate this to get the claimed bound on the bias.

6 Proof of Theorem 3 ↩︎

We start with a straightforward adaptation of Propositions 6, noting that \((\rho^u_t)_{t \ge 0}\) solves the Fokker-Planck equation \(\partial_t \rho^u =\nabla_u\cdot(b^u\rho^u)+\Delta_u\rho^u\) for \(b^u_t(x_u) := {\mathbb{E}}[\nabla_uV(Y_t)\,|\,Y^u_t=x_u]\). Then, arguing as in the proof of 7 up to 21 , and applying Remark 8 to introduce the free parameter \(\epsilon \in (0,1)\), we have \[\begin{align} \frac{d}{dt}H_t(u) &\le \frac{1}{4\epsilon}{\mathbb{E}}\big|\nabla\log\pi^u(Y^u_t) + {\mathbb{E}}[\nabla_u V(Y_t)\,|\,Y^u_t]\big|^2 - (1-\epsilon)I(\rho^u_t\,|\,\pi^u) \\ &= \frac{1}{4\epsilon}{\mathbb{E}}\big|\langle \rho^{N(u)\setminus u}_{t,Y^u_t} - \pi^{N(u)\setminus u}_{Y^u_t},\,\nabla_uV(Y^u_t,\cdot)\rangle\big|^2 - (1-\epsilon)I(\rho^u_t\,|\,\pi^u). \end{align}\] Here we write \(\rho^{N(u)\setminus u}_{t,Y^u_t}\) as usual for the conditional law of \(Y^{N(u)}_t\) given \(Y^u_t\). Apply the transport inequality (A.2) along with the chain rule for relative entropy, exactly as in Section 3.2, to get \[\frac{d}{dt}H_t(u) \le\frac{\gamma\beta^2}{2\alpha\epsilon}\big(H_t(N(u))-H_t(u)\big) - 2\alpha (1-\epsilon)H_t(u).\] Define the operator \(\mathsf{A} : {\mathbb{R}}^{2^{[n]}} \to {\mathbb{R}}^{2^{[n]}}\) by \(\mathsf{A}F(u)=\frac{\gamma\beta^2}{2\alpha}(F(N(u))-F(u))\). Then the above rewrites as a coordinatewise inequality between vectors, \[\frac{d}{dt}H_t \le \bigg(\frac{1}{ \epsilon}\mathsf{A}-2\alpha(1-\epsilon)\mathsf{I}\bigg)H_t.\] Integrate this coordinatewise inequality as in 28 to find \[H_t \le e^{-2\alpha(1-\epsilon)t} e^{ t\mathsf{A}/\epsilon}H_0.\] As explained in the paragraph after 27 , the semigroup admits the stochastic representation \(e^{s\mathsf{A}}F(u)={\mathbb{E}}F(N_{\Lambda(s)}(u))\) for any \(F \in {\mathbb{R}}^{2^{[n]}}\), where \(\Lambda\) is a Poisson process with rate \(\gamma\beta^2/2\alpha\). This completes the proof.

References↩︎

[1]

Y. Chen, X. Cheng, J. Niles-Weed, and J. Weare, Convergence of unadjusted langevin in high dimensions: Delocalization of bias, arXiv preprint arXiv:2408.13115 (2024).

[2]

S. Vempala and A. Wibisono, Rapid convergence of the unadjusted Langevin algorithm: Isoperimetry suffices, Advances in neural information processing systems 32(2019).

[3]

D. Lacker, L.C. Yeung, and F. Zhou, Quantitative propagation of chaos for non-exchangeable diffusions via first-passage percolation, arXiv preprint arXiv:2409.08882 (2024).

[4]

D. Lacker, Hierarchies, entropy, and quantitative propagation of chaos for mean field diffusions, Probability and Mathematical Physics 4(2023), no. 2, 377–432.

[5]

D. Lacker and L. Le Flem, Sharp uniform-in-time propagation of chaos, Probability Theory and Related Fields 187(2023), no. 1-2, 443–480.

[6]

F. Otto and C. Villani, Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality, Journal of Functional Analysis 173(2000), no. 2, 361–400.

[7]

T. Cui, S. Liu, and X. Tong, Stein’s method for marginals on large graphical models, arXiv preprint arXiv:2410.11771 (2024).

[8]

S. Chewi, Log-concave sampling, Book draft available at https://chewisinho. github. io (2024).

[9]

A. Durmus and A. Eberle, Asymptotic bias of inexact Markov chain Monte Carlo methods in high dimension, The Annals of Applied Probability 34(2024), no. 4, 3435–3468.

[10]

L. Erdos, B. Schlein, and H.-T. Yau, Semicircle law on short scales and delocalization of eigenvectors for Wigner random matrices, The Annals of Probability 37(2009), no. 3, 815–852.

[11]

L. Shou and R. van Handel, A localization–delocalization transition for nonhomogeneous random matrices, Journal of Statistical Physics 191(2024), no. 2, 26.

[12]

G. Brunick and S. Shreve, Mimicking an Itô process by a solution of a stochastic differential equation, The Annals of Applied Probability 23(2013), no. 4, 1584–1628.

[13]

C. Villani, Topics in optimal transportation, vol. 58, American Mathematical Soc., 2021.

[14]

T.D. Ahle, Sharp and simple bounds for the raw moments of the binomial and Poisson distributions, Statistics & Probability Letters 182(2022), 109306.

[15]

J. Negrea, Approximations and scaling limits of Markov chains with applications to MCMC and approximate inference, Ph.D. thesis, University of Toronto, 2022.

[16]

J.M. Altschuler and S. Chewi, Shifted composition II: Shift Harnack inequalities and curvature upper bounds, arXiv preprint arXiv:2401.00071 (2023).

A hierarchical entropy method for the delocalization of bias in high-dimensional Langevin Monte Carlo

Abstract

1 Introduction↩︎

1.1 Main results for sparse interactions↩︎

1.2 Main results for weak interactions↩︎

1.4 Subadditivity versus delocalization↩︎

1.5 On the notion of delocalization↩︎

1.6 Organization of the rest of the paper↩︎

Acknowledgements↩︎

2 First steps of the proof↩︎

3 Proof in the sparse case↩︎

3.1 The continuity term, \(A^1_t\)↩︎

3.2 The hierarchy term, \(A^2_t\)↩︎

3.3 Developing the hierarchy↩︎

3.4 The polynomial growth case, Theorem 9 ↩︎

3.5 The exponential growth case, Theorem 10 ↩︎

4 Proof in the weak interaction case↩︎

4.1 The continuity term, \(A^1_t\)↩︎

4.2 The hierarchy term, \(A^2_t\)↩︎

4.3 Developing the hierarchy↩︎

4.4 Simplifying the constants↩︎

5 Proof of Theorem 4 via coupling↩︎

6 Proof of Theorem 3 ↩︎

References↩︎

Subjects

Updated on Academus

A hierarchical entropy method for the delocalization of bias in high-dimensional Langevin Monte Carlo

Abstract

1 Introduction↩︎

1.1 Main results for sparse interactions↩︎

1.2 Main results for weak interactions↩︎

1.3 Related literature and methods↩︎

1.4 Subadditivity versus delocalization↩︎

1.5 On the notion of delocalization↩︎

1.6 Organization of the rest of the paper↩︎

Acknowledgements↩︎

2 First steps of the proof↩︎

3 Proof in the sparse case↩︎

3.1 The continuity term, \(A^1_t\)↩︎

3.2 The hierarchy term, \(A^2_t\)↩︎

3.3 Developing the hierarchy↩︎

3.4 The polynomial growth case, Theorem 9↩︎

3.5 The exponential growth case, Theorem 10↩︎

4 Proof in the weak interaction case↩︎

4.1 The continuity term, \(A^1_t\)↩︎

4.2 The hierarchy term, \(A^2_t\)↩︎

4.3 Developing the hierarchy↩︎

4.4 Simplifying the constants↩︎

5 Proof of Theorem 4 via coupling↩︎

6 Proof of Theorem 3↩︎

References↩︎

Subjects

Updated on Academus

3.4 The polynomial growth case, Theorem 9 ↩︎

3.5 The exponential growth case, Theorem 10 ↩︎

6 Proof of Theorem 3 ↩︎