Abstract

Recently, the vanishing-step-size limit of the Sinkhorn algorithm at finite regularization parameter \(\varepsilon\) was shown to be a mirror descent in the space of probability measures. We give \(L^2\) contraction criteria in two time-dependent metrics induced by the mirror Hessian, which reduce to the coercivity of certain conditional expectation operators. We then give an exact identity for the entropy production rate of the Sinkhorn flow, which was previously known only to be nonpositive. Examining this rate shows that the standard semigroup analysis of diffusion processes extends systematically to the Sinkhorn flow. We show that the flow induces a reversible Markov dynamics on the target marginal as an Onsager gradient flow. We define the Dirichlet form associated to its (nonlocal) infinitesimal generator, prove a Poincaré inequality for it, and show that the spectral gap is strictly positive along the Sinkhorn flow whenever \(\varepsilon > 0\). Lastly, we show that the entropy decay is exponential if and only if a logarithmic Sobolev inequality (LSI) holds. We give for illustration two immediate practical use-cases for the Sinkhorn LSI: as a design principle for the latent space in which generative models are trained, and as a stopping heuristic for discrete-time algorithms.

1 Introduction↩︎

Entropy-regularized optimal transport (OT\(_\varepsilon\)) [1] and the closely related Schrödinger bridge problem [2] have found widespread practical applications in areas such as finite-time-horizon generative modeling [3] and fast estimation of optimal transport maps [4] and distances. Simultaneously, it possesses theoretically attractive properties: while the unregularized OT coupling typically concentrates on a low-dimensional set, OT\(_\varepsilon\) yields a strictly positive coupling whose barycentric projections are entropic estimations of the optimal transport map [4]; moreover, entropic regularization makes the coupling objective strictly convex and displacement smooth, which yields the existence and well-posedness of gradient flows [5]. From an algorithmic perspective, OT\(_{\varepsilon}\) exhibits improved computational complexity; unregularized OT distances incur \(O(n^3\log n)\) time for \(n\) bins using linear programming, while the celebrated Sinkhorn algorithm [1] incurs \(O(n^2)\) per iteration. This fixed-point algorithm converges for general unbounded domains and costs [6] in relative entropy at the sublinear rate \(O(t^{-1})\).

Recent work [7] has shown that the vanishing step-size limit of the Sinkhorn algorithm is a mirror descent [8] in the space of probability measures, which we will refer to as the Sinkhorn flow. In particular, [7] shows that convergence in relative entropy of the right marginal along the continuous-time flow also occurs at sublinear rate \(O(t^{-1})\) (now in continuous time \(t\)). In this work, we provide conditions under which the Sinkhorn flow converges exponentially. This is exhibited by: (i) exponential decay of perturbations in time-dependent Riemannian metrics on \(L^2\) [9], lifting contraction analysis for gradient flows in \(\mathbb{R}^n\) [10] to the space of probability measures, and (ii) new identities and estimates for entropy production rates which mirror the Bakry-Émery semigroup theory for diffusion processes [11].

2 Background↩︎

2.1 Entropic optimal transport↩︎

We recall the entropy-regularized optimal transport problem. Let \(X, Y = \mathbb{R}^d\) and \(\mathcal{P}(X), \mathcal{P}(Y)\) the corresponding sets of probability measures. Let \(\mu \in \mathcal{P}(X), \nu \in \mathcal{P}(Y)\) be given marginals and \(\Pi(\mu, \nu) \subset \mathcal{P}(X\times Y)\) denote the set of couplings between these marginals, i.e. those joint probability measures \(\pi \in \mathcal{P}(X\times Y)\) satisfying \[\begin{align} \tag{1} \int_{X\times Y} f(x)\pi(dx, dy) &=: \int_X f(x)\pi^X(dx) = \int_X f(x)\mu(dx)\\ \tag{2} \int_{X\times Y}g(y)\pi(dx, dy) &=: \int_Y g(y) \pi^Y(dy) = \int_Yg(y)\nu(dy), \end{align}\] for all bounded measurable \(f, g\), where we have defined the marginalizations \((\cdot)^X, (\cdot)^Y\), used throughout. Equations 1 , 2 are also the sense in which we denote equality of measures, e.g. \(\pi^X = \mu\). Given a distance cost function \(c(x, y) : X\times Y \to \mathbb{R}_+\), the entropic optimal transport problem (denoted \(\rm{OT}_\varepsilon\)) is \[\begin{align} \label{eq:eot} {\rm OT}_\varepsilon(\mu, \nu) &= \min_{\pi \in \Pi(\mu, \nu)} \mathbb{E}_\pi[c] + \varepsilon H(\pi | \mu \otimes \nu), \end{align}\tag{3}\] where \(H\) is the relative entropy, defined as \[\begin{align} \label{eq:H} H(\pi | \tilde{\pi}) &= \begin{cases} \int_{X\times Y}d\pi\log\left(\frac{d\pi}{d\tilde{\pi}}\right) & {\rm if}\;\pi \ll \tilde{\pi}\\ +\infty & {\rm otherwise} \end{cases} \end{align}\tag{4}\] where \(\pi \ll \tilde{\pi}\) denotes that \(\pi\) is absolutely continuous with respect to \(\tilde{\pi}\). The existence of the minimizer (denoted throughout as \(\pi_*\)) in 3 is a standard result due to the lower semicontinuity of \(H\) (see e.g. Theorem 1.10, [12]). The Fenchel-Rockafellar dual form of \({\rm OT}_\varepsilon\) 3 (4.4, [13]) is \[\begin{align} \label{eq:eot95dual} {\rm OT}_\varepsilon(\mu, \nu) &= \sup_{f\in C(X), g\in C(Y)}\mathbb{E}_\mu[\frac{f}{\varepsilon}] + \mathbb{E}_\nu[\frac{g}{\varepsilon}] - \mathbb{E}_{(\mu\otimes \nu)}\left[\exp\left(\frac{-c + (f \oplus g)}{\varepsilon}\right)\right] \end{align}\tag{5}\] where \((f \oplus g)(x, y) = f(x) + g(y)\) and \(f \in L^1(\mu), g \in L^1(\nu)\) are called the Schrödinger potentials.

2.2 Sinkhorn algorithm for \({\rm OT}_\varepsilon\) as an \(L^2\) mirror descent↩︎

Denoting the sets of joint probability measures satisfying the one-sided marginal constraints by \[\begin{align} \Pi(\mu, \cdot) = \{\pi \in \mathcal{P}(X\times Y)\;|\;\pi^X = \mu\},\;\Pi(\cdot, \nu) = \{\pi \in \mathcal{P}(X\times Y)\;|\;\pi^Y=\nu\}, \end{align}\] the Sinkhorn algorithm [1], [13] is the fixed-point iteration \[\begin{align} d\pi_0 &\propto \exp(-c/\varepsilon)d(\mu\otimes \nu)\\ \pi_{t+1} &= \mathop{\mathrm{arg\,min}}_{\pi \in \Pi(\mu, \cdot)} H(\pi | \pi_t)\\ \pi_{t+2} &= \mathop{\mathrm{arg\,min}}_{\pi \in \Pi(\cdot, \nu)}H(\pi | \pi_{t+1}), \end{align} \label{eq:sinkhorn}\tag{6}\] whose stationary state is the minimizer \(\pi_*\) of 3 . Also called the iterative proportional fitting procedure (IPFP), it is an alternating Bregman projection (with relative entropy \(H\) the Bregman divergence) whose basic properties are \[\begin{align} &\pi_{\rm odd}^X = \mu,\quad \pi_{\rm even}^Y = \nu,\\ &H(\pi_*|\pi_t) \le H(\pi_*|\pi_{t-1}),\quad H(\pi_t^X|\mu)+H(\pi_t^Y|\nu) \le H(\pi_t|\pi_{t-1}) \end{align}\] (see e.g. §6.1, [12]). Recently [7], the continuous-time (vanishing step-size) limit of 6 was shown to be a mirror descent in \((L^1, \mathcal{P})\): \[\begin{align} \label{eq:sinkhorn95dual} \frac{\partial h_t}{\partial t} &= -\frac{\delta F}{\delta \pi}(\pi_t)=-\log\frac{d\pi_t^Y}{d\nu},\quad F(\pi) = H(\pi^Y | \nu),\;h_t \in L^1(X\times Y) \end{align}\tag{7}\] is the flow in the dual space \(L^1(X\times Y)\), where the dual variable \(h_t\) is exactly the right Schrödinger potential \(g\) in 5 , and \[\begin{align} \label{eq:sinkhorn95primal} \pi_t &= \frac{\delta \varphi^*}{\delta h}(h_t) = \hat{\pi},\quad \hat{\pi}(x, y) = \frac{\pi_0(x,y)e^{h_t(x,y)}}{\int_{Y}\pi_0(x, y')e^{h_t(x,y')}}\mu(x) \end{align}\tag{8}\] is the flow in the primal space \(\Pi(\mu, \cdot) \subset \mathcal{P}\); we have \(\hat{\pi}^X = \mu\) by construction. In 7 , \(F\) (the relative entropy of the right marginal) is the objective and \(\varphi^*\) is the Fenchel conjugate of the mirror map \(\varphi\) \[\begin{align} \varphi(\pi) &= H(\pi | \pi_0) + \iota_{\Pi(\mu, \cdot)}(\pi),\quad \iota_{\Pi(\mu, \cdot)}(\pi) = \begin{cases} 0 & \pi^X = \nu\\ +\infty & \text{otherwise,} \end{cases}\\ \varphi^*(h) &= \sup_{\pi \in \Pi(\mu, \cdot)} \langle\pi, h\rangle - \varphi(\pi) = \langle \hat{\pi}, h\rangle - H(\hat{\pi} | \pi_0), \end{align}\] where \(\hat{\pi}\) is defined as in 8 ; the last equality is due to Lemma 3, [7].

2.3 Notation↩︎

In addition to the notation already defined thus far, we will use the following abbreviations for \(L^2\) inner products: \[\begin{align} \langle f, g\rangle_{\pi} := \langle f, g\rangle_{L^2(\pi)} = \int_{X\times Y} f(x, y)g(x, y)\pi(dx, dy),\quad \langle \cdot, \cdot\rangle := \langle \cdot, \cdot\rangle_{L^2} \end{align}\] with the domains of integration implied by the measure \(\pi\). We will also denote the subspace of mass-zero functions by \[\begin{align} \label{eq:mean95zero} L_0^p(\pi) &:= \{f \in L^p(\pi)\;|\;\langle f, \boldsymbol{1}\rangle_{\pi} = 0\}, \end{align}\tag{9}\] where \(\boldsymbol{1}\) is the constant function. Finally, we shall denote the disintegrations (conditional measures) by \[\begin{align} \pi(dx, dy) = \pi(dx|y)\pi^Y(dy) = \pi(dy|x)\pi^X(dx), \end{align}\] with equality of measures in the sense of 1 .

2.4 Definitions↩︎

Definition 1 (Conditional expectation operator). Let \((X\times Y, \pi)\) be the probability space. Define \[\begin{align} \label{eq:P95pi} (P_\pi f)(y) &:= \mathbb{E}_\pi[f | Y=y] = \int_{X}f(x, y)\pi(dx|y) \in L^2(\pi^Y), \end{align}\tag{10}\] which is an orthogonal projection (after canonically embedding \(L^2(\pi^Y)\) back in \(L^2(\pi)\) via \(h(x, y) := h(y)\), which we will assume throughout), since for all \(f, g \in L^2(\pi)\),

\(P_\pi\) is a projection by the tower property \[\begin{align} \label{eq:p95idemp} P_\pi P_\pi f &= \mathbb{E}_\pi[\mathbb{E}_\pi[f|Y]|Y] = E_\pi[f|Y] = P_\pi f, \end{align}\tag{11}\]
\(P_\pi\) is self-adjoint by \[\begin{align} \label{eq:p95adj} \langle P_\pi f, g\rangle_{L^2(\pi)} &= \iint_{X\times Y}g(x, y)\pi(dx|y)\int_Xf(x', y)\pi(dx', dy) = \langle f, P_\pi g\rangle_{L^2(\pi)}. \end{align}\tag{12}\]
\(P_\pi\) is bounded and a contraction; by Jensen’s inequality, \[\begin{align} \left\lVert P_\pi f\right\rVert_{L^2(\pi)}^2 &= \mathbb{E}_\pi[(\mathbb{E}_\pi[f|Y])^2] \le \mathbb{E}_\pi[\mathbb{E}_\pi[f^2|Y]] = E_\pi[f^2] = \left\lVert f\right\rVert_{L^2(\pi)}^2. \end{align}\]
\(P_\pi\) is the orthogonal projection onto the closed subspace \[\begin{align} \mathop{\mathrm{im}}P_\pi &= \{g \in L^2(\pi)\;|\;\exists h \in L^2(\pi^Y)\;{\rm s.t.}\;g(x, y) = h(y)\;{\rm for}\;\pi-{\rm a.e.} (x, y)\}. \end{align}\]

Similarly, define the projection \(Q_\pi : L^2(\pi) \to L^2(\pi^X)\) \[\begin{align} \label{eq:Q95pi} (Q_\pi f)(x) &:= E_{\pi}[f|X=x] = \int_Yf(x, y)\pi(dy|x) \end{align}\tag{13}\] which also has properties (1), (2), and (3) above with (4) being \[\begin{align} \mathop{\mathrm{im}}Q_\pi &= \{g \in L^2(\pi)\;|\;\exists f \in L^2(\pi^X)\;{\rm s.t.}\;g(x, y) = f(x)\;{\rm for}\;\pi-{\rm a.e.} (x, y)\}. \end{align}\]

Definition 2 (Numerical range). Let \(T \in \mathcal{B}(H)\) be a bounded linear operator on a Hilbert space \(H\). Its numerical range \(W(T)\) is the subset of the complex plane \[\begin{align} W_H(T) &= \{\frac{\langle v, Tv\rangle_H}{\langle v, v\rangle_H}\;|\;v \in H\}, \end{align}\] which is equivalently the map of the unit sphere \(\left\lVert v\right\rVert_H=1\) under \(v \mapsto \langle v, Tv\rangle_H\).

Definition 3 (Coercivity). We call an operator \(T\) as in 2 \(\lambda\)-coercive in a real Hilbert space \(H\) if \(\inf W_H(T) = \lambda > 0\).

2.5 Assumptions↩︎

For simplicity, we shall assume \(\mu, \nu\) are absolutely continuous with respect to the Lebesgue measure in the following results, and will use \(\mu, \nu, \pi\) to interchangeably represent measures and densities. This precludes, for example, \(\mu\) or \(\nu\) being empirical distributions, but we believe the arguments presented here can be adapted without undue difficulty. We shall also assume \(\mu, \nu > 0\) Lebesgue-a.e., so that \(\pi_0 > 0\) and \(\pi_t > 0\) L-a.e., which follows from definitions 6 and 8 .

3 Results↩︎

Theorem 1 (Contraction of Sinkhorn flow in \(\langle \cdot, \cdot\rangle_{1/\pi_t^2}\)). The Sinkhorn flow 7 , 8 is contracting (or expanding) with rate \(\lambda \in \mathbb{R}\) in the time-dependent metric \[\begin{align} \langle \cdot, \cdot\rangle_{1/\pi_t^2} \end{align}\] for all states \(\pi_t\) and tangent directions \(\xi \in \ker Q_{\pi_t}\) at which the conditional expectation operator \(P_{\pi_t}\) defined in 10 satisfies the coercivity property \[\begin{align} \label{eq:P95coercivity} \langle \xi, P_{\pi_t} \xi\rangle &\ge \lambda \langle \xi, \xi\rangle. \end{align}\tag{14}\]

Proof. Let \(\eta \in L^1(X \times Y)\). The Gateaux derivative (first variation) of \(F\) at \(\pi\) in the direction \(\eta\) is \[\begin{align} \delta F(\pi)[\eta] &= \iint_{X\times Y}\left(\log\frac{d\pi^Y}{d\nu}(y) + 1\right)\eta(x, y)dxdy = \int_Y\eta^Y(y)\left(\log\frac{d\pi^Y}{d\nu}(y) + 1\right)dy, \end{align}\] where \(\eta^Y(y) = \int_X\eta(x, y)dx\) denotes the “marginal.” Similarly, the second variation is, given also \(\eta' \in L^1(X\times Y)\), \[\begin{align} \delta^2 F(\pi)[\eta, \eta'] &= \int_Y\frac{\eta^Y(y)\eta'^Y(y)}{\pi^Y(y)}dy \end{align}\] and in particular we have the positive-semidefiniteness \[\begin{align} \delta^2F(\pi)[\eta, \eta] &= \int_Y \frac{(\eta^Y(y))^2}{\pi^Y(y)}dy \ge 0,\;\text{with equality iff}\;\eta^Y = 0\;\pi^Y-\text{a.e}. \end{align}\] Next, we consider the mirror map \(\varphi\). Let us first define the tangent space \[\begin{align} \label{eq:tan95pi} T_\pi \Pi(\mu, \cdot) &= \{a \in L^2(X\times Y)\;|\;a^X(x)=0\;\text{for}\;\mu-\text{a.e.}\;x\}. \end{align}\tag{15}\] Then \[\begin{align} \delta \varphi(\pi)[a] &= \iint_{X\times Y}\left(\log\frac{d\pi}{d\pi_0}+1\right)a dxdy,\quad a \in T_\pi \Pi(\mu, \cdot) \end{align}\] and \[\begin{align} \label{eq:hess95varphi} \delta^2\varphi(\pi)[a, b] &= \iint_{X\times Y}\frac{ab}{\pi}dxdy,\quad a, b\in T_\pi \Pi(\mu, \cdot) \end{align}\tag{16}\] which is positive definite since \(\pi > 0\) a.e. Now, let \(\delta \pi_t \in T_\pi\Pi(\mu, \cdot)\) be a perturbation. Since \[\begin{align} h_t &= \delta \varphi(\pi_t)\quad \text{and}\quad \frac{d}{dt}h_t = -\delta F(\pi_t) \end{align}\] then \(\delta\pi_t\) induces a corresponding \(\delta h_t \in L^1\) by 16 as \[\begin{align} \label{eq:delh} \delta h_t &= \delta^2\varphi(\pi_t)[\delta \pi_t, \cdot] = \frac{\delta \pi_t}{\pi_t}\quad \text{and}\quad \frac{d}{dt}\delta h_t = -\delta^2 F(\pi_t)[\delta \pi_t, \cdot] = - \frac{\delta\pi_t^Y}{\pi_t^Y} \end{align}\tag{17}\] which is well-defined since \(\pi_t^Y > 0\) a.e. Note that from 15 , \[\begin{align} \label{eq:coercivity95domain} \delta \pi_t^X &= 0 \iff \frac{\delta \pi_t^X}{\pi_t^X} = \mathbb{E}_{\pi_t}[\delta h_t|X] = 0\iff \delta h_t \in \ker Q_{\pi_t}, \end{align}\tag{18}\] which gives the domain of coercivity as stated in the hypothesis. Note that the Hessian operators of \(\varphi, F\) are expressible as \[\begin{align} \label{eq:hessians} H^\varphi_t\delta \pi_t &= \frac{\delta\pi_t}{\pi_t},\quad H^F_t \delta \pi_t = P_{\pi_t}H^\varphi_t\delta \pi_t. \end{align}\tag{19}\] It follows from 17 and the definition 10 of \(P_\pi\) that \[\begin{align} P_{\pi_t} \delta h_t &= P_{\pi_t}\frac{\delta \pi_t}{\pi_t} = \frac{1}{\pi_t^Y}\int_X\delta\pi_t(x, y)dx = \frac{\delta \pi_t^Y}{\pi_t^Y} = -\frac{d}{dt}\delta h_t. \end{align}\] The metric in the hypothesis is \[\begin{align} \langle\cdot, \cdot\rangle_{1/\pi_t^2} &= \langle H_t^\varphi \cdot, H_t^\varphi \cdot\rangle, \end{align}\] which is valid since \(H^\varphi = (H^\varphi)^*\). The norm of the perturbation in this metric evolves as \[\begin{align} \frac{d}{dt}\frac{1}{2}\left\lVert\delta \pi_t\right\rVert_{L^2(1/\pi_t^2)}^2 &= \frac{d}{dt}\frac{1}{2}\left\lVert\delta h_t\right\rVert_{L^2}^2 = \langle \delta h_t,\frac{d}{dt}\delta h_t\rangle = -\delta^2 F(\pi_t)[\delta \pi_t, \delta h_t]\\ &= -\langle \frac{\delta \pi_t^Y}{\pi_t^Y}, \left(\frac{\delta \pi_t}{\pi_t}\right)^Y\rangle_{L^2(Y)} = -\langle P_{\pi_t}\frac{\delta \pi_t}{\pi_t}, \left(\frac{\delta \pi_t}{\pi_t}\right)^Y\rangle \end{align}\] thus the instantaneous contraction/expansion rate is governed by the numerical range \(W_{\ker Q_{\pi_t}}(P_{\pi_t})\). Then, under the coercivity hypothesis 14 , \[\begin{align} &\le -\lambda \left\lVert\delta \pi_t\right\rVert_{1/\pi_t^2}^2 \end{align}\] for all \(\xi = \delta \pi_t / \pi_t \in \ker Q_{\pi_t}\) at which 14 holds. ◻

Remark 1. Coercivity 14 on \(L^2\) with the Lebesgue measure is not an immediate consequence of the projection property 11 , 12 . Whereas \(P_{\pi}\) is self-adjoint in the weighted space \(L^2(\pi)\) by 12 ; its \(L^2\) adjoint is in fact \[\begin{align} \langle P_\pi f, g\rangle &= \iint_{X\times Y}\frac{g(x, y)}{\pi^Y(y)}dx\int_Xf(x', y)\pi(x', y)dx'dy = \langle f, \pi P_\pi \left[\frac{g}{\pi}\right]\rangle = \langle f, \pi H^F g\rangle \end{align}\] with \(H_F\) the Hessian of the objective \(H\) as defined in 19 . To handle this issue, we will analyze the flow in a second metric where the self-adjointness is preserved.

Theorem 2 (Contraction of Sinkhorn flow in Fisher-Rao). The Sinkhorn flow 7 , 8 is contracting (or expanding) with rate \(\lambda \in \mathbb{R}\) in the (time-dependent) Fisher-Rao metric [14] \[\begin{align} \langle \cdot, \cdot\rangle_{1/\pi_t} \end{align}\] for all states \(\pi_t\) for which one has the coercivity \[\begin{align} \label{eq:contr952} \langle \xi, \left[2P_{\pi_t} + (I - Q_{\pi_t})\log\frac{d\pi_t^Y}{d\nu}\right]\xi\rangle_{\pi_t} &\ge \lambda \langle \xi, \xi\rangle_{\pi_t} \end{align}\tag{20}\] for all \(\xi \in \ker Q_{\pi_t}\).

Proof. We have, using the relations 17 , \[\begin{align} \frac{d}{dt} \langle\delta \pi_t, \delta \pi_t\rangle_{\frac{1}{\pi_t}} &= \frac{d}{dt} \langle \delta \pi_t, \delta h_t\rangle = -\langle \delta h_t, P_{\pi_t}\delta h_t\rangle_{\pi_t} + \langle \delta h_t, \frac{\partial}{\partial t}(\pi_t\delta h_t)\rangle\\ &= -2\langle \delta h_t, P_{\pi_t}\delta h_t\rangle_{\pi_t} + \langle \delta h_t, \delta h_t \frac{\partial}{\partial t}\log\pi_t\rangle_{\pi_t}. \end{align}\] Noting that from the primal variable 8 , \[\begin{align} \label{eq:dlogpi95dt} \frac{\partial}{\partial t} \log \pi_t &= \frac{\partial h_t}{\partial t}(x, y) - \frac{1}{Z_t(x)}\frac{\partial Z_t}{\partial t}(x),\quad Z_t(x) = \int_Y\pi_0(x, y')e^{h_t(x, y')}dy'\\ &= \frac{\partial h_t}{\partial t}(x, y) - \frac{1}{\pi^Y(y)}\int_Y \frac{\partial h_t}{\partial t}(x, y')\pi_t(x, y')dy'\\ &= (I - Q_{\pi_t}) \frac{\partial h_t}{\partial t}(x, y), \end{align}\tag{21}\] with \(Q_\pi\) the conditional expectation operator defined in 13 (using the fact that \(\pi_t^X = \mu\) for all \(t\)). Substituting the dual flow \(\frac{\partial h_t}{\partial t}\) from 7 , we then have \[\begin{align} \label{eq:44} \frac{d}{dt} \langle\delta \pi_t, \delta \pi_t\rangle_{\frac{1}{\pi_t}} &= -2\langle \delta h_t, P_{\pi_t} \delta h_t\rangle_{\pi_t} -\langle \delta h_t^2, (I - Q_{\pi_t})\log\frac{d\pi_t^Y}{d\nu}\rangle_{\pi_t}. \end{align}\tag{22}\] Letting \(f_t := (I - Q_{\pi_t})\log\frac{d\pi_t^Y}{d\nu}\), \[\begin{align} &= -\langle \delta h_t, (2P_{\pi_t} + f_t)\delta h_t\rangle_{\pi_t}, \end{align}\] so that the contraction (expansion) rate is determined by the numerical range \(W_{\ker Q_{\pi_t}}(2P_{\pi_t} + f_t)\). Under the coercivity hypothesis (with domain of coercivity \(\delta h_t \in \ker Q_{\pi_t}\) exactly as in 18 ), we then have \[\begin{align} &\le -\lambda \langle \delta h_t, \delta h_t\rangle_{\pi_t} = -\lambda \langle \delta \pi_t, \delta \pi_t\rangle_{\frac{1}{\pi_t}}, \end{align}\] giving the result. ◻

Remark 2. In contrast with the non-self-adjointness of \(P_\pi\) in \(L^2\) noted in Remark 1, condition 20 is a coercivity property in \(L^2(\pi)\), in which \(P_\pi\) (and \(Q_\pi\)) are self-adjoint; what remains is to compare the spectral gap of the operator \(P_\pi\) and the second term, which deserves special attention on its own as we show below.

Remark 3. One way to proceed to a bound in 22 is by Cauchy-Schwarz; this leads to the term \[\begin{align} \left\lVert(I - Q_{\pi_t})\log\frac{d\pi_t^Y}{d\nu}\right\rVert_{L^2(\pi_t)}^2. \end{align}\] This term is in fact exactly the entropy production rate of the Sinkhorn flow, as we show in Theorem 3.

Theorem 3 (Entropy production rate of Sinkhorn flow). The Sinkhorn flow 7 , 8 satisfies the entropy production identity \[\begin{align} \label{eq:sinkhorn95epr} \frac{d}{dt}H(\pi_t^Y|\nu) &= -\left\lVert(I - Q_{\pi_t})\log\frac{d\pi_t^Y}{d\nu}\right\rVert_{L^2(\pi_t)}^2. \end{align}\tag{23}\]

Proof. Let \(g_t(y) := \log\frac{d\pi_t^Y}{d\nu}(y)\). Then \[\begin{align} \frac{d}{dt} H(\pi_t^Y|\nu) &= \frac{d}{dt}\int_Y\pi_t^Yg_tdy = \int_Y \frac{\partial \pi_t^Y}{\partial t}g_tdy + \cancelto{0}{\int_Y\frac{\partial \pi_t^Y}{\partial t}dy}. \end{align}\] Moreover, using the identity 21 , \[\begin{align} \label{eq:dpi95ty95dt} \frac{\partial \pi_t^Y}{\partial t}(y) &= \int_X \frac{\partial \pi_t}{\partial t}(x, y)dx = -\int_X[(I - Q_{\pi_t})g_t](x, y)\pi_t(x, y)dx. \end{align}\tag{24}\] Letting \(Q_{\pi_t}^\perp := (I - Q_{\pi_t})\), \[\begin{align} \frac{d}{dt}H(\pi_t^Y|\nu) &= -\int_{X\times Y}g_t(y)(Q_{\pi_t}^\perp g_t)(x, y)\pi_t(x, y)dxdy \\ \label{eq:dhdt95punchline} &= -\langle g_t, Q_{\pi_t}^\perp g_t\rangle_{L^2(\pi_t)} = -\left\lVert Q_{\pi_t}^\perp g_t\right\rVert_{L^2(\pi_t)}^2 \end{align}\tag{25}\] since \(Q_{\pi_t}\) (and therefore \(Q_{\pi_t}^\perp\)) is an orthogonal projection on \(L^2(\pi_t)\) by 1, giving the result. ◻

Remark 4. Notice that the expression in 25 can be written as, for some function \(g \in L^2(\pi^Y)\), \[\begin{align} \langle g, (I - Q_\pi)g\rangle_{L^2(\pi)} = \langle g, P_\pi(I-Q_\pi)g\rangle_{L^2(\pi^Y)} = \langle g, (I - P_\pi Q_\pi)g\rangle_{L^2(\pi^Y)} \end{align}\] since \(P_\pi g = g\) for \(g\) which is just a function of \(y\). This motivates the definition of the operator \[\begin{align} (T_\pi g)(y) &:= E_\pi[E_\pi[g(Y)|X]|Y=y] = (P_\pi Q_\pi g)(y), \end{align}\] which maps \(L^2(\pi^Y) \to L^2(\pi^Y)\) (compare with the domains of \(P_\pi, Q_\pi\) in 1 which are the whole of \(L^2(\pi)\)). \(T_\pi\) is self-adjoint on \(L^2(\pi^Y)\) since, for \(f\) also in \(L^2(\pi^Y)\), \[\begin{align} \langle f, T_\pi g\rangle_{\pi^Y} &= \langle f, Q_\pi g\rangle_{\pi} = \langle Q_\pi f, g\rangle_{\pi} = \langle T_\pi f, g\rangle_{\pi^Y}. \end{align}\] Moreover, \(T_\pi \boldsymbol{1} = \boldsymbol{1}\), so \(T_\pi\) satisfies the necessary conditions to be a symmetric (reversible) Markov operator (§1.6.1, [11]). Its stationary measure is \(\pi^Y\), since for every \(f \in L^2(\pi^Y)\), \[\begin{align} \int_Y \pi^Y T_\pi f dy &= \int_X \pi^XQ_\pi fdx = \int_Y f \pi^Ydy \end{align}\] hence \(T_\pi^* \pi^Y = \pi^Y\). Note furthermore that by a calculation following from 24 , \[\begin{align} \label{eq:dpi95ty952} \frac{\partial \pi_t^Y}{\partial t} = -\pi_t^YP_{\pi_t}(I - Q_{\pi_t})\log\frac{d\pi_t^Y}{d\nu} = -\pi_t^Y(I - T_{\pi_t})\log\frac{d\pi_t^Y}{d\nu}. \end{align}\tag{26}\] Hence, defining \(L_{\pi_t} := (I - T_{\pi_t})\), we see that the marginal dynamics 26 can be written in the Onsager-gradient flow form \[\begin{align} \frac{\partial \pi_t^Y}{\partial t}= -\pi_t^Y L_{\pi_t}\frac{\delta F}{\delta \pi^Y}(\pi_t^Y), \end{align}\] with \(K_{\pi} := \pi^Y L_{\pi}\) the (nonlocal, \(\pi\)-dependent) Onsager operator.

Definition 4 (Sinkhorn Dirichlet form). In analogy with diffusion processes, let us define the Dirichlet form (§1.7.1, [11]) associated to \(K_\pi\) using an “integration by parts” formula \[\begin{align} \label{eq:dirichlet} \mathcal{E}_\pi(f, g) &:= \langle f, L_\pi g\rangle_{L^2(\pi^Y)}, \end{align}\tag{27}\] which for equal arguments is exactly \[\begin{align} \label{eq:dirichlet952} \mathcal{E}_\pi(g, g) &= \langle g, (I - Q_\pi)g\rangle_{L^2(\pi)} = \left\lVert(I - Q_\pi)g\right\rVert_{L^2(\pi)}^2, \end{align}\tag{28}\] the entropy production rate of the Sinkhorn flow (Theorem 3), taking \(g = \log\frac{d\pi^Y}{d\nu}\).

As in the theory of diffusion processes, an explicit bound for the entropy production rate 23 is given by a “Poincaré constant,” or spectral gap, for the Dirichlet form \(\mathcal{E}_\pi\).

Lemma 1 (Poincaré inequality for \(\mathcal{E}\)). For all \(g \in L^2(\pi^Y)\), we have \[\begin{align} \mathcal{E}_\pi(g, g) &\ge (1 - C(\pi))\left\lVert g - \langle g, \boldsymbol{1}\rangle\boldsymbol{1}\right\rVert_{L^2(\pi^Y)}^2 = (1 - C(\pi))\mathrm{Var}_{\pi^Y}(g) \end{align}\] for some constant \(C(\pi) \in [0, 1]\) depending only on \(\pi\).

Proof. Let \(g \in L^2(\pi^Y)\) and \(g = \tilde{g} + c\boldsymbol{1}\) with \(\tilde{g} \in L^2_0(\pi^Y)\) where \(L^2_0(\pi^Y)\) is the mean-zero subspace as defined in 9 . Then, \[\begin{align} \mathcal{E}_\pi(g, g) &= \langle g, L_\pi g\rangle_{\pi^Y} = \langle \tilde{g}, L_\pi \tilde{g}\rangle_{\pi^Y} = \left\lVert\tilde{g}\right\rVert_{L^2(\pi^Y)}^2 - \langle \tilde{g}, T_\pi \tilde{g}\rangle_{\pi^Y} \end{align}\] since \(T_\pi \boldsymbol{1} = \boldsymbol{1}\), \(L_\pi\boldsymbol{1} = 0\) and \(L_\pi\) is self-adjoint on \(L^2(\pi^Y)\). Moreover, \(L_0^2(\pi^Y)\) is an invariant subspace of \(T_\pi\): \[\begin{align} \langle T_\pi\tilde{g}, \boldsymbol{1}\rangle_{\pi^Y} &= \langle \tilde{g}, \boldsymbol{1}\rangle_{\pi^Y} = 0. \end{align}\] Hence, denote the restriction to \(L_0^2(\pi^Y)\) by \(T_\pi^0\), which remains self-adjoint. Thus, letting \(f \perp \boldsymbol{1}\) denote that \(f \in L_0^2(\pi^Y)\setminus\{0\}\), \[\begin{align} \label{eq:T095norm} \left\lVert T_\pi^0\right\rVert &= \sup W_{L^2(\pi^Y)}(T_\pi^0) = \sup_{f\perp \boldsymbol{1}}\frac{\langle f, T_\pi^0 f\rangle_{\pi^Y}}{\left\lVert f\right\rVert_{L^2(\pi^Y)}^2} = \sup_{f\perp \boldsymbol{1}}\frac{\left\lVert Q_\pi f\right\rVert_{L^2(\pi)}^2}{\left\lVert f\right\rVert_{L^2(\pi^Y)}^2} =: C(\pi). \end{align}\tag{29}\] Since \(Q_\pi\) is an orthogonal projection, \(C(\pi) \le 1\). Hence, \[\begin{align} \mathcal{E}_\pi(g, g) &\ge (1 - C(\pi)) \left\lVert\tilde{g}\right\rVert_{L^2(\pi^Y)}^2, \end{align}\] which gives the result. ◻

Corollary 1. Along the Sinkhorn dynamics 7 , 8 , \[\begin{align} \label{eq:sinkhorn95epr95poincare} \frac{d}{dt} H(\pi_t^Y|\nu) & \le -(1 - C(\pi_t))\mathrm{Var}_{\pi_t^Y}(\log\frac{d\pi_t^Y}{d\nu}) \end{align}\tag{30}\] with \(C(\pi_t)\) the Poincaré constant in Lemma 1. Furthermore, \(C(\pi_t) < 1\) whenever the regularization parameter \(\varepsilon > 0\).

Proof. Inequality 30 as well as \(C(\pi) \in [0, 1]\) is an immediate consequence of Theorem 3. Now, suppose \(C(\pi) = 1\); then from 29 , there exists some \(f_* \perp \boldsymbol{1}\) (using the same notation as we have defined there) such that \[\begin{align} \left\lVert Q_\pi f_*\right\rVert_{L^2(\pi)}^2 = \left\lVert f_*\right\rVert_{L^2(\pi^Y)}^2 = \left\lVert f_*\right\rVert_{L^2(\pi)}^2 \iff \left\lVert(I - Q_\pi)f_*\right\rVert_{L^2(\pi)}^2 = 0, \end{align}\] since \(Q_\pi\) is an orthogonal projection in \(L^2(\pi)\). This holds iff \(f_* = Q_\pi f_*\) \(\pi\)-a.e. In other words, the \(X\)-measurable function \(h_*(x) := (Q_\pi f_*)(x)\) is such that \(f_*(y) = h_*(x)\) for \(\pi\)-a.e. \((x, y)\). But, since \(\varepsilon > 0\), the Sinkhorn initial condition 6 satisfies \(\pi_0 > 0\) Lebesgue-a.e.; hence, along the Sinkhorn flow 8 , \(\pi_t > 0\) a.e. Thus, \(f_*\) must be a.e. constant on \(Y\), yet \(f_* \perp \boldsymbol{1}\) so \(f_* = 0\), which is a contradiction. Hence, \(C(\pi) < 1\) if \(\varepsilon > 0\). ◻

Corollary 1 shows the role that positive entropy regularization \(\varepsilon > 0\) plays in convergence of the Sinkhorn flow. Lastly, we now give a sharp condition for exponential entropy production. Notice that \(\mathcal{E}_\pi(g, g)\) in 28 with argument \(g = \log\frac{d\pi^Y}{d\nu}\) is exactly the “Fisher information” \[\begin{align} \label{eq:fisher} I_\pi(\omega|\nu) &:= \mathcal{E}_\pi(\log\frac{d\omega}{d\nu},\log\frac{d\omega}{d\nu}), \end{align}\tag{31}\] so that \(\frac{d}{dt}H(\pi_t^Y|\nu) = -I_{\pi_t}(\pi_t^Y|\nu)\), which mirrors precisely the formula from the theory of diffusion processes [11]. It follows that the entropy production 23 corresponds to an exponential entropy production if and only if a “log-Sobolev” inequality (which is a Polyak-Lojasiewicz inequality for the Lyapunov function \(H\)) holds:

Definition 5 (Logarithmic Sobolev inequality). A pair \((\pi, \nu)\) is said to satisfy a log-Sobolev inequality (in the sense of 31 , 27 ) with constant \(\lambda > 0\) if \[\begin{align} \label{eq:lsi} H(\pi^Y|\nu) &\le \frac{1}{2\lambda}I_\pi(\pi^Y|\nu). \end{align}\tag{32}\]

Corollary 2 (Exponential entropy decay in the Sinkhorn flow). If for given \(\mu \in \mathcal{P}(X), \nu \in \mathcal{P}(Y), \lambda > 0\) and all \(\pi \in \Pi(\mu, \cdot)\), the pair \((\pi, \nu)\) satisfies the log-Sobolev inequality (Definition 5) uniformly with rate \(\lambda\), then \[\begin{align} H(\pi_t^Y|\nu) &\le e^{-2\lambda t}H(\pi_0^Y|\nu) \end{align}\] along the continuous-time Sinkhorn flow 7 , 8 .

Proof. By Theorem 3, definitions 27 , 31 , and by hypothesis, \[\begin{align} \frac{d}{dt}H(\pi_t^Y|\nu) &= -\langle \log\frac{d\pi_t^Y}{d\nu},(I-Q_\pi)\log\frac{d\pi_t^Y}{d\nu}\rangle_{L^2(\pi_t)}\\ &= -\langle \log\frac{d\pi_t^Y}{d\nu},(I-T_\pi)\log\frac{d\pi_t^Y}{d\nu}\rangle_{L^2(\pi^Y_t)}\\ &= -I_{\pi_t}(\pi_t^Y|\nu)\\ &\le -2\lambda H(\pi_t^Y|\nu) \end{align}\] and the result follows by application of the Grönwall inequality. ◻

Besides the theoretical significance of the above results, let us give for illustration two simple computational use-cases of the Sinkhorn LSI.

Example 1 (Latent space design for generative models). In generative models trained using \({\rm OT}_\varepsilon\)-type losses (e.g. the Sinkhorn divergence, [15]), the choice of the latent space \(\phi(Y)\) for training may be guided by the LSI constant 32 of the pushforward data marginal \(\phi_\#\nu\); larger (uniform) LSI constants yield faster exponential convergence of inner Sinkhorn solves. Similarly, in generative models based upon the Schrödinger bridge (e.g. [3]), a larger LSI constant in the latent space \(\phi(Y)\) can improve training stability and convergence rates.

Example 2 (Adaptive stopping heuristic for discrete Sinkhorn). Practical uses of the Sinkhorn algorithm often use a fixed number \(L\) of iterations; we illustrate how a priori bounds for the entropy production rate can be used to set \(L\). While the entropy production identity 23 holds for the continuous-time Sinkhorn flow, the per-iterate entropy drop is first-order consistent with 23 : \[\begin{align} \frac{H(\pi_{n_{\rm odd}+2}^Y|\nu) - H(\pi_{n_{\rm odd}}^Y|\nu)}{\gamma} &= \frac{d}{dt}H(\pi_t^Y|\nu) + O(\gamma) \end{align}\] for step size \(\gamma > 0\) (where \(n_{\rm odd}, n_{\rm even}\) correspond to alternating steps of the original Sinkhorn algorithm 6 ). If one has an LSI 5 of rate \(\lambda\) for the marginal \(\nu\), then \[\begin{align} H(\pi_{n_{\rm odd}+2k}^Y|\nu) &\le e^{-2\lambda \gamma k}H(\pi_{n_{\rm odd}}^Y|\nu) + O(k\gamma^2) \end{align}\] (which can also be adapted for variable step-sizes). Hence for given tolerance \(\tau > 0\) and error \(H_0\) measured at some \(n_{\rm odd}\), one can plan for \[\begin{align} \label{eq:n95iterates} n &\ge \left\lceil\frac{1}{2\lambda\gamma}\log\frac{H_0}{\tau}\right\rceil \end{align}\tag{33}\] number of iterates, at which point \(H_0\) can be re-measured and checked for within tolerance, else re-start the iteration with a new estimate \(n\) in 33 . We note that as the classical Sinkhorn corresponds to \(\gamma =1\) [7], 33 is merely a heuristic to avoid computing \(H\) on every step. In a variable-step-size Sinkhorn algorithm with \(\gamma \ll 1\) (e.g. Definition 1, [7]), 33 provides a valid estimate.

References↩︎

[1]

Marco Cuturi. Sinkhorn Distances: LightspeedComputation of OptimalTransport. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://papers.nips.cc/paper_files/paper/2013/hash/af21d0c97db2e27e13572cbf59eb343d-Abstract.html.

[2]

Christian Léonard. A survey of the Schrödinger problem and some of its connections with optimal transport, August 2013. URL http://arxiv.org/abs/1308.0215. arXiv:1308.0215 [math].

[3]

Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion SchrödingerBridge with Applications to Score-BasedGenerativeModeling. In Advances in Neural Information Processing Systems, volume 34, pages 17695–17709. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/hash/940392f5f32a7ade1cc201767cf83e31-Abstract.html.

[4]

Aram-Alexandre Pooladian and Jonathan Niles-Weed. Entropic estimation of optimal transport maps, September 2021. URL https://arxiv.org/abs/2109.12004v3.

[5]

Guillaume Carlier, Lénaïc Chizat, and Maxime Laborde. Displacement smoothness of entropic optimal transport. ESAIM: Control, Optimisation and Calculus of Variations, 30: 25, 2024. ISSN 1292-8119, 1262-3377. . URL https://www.esaim-cocv.org/10.1051/cocv/2024013.

[6]

Promit Ghosal and Marcel Nutz. On the ConvergenceRate of Sinkhorn’s Algorithm, April 2025. URL http://arxiv.org/abs/2212.06000. arXiv:2212.06000 [math].

[7]

Mohammad Reza Karimi, Ya-Ping Hsieh, and Andreas Krause. Sinkhorn Flow as MirrorFlow: AContinuous-TimeFramework for Generalizing the SinkhornAlgorithm. In Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, pages 4186–4194. PMLR, April 2024. URL https://proceedings.mlr.press/v238/reza-karimi24a.html. ISSN: 2640-3498.

[8]

Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31 (3): 167–175, May 2003. ISSN 0167-6377. . URL https://www.sciencedirect.com/science/article/pii/S0167637702002316.

[9]

Winfried Lohmiller and Jean-Jacques E. Slotine. On contraction analysis for nonlinear systems. Automatica, 34 (6): 683–696, 1998.

[10]

Patrick M. Wensing and Jean-Jacques Slotine. Beyond convexity—Contraction and global convergence of gradient descent. PLOS ONE, 15 (8): e0236661, August 2020. ISSN 1932-6203. . URL https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0236661. Publisher: Public Library of Science.

[11]

Dominique Bakry, Ivan Gentil, and Michel Ledoux. Analysis and Geometry of Markov Diffusion Operators: 348. Springer, Cham, 2014. ISBN 978-3-319-00226-2.

[12]

Marcel Nutz. Introduction to EntropicOptimalTransport.

[13]

Gabriel Peyré and Marco Cuturi. Computational OptimalTransport, March 2020. URL http://arxiv.org/abs/1803.00567. arXiv:1803.00567 [stat].

[14]

Thomas Friedrich. Die fisher-information und symplektische strukturen. Mathematische Nachrichten, 153 (1): 273–296, 1991. . URL https://onlinelibrary.wiley.com/doi/abs/10.1002/mana.19911530125.

[15]

Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning GenerativeModels with SinkhornDivergences, October 2017. URL http://arxiv.org/abs/1706.00292. arXiv:1706.00292 [stat].

Contraction and entropy production
in continuous-time Sinkhorn dynamics

Abstract

1 Introduction↩︎

2 Background↩︎

2.1 Entropic optimal transport↩︎

2.2 Sinkhorn algorithm for \({\rm OT}_\varepsilon\) as an \(L^2\) mirror descent↩︎

2.3 Notation↩︎

2.4 Definitions↩︎

2.5 Assumptions↩︎

3 Results↩︎

References↩︎

Subjects

Updated on Academus

Contraction and entropy production in continuous-time Sinkhorn dynamics

Abstract

1 Introduction↩︎

2 Background↩︎

2.1 Entropic optimal transport↩︎

2.2 Sinkhorn algorithm for \({\rm OT}_\varepsilon\) as an \(L^2\) mirror descent↩︎

2.3 Notation↩︎

2.4 Definitions↩︎

2.5 Assumptions↩︎

3 Results↩︎

References↩︎

Subjects

Updated on Academus

Contraction and entropy production
in continuous-time Sinkhorn dynamics