Bregman Douglas-Rachford Splitting Method2


Abstract

In this paper, we propose the Bregman Douglas-Rachford splitting (BDRS) method and its variant Bregman Peaceman-Rachford splitting method for solving maximal monotone inclusion problem. We show that BDRS is equivalent to a Bregman alternating direction method of multipliers (ADMM) when applied to the dual of the problem. A special case of the Bregman ADMM is an alternating direction version of the exponential multiplier method. To the best of our knowledge, algorithms proposed in this paper are new to the literature. We also discuss how to use our algorithms to solve the discrete optimal transport (OT) problem. We prove the convergence of the algorithms under certain assumptions, though we point out that one assumption does not apply to the OT problem.

1 Introduction↩︎

This paper studies the celebrated Douglas-Rachford splitting method (DRS) that has a long history dating back to 1956 for solving variational problems arising from numerical PDEs [1]. The DRS later on became a very popular method for finding zeros of the sum of maximal monotone operators: \[\label{sum-2-operator}Findx, \;\mathrm{s.t. }, \;0 \in A(x) + B(x),\tag{1}\] where \(A: \mathbb{R}^n \to \mathbb{R}^n\) and \(B: \mathbb{R}^n \to \mathbb{R}^n\) are two maximal monotone operators. The problem 1 is also known as the monotone inclusion problem. The solution set of 1 is denoted as \((A+B)^{-1}(0)\). The DRS has been studied by many researchers under various settings [2][20]. When \(A\) and \(B\) are normal cone operators, the problem 1 reduces to a feasibility problem which seeks for a point in the intersection of two sets. We refer to the recent survey paper [21] for more details on DRS for feasibility problems. The DRS for solving 1 can be written as: \[\tag{2} \begin{align} x^{k} & := J_{\gamma_k B}(z^k) \tag{3}\\ y^k & := J_{\gamma_k A}(2x^k-z^k) \tag{4}\\ z^{k+1} & := z^k - x^k + y^k, \tag{5} \end{align}\] where \(J_{T} := (I+T)^{-1}\) is called the resolvent of the operator \(T\), and \(\gamma_k > 0\) is a parameter.

A well-known application of the DRS is the so-called alternating direction method of multipliers (ADMM), which is usually applied to solving the following convex minimization problem: \[\label{admm-sum-2} \min_{u,v} \;f(u) + g(v), \;\mathrm{s.t. }, \;Mu + Nv = b, \;u\in\mathbb{R}^p, v\in\mathbb{R}^q,\tag{6}\] where \(b\in\mathbb{R}^m\), \(M\in\mathbb{R}^{m\times p}\) and \(N\in\mathbb{R}^{m\times q}\), and \(f\) and \(g\) are both proper, closed and convex functions. The ADMM for solving 6 updates the iterates as follows: \[\tag{7} \begin{align} u^{k+1} & := \mathop{\rm argmin}_u \;\mathcal{L}_\beta(u,v^k;w^k) \tag{8}\\ v^{k+1} & := \mathop{\rm argmin}_v \;\mathcal{L}_\beta(u^{k+1},v;w^k) \tag{9} \\ w^{k+1} & := w^k + \beta(Mu^{k+1} + Nv^{k+1} - b). \tag{10} \end{align}\] Here \[\label{admm-sum-2-AL-Fn} \mathcal{L}_\beta(u,v;w) := f(u) + g(v) + \langle w, Mu + Nv - b\rangle + \frac{\beta}{2}\|Mu + Nv - b\|_2^2,\tag{11}\] is the augmented Lagrangian function for 6 , where \(w\) denotes the Lagrange multiplier, and \(\beta> 0\) is a penalty parameter. As proved by Gabay [4], ADMM 7 for solving 6 is a special case of DRS 2 applied to solving the dual problem of 6 , whose optimality condition is in the form of 1 . More specifically, the dual problem of 6 is given by \[\label{admm-sum-2-dual} \min_x \;f^*(-M^\top x) + g^*(-N^\top x) + b^\top x,\tag{12}\] where \(f^*\) and \(g^*\) are the conjugate functions of \(f\) and \(g\), respectively. The optimality condition of 12 is: \[\label{admm-sum-2-dual-OptCond-0} 0\in -M\partial f^*(-M^\top x) - N \partial g^*(-N^\top x) + b,\tag{13}\] which is in the form of 1 by defining \(A(x) = -M\partial f^*(-M^\top x)\) and \(B(x) = -N \partial g^*(-N^\top x) + b\). Appying DRS 2 to 13 gives the ADMM 7 .

ADMM has received significant attention due to its applications in signal processing, image processing, semidefinite programming and statistics [11], [22][25]. It is not possible to exhaust the vast literature on ADMM, and we thus refer the reader to the following survey papers for more details on the theory and applications of ADMM and its variants [26][29].

The efficiency of the DRS 2 relies on the assumption that both \(J_{\gamma_k A}(z)\) and \(J_{\gamma_k B}(z)\) can be computed easily, and similarly, the efficiency of the ADMM 7 replies on the assumption that the two minimization subproblems are easy to solve. As we will discuss later, in certain applications, these computations are not easy (even when \(M=N=I\)) and more general Bregman distances need to be considered when designing these algorithms. However, to the best of our knowledge, Bregman DRS (BDRS) was not considered in the literature before. There exists one work on Bregman ADMM [30], but this algorithm only applies to some special class of problems and is different from the algorithms that we consider in this paper.

Our contributions. In this paper, we target to design the BDRS algorithm, and our main contributions are as follows.

(i) We design the first BDRS algorithm in the literature, and analyze its connections with several existing methods. We also propose a Bregman Peaceman-Rachford splitting (BPRS) method, which is a close variant of BDRS. We show that BDRS is equivalent to a Bregman ADMM algorithm when applied to its dual.

(ii) We show that if the Bregman distance is generated by the Boltzmann-Shannon entropy, then when applied to the dual problem of linear inequality constrained convex programming problem, our BDRS gives an alternating direction version of the exponential multiplier method. We name this algorithm ADEMM, and this is also a new algorithm, to the best of our knowledge.

(iii) We discuss how to use our BDRS and ADEMM to solve the discrete optimal transport (OT) problem, and discuss how they relate to and why they are better than the Sinkhorn’s algorithm.

(iv) We prove the convergence of BDRS under certain assumptions, though we want to point out that one of the assumptions does not apply to the OT problem.

Organization. The rest of this paper is organized as follows. In Section 2 we provide some preliminaries on Bregman distance and algorithms based on it. In Section 3, we propose our BDRS and ADEMM algorithms and discuss their connections with existing algorithms. In Section 4, we propose the BPRS algorithm. In Section 5, we discuss how to apply our proposed algorithms to solving the discrete optimal transport problem. We draw some conclusions in Section 6. In the appendix, we provide the convergence analysis for BPRS and BDRS, both under certain assumptions.

2 Preliminaries and Existing Bregman Algorithms↩︎

In this section, we briefly review the basics of Bregman distance and existing algorithms that use Bregman distance. The Bregman distance was first proposed by Bregman [31] in a primal-dual method for solving linearly constrained convex programming problems that involves non-orthogonal projections onto hyperplanes. This method was further studied by Censor and Lent [32], and De Pierro and Iusem [33].

We now introduce the notions of Legendre function and Bregman distance.

Definition 1 ([34]). A function \(h\) is called a Legendre function, if it is proper, lower semicontinuous, strictly convex and essentially smooth.

A Legendre function enjoys the following two useful properties:

(i) \(h\) is Legendre if and only if its conjugate \(h^*\) is Legendre.

(ii) The gradient of a Legendre function \(h\) is a bijection from \(\mathrm{int\ dom}h\) to \(\mathrm{int\ dom}h^*\), and its inverse is the gradient of the conjugate, that is, we have \((\nabla h)^{-1} = \nabla h^*\).

Definition 2. For a Legendre function \(h\), the Bregman distance corresponding to \(h\) is defined as \[\label{def:Bregman-dist} D_h(x,y) := h(x) - h(y) - \langle \nabla h(y), x-y\rangle.\qquad{(1)}\]

The following Bregman distances are commonly seen in practice (for more examples, see [35]).

Example 1.

(i) Energy: If \(h(x) = \frac{1}{2}\|x\|_2\), then \(D_h(x,y) = \frac{1}{2}\|x-y\|_2^2\) is the Euclidean distance.

(ii) Quadratic: If \(h(x) = \frac{1}{2}x^\top Lx\) with matrix \(L \succ 0\), then \(D_h(x,y) = \frac{1}{2}\|x-y\|_L^2 = \frac{1}{2}(x-y)^\top L (x-y)\).

(iii) Boltzmann-Shannon entropy: If \(h(x)\) is the Boltzmann-Shannon entropy function defined as \(h(x) = \sum_i x_i (\log x_i-1)\), then \(D_h(x,y) = \sum_i x_i\log \frac{x_i}{y_i}-x_i+y_i\) is the Kullback-Leibler (KL) divergence. Note that the domain of \(h\) is \(\textrm{dom } h = \mathbb{R}_{++}^n = \{x\mid x>0\}\). Moreover, \(h^*(x) = \sum_i e^{x_i}\).

(iv) Burg’s entropy: If \(h(x) = -\sum_i \log x_i\), then \(D_h(x,y) = -\sum_i \log\frac{x_i}{y_i}+\frac{x_i}{y_i}-1\). Note that the domain of \(h\) is \(\textrm{dom } h = \mathbb{R}_{++}^n\).

We now define a few useful Bregman operators.

Definition 3 (Bregman forward operator, Bregman resolvent operator, Bregman reflection operator, Bregman Mann’s operator). Use \(h\) to denote a Legendre function. The Bregman forward operator for a single-valued operator \(T\) is defined as \[\label{def-Bregman-forward-operator} F_T^h := \nabla h^*\circ (\nabla h - T).\qquad{(2)}\] The Bregman resolvent operator of a maximal monotone operator \(T\) is defined as [36] \[\label{def-Bregman-resolvent} J_T^h := (\nabla h + T)^{-1}\circ \nabla h.\qquad{(3)}\] The Bregman reflection operator of a maximal monotone operator \(T\) is defined as \[\label{def-reflection-op} R_T^h := \nabla h^*\circ (2\nabla h\circ J_T^h - \nabla h),\qquad{(4)}\] and the Bregman Mann’s operator of a maximal monotone operator \(T\) is defined as \[\label{def-Mann} M_\alpha^h(T) := \nabla h^*\circ(\alpha \nabla h + (1-\alpha)\nabla h\circ T),\qquad{(5)}\]

The notion of Bregman resolvent operator ?? was first proposed by Eckstein in [36]. When \(h(\cdot) = \frac{1}{2}\|\cdot\|_2^2\), \(J_T^h\) reduces to \(J_T := (I+T)^{-1}\), because \(\nabla h = I\). The definitions of Bregman relection operator ?? and Bregman Mann’s operator ?? are new to the literature to the best of our knowledge.

We now review several classes of Bregman algorithms for both monotone inclusion and convex optimization problems.

2.1 Bregman Gradient Method and Mirror Descent Method↩︎

For unconstrained convex minimization problem \[\label{prob-uncon}\min_{x\in \mathcal{X}} \;f(x)\tag{14}\] where \(f\) is convex and smooth and \(\mathcal{X}\subseteq \mathbb{R}^n\) is a convex set, Nemirovski and Yudin [37] proposed the mirror descent algorithm which iterates as \[\label{BGM-alg-explicit} x^{k+1} := \nabla h^*(\nabla h(x^k) - \gamma_k \nabla f(x^k)) \equiv F_{\gamma_k\nabla f}^h(x^k),\tag{15}\] where \(\gamma_k>0\) is the step size, and \(h\) is a Legendre function. It was later showed by Beck and Teboulle [38] that the mirror descent algorithm 15 can be interpreted as a projected gradient method with a Bregman distance, which can be described as follows: \[\label{BGM-alg} x^{k+1} := \mathop{\rm argmin}_{x\in \mathcal{X}} \;\langle \nabla f(x^k), x-x^k\rangle + \frac{1}{\gamma_k}D_h(x,x^k).\tag{16}\] Here we discuss an important setting where \(S\) is the probability simplex \(S:=\{x\in\mathbb{R}^n\mid \sum_i x_i = 1, x\geq 0\}\). In this case, if one applies the projected gradient method using the Euclidean distance to solve 14 , then each iteration requires a projection onto \(S\). If one uses the Bregman distance generated by the Boltzmann-Shannon entropy, i.e., Example 1 (iii), then the solution of 16 is given by a simple normalization \[x^{k+1} := \frac{y^k}{\|y^k\|_1}, \quadwithy^k := x^k \circ e^{-\gamma_k\nabla f(x^k)},\] which is easier to compute than the projection onto \(S\). Here for vectors \(a\) and \(b\), \(a\circ b\) is the componentwise multipication, and \(e^{a}\) is the componentwise exponential function.

2.2 Bregman Proximal Gradient Method and Bregman Forward-Backward Splitting↩︎

For composite convex minimization problem \[\label{prob-composite} \min_{x\in S} \; f(x) + g(x)\tag{17}\] where \(f\) is convex and smooth and \(g\) is convex and possibly nonsmooth, Bregman proximal gradient method was studied in [39][41], which iterates as: \[\label{BPGM-alg} \begin{align} x^{k+1} & := \mathop{\rm argmin}_x \;\langle \nabla f(x^k), x-x^k\rangle + g(x) + \frac{1}{\gamma_k}D_h(x,x^k) \\ & \equiv \mathrm{prox}_{\gamma_kg}^h(\nabla h^*(\nabla h(x^k) - \gamma_k\nabla f(x^k))) \equiv J_{\gamma_k\partial g}^h \circ F_{\gamma_k\nabla f}^k (x^k), \end{align}\tag{18}\] where \[\label{def-Bregman-Prox-map} \mathrm{prox}_g^h(z) := \mathop{\rm argmin}_x \;g(x) + D_h(x,z) \equiv J_{\partial g}^h(z),\tag{19}\] is called the Bregman proximal map of \(g\) with respect to \(h\). An interesting question is how to accelerate the Bregman proximal gradient method using Nesterov’s acceleration techniques [42][45]. When \(f\) has an globally Lipschitz gradient, and \(h\) is strongly convex, faster algorithms have been given by Auslender and Teboulle [46], and Tseng [41]. However, when these assumptions are weakened, this problem has not been fully addressed in the literature. We refer to [47], [48] for some recent progresses on this topic and the recent paper by Teboulle [49] for more detailed discussions on Bregman proximal gradient method.

When it comes to the monotone inclusion problem 1 with \(B\) being single-valued, the Bregman forward-backward splitting methods are studied in the literature, which iterates as \[\label{BFBS} x^{k+1} := (\nabla h + \gamma_k A)^{-1}(\nabla h(x^k) - \gamma_k B(x^k)) \equiv J_{\gamma_k A}^h\circ F_{\gamma_k B}^h(x^k).\tag{20}\] We refer to the recent paper by Bùi and Combettes [50] and references therein for more discussions on this method.

2.3 Bregman Proximal Point Method and Bregman Augmented Lagrangian method↩︎

Another widely used Bregman algorithm is the Bregman proximal point method (PPM). The idea of Bregman PPM can be traced back to [51][53]. The Bregman PPM in its current form was proposed by Censor and Zenios [54] for convex minimization problem and by Eckstein [36] for monotone inclusion problem. This method was further studied in [55][62]. For the maximal monotone inclusion problem \(0\in T(x)\), the Bregman PPM iterates as \[\label{Bregman-PPM-monotone-inclusion} x^{k+1} := J_{\gamma_kT}^h(x^k) = (\nabla h + \gamma_k T)^{-1}\circ \nabla h(x^k).\tag{21}\] For convex minimization problem 14 with \(f\) being nonsmooth, the Bregman PPM reduces to \[\label{Bregman-PPM-convex} x^{k+1} := J_{\gamma_k\partial f}^h(x^k) = \mathop{\rm argmin}_x \;f(x) + \frac{1}{\gamma_k} D_h(x,x^k).\tag{22}\] This algorithm generalizes the PPM in Euclidean space [63][66] to non-quadratic distance. Since it is usually difficult to solve the subproblem in 21 and 22 exactly, inexact Bregman PPM is studied in the literature [67][69]. Moreover, the exponential multiplier method proposed by Kort and Bertsekas [70], [71] is known to be a special case of Bregman PPM with a specifically chosen \(h\). We will discuss this method in more details in Section 3.3. Furthermore, the nonlinear rescaling method developed by Polyak [72][75] is also known to be equivalent to a Bregman PPM with suitably chosen distance and nonlinear penalty functions, as proved by Polyak and Teboulle [76].

Similar to the connections between PPM and augmented Lagnrangain method (ALM) in the Euclidean case, there is a similar connection between Bregman PPM and Bregman ALM, as illustrated by Eckstein [36]. Here we use the following convex optimization problem with linear equality constraint to illustrate the idea: \[\label{prob-convex-linear-equality} \min_x \;f(x), \;\mathrm{s.t. }, \;Mx = b, x\in \mathcal{X}.\tag{23}\] It can be shown that the Bregman PPM for solving the dual problem of 23 is equivalent to the following Bregman ALM for solving 23 : \[\label{Bregman-ALM} \begin{align} x^{k+1} & \in \mathop{\rm argmin}_{x\in\mathcal{X}} \;f(x) + \frac{1}{\gamma_k}h^*(\nabla h(\lambda^k) + \gamma_k(Mx-b)) \\ \lambda^{k+1} & := \nabla h^*(\nabla h(\lambda^k)+\gamma_k(Mx^{k+1}-b)). \end{align}\tag{24}\] One can see that unlike the classical ALM with a quadratic penalty function, the Bregman ALM 24 adopts a non-quadratic penalty function \(h^*\). A classical reference on Bregman ALM is due to Bertsekas [77], and a more recent survey is due to Iusem [66]. Some very recent works on this topic include [78], [79].

2.4 Bregman ADMM↩︎

A natural extension of the ADMM in Euclidean space is the Bregman ADMM. A Bregman ADMM was studied by Wang and Banerjee [30], in which the authors targeted solving 6 using the following Bregman ADMM: \[\tag{25} \begin{align} u^{k+1} & := \mathop{\rm argmin}_y \;f(y) + \langle w^k, Mu+Nv^k-b\rangle + \gamma D_h(b-Mu, Nv^k) \tag{26}\\ v^{k+1} & := \mathop{\rm argmin}_z \;g(z) + \langle w^k, Mu^{k+1}+Nv-b\rangle + \gamma D_h(Nv,b-Mu^{k+1}) \tag{27} \\ w^{k+1} & := w^k + \gamma(Mu^{k+1} + Nv^{k+1} - b). \tag{28} \end{align}\] Note that this algorithm requires \(Nv^k\) and \(b-Mu^{k+1}\) to lie in the domain of \(h\), which is very difficult to guarantee for many useful Bregman distances such as the ones generated by the Boltzmann-Shannon entropy and the Burg’s entropy. Thus the applicability of 25 is limited.

3 Bregman Douglas-Rachford Splitting Method↩︎

In this section, we introduce our BDRS for solving 1 . A typical iteration of our BDRS for solving 1 is as follows: \[\label{BDRS-alg} \begin{align} x^k & := J_{\gamma_k B}^h (z^k) \\ y^k & := J_{\gamma_k A}^h\circ \nabla h^*(2\nabla h(x^k)-\nabla h(z^k)) \\ z^{k+1} & := \nabla h^*(\nabla h(z^k) - \nabla h(x^k) + \nabla h(y^k)), \end{align}\tag{29}\] where \(\gamma_k>0\) is a parameter. We now provide some explainations to our BDRS 29 . Note that the Bregman resolvent operators are always taken to the points in the primal space. For points in the mirror space, we always use \(\nabla h^*\) to transform them back to the primal space. Note that 29 can be written equivalently as \[\label{BDRS-alg-compact} \nabla h(z^{k+1}) = \nabla h(z^k) + \nabla h\circ J_{\gamma_k A}^h\circ \nabla h^*(2\nabla h\circ J_{\gamma_k B}^h(z^k)-\nabla h(z^k)) - \nabla h\circ J_{\gamma_k B}^h(z^k).\tag{30}\] Using the Bregman reflection operator ?? and the Bregman Mann’s operator ?? , the BDRS 30 can be written more compactly as \[\label{BDRS-alg-Mann} z^{k+1} := M_{\frac{1}{2}}^h(R_{\gamma_k A}^hR_{\gamma_k B}^h) (z^k).\tag{31}\] We notice that when \(h(\cdot) = \frac{1}{2}\|\cdot\|_2^2\), all the three forms of BDRS, i.e., 29 , 30 , and 31 reduce exactly to their Euclidean counterparts.

3.1 Application to Convex Minimization: Bregman ADMM↩︎

First, we note that if one applies the Bregman ALM to solve the convex optimization problem with linear equality constraints 6 , then it should iterate as follows: \[\label{Bregman-ALM-2-block} \begin{align} (u^k,v^k) & := \mathop{\rm argmin}_{u,v} \;f(u) + g(v)+ \frac{1}{\gamma_k}h^*(\nabla h(w^k) + \gamma_k(Mu+Nv-b)) \\ w^{k+1} & := \nabla h^*(\nabla h(w^k)+\gamma_k(Mu^k+Nv^k-b)). \end{align}\tag{32}\] As a result, it is easy to see that the Bregman ADMM for solving 6 is given by the following updates: \[\label{Bregman-ADMM} \begin{align} u^k & := \mathop{\rm argmin}_{u} \;f(u) + \frac{1}{\gamma_k}h^*(\nabla h(w^k) + \gamma_k(Mu+Nv^{k-1}-b)) \\ v^k & := \mathop{\rm argmin}_{v} \;g(v) + \frac{1}{\gamma_k}h^*(\nabla h(w^k) + \gamma_k(Mu^k+Nv-b)) \\ w^{k+1} & := \nabla h^*(\nabla h(w^k)+\gamma_k(Mu^k+Nv^k-b)), \end{align}\tag{33}\] where we alternatingly update \(u^k\) and \(v^k\) in Bregman ALM with the other variable being fixed. To the best of our knowledge, both BDRS 29 and Bregman ADMM 33 are new to the literature. In the following, we show that the Bregman ADMM 33 is actually a direct application of BDRS 29 to the dual of 6 , which is given by \[\label{pfnyhbdm} \begin{align} \max_w\min_{u,v} f(u) + g(v) + \langle w, Mu+Nv-b\rangle & \equiv \max_w -f^*(-M^\top w) - g^*(-N^\top w) - \langle b,w\rangle \\ & \equiv \min_w f^*(-M^\top w) + g^*(-N^\top w) + \langle b,w\rangle, \end{align}\tag{34}\] where \(w\) is the dual variable (Lagrange multiplier), and \(f^*\) and \(g^*\) are the conjugate functions of \(f\) and \(g\), respectively. The optimality condition of 12 is given by \[\label{admm-sum-2-dual-OptCond} 0 \in -M \partial f^*(-M^\top w) - N\partial g^*(-N^\top w) + b,\tag{35}\] which is in the from of 1 with \(A(w) = -M \partial f^*(-M^\top w)\) and \(B(w) = -N\partial g^*(-N^\top w) + b\).

Theorem 1. The BDRS 29 for solving the dual problem 35 is equivalent to the Bregman ADMM 33 for solving the primal problem 6 .

Proof. With \(A(w) = - M \partial f^*(-M^\top w)\) and \(B(w) = - N\partial g^*(-N^\top w) + b\), the BDRS for solving 35 can be written as \[\tag{36} \begin{align} x^k & := \mathop{\rm argmin}_x \;g^*(-N^\top x) + b^\top x + \frac{1}{\gamma_k} D_h(x,z^k) \tag{37}\\ y^k & := \mathop{\rm argmin}_y \;f^*(-M^\top y) +\frac{1}{\gamma_k} D_h(y, \nabla h^*(2\nabla h(x^k)-\nabla h(z^k))) \tag{38}\\ z^{k+1} & := \nabla h^*(\nabla h(z^k) - \nabla h(x^k) + \nabla h(y^k)). \tag{39} \end{align}\] By using \(v\) to denote the dual variable for 37 , we obtain that 37 is equivalent to \[\tag{40} \begin{align} & \min_x \max_v \;\langle -N^\top x, v\rangle - g(v) + b^\top x + \frac{1}{\gamma_k}D_h(x,z^k) \\ = & \max_v \min_x \;\langle -N^\top x, v\rangle - g(v) + b^\top x + \frac{1}{\gamma_k}D_h(x,z^k) \\ \equiv & \max_v \min_x \;h(x) - \langle \nabla h(z^k)+\gamma_k(Nv-b), x\rangle -\gamma_k g(v) - h(z^k) + \langle \nabla h(z^k), z^k \rangle \\ = & \max_v -\{\max_x -h(x) + \langle \nabla h(z^k)+\gamma_k(Nv-b), x\rangle +\gamma_k g(v) + h(z^k) - \langle \nabla h(z^k), z^k \rangle\} \tag{41}\\ = & \max_v -\{h^*(\nabla h(z^k)+\gamma_k(Nv-b)) +\gamma_k g(v) + h(z^k) - \langle \nabla h(z^k), z^k \rangle\} \\ \equiv & \min_v h^*(\nabla h(z^k)+\gamma_k(Nv-b)) +\gamma_k g(v). \end{align}\] Moreover, note that the optimal \(x\) in 41 is given by: \[\label{BDRS-admm-sum-2-dual-OptCond-1-equiv-4-OptCond} x := \nabla h^*(\gamma_k(Nv-b)+\nabla h(z^k)).\tag{42}\] Similarly, by using \(u\) to denote the dual variable for 38 , we obtain that 38 is equivalent to (for ease of presentation, denote \(\tilde{y}^k:=\nabla h^*(2\nabla h(x^k)-\nabla h(z^k))\)): \[\tag{43} \begin{align} & \min_y \max_u \;\langle -M^\top y, u\rangle - f(u) + \frac{1}{\gamma_k}D_h(y,\tilde{y}^k) \\ = & \max_u \min_y \;\langle -M^\top y, u\rangle - f(u) + \frac{1}{\gamma_k} D_h(y,\tilde{y}^k) \\ \equiv & \max_u \min_y \;h(y) - \langle \nabla h(\tilde{y}^k)+\gamma_k Mu, y\rangle -\gamma_k f(u) - h(\tilde{y}^k) + \langle \nabla h(\tilde{y}^k), \tilde{y}^k \rangle \\ = & \max_u -\{\max_y -h(y) + \langle \nabla h(\tilde{y}^k)+\gamma_k Mu, y\rangle +\gamma_k f(u) + h(\tilde{y}^k) - \langle \nabla h(\tilde{y}^k), \tilde{y}^k \rangle\} \tag{44}\\ = & \max_u -\{h^*(\nabla h(\tilde{y}^k)+\gamma_k Mu) +\gamma_k f(u) + h(\tilde{y}^k) - \langle \nabla h(\tilde{y}^k), \tilde{y}^k \rangle\} \\ \equiv & \min_u h^*(2\nabla h(x^k)-\nabla h(z^k)+\gamma_k Mu) +\gamma_k f(u). \end{align}\] Moreover, the optimal \(y\) in 44 is given by \[\label{BDRS-admm-sum-2-dual-OptCond-2-equiv-4-OptCond} y := \nabla h^*(2\nabla h(x^k)-\nabla h(z^k)+\gamma_k Mu).\tag{45}\] By combining 40 , 42 , 43 and 45 , we have that the BDRS 36 is equivalent to \[\tag{46} \begin{align} v^k & := \mathop{\rm argmin}_v h^*(\nabla h(z^k)+\gamma_k(Nv-b)) +\gamma_k g(v) \\ x^k & := \nabla h^*(\gamma_k(Nv^k-b)+\nabla h(z^k)) \tag{47}\\ u^k & := \mathop{\rm argmin}_u h^*(2\nabla h(x^k)-\nabla h(z^k)+\gamma_k Mu) +\gamma_k f(u) \\ y^k & := \nabla h^*(2\nabla h(x^k)-\nabla h(z^k)+\gamma_k Mu^k) \tag{48}\\ z^{k+1} & := \nabla h^*(\nabla h(z^k) - \nabla h(x^k) + \nabla h(y^k)).\tag{49} \end{align}\] Reordering the updates in 46 , we know that 46 is equivalent to \[\tag{50} \begin{align} u^k & := \mathop{\rm argmin}_u h^*(2\nabla h(x^k)-\nabla h(z^k)+\gamma_k Mu) +\gamma_k f(u) \tag{51}\\ y^k & := \nabla h^*(2\nabla h(x^k)-\nabla h(z^k)+\gamma_k Mu^k) \tag{52}\\ z^{k+1} & := \nabla h^*(\nabla h(z^k) - \nabla h(x^k) + \nabla h(y^k)) \tag{53}\\ v^{k+1} & := \mathop{\rm argmin}_v h^*(\nabla h(z^{k+1})+\gamma_k(Nv-b)) +\gamma_k g(v) \tag{54}\\ x^{k+1} & := \nabla h^*(\gamma_k(Nv^{k+1}-b)+\nabla h(z^{k+1})). \tag{55} \end{align}\] Combining 52 and 53 yields \[\label{reorder-relation-1} \nabla h(z^{k+1}) = \nabla h(z^k) - \nabla (x^k) + 2\nabla h(x^k) - \nabla h(z^k) + \gamma_k Mu^k = \nabla h(x^k) +\gamma_k Mu^k.\tag{56}\] From 55 we have \[\label{reorder-relation-2} \nabla h(z^{k+1}) = \nabla h(x^{k+1}) - \gamma_k (Nv^{k+1}-b),\tag{57}\] which implies \[\label{reorder-relation-3} 2\nabla h(x^{k+1}) - \nabla h(z^{k+1}) = \nabla h(x^{k+1}) +\gamma_k (Nv^{k+1}-b),\tag{58}\] and \[\label{reorder-relation-4} \nabla h(x^{k+1}) = \nabla h(x^k) +\gamma_k (Mu^k+Nv^{k+1}-b).\tag{59}\] By substituting 58 to 51 , 56 to 54 , and combining with 59 , we know that 50 is equivalent to \[\label{BDRS-admm-sum-2-dual-OptCond-rewrite-uv-reorder-simplified} \begin{align} u^k & := \mathop{\rm argmin}_u h^*(\nabla h(x^k) + \gamma_k(Nv^k-b+Mu)) +\gamma_k f(u) \\ v^{k+1} & := \mathop{\rm argmin}_v h^*(\nabla h(x^k) + \gamma_k(Mu^k+Nv-b)) +\gamma_k g(v) \\ x^{k+1} & := \nabla h^*(\nabla h(x^k) + \gamma_k(Mu^k+Nv^{k+1}-b)). \end{align}\tag{60}\] This is exactly the same as the Bregman ADMM 33 . ◻

3.2 Connection to Variable Metric ADMM↩︎

A variable metric ADMM has been studied in [80], [81], which solves the convex minimization problem 6 using the following updates: \[\label{BDRS-alg-L-equiv-dual-switch-uv} \begin{align} u^k & := \mathop{\rm argmin}_u \;f(u) + \frac{1}{2\gamma_k}\|\gamma_k(Mu + Nv^{k-1}-b) + Lw^k\|_{L^{-1}}^2 \\ v^k & := \mathop{\rm argmin}_v \;g(v) + \frac{1}{2\gamma_k}\|\gamma_k(Mu^k+Nv-b) +Lw^k\|_{L^{-1}}^2 \\ w^{k+1} & := w^k + \gamma_kL^{-1}(Mu^k + Nv^k-b), \end{align}\tag{61}\] where \(L \succ 0\) is a positive definite matrix, and \(\|x\|_L^2 := x^\top Lx\). This algorithm is also known as applying ADMM to 6 with preconditioned constraints [82]. We now show that this variable metric ADMM is a special case of our Bregman ADMM 33 with a special choice of \(h(x) = \frac{1}{2}\|x\|_L^2 := \frac{1}{2}x^\top L x\), i.e., the quadratic function in Example 1 (ii). In this case, we have \[\label{VM-ADMM-h} \nabla h(x) = Lx, \quad h^*(x) = \frac{1}{2}\|x\|_{L^{-1}}^2 = \frac{1}{2}x^\top L^{-1}x, \quad \nabla h^*(x) = L^{-1}x.\tag{62}\]

Theorem 2. The variable metric ADMM 61 is a special case of the Bregman ADMM 33 with \(h\) given in Example 1 (ii).

Proof. From 62 , we know that the Bregman ADMM 33 can be written as \[\label{Bregman-ADMM-VM} \begin{align} u^k & := \mathop{\rm argmin}_{u} \;f(u) + \frac{1}{2\gamma_k}\|Lw^k + \gamma_k(Mu+Nv^{k-1}-b)\|_{L^{-1}}^2 \\ v^k & := \mathop{\rm argmin}_{v} \;g(v) + \frac{1}{2\gamma_k}\|Lw^k + \gamma_k(Mu^k+Nv-b)\|_{L^{-1}}^2 \\ w^{k+1} & := L^{-1}(Lw^k+\gamma_k(Mu^k+Nv^k-b)), \end{align}\tag{63}\] which is the same as the variable metric ADMM 61 . ◻

3.3 Alternating Direction Exponential Multiplier Method↩︎

In this section, we propose a new algorithm, which is a special case of BDRS, for solving the following linear inequality constrained convex minimization problem: \[\label{admm-sum-2-inequality-constraint} \min_{u,v} \;f(u) + g(v), \;\mathrm{s.t. }, \;M u+ N v-b \leq 0, \;u\in\mathbb{R}^p, v\in\mathbb{R}^q,\tag{64}\] where \(f\) and \(g\) are both proper, closed and convex functions, and \(M\in\mathbb{R}^{m\times p}\), \(N\in\mathbb{R}^{m\times q}\), \(b\in \mathbb{R}^m\).

One important approach for solving 64 is the exponential multiplier method (EMM) that was proposed and studied by Bertsekas, Kort and Tseng [70], [71], [77]. Unlike the usual augmented Lagrangian method that uses a quadratic penalty term, the EMM proposes to use a non-quadratic penalty term. More specifically, the EMM uses the exponential penalty function given by: \[\label{def-psi} \psi(t) = e^t - 1.\tag{65}\] By associating a Lagrange multiplier \(w_j\) to the \(j\)-th constraint in 64 , the EMM for solving 64 iterates as follows 3:

\[\tag{66} \begin{align} (u^k,v^k) & := \mathop{\rm argmin}_{u,v} f(u) + g(v) + \frac{1}{\gamma_k}\sum_{j=1}^m w_j^k\psi(\gamma_k(M_j^\top u+N_j^\top v-b_j)) \tag{67} \\ w_j^{k+1} & := w_j^k\nabla\psi(\gamma_k(M_j^\top u^k+N_j^\top v^k-b_j)) = w_j^k e^{\gamma_k(M_j^\top u^k+N_j^\top v^k-b_j)}, \;j=1,\ldots,m,\tag{68} \end{align}\] where \(\gamma_k > 0\) is a parameter, and \(M_j^\top\) denotes the \(j\)-th row of \(M\) and \(N_j^\top\) denotes the \(j\)-th row of \(N\). It can be shown that the EMM 66 is equivalent to a Bregman proximal point algorithm for solving the dual of 64 . Note that the dual of 64 is given by \[\label{admm-sum-2-inequality-constraint-dual} \max \;d(w), \;\mathrm{s.t. }, \;w\geq 0,\tag{69}\] where the dual function \(d(w)\) is defined as \[d(w) := \min_{u,v} \left\{f(u)+g(v) + \sum_{j=1}^m w_j (M_j^\top u+N_j^\top v-b_j)\right\}.\] It can be shown that the EMM 66 for solving the primal problem 64 is equivalent to the following algorithm for solving the dual problem 69 : \[\label{EMM-alg-dual-BPPA} w^{k+1} := \mathop{\rm argmax}_{w\geq 0} \;\left\{d(w) - \frac{1}{\gamma_k}\sum_{j=1}^m w_j^k\psi^*\left(\frac{w_j}{w_j^k}\right)\right\},\tag{70}\] where \(\psi^*\) is the conjugate function of \(\psi\), which is the entropy function \[\label{def-psi-conj} \psi^*(s) = s\log s - s + 1.\tag{71}\] Note that 70 is indeed a Bregman proximal point algorithm, because \(\sum_{j=1}^m{w_j^k}\psi^*\left(\frac{w_j}{w_j^k}\right) = D_h(w,w^k)\) with \(h\) being the Boltzmann-Shannon entropy defined in Example [Example:Bregman-distance] (iii).

Note that for some applications, the subproblem 67 in EMM can be difficult to solve. To overcome this difficulty, we propose an alternating direction exponential multiplier method (ADEMM) for solving 64 . The ADEMM iterates as follows: \[\tag{72} \begin{align} u^k & := \mathop{\rm argmin}_{u} f(u) + \frac{1}{\gamma_k}\sum_{j=1}^m w_j^k\psi(\gamma_k(M_j^\top u+N_j^\top v^{k-1}-b_j)) \tag{73} \\ v^k & := \mathop{\rm argmin}_{v} g(v) + \frac{1}{\gamma_k}\sum_{j=1}^m w_j^k\psi(\gamma_k(M_j^\top u^k+N_j^\top v-b_j)) \tag{74} \\ w_j^{k+1} & := w_j^k\nabla\psi(\gamma_k(M_j^\top u^k+N_j^\top v^k-b_j)) = w_j^k e^{\gamma_k(M_j^\top u^k+N_j^\top v^k-b_j)}, \;j=1,\ldots,m.\tag{75} \end{align}\] That is, the ADEMM alternatingly minimizes the Lagrangian function with respect to \(u\) and \(v\) in each iteration. This is exactly in the same spirit of the usual ADMM in Euclidean space with a quadratic penalty. To the best of our knowledge, the ADEMM 72 is new to the literature: it is the first alternating direction version of EMM. For some applications, both subproblems 73 and 74 are easier to solve than 67 . This is indeed the case as we will see later in the discrete optimal transport problem in Section 5.

Very interestingly, the ADEMM 72 for solving the primal problem 64 is equivalent to BDRS 29 for solving the dual problem 69 , with \(h\) being the Boltzmann-Shannon entropy.

Theorem 3. The ADEMM 72 for solving the primal problem 64 is equivalent to BDRS 29 for solving the dual problem 69 , with \(h\) being the Boltzmann-Shannon entropy.

Proof. We first note that the dual problem 69 is equivalent to \[\label{admm-sum-2-inequality-constraint-dual-equivalent} \min_{w\geq 0} \;f^*(-M^\top w) + g^*(-N^\top w) + b^\top w,\tag{76}\] whose optimality condition is given by: \[\label{admm-sum-2-inequality-constraint-dual-equivalent-OptCond} 0 \in -M \partial f^*(M^\top w) - N \partial g^*(N^\top w) + b + \partial \mathbb{I}(w\geq 0),\tag{77}\] where \(\mathbb{I}(C)\) denotes the indicator function of set \(C\). By letting \(A(w) = - M \partial f^*(-M^\top w) + \partial \mathbb{I}(w\geq 0)\) and \(B(w) = - N\partial g^*(-N^\top w) + b + \partial \mathbb{I}(w\geq 0)\), we note that 77 is in the same form as 1 and thus can be solved by BDRS 29 . The BDRS 29 with \(h\) being the Boltzmann-Shannon entropy can be written as \[\tag{78} \begin{align} x^k & := \mathop{\rm argmin}_{x\geq 0} \;g^*(-N^\top x) + b^\top x + \frac{1}{\gamma_k} D_h(x,z^k) \tag{79} \\ y^k & := \mathop{\rm argmin}_{y\geq 0} \;f^*(-M^\top y) +\frac{1}{\gamma_k} D_h(y, (x^k\circ x^k)/z^k) \tag{80}\\ z^{k+1} & := \nabla h^*(\nabla h(z^k) - \nabla h(x^k) + \nabla h(y^k)),\tag{81} \end{align}\] where \(a\circ b\) denotes the elementwise multiplication of vectors \(a\) and \(b\), and \(a/b\) denotes the elementwise division of vectors \(a\) and \(b\).

By introducing \(v\) as the dual variable of 79 , we obtain that 79 is equivalent to: \[\begin{align} & \min_x \max_v \langle -N^\top x, v\rangle - g(v) + b^\top x +\frac{1}{\gamma_k} D_h(x,z^k) \\ = & \max_v \min_x \langle -N^\top x, v\rangle - g(v) + b^\top x + \frac{1}{\gamma_k} D_h(x,z^k) \\ \equiv & \max_v -\gamma_k g(v) - h^*(\nabla h(z^k) + \gamma_k (N_j^\top v-b_j)) \\ \equiv & \min_v g(v) + \frac{1}{\gamma_k}\sum_j z^k_j e^{\gamma_k (N_j^\top v-b_j)}, \end{align}\] with \(\gamma_k(-Nv + b) + \nabla h(x) - \nabla h(z^k)=0\). Similarly, by introducing \(u\) as the dual variable of 80 , we obtain that 80 is equivalent to: \[\begin{align} & \min_y \max_u \langle -M^\top y, u\rangle - f(u) + \frac{1}{\gamma_k}D_h(y, (x^k\circ x^k)/z^k) \\ = & \max_u \min_y \langle -M^\top y, u\rangle - f(u) + \frac{1}{\gamma_k} D_h(y, (x^k\circ x^k)/z^k) \\ = & \max_u -\gamma f(u) - \sum_j (x^k_j \cdot x^k_j / z^k_j) e^{\gamma_k M_j^\top u} \\ \equiv & \min_u f(u) + \frac{1}{\gamma_k} \sum_j (x^k_j \cdot x^k_j / z^k_j) e^{\gamma_k M_j^\top u}, \end{align}\] with \(-\gamma_k Mu+\nabla h(y) - \nabla h((x^k\circ x^k)./z^k)=0\). Therefore, 78 can be equivalently rewritten as \[\tag{82} \begin{align} v^k & := \mathop{\rm argmin}_v g(v) + \frac{1}{\gamma_k}\sum_j z^k_j e^{\gamma_k(N_j^\top v-b_j)} \tag{83} \\ x^k_j & := z^k_j e^{\gamma_k(N_j^\top v^k-b_j)} \\ u^k & := \mathop{\rm argmin}_u f(u) + \frac{1}{\gamma_k}\sum_j (x^k_j \cdot x^k_j / z^k_j) e^{\gamma_k M_j^\top u} \tag{84}\\ y^k_j & := (x^k_j \cdot x^k_j / z^k_j) e^{\gamma_k M_j^\top u^k}\\ z^{k+1} & := \nabla h^*(\nabla h(z^k) - \nabla h(x^k) + \nabla h(y^k)).\tag{85} \end{align}\] We have 82 can be equivalently rewritten as \[\begin{align} v^k & := \mathop{\rm argmin}_v g(v) + \frac{1}{\gamma_k}\sum_j z_j^k e^{\gamma_k(N_j^\top v-b_j)}\\ x^k_j & := z^k_j e^{\gamma_k(N_j^\top v^k-b_j)} \\ u^k & := \mathop{\rm argmin}_u f(u) + \frac{1}{\gamma_k}\sum_j x^k_j e^{\gamma_k(M_j^\top u+N_j^\top v^k-b_j)}\\ z^{k+1}_j & := \nabla h^*(\nabla h(z^k_j) + \gamma_k(M_j^\top u^k+N_j^\top v^k-b_j)). \end{align}\] The last equation implies that \(\nabla h(z_j^{k+1}) =\nabla h(x_j^k) + \gamma_k M_j^\top u^k\), and therefore, \(z_j^{k+1} = x_j^k e^{\gamma_k M_j^\top u^k}\). Thus, the above is equivalent to \[\begin{align} v^k & := \mathop{\rm argmin}_v g(v) + \frac{1}{\gamma_k} \sum_j x_j^{k-1} e^{\gamma_k(M_j^\top u^{k-1}+N_j^\top v-b_j)}\\ x^k_j & := x^{k-1}_j e^{\gamma_k(M_j^\top u^{k-1}+N_j^\top v^k-b_j)} \\ u^k & := \mathop{\rm argmin}_u f(u) + \frac{1}{\gamma_k} \sum_j x^k_j e^{\gamma_k(M_j^\top u+N_j^\top v^k-b_j)} . \end{align}\] This is exactly the ADEMM 72 . ◻

We end this section by remarking that the EMM 66 , being equivalent to a Bregman PPM, is also equivalent to the nonlinear rescaling method developed by Polyak [72][75] with suitably chosen nonlinear rescaling function, as proved by Polyak and Teboulle [76]. More specifically, the linearly inequality constrained problem 64 is equivalent to: \[\label{admm-sum-2-inequality-constraint-NR} \min_{u,v} \;f(u) + g(v), \;\mathrm{s.t. }, \;\frac{1}{\gamma_k}\psi(\gamma_k(M_j^\top u+N_j^\top v-b_j)) \leq 0, j=1,\ldots,m,\tag{86}\] where \(\psi\) is defined in 65 . By associating a Lagrange multiplier \(w_j\) to the \(j\)-th constraint in 86 , the Lagrangian function of 86 is given by: \[\label{admm-sum-2-inequality-constraint-NR-Lag-Fn} \mathcal{L}_{NR}(u,v;w) := f(u) + g(v) + \frac{1}{\gamma_k}\sum_j w_j\psi(\gamma_k(M_j^\top u+N_j^\top v-b_j)).\tag{87}\] The nonlinear rescaling method is essentially a Lagrangian multiplier method for solving 87 and is given by: \[\tag{88} \begin{align} (u^k,v^k) & := \mathop{\rm argmin}_{u,v} \;\mathcal{L}_{NR}(u,v;w^k) \tag{89}\\ w_j^{k+1} & := w_j^k \nabla \psi(\gamma_k(M_j^\top u^k+N_j^\top v^k-b_j)), j=1,\ldots,m.\tag{90} \end{align}\] Polyak and Griva [74], [75] proposed to use Newton’s method to solve a primal-dual system that consists 90 and the KKT system of 89 . Note that Newton’s method can be employed because function \(\psi\) and its derivative \(\psi'\) are both twice continuously differentiable. This is one of the main motivations of designing the nonlinear rescaling method. Our ADEMM 72 leads to the following alternating direction version of the nonlinear rescaling method: \[\begin{align} u^k & := \mathop{\rm argmin}_{u,v} \;\mathcal{L}_{NR}(u,v^{k-1};w^k) \\ v^k & := \mathop{\rm argmin}_{u,v} \;\mathcal{L}_{NR}(u^k,v;w^k) \\ w_j^{k+1} & := w_j^k \nabla \psi(\gamma_k(M_j^\top u^k+N_j^\top v^k-b_j)), j=1,\ldots,m. \end{align}\]

4 Bregman Peaceman-Rachford Splitting and Bregman Double-Backward Method↩︎

In this section, we discuss two algorithms that are related to BDRS, namely Bregman Peaceman-Rachford spliting method (BPRS) and Bregman double-backward method (BDBM). The Peaceman-Rachford splitting (PRS) method is another well-studied operator splitting method that was also proposed to solve variational problems arising from numerical PDEs [83]. The PRS can also be applied to solving the monotone inclusion problem 1 . Using our notions defined in Definition 3, the BPRS for solving 1 can be written as \[\label{BPRS-alg} z^{k+1} := R_{\gamma_k A}^hR_{\gamma_k B}^h(z^k).\tag{91}\] When \(h(x)=\frac{1}{2}\|x\|_2^2\), 91 reduces to the original PRS in the Euclidean space. For the convex minimization problem 6 , it is known that a symmetric ADMM is equivalent to the PRS applied to solving the dual of 6. The symmetric ADMM for solving 6 iterates as follows: \[\label{sym-admm-sum-2-alg} \begin{align} u^{k+1} & := \mathop{\rm argmin}_u \;\mathcal{L}_\beta(u,v^k;w^k) \\ w^{k+\frac{1}{2}} & := w^k + \beta(Mu^{k+1} + Nv^{k} - b)\\ v^{k+1} & := \mathop{\rm argmin}_v \;\mathcal{L}_\beta(u^{k+1},v;w^{k+\frac{1}{2}})\\ w^{k+1} & := w^{k+\frac{1}{2}} + \beta(Mu^{k+1} + Nv^{k+1} - b), \end{align}\tag{92}\] where the augmented Lagrangian function \(\mathcal{L}_\beta\) is defined in 11 . This symmetric ADMM is equivalent to an alternating linearization method when \(f\) and \(g\) are both smooth, as studied by Goldfarb, Ma and Scheinberg [84]. For BPRS 91 , we can prove a similar result.

Theorem 4. For \(\gamma_k>0\), the BPRS 91 for solving the dual of 6 with \(A(x) = -M\partial f^*(-M^\top x)\) and \(B(x) = -N \partial g^*(-N^\top x) + b\), is equivalent to the following Bregman symmetric ADMM for solving 6 : \[\label{Bregman-sym-ADMM} \begin{align} u^k & := \mathop{\rm argmin}_{u} \;f(u) + \frac{1}{\gamma_k}h^*(\nabla h(w^k) + \gamma_k(Mu+Nv^{k-1}-b)) \\ w^{k+\frac{1}{2}} & := \nabla h^*(\nabla h(w^k)+\gamma_k(Mu^k+Nv^{k-1}-b)) \\ v^k & := \mathop{\rm argmin}_{v} \;g(v) + \frac{1}{\gamma_k}h^*(\nabla h(w^{k+\frac{1}{2}}) + \gamma_k(Mu^k+Nv-b)) \\ w^{k+1} & := \nabla h^*(\nabla h(w^{k+\frac{1}{2}})+\gamma_k(Mu^k+Nv^k-b)). \end{align}\qquad{(6)}\]

Proof. The proof is similar to the proof of Theorem 1. The BPRS 91 can be equivalently written as \[\label{BPRS-alg-equiv} \begin{align} x^k & := J_{\gamma_k B}^h(z^k) \\ y^k & := J_{\gamma_k A}^h\circ \nabla h^*(2\nabla h(x^k) - \nabla h(z^k)) \\ z^{k+1} & := \nabla h^*(\nabla h(z^k) - 2\nabla h(x^k) + 2\nabla h(y^k)), \end{align}\tag{93}\] which is further equivalent to \[\tag{94} \begin{align} x^k & := \mathop{\rm argmin}_x \;g^*(-N^\top x) + b^\top x + \frac{1}{\gamma_k}D_h(x,z^k) \tag{95} \\ y^k & := \mathop{\rm argmin}_y \;f^*(-M^\top y) + \frac{1}{\gamma_k}D_h(y,\nabla h^*(2\nabla h(x^k) - \nabla h(z^k))) \tag{96}\\ z^{k+1} & := \nabla h^*(\nabla h(z^k) - 2\nabla h(x^k) + 2\nabla h(y^k)). \end{align}\] By associating \(v\) as the dual variable of 95 , and \(u\) as the dual variable of 96 , similar to the proof of Theorem 1, it can be shown that 94 is equivalent to \[\tag{97} \begin{align} v^k & := \mathop{\rm argmin}_v h^*(\nabla h(z^k)+\gamma_k(Nv-b)) +\gamma_k g(v) \\ x^k & := \nabla h^*(\gamma_k(Nv^k-b)+\nabla h(z^k)) \tag{98}\\ u^k & := \mathop{\rm argmin}_u h^*(2\nabla h(x^k)-\nabla h(z^k)+\gamma_k Mu) +\gamma_k f(u) \\ y^k & := \nabla h^*(2\nabla h(x^k)-\nabla h(z^k)+\gamma_k Mu^k) \tag{99}\\ z^{k+1} & := \nabla h^*(\nabla h(z^k) - 2\nabla h(x^k) + 2\nabla h(y^k)).\tag{100} \end{align}\] Reordering the updates in 97 , we know that 97 is equivalent to \[\tag{101} \begin{align} u^k & := \mathop{\rm argmin}_u h^*(2\nabla h(x^k)-\nabla h(z^k)+\gamma_k Mu) +\gamma_k f(u) \tag{102}\\ y^k & := \nabla h^*(2\nabla h(x^k)-\nabla h(z^k)+\gamma_k Mu^k) \tag{103}\\ z^{k+1} & := \nabla h^*(\nabla h(z^k) - 2\nabla h(x^k) + 2\nabla h(y^k)) \tag{104}\\ v^{k+1} & := \mathop{\rm argmin}_v h^*(\nabla h(z^{k+1})+\gamma_k(Nv-b)) +\gamma_k g(v) \tag{105}\\ x^{k+1} & := \nabla h^*(\gamma_k(Nv^{k+1}-b)+\nabla h(z^{k+1})). \tag{106} \end{align}\] Note that 106 implies \[\label{BPRS-relation-1} \nabla h(z^{k+1}) = \nabla h(x^{k+1}) - \gamma_k (Nv^{k+1}-b),\tag{107}\] which immediately gives \[\label{BPRS-relation-2} \nabla h(z^{k}) = \nabla h(x^{k}) - \gamma_k (Nv^{k}-b).\tag{108}\] Combining 103 and 104 yields \[\label{BPRS-relation-3} \nabla h(z^{k+1}) = \nabla h(y^k) + \gamma_k Mu^k,\tag{109}\] which together with 107 gives \[\label{BPRS-relation-4} \nabla h(x^{k+1}) = \nabla h(y^k) + \gamma_k (Mu^k+Nv^{k+1}-b).\tag{110}\] Moreover, 103 implies \[\label{BPRS-relation-5} \nabla h(y^{k}) = 2\nabla h(x^k) - \nabla h(z^k) + \gamma_k Mu^k = \nabla h(x^k) + \gamma_k (Mu^k+Nv^{k}-b),\tag{111}\] where the second equality is due to 108 . Substituting 109 into 105 , 108 into 102 , and combining with 111 and 110 , we know that 101 is equivalent to

\[\begin{align} u^k & := \mathop{\rm argmin}_u h^*(\nabla h(x^k)+\gamma_k (Mu+Nv^k-b)) +\gamma_k f(u) \\ y^k & := \nabla h^*(\nabla h(x^k)+\gamma_k (Mu^k+Nv^k-b)) \\ v^{k+1} & := \mathop{\rm argmin}_v h^*(\nabla h(y^{k})+\gamma_k(Mu^k+Nv-b)) +\gamma_k g(v) \\ x^{k+1} & := \nabla h^*(\nabla h(y^k) + \gamma_k(Mu^k+Nv^{k+1}-b)), \end{align}\]

which is exactly the same as the Bregman symmetric ADMM ?? . ◻

For the linearly inequality constrained problem 64 , the BPRS with \(h\) being the Boltzmann-Shannon entropy leads to a symmetric version of ADEMM.

Theorem 5. For \(\gamma_k>0\), the BPRS 91 for solving the dual of 64 with \(h\) being the Boltzmann-Shannon entropy and \(A(x) = -M\partial f^*(-M^\top x)\) and \(B(x) = -N \partial g^*(-N^\top x) + b\), is equivalent to the following Bregman symmetric ADEMM for solving 64 : \[\label{sym-ADEMM-alg} \begin{align} u^k & := \mathop{\rm argmin}_{u} f(u) + \frac{1}{\gamma_k}\sum_{j=1}^m w_j^k \psi(\gamma_k(M_j^\top u+N_j^\top v^{k-1}-b_j)) \label{sym-ADEMM-alg-1} \\ w_j^{k+\frac{1}{2}} & := w_j^k e^{\gamma_k(M_j^\top u^k+N_j^\top v^k-b_j)}, \;j=1,\ldots,m.\label{sym-ADEMM-alg-2}\\ v^k & := \mathop{\rm argmin}_{v} g(v) + \frac{1}{\gamma_j^k}\sum_{j=1}^m w_j^{k+\frac{1}{2}}\psi(\gamma_k(M_j^\top u^k+N_j^\top v-b_j)) \label{sym-ADEMM-alg-3} \\ w_j^{k+1} & := w_j^{k+\frac{1}{2}} e^{\gamma_k(M_j^\top u^k+N_j^\top v^k-b_j)}, \;j=1,\ldots,m,\label{sym-ADEMM-alg-4} \end{align}\] {#eq: sublabel=eq:sym-ADEMM-alg,eq:sym-ADEMM-alg-1,eq:sym-ADEMM-alg-2,eq:sym-ADEMM-alg-3,eq:sym-ADEMM-alg-4} where \(\psi\) is defined in 65 .

Proof. The proof is very similar to the proof of Theorem 4. We thus omit it for brevity. ◻

Another closely related splitting method is the double-backward (also called backward-backward) method. The double-backward method was proposed by Passty [85] and later studied by many others including Combettes [86]. The Bregman double-backward method for solving 1 is given by \[\label{BDB-alg} z^{k+1} := J_{\gamma_kA}^hJ_{\gamma_kB}^h(z^k).\tag{112}\] The theoretical analysis of BDBM has been mainly studied for solving the feasibility problem, i.e., when \(A\) and \(B\) are both normal cone operators. In this case, the monotone inclusion problem reduces to \[\label{prob-feasibility}Findx, \;\mathrm{s.t. }, \;x \in \mathcal{X}_1 \bigcap \mathcal{X}_2,\tag{113}\] where \(\mathcal{X}_1\) and \(\mathcal{X}_2\) are convex sets, and \(A\) and \(B\) are normal cone operators of \(\mathcal{X}_1\) and \(\mathcal{X}_2\), respectively. Under the assumption that \(A^{-1}(0)\cap B^{-1}(0)\) is not empty, the weak convergence of BDBM 112 was proved by Reich [87]. In fact, note that both \(J_{\gamma_kB}^h\) and \(J_{\gamma_kA}^h\) are quasi-Bregman nonexpansive (QBNE, see [88]), and thus their product is also QBNE [88]. Therefore, we have \(\forall u \in A^{-1}(0) \bigcap B^{-1}(0) = \mathrm{Fix}(J_{\gamma_kA}^h) \bigcap \mathrm{Fix}(J_{\gamma_kB}^h)\), \[\label{BDBM-decrease} D_h(u,x_{k+1}) = D_h(u,J_{\gamma_kB}^h J_{\gamma_kA}^h(x_k)) \leq D_h(u,x_k) \leq D_h(u,x_0),\tag{114}\] where \(\mathrm{Fix}(A)\) denotes the fixed point set of \(A\). Therefore, 114 implies that \(D_h(u,x_k)\) is convergent.

For general \(A\) and \(B\), however, it may not be reasonable to assume that \(A^{-1}(0)\cap B^{-1}(0)\) is not empty, and thus the convergence result in [87] does not apply. This is indeed the case as we will see later for the optimal transport problem in Section 5.

5 Applications to Discrete Optimal Transport↩︎

Optimal transport has found many important applications in machine learning and data science recently [89], [90]. In the case of discrete probability measures, one is given two sets of finite number atoms, \(\{y_1, y_2, \ldots, y_n\} \subset \mathbb{R}^d\) and \(\{z_1, z_2, \ldots, z_n\} \subset \mathbb{R}^d\), and two probability distributions \(\mu_n = \sum_{i=1}^n r_i\delta_{y_i}\) and \(\nu_n = \sum_{j=1}^n c_j\delta_{z_j}\). Here \(r = (r_1, r_2, \ldots, r_n)^\top \in \Delta^n\) and \(c = (c_1, c_2, \ldots, c_n)^\top \in \Delta^n\), \(\Delta^n\) denotes the probability simplex in \(\mathbb{R}^n\) and \(\delta_y\) denotes the Dirac delta function at \(y\). The optimal transport between \(\mu_n\) and \(\nu_n\) is obtained by solving the following problem: \[\label{OT} \min_{X} \langle C, X \rangle, \;\mathrm{s.t. }, \;X\mathbf{1}= r, X^\top\mathbf{1}= c, X\geq 0,\tag{115}\] where \(\mathbf{1}\) denotes the \(n\)-dimensional all-one vector, \(C \in\mathbb{R}^{n\times n}\) is the cost matrix whose \((i,j)\)-th component is \(C_{ij} = \|y_i - z_j\|^2\). Note that 115 is a linear program (LP) and can be solved by off-the-shelf LP solvers. However, 115 appearing in real applications can be very large, and classical LP solvers may suffer scalability issues. In [91], Cuturi suggested to adopt the algorithm proposed by Sinkhorn and Knopp [92] to solve the following approximation of 115 : \[\label{OT-reg} \min_{X} \langle C, X \rangle + \eta h(X), \;\mathrm{s.t. }, \;X\mathbf{1}= r, X^\top\mathbf{1}= c,\tag{116}\] where \(\eta>0\) is a penalty parameter, \(h(X)\) is the Boltzmann-Shannon entropy, and for matrix \(X\) it is defined as: \(h(X) = \sum_{ij} X_{ij}(\log X_{ij}-1)\). That is, an entropy penalty term is added to the objective function with a penalty parameter \(\eta\). The advantage of the penalized problem is that the nonnegativity constraint \(X\geq 0\) is no longer needed because it is implicitly enforced by the entropy function \(h(X)\). The algorithm proposed in [92] (from now on, we call it Sinkhorn’s algorithm) solves the dual problem 116 using a block minimization algorithm. More specifically, using \(\alpha\) and \(\beta\) to denote the Lagrange multipliers associated with the two linear equality constraints of 116 , the dual problem of 116 can be written as: \[\tag{117} \begin{align} & \max_{\alpha,\beta}\min_X \;\langle C, X \rangle + \eta h(X) - \langle \alpha, X\mathbf{1}- r\rangle - \langle \beta, X^\top\mathbf{1}- c \rangle \tag{118}\\ = &\max_{\alpha,\beta} \;\langle \alpha,r\rangle + \langle \beta,c \rangle -\eta\sum_{ij}e^{\frac{1}{\eta}(\alpha_i+\beta_j-C_{ij})}.\tag{119} \end{align}\] Note that the \(X\)-minimization problem in 118 has a closed-form optimal solution given by \[X_{ij} = e^{\frac{1}{\eta}(\alpha_i+\beta_j-C_{ij})}, \quad i,j = 1,\ldots,n.\] By letting \(K_{ij} = e^{-C_{ij}/\eta}\), \(u_i = e^{\alpha_i/\eta}\), and \(v_j = e^{\beta_j/\eta}\), 119 can be equivalently written as: \[\label{sinkhorn-uv-problem} \min_{u,v} \sum_{ij} u_i K_{ij} v_j - \sum_i r_i \log u_i - \sum_j c_j \log v_j.\tag{120}\] The Sinkhorn’s algorithm [91], [92] solves 120 by alternatingly minimizing \(u\) and \(v\) with the other variable being fixed, i.e., \[\label{sinkhorn-alg} \begin{align} u^k & := \mathop{\rm argmin}_u \;\sum_{ij} u_i K_{ij} v_j^{k-1} - \sum_i r_i \log u_i \\ v^k & := \mathop{\rm argmin}_v \;\sum_{ij} u_i^k K_{ij} v_j - \sum_j c_j \log v_j. \end{align}\tag{121}\] It turns out that both subproblems in 121 admit closed-form solutions, and the Sinkhorn’s algorithm can be written more compactly as: \[\label{sinkhorn-alg-simpler} \begin{align} u^k & := r./(Kv^{k-1}) \\ v^k & := c./(K^\top u^k) . \end{align}\tag{122}\] The Sinkhorn’s algorithm can be implemented very efficiently because the computations in 122 are simple. One drawback of the Sinkhorn’s algorithm is that it is very difficult to tune the parameter \(\eta\). Ideally, one wants a small \(\eta\) so that the penalized problem 116 is close to the original problem 115 . However, small \(\eta\) will cause numerical instability of the Sinkhorn’s algorithm. It is usually not clear how small \(\eta\) can be without causing numerical issues. Moreover, the Sinkhorn’s algorithm only solves the regularized problem 116 , not the original OT problem 115 .

Now we discuss how to use our BDRS (or ADEMM) to solve the original OT 115 . Note that 115 can be written in the form of 1 by defining \[\label{def-OT-AB} A(X) = C + \partial \mathbb{I}(X\mathbf{1}=r) + \partial\mathbb{I}(X\geq 0), \quad B(X) = \partial \mathbb{I}(X^\top \mathbf{1}=c) + \partial\mathbb{I}(X\geq 0).\tag{123}\] Now the BDRS 29 can be written as (for the ease of comparison with Sinkhorn’s algorithm, we choose \(\gamma_k\) to be a constant \(1/\eta\)): \[\tag{124} \begin{align} X^k & := \mathop{\rm argmin}_X \;D_h(X,Z^k), \;\mathrm{s.t. }, \;X^\top \mathbf{1}=c \tag{125}\\ Y^k & := \mathop{\rm argmin}_Y \;\langle C,Y\rangle + \eta D_h(Y, X^k\circ X^k./Z^k), \;\mathrm{s.t. }, \;Y\mathbf{1}=r \tag{126}\\ Z^{k+1} & := Z^k \circ Y^k./ X^k.\tag{127} \end{align}\] The dual problem of the OT 115 is given by: \[\label{OT-dual} \min_{\alpha, \beta} \;-r^\top \alpha - c^\top \beta, \;\mathrm{s.t. }, \;\alpha_i + \beta_j \leq C_{ij}, i, j = 1,\ldots,n.\tag{128}\] The ADEMM 72 for solving 128 is given by: \[\label{ADEMM-alg-OT} \begin{align} \alpha^k & := \mathop{\rm argmin}_\alpha \;-r^\top \alpha + \eta\sum_{ij}X_{ij}^k e^{\frac{1}{\eta}(\alpha_i+\beta_j^{k-1}-C_{ij})} \\ \beta^k & := \mathop{\rm argmin}_\beta \;- c^\top \beta + \eta\sum_{ij}X_{ij}^k e^{\frac{1}{\eta}(\alpha_i^k+\beta_j-C_{ij})} \\ X_{ij}^{k+1} & := X_{ij}^ke^{\frac{1}{\eta}(\alpha_i^k+\beta_j^k-C_{ij})}, \;i,j=1,\ldots,n. \end{align}\tag{129}\] By denoting \(u = e^{\alpha/\eta}\), \(v= e^{\beta/\eta}\), and \(K_{ij} = e^{-C_{ij}/\eta}\), 129 can be further rewritten as \[\tag{130} \begin{align} u^k & := \mathop{\rm argmin}_u \;\sum_{ij}u_i X_{ij}^kK_{ij}v_j^{k-1} -\sum_i r_i\log u_i \tag{131}\\ v^k & := \mathop{\rm argmin}_v \;\sum_{ij}u_i^k X_{ij}^kK_{ij}v_j - \sum_j c_j\log v_j \tag{132}\\ X_{ij}^{k+1} & := u_i^kX_{ij}^kK_{ij}v_j^k, \;i,j=1,\ldots,n. \end{align}\] Similar to 121 , the two subproblems 131 and 132 admit closed-form solutions, and 130 can be equivalently written as: \[\tag{133} \begin{align} u^k & := r./((X^k\circ K)v^{k-1}) \tag{134}\\ v^k & := c./((X^k\circ K)^\top u^k) \tag{135}\\ X_{ij}^{k+1} & := u_i^kX_{ij}^kK_{ij}v_j^k, \;i,j=1,\ldots,n.\tag{136} \end{align}\] Now comparing ADEMM 133 with the Sinkhorn’s algorithm 122 , we see when updating \(u^k\) and \(v^k\), 133 replaces matrix \(K\) in 122 by matrix \(X^k\circ K\), which is no longer a constant matrix. The matrix \(X^{k+1}\) is then updated by 136 .

Remark 6. Our BDRS 133 for solving the OT probelm can be a much better algorithm than the Sinkhorn’s algorithm, because we do not require \(\eta\) to be close to 0, and thus resolve the issue of numerical instability of the Sinkhorn’s algoirthm. We also see that the computations in 133 are very simple and can be done in parallel as illustrated in [91] for the Sinkhorn’s algorithm. Thus our BDRS can be a much better algorithm than the classical ADMM for solving 115 which requires projections onto the probability simplex in each iteration.

We now discuss the BDBM 112 for solving the OT problem 115 . We note that \(\mathrm{Fix}(J_{\gamma A}^h J_{\gamma B}^h) \not\subset (A+B)^{-1}(0)\), although \(\mathrm{Fix}(R_{\gamma A}^h R_{\gamma B}^h) \subset (A+B)^{-1}(0)\). Therefore, even though we can find a fixed point of \(J_{\gamma A}^h J_{\gamma B}^h\) using the BDBM with a constant \(\gamma_k=\gamma\), the solution we find may not solve the OT problem 115 . However, we have the following result regarding the fixed point of \(J_{\gamma_k A}^h J_{\gamma_k B}^h\).

Theorem 7. For the OT problem 115 with \(A\) and \(B\) defined in 123 and \(h\) being the Boltzmann-Shannon entropy, it holds that \[\lim_{\gamma_k\to 0} \mathrm{Fix}(J_{\gamma_k A}^h J_{\gamma_k B}^h) \subset (A+B)^{-1}(0).\]

Proof. Assume \(x\in \mathrm{Fix}(J_{\gamma_k A}^h J_{\gamma_k B}^h)\), that is \[x = J_{\gamma_k A}^h J_{\gamma_k B}^h x,\] which can be equivalently written as \[x = (\nabla h + \gamma_k A)^{-1}\circ \nabla h \circ (\nabla h + \gamma_k B)^{-1}\circ \nabla h(x).\] This is further equivalent to (note that \(\nabla h(x) = \log x\) and \(\nabla h^*(x) = e^x\)): \[\begin{align} & \quad (\nabla h + \gamma_k B)\nabla h^*(\nabla h + \gamma_k A)x = \nabla h(x) \\ \Longleftrightarrow & \quad (I + \gamma_k B\nabla h^*) (\nabla h (x) + \gamma_k A(x)) = \nabla h(x) \\ \Longleftrightarrow & \quad (I + \gamma_k B\nabla h^*) \log (x^k\cdot e^{\gamma_k A(x)}) = \nabla h(x) \\ \Longleftrightarrow & \quad \log (x\cdot e^{\gamma_k A(x)}) + \gamma_k B(x\cdot e^{\gamma_k A(x)}) = \nabla h(x) \\ \Longleftrightarrow & \quad \log (x\cdot e^{\gamma_k A(x)} \cdot e^{\gamma_k B(x\cdot e^{\gamma_k A(x)})}) = \log(x) \\ \Longleftrightarrow & \quad \gamma_k A(x) + \gamma_k B(x\cdot e^{\gamma_k A(x)}) = 0 \\ \Longleftrightarrow & \quad A(x) + B(x\cdot e^{\gamma_k A(x)}) = 0. \end{align}\] By letting \(\gamma_k \rightarrow 0\), we have \(A(x)+ B(x) = 0\), i.e., \(x\in (A+B)^{-1}(0)\). ◻

Theorem 7 shows that if we let \(\gamma_k\rightarrow 0\), then the BDBM 112 solves the OT problem 115 .

6 Concluding Remarks↩︎

In this paper, we proposed several new algorithms for solving monotone operator inclusion problem and convex minimization problems. The algorithms include BDRS, BPRS, Bregman ADMM, and ADEMM. We discussed their connections with existing algorithms in the literature. We also discussed how to apply our algorithms to solve the discrete optimal transport problem. We proved the convergence of the algorithms under certain assumptions, though we point out that one assumption does not apply to the OT problem. We leave it as a future work to prove the convergence of the proposed algorithms under the most general setting.

7 The Convergence Analysis of BPRS↩︎

Note that an interesting observation to the operators defined in Definition 3 is that \[R_T^h = F_T^h\circ J_T^h.\]

Therefore the BPRS 91 is equivalent to \[z^{k+1} = F_{\gamma_kA}^h\circ J_{\gamma_kA}^h \circ F_{\gamma_kB}^h\circ J_{\gamma_kB}^h (z^k).\] Let \(x^k = J_{\gamma_kA}^h \circ F_{\gamma_kB}^h\circ J_{\gamma_kB}^h (z^k)\), then the above equation becomes \[\label{BPRS-x-special} x^{k+1} = J_{\gamma_kA}^h\circ F_{\gamma_kB}^h \circ J_{\gamma_kB}^h\circ F_{\gamma_kA}^h (x^k).\tag{137}\] Therefore, the BPRS is the composition of two Bregman forward-backward operators. Moreover it can be shown that \[\label{BPRS-special-fixed-set} \mathrm{Fix}(J_{\gamma_kA}^h\circ F_{\gamma_kB}^h \circ J_{\gamma_kB}^h\circ F_{\gamma_kA}^h) = (A+B)^{-1}(0).\tag{138}\]

7.1 The convergence of BPRS when both \(f\) and \(g\) are relatively smooth.↩︎

In this section, we provide a convergence analysis of BPRS when both \(f\) and \(g\) are relative smooth functions with respect to \(D_h\), i.e., \[\begin{align} f(x) \leq f(y) + \langle \nabla f(y), x-y \rangle + LD_h(x,y) \\ g(x) \leq g(y) + \langle \nabla g(y), x-y \rangle + LD_h(x,y), \end{align}\] where \(L>0\) is the relative smoothness parameter. We consider the smooth problem \[\label{proof-problem-special} \min_x \;F(x) = f(x) + g(x) \Longleftrightarrow 0 \in A(x) + B(x),\tag{139}\] where \(A = \nabla f\) and \(B = \nabla g\). Our convergence result is summarized in Theorem 8.

Theorem 8. Assume \(f\) and \(g\) are relative smooth functions with respect to \(D_h\) with parameter \(L\). BPRS with \(\gamma_k=\gamma\leq 1/L\) for solving 139 globally converges to \((A+B)^{-1}(0)\).

Proof. BPRS with \(\gamma_k=\gamma\leq 1/L\) for solving 139 can be rewritten as \[\label{BPRS-alg-special} \begin{align} y^k & := \mathop{\rm argmin}_y \;f(x^k) + \langle \nabla f(x^k), y-x^k\rangle + g(y) + \frac{1}{\gamma}D_h(y,x^k) \\ x^{k+1} & := \mathop{\rm argmin}_x \;g(y^k) + \langle \nabla g(y^k), x-y^k\rangle + f(x) + \frac{1}{\gamma}D_h(x,y^k). \end{align}\tag{140}\] It is easy to obtain that: \[\begin{align} F(y^k) \leq & F(x^k) - \frac{1}{\gamma} D_h(x^k,y^k) \\ F(x^{k+1}) \leq & F(y^k) - \frac{1}{\gamma} D_h(y^k,x^{k+1}), \end{align}\] which further yields \[\begin{align} F(x^{k+1}) \leq F(x^k) - \frac{1}{\gamma} D_h(x^k,y^k) - \frac{1}{\gamma} D_h(y^k,x^{k+1}). \end{align}\] This inequality shows that \(F(x^k)\) monotonically decreases. Moreover, by telescoping sum, we obtain: \[\frac{1}{\gamma}\sum_{k=0}^N \left(D_h(x^k,y^k) + D_h(y^k,x^{k+1}) \right)\leq F(x^0) -F(x^*).\] This indicates \[\lim_{k\to \infty} D_h(x^k,y^k) = 0,and\lim_{k\to \infty} D_h(y^k,x^{k+1}) = 0.\] This further implies \[\lim_{k\to \infty} \|x^k-y^k\|=0, \quad \lim_{k\to \infty} \|x^{k+1}-y^k\|=0, \quad \lim_{k\to \infty} \|x^k-x^{k+1}\|=0.\] This shows that \(\{x^k\}\) is a Cauchy sequence. Therefore, \(\{x^k\}\) is bounded and convergent. From 137 and 138 we know that \(\{x^k\}\) converges to \((A+B)^{-1}(0)\). ◻

7.2 The convergence of BPRS when both \(f\) and \(g\) are non-smooth functions.↩︎

Note that even when both \(f\) and \(g\) are nonsmooth, we can still apply 137 with \(A = \partial f\) and \(B = \partial g\). In this case, 137 reduces to (with \(x^0:=J_{\gamma_kA}^h \circ F_{\gamma_kB}^h\circ J_{\gamma_kB}^h (z^0)\)) \[\label{BPRS-alternating-prox-subgrad} x^{k+1} = (\nabla h+\gamma_k \partial f)^{-1} (\nabla h - \gamma_k \partial g) (\nabla h+\gamma_k \partial g)^{-1} (\nabla h - \gamma_k \partial f) (x^k).\tag{141}\]

In this section we prove the convergence of the BPRS for solving \[\label{prove-BPRS-prob-sum-1}\min_x \;f(x) + g(x)\tag{142}\] when both \(f\) and \(g\) are nonsmooth functions, under the assumption that \(\textrm{im}(\nabla h^*) \subset \textrm{dom } f \cap \textrm{dom } g\). This assumption was used in [93]. Note that according to 141 , the BPRS for solving 142 can be written as \[\tag{143} \begin{align} \bar{x}^k & := \nabla h^*(\nabla h(w^k) - \gamma_k\partial f(w^k)) \tag{144}\\ x^{k+1} & := \mathop{\rm argmin}_x \;g(x) + \frac{1}{\gamma_k}D_h(x,\bar{x}^k) \tag{145}\\ \bar{w}^{k+1} &:= \nabla h^*(\nabla h(x^{k+1}) - \gamma_k\partial g(x^{k+1})) \tag{146}\\ w^{k+1} & := \mathop{\rm argmin}_w \;f(w) + \frac{1}{\gamma_k}D_h(w,\bar{w}^{k+1}).\tag{147} \end{align}\] Note that this is an alternating Bregman proximal subgradient method. Bot and Bohm [93] analyzed the convergence of Bregman proximal subgradient method, which consists only one step of 143 .

We will use the Three point identity: \[\label{three-point} D_h(x,y) + D_h(y,z) = D_h(x,z) - \langle\nabla h(y)-\nabla h(z),x-y \rangle.\tag{148}\]

Our main result of the convergence of BPRS is summarized in Theorem 9.

Theorem 9. Assume \(\textrm{im}(\nabla h^*) \subset \textrm{dom } f \cap \textrm{dom } g\), \(\|\partial f\|_\infty\leq G\), \(\|\partial g\|_\infty\leq G\), \(h\) is \(\sigma\)-strongly convex over the simplex, and \(\nabla h^*\) is Lipschitz continuous. Then BPRS 143 with \(\gamma_k=1/\sqrt{k}\) for solving 142 converges sublinearly.

Proof. First, we have \[\begin{align} D_h(y,\bar{x}^k) & \overset{\eqref{three-point}}{=\joinrel=} D_h(y,w^k) - D_h(\bar{x}^k,w^k) - \langle \nabla h(\bar{x}^k) - \nabla h(w^k), y-\bar{x}^k\rangle \nonumber\\ & = D_h(y,w^k) - D_h(\bar{x}^k,w^k) + \gamma_k \langle\partial f(w^k), y-\bar{x}^k\rangle \nonumber\\ & = D_h(y,w^k) - D_h(\bar{x}^k,w^k) + \gamma_k \langle \partial f(w^k), y-w^k\rangle - \gamma_k \langle \partial f(w^k), \bar{x}^k-w^k\rangle \nonumber\\ & \leq D_h(y,w^k) - D_h(\bar{x}^k,w^k) + \gamma_k (f(y) - f(w^k)) + \gamma_k \|\partial f(w^k)\|_\infty\|\bar{x}^k-w^k\|_1\nonumber \\ & \leq D_h(y,w^k) - D_h(\bar{x}^k,w^k) + \gamma_k (f(y) - f(w^k)) + \frac{\gamma_k^2}{\sigma}\|\partial f(w^k)\|_\infty^2 + \frac{\sigma}{4}\|\bar{x}^k-w^k\|_1^2 \nonumber\\ & \overset{(*)}{\leq} D_h(y,w^k) - D_h(\bar{x}^k,w^k) + \gamma_k (f(y) - f(w^k)) + \frac{\gamma_k^2}{\sigma}\|\partial f(w^k)\|_\infty^2 + \frac{1}{2}D_h(\bar{x}^k,w^k) \nonumber\\ & \overset{(**)}{\leq} D_h(y,w^k) - \frac{1}{2}D_h(\bar{x}^k,w^k) + \gamma_k (f(y) - f(w^k)) + \frac{\gamma_k^2}{\sigma}G^2 \label{proof-bound-1}, \end{align}\tag{149}\] where (*) is due to the \(\sigma\)-strong convexity of \(h\) over the simplex (also note that both \(\bar{x}^k\) and \(w^k\) are on simplex), in (**) we used \(\|\partial f\|_\infty\leq G\).

Now, the optimality condition of 145 is: \[0\in\gamma_k\partial g(x^{k+1}) + \nabla h(x^{k+1}) - \nabla h(\bar{x}^k),\] which implies \[\gamma_k(g(y) - g(x^{k+1})) \geq \langle \nabla h(\bar{x}^k)-\nabla h(x^{k+1}), y - x^{k+1}\rangle.\] Using the three-point identity we have \[\gamma_k(g(y) - g(x^{k+1})) \geq D_h(y,x^{k+1}) + D_h(x^{k+1},\bar{x}^k) - D_h(y,\bar{x}^k),\] or, equivalently, \[\label{proof-bound-2}\gamma_k(g(x^{k+1})-g(y)) + D_h(y,x^{k+1}) \leq D_h(y,\bar{x}^k)- D_h(x^{k+1},\bar{x}^k).\tag{150}\] Combining 149 and 150 , we have \[\gamma_k(g(x^{k+1})-g(y)) + \gamma_k (f(w^k) - f(y)) + D_h(y,x^{k+1}) \leq D_h(y,w^k) + \frac{\gamma_k^2}{\sigma}G^2 - D_h(x^{k+1},\bar{x}^k) - \frac{1}{2}D_h(\bar{x}^k,w^k).\] By adding and subtracting \(f(x^{k+1})\), we obtain

\[\begin{align} &\gamma_k(f(x^{k+1})+g(x^{k+1})-f(y)- g(y)) + \gamma_k (f(w^k) - f(x^{k+1})) + D_h(y,x^{k+1}) \\ \leq & D_h(y,w^k) + \frac{\gamma_k^2}{\sigma}G^2 - D_h(x^{k+1},\bar{x}^k) - \frac{1}{2}D_h(\bar{x}^k,w^k), \end{align}\] which immediately implies \[\begin{align} &\gamma_k(f(x^{k+1})+g(x^{k+1})-f(y)- g(y)) + D_h(y,x^{k+1}) \\ \leq & D_h(y,w^k) + \frac{\gamma_k^2}{\sigma}G^2 - D_h(x^{k+1},\bar{x}^k) - \frac{1}{2}D_h(\bar{x}^k,w^k) + \gamma_k G \|w^k - x^{k+1}\|_1 \\ \overset{(*)} \leq &D_h(y,w^k) + \frac{\gamma_k^2}{\sigma}G^2 - D_h(x^{k+1},\bar{x}^k) - \frac{1}{2}D_h(\bar{x}^k,w^k) + \gamma_k G \|w^k - \bar{x}^k\|_1 + \gamma_kG\|\bar{x}^k-x^{k+1}\|_1 \\ \overset{(**)}\leq &D_h(y,w^k) + \frac{\gamma_k^2}{\sigma}G^2 - D_h(x^{k+1},\bar{x}^k) - \frac{1}{2}D_h(\bar{x}^k,w^k) + \gamma_k G \|w^k - \bar{x}^k\|_1 + \frac{\gamma_k^2G^2}{2\sigma} + D_h(x^{k+1},\bar{x}^k) \\ = & D_h(y,w^k) + \frac{\gamma_k^2}{\sigma}G^2 - \frac{1}{2}D_h(\bar{x}^k,w^k) + \gamma_k G \|w^k - \bar{x}^k\|_1 + \frac{\gamma_k^2G^2}{2\sigma} \\ \overset{(***)}\leq & D_h(y,w^k) + \frac{\gamma_k^2}{\sigma}G^2 - \frac{1}{2}D_h(\bar{x}^k,w^k) + \frac{\gamma_k^2G^2}{\sigma} + \frac{\sigma}{4} \|w^k - \bar{x}^k\|_1^2 + \frac{\gamma_k^2G^2}{2\sigma} \\ \overset{(****)}{\leq} & D_h(y,w^k) + \frac{\gamma_k^2}{\sigma}G^2 + \frac{\gamma_k^2G^2}{\sigma} + \frac{\gamma_k^2G^2}{2\sigma} \\ = & D_h(y,w^k) + \frac{5\gamma_k^2}{2\sigma}G^2, \end{align}\] where (*) is the triangle inequality, (**) is due to Young’s inequality and the \(\sigma\)-strong convexity of \(h\), (***) is due to Young’s inequality, and (****) is due to the \(\sigma\)-strong convexity of \(h\). Note that so far we only dealt with 144 145 , and we obtained \[\label{key-inequality-1} \gamma_k(f(x^{k+1})+g(x^{k+1})-f(y)- g(y)) + D_h(y,x^{k+1}) \leq D_h(y,w^k) + \frac{5\gamma_k^2}{2\sigma}G^2.\tag{151}\] Similarly, for 146 147 , we have \[\label{key-inequality-2} \gamma_k(f(w^{k+1})+g(w^{k+1})-f(y)- g(y)) + D_h(y,w^{k+1}) \leq D_h(y,x^{k+1}) + \frac{5\gamma_k^2}{2\sigma}G^2.\tag{152}\] Now summing up 151 and 152 we have (denote \(L = \frac{5G^2}{2\sigma}\)) \[\label{important-inequality-3} \gamma_k (f(x^{k+1}) + g(x^{k+1}) + f(w^{k+1}) + g(w^{k+1}) - 2(f(y) + g(y))) + D_h(y,w^{k+1}) \leq D_h(y,w^{k}) + 2L \gamma_k^2.\tag{153}\] Define \(z^{k+1} := (x^{k+1}+w^{k+1})/2\). From the convexity of \(f\) and \(g\), we have \[\label{important-inequality-4} \gamma_k (f(z^{k+1}) + g(z^{k+1}) - (f(y) + g(y))) + \frac{1}{2}D_h(y,w^{k+1}) \leq \frac{1}{2}D_h(y,w^{k}) + L \gamma_k^2.\tag{154}\] Now summing up 154 for \(k = 0,1,\ldots,N-1\), we have \[\label{important-inequality-5} \sum_{k=0}^{N-1}\gamma_k (f(z^{k+1}) + g(z^{k+1}) - (f(y) + g(y))) \leq \frac{1}{2}D_h(y,w^{0}) + L \sum_{k=0}^{N-1}\gamma_k^2.\tag{155}\] Therefore, we have \[\label{important-inequality-6} \min_{0\leq k\leq N-1}\left(f(z^{k+1}) + g(z^{k+1})\right) - (f(y) + g(y)) \leq \frac{D_h(y,w^{0})}{2 \sum_{k=0}^{N-1}\gamma_k } + \frac{L \sum_{k=0}^{N-1}\gamma_k^2}{\sum_{k=0}^{N-1}\gamma_k}.\tag{156}\] By choosing \(y = x^*\) and using \(\gamma_k = \frac{1}{\sqrt{k}}\), we obtain \[\lim_{N\rightarrow +\infty} \min_{0\leq k\leq N-1}\left(f(z^{k+1}) + g(z^{k+1})\right) = f(x^*) + g(x^*).\] Moreover, the convergence rate is sublinear. ◻

Remark 10. Note that we required that \(\nabla h^*\) is Lipschitz continuous. This is satisfied if \(h\) is the entropy function: \(h(x) = \sum_i (x_i \log x_i - x_i)\). In this case, \(\nabla h(x) = \log x\), and the conjugate function on probability simplex is given by: \[h^*(x) = \sup_{e^\top y =1} x^\top y - h(y).\] It is easy to verify that the optimal \(y\) is given by \(y_i = \frac{e^{x_i}}{\sum_j e^{x_j}}\), and \[\label{def-h42} h^*(x) = \log \sum_i e^{x_i} + 1.\qquad{(7)}\] Note that this is \(\log \sum\exp\) function, and it is known to have Lipschitz continuous gradient. And, we have \([\nabla h^*(x)]_i = \frac{e^{x_i}}{\sum_j e^{x_j}}\).

8 Convergence of BDRS↩︎

In this section we consider the same problem as in Section 7.2 under the same assumptions in Theorem 9. Note that BDRS can be equivalently written as \[\tag{157} \begin{align} w^k & := \mathop{\rm argmin}_w \;f(w) + \frac{1}{\gamma_k}D_h(w,x^k)\tag{158} \\ \bar{x}^k & := \nabla h^*(2\nabla h(w^k) - \nabla h(x^k))\tag{159} \\ z^k & := \mathop{\rm argmin}_z \;g(z) + \frac{1}{\gamma_k}D_h(z,\bar{x}^k)\tag{160} \\ y^k & := \nabla h^*(2\nabla h(z^k) - \nabla h(\bar{x}^k))\tag{161} \\ x^{k+1} & := \nabla h^*\left(\frac{1}{2}\nabla h(x^k)+\frac{1}{2}\nabla h(y^k)\right).\tag{162} \end{align}\]

This is equivalent to the following when considering problem 142 :

\[\tag{163} \begin{align} w^k & := \mathop{\rm argmin}_w \;f(w) + \frac{1}{\gamma_k}D_h(w,x^k)\tag{164} \\ \bar{x}^k & := \nabla h^*(\nabla h(w^k) - \gamma_k \partial f(w^k))\tag{165} \\ z^k & := \mathop{\rm argmin}_z \;g(z) + \frac{1}{\gamma_k}D_h(z,\bar{x}^k)\tag{166} \\ y^k & := \nabla h^*(\nabla h(z^k) - \gamma_k\partial g(z^k))\tag{167} \\ x^{k+1} & := \nabla h^*\left(\frac{1}{2}\nabla h(x^k)+\frac{1}{2}\nabla h(y^k)\right).\tag{168} \end{align}\]

Theorem 11. Assume \(\textrm{im}(\nabla h^*) \subset \textrm{dom } f \cap \textrm{dom } g\), \(\|\partial f\|_\infty\leq G\), \(\|\partial g\|_\infty\leq G\), \(h\) is \(\sigma\)-strongly convex over the simplex, and \(\nabla h^*\) is Lipschitz continuous. Then BDRS 163 with \(\gamma_k=1/\sqrt{k}\) for solving 142 converges sublinearly.

Note that 164 is one step of Bregman PPA, which yields \[\label{BPPA-inequality} \gamma_k (f(y) - f(x^{k+1})) \geq D_h(y,x^{k+1}) - D_h(y,x^k) + D_h(x^{k+1},x^k), \quad \forall y.\tag{169}\] 165 is one step of the mirror subgradient method, which yields \[\label{mirror-subgrad-inequality} \gamma_k (f(x^k)-f(y)) \leq D_h(y,x^k) - D_h(y,x^{k+1}) - \frac{1}{2}D_h(x^{k+1},x^k)+ \frac{\gamma_k^2\|f'(x^k)\|_*^2}{2\sigma}, \quad \forall y.\tag{170}\]

Apply 170 to 165 and 167 , we have (we ignored the constant coefficient of \(\gamma_k^2\)) \[\begin{align} \gamma_k (f(w^k)-f(y)) & \leq D_h(y,w^k) - D_h(y,\bar{x}^{k}) - \frac{1}{2}D_h(\bar{x}^{k},w^k)+ \gamma_k^2 \\ \gamma_k (g(z^k)-g(y)) & \leq D_h(y,z^k) - D_h(y,y^{k}) - \frac{1}{2}D_h(y^{k},z^k)+ \gamma_k^2. \end{align}\] Apply 169 to 164 and 166 , we have \[\begin{align} \gamma_k(f(w^k)-f(y)) &\leq D_h(y,x^k) - D_h(y,w^k) - D_h(w^k,x^k) \\ \gamma_k(g(z^k)-g(y)) & \leq D_h(y,\bar{x})-D_h(y,z^k) - D_h(z^k,\bar{x}^k). \end{align}\] Combining these four inequalities, we have \[\begin{align} & 2\gamma_k(f(w^k)+g(z^k)-f(y)-g(y)) \label{combine-mirror-subgrad-BPPA} \\ \leq & D_h(y,x^k) - D_h(y,y^{k}) - \frac{1}{2}D_h(y^{k},z^k) - \frac{1}{2}D_h(\bar{x}^{k},w^k)- D_h(w^k,x^k) - D_h(z^k,\bar{x}^k)+ \gamma_k^2.\nonumber \end{align}\tag{171}\]

Note that 168 is equivalent to \[\label{BDRS-rewrite-5-equiv} \nabla h(x^{k+1}) - \nabla h(y^k) = \nabla h(x^k) - \nabla h(x^{k+1}).\tag{172}\] Using the 3-point identity 148 twice, we have \[\begin{align} D_h(y,x^{k+1}) & = D_h(y,y^k) - D_h(x^{k+1},y^k)-\langle \nabla h(x^{k+1})-\nabla h(y^k),y-x^{k+1}\rangle \\ \langle \nabla h(x^k)-\nabla h(x^{k+1}),y-x^{k+1}\rangle & = D_h(y,x^{k+1}) + D_h(x^{k+1},x^k) - D_h(y,x^k). \end{align}\] Combining these two identities with 172 , we obtain \[D_h(y,x^{k+1}) = D_h(y,y^k) - D_h(x^{k+1},y^k) -D_h(y,x^{k+1}) - D_h(x^{k+1},x^k) + D_h(y,x^k),\] which is equivalent to \[\label{BDRS-last-step-3-pts-identity} 2D_h(y,x^{k+1}) = D_h(y,x^k) + D_h(y,y^k) - D_h(x^{k+1},y^k)-D_h(x^{k+1},x^k).\tag{173}\] Combining 171 and 173 , we have \[\begin{align} \label{combine-all-5} & 2\gamma_k(f(w^k)+g(z^k)-f(y)-g(y)) +2D_h(y,x^{k+1}) \\ \leq & 2D_h(y,x^k) - \frac{1}{2}D_h(y^{k},z^k) - \frac{1}{2}D_h(\bar{x}^{k},w^k)- D_h(w^k,x^k) - D_h(z^k,\bar{x}^k) - D_h(x^{k+1},y^k)-D_h(x^{k+1},x^k) + \gamma_k^2.\nonumber \end{align}\tag{174}\] Under the assumption that \(\textrm{im}(\nabla h^*)\subset \textrm{dom } f\cap \textrm{dom } g\), we know \(f(x^{k+1})\) and \(g(x^{k+1})\) are finite values. By adding and subtracting \(f(x^{k+1}) + g(x^{k+1})\) to 174 , we get \[\begin{align} & 2\gamma_k(f(x^{k+1})+g(x^{k+1})-f(y)-g(y)) +2D_h(y,x^{k+1}) \label{final-bound-inequality}\\ \leq & 2D_h(y,x^k) - \frac{1}{2}D_h(y^{k},z^k) - \frac{1}{2}D_h(\bar{x}^{k},w^k)- D_h(w^k,x^k) - D_h(z^k,\bar{x}^k) - D_h(x^{k+1},y^k)-D_h(x^{k+1},x^k) \nonumber\\ & + 2\gamma_k(f(x^{k+1})- f(w^k)+g(x^{k+1})-g(z^k))+ \gamma_k^2. \nonumber \end{align}\tag{175}\] Now we only need to show that \(2\gamma_k(f(x^{k+1})- f(w^k)+g(x^{k+1})-g(z^k))\) can be bounded by those negative terms on the right-hand-side. We have \[\begin{align} & 2\gamma_k(f(x^{k+1})- f(w^k)+g(x^{k+1})-g(z^k)) \\ \leq & 2\gamma_k G (\|x^{k+1}-w^k\|_1 + \|x^{k+1}-z^k\|_1) \\ \leq & 2\gamma_k G(\|x^{k+1}-w^k\|_1 +\|z^k-w^k\|_1+\|x^{k+1}-w^k\|_1) \\ = & 2\gamma_k G(2\|x^{k+1}-w^k\|_1 +\|z^k-w^k\|_1) \\ \leq & 2\gamma_k G(2\|x^k-x^{k+1}\|_1+2\|w^k-x^k\|_1+\|z^k-w^k\|_1)\\ \leq & 2\gamma_k G(2\|x^k-x^{k+1}\|_1+2\|w^k-x^k\|_1+\|z^k-\bar{x}^k\|_1+\|\bar{x}^k-w^k\|_1) \end{align}\] Now apply Young’s inequality to these four terms above, we obtain \[\begin{align} & 2\gamma_k(f(x^{k+1})- f(w^k)+g(x^{k+1})-g(z^k)) \\ \leq & 2\gamma_k G(2\|x^k-x^{k+1}\|_1+2\|w^k-x^k\|_1+\|z^k-\bar{x}^k\|_1+\|\bar{x}^k-w^k\|_1) \\ \leq & \left(\frac{8\gamma_k^2G^2}{\sigma} + \frac{\sigma}{2}\|x^{k+1}-x^k\|_1^2\right)+\left(\frac{8\gamma_k^2G^2}{\sigma} + \frac{\sigma}{2}\|w^k-x^k\|_1^2\right) \\ & \left(\frac{2\gamma_k^2G^2}{\sigma} + \frac{\sigma}{2}\|z^k-\bar{x}^k\|_1^2\right) + \left(\frac{4\gamma_k^2G^2}{\sigma} + \frac{\sigma}{4}\|\bar{x}^k-w^k\|_1^2\right) \\ \leq & \gamma_k^2 + D_h(x^{k+1},x^k) + D_h(w^k,x^k) + D_h(z^k,\bar{x}^k) + \frac{1}{2}D_h(\bar{x}^k,w^k), \end{align}\] where in the last inequality we omitted the coefficient of \(\gamma_k^2\), and we used the \(\sigma\)-strong convexity of \(h\). Combining this inequality with 175 , we get \[2\gamma_k(f(x^{k+1})+g(x^{k+1})-f(y)-g(y)) +2D_h(y,x^{k+1}) \leq 2D_h(y,x^k) + \gamma_k^2.\] Now summing this over \(k=0,1,\ldots,N-1\), we get \[\begin{align} \sum_{k=0}^{N-1}\gamma_k (f(x^{k+1})+g(x^{k+1})-f(y)-g(y)) \leq D_h(y,x^0) + \sum_{k=0}^{N-1}\gamma_k^2. \end{align}\] That is, \[\min_{0\leq k\leq N-1} \left(f(x^{k+1})+g(x^{k+1})\right) - (f(y)+g(y)) \leq \frac{D_h(y,x^0)}{\sum_{k=0}^{N-1}\gamma_k} + \frac{\sum_{k=0}^{N-1}\gamma_k^2}{\sum_{k=0}^{N-1}\gamma_k}.\] By choosing \(y=x^*\) and using \(\gamma_k = \frac{1}{\sqrt{k}}\), we obtain the convergence and the sublinear rate.

References↩︎

[1]
J. Douglas and H. H. Rachford. On the numerical solution of the heat conduction problem in 2 and 3 space variables. Transactions of the American Mathematical Society, 82:421–439, 1956.
[2]
P. L. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis, 16:964–979, 1979.
[3]
D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via finite-element approximations. Comp. Math. Appl., 2:17–40, 1976.
[4]
D. Gabay. Applications of the method of multipliers to variational inequalities. In M. Fortin and R. Glowinski, editors, Augmented Lagrangian Methods: Applications to the Solution of Boundary Value Problems. North-Hollan, Amsterdam, 1983.
[5]
R. Glowinski and P. Le Tallec. Augmented Lagrangian and Operator-Splitting Methods in Nonlinear Mechanics. SIAM, Philadelphia, Pennsylvania, 1989.
[6]
M. Fortin and R. Glowinski. Augmented Lagrangian methods: applications to the numerical solution of boundary-value problems. North-Holland Pub. Co., 1983.
[7]
J. Eckstein. Splitting methods for monotone operators with applications to parallel optimization. PhD thesis, Massachusetts Institute of Technology, 1989.
[8]
J. Eckstein and D. P. Bertsekas. On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming, 55:293–318, 1992.
[9]
H. H. Bauschke and W. M. Moursi. On the Douglas-Rachford algorithm. Mathematical Programming Series A, 164:263–284, 2017.
[10]
J. M. Borwein and B. Sims. The Douglas-Rachford algorithm in the absence of convexity. In Fixed-point algorithms for inverse problems in science and engineering, pages 93–109. Springer, New York, NY, 2011.
[11]
P. L. Combettes and Jean-Christophe Pesquet. A Douglas-Rachford splitting approach to nonsmooth convex variational signal recovery. IEEE Journal of Selected Topics in Signal Processing, 1(4):564–574, 2007.
[12]
B. F. Svaiter. Weak convergence on Douglas-Rachford method. SIAM Journal on Control and Optimization, 49(1):280–287, 2011.
[13]
G. Li and T. K. Pong. ouglas-Rachford splitting for nonconvex optimization with application to nonconvex feasibility problems. Math. Program., 159:371–401, 2016.
[14]
B. He and X. Yuan. On the convergence rate of Douglas-Rachford operator splitting method. Mathematical Programming, 153(2):715–722, 2015.
[15]
E. K. Ryu. Uniqueness of DRS as the 2 operator resolvent-splitting and impossibility of 3 operator resolvent-splitting. Mathematical Programming Series A, 2020.
[16]
E. K. Ryu, Y. Liu, and W. Yin. ouglas-Rachford splitting and ADMM for pathological convex optimization. Computational Optimization and Applications, 2019.
[17]
J. Liang, J. Fadili, and G. Peyré. Local convergence properties of Douglas-Rachford and alternating direction method of multipliers. Journal of Optimization Theory and Applications, 172(3):874–913, 2017.
[18]
E. K. Ryu and W. Yin. Large-Scale Convex Optimization via Monotone Operators. Cambridge University Press (to be published), 2022.
[19]
M. N. Bùi and P. L. Combettes. The Douglas-Rachford algorithm converges only weakly. SIAM Journal on Control and Optimization, 58(2):1118–1120, 2020.
[20]
F. J. Aragón Artacho, Y. Censor, and A. Gibali. The cyclic Douglas-Rachford algorithm with \(r\)-sets-Douglas-Rachford operators. Optimization Methods & Software, 34(4):875–889, 2019.
[21]
S. B. Lindstrom and B. Sims. Survey: Sixty years of Douglas-Rachford. Journal of the Australian Mathematical Society, 110(3):333–370, 2021.
[22]
T. Goldstein and S. Osher. The split Bregman method for L1-regularized problems. SIAM J. Imaging Sci., 2:323–343, 2009.
[23]
Y. Wang, J. Yang, W. Yin, and Y. Zhang. A new alternating minimization algorithm for total variation image reconstruction. SIAM Journal on Imaging Sciences, 1(3):248–272, 2008.
[24]
Z. Wen, D. Goldfarb, and W. Yin. Alternating direction augmented Lagrangian methods for semidefinite programming. Mathematical Programming Computation, 2:203–230, 2010.
[25]
S. Ma and N. S. Aybat. Efficient optimization algorithms for robust principal component analysis and its variants. Proceedings of the IEEE, 106(8):1411–1426, 2018.
[26]
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2011.
[27]
J. Eckstein and W. Yao. Understanding the convergence of the alternating direction method of multipliers: Theoretical and computational perspectives. Pacific Journal on Optimization, 11(4):619–644, 2015.
[28]
R. Glowinski. On alternating direction methods of multipliers: A historical perspective. In W. Fitzgibbon, Y. A. Kuznetsov, P. Neittaanmäki, and O. Pironneau, editors, Modeling, Simulation and Optimization for Scienceand Technology, volume 34, pages 59–82. Springer Netherlands, Dordrecht, 2014.
[29]
S. Ma and M. Hong. A gentle introduction to ADMM for statistical problems. In Handbook of Computational Statistics and Data Science. John Wiley & Sons, 2020.
[30]
H. Wang and A. Banerjee. Bregman alternating direction method of multipliers. In NIPS, 2014.
[31]
L. Bregman. The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7:200–217, 1967.
[32]
Y. Censor and A. Lent. An iterative row-action method for interval convex programming. Journal of Optimization Theory and Applications, 34:321–353, 1981.
[33]
A. R. De Pierro and A. N. Iusem. A relaxed version of Bregman’s methodfor convex programming. J. Optim. Theory Appl., 51(3):421–440, 1986.
[34]
R. T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, 1970.
[35]
K. C. Kiwiel. Proximal minimization methods with generalized Bregman functions. SIAM Journal on Control and Optimization, 35(4):1142–1168, 1997.
[36]
J. Eckstein. Nonlinear proximal point algorithms using Bregman functions, with applications to convex programming. Mathematics of Operations Research, 18(1):202–226, 1993.
[37]
A. S. Nemirovski and D. B. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley-Interscience Series in Discrete Mathematics, John Wiley, 1983.
[38]
A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methodsfor convex optimization. Operations Research Letters, 31:167–175, 2003.
[39]
H. H. Bauschke, J. Bolte, and M. Teboulle. A descent lemma beyond Lipschitz gradient continuity: First-order methods revisited and applications. Mathematics of Operations Research, 42:330–348, 2017.
[40]
H. Lu, R. M. Freund, and Y. Nesterov. Relatively-smooth convex optimization by first-order methods, and applications. SIAM Journal on Optimization, 28(1):333–354, 2018.
[41]
P. Tseng. Approximation accuracy, gradient methods, and error bound for structured convex optimization. Math. Program. Ser. B, 125:263–295, 2010.
[42]
Y. E. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence \(\mathcal{O}(1/k^2)\). Dokl. Akad. Nauk SSSR, 269:543–547, 1983.
[43]
Y. E. Nesterov. Introductory lectures on convex optimization: A basic course. Applied Optimization. Kluwer Academic Publishers, Boston, MA, 2004.
[44]
Y. E. Nesterov. Smooth minimization for non-smooth functions. Math. Program. Ser. A, 103:127–152, 2005.
[45]
A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences, 2(1):183–202, 2009.
[46]
A. Auslender and M. Teboulle. Interior gradient and proximal methods for convex and conic optimization. SIAM J. on Optimization, 16(3):697–725, 2006.
[47]
F. Hanzely, P. Richtárik, and L. Xiao. Accelerated Bregman proximal gradient methods for relatively smooth convex minimization. Computational Optimization and Applications, 79(2):405–440, 2021.
[48]
D. H. Gutman and J. F. Pena. A unified framework for Bregman proximal methods: subgradient, gradient, and accelerated gradient schemes. http://www.optimization-online.org/DB_FILE/2018/12/6996.pdf, 2018.
[49]
M. Teboulle. A simplified view of first order methods for optimization. Math. Program., Ser. B, 170:67–96, 2018.
[50]
M. N. Bùi and P. L. Combettes. Bregman forward-backward operator splitting. Set-Valued and Variational Analysis, 2020.
[51]
S. Erlander. Entropy in linear programs. Mathematical Programming, 21:137–151, 1981.
[52]
J. Eriksson. An iterative primal-dual algorithm for linear programming. Technical report, Technical Report 85-10, Department of Mathematics, Linkoping University, 1985.
[53]
P. P. B. Eggermont. Multiplicative iterative algorithms for convex programming. Linear Algebra and Its Applications, 130:25–42, 1990.
[54]
Y. Censor and S. A. Zenios. The proximal minimization algorithm with D-functions. J. Optim. Theory Appl., 73:451–464, 1992.
[55]
G. Chen and M. Teboulle. Convergence analysis of a proximal-like minimization algorithm using Bregman functions. SIAM Journal on Optimization, 3:538–543, 1993.
[56]
A. N. Iusem. On some properties of generalized proximal point methods for quadratic and linear programming. J. Optim. Theory Appl., 96:337–362, 1998.
[57]
A. Ben-Tal and M. Zibulevsky. Penalty/barrier multiplier methods for convex programming problems. SIAM J. Optimization, 7(2):347–366, 1997.
[58]
M. Teboulle. Entropic proximal mappings with applications to nonlinear programming. Mathematics of Operations Research, 17(3):670–690, 1992.
[59]
A. Auslender, M. Teboulle, and S. Ben-Tiba. Interior proximal and multiplier methods based on second order homogeneous kernels. Mathematics of Operations Research, 24(3):645–668, 1999.
[60]
A. N. Iusem, B. F. Svaiter, and M. Teboulle. Entropy-like proximal methods in convex programming. Mathematics of Operations Research, 19(4):790–814, 1994.
[61]
A. N. Iusem and M. Teboulle. Convergence rate analysis of nonquadratic proximal methods for convex and linear programming. Mathematics of Operations Research, 20(3):657–677, 1995.
[62]
H. H. Bauschke, J. M. Borwein, and P. L. Combettes. Bregman monotone optimization algorithms. SIAM J. Control. Optim., 42(2):596–636, 2003.
[63]
B. Martinet. Régularisation d’inéquations variationelles par approximations successives. Rev. Francise Informat. Recherche Opérationelle, 4:154–159, 1970.
[64]
R. T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM J. Control Optim., 14(5):877–898, 1976.
[65]
R. T. Rockafellar. Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Math. Oper. Res., 1(2):97–116, 1976.
[66]
A. N. Iusem. Augmented Lagrangian methods and proximal point methods for convex optimization. Investigación Operativa, 8:11–49, 1999.
[67]
J. Eckstein. Approximate iterations in Bregman-function-based proximal algorithms. Mathematical programming, 83(1):113–123, 1998.
[68]
M. V. Solodov and B. F. Svaiter. An inexact hybrid generalized proximal point algorithm and some new results on the theory of Bregman functions. Mathematics of Operations Research, 25:214–230, 2000.
[69]
L. Yang and K.-C. Toh. regman proximal point algorithm revisited: A new inexact version and its variant. https://arxiv.org/abs/2105.10370, 2021.
[70]
B. W. Kort and D. P. Bertsekas. A new penalty function method for constrained minimization. In Proceedings of the IEEE conference on decision and control, pages 162–166, 1972.
[71]
P. Tseng and D. P. Bertsekas. One the convergence of the exponential multiplier method for convex programming. Mathematical Programming, 60:1–19, 1993.
[72]
R. A. Polyak. Controlled processes in extremal and equilibrium problems. VINITI, deposited manuscript, Moscow (in Russian), 1986.
[73]
R. A. Polyak. The nonlinear rescaling principle in linear programming. Technical Report IBM Research Report RC15030, IBM T. J. Watson Research Center (Yorktown Heights, NY), 1989.
[74]
R. A. Polyak and I. Griva. Primal-dual nonlinear rescaling method for convex optimization. Journal of Optimization Theory and Applications, 122(1):111–156, 2004.
[75]
I. Griva and R. A. Polyak. Primal-dual nonlinear rescaling method with dynamic scaling parameter update. Math. Program. Ser. A, 106:237–259, 2006.
[76]
R. A. Polyak and M. Teboulle. Nonlinear rescaling and proximal-like methods in convex optimization. Math. Program., 76:265–284, 1997.
[77]
D. P. Bertsekas. Constrained Optimization and Lagrange Multiplier Methods. Athena Scientific, 1996.
[78]
J. Eckstein. A practical general approximation criterion for methods ofmultipliers based on bregman distances. Mathematical Programming Series A, 96:61–86, 2003.
[79]
S. Yan and N. He. Bregman augmented Lagrangian and its acceleration. https://arxiv.org/abs/2002.06315, 2020.
[80]
P. Giselsson and S. Boyd. Linear convergence and metric selection for Douglas-Rachford splitting and ADMM. IEEE Transantions on Automatic Control, 62(2):532–544, 2017.
[81]
R. I. Bot, E. R. Csetnek, and D. Meier. Variable metric ADMM for solving variational inequalities with monotone operators over affine sets. In H. H. Bauschke, R. S. Burachik, and D. R. Luke, editors, Splitting Algorithms, Modern Operator Theory, and Applications, pages 91–112. Springer, 2019.
[82]
E. Ghadimi, A. Teixeira, I. Shames, and M. Johansson. Optimal parameterselection for the alternating direction method of multipliers (ADMM): Quadratic problems. IEEE Trans. Automatic Control, 60(3):644–658, 2015.
[83]
D. H. Peaceman and H. H. Rachford. The numerical solution of parabolic elliptic differential equations. SIAM Journal on Applied Mathematics, 3:28–41, 1955.
[84]
D. Goldfarb, S. Ma, and K. Scheinberg. Fast alternating linearization methods for minimizing the sum of two convex functions. Mathematical Programming, 141(1-2):349–382, 2013.
[85]
G. B. Passty. Ergodic convergence to a zero of the sum of monotone operators in Hilbert space. Journal of Mathematical Analysis and Applications, 72:383–390, 1979.
[86]
P. L. Combettes. Solving monotone inclusions via compositions of nonexpansive averaged operators. Optimization, 53:475–504, 2004.
[87]
S. Reich. A weak convergence theorem for the alternating method with Bregman distances. In Theory and Applications of Nonlinear Operators of Accretive and Monotone Type, pages 313–318. Marcel Dekker, New York, 1996.
[88]
S. Sabach. Products of finitely many resolvents of maximal monotone mappings in reflexive Banach spaces. SIAM Journal on Optimization, 21:1289–1308, 2011.
[89]
Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
[90]
Gabriel Peyré and Marco Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
[91]
Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.
[92]
R. Sinkhorn and P. Knopp. Concerning nonnegative matrices and doubly stochastic matrices. Pacific J. Math., 21:343–348, 1967.
[93]
R. I. Bot and A. Bohm. An incremental mirror descent subgradient algorithm with random sweeping and proximal step. Optimization, 68(1):33–50, 2019.

  1. The results in Sections 1-5 were written by Shiqian Ma in 2020, and the results in the Appendix were written by Shiqian Ma in Nov 2021. These results formed the first version of the paper which was ready on Nov 5, 2021. Shiqian Ma polished the writing in Sep 2025, which gives the current version.↩︎

  2. The three authors started to work on this problem back in 2020, and this draft was ready in Nov 2021. We spent the last four years trying to prove the convergence of the algorithms under the most general setting, but did not succeed. Despite of that, we believe that the current contributions are significant. So we decided to share the results with the community now. Since this draft is from 2021, we apologize that the references are not up to date. We will include more recent works in a later version of the paper.↩︎

  3. Note that the EMM proposed in [70], [71], [77] allows the scalar \(\gamma_k\) to be replaced by a vector whose \(j\)-th entry is \([\gamma_k]_j\). In this case, the EMM becomes \[\begin{align} (u^k,v^k) & := \mathop{\rm argmin}_{u,v} f(u) + g(v) + \sum_{j=1}^m \frac{w_j^k}{\gamma_j^k}\psi([\gamma_k]_j(M_j^\top u+N_j^\top v-b_j)) \\ w_j^{k+1} & := w_j^k\nabla\psi([\gamma_k]_j(M_j^\top u^k+N_j^\top v^k-b_j)) = w_j^k e^{[\gamma_k]_j(M_j^\top u^k+N_j^\top v^k-b_j)}, \;j=1,\ldots,m. \end{align}\] Here we use a scalar \(\gamma_k\) for simplicity.↩︎