Abstract

Optimizing problems in a distributed manner is critical for systems involving multiple agents with private data. Despite substantial interest, a unified method for analyzing the convergence rates of distributed optimization algorithms is lacking. This paper introduces an energy conservation approach for analyzing continuous-time dynamical systems in dilated coordinates. Instead of directly analyzing dynamics in the original coordinate system, we establish a conserved quantity, akin to physical energy, in the dilated coordinate system. Consequently, convergence rates can be explicitly expressed in terms of the inverse time-dilation factor. Leveraging this generalized approach, we formulate a novel second-order distributed accelerated gradient flow with a convergence rate of \(\text{\usefont{OMS}{cmsy}{m}{n}O}\left(1/t^{2-\text{\textepsilon}}\right)\) in time \(t\) for > 0. We then employ a semi second-order symplectic Euler discretization to derive a rate-matching algorithm with a convergence rate of \(\text{\usefont{OMS}{cmsy}{m}{n}O}\left(1/k^{2-\text{\textepsilon}}\right)\) in \(k\) iterations. To the best of our knowledge, this represents the most favorable convergence rate for any distributed optimization algorithm designed for smooth convex optimization. Its accelerated convergence behavior is benchmarked against various state-of-the-art distributed optimization algorithms on practical, large-scale problems.

Accelerated distributed optimization, Continuous-time optimization, Energy conservation, Distributed Euler-Lagrange method

1 Introduction↩︎

Over the past few years, there has been significant progress in digital systems, communication, and sensing technologies, leading to the emergence of networked systems [1]. These systems encompass a multitude of interconnected subsystems (agents) that necessitate collaboration to attain specific global objectives. The applications of such networked systems span diverse domains, including solving economic dispatch problems in power systems [2], distributed estimation in sensor networks [3], coverage control [4], decentralized machine learning [5], among others.

The above applications often involve solving a distributed convex optimization problem, wherein each node performs local computation on local data and cooperatively reach consensus on a global optimal solution through exchanging local decisions with neighbors in an iterative fashion. Formally, the problem can be expressed as follows: \[\begin{align} \label{eq:DOP} \min\limits_{\{{\mathbf{x}}_i\in{\mathbb{R}}^d,\forall i\}} &\sum\limits_{i=1}^mf_i({\mathbf{x}}_i), \nonumber \\ \text{s.t. } &{\mathbf{x}}_i = {\mathbf{x}}_j, \quad \forall i,j\in\{1,2,\dots,m\}. \end{align}\tag{1}\] Here the convex function \(f_i:{\mathbb{R}}^d\to{\mathbb{R}}\) represents the local objective function of the \(i^{\text{th}}\)-agent, where \(i\in\{1,2,\dots,m\}\) and \(m\) is the number of agents in the network. Distributed optimization problems enable agents to coordinate in a distributed manner while simultaneously minimizing the overall team objective function. As a result, these problems tend to be inherently more complex.

Most prior research on distributed optimization has focused on creating discretized algorithms designed to ensure that each agent within the network converges to the optimal solution of 1 or to a neighborhood around it [6]–[8]. These algorithms are easy to implement and often come with theoretical guarantees on the rate and order of convergence. Additionally, their theoretical analysis provides insights into the theoretical bounds on various hyperparameters, such as learning rate. However, a primary disadvantage of these approaches is the difficulty in their analysis and design. In the recent years, continuous-time analysis of optimization algorithms has gained significant interest, primarily due to ease of analysis and ability to design novel algorithms [9]–[12]. The continuous-time limiting behavior allows for the application of tools from Lyapunov theory and differential equations, facilitating stability and convergence-rate analysis. However, while the analysis is relatively easier, these algorithms ultimately need to be discretized for iterative implementation. Most continuous-time algorithms either lose their accelerated convergence guarantees upon discretization—making it challenging to achieve rate-matching discretization—or the resulting discretized algorithms start to diverge. Another common issue that plagues both discretized and continuous-time analyses is the lack of a generalized framework that can be used to analyze a broader class of distributed optimization algorithms.

In the wake of above limitations, the proposed work makes the following contributions:
1. Distributed gradient flow with \(\text{\usefont{OMS}{cmsy}{m}{n}O}(1/t^{2-\beta})\) convergence rate: In this study, we introduce an innovative distributed continuous-time gradient flow method aimed at minimizing the sum of smooth convex functions, achieving an unprecedented convergence rate of \(\text{\usefont{OMS}{cmsy}{m}{n}O}\left(1/t^{2-\beta}\right)\), where \(\beta>0\) can be arbitrarily small. Notably, our proposed algorithm surpasses the previously best-known convergence rate of \(\text{\usefont{OMS}{cmsy}{m}{n}O}\left(1/t^{1.4-\beta}\right)\) achieved by the distributed version of Nesterov accelerated gradient descent [13]. It must be emphasized that the optimal convergence rate for centralized smooth convex function optimization is \(\text{\usefont{OMS}{cmsy}{m}{n}O}\left(1/t^2\right)\). Remarkably, we have (nearly) attained this optimal rate in a distributed setting.
2. Generalized framework for analysis of distributed gradient flows: One of the primary contributions of our proposed methodology is the establishment of an energy conservation perspective on optimization algorithms, akin to Lyapunov analysis. Notably, we demonstrate that the associated energy functional remains conserved within a dilated coordinate system, where the time-dilation factor is directly linked to the rate of convergence. Consequently, this framework can be extended to analyze the convergence rates of a wide range of distributed optimization algorithms.
3. Consistent rate-matching discretization: Employing symplectic Euler discretization, we offer a rate-matching consistent discretization, ensuring that the discretized algorithm achieves a convergence rate of \(\text{\usefont{OMS}{cmsy}{m}{n}O}\left(1/k^{2-\beta}\right)\), where \(k\) represents the number of iterations.
4. Experimental validation: We verify the accelerated convergence behavior of the proposed distributed optimization algorithm through experimentation on various examples, specifically focusing on those characterized by poor condition numbers, thereby posing significant numerical challenges.

The remainder of the paper is structured as follows. Section 2 provides an overview of the key notations, definitions, and generalized conservation laws relevant to continuous-time dynamical systems. In Section 3, we introduce the rate-optimal distributed accelerated gradient flow, which is then consistently discretized in Section 4 using the Symplectic Euler method for iterative implementation. Finally, numerical experiments are discussed in Section 5.

2 Preliminaries↩︎

Notation: Let \(N\) be any natural number. Throughout this paper, we use \(\nabla g\) to represent the gradient of a function \(g:\mathbb{R}^N\to\mathbb{R}\). Unless specified otherwise, we use \(\|\mathbf{v}\|\) to represent the Euclidean norm of a vector \(\mathbf{v}\in\mathbb{R}^N\), and \(\|M\|\) to denote the induced 2-norm of the matrix \(M\). The Kronecker product of two matrices is represented by \(\otimes\). We let \(\mathbf{0}_N\) denote the \(N\)-dimensional zero vector and \(O\) denote the zero matrix of appropriate dimension. We let \(\mathbf{1}_N\) denote the \(N\)-dimensional vector of ones. We denote by \(Diag(\cdot)\) a block-diagonal matrix of appropriate dimensions. We use the abbreviation SPD for symmetric positive definite. We let \({\cal L}\) denote the Laplacian matrix of the graph \(\mathbb{G}\). We let \(\bar{\lambda}_{\cal L}\) and \(\underline{\lambda}_{\cal L}\) respectively denote the largest and the smallest non-zero eigenvalue of the matrix \({\cal L}\). We let \({\mathit{X_*}}=[(\mathbf{x}_*)^\intercal,\dots,(\mathbf{x}_*)^\intercal]^\intercal\). Finally, we let \(\tilde{{\cal L}}={\cal L}\otimes I\), where \(I\) denotes the \((d\times d)\)-dimensional identity matrix. We let \(F: {\mathbb{R}}^{md} \to {\mathbb{R}}\) denote the cumulative cost function, i.e., \(f({\mathbf{x}}) = \sum_{i=1}^m f_i({\mathbf{x}})\) for all \({\mathbf{x}}\in {\mathbb{R}}^d\) and \(F({\mathit{X}}) = \sum_{i=1}^m f_i({\mathbf{x}}_i)\) for all \({\mathit{X}}\in {\mathbb{R}}^{md}\).

We will use the following definitions from convex optimization theory.

Definition 1. A differentiable function \(g:{\mathbb{R}}^N\to{\mathbb{R}}\) is convex if \(\nabla g({\mathbf{x}})^\intercal({\mathbf{y}}-{\mathbf{x}})\leq g({\mathbf{y}})-g({\mathbf{x}})\) for any \({\mathbf{x}}, {\mathbf{y}}\in {\mathbb{R}}^N\).

Definition 2. A differentiable function \(g:{\mathbb{R}}^N\to{\mathbb{R}}\) is said to be \(L_g\)-smooth if it has Lipschitz gradient with \(L_g>0\), i.e., \(\|\nabla g({\mathbf{x}})-\nabla g({\mathbf{y}})\|\leq L_g\|{\mathbf{x}}-{\mathbf{y}}\|\) for any \({\mathbf{x}}, {\mathbf{y}}\in {\mathbb{R}}^N\).

General form of conservation laws: Consider a second-order ODE given by: \[\begin{align} \label{eq:ODE-gen} a(t)\ddot{{\mathbf{w}}} + b(t)\dot{{\mathbf{w}}} + \nabla_{{\mathbf{w}}}U({\mathbf{w}},t) = 0. \end{align}\tag{2}\] Taking inner-product with \(\dot{{\mathbf{w}}}\), one obtains: \[\begin{align} a(t)\langle\ddot{{\mathbf{w}}},\dot{{\mathbf{w}}}\rangle + b(t)\langle\dot{{\mathbf{w}}},\dot{{\mathbf{w}}}\rangle + \langle\nabla_{{\mathbf{w}}}U({\mathbf{w}},t),\dot{{\mathbf{w}}}\rangle = 0, \end{align}\] which can be rewritten as: \[\begin{align} \frac{d}{dt}\!\!\!\left(\!\frac{a(t)}{2}\|\dot{{\mathbf{w}}}\|^2\!\!\right)\!\!+\!\!\left(\!\!b(t)\!-\!\frac{\dot{a}(t)}{2}\!\right)\!\!\|\dot{{\mathbf{w}}}\|^2 + \frac{d}{dt}\!U({\mathbf{w}},t)=\frac{\partial}{\partial t}U({\mathbf{w}},t). \end{align}\] Integrating the above equation yields the conservation law: \[\begin{align} \label{eq:con-law} E(t) &\equiv \frac{a(t_0)}{2}\|\dot{{\mathbf{w}}}(t_0)\|^2 + U(\dot{{\mathbf{w}}}(t_0),t_0) \nonumber\\ &= \frac{a(t)}{2}\|\dot{{\mathbf{w}}}(t)\|^2 + \int\limits_{t_0}^t\left(b(s)-\frac{\dot{a}(s)}{2}\right)\|\dot{{\mathbf{w}}}(s)\|^2ds \nonumber\\ &\;+ U({\mathbf{w}}(t),t) - \int\limits_{t_0}^t\frac{\partial}{\partial s}U({\mathbf{w}}(s),s)ds. \end{align}\tag{3}\] Our approach to design of distributed optimization algorithm revolves around designing a dilated coordinate system \({\mathbf{w}}(t)\) such that the total energy \(E(t)\) is conserved in the dilated coordinate system.

Our theoretical analysis hinges on the following key assumptions:

Assumption 1. \(\left\lvert\min_{{\mathbf{x}}\in {\mathbb{R}}^d} \sum_{i=1}^m f_i({\mathbf{x}})\right\rvert < \infty\) and the solution set \(\{{\mathbf{x}}_* \in {\mathbb{R}}^d | \sum_{i=1}^m \nabla f_i({\mathbf{x}}_*) = 0_d\}\) of problem 1 is non-empty.

Assumption 2. The communication graph \({\mathbb{G}}\) is undirected and connected.

3 Distributed Accelerated Gradient Flow↩︎

The primary objective in distributed optimization is to reduce the function value efficiently. In proving stability of equilibria or convergence behavior of dynamical systems, Lyapunov functions are often employed as proxies for system energy. By ensuring the rate of change of Lyapunov functions to be non-positive, stability of equilibria can be deduced. Obtaining bounds on convergence rates is nontrivial and requires reformulating suitable explicit time-varying energy functionals that are parameterized directly in terms of the convergence rates.

One of the fundamental limitations of Lyapunov-based analysis of dynamical systems is its inability to elucidate bound on the convergence rate, particularly when the convergence to equilibrium is asymptotic. We overcome this limitation by suggesting a dissipation law of dynamical system concerned with solving distributed optimization problem. Methods, such as the the Accelerated Gradient Method (AGM)-ODE and the Optimized Gradient Method (OGM)-ODE are shown to exhibit faster rates of convergence for single-agent (i.e., non-distributed) convex optimization problems [14], [15]. We now propose a distributed variant of the AGM-ODE and establish \(\text{\usefont{OMS}{cmsy}{m}{n}O}(1/t^{2-\beta})\) convergence rate, where \(\beta > 0\) is arbitrarily small. In particular, we consider the following distributed dynamical system: \[\begin{align} \label{eq:HBDist} \ddot{{\mathit{X}}} + \frac{3}{t}\dot{{\mathit{X}}} + \frac{1}{t^\beta}\nabla F({\mathit{X}}) + k\tilde{\mathcal{L}}{\mathit{X}}= 0,\end{align}\tag{4}\] with parameters \(k,\beta>0\). Here \({\mathit{X}}\mathrel{\vcenter{:}}= [{\mathbf{x}}_1^\intercal, \dots, {\mathbf{x}}_N^\intercal]^\intercal\) represents the concatenated state of all the agents. Note that from the perspective of the \(i^{\text{th}}\)-agent, the agent runs the following distributed protocol: \[\begin{align} \ddot{\mathbf{x}}_i + \frac{3}{t}\dot{\mathbf{x}}_i + \frac{1}{t^\beta}\nabla f_i(\mathbf{x}_i) + k\sum\limits_{j\in\mathcal{N}_i}(\mathbf{x}_i - \mathbf{x}_j) = 0. \end{align}\] The exponent \(\beta\) can be arbitrarily small, however, it enforces a diminishing coefficient for the gradient term in 4 in order to ensure that the trajectories of the individual agents converge to the optimal solution \({\mathbf{x_*}}\). Although the constant coefficient strategy (i.e. \(\beta=0\)) converges faster than the diminishing step-size scheme, it only reaches a small neighborhood of the optimal solution \({\mathbf{x_*}}\); see [16], [17] for example. Below we show that the distributed optimization algorithm 4 achieves near optimal convergence rate.

Theorem 1. Consider the distributed optimization problem given by 1 , and let the system consist of an undirected, connected network of \(m\) agents, each following the distributed protocol outlined in 4 . Under these conditions and subject to Assumptions 1-2, the states of the agents will converge to the optimal solution \({\mathbf{x_*}}\) of 1 with a convergence rate of \(\text{\usefont{OMS}{cmsy}{m}{n}O}(1/t^{2-\beta})\), where \(\beta>0\) is an arbitrarily small positive constant defined in 4 .

Proof. Let us consider the dilated coordinate system \({\mathit{W}}(t)\mathrel{\vcenter{:}}= t^2({\mathit{X}}(t)\!-\!{\mathit{X_*}})\). After some algebraic manipulation, 4 can be expressed in the new coordinate system as: \[\begin{align} \label{eq:ODE-in-W} \frac{\ddot{{\mathit{W}}}}{t^2} - \frac{\dot{{\mathit{W}}}}{t^3} + \underbrace{t^{2-\beta}\nabla_{{\mathit{W}}}F\left(\frac{{\mathit{W}}}{t^2}+{\mathit{X_*}}\right) + \frac{\tilde{{\cal L}}{\mathit{W}}}{t^2}}_{\nabla_{{\mathit{W}}}U({\mathit{W}},t)} = 0. \end{align}\tag{5}\] From 5 , one obtains: \[\begin{align} \label{eq:U} U({\mathit{W}},t) = \frac{1}{2t^2}{\mathit{W}}^\intercal{\tilde{{\cal L}}}{\mathit{W}}+ t^{2-\beta}\left(\!F\left(\!\frac{{\mathit{W}}}{t^2}+{\mathit{X_*}}\!\right)-F_*\!\right)\!. \end{align}\tag{6}\] Upon taking partial derivative of \(U\) with respect to the time component \(s\), one obtains: \[\begin{align} \label{eq:Us} \frac{\partial U}{\partial s} &= -\frac{1}{s^3}{\mathit{W}}^\intercal{\tilde{{\cal L}}}{\mathit{W}}+ (2-\beta)s^{1-\beta}\left(F\left(\frac{W}{s^2}+{\mathit{X_*}}\right)-F_*\right) \nonumber \\ &\;\;\;- 2s^{1-\beta}\left\langle\nabla_{{\mathit{W}}}F\left(\frac{W}{s^2}+{\mathit{X_*}}\right),\frac{{\mathit{W}}}{s^2}\right\rangle \nonumber \\ &= -s({\mathit{X}}-{\mathit{X_*}})^\intercal{\tilde{{\cal L}}}({\mathit{X}}-{\mathit{X_*}}) - \beta s^{1-\beta}(F({\mathit{X}})-F_*) \nonumber \\ &\;\;\;-2s^{1-\beta}(F_*-F({\mathit{X}})-\nabla_{{\mathit{X}}}F({\mathit{X}})^\intercal({\mathit{X_*}}-{\mathit{X}})). \end{align}\tag{7}\] Equating 5 with 3 , one obtains: \[\begin{align} a(t) &= \frac{1}{t^2}, \qquad b(t) = -\frac{1}{t^3}, \\ U({\mathit{X}},t)&=\frac{t^2}{2}({\mathit{X}}-{\mathit{X_*}})^\intercal{\tilde{{\cal L}}}({\mathit{X}}-{\mathit{X_*}})+t^{2-\beta}(F({\mathit{X}})-F_*). \end{align}\]

Thus, the conservation law for 4 is given by: \[\begin{align} E &\equiv \frac{1}{2}\|t_0\dot{{\mathit{X}}}_0+2({\mathit{X}}_0-{\mathit{X_*}})\|^2 + \frac{t_0^2}{2}({\mathit{X}}_0-{\mathit{X_*}})^\intercal{\tilde{{\cal L}}}({\mathit{X}}_0-{\mathit{X_*}}) \nonumber \\ &\;\;\;+ t_0^{2-\beta}(F({\mathit{X}}_0)-F_*), \nonumber \end{align}\] which can be further simplified as: \[\begin{align} \label{eq:E-Dist-AGD} E &= \frac{1}{2}\|t\dot{{\mathit{X}}}+2({\mathit{X}}-{\mathit{X_*}})\|^2 + \frac{t^2}{2}({\mathit{X}}-{\mathit{X_*}})^\intercal{\tilde{{\cal L}}}({\mathit{X}}-{\mathit{X_*}}) \nonumber \\ &\;\;\;+t^{2-\beta}(F({\mathit{X}})-F_*) + \int\limits_{t_0}^{t}s({\mathit{X}}-{\mathit{X_*}})^\intercal{\tilde{{\cal L}}}({\mathit{X}}-{\mathit{X_*}})ds \nonumber \\ &\;\;\;+ 2\int\limits_{t_0}^{t}s^{1-\beta}(F_*-F({\mathit{X}})-\langle\nabla F({\mathit{X}}),{\mathit{X_*}}-{\mathit{X}}\rangle)ds \nonumber \\ &\;\;\;+ \beta\int\limits_{t_0}^{t}s^{1-\beta}(F({\mathit{X}})-F_*)ds\\ &\equiv 2\|{\mathit{X}}_0-{\mathit{X_*}}\|^2 \qquad \text{for t_0=0}. \end{align}\tag{8}\] Recall that each term in the above energy functional is non-negative. This follows from convexity of \(F(\cdot)\), positive semi-definiteness of \({\tilde{{\cal L}}}\), and the fact that \(F_*\) is the minimum of \(F(\cdot)\). Thus, it follows that: \[\begin{align} t^{2-\beta}(F({\mathit{X}})-F_*) &\leq 2\|{\mathit{X}}_0-{\mathit{X_*}}\|^2 \\ \implies F({\mathit{X}})-F_* &\leq \frac{2\|{\mathit{X}}_0-{\mathit{X_*}}\|^2}{t^{2-\beta}} = \text{\usefont{OMS}{cmsy}{m}{n}O}\left(\frac{1}{t^{2-\beta}}\right), \end{align}\] which completes the proof. ◻

4 Discretization of Distributed AGM-ODE↩︎

In this section, we show that discretizing 4 via a semi-second-order symplectic Euler discretization in the dilated coordinate system leads to a discretized algorithm with consistent discretization \(\text{\usefont{OMS}{cmsy}{m}{n}O}(1/k^{2-\beta})\). Even though there has been substantial prior work on continuous-time analyses of optimization algorithms, achieving a rate-matching discretized algorithm via a straightforward and “natural" discretization has proven to be surprisingly challenging.

We begin with formulating the generalized momentum via the Lagrangian formulation: \[\begin{align} \label{eq:Lagrangian} L({\mathit{W}},\dot{{\mathit{W}}},t) = \frac{1}{2t}\|\dot{{\mathit{W}}}\|^2 - tU({\mathit{W}},t). \end{align}\tag{9}\] The Euler-Lagrange equation \(\dfrac{d}{dt}\nabla_{\dot{bW}}L=\nabla_{{\mathit{W}}}L\) yields ODE 5 and the conjugate momentum \({\mathit{P}}=\frac{\dot{{\mathit{W}}}}{t}=t\dot{{\mathit{X}}}+2({\mathit{X}}-{\mathit{X_*}})\). Expressing 5 in \({\mathit{W}}\) and \({\mathit{P}}\) yields: \[\begin{align} \label{eq:W-P} \dot{{\mathit{P}}} &= -t^{1-\beta}\nabla F({\mathit{X}}) - t{\tilde{{\cal L}}}{\mathit{X}}\nonumber\\ \dot{{\mathit{W}}} &= t{\mathit{P}}. \end{align}\tag{10}\] Notice that \(\ddot{{\mathit{W}}}={\mathit{P}}-t^{2-\beta}\nabla F({\mathit{X}})-t^2{\tilde{{\cal L}}}{\mathit{X}}\). Inspired by the symplectic Euler [18], we consider alternating updates of \({\mathit{W}}\) and \({\mathit{P}}\) but use a second-order update for \({\mathit{W}}\): \[\begin{align} \label{eq:dist-P-W} {\mathit{P}}(t+h) &\approx {\mathit{P}}(t)-h(t^{1-\beta}\nabla F({\mathit{X}})+t{\tilde{{\cal L}}}{\mathit{X}}), \nonumber \\ {\mathit{W}}(t+h) &\approx {\mathit{W}}(t) + h\dot{{\mathit{W}}}(t) + \frac{h^2}{2}\ddot{{\mathit{W}}}(t) \nonumber \\ &= {\mathit{W}}(t) + ht{\mathit{P}}(t) \nonumber \\ &\;\;\;+ \frac{h^2}{2}({\mathit{P}}(t)-t^{2-\beta}\nabla F({\mathit{X}})-t^2{\tilde{{\cal L}}}{\mathit{X}}), \end{align}\tag{11}\] where \(h\) is the discrete step. Identifying \(t\) with \(kh\), and \({\mathit{W}}(kh)\), \({\mathit{P}}(kh)\) and \({\mathit{X}}(kh)\) with \({\mathit{W}}_{k}\), \({\mathit{P}}_k\) and \({\mathit{X}}_k\), respectively, we get the method: \[\begin{align} \label{eq:dist-P} {\mathit{P}}_{k+1} = {\mathit{P}}_{k} - h[(kh)^{1-\beta}\nabla F({\mathit{X}}_k) + (kh){\tilde{{\cal L}}}{\mathit{X}}_k]. \end{align}\tag{12}\] Similarly, using the definition of \({\mathit{W}}_{k}=k^2({\mathit{X}}_{k}-{\mathit{X_*}})\), 11 can be expressed as: \[\begin{align} (k+1)^2h^2({\mathit{X}}_{k+1}-{\mathit{X_*}}) = k^2h^2({\mathit{X}}_{k+1}-{\mathit{X_*}}) + kh^2{\mathit{P}}_k \nonumber \\ +\frac{h^2{\mathit{P}}_k}{2} - \frac{h^2}{2}[(kh)^{2-\beta}\nabla F({\mathit{X}}_k)+(kh)^2{\tilde{{\cal L}}}{\mathit{X}}_k], \end{align}\] which upon rearranging leads to: \[\begin{align} \label{eq:dist-X} {\mathit{X}}_{k+1}&=\frac{k^2}{(k+1)^2}\underbrace{\left[{\mathit{X}}_k-\frac{h^2}{2}\left\{(kh)^{-\beta}\nabla F({\mathit{X}}_k)+{\tilde{{\cal L}}}{\mathit{X}}_k\right\}\right]}_{\mathrel{\vcenter{:}}={\mathit{X}}_k^{+}} \nonumber \\ &\;\;\;+\frac{2k+1}{(k+1)^2}\underbrace{\left(\frac{P_k}{2}+{\mathit{X_*}}\right)}_{\mathrel{\vcenter{:}}={\mathit{Z}}_k}. \end{align}\tag{13}\] Finally, letting \(s=h^2\), \(\theta_k=\frac{k}{2}\), we get the following algorithm from 13 : \[\begin{align} {\mathit{X}}_k^{+} &= {\mathit{X}}_k-\frac{s}{2}\left((2\theta_k h)^{-\beta}\nabla F({\mathit{X}}_k)+{\tilde{{\cal L}}}{\mathit{X}}_k\right) \nonumber \\ {\mathit{Z}}_{k+1} &= {\mathit{Z}}_{k} - s\theta_k\left((2\theta_k h)^{-\beta}\nabla F({\mathit{X}}_k)+{\tilde{{\cal L}}}{\mathit{X}}_k\right)\nonumber \\ {\mathit{X}}_{k+1} &= \frac{\theta_k^2}{\theta_{k+1}^2}{\mathit{X}}_{k}^{+} + \left(1-\frac{\theta_k^2}{\theta_{k+1}^2}\right){\mathit{Z}}_k. \label{eqn:dist95prelim} \end{align}\tag{14}\] The algorithm above in 14 employs a fixed step-size \(s\). However, adaptive step-size is common in implementation of optimization algorithms in practice, including learning rate scheduling in deep neural network training [19], gradient methods with adaptive step-size [20], and adaptive filtering [21]. Motivated by its impact on stability and performance, we use an adaptive step-size \(s_k\) leading to the following algorithm.

For iteration \(k=0\), \[\begin{align} {\mathit{X}}_0^+ & = {\mathit{X}}_0, \tag{15} \\ {\mathit{Z}}_1 & = {\mathit{Z}}_0, \tag{16} \\ {\mathit{X}}_1 & = {\mathit{X}}_0, \tag{17} \end{align}\] where step-size \(s_0 = 0\). For iteration \(k \geq 1\), \[\begin{align} {\mathit{X}}_k^+ & = {\mathit{X}}_k - \frac{s_k}{2} \left((2\theta_{k}h)^{-\beta} \nabla F({\mathit{X}}_k) + {\tilde{{\cal L}}}{\mathit{X}}_k \right), \tag{18} \\ {\mathit{Z}}_{k+1} & = {\mathit{Z}}_k - s_k \theta_k \left((2\theta_{k}h)^{-\beta} \nabla F({\mathit{X}}_k) + {\tilde{{\cal L}}}{\mathit{X}}_k \right), \tag{19} \\ {\mathit{X}}_{k+1} & = \frac{\theta_k^2}{\theta_{k+1}^2}{\mathit{X}}_k^+ + \left(1 - \frac{\theta_k^2}{\theta_{k+1}^2} \right) {\mathit{Z}}_{k+1}, \tag{20} \end{align}\] where \(s_k > 0\). It must be noted that, the update rule at the first iteration \(k=0\) is chosen to be different than the updates at iteration \(k \geq 1\) to avoid division by zero in the term \((2\theta_{k}h)^{-\beta}\) at \(k=0\).

To present convergence guarantee of our algorithm 15 20 , we define the following notation. We let \(c_k = \frac{\theta_{k+1}}{\theta_{k+1}^2 - \theta_k^2}\), \(A_k = c_k \theta_k^2\), and the estimation error \({\overline{{\mathit{X}}}}_k = {\mathit{X}}_k - {\mathit{X}}_*\). For iteration \(k = 0\), we let \[\begin{align} & a_1 = (2\theta_1h)^{-\beta}(F({\mathit{X}}_0)-F_*) + \frac{1}{2}{\overline{{\mathit{X}}}}_0^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_k \nonumber \\ & - (2\theta_1h)^{-\beta}(F({\mathit{X}}_1)-F_*) - \frac{1}{2}{\overline{{\mathit{X}}}}_1^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_1 \nonumber \\ & - \left\langle(2\theta_1h)^{-\beta} \nabla F({\mathit{X}}_1) + {\tilde{{\cal L}}}{\mathit{X}}_1, {\mathit{X}}_0 - {\mathit{X}}_1\right\rangle. \tag{21} \\ & b_1 = \left\lVert(2\theta_1h)^{-\beta} \nabla F({\mathit{X}}_0) + {\tilde{{\cal L}}}{\mathit{X}}_0\right\rVert^2 \nonumber \\ & + \left\lVert(2\theta_1h)^{-\beta} \nabla F({\mathit{X}}_1) + {\tilde{{\cal L}}}{\mathit{X}}_1\right\rVert^2. \tag{22} \end{align}\] For iteration \(k \geq 1\), we let \[\begin{align} & a_{k+1} = (2\theta_{k}h)^{-\beta}(F({\mathit{X}}_k)-F_*) + \frac{1}{2}{\overline{{\mathit{X}}}}_k^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_k \nonumber \\ & - (2\theta_{k+1}h)^{-\beta}(F({\mathit{X}}_{k+1})-F_*) - \frac{1}{2}{\overline{{\mathit{X}}}}_{k+1}^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_{k+1} \nonumber \\ & - \left\langle(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}, {\mathit{X}}_k - {\mathit{X}}_{k+1}\right\rangle, \tag{23} \\ & b_{k+1} = ||(2\theta_{k}h)^{-\beta} \nabla F({\mathit{X}}_k) + {\tilde{{\cal L}}}{\mathit{X}}_k \nonumber \\ & - (2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) - {\tilde{{\cal L}}}{\mathit{X}}_{k+1}||^2, \tag{24} \\ & \tilde{a}_{k+1} = a_{k+1} + \frac{s_k}{2} \langle (2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) \nonumber \\ & + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}, (2\theta_{k}h)^{-\beta} \nabla F({\mathit{X}}_k) + {\tilde{{\cal L}}}{\mathit{X}}_k \rangle, \tag{25} \\ & \tilde{b}_{k+1} = \left\lVert(2\theta_{k}h)^{-\beta} \nabla F({\mathit{X}}_k) + {\tilde{{\cal L}}}{\mathit{X}}_k\right\rVert^2 \nonumber \\ & + \left\lVert(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}\right\rVert^2, \tag{26} \\ & w_{k+1} = \langle (2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) \nonumber \\ & + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}, (2\theta_{k}h)^{-\beta} \nabla F({\mathit{X}}_k) + {\tilde{{\cal L}}}{\mathit{X}}_k \rangle, \tag{27} \\ & r_{k+1} = -\left\lVert(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_*)\right\rVert^2 \nonumber \\ & + 2 \langle(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_*),(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_*) \nonumber \\ & - (2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) - {\tilde{{\cal L}}}{\mathit{X}}_{k+1} \rangle. \tag{28} \end{align}\] From Theorem 1, recall that the continuous-time 4 algorithm achieves \(\text{\usefont{OMS}{cmsy}{m}{n}O}(1/t^{2-\beta})\) convergence rate. Next, we prove that its discretized counterpart 15 20 achieves the same convergence rate. To present this result, we make the following additional assumption on the class of objective function in 1 .

Assumption 3. The cumulative cost function \(F\) is continuously differentiable, and its gradient \(\nabla F\) is Lipcshitz continuous with a constant \(L_f > 0\) (i.e., \(F\) is \(L_f\)-smooth).

Theorem 2. Consider algorithm 15 20 with initialization \({\mathit{X}}_0 = {\mathit{Z}}_0 = {\mathit{X}}_0 \in {\mathbb{R}}^{md}\) and parameter \(h > 0\) and \(\beta \in (0,2)\). Suppose that Assumptions 1-3 hold true. Let the step-size be chosen as follows.

\(s_0 = 0\).
\(s_1 \begin{cases} > 0, & \text{if } r_1 \geq 0 \\ = 0, & \text{if } r_1 < 0. \end{cases}\)
for \(k \geq 1\), \[\begin{align} s_{k+1} & \leq \begin{cases} \frac{4a_{k+1}}{b_{k+1}}, & \text{if } w_{k+1} \leq 0, r_{k+1} \geq 0 \\ \frac{4A_k a_{k+1}}{A_k b_{k+1} + \theta_{k+1} (-r_{k+1})}, & \text{if } w_{k+1} \leq 0, r_{k+1} < 0 \\ \frac{4\tilde{a}_{k+1}}{\tilde{b}_{k+1}}, & \text{if } w_{k+1} > 0, r_{k+1} \geq 0 \\ \frac{4A_k \tilde{a}_{k+1}}{A_k \tilde{b}_{k+1} + \theta_{k+1} (-r_{k+1})}, & \text{if } w_{k+1} > 0, r_{k+1} < 0. \end{cases} \end{align}\]

Additionally, for \(k \geq 2\), \[\begin{align} s_k & \leq s_{k+1}, \\ s_{k} & \leq \frac{1}{\max \left\{\overline{\lambda}_L, (k h)^{-\beta} L_f \right\}}. \end{align}\]

Then, for each iteration \(k \geq 0\), \[\begin{align} F({\mathit{X}}_k^+)-F_* & \leq \frac{2h^{\beta} \left\lVert{\mathit{X}}_0- {\mathit{X}}_*\right\rVert^2}{s_0} \frac{1}{k^{2 - \beta}}. \label{eqn:rate95dt} \end{align}\qquad{(1)}\]

Proof. We split the proof into three parts. In the first part, we define a Lyapunov-like function \(V_k\) and prove its decrement for \(k \geq 1\). Second, we define a slightly different Lyapunov-like function \(V'_0\) for \(k = 0\) and prove that \(V_1 < V'_0\). We note that the choice of \(V'_0\) for \(k=0\) is slightly different from \(V_k\) for \(k \geq 1\) because \(V_k\) contains \((2\theta_{k}h)^{-\beta} = (kh)^{-\beta}\) which leads to division by zero at \(k=0\). In the final part, we combine these results and obtain the convergence rate.

: Consider an arbitrary iteration \(k \geq 1\). We define the function \[V_k = 2c_k \theta_k^2 \left((2\theta_{k}h)^{-\beta}(F({\mathit{X}}_k)-F_*) + \frac{1}{2} {\overline{{\mathit{X}}}}_k^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_k - \frac{s_k}{4} \left\lVert(2\theta_{k}h)^{-\beta} \nabla F({\mathit{X}}_k) + {\tilde{{\cal L}}}{\mathit{X}}_k\right\rVert^2 \right) + \frac{1}{s_k} \left\lVert{\mathit{Z}}_{k+1} - {\mathit{X}}_*\right\rVert^2. \label{eqn:Vk95def}\tag{29}\] From \(\theta_k = \frac{k}{2}\), we have \(c_k = \frac{2(k+1)}{2k+1} \geq 1\). So, \(A_k \geq \theta_k^2\). Using 19 , \[\frac{1}{s_k} \left\lVert{\mathit{Z}}_{k+1} - {\mathit{X}}_*\right\rVert^2 - \frac{1}{s_{k+1}} \left\lVert{\mathit{Z}}_{k+2} - {\mathit{X}}_*\right\rVert^2 \\ = \frac{1}{s_k} \left\lVert{\mathit{Z}}_{k+1} - {\mathit{X}}_*\right\rVert^2 - \frac{1}{s_{k+1}} \left\lVert{\mathit{Z}}_{k+1} - {\mathit{X}}_* - s_{k+1} \theta_{k+1} \left((2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1} \right)\right\rVert^2 \\ = (\frac{1}{s_k} - \frac{1}{s_{k+1}}) \left\lVert{\mathit{Z}}_{k+1} - {\mathit{X}}_*\right\rVert^2 - s_{k+1} \theta_{k+1}^2 \left\lVert(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}\right\rVert^2 + 2\theta_{k+1} \left\langle(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}, {\mathit{Z}}_{k+1} - {\mathit{X}}_*\right\rangle.\] Using \(A_k \geq \theta_k^2\) and \(s_k \leq s_{k+1}\), from above we have \[\begin{align} & \frac{1}{s_k} \left\lVert{\mathit{Z}}_{k+1} - {\mathit{X}}_*\right\rVert^2 - \frac{1}{s_{k+1}} \left\lVert{\mathit{Z}}_{k+2} - {\mathit{X}}_*\right\rVert^2 \\ & \geq 2\theta_{k+1} \left\langle(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}, {\mathit{Z}}_{k+1} - {\mathit{X}}_*\right\rangle \\ & - s_{k+1} A_{k+1} \left\lVert(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}\right\rVert^2. \end{align}\] Using the above we have \[\begin{align} & V_k - V_{k+1} \\ & \geq 2A_k ((2\theta_{k}h)^{-\beta}(F({\mathit{X}}_k)-F_*) + \frac{1}{2}{\overline{{\mathit{X}}}}_k^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_k \\ & - \frac{s_k}{4} \left\lVert(2\theta_{k}h)^{-\beta} \nabla F({\mathit{X}}_k) + {\tilde{{\cal L}}}{\mathit{X}}_k\right\rVert^2 ) \\ & - 2A_{k+1} ((2\theta_{k+1}h)^{-\beta}(F({\mathit{X}}_{k+1})-F_*) + \frac{1}{2}{\overline{{\mathit{X}}}}_{k+1}^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_{k+1} \\ & + \frac{s_{k+1}}{4} \left\lVert(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}\right\rVert^2) \\ & + 2\theta_{k+1} \left\langle(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}, {\mathit{Z}}_{k+1} - {\mathit{X}}_*\right\rangle. \end{align}\] Since \(s_k \leq s_{k+1}\), \[\begin{align} & V_k - V_{k+1} \\ & \geq 2A_k ((2\theta_{k}h)^{-\beta}(F({\mathit{X}}_k)-F_*) + \frac{1}{2}{\overline{{\mathit{X}}}}_k^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_k \\ & - \frac{s_{k+1}}{4} \left\lVert(2\theta_{k}h)^{-\beta} \nabla F({\mathit{X}}_k) + {\tilde{{\cal L}}}{\mathit{X}}_k\right\rVert^2 ) \\ & - 2A_{k+1} ((2\theta_{k+1}h)^{-\beta}(F({\mathit{X}}_{k+1})-F_*) + \frac{1}{2}{\overline{{\mathit{X}}}}_{k+1}^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_{k+1} \\ & + \frac{s_{k+1}}{4} \left\lVert(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}\right\rVert^2) \\ & + 2\theta_{k+1} \left\langle(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}, {\mathit{Z}}_{k+1} - {\mathit{X}}_*\right\rangle \\ & = 2A_k ((2\theta_{k}h)^{-\beta}(F({\mathit{X}}_k)-F_*) + \frac{1}{2}{\overline{{\mathit{X}}}}_k^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_k \\ & - \frac{s_{k+1}}{4} \left\lVert(2\theta_{k}h)^{-\beta} \nabla F({\mathit{X}}_k) + {\tilde{{\cal L}}}{\mathit{X}}_k\right\rVert^2 ) \\ & - 2A_k ((2\theta_{k+1}h)^{-\beta}(F({\mathit{X}}_{k+1})-F_*) + \frac{1}{2}{\overline{{\mathit{X}}}}_{k+1}^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_{k+1} \\ & + \frac{s_{k+1}}{4} \left\lVert(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}\right\rVert^2) \\ & + 2(A_k - A_{k+1} + \theta_{k+1}) ((2\theta_{k+1}h)^{-\beta}(F({\mathit{X}}_{k+1})-F_*) \\ & + \frac{1}{2}{\overline{{\mathit{X}}}}_{k+1}^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_{k+1} \\ & + \frac{s_{k+1}}{4} \left\lVert(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}\right\rVert^2) \\ & - 2\theta_{k+1} ((2\theta_{k+1}h)^{-\beta}(F({\mathit{X}}_{k+1})-F_*) + \frac{1}{2}{\overline{{\mathit{X}}}}_{k+1}^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_{k+1} \\ & + \frac{s_{k+1}}{4} \left\lVert(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}\right\rVert^2) \\ & + 2\theta_{k+1} \left\langle(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}, {\mathit{Z}}_{k+1} - {\mathit{X}}_*\right\rangle. \end{align}\] By definition of \(F_*\), we have \(F({\mathit{X}}_{k+1}) \geq F_*\). Under Assumption 2, \({\tilde{{\cal L}}}\) is symmetric positive semi-definite. By definition, \(A_k - A_{k+1} + \theta_{k+1} = \frac{(k+1^2}{8k^2+16k+6} \geq 0\). So, the third term on the R.H.S. above is non-negative, and we have \[\begin{align} & V_k - V_{k+1} \nonumber \\ & \geq 2A_k ((2\theta_{k}h)^{-\beta}(F({\mathit{X}}_k)-F_*) + \frac{1}{2}{\overline{{\mathit{X}}}}_k^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_k \nonumber \\ & - (2\theta_{k+1}h)^{-\beta}(F({\mathit{X}}_{k+1})-F_*) - \frac{1}{2}{\overline{{\mathit{X}}}}_{k+1}^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_{k+1} \nonumber \\ & - \frac{s_{k+1}}{4} \left\lVert(2\theta_{k}h)^{-\beta} \nabla F({\mathit{X}}_k) + {\tilde{{\cal L}}}{\mathit{X}}_k\right\rVert^2 \nonumber \\ & - \frac{s_{k+1}}{4} \left\lVert(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}\right\rVert^2) \nonumber \\ & - 2\theta_{k+1} ((2\theta_{k+1}h)^{-\beta}(F({\mathit{X}}_{k+1})-F_*) + \frac{1}{2}{\overline{{\mathit{X}}}}_{k+1}^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_{k+1} \nonumber \\ & + \frac{s_{k+1}}{4} \left\lVert(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}\right\rVert^2) \nonumber \\ & + 2\theta_{k+1} \left\langle(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}, {\mathit{Z}}_{k+1} - {\mathit{X}}_*\right\rangle \nonumber \\ & = 2A_k ((2\theta_{k}h)^{-\beta}(F({\mathit{X}}_k)-F_*) + \frac{1}{2}{\overline{{\mathit{X}}}}_k^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_k \nonumber \\ & - (2\theta_{k+1}h)^{-\beta}(F({\mathit{X}}_{k+1})-F_*) - \frac{1}{2}{\overline{{\mathit{X}}}}_{k+1}^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_{k+1} \nonumber \\ & - \frac{s_{k+1}}{4} \left\lVert(2\theta_{k}h)^{-\beta} \nabla F({\mathit{X}}_k) + {\tilde{{\cal L}}}{\mathit{X}}_k\right\rVert^2 \nonumber \\ & - \frac{s_{k+1}}{4} \left\lVert(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}\right\rVert^2) \nonumber \\ & + 2\theta_{k+1} ((2\theta_{k+1}h)^{-\beta}(F_*-F({\mathit{X}}_{k+1})) - \frac{1}{2}{\overline{{\mathit{X}}}}_{k+1}^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_{k+1} \nonumber \\ & - \left\langle(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}, {\mathit{X}}_* - {\mathit{X}}_{k+1}\right\rangle \nonumber \\ & - \frac{s_{k+1}}{4} \left\lVert(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}\right\rVert^2) \nonumber \\ & + 2\theta_{k+1} \left\langle(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}, {\mathit{Z}}_{k+1} - {\mathit{X}}_{k+1}\right\rangle. \label{eqn:diff951} \end{align}\tag{30}\] Under Assumption 2, \({\tilde{{\cal L}}}{\mathit{X}}_* = 0_{md}\). So, we rewrite the second term on the R.H.S. above as \[2\theta_{k+1} ((2\theta_{k+1}h)^{-\beta}(F_*-F({\mathit{X}}_{k+1})) - \frac{1}{2}{\overline{{\mathit{X}}}}_{k+1}^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_{k+1} - \left\langle(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}, {\mathit{X}}_* - {\mathit{X}}_{k+1}\right\rangle - \frac{s_{k+1}}{4} \left\lVert(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_*) - (2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) - {\tilde{{\cal L}}}{\mathit{X}}_{k+1} - (2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_*)\right\rVert^2) = 2\theta_{k+1} ((2\theta_{k+1}h)^{-\beta}(F_*-F({\mathit{X}}_{k+1})) - \frac{1}{2}{\overline{{\mathit{X}}}}_{k+1}^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_{k+1} - \left\langle(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}, {\mathit{X}}_* - {\mathit{X}}_{k+1}\right\rangle - \frac{s_{k+1}}{4} \left\lVert(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_*) + {\tilde{{\cal L}}}{\mathit{X}}_* - (2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) - {\tilde{{\cal L}}}{\mathit{X}}_{k+1}\right\rVert^2 - \frac{s_{k+1}}{4} \left\lVert(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_*)\right\rVert^2 + \frac{s_{k+1}}{4} 2 \left\langle(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_*),(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_*) - (2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) - {\tilde{{\cal L}}}{\mathit{X}}_{k+1}\right\rangle).\] Under Assumption 3, for any \(k\), \((2\theta_{k+1}h)^{-\beta} F({\mathit{X}}) + \frac{1}{2}{\overline{{\mathit{X}}}}^{\top} {\tilde{{\cal L}}}{\overline{{\mathit{X}}}}\) is \(\tilde{L}_{k+1}\)-smooth, where \(\tilde{L}_{k+1} = 2 \max \left\{\overline{\lambda}_L, (k+1)^{-\beta} h^{-\beta} L_f \right\}\). Moreover, \({\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_k = {\tilde{{\cal L}}}{\mathit{X}}_k - {\tilde{{\cal L}}}{\mathit{X}}_* = {\tilde{{\cal L}}}{\mathit{X}}_k\). So, for \(s_{k+1} \leq \frac{2}{\tilde{L}_{k+1}}\), \[(2\theta_{k+1}h)^{-\beta}(F_*-F({\mathit{X}}_{k+1})) - \frac{1}{2}{\overline{{\mathit{X}}}}_{k+1}^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_{k+1} - \left\langle(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}, {\mathit{X}}_* - {\mathit{X}}_{k+1}\right\rangle - \frac{s_{k+1}}{4} \left\lVert(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_*) + {\tilde{{\cal L}}}{\mathit{X}}_* - (2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) - {\tilde{{\cal L}}}{\mathit{X}}_{k+1}\right\rVert^2 \geq 0.\] From above, the second term on the R.H.S. in 30 is lower bounded by \(\frac{s_{k+1}}{4} 2\theta_{k+1} r_{k+1}\), where \(r_{k+1}\) is defined by 28 . From 30 , then we have \[\begin{align} & V_k - V_{k+1} \nonumber \\ & \geq 2A_k ((2\theta_{k}h)^{-\beta}(F({\mathit{X}}_k)-F_*) + \frac{1}{2}{\overline{{\mathit{X}}}}_k^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_k \nonumber \\ & - (2\theta_{k+1}h)^{-\beta}(F({\mathit{X}}_{k+1})-F_*) - \frac{1}{2}{\overline{{\mathit{X}}}}_{k+1}^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_{k+1} \nonumber \\ & - \frac{s_{k+1}}{4} \left\lVert(2\theta_{k}h)^{-\beta} \nabla F({\mathit{X}}_k) + {\tilde{{\cal L}}}{\mathit{X}}_k\right\rVert^2 \nonumber \\ & - \frac{s_{k+1}}{4} \left\lVert(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}\right\rVert^2) \nonumber \\ & + 2\theta_{k+1} \left\langle(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}, {\mathit{Z}}_{k+1} - {\mathit{X}}_{k+1}\right\rangle \nonumber \\ & + \frac{s_{k+1}}{4} 2\theta_{k+1} r_{k+1}. \nonumber \end{align}\] Upon substituting above from 20 , \[V_k - V_{k+1} \geq 2A_k ((2\theta_{k}h)^{-\beta}(F({\mathit{X}}_k)-F_*) + \frac{1}{2}{\overline{{\mathit{X}}}}_k^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_k - (2\theta_{k+1}h)^{-\beta}(F({\mathit{X}}_{k+1})-F_*) - \frac{1}{2}{\overline{{\mathit{X}}}}_{k+1}^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_{k+1} - \frac{s_{k+1}}{4} \left\lVert(2\theta_{k}h)^{-\beta} \nabla F({\mathit{X}}_k) + {\tilde{{\cal L}}}{\mathit{X}}_k\right\rVert^2 - \frac{s_{k+1}}{4} \left\lVert(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}\right\rVert^2) + 2\theta_{k+1} \left\langle(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}, ({\mathit{X}}_{k+1} - {\mathit{X}}_k^+) \frac{\theta_k^2}{\theta_{k+1}^2 - \theta_k^2}\right\rangle + \frac{s_{k+1}}{4} 2\theta_{k+1} r_{k+1}. \label{eqn:diff952}\tag{31}\] From definitions of \(A_k\), \(c_k\), and upon substituting from 18 , we rewrite the second term on the R.H.S. above as \[2A_k \left\langle(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}, {\mathit{X}}_{k+1} - {\mathit{X}}_k\right\rangle + 2A_k \frac{s_k}{2} \left\langle(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}, (2\theta_{k}h)^{-\beta} \nabla F({\mathit{X}}_k) + {\tilde{{\cal L}}}{\mathit{X}}_k\right\rangle.\] Upon substituting from above in 31 , \[\begin{align} V_k - V_{k+1} \geq 2A_k q_{k+1} + \frac{s_{k+1}}{4} 2\theta_{k+1} r_{k+1}, \label{eqn:diff953} \end{align}\tag{32}\] where we denote \[q_{k+1} = (2\theta_{k}h)^{-\beta}(F({\mathit{X}}_k)-F_*) + \frac{1}{2}{\overline{{\mathit{X}}}}_k^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_k - (2\theta_{k+1}h)^{-\beta}(F({\mathit{X}}_{k+1})-F_*) - \frac{1}{2}{\overline{{\mathit{X}}}}_{k+1}^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_{k+1} - \left\langle(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}, {\mathit{X}}_k - {\mathit{X}}_{k+1}\right\rangle - \frac{s_{k+1}}{4} \left\lVert(2\theta_{k}h)^{-\beta} \nabla F({\mathit{X}}_k) + {\tilde{{\cal L}}}{\mathit{X}}_k\right\rVert^2 - \frac{s_{k+1}}{4} \left\lVert(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}\right\rVert^2 + \frac{s_k}{2} \left\langle(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}, (2\theta_{k}h)^{-\beta} \nabla F({\mathit{X}}_k) + {\tilde{{\cal L}}}{\mathit{X}}_k\right\rangle. \label{eqn:qk}\tag{33}\] Under Assumptions 2-3, \((2\theta_{k+1}h)^{-\beta}(F({\mathit{X}})-F_*) + \frac{1}{2}{\overline{{\mathit{X}}}}^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}\) is convex. So, \[\begin{align} & (2\theta_{k+1}h)^{-\beta}(F({\mathit{X}}_k)-F_*) + \frac{1}{2}{\overline{{\mathit{X}}}}_k^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_k \\ & - (2\theta_{k+1}h)^{-\beta}(F({\mathit{X}}_{k+1})-F_*) - \frac{1}{2}{\overline{{\mathit{X}}}}_{k+1}^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_{k+1} \\ & - \left\langle(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}, {\mathit{X}}_k - {\mathit{X}}_{k+1}\right\rangle \geq 0. \end{align}\] Since \((2\theta_{k}h)^{-\beta}(F({\mathit{X}}_k)-F_*) > (2\theta_{k+1}h)^{-\beta}(F({\mathit{X}}_k)-F_*)\), from above we have \[\begin{align} & (2\theta_{k}h)^{-\beta}(F({\mathit{X}}_k)-F_*) + \frac{1}{2}{\overline{{\mathit{X}}}}_k^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_k \\ & - (2\theta_{k+1}h)^{-\beta}(F({\mathit{X}}_{k+1})-F_*) - \frac{1}{2}{\overline{{\mathit{X}}}}_{k+1}^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_{k+1} \\ & - \left\langle(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}, {\mathit{X}}_k - {\mathit{X}}_{k+1}\right\rangle > 0. \end{align}\] We consider two possible cases. Recall the definition of \(w_{k+1}\) from 27 .

First, we consider the case \(w_{k+1} \leq 0\). Since \(s_k \leq s_{k+1}\), we get \[\frac{s_k}{2} \left\langle(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}, (2\theta_{k}h)^{-\beta} \nabla F({\mathit{X}}_k) + {\tilde{{\cal L}}}{\mathit{X}}_k\right\rangle \geq \frac{s_{k+1}}{2} \left\langle(2\theta_{k+1}h)^{-\beta} \nabla F({\mathit{X}}_{k+1}) + {\tilde{{\cal L}}}{\mathit{X}}_{k+1}, (2\theta_{k}h)^{-\beta} \nabla F({\mathit{X}}_k) + {\tilde{{\cal L}}}{\mathit{X}}_k\right\rangle.\] Upon substituting from above in 33 , \[\begin{align} q_{k+1} \geq a_{k+1} - \frac{s_{k+1}}{4} b_{k+1}, \end{align}\] where \(a_{k+1}, b_{k+1}\) are defined by 23 24 .

Upon substituting from above in 32 , \[\begin{align} V_k - V_{k+1} \geq 2A_k (a_{k+1} - \frac{s_{k+1}}{4} b_{k+1}) + \frac{s_{k+1}}{4} 2\theta_{k+1} r_{k+1}. \end{align}\] If \(r_{k+1} \geq 0\), then \(s_{k+1} \leq \frac{4a_{k+1}}{b_{k+1}}\) implies that \(V_k - V_{k+1} \geq 0\). If \(r_{k+1} < 0\), then \(s_{k+1} \leq \frac{4A_k a_{k+1}}{A_k b_{k+1} + \theta_{k+1} (-r_{k+1})}\) implies that \(V_k - V_{k+1} \geq 0\).

Next, we consider the other case \[w_{k+1} > 0.\] Recall the definition of \(\tilde{a}_{k+1}, \tilde{b}_{k+1}\) from 25 26 . We note \(\tilde{a}_{k+1} > 0\). Upon substituting from above in 32 , \[\begin{align} V_k - V_{k+1} \geq 2A_k (\tilde{a}_{k+1} - \frac{s_{k+1}}{4} \tilde{b}_{k+1}) + \frac{s_{k+1}}{4} 2\theta_{k+1} r_{k+1}. \end{align}\] If \(r_{k+1} \geq 0\), then \(s_{k+1} \leq \frac{4\tilde{a}_{k+1}}{\tilde{b}_{k+1}}\) implies that \(V_k - V_{k+1} \geq 0\). If \(r_{k+1} < 0\), then \(s_{k+1} \leq \frac{4A_k \tilde{a}_{k+1}}{A_k \tilde{b}_{k+1} + \theta_{k+1} (-r_{k+1})}\) implies that \(V_k - V_{k+1} \geq 0\).

: Consider the first iteration \(k = 0\). We define the function \[V'_0 = 2c_0 \theta_0^2 \left((2\theta_{1}h)^{-\beta}(F({\mathit{X}}_0)-F_*) + \frac{1}{2} {\overline{{\mathit{X}}}}_0^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_0 - \frac{s_0}{4} \left\lVert(2\theta_{1}h)^{-\beta} \nabla F({\mathit{X}}_0) + {\tilde{{\cal L}}}{\mathit{X}}_0\right\rVert^2 \right) + \frac{1}{s_0} \left\lVert{\mathit{Z}}_{1} - {\mathit{X}}_*\right\rVert^2. \label{eqn:V095def}\tag{34}\] The choice for a different \(V'_0\) rather than proceeding with the previously defined \(V_k\) is motivated by division of zero appearing the term \((2\theta_{k}h)^{-\beta}\) at \(k=0\). Instead, \(V'_0\) is well-defined, as \(\theta_1 > 0\). We proceed the same way as the first part above, with 15 17 instead of 18 20 . In this way, instead of 32 33 , we obtain \[\begin{align} V'_0 - V_1 \geq 2A_0 q_1 + \frac{s_1}{4} 2\theta_1 r_1, \label{eqn:diff950} \end{align}\tag{35}\] where we denote \[q_1 = (2\theta_1h)^{-\beta}(F({\mathit{X}}_0)-F_*) + \frac{1}{2}{\overline{{\mathit{X}}}}_0^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_0 - (2\theta_1h)^{-\beta}(F({\mathit{X}}_1)-F_*) - \frac{1}{2}{\overline{{\mathit{X}}}}_1^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_1 - \left\langle(2\theta_1h)^{-\beta} \nabla F({\mathit{X}}_1) + {\tilde{{\cal L}}}{\mathit{X}}_1, {\mathit{X}}_0 - {\mathit{X}}_1\right\rangle - \frac{s_1}{4} \left\lVert(2\theta_1h)^{-\beta} \nabla F({\mathit{X}}_0) + {\tilde{{\cal L}}}{\mathit{X}}_0\right\rVert^2 - \frac{s_1}{4} \left\lVert(2\theta_1h)^{-\beta} \nabla F({\mathit{X}}_1) + {\tilde{{\cal L}}}{\mathit{X}}_1\right\rVert^2. \label{eqn:q1}\tag{36}\] Under Assumptions 2-3, \((2\theta_1 h)^{-\beta}(F({\mathit{X}})-F_*) + \frac{1}{2}{\overline{{\mathit{X}}}}^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}\) is convex. So, \[\begin{align} & (2\theta_1h)^{-\beta}(F({\mathit{X}}_0)-F_*) + \frac{1}{2}{\overline{{\mathit{X}}}}_0^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_k \\ & - (2\theta_1h)^{-\beta}(F({\mathit{X}}_1)-F_*) - \frac{1}{2}{\overline{{\mathit{X}}}}_1^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_1 \\ & - \left\langle(2\theta_1h)^{-\beta} \nabla F({\mathit{X}}_1) + {\tilde{{\cal L}}}{\mathit{X}}_1, {\mathit{X}}_0 - {\mathit{X}}_1\right\rangle \geq 0. \end{align}\] Upon substituting from above in 36 , \[\begin{align} q_1 \geq a_1 - \frac{s_1}{4} b_1, \end{align}\] where \(a_1, b_1\) are defined by 21 22 . Upon substituting from above in 35 , \[\begin{align} V'_0 - V_1 \geq 2A_0 (a_1 - \frac{s_1}{4} b_1) + \frac{s_1}{4} 2\theta_1 r_1 = \frac{s_1}{4} 2\theta_1 r_1, \end{align}\] since \(A_0 = c_0 \theta_0^2 = 0\). If \(r_1 \geq 0\), then \(s_1 \geq 0\) implies that \(V'_0 - V_1 \geq 0\). If \(r_1 < 0\), then \(s_1 = 0\) implies that \(V'_0 - V_1 = 0\).

: Finally, we will obtain the rate of convergence. From Part-I, we have \(V_k \leq V_1\) for all \(k \geq 1\). From Part-II, we have \(V_1 \leq V'_0\). So, \(V_k \leq V'_0\) for \(k \geq 1\). Since \(\theta_0 = 0\), from the definition 34 , \[\begin{align} V_k \leq V'_0 = \frac{1}{s_0} \left\lVert{\mathit{Z}}_1 - {\mathit{X}}_*\right\rVert^2. \label{eqn:V0} \end{align}\tag{37}\] From 16 at \(k=0\), \({\mathit{Z}}_1 = {\mathit{Z}}_0 = {\mathit{X}}_0\). Upon substituting from above in 37 , \[\begin{align} V_k \leq \frac{1}{s_0} \left\lVert{\mathit{X}}_0- {\mathit{X}}_*\right\rVert^2. \end{align}\] It was shown earlier that, Under Assumption 3, for any \(k\), \((2\theta_{k+1}h)^{-\beta} F({\mathit{X}}) + \frac{1}{2}{\overline{{\mathit{X}}}}^{\top} {\tilde{{\cal L}}}{\overline{{\mathit{X}}}}\) is \(\tilde{L}_{k+1}\)-smooth. Moreover, \({\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_k = {\tilde{{\cal L}}}{\mathit{X}}_k\). So, for \(s_{k+1} \leq \frac{2}{\tilde{L}_{k+1}}\), \[(2\theta_{k}h)^{-\beta}(F({\mathit{X}}_k^+)-F_*) + \frac{1}{2}({\overline{{\mathit{X}}}}_k^+)^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_k^+ \leq (2\theta_{k}h)^{-\beta}(F({\mathit{X}}_k)-F_*) + \frac{1}{2}{\overline{{\mathit{X}}}}_k^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_k - \frac{s_k}{4} \left\lVert(2\theta_{k}h)^{-\beta} \nabla F({\mathit{X}}_k) + {\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_k\right\rVert^2.\] Upon multiplying both sides above with \(2A_k\) and adding \(\frac{1}{s_k} \left\lVert{\mathit{Z}}_{k+1} - {\mathit{X}}_*\right\rVert^2\), from the definition 29 we get \[\begin{align} & 2A_k ((2\theta_{k}h)^{-\beta}(F({\mathit{X}}_k^+)-F_*) + \frac{1}{2}({\overline{{\mathit{X}}}}_k^+)^{\top}{\tilde{{\cal L}}}{\overline{{\mathit{X}}}}_k^+)) \\ & + \frac{1}{s_k} \left\lVert{\mathit{Z}}_{k+1} - {\mathit{X}}_*\right\rVert^2 \leq V_k \leq \frac{1}{s_0} \left\lVert{\mathit{X}}_0- {\mathit{X}}_*\right\rVert^2. \end{align}\] The above implies that \[\begin{align} & 2A_k (2\theta_{k}h)^{-\beta}(F({\mathit{X}}_k^+)-F_*) \leq \frac{1}{s_0} \left\lVert{\mathit{X}}_0- {\mathit{X}}_*\right\rVert^2 \\ & \implies F({\mathit{X}}_k^+)-F_*) \leq \frac{(2\theta_{k}h)^{\beta}}{2 A_k s_0} \left\lVert{\mathit{X}}_0- {\mathit{X}}_*\right\rVert^2 \end{align}\] From the definition of \(A_k\) and \(\theta_k\), we get \[\begin{align} F({\mathit{X}}_k^+)-F_* & \leq \frac{(k h)^{\beta}}{s_0 k^2} \frac{2(k + 0.5)}{k+1} \left\lVert{\mathit{X}}_0- {\mathit{X}}_*\right\rVert^2. \end{align}\] Since \(\frac{k+0.5}{k+1} \leq 1\), we have \[\begin{align} F({\mathit{X}}_k^+)-F_* & \leq \frac{2h^{\beta} \left\lVert{\mathit{X}}_0- {\mathit{X}}_*\right\rVert^2}{s_0} \frac{k^{\beta}}{k^2}. \end{align}\] The proof is complete. ◻

Theorem 2 signifies that the discretization of 4 in dilated coordinate, described earlier in this section, is rate-matching with its continuous-time counterpart from Theorem 1. We complete the presentation of our theoretical results by highlighting that the quantities \(a_k, b_k, \tilde{a}_k, \tilde{b}_k, w_k, r_{k}\) appearing in the sufficient conditions for step-size \(s_k\) at iteration \(k\) depends on the current state \(X_k\) and the previous state \(X_{k-1}\), and thus, these conditions can be verified at each iteration.

5 Empirical Results↩︎

Figure 1: (a) \(\left\lVert{\tilde{{\cal L}}}X_k\right\rVert\), (b) \(\left\lVert\nabla F(X_k)\right\rVert\), and (c) \(F(X_k)\) of the discretized distributed AGM algorithm 15 20 with different values of \(\beta\) for solving the binary logistic regression problem on MNIST dataset..

Figure 2: Comparison of (a) \(\left\lVert{\tilde{{\cal L}}}X_k\right\rVert\), (b) \(\left\lVert\nabla F(X_k)\right\rVert\), and (c) \(F(X_k)\) of different distributed optimization algorithms for solving the binary logistic regression problem on MNIST dataset..

In this section, we conduct a numerical simulation to show the efficiency of proposed algorithm 15 20 and to validate its convergence guarantee obtained in Theorem 2.

Simulation set-up and algorithm parameters: We consider solving the logistic regression problem on a subset of the MNIST dataset. Specifically, we consider the binary classification problem between the digits one and five using the MNIST [22] dataset. Our data consists of all the samples labeled as either digit one or digit five from the MNIST training set. We consider the logistic regression model and conduct experiments to minimize the cross-entropy error on the training data. The data points are distributed equally among \(m=5\) agents. The communication topology is considered to be a ring. In this case, there are multiple solutions of 1 . For a fair comparison, the parameters in each algorithm are tuned such that the respective algorithm converges in fewer iterations. Their specific values are as follows. DGD [6]: \(\alpha = 0.001\) and \(W\) is the Metropolis weights; PI [23]: \(\alpha = \beta = 100, h = 0.001\); PI consensus [24]: \(\alpha = 0.01, \beta = 0.1\); DIGing [25]: \(\alpha = 0.001\) and \(W\) is the Metropolis weights; pre-conditioned PI consensus [26]: \(\alpha = 2, h = 0.09, \beta = 20, \gamma = 20\); proposed algorithm: \(h = 10, \beta = 0.1\). \({\mathbf{x}}_i(0)\) in all the algorithms are the same, and its each entry is chosen from the Normal distribution with zero mean and \(0.1\) standard deviation. \({\mathbf{v}}_i(0)\) in the PI consensus and pre-conditioned PI consensus algorithms are chosen similarly, and \({\mathbf{v}}_i(0)\) for the PI algorithm is according to [23]. Note that the implementation itself of a few recently proposed distributed optimization algorithms, such as APM-C [27], Mudag [28], DAccGD [29], and ACC-SONATA [30], requires the value of the strong-convexity coefficient of the cumulative or the aggregate cost function. Since neither the aggregate cost nor the cumulative cost is strongly convex in this logistic regression problem, these APM-C, Mudag, DAccGD, and ACC-SONATA algorithms are not applicable.

Results: From Figure 2, the proposed algorithm converges much faster than the other algorithms for solving the logistic regression problem. We note that the ratio between the largest and the smallest non-zero singular value of the Hessian of \(F\) is of order \(10^{11}\). Thus, even for ill-conditioned problems, the proposed algorithm can converge significantly faster than the other algorithms. The convergence guarantees in Theorem 2 holds for \(\beta \in (0,2)\). So, in Figure 1, we show the effect of tuning the parameter \(\beta\) in the proposed algorithm. We simulate the algorithm with four choices of \(\beta \in \{1,0.5,0.1,0.01\}\). The values of \(h, {\mathbf{x}}_i(0), {\mathbf{v}}_i(0)\) are chosen as described earlier and is fixed while only \(\beta\) is varied. We observe convergence even for a small value of \(\beta = 0.01\), as expected from Theorem 2. Also, from Theorem 2, the convergence rate is \(\text{\usefont{OMS}{cmsy}{m}{n}O}{(\frac{1}{k^{2-\beta}})}\). Accordingly, from Figure 1, faster convergence happens for smaller values of \(\beta\).

References↩︎

[1]

T. Yang, X. Yi, J. Wu, Y. Yuan, D. Wu, Z. Meng, Y. Hong, H. Wang, Z. Lin, and K. H. Johansson, “A survey of distributed optimization,” Annual Reviews in Control, vol. 47, pp. 278–305, 2019.

[2]

M. Baranwal, K. Garg, D. Panagou, and A. O. Hero, “Robust distributed fixed-time economic dispatch under time-varying topology,” IEEE Control Systems Letters, vol. 5, no. 4, pp. 1183–1188, 2020.

[3]

J. Li and G. AlRegib, “Rate-constrained distributed estimation in wireless sensor networks,” IEEE Transactions on Signal Processing, vol. 55, no. 5, pp. 1634–1643, 2007.

[4]

M. Schwager, J. McLurkin, and D. Rus, “Distributed coverage control with sensory feedback for networked robots.” in robotics: science and systems, 2006, pp. 49–56.

[5]

A. Koloskova, T. Lin, and S. U. Stich, “An improved analysis of gradient tracking for decentralized machine learning,” Advances in Neural Information Processing Systems, vol. 34, pp. 11 422–11 435, 2021.

[6]

A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.

[7]

A. Nedic, A. Ozdaglar, and P. A. Parrilo, “Constrained consensus and optimization in multi-agent networks,” IEEE Transactions on Automatic Control, vol. 55, no. 4, pp. 922–938, 2010.

[8]

A. Nedić and A. Olshevsky, “Distributed optimization over time-varying directed graphs,” IEEE Transactions on Automatic Control, vol. 60, no. 3, pp. 601–615, 2014.

[9]

J. Wang and N. Elia, “A control perspective for centralized and distributed convex optimization,” in 2011 50th IEEE conference on decision and control and European control conference.IEEE, 2011, pp. 3800–3805.

[10]

S. Liu, Z. Qiu, and L. Xie, “Continuous-time distributed convex optimization with set constraints,” IFAC Proceedings Volumes, vol. 47, no. 3, pp. 9762–9767, 2014.

[11]

P. Budhraja, M. Baranwal, K. Garg, and A. Hota, “Breaking the convergence barrier: Optimization via fixed-time convergent flows,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 6, 2022, pp. 6115–6122.

[12]

K. Garg and M. Baranwal, “Accelerating distributed optimization via fixed-time convergent flows,” IFAC-PapersOnLine, vol. 56, no. 2, pp. 1235–1240, 2023.

[13]

G. Qu and N. Li, “Accelerated distributed nesterov gradient descent,” IEEE Transactions on Automatic Control, vol. 65, no. 6, pp. 2566–2581, 2019.

[14]

W. Su, S. Boyd, and E. Candes, “A differential equation for modeling nesterov’s accelerated gradient method: theory and insights,” Advances in neural information processing systems, vol. 27, 2014.

[15]

J. J. Suh, G. Roh, and E. K. Ryu, “Continuous-time analysis of accelerated gradient methods via conservation laws in dilated coordinate systems,” in International Conference on Machine Learning.PMLR, 2022, pp. 20 640–20 667.

[16]

K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, 2016.

[17]

B. Swenson, S. Kar, H. V. Poor, J. M. Moura, and A. Jaech, “Distributed gradient methods for nonconvex optimization: Local and global convergence guarantees,” arXiv preprint arXiv:2003.10309, 2020.

[18]

E. Hairer, M. Hochbruck, A. Iserles, and C. Lubich, “Geometric numerical integration,” Oberwolfach Reports, vol. 3, no. 1, pp. 805–882, 2006.

[19]

A. Khodamoradi, K. Denolf, K. Vissers, and R. C. Kastner, “Aslr: An adaptive scheduler for learning rate,” in 2021 International Joint Conference on Neural Networks (IJCNN).IEEE, 2021, pp. 1–8.

[20]

B. Zhou, L. Gao, and Y.-H. Dai, “Gradient methods with adaptive step-sizes,” Computational optimization and applications, vol. 35, pp. 69–86, 2006.

[21]

D. P. Mandic, “A generalized normalized gradient descent algorithm,” IEEE signal processing letters, vol. 11, no. 2, pp. 115–118, 2004.

[22]

“MNIST in CSV,” https://www.kaggle.com/datasets/oddrationale/mnist-in-csv, accessed: 19-September-2020.

[23]

X. Yi, S. Zhang, T. Yang, T. Chai, and K. H. Johansson, “Exponential convergence for distributed optimization under the restricted secant inequality condition,” IFAC-PapersOnLine, vol. 53, no. 2, pp. 2672–2677, 2020.

[24]

J. Wang and N. Elia, “Control approach to distributed optimization,” in 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).IEEE, 2010, pp. 557–561.

[25]

A. Nedic, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017.

[26]

K. Chakrabarti and M. Baranwal, “Linear convergence of pre-conditioned pi consensus algorithm under restricted strong convexity,” arXiv preprint arXiv:2310.00419, 2023.

[27]

H. Li, C. Fang, W. Yin, and Z. Lin, “A sharp convergence rate analysis for distributed accelerated gradient methods,” arXiv preprint arXiv:1810.01053, 2018.

[28]

H. Ye, L. Luo, Z. Zhou, and T. Zhang, “Multi-consensus decentralized accelerated gradient descent,” arXiv preprint arXiv:2005.00797, 2020.

[29]

A. Rogozin, V. Lukoshkin et al., “Towards accelerated rates for distributed optimization over time-varying networks,” in Optimization and Applications: 12th International Conference, OPTIMA 2021, Petrovac, Montenegro, September 27–October 1, 2021, Proceedings 12.Springer, 2021, pp. 258–272.

[30]

Y. Tian, G. Scutari, T. Cao, and A. Gasnikov, “Acceleration in distributed optimization under similarity,” in International Conference on Artificial Intelligence and Statistics.PMLR, 2022, pp. 5721–5756.

Mayank Baranwal and Kushal Chakrabarti are with the Tata Consultancy Services Research, Mumbai, Maharashtra 400607, India. M. Baranwal is also a faculty with the Systems and Control group at the Indian Institute of Technology, Bombay, Maharashtra 400076, India. (e-mails: {baranwal.mayank, chakrabarti.k}@tcs.com)``.↩︎

Distributed Optimization via Energy Conservation Laws in Dilated Coordinates