Abstract

In the realm of gradient-based optimization, Nesterov’s accelerated gradient method (NAG) is a landmark advancement, achieving an accelerated convergence rate that outperforms the vanilla gradient descent method for convex function. However, for strongly convex functions, whether NAG converges linearly remains an open question, as noted in the comprehensive review by [1]. This issue, aside from the critical step size, was addressed by [2] using a high-resolution differential equation framework. Furthermore, [3] introduced a monotonically convergent variant of NAG, referred to as M-NAG. Despite these developments, the Lyapunov analysis presented in [2] cannot be directly extended to M-NAG. In this paper, we propose a modification to the iterative relation by introducing a gradient term, leading to a new gradient-based iterative relation. This adjustment allows for the construction of a novel Lyapunov function that excludes kinetic energy. The linear convergence derived from this Lyapunov function is independent of both the parameters of the strongly convex functions and the step size, yielding a more general and robust result. Notably, we observe that the gradient iterative relation derived from M-NAG is equivalent to that from NAG when the position-velocity relation is applied. However, the Lyapunov analysis does not rely on the position-velocity relation, allowing us to extend the linear convergence to M-NAG. Finally, by utilizing two proximal inequalities, which serve as the proximal counterparts of strongly convex inequalities, we extend the linear convergence to both the fast iterative shrinkage-thresholding algorithm (FISTA) and its monotonic counterpart (M-FISTA).

1 Introduction↩︎

Over the past few decades, machine learning has emerged as a major application area for optimization algorithms. A central challenge in this field is unconstrained optimization, which involves minimizing a convex objective function without constraints. Mathematically, this problem is formulated as: \[\min_{x \in \mathbb{R}^d} f(x).\] At the heart of addressing this challenge are gradient-based methods, which have driven significant advancements due to their computational efficiency and low memory requirements. These advantages make them particularly well-suited for large-scale machine learning tasks. As a result, gradient-based methods have not only propelled theoretical progress but also facilitated practical breakthroughs, establishing them as indispensable tools in both machine learning research and real-world applications.

One of the simplest and most foundational gradient-based methods is the vanilla gradient descent, which dates back to [4]. Given an initial point \(x_0 \in \mathbb{R}^d\), the vanilla gradient descent method is implemented recursively as: \[x_{k+1} = x_{k} - s \nabla f(x_{k}),\] where \(s\) denotes the step size. While this method is effective, it can be slow to converge for convex functions. In the mid-1980s, [5] introduced a two-step accelerated algorithm, now known as Nesterov’s accelerated gradient method (NAG). Given any initial \(x_0 = y_0 \in \mathbb{R}^d\), NAG employs the following update rules: \[\tag{1} \begin{empheq}[left=\empheqlbrace]{align} & x_{k+1} = y_{k} - s\nabla f(y_{k}), \tag{2} \\ & y_{k+1} = x_{k+1} + \frac{k}{k+r+1} (x_{k+1} - x_{k}), \tag{3} \end{empheq}\] where \(s\) is the step size, and \(r \geq 2\) is the momentum parameter. Compared to the vanilla gradient descent method, NAG achieves accelerated convergence rates. However, it does not guarantee monotone convergence, that is, the function values \(f(x_k)\) do not necessarily decrease at every iteration. As a result, oscillations or overshoots may occur, especially as iterates approach the minimizer, as illustrated in .

Figure 1: Numerical comparison of the iterative progression of function values for `NAG` and `M-NAG`. The momentum parameter is set to \(r=2\), and the step size is \(s = 0.4\). Both optimization algorithms are applied to the quadratic function \(f(x_1, x_2) = 5 \times 10^{-3}x_1^2 + x_2^2\)..

This lack of stability and predictability makes it difficult to monitor progress and ensure reliable convergence. To address this issue, [3] proposed a variant that achieves both acceleration and monotonicity, referred to as the Monotone Nesterov’s accelerated gradient method (M-NAG). This method incorporates a comparison step to stabilize convergence while maintaining the acceleration properties of NAG. Starting with any initial \(x_0 = y_0 \in \mathbb{R}^d\), the recursive update rules of M-NAG are as follows: \[\tag{4} \begin{empheq}[left=\empheqlbrace]{align} & z_{k} = y_{k} - s\nabla f(y_{k}), \tag{5} \\ & x_{k+1} = \left\{ \begin{align} & z_{k}, && \text{if}\; f(z_{k}) \leq f(x_{k}), \\ & x_{k}, && \text{if}\; f(z_{k}) > f(x_{k}), \end{align} \right. \tag{6} \\ & y_{k+1} = x_{k+1} + \frac{k}{k+r+1} (x_{k+1} - x_{k}) + \frac{k+r}{k+r+1} (z_{k} - x_{k+1}), \tag{7} \end{empheq}\] where \(s\) is the step size and \(r \geq 2\) is the momentum parameter. Unlike NAG, M-NAG ensures that the function values are non-increasing at each iteration, thereby providing a more stable convergence path. As illustrated in , M-NAG effectively mitigates oscillations while preserving the acceleration characteristic of NAG, leading to more stable and predictable convergence behavior.

A notable advantage of NAG is that it does not require prior knowledge of the modulus of the strong convexity. For strongly convex functions, [2] utilized the high-resolution differential equation framework [6] to establish linear convergence, thereby addressing an open problem posed by [1]. The natural next step is to generalize the Lyapunov analysis proposed in [2] from NAG to M-NAG. However, due to the introduction of the comparison step, the phase-space representation that works for NAG cannot be directly applied to M-NAG, making such a generalization quite challenging. This leads to the following question:

Does M-NAG converge linearly for strongly convex functions?

1.1 Two key observations↩︎

In this paper, we make two key observations for NAG, which allow us to answer the above question through Lyapunov analysis.

1.1.1 Gradient iterative relation↩︎

Recall the high-resolution differential equation framework presented in [7], [8], where Lyapunov analysis based on the implicit-velocity scheme has been shown to outperform the gradient-correction scheme. Accordingly, we adopt the phase-space representation derived from the implicit-velocity scheme. To facilitate this approach, we introduce a new iterative velocity sequence \(\{v_{k}\}_{k=0}^{\infty}\) such that \(x_{k} - x_{k-1} = \sqrt{s}v_{k}\). Using this formulation, NAG can be reformulated as follows: \[\tag{8} \begin{empheq}[left=\empheqlbrace]{align} & x_{k+1} - x_{k} = \sqrt{s} v_{k+1}, \tag{9} \\ & v_{k+1} - v_{k} = -\frac{r+1}{k+r} v_{k} - \sqrt{s} \nabla f\left( y_{k} \right), \tag{10} \end{empheq}\] where the sequence \(\{y_{k}\}_{k=0}^{\infty}\) satisfies the following relationship: \[\label{eqn:32nag-phase-x-y-v} y_k = x_{k} + \frac{k-1}{k+r} \cdot \sqrt{s}v_k.\tag{11}\] This phase-space representation allows for a more refined analysis of NAG’s convergence properties. The key idea behind constructing a Lyapunov function, based on the high-resolution differential equation framework [6], is the mixed energy, which is derived from the velocity iterative relation. By substituting 9 into 10 , we obtain the following iterative relation as \[\underbrace{k\sqrt{s}v_{k+1} + r x_{k+1} }_{ \mathbf{R}_{k+1} } = \underbrace{ (k-1)\sqrt{s}v_{k} + rx_k }_{ \mathbf{R}_{k} } - \left( k+r\right)s \nabla f(y_{k}) \label{eqn:32nag-phase-iteration}\tag{12}\] where we denote a new iterative sequence \(\mathbf{R}_k = (k-1)\sqrt{s}v_{k} - rx_k\), which generate the mixed energy as shown in [2]. As demonstrated in [2], a Lyapunov function is used to derive the linear convergence for NAG. However, due to the existence of kinetic energy, we cannot directly generalize the Lyapunov analysis from NAG to M-NAG.

For convex functions, as shown in [7], [8], the iterative sequence \(\mathbf{R}_{k}\) in 12 is used to construct a Lyapunov function, given by: \[\label{eqn:32lyapunov-cvx} \mathcal{E}(k) = sk(k+r)\left( f(x_k) - f(x^{\star}) \right) + \frac{1}{2} \left\| (k-1)\sqrt{s}v_{k} + r(x_k - x^{\star}) \right\|^2.\tag{13}\] By utilizing the fundamental inequality for strongly convex functions, we can derive the iterative difference satisfying the following inequality: \[\label{eqn:32lyapunov-cvx-iter} \mathcal{E}(k+1) - \mathcal{E}(k) \leq - \mu s \left[ C_1(k+1)\left( f(x_{k+1}) - f(x^{\star}) \right) + C_2(k) s\left\| v_{k} \right\|^2 + C_{3}(k)\left\| x_k - x^{\star} \right\|^2 \right],\tag{14}\] where \(C_1(k) = \Theta(k^2)\), \(C_2(k) = \Theta(k^2)\), and \(C_3(k) = \Theta(k)\).² However, comparing 13 and 14 , it is impossible to align the right-hand side of the inequality 14 with the Lyapunov function 13 to be proportional, due to the misalignment relationship between the potential energy involving \(x_{k+1}\) and the mixed energy involving \(v_k\) and \(x_k\).

To eliminate the kinetic energy, we introduce a new gradient term, transforming the iteration into a gradient-based iterative form: \[\begin{gather} \underbrace{k\sqrt{s}v_{k+1} + rx_{k+1} - \left( k + r + 1 \right)s\nabla f(y_{k+1}) }_{ \mathbf{S}_{k+1} } \\ = \underbrace{(k-1)\sqrt{s}v_{k} + r x_k - \left( k+r\right)s \nabla f(y_{k})}_{ \mathbf{S}_{k} } - \left( k + r + 1 \right)s\nabla f(y_{k+1}), \label{eqn:32nag-phase-iteration-new} \end{gather}\tag{15}\] where the new iterative sequence is given by \(\mathbf{S}_k = \mathbf{R}_k - \left( k+r\right)s \nabla f(y_{k})\). The first key observation is this gradient iterative relation 15 , which enables us to construct a novel Lyapunov function and derive linear convergence. Notably, the convergence rate does not depend on the parameters, except for the momentum parameter \(r\), in contrast to[2], as detailed in .

1.1.2 Monotonicity implicitly embedded in NAG↩︎

The second key observation pertains to the connection between NAG and M-NAG. By substituting 5 into 7 and reformulating, we derive the following iterative relation: \[(k + r + 1) y_{k+1} - (k + 1) x_{k+1} = (k + r)y_{k} - kx_k - (k + r)s\nabla f(y_k), \label{eqn:32m-nag-iteration}\tag{16}\] Next, by incorporating the gradient term, we can transform the iterative relation 16 into a gradient-based form, yielding: \[\begin{gather} \underbrace{(k + r + 1) y_{k+1} - (k + 1) x_{k+1} - \left( k + r + 1 \right)\nabla f(y_{k+1})}_{\mathbf{T}_{k+1}} \\ = \underbrace{ (k + r)y_{k} - kx_k - (k + r)s\nabla f(y_k) }_{\mathbf{T}_k} - \left( k + r + 1 \right)\nabla f(y_{k+1}). \label{eqn:32m-nag-iteration-new} \end{gather}\tag{17}\] Substituting the position iteration 11 into the sequence \(\mathbf{S}_{k}\), we find that the two iterative sequences are identical, i.e., \(\mathbf{S}_k = \mathbf{T}_k\). The detailed calculation is as follows: \[\mathbf{S}_k = (k-1)\sqrt{s}v_{k} + r x_k - \left( k+r\right)s \nabla f(y_{k}) = (k + r)y_{k} - kx_k - (k + r)s\nabla f(y_k) = \mathbf{T}_k.\] Thus, the relationship between the two gradient iterative relations, 15 and 17 , relies solely on the position-velocity relation 11 . Since the position-velocity relation 11 introduces a new iterative sequence \(v_k\), which is free to choose. Hence, the two gradient iterative relations, 15 and 17 , are essentially identical and can be used interchangeably.

With this key observation, we can extend the Lyapunov analysis from NAG to M-NAG. Therefore, it is sufficient to employ Lyapunov analysis to establish the linear convergence for NAG involving the gradient iterative relation 17 and the position-velocity relation 11 , without needing the position iteration 9 , as detailed in . This insight reveals that the underlying philosophy of the M-NAG iteration, 5 — 7 , is to focus on the iterative sequence \(\{x_k\}_{k=0}^{\infty}\) to track the monotonicity of function values, while allowing the sequence \(\{y_{k}\}_{k=0}^{\infty}\) to adjust correspondingly. In fact, in addition to the freely chosen position-velocity relation 11 , NAG still consists of two relations: the position iteration 9 and the gradient iterative relation 17 . With the position-velocity relation 11 alone, we cannot directly transform the gradient iterative relation 17 into the velocity iteration 10 , which still requires the position iteration 9 . For M-NAG, we know that it only involves the free-chosen position-velocity relation 11 and the gradient iterative relation 17 .

1.2 Overview of contributions↩︎

In this study, building upon the two key observations outlined above, we make the following contributions for the forward-backward accelerated algorithms.

In contrast to the previous work [2], we introduce the gradient term to the iterative relation 12 , resulting in a new gradient-based iterative relation 15 . This adjustment facilitates the construction of a novel Lyapunov function. Unlike the Lyapunov function developed in [2], our new Lyapunov function not only eliminates the kinetic energy, but also guarantees linear convergence for NAG. Moreover, the convergence rate derived from this Lyapunov function is independent of both the parameters of the strongly convex functions and the step size, offering a more general and robust result.
Through some simple substitution, we derive the gradient iterative relation 17 for M-NAG. We further observe that this gradient iterative relation 17 is equivalent to the one in 15 when the position-velocity relation 9 is applied. Additionally, we find that the derivation of linear convergence requires only the gradient iterative relation 17 , and is independent of the position-velocity relation 9 . As a result, we extend the linear convergence from NAG to M-NAG.
Finally, building on the two proximal inequalities for composite functions derived in [2], [9], which are the proximal counterparts of the two strongly convex inequalities, we extend the linear convergence to the two proximal algorithms, the fast iterative shrinkage-thresholding algorithm (FISTA) and its monotonic counterpart (M-FISTA).

1.3 Related works and organization↩︎

In the history of gradient-based optimization algorithms, a significant milestone was the introduction of NAG, a two-step forward-backward algorithm originally proposed by [5]. This pioneering work revolutionized optimization techniques by incorporating acceleration into the vanilla gradient descent method, leading to substantial improvements in convergence rates. Building upon this breakthrough, [10] introduced a proximal version of the fundamental inequality, which paved the way for the development of FISTA. This new algorithm extended the power of NAG to handle composite objective functions, making it a more efficient tool for a wide range of applications, including signal processing and image reconstruction. Further advancing the field, [3] proposed a monotonically convergent version of the forward-backward accelerated gradient algorithms, ensuring that the algorithm’s performance would steadily improve over time. Simultaneously, adaptive restart strategies were explored to maintain monotonicity and enhance the robustness of forward-backward accelerated algorithms, with some contributions from [11] and [12] .

Over the past decade, there has been increasing interest in understanding the acceleration phenomenon in optimization algorithms through the lens of differential equations. This exploration began with works [13], [14], which sparked a surge of research into the dynamics underlying accelerated methods. The topic gained further prominence with the development of a low-resolution differential equation for modeling NAG [15], a variational perspective on acceleration [16], and studies focusing on the faster convergence rate of function values [17]. The acceleration mechanism was ultimately clarified through comparisons between NAG method with Polyak’s heavy-ball method, with pivotal insights emerging from the high-resolution differential equation framework introduced by [6]. This framework revealed that gradient correction is effectively achieved through an implicit velocity update, providing a more comprehensive understanding of how acceleration arises. Subsequent works [7], [18] further demonstrated that this framework is particularly well-suited for NAG. Buildling on this, progress was made by extending the framework to address composite optimization problems, encompassing both convex and strongly convex functions [8], [19]. This extension was achieved by refining the proximal inequality and generalizing the framework to accommodate a wider range of optimization scenarios, including the overdamped case, as shown in [20]. These advancements have significantly deepened our understanding of acceleration mechanisms and paved the way for more efficient, robust, and versatile optimization algorithms.

The remainder of this paper is organized as follows. introduces the foundational definitions and inequalities for strongly convex functions and composite functions, which serve as preliminaries for the analysis. outline the construction of the Lyapunov function and establishes the linear convergence for NAG and FISTA. extends the linear convergence to the monotonically accelerated forward-backward algorithms, M-NAG and M-FISTA, leveraging the novel Lyapnov function that excludes the kinetic energy. Finally, concludes this papers and proposes potential avenues for future research.

2 Preliminaries↩︎

In this paper, we adopt the notations that are consistent with those used in [6], [9], [21], with slight adjustments tailored to the specific context of our analysis. Let \(\mathcal{F}^0(\mathbb{R}^d)\) denote the class of continuous convex functions defined on \(\mathbb{R}^d\). Specifically, \(g \in \mathcal{F}^{0}(\mathbb{R}^d)\) if it satisfies the convex condition: \[g\left( \alpha x + (1 - \alpha)y \right) \leq \alpha g(x) + (1 - \alpha)g(y),\] for any \(x,y \in \mathbb{R}^d\) and \(\alpha \in [0, 1]\). A subclass of \(\mathcal{F}^0(\mathbb{R}^d)\), denoted \(\mathcal{F}^1_L(\mathbb{R}^d)\), consists of functions that are continuously differentiable and have Lipschitz continuous gradients. Specifically, \(f \in \mathcal{F}^{1}_{L}(\mathbb{R}^d)\) if \(f \in \mathcal{F}^{0}(\mathbb{R}^d)\) and its gradient satisfies: \[\label{eqn:32grad-lip} \| \nabla f(x) - \nabla f(y) \| \leq L \| x - y \|,\tag{18}\] for any \(x, y \in \mathbb{R}^d\). Next, we introduce the function class \(\mathcal{S}_{\mu,L}^{1}(\mathbb{R}^d)\), a subclass of \(\mathcal{F}^1_L(\mathbb{R}^d)\), where each function is \(\mu\)-strongly convex for some \(0 < \mu \leq L\). Specifically, \(f \in \mathcal{S}_{\mu,L}^{1}(\mathbb{R}^d)\) if \(f \in \mathcal{F}^1_L(\mathbb{R}^d)\) and satisfies the \(\mu\)-strongly convex condition: \[\label{eqn:32defn-scvx} f(y) \geq f(x) + \langle \nabla f(y), y - x \rangle + \frac{\mu}{2} \|y - x\|^2,\tag{19}\] for any \(x, y \in \mathbb{R}^d\). For any \(f \in \mathcal{S}_{\mu,L}^{1}(\mathbb{R}^d)\), the following fundamental inequality can be derived: \[\label{eqn:32fund-inq-smooth} f(y - s\nabla f(y)) - f(x) \leq \left\langle \nabla f(y), y - x \right\rangle - \frac{\mu}{2} \|y - x\|^2- \left( s - \frac{Ls^2}{2} \right) \|\nabla f(y)\|^2,\tag{20}\] for any \(x,y \in \mathbb{R}^d\). Additionally, we denote \(x^\star\) as the unique minimizer of the objective function. Furthermore, for any \(f \in \mathcal{S}_{\mu,L}^{1}(\mathbb{R}^d)\), we have the following inequality \[\label{eqn:32key-inq-smooth} \left\| \nabla f(y) \right\|^2 \geq 2\mu \left( f(y - s\nabla f(y)) - f(x^{\star}) \right),\tag{21}\] for any \(y \in \mathbb{R}^d\).

We now shift focus to a composite function \(\Phi= f + g\), where \(f \in \mathcal{S}_{\mu,L}^1(\mathbb{R}^d)\) and \(g \in \mathcal{F}^0(\mathbb{R}^d)\). Building upon the methodologies established in [10], [15], we introduce the concept of the \(s\)-proximal value, which plays a pivotal role in our analysis.

Definition 1 (\(s\)-proximal value). Let the step size satisfy \(s \in (0, 1/L)\). For any \(f\in \mathcal{S}_{\mu,L}^1(\mathbb{R}^d)\) and \(g \in \mathcal{F}^0(\mathbb{R}^d)\), the \(s\)-proximal value is defined as \[\label{eqn:32proximal-value} P_s(x) := \mathop{\arg\min}_{y\in\mathbb{R}^d}\left\{ \frac{1}{2s}\left\| y - \left(x - s\nabla f(x)\right) \right\|^2 + g(y) \right\},\tag{22}\] for any \(x \in \mathbb{R}^d\). The \(s\)-proximal value \(P_s(x)\) in 22 minimizes a weighted sum of the squared Euclidean distance between \(y\) and the gradient-based update \(x - s\nabla f(x)\), along with the regularization term \(g(y)\). The introduction of the \(s\)-proximal subgradient offers the flexibility to handle composite objective functions, enabling efficient minimization of the form \(f(x)+g(x)\), where \(g(x)\) may be non-smooth but proximally simple. When \(g\) is specified as the \(\ell_1\)-norm, i.e., \(g(x) = \lambda \|x\|_1\), a closed-form expression for the \(s\)-proximal value can be derived. Specifically, for any \(x \in \mathbb{R}^d\), the \(i\)-th component of \(s\)-proximal value is given by: \[P_s(x)_i = \big(\left|\left(x - s\nabla f(x)\right)_i\right| - \lambda s\big)_+ \text{sgn}\big(\left(x - s\nabla f(x)\right)_i\big),\] where \(i=1,\ldots,d\) and \(\text{sgn}(\cdot)\) denotes the sign function. This closed-form expression is particularly useful, as it provides an efficient way to compute the \(s\)-proximal value for the \(\ell_1\)-regularization term, which is widely used in sparse optimization problems.

Definition 2 (\(s\)-proximal subgradient). For any \(x \in \mathbb{R}^d\)., the \(s\)-proximal subgradient is defined as: \[\label{eqn:32subgradient} G_s(x): = \frac{x - P_s(x)}{s},\tag{23}\] where the \(s\)-proximal value \(P_s(y)\) is given by 22 . The subgradient \(G_s(x)\), as defined in 23 , captures the direction of improvement for the composite function \(\Phi\), based on the \(s\)-proximal update. It can be viewed as a generalization of the traditional gradient, incorporating the proximal step with respect to the regularizer \(g\).

With the definition of \(s\)-proximal subgradient (), we extend the two key inequalities, 20 and 21 , to the proximal cases. We now present a proximal version of the fundamental inequality 20 for a composite function \(\Phi\). This result, rigorously established in [9], is stated as follows:

Lemma 1 (Lemma 4 in [9]). Let \(\Phi = f + g\) be a composite function with \(f \in \mathcal{S}_{\mu, L}^1(\mathbb{R}^d)\) and \(g \in \mathcal{F}^0(\mathbb{R}^d)\). Then, the following inequality \[\label{eqn:32fund-inq-composite} \Phi(y - sG_s(y)) - \Phi(x) \leq \left\langle G_s(y), y - x \right\rangle - \frac{\mu}{2}\|y - x\|^2 - \left( s - \frac{Ls^2}{2} \right) \|G_s(y)\|^2,\tag{24}\] holds for any \(x, y \in \mathbb{R}^d\).

Finally, we present a key inequality for the \(s\)-proximal subgradient, which will be used in the subsequent sections. This result, previously established in [2], is stated as follows:

Lemma 2 (Lemma 4.3 in [2]). Let \(f \in \mathcal{S}_{\mu,L}^{1}(\mathbb{R}^d)\) and \(g \in \mathcal{F}^0(\mathbb{R}^d)\). Then, the s-proximal subgradient, as defined in [defn:32subgradient], satisfies the following inequality: \[\label{eqn:32key-inq-composite} \|G_s(y)\|^2 \geq 2\mu \left( \Phi(y - sG_s(y)) - \Phi(x^{\star}) \right),\tag{25}\] for any \(y \in \mathbb{R}^d\).

3 Lyapnov analysis for NAG and FISTA↩︎

In this section, we first outline the process of constructing a novel Lyapunov function, following the principled way presented in [18]. Specifically, we introduce the iterative sequence \(\mathbf{S}_{k}\) in 15 , which generates the mixed term that plays a pivotal role in the construction of this Lyapunov function.Building on this, we then derive the linear convergence rate for NAG applied to strongly convex functions and formulate a theorem to characterize the result. Finally, we apply and to extend the linear convergence rate to the proximal setting, focusing on the FISTA.

3.1 The construction of a novel Lyapunov function↩︎

Following the principled way outlined in [18], we construct a novel Lyapunov function in two main steps: first, by introducing the iterative sequence \(\mathbf{S}_{k}\) for the mixed energy, and then by adjusting the coefficient of the potential energy to align with this mixed energy.

Construction of mixed energy Recall the iterative relation 15 . In the context of convergence analysis, it is not the iterative point \(x_k\) itself that matters, but rather its difference from the optimal solution, \(x_k - x^{\star}\). To account for this, we introduce a term \(-rx^{\star}\) to both sides of the gradient iterative relation 15 , yielding: \[\begin{gather} k\sqrt{s}v_{k+1} + r ( x_{k+1} - x^{\star} ) - \left( k + r + 1 \right)s\nabla f(y_{k+1}) \\ = (k-1)\sqrt{s}v_{k} + r ( x_k - x^{\star} ) - \left( k+r\right)s \nabla f(y_{k}) - \left( k + r + 1 \right)s\nabla f(y_{k+1}). \label{eqn:32nag-phase-iteration-new-xstar} \end{gather}\tag{26}\] This relation naturally leads to the mixed energy, which is given by: \[\label{eqn:32mix} \mathcal{E}_{mix}(k) = \frac{1}{2} \left\| (k-1)\sqrt{s}v_{k} + r (x_{k} - x^{\star}) - s (k + r) \nabla f(y_k) \right\|^2.\tag{27}\] Next, we examine the iterative difference of this mixed energy using the iterative relation 26 : \[\begin{align} \mathcal{E}&_{mix}(k + 1) - \mathcal{E}_{mix}(k) \nonumber \\ & = \left\langle - \left( k + r + 1 \right)s\nabla f(y_{k+1}), k\sqrt{s}v_{k+1} + r ( x_{k+1} - x^{\star} ) - \frac{\left( k + r + 1 \right)s}{2} \cdot \nabla f(y_{k+1}) \right\rangle. \label{eqn:32mix-iter1} \end{align}\tag{28}\] Substituting 11 into 28 , we derive the following result for the iterative difference: \[\begin{align} \mathcal{E}_{mix}(k + 1) - \mathcal{E}_{mix}(k) = & \underbrace{- k(k+1)s\sqrt{s} \left\langle \nabla f(y_{k+1}), v_{k+1}\right\rangle}_{\mathbf{I}} \nonumber \\ & - r (k + r + 1) s \left\langle \nabla f(y_{k+1}), y_{k + 1} - x^{\star} \right\rangle \nonumber \\ & + \frac{(k+r+1)^2s^2}{2}\left\| \nabla f(y_{k+1}) \right\|^2. \label{eqn:32mix-iter2} \end{align}\tag{29}\]
Construction of potential energy In contrast to the Lyapunov function 13 , previously used for convex functions in [7], [8], we now modify the potential energy by replacing \(f(x_{k})\) with \(f(x_{k+1})\). This adjustment aligns the formulation of the potential energy with the mixed energy given in 27 . Additionally, we introduce a dynamic coefficient to the potential energy, which is expressed as: \[\label{eqn:32pot} \mathcal{E}_{pot}(k) := s \tau(k) \left( f(x_{k+1}) - f(x^{\star}) \right).\tag{30}\] The iterative difference of this potential energy is then given by: \[\begin{align} \mathcal{E}_{pot}(k+1) &- \mathcal{E}_{pot}(k) \nonumber \\ & = s \tau(k) \left( f(x_{k+2}) - f(x_{k+1}) \right) + s\left( \tau(k+1) - \tau(k) \right) \left( f(x_{k+2}) - f(x^{\star}) \right). \label{eqn:32pot-iter1} \end{align}\tag{31}\] Next, by substituting \(x_{k+2}\) and \(x_{k+1}\) into the fundamental inequality 20 , we obtain the following inequality \[\begin{align} f(x_{k+2}) - f(x_{k+1}) \leq & \left\langle \nabla f(y_{k+1}), y_{k+1} - x_{k+1}\right\rangle \nonumber \\ & - \frac{\mu}{2} \left\| y_{k+1} - x_{k+1} \right\|^2 - \left( s - \frac{Ls^2}{2} \right) \left\| \nabla f(y_{k+1}) \right\|^2 \label{eqn:32fund-inq-xkyk-nag} \end{align}\tag{32}\] Substituting 32 into the iterative difference 31 , we derive the following bound on the iterative difference of the potential energy: \[\begin{align} \mathcal{E}&_{pot}(k+1) - \mathcal{E}_{pot}(k) \nonumber \\ & \leq s \tau(k) \left( \left\langle \nabla f(y_{k+1}), y_{k+1} - x_{k+1} \right\rangle - \frac{\mu}{2} \left\| y_{k+1} - x_{k+1} \right\|^2 - \left( s - \frac{Ls^2}{2} \right) \left\| \nabla f(y_{k+1}) \right\|^2 \right) \nonumber \\ & \mathrel{\phantom{\leq}} + s \left( \tau(k+1) - \tau(k) \right) \left( f(x_{k+2}) - f(x^{\star}) \right). \label{eqn:32pot-iter2} \end{align}\tag{33}\] Substituting 9 into 33 , we refine the above inequality 33 as: \[\begin{align} \mathcal{E}&_{pot}(k+1) - \mathcal{E}_{pot}(k) \nonumber \\ & = s \tau(k) \left( \underbrace{\frac{k\sqrt{s}\left\langle \nabla f(y_{k+1}), v_{k+1} \right\rangle }{k+r+1} }_{\mathbf{II}} - \frac{\mu}{2} \cdot \frac{k^2 s\left\| v_{k+1} \right\|^2}{(k+r+1)^2} - \left( s - \frac{Ls^2}{2} \right) \left\| \nabla f(y_{k+1}) \right\|^2 \right) \nonumber \\ & \mathrel{\phantom{\leq}} + s \left( \tau(k+1) - \tau(k) \right) \left( f(x_{k+2}) - f(x^{\star}) \right) \label{eqn:32pot-iter3} \end{align}\tag{34}\] To ensure that the term \(\mathbf{I} + (k+1)(k+r+1)s \cdot \mathbf{II} = 0\), we choose a dynamica coefficient as \(\tau(k) = (k+1)(k+r+1)\).

With the mixed energy defined in 27 and the potential energy defined in 30 , we construct the Lyapunov function as follows: \[\begin{align} \mathcal{E}(k) = &\; s(k+1)(k+r+1)\left( f(x_{k+1}) - f(x^{\star}) \right) \nonumber \\ &\; + \frac{1}{2}\left\| (k-1)\sqrt{s}v_{k} + r (x_{k} - x^{\star}) - s (k + r) \nabla f(y_k) \right\|^2. \label{eqn:32lyapunov-nag} \end{align}\tag{35}\] This Lyapunov function encapsulates both the mixed and potential energies and is a key tool for analyzing the linear convergence of the iterative process.

3.2 Linear convergence of NAG↩︎

Using the Lyapunov function 35 , we can rigorously establish the linear convergence of NAG. The result is formally presented in the following theorem:

Theorem 3. Let \(f \in \mathcal{S}_{\mu,L}^{1}(\mathbb{R}^d)\). Given any step size \(0 < s < 1/L\), there exists a positive integer \(K: = \max\left\{0, \frac{3r^2 - 4r - 12}{8}\right\}\) such that the iterative sequence \(\{x_{k}\}_{k=0}^{\infty}\) generated by NAG, 2 and 3 , with any initial point \(x_0 = y_0 \in \mathbb{R}^d\), satisfies the following inequality \[\label{eqn:32nag-rate} f(x_k) - f(x^{\star}) \leq \frac{(r +1)\left( f(x_1) - f(x^{\star})\right) + r^2 L\| x_1 - x^{\star} \|^2}{k(k+r) \left[ 1 + (1 - Ls) \cdot \frac{\mu s}{4} \right]^k},\tag{36}\] for any \(k \geq \max\left\{1,K \right\}\).

Proof of . We begin by deriving an expression for the change in the Lyapunov function \(\mathcal{E}(k)\) across iterations. By combining the inequalities for the iterative differences, 29 and 34 , we derive the iterative difference for the Lyapunov function 35 as follows: \[\begin{align} \mathcal{E}(k+1) - \mathcal{E}(k) = & - r (k + r + 1) s \left\langle \nabla f(y_{k+1}), y_{k + 1} - x^{\star} \right\rangle + \frac{(k+r+1)^2s^2}{2}\left\| \nabla f(y_{k+1}) \right\|^2 \nonumber \\ & - \frac{\mu}{2} \cdot \frac{s^2k^2(k+1)}{k+r+1} \cdot \|v_{k+1}\|^2 - s(k+1)(k+r+1)\left( s - \frac{Ls^2}{2} \right) \| \nabla f(y_{k+1}) \|^2 \nonumber \\ & + s (2k + r + 3) \left( f(x_{k+2}) - f(x^{\star}) \right). \label{eqn:32lyapunov-iter1} \end{align}\tag{37}\] Next, by substituting \(x_{k+2}\) and \(x^{\star}\) into the fundamental inequality 20 and rearranging, we obtain: \[\begin{align} - \left\langle \nabla f(y_{k+1}), y_{k+1} - x^{\star}\right\rangle \leq & - \left( f(x_{k+2}) - f(x^{\star}) \right) \nonumber \\ &- \frac{\mu}{2} \left\| y_{k+1} - x^{\star} \right\|^2 - \left( s - \frac{Ls^2}{2} \right) \left\| \nabla f(y_{k+1}) \right\|^2. \label{eqn:32fund-inq-xstar} \end{align}\tag{38}\] Substituting the inequality 38 into 37 , we arrive at the following estimate for the iterative difference as: \[\begin{align} \mathcal{E}(k+1) - \mathcal{E}(k) \leq & - s \left[(r -2)k + (r^2 - 3) \right] \left( f(x_{k+2}) - f(x^{\star}) \right) - \frac{\mu}{2} \cdot \frac{s^2k^2(k+1)}{k+r+1} \cdot \|v_{k+1}\|^2 \nonumber \\ & - \frac{ s (2k + r + 3) \mu}{2} \|y_{k+1} - x^{\star}\|^2 - \frac{(k+r+1)^2s^2(1-Ls)}{2}\left\| \nabla f(y_{k+1}) \right\|^2. \label{eqn:32lyapunov-iter2} \end{align}\tag{39}\] To further simplify this expression, we use the strongly convex inequality 21 to derive the lower bound on \(\| \nabla f(y_{k+1}) \|^2\) as: \[\left\| \nabla f(y_{k+1}) \right\|^2 \geq 2\mu \left( f(x_{k+2}) - f(x^{\star}) \right). \label{eqn:32fund-inq-smooth1-new}\tag{40}\] Substituting this inequality 40 into 39 , we further estimate the iterative difference as \[\begin{align} \mathcal{E}(k+1) - \mathcal{E}(k) \leq & - \frac{\mu s(1-Ls)}{4} \cdot s(k+r+1)^2\left( f(x_{k+2}) - f(x^{\star}) \right) \nonumber \\ & - \frac{\mu s^2}{2} \cdot \frac{k^2(k+1)}{k+r+1} \cdot \|v_{k+1}\|^2 \nonumber \\ & - \frac{ s (2k + r + 3) \mu}{2} \|y_{k+1} - x^{\star}\|^2 - \frac{3(k+r+1)^2s^2(1-Ls)}{8}\left\| \nabla f(y_{k+1}) \right\|^2. \label{eqn:32lyapunov-iter3} \end{align}\tag{41}\] Finally, we estimate the Lyapunov function 35 using the Cauchy-Schwarz inequality: \[\begin{align} \mathcal{E}(k + 1) \leq & s(k+2)(k+r+2)\left( f(x_{k+2}) - f(x^{\star}) \right) + \frac{3s}{2} \cdot \frac{k^2(k+1)^2}{(k+r+1)^2} \cdot \left\| v_{k+1} \right\|^2 \nonumber \\ & + \frac{3r^2}{2} \left\| y_{k+1} - x^{\star} \right\|^2 + \frac{3s^2}{2} \cdot (k+r+1)^2 \cdot \|\nabla f(y_{k+1}) \|^2. \label{eqn:32lyapunov-estimate} \end{align}\tag{42}\] By comparing the coefficients of the right-hand sides of the two inequalities, 37 and 42 , we deduce the following estimate as: \[\begin{align} \mathcal{E}(k+1) - \mathcal{E}(k) & \leq - \mu s \cdot \min\left\{ \frac{1-Ls}{4}, \frac{1}{3}, \frac{2k+r+3}{3r^2}, \frac{1-Ls}{4 \mu s} \right\} \mathcal{E}(k+1) \nonumber \\ & \leq - \mu s \cdot \left(\frac{1 - Ls}{4} \right) \cdot \mathcal{E}(k+1), \label{eqn:32iter-diff-lyapunov} \end{align}\tag{43}\] where the last inequality follows from the definition of the integer \(K\), and the fact that \(0 < \mu s < \mu/L \leq 1\). Thus, by applying standard inequalities for the gradient Lipschitz condition 18 , we complete the proof. ◻

3.3 Linear convergence of FISTA↩︎

We now extend the linear convergence of NAG, as established in , to include its proximal variant, FISTA. As specified in , FISTA utilizes the \(s\)-proximal value 22 and follows the iterative scheme below, starting from any initial point \(y_0 = x_0 \in \mathbb{R}^d\): \[\tag{44} \begin{empheq}[left=\empheqlbrace]{align} & x_{k+1} = P_s\left( y_{k} \right), \tag{45} \\ & y_{k+1} = x_{k+1} + \frac{k}{k+r+1} (x_{k+1} - x_{k}), \tag{46} \end{empheq}\] where \(s>0\) denotes the step size. As formulated in , we can rewrite FISTA using the \(s\)-proximal subgradient 23 , resulting in a new iterative scheme similar to NAG as: \[\tag{47} \begin{empheq}[left=\empheqlbrace]{align} & x_{k+1} = y_{k} - sG_s(y_{k}), \tag{48} \\ & y_{k+1} = x_{k+1} + \frac{k}{k+r+1} (x_{k+1} - x_{k}), \tag{49} \end{empheq}\] where the \(s\)-proximal subgradient \(G_s(y_{k})\) serves as a natural generalization of the classical gradient \(\nabla f(y_k)\) used in NAG. This formulation highlights the structural resemblance between M-FISTA and M-NAG, while accounting for the non-smooth nature of \(\Phi\). Furthermore, by defining the velocity iterative sequence as \(v_{k} = (x_{k}- x_{k-1})/\sqrt{s}\), we can express the FISTA updates in a manner akin to NAG, 9 and 10 . This leads to the implicit-velocity phase-space representation: \[\tag{50} \begin{empheq}[left=\empheqlbrace]{align} & x_{k+1} - x_{k} = \sqrt{s} v_{k+1}, \tag{51} \\ & v_{k+1} - v_{k} = -\frac{r+1}{k+r} v_{k} - \sqrt{s} G_s \left( y_{k} \right), \tag{52} \end{empheq}\] where the sequence \(\{y_{k}\}_{k=0}^{\infty}\) satisfies the following relationship: \[\label{eqn:32fista-phase-x-y-v} y_k = x_{k} + \frac{k-1}{k+r} \cdot \sqrt{s}v_k.\tag{53}\]

To derive the linear convergence for FISTA, we still need to establish a suitable Lyapunov function. In this context, we generalize the one 35 , previously used for smooth functions, as shown in , by replacing the smooth function \(f\) with the composite function \(\Phi = f+g\). This modification leads to the following Lyapunov function: \[\begin{align} \mathcal{E}(k) = &\; s(k+1)(k+r+1)\left( \Phi(x_{k+1}) - \Phi(x^{\star}) \right) \nonumber \\ &\; + \frac{1}{2}\left\| (k-1)\sqrt{s}v_{k} + r (x_{k} - x^{\star}) - s (k + r) G_s(y_k) \right\|^2. \label{eqn:32lyapunov-fista} \end{align}\tag{54}\] As detailed in , the linear convergence for smooth functions predominantly hinges on three key inequalities, 32 , 38 , and 40 . These inequalities serve as the foundation for analyzing the linear convergence of NAG, encapsulating how it approaches the optimal solution. Additionally, and extend the two strongly convex inequalities, 20 and 21 , to the proximal setting. Consequently, we can generalize these three key inequalities, 32 , 38 , and 40 , to account for the proximal components, effectively addressing the challenges posed by the non-smooth components of the objective function. Thus, we can establish the linear convergence for FISTA, as formally stated in the following theorem:

Theorem 4. Let \(\Phi = f + g\) with \(f \in \mathcal{S}_{\mu,L}^{1}(\mathbb{R}^d)\) and \(g \in \mathcal{F}^0(\mathbb{R}^d)\). Given any step size \(0 < s < 1/L\), there exists a positive integer \(K: = \max\left\{0, \frac{3r^2 - 4r - 12}{8}\right\}\) such that the iterative sequence \(\{x_{k}\}_{k=0}^{\infty}\) generated by FISTA, 2 and 3 , with any initial \(x_0 = y_0 \in \mathbb{R}^d\), satisfies the following inequality: \[\label{eqn:32fista-rate} \Phi(x_k) - \Phi(x^{\star}) \leq \frac{(r +1)\left( \Phi(x_1) - \Phi(x^{\star})\right) + r^2 L\| x_1 - x^{\star} \|^2}{k(k+r) \left[ 1 + (1 - Ls) \cdot \frac{\mu s}{4} \right]^k},\tag{55}\] for any \(k \geq \max\left\{ 1, K \right\}\).

4 Monotonicity and linear convergence↩︎

In this section, we extend the linear convergence for NAG and FISTA, as established in , to their corresponding proximal variants, M-NAG and M-FISTA. Specifically, we use the iterative sequence \(\mathbf{S}_k\) in the gradient iterative relation 17 , which only depends on the variables \(x_{k}\) and \(y_{k}\), rather than the full NAG iterative relations, 9 and 17 , to construct the Lyapunov function. In deriving the linear convergence, we show that it fundamentally relies on the gradient iterative relation 26 , specifically, the information from M-NAG, as described in 5 — 7 , rather than the full NAG information from 2 and 3 . Finally, we generalize the linear convergence to include the proximal variant, M-FISTA.

4.1 Smooth optimization via M-NAG↩︎

Using the iterative sequence \(\mathbf{S}_k\) from the gradient iterative relation 17 , we write down a new form of the Lyapunov function 35 , which depnends only on the variables, \(x_k\) and \(y_{k}\), as follows: \[\begin{align} \mathcal{E}(k) = &\; \underbrace{s(k+1)(k+r+1)\left( f(x_{k+1}) - f(x^{\star}) \right)}_{\mathcal{E}_{pot}(k)} \nonumber \\ &\; + \underbrace{\frac{1}{2}\left\| k(y_{k} -x_k) + r(y_k - x^{\star}) - (k + r)s\nabla f(y_k) \right\|^2}_{\mathcal{E}_{mix}(k)}. \label{eqn:32lyapunov-m-nag} \end{align}\tag{56}\] This form of the Lyapunov function plays a critical role in deriving the linear convergence for M-NAG. The result is rigorously stated in the subsequent theorem.

Theorem 5. Let \(f \in \mathcal{S}_{\mu,L}^{1}(\mathbb{R}^d)\). Given any step size \(0 < s < 1/L\), there exists a positive integer \(K: = \max\left\{0, \frac{3r^2 - 4r - 12}{8}\right\}\) such that the iterative sequence \(\{x_{k}\}_{k=0}^{\infty}\) generated by M-NAG, 5 — 7 , with any initial point \(x_0 = y_0 \in \mathbb{R}^d\), satisfies the following inequality \[\label{eqn:32mnag-rate} f(x_k) - f(x^{\star}) \leq \frac{(r +1)\left( f(x_1) - f(x^{\star})\right) + r^2 L\| x_1 - x^{\star} \|^2}{k(k+r) \left[ 1 + (1 - Ls) \cdot \frac{\mu s}{4} \right]^k},\tag{57}\] for any \(k \geq \max\left\{1,K \right\}\).

Proof of . To calculate the iterative difference of the Lyapunov function, we separate the process into two parts, the potential energy and the mixed energy, as shown in 56 .

For the potential energy, we calculate its iterative difference as \[\begin{align} \mathcal{E}_{pot}(k+1) - \mathcal{E}_{pot}(k) = & s (k+1)(k+r+1) \left( f(x_{k+2}) - f(x_{k+1}) \right) \nonumber \\ & + s \left( 2k + r + 3 \right) \left( f(x_{k+2}) - f(x^{\star}) \right). \label{eqn:32pot-iter-mnag-1} \end{align}\tag{58}\] Next, substituting \(x_{k+2}\) and \(x_{k+1}\) into the fundamental inequality 20 and noting from 6 that \(f(x_{k+2}) \leq f(z_{k+1}) = f(y_{k+1} - s \nabla f(y_{k+1}))\), we obtain the following inequality \[\begin{align} f(x_{k+2}) - f(x_{k+1}) \leq & \left\langle \nabla f(y_{k+1}), y_{k+1} - x_{k+1}\right\rangle \nonumber \\ & - \frac{\mu}{2} \left\| y_{k+1} - x_{k+1} \right\|^2 - \left( s - \frac{Ls^2}{2} \right) \left\| \nabla f(y_{k+1}) \right\|^2. \label{eqn:32fund-inq-xkyk-mnag} \end{align}\tag{59}\] Then, we substitute the inequality 59 into the iterative difference 58 to derive the following bound as \[\begin{align} \mathcal{E}_{pot}(k+1) - \mathcal{E}_{pot}(k) \leq & \underbrace{s (k+1)(k+r+1) \left\langle \nabla f(y_{k+1}), y_{k+1} - x_{k+1} \right\rangle}_{\mathbf{I}} \nonumber \\ & - \frac{\mu s}{2} \cdot (k+1)(k+r+1) \| y_{k+1} - x_{k+1} \|^2 \nonumber \\ & - s \left( s - \frac{Ls^2}{2} \right) (k+1)(k+r+1) \| \nabla f(y_{k+1}) \|^2 \nonumber \\ & + s \left( 2k + r + 3\right) \left( f(x_{k+2}) - f(x^{\star}) \right). \label{eqn:32pot-iter-mnag-2} \end{align}\tag{60}\]
To account for the difference between the iterative point \(x_k\) and the optimal solution \(x^{\star}\), we introduce the term \(-rx^{\star}\) to both sides of the gradient iterative relation 17 . This results in the following iterative relation: \[\begin{gather} (k+1)(y_{k+1} - x_{k+1}) + r ( y_{k+1} - x^{\star} ) - \left( k + r + 1 \right)s\nabla f(y_{k+1}) \\ = k(y_{k} - x_{k}) + r ( y_k - x^{\star} ) - \left( k+r\right)s \nabla f(y_{k}) - \left( k + r + 1 \right)s\nabla f(y_{k+1}). \label{eqn:32nag-iteration-new-xstar} \end{gather}\tag{61}\] Using this modified gradient iterative relation, we can now derive the following expression for the iterative difference of the mixed energy between iterations as: \[\begin{align} \mathcal{E}_{mix}(k+1) - \mathcal{E}_{mix}(k) = & -s(k+1)(k+r+1) \left\langle \nabla f(y_{k+1}), y_{k+1} - x_{k+1} \right\rangle \nonumber \\ & - sr(k+r+1) \left\langle \nabla f(y_{k+1}), y_{k+1} - x^{\star} \right\rangle \nonumber \\ & + \frac{s^2(k+r+1)^2}{2} \| \nabla f(y_{k+1}) \|^2. \label{eqn:32mix-iter-mnag-1} \end{align}\tag{62}\] Next, we substitute \(x_{k+2}\) and \(x^{\star}\) into the fundamental inequality 20 and rearrange to obtain: \[\begin{align} - \left\langle \nabla f(y_{k+1}), y_{k+1} - x^{\star}\right\rangle \leq & - \left( f(x_{k+2}) - f(x^{\star}) \right) \nonumber \\ & - \frac{\mu}{2} \left\| y_{k+1} - x^{\star} \right\|^2 - \left( s - \frac{Ls^2}{2} \right) \left\| \nabla f(y_{k+1}) \right\|^2. \label{eqn:32fund-inq-mnag} \end{align}\tag{63}\] Substituting the inequality 63 into the iterative difference 62 of the mixed energy, we arrive at the following bound: \[\begin{align} \mathcal{E}_{mix}(k+1) - \mathcal{E}_{mix}(k) \leq & - \underbrace{s(k+1)(k+r+1) \left\langle \nabla f(y_{k+1}), y_{k+1} - x_{k+1} \right\rangle}_{\mathbf{II}} \nonumber \\ & - sr(k+r+1) \left( f(x_{k+2}) - f(x^{\star}) \right) \nonumber \\ & - \frac{\mu s}{2} \cdot r(k+r+1)\left\| y_{k+1} - x^{\star} \right\|^2 \nonumber \\ & - s\left( s - \frac{Ls^2}{2} \right) \cdot r(k+r+1) \left\| \nabla f(y_{k+1}) \right\|^2 \nonumber \\ & + \frac{s^2(k+r+1)^2}{2} \| \nabla f(y_{k+1}) \|^2. \label{eqn:32mix-iter-mnag-2} \end{align}\tag{64}\]

From the iterative differences of the potential energy and the mixed energy, as described in 60 and 64 , we observe that \(\mathbf{I} = \mathbf{II}\). This allows us to derive the following bound on the iterative difference: \[\begin{align} \mathcal{E}(k+1) - \mathcal{E}(k) \leq & - s \left[(r -2)k + (r^2 - 3) \right] \left( f(x_{k+2}) - f(x^{\star}) \right) \nonumber \\ & - \frac{\mu s}{2} \cdot (k+1)(k+r+1)\cdot \|y_{k+1} - x_{k+1}\|^2 \nonumber \\ & - \frac{ \mu s}{2} \cdot (2k + r + 3) \cdot \|y_{k+1} - x^{\star}\|^2 \nonumber \\ & - \frac{s^2(1-Ls) (k+r+1)^2}{2}\left\| \nabla f(y_{k+1}) \right\|^2. \label{eqn:32iter-mnag1} \end{align}\tag{65}\] To further simplify this expression, we apply the strongly convex inequality 21 , which provides a lower bound on \(\| \nabla f(y_{k+1}) \|^2\) as follows: \[\left\| \nabla f(y_{k+1}) \right\|^2 \geq 2\mu \left( f(x_{k+2}) - f(x^{\star}) \right). \label{eqn:32key-inq-mnag}\tag{66}\] Substituting this bound 40 into 65 , we arrive at the following estimate for the iterative difference: \[\begin{align} \mathcal{E}(k+1) - \mathcal{E}(k) \leq & - \frac{\mu s(1-Ls)}{4} \cdot s(k+r+1)^2\left( f(x_{k+2}) - f(x^{\star}) \right) \nonumber \\ & - \frac{\mu s}{2} \cdot (k+1)(k+r+1)\cdot \|y_{k+1} - x_{k+1}\|^2 \nonumber \\ & - \frac{ \mu s}{2} \cdot (2k + r + 3) \cdot \|y_{k+1} - x^{\star}\|^2 \nonumber \\ & - \frac{3s^2(1-Ls) (k+r+1)^2}{8}\left\| \nabla f(y_{k+1}) \right\|^2. \label{eqn:32iter-mnag2} \end{align}\tag{67}\] Next, we estimate the Lyapunov function 56 using the Cauchy-Schwarz inequality: \[\begin{align} \mathcal{E}(k+1) \leq &\; s(k+2)(k+r+2)\left( f(x_{k+2}) - f(x^{\star}) \right) + \frac{3(k+1)^2}{2} \|y_{k+1} - x_{k+1}\|^2 \nonumber \\ &\;+ \frac{3r^2}{2} \| y_{k+1} - x^{\star} \|^2 + \frac{3s^2(k+r+1)^2}{2} \| \nabla f(y_{k+1}) \|^2. \label{eqn:32lyapunov-estimate-mnag} \end{align}\tag{68}\] By comparing the coefficients of the right-hand sides of the two inequalities, 67 and 68 , we deduce the following estimate for the iterative difference: \[\begin{align} \mathcal{E}(k+1) - \mathcal{E}(k) & \leq - \mu s \cdot \min\left\{ \frac{1-Ls}{4}, \frac{1}{3}, \frac{2k+r+3}{3r^2}, \frac{1-Ls}{4 \mu s} \right\} \mathcal{E}(k+1) \nonumber \\ & \leq - \mu s \cdot \left(\frac{1 - Ls}{4} \right) \cdot \mathcal{E}(k+1), \label{eqn:32iter-diff-lyapunov-mnag} \end{align}\tag{69}\] where the last inequality holds due to the definition of the integer \(K\) and the fact that \(0 < \mu s < \mu/L \leq 1\). Finally, by applying standard inequalities for the gradient Lipschitz condition 18 , we complete the proof. ◻

4.2 Composite optimization via M-FISTA↩︎

We now extend the linear convergence of M-NAG, as established in , to include its proximal variant, M-FISTA. As outlined in , M-FISTA employs the \(s\)-proximal value defined in 22 and follows the iterative scheme below, starting from any initial point \(y_0 = x_0 \in \mathbb{R}^d\): \[\tag{70} \begin{empheq}[left=\empheqlbrace]{align} & z_{k} = P_s(y_{k}), \tag{71} \\ & x_{k+1} = \left\{ \begin{align} & z_{k}, && \text{if}\; f(z_{k}) \leq f(x_{k}), \\ & x_{k}, && \text{if}\; f(z_{k}) > f(x_{k}), \end{align} \right. \tag{72} \\ & y_{k+1} = x_{k+1} + \frac{k}{k+r+1} (x_{k+1} - x_{k}) + \frac{k+r}{k+r+1} (z_{k} - x_{k+1}), \tag{73} \end{empheq}\] where \(s>0\) denotes the step size. To provide a unified perspective, we introduce M-FISTA, a variant that mirrors the structure of M-NAG. According to , M-FISTA replaces the \(s\)-proximal value with the \(s\)-proximal subgradient, as defined in 23 . This yields a new iterative scheme that more closely resembles M-NAG, as follows: \[\tag{74} \begin{empheq}[left=\empheqlbrace]{align} & z_{k} = y_{k} - sG_s(y_{k}), \tag{75} \\ & x_{k+1} = \left\{ \begin{align} & z_{k}, && \text{if}\; f(z_{k}) \leq f(x_{k}), \\ & x_{k}, && \text{if}\; f(z_{k}) > f(x_{k}), \end{align} \right. \tag{76} \\ & y_{k+1} = x_{k+1} + \frac{k}{k+r+1} (x_{k+1} - x_{k}) + \frac{k+r}{k+r+1} (z_{k} - x_{k+1}), \tag{77} \end{empheq}\] where the \(s\)-proximal subgradient \(G_s(y_{k})\) replaces the gradient \(\nabla f(y_k)\) in NAG.

Similarly, to establish the linear convergence for M-FISTA, we generalize the Lyapunov function 56 to its proximal counterpart as follows: \[\begin{align} \mathcal{E}(k) = &\; s(k+1)(k+r+1)\left( \Phi(x_{k+1}) - \Phi(x^{\star}) \right) \nonumber \\ &\; + \frac{1}{2}\left\| k(y_{k} -x_k) + r(y_k - x^{\star}) - (k + r)s G_s(y_k) \right\|^2. \label{eqn:32lyapunov-m-fista} \end{align}\tag{78}\] In this formulation, we incorporate the proximal structure into the Lyapunov function to account for the non-smoothness of the objective function. and extend the two strongly convex inequalities, 20 and 21 , to the proximal setting. Consequently, we can generalize these three key inequalities, 59 , 63 , and 66 , to o account for the proximal adjustments. With these extensions, we can generalize the linear convergence from M-NAG to M-FISTA, as formally stated in the following theorem:

Theorem 6. Let \(\Phi = f + g\) with \(f \in \mathcal{S}_{\mu,L}^{1}(\mathbb{R}^d)\) and \(g \in \mathcal{F}^0(\mathbb{R}^d)\). Given any step size \(0 < s < 1/L\), there exists a positive integer \(K: = \max\left\{0, \frac{3r^2 - 4r - 12}{8}\right\}\) such that the iterative sequence \(\{x_{k}\}_{k=0}^{\infty}\) generated by M-FISTA, 5 — 7 , with any initial \(x_0 = y_0 \in \mathbb{R}^d\), satisfies the following inequality: \[\label{eqn:32m-fista-rate} \Phi(x_k) - \Phi(x^{\star}) \leq \frac{(r +1)\left( \Phi(x_1) - \Phi(x^{\star})\right) + r^2 L\| x_1 - x^{\star} \|^2}{k(k+r) \left[ 1 + (1 - Ls) \cdot \frac{\mu s}{4} \right]^k},\tag{79}\] for any \(k \geq \max\left\{ 1, K \right\}\).

5 Conclusion and future work↩︎

In this paper, we introduce an additional gradient term to the iterative relation 12 , resulting in a refined gradient-based iterative relation 15 . This modification paves the way for the development of a novel Lyapunov function. Unlike the Lyapunov function presented in [2], our new formulation not only eliminates the kinetic energy but also ensures linear convergence for NAG. Furthermore, the linear convergence derived from this Lyapunov function is independent of both the parameters of the strongly convex functions and the step size, thus providing a more universal and robust result. By performing a series of straightforward substitutions, we derive the gradient-based iterative relation 17 for M-NAG. Interestingly, we observe that this gradient iterative relation 17 coincides with the one in 15 when the position-velocity relation 9 is applied. Additionally, we further demonstrate that the derivation of linear convergence relies solely on the gradient iterative relation 17 and does not require the position-velocity relation 9 . The key observation allows us to extend the linear convergence from NAG to M-NAG. Finally, leveraging the two proximal inequalities for composite functions developed in [2], [9], which serves as the proximal analogs of the two classical strongly convex inequalities, we extend the linear convergence to both FISTA) and M-FISTA. This extension broadens the applicability of our results, providing insights into the convergence behavior of these widely used proximal algorithms.

For \(\mu\)-strongly convex functions, another Nesterov’s accelerated gradient method, referred to as NAG-SC, was originally proposed by [21]. In NAG-SC, the momentum coefficient is defined as \(\frac{1-\sqrt{\mu s}}{1 + \sqrt{\mu s}}\), in contrast to \(\frac{k}{k+r+1}\) used in the standard NAG. The iterative scheme for NAG-SC is given by: \[\tag{80} \begin{empheq}[left=\empheqlbrace]{align} & x_{k+1} = y_{k} - s\nabla f(y_{k}), \tag{81} \\ & y_{k+1} = x_{k+1} + \frac{1-\sqrt{\mu s}}{1 + \sqrt{\mu s}} (x_{k+1} - x_{k}), \tag{82} \end{empheq}\] where \(s\) is the step size. Building on the same idea introduced by [3], we can propose a monotonic variant of NAG-SC, which we denote as M-NAG-SC. The iterative scheme for M-NAG-SC is as follows: \[\tag{83} \begin{empheq}[left=\empheqlbrace]{align} & z_{k} = y_{k} - s\nabla f(y_{k}), \tag{84} \\ & x_{k+1} = \left\{ \begin{align} & z_{k}, && \text{if}\; f(z_{k}) \leq f(x_{k}), \\ & x_{k}, && \text{if}\; f(z_{k}) > f(x_{k}), \end{align} \right. \tag{85} \\ & y_{k+1} = x_{k+1} +\frac{1-\sqrt{\mu s}}{1 + \sqrt{\mu s}}(x_{k+1} - x_{k}) + (z_{k} - x_{k+1}), \tag{86} \end{empheq}\] where \(s\) is the step size. We demonstrate the numerical performance of both NAG-SC and M-NAG-SC in comparison with the vanilla gradient descent method in .

Figure 2: Numerical comparison of the iterative progression of function values forthe vanilla gradient descent method, `NAG-SC` and `M-NAG-SC`. The step size is \(s = 0.01\). All optimization algorithms are applied to the quadratic function \(f(x_1, x_2) = 5 \times 10^{-3}x_1^2 + x_2^2\)..

An intriguing avenue for future research is to explore whether M-NAG-SC, as described by 84 — 86 , can retain the accelerated convergence rate of NAG-SC. However, it is important to note that the Lyapunov function presented in [6], which incorporates the kinetic energy as an essential part of the analysis, complicates the direct generalization from NAG-SC to M-NAG-SC. Thus, extending the Lyapunov analysis to the monotonic variant requires further exploration.

Acknowledgements↩︎

We thank Bowen Li for his helpful and insightful discussions. Mingwei Fu acknowledges partial support from the Loo-Keng Hua Scholarship of CAS. Bin Shi was partially supported by the startup fund from SIMIS and Grant No.YSBR-034 of CAS.

References↩︎

[1]

A. Chambolle and T. Pock. An introduction to continuous optimization for imaging. Acta Numerica, 25: 161–319, 2016.

[2]

B. Li, B. Shi, and Y.-x. Yuan. Linear convergence of forward-backward accelerated algorithms without knowledge of the modulus of strong convexity. SIAM Journal on Optimization, 34 (2): 2150–2168, 2024.

[3]

A. Beck. First-order methods in optimization. SIAM, 2017.

[4]

A. Cauchy. Méthode générale pour la résolution des systemes d’équations simultanées. Comp. Rend. Sci. Paris, 25 (1847): 536–538, 1847.

[5]

Y. Nesterov. A method of solving a convex programming problem with convergence rate \(o(1/k^2)\). Soviet Mathematics-Doklady,, 27 (2): 372–376, 1983.

[6]

B. Shi, S. S. Du, M. I. Jordan, and W. J. Su. Understanding the acceleration phenomenon via high-resolution differential equations. Mathematical Programming, pages 79–148, 2022.

[7]

S. Chen, B. Shi, and Y.-x. Yuan. Gradient norm minimization of nesterov acceleration: \(o(1/k^{3})\). arXiv preprint arXiv:2209.08862, 2022.

[8]

B. Li, B. Shi, and Y.-x. Yuan. Proximal subgradient norm minimization of ista and fista. arXiv preprint arXiv:2211.01610, 2022.

[9]

B. Li, B. Shi, and Y.-X. Yuan. Linear convergence of ISTA and FISTA. Journal of the Operations Research Society of China, pages 1–19, 2024.

[10]

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2 (1): 183–202, 2009.

[11]

P. Giselsson and S. Boyd. Monotonicity and restart in fast gradient methods. In 53rd IEEE Conference on Decision and Control, pages 5058–5063. IEEE, 2014.

[12]

B. O’Donoghue and E. Candes. Adaptive restart for accelerated gradient schemes. Foundations of computational mathematics, 15: 715–732, 2015.

[13]

H. Attouch, P.-E. Maingé, and P. Redont. A second-order differential system with hessian-driven damping; application to non-elastic shock laws. Differential Equations and Applications, 4 (1): 27–65, 2012.

[14]

H. Attouch, J. Peypouquet, and P. Redont. A dynamical approach to an inertial forward-backward algorithm for convex minimization. SIAM Journal on Optimization, 24 (1): 232–256, 2014.

[15]

W. Su, S. Boyd, and E. J. Candes. A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights. Journal of Machine Learning Research, 17 (153): 1–43, 2016.

[16]

A. Wibisono, A. C. Wilson, and M. I. Jordan. A variational perspective on accelerated methods in optimization. proceedings of the National Academy of Sciences, 113 (47): E7351–E7358, 2016.

[17]

H. Attouch and J. Peypouquet. The rate of convergence of nesterov’s accelerated forward-backward method is actually faster than \(1/k^2\). SIAM Journal on Optimization, 26 (3): 1824–1834, 2016.

[18]

S. Chen, B. Shi, and Y.-x. Yuan. Revisiting the acceleration phenomenon via high-resolution differential equations. arXiv preprint arXiv:2212.05700, 2022.

[19]

B. Li, B. Shi, and Y.-X. Yuan. Linear convergence of ISTA and FISTA. arXiv preprint arXiv:2212.06319, 2022.

[20]

S. Chen, B. Shi, and Y.-x. Yuan. On underdamped nesterov’s acceleration. arXiv preprint arXiv:2304.14642, 2023.

[21]

Y. Nesterov. Lectures on Convex Optimization, volume 137. Springer, second edition, 2018.

Corresponding author: binshi@simis.cn ↩︎
In detail, \(x^{\star}\) is the unique global minimum, and the fundamental inequality for strongly convex functions is rigorously shown in 20 .↩︎

Lyapunov Analysis For Monotonically Forward-Backward Accelerated Algorithms