December 18, 2024
In the realm of gradient-based optimization, Nesterov’s accelerated gradient method (NAG
) is a landmark advancement, achieving an accelerated convergence rate that outperforms the vanilla gradient descent method for convex function.
However, for strongly convex functions, whether NAG
converges linearly remains an open question, as noted in the comprehensive review by [1]. This issue, aside from the critical step size, was addressed by [2] using a high-resolution differential
equation framework. Furthermore, [3] introduced a monotonically convergent variant of NAG
, referred to as M-NAG
.
Despite these developments, the Lyapunov analysis presented in [2] cannot be directly extended to M-NAG
. In this paper, we propose a
modification to the iterative relation by introducing a gradient term, leading to a new gradient-based iterative relation. This adjustment allows for the construction of a novel Lyapunov function that excludes kinetic energy. The linear convergence derived
from this Lyapunov function is independent of both the parameters of the strongly convex functions and the step size, yielding a more general and robust result. Notably, we observe that the gradient iterative relation derived from M-NAG
is
equivalent to that from NAG
when the position-velocity relation is applied. However, the Lyapunov analysis does not rely on the position-velocity relation, allowing us to extend the linear convergence to M-NAG
. Finally, by
utilizing two proximal inequalities, which serve as the proximal counterparts of strongly convex inequalities, we extend the linear convergence to both the fast iterative shrinkage-thresholding algorithm (FISTA
) and its monotonic counterpart
(M-FISTA
).
Over the past few decades, machine learning has emerged as a major application area for optimization algorithms. A central challenge in this field is unconstrained optimization, which involves minimizing a convex objective function without constraints. Mathematically, this problem is formulated as: \[\min_{x \in \mathbb{R}^d} f(x).\] At the heart of addressing this challenge are gradient-based methods, which have driven significant advancements due to their computational efficiency and low memory requirements. These advantages make them particularly well-suited for large-scale machine learning tasks. As a result, gradient-based methods have not only propelled theoretical progress but also facilitated practical breakthroughs, establishing them as indispensable tools in both machine learning research and real-world applications.
One of the simplest and most foundational gradient-based methods is the vanilla gradient descent, which dates back to [4]. Given an
initial point \(x_0 \in \mathbb{R}^d\), the vanilla gradient descent method is implemented recursively as: \[x_{k+1} = x_{k} - s \nabla f(x_{k}),\] where \(s\) denotes the step size. While this method is effective, it can be slow to converge for convex functions. In the mid-1980s, [5] introduced a two-step accelerated algorithm, now known as Nesterov’s accelerated gradient method (NAG
). Given any initial \(x_0 = y_0 \in \mathbb{R}^d\), NAG
employs the following update rules: \[\tag{1}
\begin{empheq}[left=\empheqlbrace]{align}
& x_{k+1} = y_{k} - s\nabla f(y_{k}), \tag{2} \\
& y_{k+1} = x_{k+1} + \frac{k}{k+r+1} (x_{k+1} - x_{k}), \tag{3}
\end{empheq}\] where \(s\) is the step size,
and \(r \geq 2\) is the momentum parameter. Compared to the vanilla gradient descent method, NAG
achieves accelerated convergence rates. However, it does not guarantee monotone convergence, that is, the
function values \(f(x_k)\) do not necessarily decrease at every iteration. As a result, oscillations or overshoots may occur, especially as iterates approach the minimizer, as illustrated in .
Figure 1: Numerical comparison of the iterative progression of function values for NAG
and M-NAG
. The momentum parameter is set to \(r=2\), and the step size is \(s = 0.4\). Both optimization algorithms are applied to the quadratic function \(f(x_1, x_2) = 5 \times 10^{-3}x_1^2 + x_2^2\)..
This lack of stability and predictability makes it difficult to monitor progress and ensure reliable convergence. To address this issue, [3]
proposed a variant that achieves both acceleration and monotonicity, referred to as the Monotone Nesterov’s accelerated gradient method (M-NAG
). This method incorporates a comparison step to stabilize convergence while maintaining the
acceleration properties of NAG
. Starting with any initial \(x_0 = y_0 \in \mathbb{R}^d\), the recursive update rules of M-NAG
are as follows: \[\tag{4}
\begin{empheq}[left=\empheqlbrace]{align}
& z_{k} = y_{k} - s\nabla f(y_{k}), \tag{5} \\
& x_{k+1} = \left\{ \begin{align} & z_{k}, && \text{if}\; f(z_{k}) \leq f(x_{k}), \\ & x_{k}, && \text{if}\; f(z_{k}) > f(x_{k}), \end{align} \right. \tag{6} \\
& y_{k+1} = x_{k+1} + \frac{k}{k+r+1} (x_{k+1} - x_{k}) + \frac{k+r}{k+r+1} (z_{k} - x_{k+1}), \tag{7}
\end{empheq}\]
where \(s\) is the step size and \(r \geq 2\) is the momentum parameter. Unlike NAG
, M-NAG
ensures that the function values are non-increasing at each iteration,
thereby providing a more stable convergence path. As illustrated in , M-NAG
effectively mitigates oscillations while preserving the acceleration characteristic of NAG
, leading to more stable and predictable convergence
behavior.
A notable advantage of NAG
is that it does not require prior knowledge of the modulus of the strong convexity. For strongly convex functions, [2] utilized the high-resolution differential equation framework [6] to establish linear
convergence, thereby addressing an open problem posed by [1]. The natural next step is to generalize the Lyapunov
analysis proposed in [2] from NAG
to M-NAG
. However, due to the introduction of the comparison step, the phase-space
representation that works for NAG
cannot be directly applied to M-NAG
, making such a generalization quite challenging. This leads to the following question:
M-NAG
converge linearly for strongly convex functions?In this paper, we make two key observations for NAG
, which allow us to answer the above question through Lyapunov analysis.
Recall the high-resolution differential equation framework presented in [7], [8], where Lyapunov analysis based on the implicit-velocity scheme has been shown to outperform the gradient-correction scheme. Accordingly, we adopt the phase-space representation derived from the implicit-velocity scheme.
To facilitate this approach, we introduce a new iterative velocity sequence \(\{v_{k}\}_{k=0}^{\infty}\) such that \(x_{k} - x_{k-1} = \sqrt{s}v_{k}\). Using this
formulation, NAG
can be reformulated as follows: \[\tag{8}
\begin{empheq}[left=\empheqlbrace]{align} & x_{k+1} - x_{k} = \sqrt{s} v_{k+1}, \tag{9} \\ & v_{k+1} - v_{k} = -\frac{r+1}{k+r} v_{k} - \sqrt{s} \nabla f\left( y_{k} \right), \tag{10}
\end{empheq}\] where the sequence \(\{y_{k}\}_{k=0}^{\infty}\) satisfies the following relationship: \[\label{eqn:32nag-phase-x-y-v}
y_k = x_{k} + \frac{k-1}{k+r} \cdot \sqrt{s}v_k.\tag{11}\] This phase-space representation allows for a more refined analysis of NAG
’s convergence properties. The key idea behind constructing a Lyapunov function, based on the
high-resolution differential equation framework [6], is the mixed energy, which is derived from the velocity iterative relation. By
substituting 9 into 10 , we obtain the following iterative relation as \[\underbrace{k\sqrt{s}v_{k+1} +
r x_{k+1} }_{ \mathbf{R}_{k+1} } = \underbrace{ (k-1)\sqrt{s}v_{k} + rx_k }_{ \mathbf{R}_{k} } - \left( k+r\right)s \nabla f(y_{k}) \label{eqn:32nag-phase-iteration}\tag{12}\] where we denote a new iterative sequence \(\mathbf{R}_k = (k-1)\sqrt{s}v_{k} - rx_k\), which generate the mixed energy as shown in [2]. As demonstrated in [2], a Lyapunov function is used to derive the linear convergence for NAG
. However, due to the existence of kinetic energy, we cannot
directly generalize the Lyapunov analysis from NAG
to M-NAG
.
For convex functions, as shown in [7], [8], the iterative sequence \(\mathbf{R}_{k}\) in 12 is used to construct a Lyapunov function, given by: \[\label{eqn:32lyapunov-cvx} \mathcal{E}(k) = sk(k+r)\left( f(x_k) - f(x^{\star}) \right) + \frac{1}{2} \left\| (k-1)\sqrt{s}v_{k} + r(x_k - x^{\star}) \right\|^2.\tag{13}\] By utilizing the fundamental inequality for strongly convex functions, we can derive the iterative difference satisfying the following inequality: \[\label{eqn:32lyapunov-cvx-iter} \mathcal{E}(k+1) - \mathcal{E}(k) \leq - \mu s \left[ C_1(k+1)\left( f(x_{k+1}) - f(x^{\star}) \right) + C_2(k) s\left\| v_{k} \right\|^2 + C_{3}(k)\left\| x_k - x^{\star} \right\|^2 \right],\tag{14}\] where \(C_1(k) = \Theta(k^2)\), \(C_2(k) = \Theta(k^2)\), and \(C_3(k) = \Theta(k)\).2 However, comparing 13 and 14 , it is impossible to align the right-hand side of the inequality 14 with the Lyapunov function 13 to be proportional, due to the misalignment relationship between the potential energy involving \(x_{k+1}\) and the mixed energy involving \(v_k\) and \(x_k\).
To eliminate the kinetic energy, we introduce a new gradient term, transforming the iteration into a gradient-based iterative form: \[\begin{gather} \underbrace{k\sqrt{s}v_{k+1} + rx_{k+1} - \left( k + r + 1 \right)s\nabla f(y_{k+1}) }_{ \mathbf{S}_{k+1} } \\ = \underbrace{(k-1)\sqrt{s}v_{k} + r x_k - \left( k+r\right)s \nabla f(y_{k})}_{ \mathbf{S}_{k} } - \left( k + r + 1 \right)s\nabla f(y_{k+1}), \label{eqn:32nag-phase-iteration-new} \end{gather}\tag{15}\] where the new iterative sequence is given by \(\mathbf{S}_k = \mathbf{R}_k - \left( k+r\right)s \nabla f(y_{k})\). The first key observation is this gradient iterative relation 15 , which enables us to construct a novel Lyapunov function and derive linear convergence. Notably, the convergence rate does not depend on the parameters, except for the momentum parameter \(r\), in contrast to[2], as detailed in .
The second key observation pertains to the connection between NAG
and M-NAG
. By substituting 5 into 7 and reformulating, we derive the following
iterative relation: \[(k + r + 1) y_{k+1} - (k + 1) x_{k+1} = (k + r)y_{k} - kx_k - (k + r)s\nabla f(y_k), \label{eqn:32m-nag-iteration}\tag{16}\] Next, by
incorporating the gradient term, we can transform the iterative relation 16 into a gradient-based form, yielding: \[\begin{gather}
\underbrace{(k + r + 1) y_{k+1} - (k + 1) x_{k+1} - \left( k + r + 1 \right)\nabla f(y_{k+1})}_{\mathbf{T}_{k+1}} \\
= \underbrace{ (k + r)y_{k} - kx_k - (k + r)s\nabla f(y_k) }_{\mathbf{T}_k} - \left( k + r + 1 \right)\nabla f(y_{k+1}). \label{eqn:32m-nag-iteration-new}
\end{gather}\tag{17}\] Substituting the position iteration 11 into the sequence \(\mathbf{S}_{k}\), we find that the two iterative sequences are identical, i.e., \(\mathbf{S}_k = \mathbf{T}_k\). The detailed calculation is as follows: \[\mathbf{S}_k = (k-1)\sqrt{s}v_{k} + r x_k - \left( k+r\right)s \nabla f(y_{k}) = (k + r)y_{k} - kx_k - (k + r)s\nabla f(y_k) =
\mathbf{T}_k.\] Thus, the relationship between the two gradient iterative relations, 15 and 17 , relies solely on the position-velocity relation 11 . Since the position-velocity relation 11 introduces a new iterative sequence \(v_k\), which is free to choose. Hence, the two gradient
iterative relations, 15 and 17 , are essentially identical and can be used interchangeably.
With this key observation, we can extend the Lyapunov analysis from NAG
to M-NAG
. Therefore, it is sufficient to employ Lyapunov analysis to establish the linear convergence for NAG
involving the gradient iterative
relation 17 and the position-velocity relation 11 , without needing the position iteration 9 , as detailed in . This insight reveals
that the underlying philosophy of the M-NAG
iteration, 5 — 7 , is to focus on the iterative sequence \(\{x_k\}_{k=0}^{\infty}\) to
track the monotonicity of function values, while allowing the sequence \(\{y_{k}\}_{k=0}^{\infty}\) to adjust correspondingly. In fact, in addition to the freely chosen position-velocity relation 11 , NAG
still consists of two relations: the position iteration 9 and the gradient iterative relation 17 . With the
position-velocity relation 11 alone, we cannot directly transform the gradient iterative relation 17 into the velocity iteration 10 ,
which still requires the position iteration 9 . For M-NAG
, we know that it only involves the free-chosen position-velocity relation 11 and the gradient iterative
relation 17 .
In this study, building upon the two key observations outlined above, we make the following contributions for the forward-backward accelerated algorithms.
In contrast to the previous work [2], we introduce the gradient term to the iterative relation 12 , resulting in a new gradient-based iterative relation 15 . This adjustment facilitates the construction of a novel Lyapunov function. Unlike the Lyapunov function
developed in [2], our new Lyapunov function not only eliminates the kinetic energy, but also guarantees linear convergence for NAG
.
Moreover, the convergence rate derived from this Lyapunov function is independent of both the parameters of the strongly convex functions and the step size, offering a more general and robust result.
Through some simple substitution, we derive the gradient iterative relation 17 for M-NAG
. We further observe that this gradient iterative relation 17 is equivalent to the one in 15 when the position-velocity relation 9 is applied. Additionally, we find that the derivation of
linear convergence requires only the gradient iterative relation 17 , and is independent of the position-velocity relation 9 . As a result, we extend the linear convergence
from NAG
to M-NAG
.
Finally, building on the two proximal inequalities for composite functions derived in [2], [9], which are the proximal counterparts of the two strongly convex inequalities, we extend the linear convergence to the two proximal algorithms, the fast iterative shrinkage-thresholding algorithm (FISTA
) and
its monotonic counterpart (M-FISTA
).
In the history of gradient-based optimization algorithms, a significant milestone was the introduction of NAG
, a two-step forward-backward algorithm originally proposed by [5]. This pioneering work revolutionized optimization techniques by incorporating acceleration into the vanilla gradient descent method, leading to substantial improvements in convergence
rates. Building upon this breakthrough, [10] introduced a proximal version of the fundamental inequality, which paved the way for the development
of FISTA
. This new algorithm extended the power of NAG
to handle composite objective functions, making it a more efficient tool for a wide range of applications, including signal processing and image reconstruction. Further
advancing the field, [3] proposed a monotonically convergent version of the forward-backward accelerated gradient algorithms, ensuring that the
algorithm’s performance would steadily improve over time. Simultaneously, adaptive restart strategies were explored to maintain monotonicity and enhance the robustness of forward-backward accelerated algorithms, with some contributions from [11] and [12]
.
Over the past decade, there has been increasing interest in understanding the acceleration phenomenon in optimization algorithms through the lens of differential equations. This exploration began with works [13], [14], which sparked a surge of research into the dynamics underlying
accelerated methods. The topic gained further prominence with the development of a low-resolution differential equation for modeling NAG
[15], a variational perspective on acceleration [16], and studies focusing on the faster
convergence rate of function values [17]. The acceleration mechanism was ultimately clarified through comparisons between NAG
method with Polyak’s heavy-ball method, with pivotal insights emerging from the high-resolution differential equation framework introduced by [6]. This framework revealed that gradient correction is effectively achieved through an implicit velocity update, providing a more comprehensive understanding of how acceleration arises. Subsequent works [7], [18] further demonstrated that this framework is particularly
well-suited for NAG
. Buildling on this, progress was made by extending the framework to address composite optimization problems, encompassing both convex and strongly convex functions [8], [19]. This extension was achieved by refining the proximal inequality and generalizing the
framework to accommodate a wider range of optimization scenarios, including the overdamped case, as shown in [20]. These advancements
have significantly deepened our understanding of acceleration mechanisms and paved the way for more efficient, robust, and versatile optimization algorithms.
The remainder of this paper is organized as follows. introduces the foundational definitions and inequalities for strongly convex functions and composite functions, which serve as preliminaries for the analysis. outline the construction of the
Lyapunov function and establishes the linear convergence for NAG
and FISTA
. extends the linear convergence to the monotonically accelerated forward-backward algorithms, M-NAG
and M-FISTA
, leveraging the
novel Lyapnov function that excludes the kinetic energy. Finally, concludes this papers and proposes potential avenues for future research.
In this paper, we adopt the notations that are consistent with those used in [6], [9], [21], with slight adjustments tailored to the specific context of our analysis. Let \(\mathcal{F}^0(\mathbb{R}^d)\) denote the class of continuous convex functions defined on \(\mathbb{R}^d\). Specifically, \(g \in \mathcal{F}^{0}(\mathbb{R}^d)\) if it satisfies the convex condition: \[g\left( \alpha x + (1 - \alpha)y \right) \leq \alpha g(x) + (1 - \alpha)g(y),\] for any \(x,y \in \mathbb{R}^d\) and \(\alpha \in [0, 1]\). A subclass of \(\mathcal{F}^0(\mathbb{R}^d)\), denoted \(\mathcal{F}^1_L(\mathbb{R}^d)\), consists of functions that are continuously differentiable and have Lipschitz continuous gradients. Specifically, \(f \in \mathcal{F}^{1}_{L}(\mathbb{R}^d)\) if \(f \in \mathcal{F}^{0}(\mathbb{R}^d)\) and its gradient satisfies: \[\label{eqn:32grad-lip} \| \nabla f(x) - \nabla f(y) \| \leq L \| x - y \|,\tag{18}\] for any \(x, y \in \mathbb{R}^d\). Next, we introduce the function class \(\mathcal{S}_{\mu,L}^{1}(\mathbb{R}^d)\), a subclass of \(\mathcal{F}^1_L(\mathbb{R}^d)\), where each function is \(\mu\)-strongly convex for some \(0 < \mu \leq L\). Specifically, \(f \in \mathcal{S}_{\mu,L}^{1}(\mathbb{R}^d)\) if \(f \in \mathcal{F}^1_L(\mathbb{R}^d)\) and satisfies the \(\mu\)-strongly convex condition: \[\label{eqn:32defn-scvx} f(y) \geq f(x) + \langle \nabla f(y), y - x \rangle + \frac{\mu}{2} \|y - x\|^2,\tag{19}\] for any \(x, y \in \mathbb{R}^d\). For any \(f \in \mathcal{S}_{\mu,L}^{1}(\mathbb{R}^d)\), the following fundamental inequality can be derived: \[\label{eqn:32fund-inq-smooth} f(y - s\nabla f(y)) - f(x) \leq \left\langle \nabla f(y), y - x \right\rangle - \frac{\mu}{2} \|y - x\|^2- \left( s - \frac{Ls^2}{2} \right) \|\nabla f(y)\|^2,\tag{20}\] for any \(x,y \in \mathbb{R}^d\). Additionally, we denote \(x^\star\) as the unique minimizer of the objective function. Furthermore, for any \(f \in \mathcal{S}_{\mu,L}^{1}(\mathbb{R}^d)\), we have the following inequality \[\label{eqn:32key-inq-smooth} \left\| \nabla f(y) \right\|^2 \geq 2\mu \left( f(y - s\nabla f(y)) - f(x^{\star}) \right),\tag{21}\] for any \(y \in \mathbb{R}^d\).
We now shift focus to a composite function \(\Phi= f + g\), where \(f \in \mathcal{S}_{\mu,L}^1(\mathbb{R}^d)\) and \(g \in \mathcal{F}^0(\mathbb{R}^d)\). Building upon the methodologies established in [10], [15], we introduce the concept of the \(s\)-proximal value, which plays a pivotal role in our analysis.
Definition 1 (\(s\)-proximal value). Let the step size satisfy \(s \in (0, 1/L)\). For any \(f\in \mathcal{S}_{\mu,L}^1(\mathbb{R}^d)\) and \(g \in \mathcal{F}^0(\mathbb{R}^d)\), the \(s\)-proximal value is defined as \[\label{eqn:32proximal-value} P_s(x) := \mathop{\arg\min}_{y\in\mathbb{R}^d}\left\{ \frac{1}{2s}\left\| y - \left(x - s\nabla f(x)\right) \right\|^2 + g(y) \right\},\tag{22}\] for any \(x \in \mathbb{R}^d\). The \(s\)-proximal value \(P_s(x)\) in 22 minimizes a weighted sum of the squared Euclidean distance between \(y\) and the gradient-based update \(x - s\nabla f(x)\), along with the regularization term \(g(y)\). The introduction of the \(s\)-proximal subgradient offers the flexibility to handle composite objective functions, enabling efficient minimization of the form \(f(x)+g(x)\), where \(g(x)\) may be non-smooth but proximally simple. When \(g\) is specified as the \(\ell_1\)-norm, i.e., \(g(x) = \lambda \|x\|_1\), a closed-form expression for the \(s\)-proximal value can be derived. Specifically, for any \(x \in \mathbb{R}^d\), the \(i\)-th component of \(s\)-proximal value is given by: \[P_s(x)_i = \big(\left|\left(x - s\nabla f(x)\right)_i\right| - \lambda s\big)_+ \text{sgn}\big(\left(x - s\nabla f(x)\right)_i\big),\] where \(i=1,\ldots,d\) and \(\text{sgn}(\cdot)\) denotes the sign function. This closed-form expression is particularly useful, as it provides an efficient way to compute the \(s\)-proximal value for the \(\ell_1\)-regularization term, which is widely used in sparse optimization problems.
Definition 2 (\(s\)-proximal subgradient). For any \(x \in \mathbb{R}^d\)., the \(s\)-proximal subgradient is defined as: \[\label{eqn:32subgradient} G_s(x): = \frac{x - P_s(x)}{s},\tag{23}\] where the \(s\)-proximal value \(P_s(y)\) is given by 22 . The subgradient \(G_s(x)\), as defined in 23 , captures the direction of improvement for the composite function \(\Phi\), based on the \(s\)-proximal update. It can be viewed as a generalization of the traditional gradient, incorporating the proximal step with respect to the regularizer \(g\).
With the definition of \(s\)-proximal subgradient (), we extend the two key inequalities, 20 and 21 , to the proximal cases. We now present a proximal version of the fundamental inequality 20 for a composite function \(\Phi\). This result, rigorously established in [9], is stated as follows:
Lemma 1 (Lemma 4 in [9]). Let \(\Phi = f + g\) be a composite function with \(f \in \mathcal{S}_{\mu, L}^1(\mathbb{R}^d)\) and \(g \in \mathcal{F}^0(\mathbb{R}^d)\). Then, the following inequality \[\label{eqn:32fund-inq-composite} \Phi(y - sG_s(y)) - \Phi(x) \leq \left\langle G_s(y), y - x \right\rangle - \frac{\mu}{2}\|y - x\|^2 - \left( s - \frac{Ls^2}{2} \right) \|G_s(y)\|^2,\tag{24}\] holds for any \(x, y \in \mathbb{R}^d\).
Finally, we present a key inequality for the \(s\)-proximal subgradient, which will be used in the subsequent sections. This result, previously established in [2], is stated as follows:
Lemma 2 (Lemma 4.3 in [2]). Let \(f \in \mathcal{S}_{\mu,L}^{1}(\mathbb{R}^d)\) and \(g \in \mathcal{F}^0(\mathbb{R}^d)\). Then, the s-proximal subgradient, as defined in [defn:32subgradient], satisfies the following inequality: \[\label{eqn:32key-inq-composite} \|G_s(y)\|^2 \geq 2\mu \left( \Phi(y - sG_s(y)) - \Phi(x^{\star}) \right),\tag{25}\] for any \(y \in \mathbb{R}^d\).
In this section, we first outline the process of constructing a novel Lyapunov function, following the principled way presented in [18]. Specifically, we introduce the iterative sequence \(\mathbf{S}_{k}\) in 15 , which generates the mixed term that plays a pivotal role in
the construction of this Lyapunov function.Building on this, we then derive the linear convergence rate for NAG
applied to strongly convex functions and formulate a theorem to characterize the result. Finally, we apply and to extend the
linear convergence rate to the proximal setting, focusing on the FISTA
.
Following the principled way outlined in [18], we construct a novel Lyapunov function in two main steps: first, by introducing the iterative sequence \(\mathbf{S}_{k}\) for the mixed energy, and then by adjusting the coefficient of the potential energy to align with this mixed energy.
Construction of mixed energy Recall the iterative relation 15 . In the context of convergence analysis, it is not the iterative point \(x_k\) itself that matters, but rather its difference from the optimal solution, \(x_k - x^{\star}\). To account for this, we introduce a term \(-rx^{\star}\) to both sides of the gradient iterative relation 15 , yielding: \[\begin{gather} k\sqrt{s}v_{k+1} + r ( x_{k+1} - x^{\star} ) - \left( k + r + 1 \right)s\nabla f(y_{k+1}) \\ = (k-1)\sqrt{s}v_{k} + r ( x_k - x^{\star} ) - \left( k+r\right)s \nabla f(y_{k}) - \left( k + r + 1 \right)s\nabla f(y_{k+1}). \label{eqn:32nag-phase-iteration-new-xstar} \end{gather}\tag{26}\] This relation naturally leads to the mixed energy, which is given by: \[\label{eqn:32mix} \mathcal{E}_{mix}(k) = \frac{1}{2} \left\| (k-1)\sqrt{s}v_{k} + r (x_{k} - x^{\star}) - s (k + r) \nabla f(y_k) \right\|^2.\tag{27}\] Next, we examine the iterative difference of this mixed energy using the iterative relation 26 : \[\begin{align} \mathcal{E}&_{mix}(k + 1) - \mathcal{E}_{mix}(k) \nonumber \\ & = \left\langle - \left( k + r + 1 \right)s\nabla f(y_{k+1}), k\sqrt{s}v_{k+1} + r ( x_{k+1} - x^{\star} ) - \frac{\left( k + r + 1 \right)s}{2} \cdot \nabla f(y_{k+1}) \right\rangle. \label{eqn:32mix-iter1} \end{align}\tag{28}\] Substituting 11 into 28 , we derive the following result for the iterative difference: \[\begin{align} \mathcal{E}_{mix}(k + 1) - \mathcal{E}_{mix}(k) = & \underbrace{- k(k+1)s\sqrt{s} \left\langle \nabla f(y_{k+1}), v_{k+1}\right\rangle}_{\mathbf{I}} \nonumber \\ & - r (k + r + 1) s \left\langle \nabla f(y_{k+1}), y_{k + 1} - x^{\star} \right\rangle \nonumber \\ & + \frac{(k+r+1)^2s^2}{2}\left\| \nabla f(y_{k+1}) \right\|^2. \label{eqn:32mix-iter2} \end{align}\tag{29}\]
Construction of potential energy In contrast to the Lyapunov function 13 , previously used for convex functions in [7], [8], we now modify the potential energy by replacing \(f(x_{k})\) with \(f(x_{k+1})\). This adjustment aligns the formulation of the potential energy with the mixed energy given in 27 . Additionally, we introduce a dynamic coefficient to the potential energy, which is expressed as: \[\label{eqn:32pot} \mathcal{E}_{pot}(k) := s \tau(k) \left( f(x_{k+1}) - f(x^{\star}) \right).\tag{30}\] The iterative difference of this potential energy is then given by: \[\begin{align} \mathcal{E}_{pot}(k+1) &- \mathcal{E}_{pot}(k) \nonumber \\ & = s \tau(k) \left( f(x_{k+2}) - f(x_{k+1}) \right) + s\left( \tau(k+1) - \tau(k) \right) \left( f(x_{k+2}) - f(x^{\star}) \right). \label{eqn:32pot-iter1} \end{align}\tag{31}\] Next, by substituting \(x_{k+2}\) and \(x_{k+1}\) into the fundamental inequality 20 , we obtain the following inequality \[\begin{align} f(x_{k+2}) - f(x_{k+1}) \leq & \left\langle \nabla f(y_{k+1}), y_{k+1} - x_{k+1}\right\rangle \nonumber \\ & - \frac{\mu}{2} \left\| y_{k+1} - x_{k+1} \right\|^2 - \left( s - \frac{Ls^2}{2} \right) \left\| \nabla f(y_{k+1}) \right\|^2 \label{eqn:32fund-inq-xkyk-nag} \end{align}\tag{32}\] Substituting 32 into the iterative difference 31 , we derive the following bound on the iterative difference of the potential energy: \[\begin{align} \mathcal{E}&_{pot}(k+1) - \mathcal{E}_{pot}(k) \nonumber \\ & \leq s \tau(k) \left( \left\langle \nabla f(y_{k+1}), y_{k+1} - x_{k+1} \right\rangle - \frac{\mu}{2} \left\| y_{k+1} - x_{k+1} \right\|^2 - \left( s - \frac{Ls^2}{2} \right) \left\| \nabla f(y_{k+1}) \right\|^2 \right) \nonumber \\ & \mathrel{\phantom{\leq}} + s \left( \tau(k+1) - \tau(k) \right) \left( f(x_{k+2}) - f(x^{\star}) \right). \label{eqn:32pot-iter2} \end{align}\tag{33}\] Substituting 9 into 33 , we refine the above inequality 33 as: \[\begin{align} \mathcal{E}&_{pot}(k+1) - \mathcal{E}_{pot}(k) \nonumber \\ & = s \tau(k) \left( \underbrace{\frac{k\sqrt{s}\left\langle \nabla f(y_{k+1}), v_{k+1} \right\rangle }{k+r+1} }_{\mathbf{II}} - \frac{\mu}{2} \cdot \frac{k^2 s\left\| v_{k+1} \right\|^2}{(k+r+1)^2} - \left( s - \frac{Ls^2}{2} \right) \left\| \nabla f(y_{k+1}) \right\|^2 \right) \nonumber \\ & \mathrel{\phantom{\leq}} + s \left( \tau(k+1) - \tau(k) \right) \left( f(x_{k+2}) - f(x^{\star}) \right) \label{eqn:32pot-iter3} \end{align}\tag{34}\] To ensure that the term \(\mathbf{I} + (k+1)(k+r+1)s \cdot \mathbf{II} = 0\), we choose a dynamica coefficient as \(\tau(k) = (k+1)(k+r+1)\).
With the mixed energy defined in 27 and the potential energy defined in 30 , we construct the Lyapunov function as follows: \[\begin{align} \mathcal{E}(k) = &\; s(k+1)(k+r+1)\left( f(x_{k+1}) - f(x^{\star}) \right) \nonumber \\ &\; + \frac{1}{2}\left\| (k-1)\sqrt{s}v_{k} + r (x_{k} - x^{\star}) - s (k + r) \nabla f(y_k) \right\|^2. \label{eqn:32lyapunov-nag} \end{align}\tag{35}\] This Lyapunov function encapsulates both the mixed and potential energies and is a key tool for analyzing the linear convergence of the iterative process.
Using the Lyapunov function 35 , we can rigorously establish the linear convergence of NAG
. The result is formally presented in the following theorem:
Theorem 3. Let \(f \in \mathcal{S}_{\mu,L}^{1}(\mathbb{R}^d)\). Given any step size \(0 < s < 1/L\), there exists a positive integer \(K:
= \max\left\{0, \frac{3r^2 - 4r - 12}{8}\right\}\) such that the iterative sequence \(\{x_{k}\}_{k=0}^{\infty}\) generated by NAG
, 2 and 3 , with any initial point \(x_0 = y_0 \in \mathbb{R}^d\), satisfies the following inequality \[\label{eqn:32nag-rate}
f(x_k) - f(x^{\star}) \leq \frac{(r +1)\left( f(x_1) - f(x^{\star})\right) + r^2 L\| x_1 - x^{\star} \|^2}{k(k+r) \left[ 1 + (1 - Ls) \cdot \frac{\mu s}{4} \right]^k},\tag{36}\] for any \(k \geq \max\left\{1,K
\right\}\).
Proof of . We begin by deriving an expression for the change in the Lyapunov function \(\mathcal{E}(k)\) across iterations. By combining the inequalities for the iterative differences, 29 and 34 , we derive the iterative difference for the Lyapunov function 35 as follows: \[\begin{align} \mathcal{E}(k+1) - \mathcal{E}(k) = & - r (k + r + 1) s \left\langle \nabla f(y_{k+1}), y_{k + 1} - x^{\star} \right\rangle + \frac{(k+r+1)^2s^2}{2}\left\| \nabla f(y_{k+1}) \right\|^2 \nonumber \\ & - \frac{\mu}{2} \cdot \frac{s^2k^2(k+1)}{k+r+1} \cdot \|v_{k+1}\|^2 - s(k+1)(k+r+1)\left( s - \frac{Ls^2}{2} \right) \| \nabla f(y_{k+1}) \|^2 \nonumber \\ & + s (2k + r + 3) \left( f(x_{k+2}) - f(x^{\star}) \right). \label{eqn:32lyapunov-iter1} \end{align}\tag{37}\] Next, by substituting \(x_{k+2}\) and \(x^{\star}\) into the fundamental inequality 20 and rearranging, we obtain: \[\begin{align} - \left\langle \nabla f(y_{k+1}), y_{k+1} - x^{\star}\right\rangle \leq & - \left( f(x_{k+2}) - f(x^{\star}) \right) \nonumber \\ &- \frac{\mu}{2} \left\| y_{k+1} - x^{\star} \right\|^2 - \left( s - \frac{Ls^2}{2} \right) \left\| \nabla f(y_{k+1}) \right\|^2. \label{eqn:32fund-inq-xstar} \end{align}\tag{38}\] Substituting the inequality 38 into 37 , we arrive at the following estimate for the iterative difference as: \[\begin{align} \mathcal{E}(k+1) - \mathcal{E}(k) \leq & - s \left[(r -2)k + (r^2 - 3) \right] \left( f(x_{k+2}) - f(x^{\star}) \right) - \frac{\mu}{2} \cdot \frac{s^2k^2(k+1)}{k+r+1} \cdot \|v_{k+1}\|^2 \nonumber \\ & - \frac{ s (2k + r + 3) \mu}{2} \|y_{k+1} - x^{\star}\|^2 - \frac{(k+r+1)^2s^2(1-Ls)}{2}\left\| \nabla f(y_{k+1}) \right\|^2. \label{eqn:32lyapunov-iter2} \end{align}\tag{39}\] To further simplify this expression, we use the strongly convex inequality 21 to derive the lower bound on \(\| \nabla f(y_{k+1}) \|^2\) as: \[\left\| \nabla f(y_{k+1}) \right\|^2 \geq 2\mu \left( f(x_{k+2}) - f(x^{\star}) \right). \label{eqn:32fund-inq-smooth1-new}\tag{40}\] Substituting this inequality 40 into 39 , we further estimate the iterative difference as \[\begin{align} \mathcal{E}(k+1) - \mathcal{E}(k) \leq & - \frac{\mu s(1-Ls)}{4} \cdot s(k+r+1)^2\left( f(x_{k+2}) - f(x^{\star}) \right) \nonumber \\ & - \frac{\mu s^2}{2} \cdot \frac{k^2(k+1)}{k+r+1} \cdot \|v_{k+1}\|^2 \nonumber \\ & - \frac{ s (2k + r + 3) \mu}{2} \|y_{k+1} - x^{\star}\|^2 - \frac{3(k+r+1)^2s^2(1-Ls)}{8}\left\| \nabla f(y_{k+1}) \right\|^2. \label{eqn:32lyapunov-iter3} \end{align}\tag{41}\] Finally, we estimate the Lyapunov function 35 using the Cauchy-Schwarz inequality: \[\begin{align} \mathcal{E}(k + 1) \leq & s(k+2)(k+r+2)\left( f(x_{k+2}) - f(x^{\star}) \right) + \frac{3s}{2} \cdot \frac{k^2(k+1)^2}{(k+r+1)^2} \cdot \left\| v_{k+1} \right\|^2 \nonumber \\ & + \frac{3r^2}{2} \left\| y_{k+1} - x^{\star} \right\|^2 + \frac{3s^2}{2} \cdot (k+r+1)^2 \cdot \|\nabla f(y_{k+1}) \|^2. \label{eqn:32lyapunov-estimate} \end{align}\tag{42}\] By comparing the coefficients of the right-hand sides of the two inequalities, 37 and 42 , we deduce the following estimate as: \[\begin{align} \mathcal{E}(k+1) - \mathcal{E}(k) & \leq - \mu s \cdot \min\left\{ \frac{1-Ls}{4}, \frac{1}{3}, \frac{2k+r+3}{3r^2}, \frac{1-Ls}{4 \mu s} \right\} \mathcal{E}(k+1) \nonumber \\ & \leq - \mu s \cdot \left(\frac{1 - Ls}{4} \right) \cdot \mathcal{E}(k+1), \label{eqn:32iter-diff-lyapunov} \end{align}\tag{43}\] where the last inequality follows from the definition of the integer \(K\), and the fact that \(0 < \mu s < \mu/L \leq 1\). Thus, by applying standard inequalities for the gradient Lipschitz condition 18 , we complete the proof. ◻
We now extend the linear convergence of NAG
, as established in , to include its proximal variant, FISTA
. As specified in , FISTA
utilizes the \(s\)-proximal value 22 and follows the iterative scheme below, starting from any initial point \(y_0 = x_0 \in \mathbb{R}^d\): \[\tag{44}
\begin{empheq}[left=\empheqlbrace]{align}
& x_{k+1} = P_s\left( y_{k} \right), \tag{45} \\
& y_{k+1} = x_{k+1} + \frac{k}{k+r+1} (x_{k+1} - x_{k}), \tag{46}
\end{empheq}\] where \(s>0\) denotes
the step size. As formulated in , we can rewrite FISTA
using the \(s\)-proximal subgradient 23 , resulting in a new iterative scheme similar to NAG
as:
\[\tag{47}
\begin{empheq}[left=\empheqlbrace]{align}
& x_{k+1} = y_{k} - sG_s(y_{k}), \tag{48} \\
& y_{k+1} = x_{k+1} + \frac{k}{k+r+1} (x_{k+1} - x_{k}), \tag{49}
\end{empheq}\] where the \(s\)-proximal subgradient \(G_s(y_{k})\) serves as a natural generalization of the classical gradient \(\nabla f(y_k)\) used in NAG
. This formulation
highlights the structural resemblance between M-FISTA and M-NAG, while accounting for the non-smooth nature of \(\Phi\). Furthermore, by defining the velocity iterative sequence as \(v_{k} =
(x_{k}- x_{k-1})/\sqrt{s}\), we can express the FISTA
updates in a manner akin to NAG
, 9 and 10 . This leads to the implicit-velocity phase-space
representation: \[\tag{50}
\begin{empheq}[left=\empheqlbrace]{align} & x_{k+1} - x_{k} = \sqrt{s} v_{k+1}, \tag{51} \\ & v_{k+1} - v_{k} = -\frac{r+1}{k+r} v_{k} - \sqrt{s} G_s \left( y_{k} \right), \tag{52}
\end{empheq}\] where the sequence \(\{y_{k}\}_{k=0}^{\infty}\) satisfies the following relationship: \[\label{eqn:32fista-phase-x-y-v}
y_k = x_{k} + \frac{k-1}{k+r} \cdot \sqrt{s}v_k.\tag{53}\]
To derive the linear convergence for FISTA
, we still need to establish a suitable Lyapunov function. In this context, we generalize the one 35 , previously used for smooth functions, as shown in , by
replacing the smooth function \(f\) with the composite function \(\Phi = f+g\). This modification leads to the following Lyapunov function:
\[\begin{align}
\mathcal{E}(k) = &\; s(k+1)(k+r+1)\left( \Phi(x_{k+1}) - \Phi(x^{\star}) \right) \nonumber \\ &\; + \frac{1}{2}\left\| (k-1)\sqrt{s}v_{k} + r (x_{k} - x^{\star}) - s (k + r) G_s(y_k) \right\|^2. \label{eqn:32lyapunov-fista}
\end{align}\tag{54}\] As detailed in , the linear convergence for smooth functions predominantly hinges on three key inequalities, 32 , 38 , and 40 . These inequalities serve as the foundation for analyzing the linear convergence of NAG
, encapsulating how it approaches the optimal solution. Additionally, and extend the two strongly convex
inequalities, 20 and 21 , to the proximal setting. Consequently, we can generalize these three key inequalities, 32 , 38 , and 40 , to account for the proximal components, effectively addressing the challenges posed by the non-smooth components of the objective function. Thus, we can
establish the linear convergence for FISTA
, as formally stated in the following theorem:
Theorem 4. Let \(\Phi = f + g\) with \(f \in \mathcal{S}_{\mu,L}^{1}(\mathbb{R}^d)\) and \(g \in \mathcal{F}^0(\mathbb{R}^d)\). Given any
step size \(0 < s < 1/L\), there exists a positive integer \(K: = \max\left\{0, \frac{3r^2 - 4r - 12}{8}\right\}\) such that the iterative sequence \(\{x_{k}\}_{k=0}^{\infty}\) generated by FISTA,
2 and 3 , with any initial \(x_0 = y_0 \in \mathbb{R}^d\),
satisfies the following inequality: \[\label{eqn:32fista-rate}
\Phi(x_k) - \Phi(x^{\star}) \leq \frac{(r +1)\left( \Phi(x_1) - \Phi(x^{\star})\right) + r^2 L\| x_1 - x^{\star} \|^2}{k(k+r) \left[ 1 + (1 - Ls) \cdot \frac{\mu s}{4} \right]^k},\tag{55}\] for any \(k \geq
\max\left\{ 1, K \right\}\).
In this section, we extend the linear convergence for NAG
and FISTA
, as established in , to their corresponding proximal variants, M-NAG
and M-FISTA
. Specifically, we use the iterative sequence \(\mathbf{S}_k\) in the gradient iterative relation 17 , which only depends on the variables \(x_{k}\) and \(y_{k}\),
rather than the full NAG
iterative relations, 9 and 17 , to construct the Lyapunov function. In deriving the linear convergence, we show that it fundamentally
relies on the gradient iterative relation 26 , specifically, the information from M-NAG
, as described in 5 — 7
, rather than the full NAG
information from 2 and 3 . Finally, we generalize the linear convergence to include the proximal variant, M-FISTA
.
Using the iterative sequence \(\mathbf{S}_k\) from the gradient iterative relation 17 , we write down a new form of the Lyapunov function 35 , which depnends only on the variables, \(x_k\) and \(y_{k}\), as follows: \[\begin{align}
\mathcal{E}(k) = &\; \underbrace{s(k+1)(k+r+1)\left( f(x_{k+1}) - f(x^{\star}) \right)}_{\mathcal{E}_{pot}(k)} \nonumber \\ &\; + \underbrace{\frac{1}{2}\left\| k(y_{k} -x_k) + r(y_k - x^{\star}) - (k + r)s\nabla f(y_k)
\right\|^2}_{\mathcal{E}_{mix}(k)}. \label{eqn:32lyapunov-m-nag}
\end{align}\tag{56}\] This form of the Lyapunov function plays a critical role in deriving the linear convergence for M-NAG
. The result is rigorously stated in the subsequent theorem.
Theorem 5. Let \(f \in \mathcal{S}_{\mu,L}^{1}(\mathbb{R}^d)\). Given any step size \(0 < s < 1/L\), there exists a positive integer \(K:
= \max\left\{0, \frac{3r^2 - 4r - 12}{8}\right\}\) such that the iterative sequence \(\{x_{k}\}_{k=0}^{\infty}\) generated by M-NAG
, 5 — 7 , with any initial point \(x_0 = y_0 \in \mathbb{R}^d\), satisfies the following inequality \[\label{eqn:32mnag-rate}
f(x_k) - f(x^{\star}) \leq \frac{(r +1)\left( f(x_1) - f(x^{\star})\right) + r^2 L\| x_1 - x^{\star} \|^2}{k(k+r) \left[ 1 + (1 - Ls) \cdot \frac{\mu s}{4} \right]^k},\tag{57}\] for any \(k \geq \max\left\{1,K
\right\}\).
Proof of . To calculate the iterative difference of the Lyapunov function, we separate the process into two parts, the potential energy and the mixed energy, as shown in 56 .
For the potential energy, we calculate its iterative difference as \[\begin{align} \mathcal{E}_{pot}(k+1) - \mathcal{E}_{pot}(k) = & s (k+1)(k+r+1) \left( f(x_{k+2}) - f(x_{k+1}) \right) \nonumber \\ & + s \left( 2k + r + 3 \right) \left( f(x_{k+2}) - f(x^{\star}) \right). \label{eqn:32pot-iter-mnag-1} \end{align}\tag{58}\] Next, substituting \(x_{k+2}\) and \(x_{k+1}\) into the fundamental inequality 20 and noting from 6 that \(f(x_{k+2}) \leq f(z_{k+1}) = f(y_{k+1} - s \nabla f(y_{k+1}))\), we obtain the following inequality \[\begin{align} f(x_{k+2}) - f(x_{k+1}) \leq & \left\langle \nabla f(y_{k+1}), y_{k+1} - x_{k+1}\right\rangle \nonumber \\ & - \frac{\mu}{2} \left\| y_{k+1} - x_{k+1} \right\|^2 - \left( s - \frac{Ls^2}{2} \right) \left\| \nabla f(y_{k+1}) \right\|^2. \label{eqn:32fund-inq-xkyk-mnag} \end{align}\tag{59}\] Then, we substitute the inequality 59 into the iterative difference 58 to derive the following bound as \[\begin{align} \mathcal{E}_{pot}(k+1) - \mathcal{E}_{pot}(k) \leq & \underbrace{s (k+1)(k+r+1) \left\langle \nabla f(y_{k+1}), y_{k+1} - x_{k+1} \right\rangle}_{\mathbf{I}} \nonumber \\ & - \frac{\mu s}{2} \cdot (k+1)(k+r+1) \| y_{k+1} - x_{k+1} \|^2 \nonumber \\ & - s \left( s - \frac{Ls^2}{2} \right) (k+1)(k+r+1) \| \nabla f(y_{k+1}) \|^2 \nonumber \\ & + s \left( 2k + r + 3\right) \left( f(x_{k+2}) - f(x^{\star}) \right). \label{eqn:32pot-iter-mnag-2} \end{align}\tag{60}\]
To account for the difference between the iterative point \(x_k\) and the optimal solution \(x^{\star}\), we introduce the term \(-rx^{\star}\) to both sides of the gradient iterative relation 17 . This results in the following iterative relation: \[\begin{gather} (k+1)(y_{k+1} - x_{k+1}) + r ( y_{k+1} - x^{\star} ) - \left( k + r + 1 \right)s\nabla f(y_{k+1}) \\ = k(y_{k} - x_{k}) + r ( y_k - x^{\star} ) - \left( k+r\right)s \nabla f(y_{k}) - \left( k + r + 1 \right)s\nabla f(y_{k+1}). \label{eqn:32nag-iteration-new-xstar} \end{gather}\tag{61}\] Using this modified gradient iterative relation, we can now derive the following expression for the iterative difference of the mixed energy between iterations as: \[\begin{align} \mathcal{E}_{mix}(k+1) - \mathcal{E}_{mix}(k) = & -s(k+1)(k+r+1) \left\langle \nabla f(y_{k+1}), y_{k+1} - x_{k+1} \right\rangle \nonumber \\ & - sr(k+r+1) \left\langle \nabla f(y_{k+1}), y_{k+1} - x^{\star} \right\rangle \nonumber \\ & + \frac{s^2(k+r+1)^2}{2} \| \nabla f(y_{k+1}) \|^2. \label{eqn:32mix-iter-mnag-1} \end{align}\tag{62}\] Next, we substitute \(x_{k+2}\) and \(x^{\star}\) into the fundamental inequality 20 and rearrange to obtain: \[\begin{align} - \left\langle \nabla f(y_{k+1}), y_{k+1} - x^{\star}\right\rangle \leq & - \left( f(x_{k+2}) - f(x^{\star}) \right) \nonumber \\ & - \frac{\mu}{2} \left\| y_{k+1} - x^{\star} \right\|^2 - \left( s - \frac{Ls^2}{2} \right) \left\| \nabla f(y_{k+1}) \right\|^2. \label{eqn:32fund-inq-mnag} \end{align}\tag{63}\] Substituting the inequality 63 into the iterative difference 62 of the mixed energy, we arrive at the following bound: \[\begin{align} \mathcal{E}_{mix}(k+1) - \mathcal{E}_{mix}(k) \leq & - \underbrace{s(k+1)(k+r+1) \left\langle \nabla f(y_{k+1}), y_{k+1} - x_{k+1} \right\rangle}_{\mathbf{II}} \nonumber \\ & - sr(k+r+1) \left( f(x_{k+2}) - f(x^{\star}) \right) \nonumber \\ & - \frac{\mu s}{2} \cdot r(k+r+1)\left\| y_{k+1} - x^{\star} \right\|^2 \nonumber \\ & - s\left( s - \frac{Ls^2}{2} \right) \cdot r(k+r+1) \left\| \nabla f(y_{k+1}) \right\|^2 \nonumber \\ & + \frac{s^2(k+r+1)^2}{2} \| \nabla f(y_{k+1}) \|^2. \label{eqn:32mix-iter-mnag-2} \end{align}\tag{64}\]
From the iterative differences of the potential energy and the mixed energy, as described in 60 and 64 , we observe that \(\mathbf{I} = \mathbf{II}\). This allows us to derive the following bound on the iterative difference: \[\begin{align} \mathcal{E}(k+1) - \mathcal{E}(k) \leq & - s \left[(r -2)k + (r^2 - 3) \right] \left( f(x_{k+2}) - f(x^{\star}) \right) \nonumber \\ & - \frac{\mu s}{2} \cdot (k+1)(k+r+1)\cdot \|y_{k+1} - x_{k+1}\|^2 \nonumber \\ & - \frac{ \mu s}{2} \cdot (2k + r + 3) \cdot \|y_{k+1} - x^{\star}\|^2 \nonumber \\ & - \frac{s^2(1-Ls) (k+r+1)^2}{2}\left\| \nabla f(y_{k+1}) \right\|^2. \label{eqn:32iter-mnag1} \end{align}\tag{65}\] To further simplify this expression, we apply the strongly convex inequality 21 , which provides a lower bound on \(\| \nabla f(y_{k+1}) \|^2\) as follows: \[\left\| \nabla f(y_{k+1}) \right\|^2 \geq 2\mu \left( f(x_{k+2}) - f(x^{\star}) \right). \label{eqn:32key-inq-mnag}\tag{66}\] Substituting this bound 40 into 65 , we arrive at the following estimate for the iterative difference: \[\begin{align} \mathcal{E}(k+1) - \mathcal{E}(k) \leq & - \frac{\mu s(1-Ls)}{4} \cdot s(k+r+1)^2\left( f(x_{k+2}) - f(x^{\star}) \right) \nonumber \\ & - \frac{\mu s}{2} \cdot (k+1)(k+r+1)\cdot \|y_{k+1} - x_{k+1}\|^2 \nonumber \\ & - \frac{ \mu s}{2} \cdot (2k + r + 3) \cdot \|y_{k+1} - x^{\star}\|^2 \nonumber \\ & - \frac{3s^2(1-Ls) (k+r+1)^2}{8}\left\| \nabla f(y_{k+1}) \right\|^2. \label{eqn:32iter-mnag2} \end{align}\tag{67}\] Next, we estimate the Lyapunov function 56 using the Cauchy-Schwarz inequality: \[\begin{align} \mathcal{E}(k+1) \leq &\; s(k+2)(k+r+2)\left( f(x_{k+2}) - f(x^{\star}) \right) + \frac{3(k+1)^2}{2} \|y_{k+1} - x_{k+1}\|^2 \nonumber \\ &\;+ \frac{3r^2}{2} \| y_{k+1} - x^{\star} \|^2 + \frac{3s^2(k+r+1)^2}{2} \| \nabla f(y_{k+1}) \|^2. \label{eqn:32lyapunov-estimate-mnag} \end{align}\tag{68}\] By comparing the coefficients of the right-hand sides of the two inequalities, 67 and 68 , we deduce the following estimate for the iterative difference: \[\begin{align} \mathcal{E}(k+1) - \mathcal{E}(k) & \leq - \mu s \cdot \min\left\{ \frac{1-Ls}{4}, \frac{1}{3}, \frac{2k+r+3}{3r^2}, \frac{1-Ls}{4 \mu s} \right\} \mathcal{E}(k+1) \nonumber \\ & \leq - \mu s \cdot \left(\frac{1 - Ls}{4} \right) \cdot \mathcal{E}(k+1), \label{eqn:32iter-diff-lyapunov-mnag} \end{align}\tag{69}\] where the last inequality holds due to the definition of the integer \(K\) and the fact that \(0 < \mu s < \mu/L \leq 1\). Finally, by applying standard inequalities for the gradient Lipschitz condition 18 , we complete the proof. ◻
We now extend the linear convergence of M-NAG
, as established in , to include its proximal variant, M-FISTA
. As outlined in , M-FISTA
employs the \(s\)-proximal value defined in 22 and follows the iterative scheme below, starting from any initial point \(y_0 = x_0 \in \mathbb{R}^d\): \[\tag{70}
\begin{empheq}[left=\empheqlbrace]{align}
& z_{k} = P_s(y_{k}), \tag{71} \\
& x_{k+1} = \left\{ \begin{align} & z_{k}, && \text{if}\; f(z_{k}) \leq f(x_{k}), \\ & x_{k}, && \text{if}\; f(z_{k}) > f(x_{k}), \end{align} \right. \tag{72} \\
& y_{k+1} = x_{k+1} + \frac{k}{k+r+1} (x_{k+1} - x_{k}) + \frac{k+r}{k+r+1} (z_{k} - x_{k+1}), \tag{73}
\end{empheq}\]
where \(s>0\) denotes the step size. To provide a unified perspective, we introduce M-FISTA
, a variant that mirrors the structure of M-NAG
. According to , M-FISTA
replaces the \(s\)-proximal value with the \(s\)-proximal subgradient, as defined in 23 . This yields a new iterative scheme that more closely resembles M-NAG
,
as follows: \[\tag{74}
\begin{empheq}[left=\empheqlbrace]{align}
& z_{k} = y_{k} - sG_s(y_{k}), \tag{75} \\
& x_{k+1} = \left\{ \begin{align} & z_{k}, && \text{if}\; f(z_{k}) \leq f(x_{k}), \\ & x_{k}, && \text{if}\; f(z_{k}) > f(x_{k}), \end{align} \right. \tag{76} \\
& y_{k+1} = x_{k+1} + \frac{k}{k+r+1} (x_{k+1} - x_{k}) + \frac{k+r}{k+r+1} (z_{k} - x_{k+1}), \tag{77}
\end{empheq}\]
where the \(s\)-proximal subgradient \(G_s(y_{k})\) replaces the gradient \(\nabla f(y_k)\) in NAG
.
Similarly, to establish the linear convergence for M-FISTA
, we generalize the Lyapunov function 56 to its proximal counterpart as follows: \[\begin{align}
\mathcal{E}(k) = &\; s(k+1)(k+r+1)\left( \Phi(x_{k+1}) - \Phi(x^{\star}) \right) \nonumber \\ &\; + \frac{1}{2}\left\| k(y_{k} -x_k) + r(y_k - x^{\star}) - (k + r)s G_s(y_k) \right\|^2. \label{eqn:32lyapunov-m-fista}
\end{align}\tag{78}\] In this formulation, we incorporate the proximal structure into the Lyapunov function to account for the non-smoothness of the objective function. and extend the two strongly convex inequalities, 20 and 21 , to the proximal setting. Consequently, we can generalize these three key inequalities, 59 , 63 , and 66 , to o account for the proximal adjustments. With these extensions, we can generalize the linear convergence from M-NAG
to M-FISTA
, as formally
stated in the following theorem:
Theorem 6. Let \(\Phi = f + g\) with \(f \in \mathcal{S}_{\mu,L}^{1}(\mathbb{R}^d)\) and \(g \in \mathcal{F}^0(\mathbb{R}^d)\). Given any
step size \(0 < s < 1/L\), there exists a positive integer \(K: = \max\left\{0, \frac{3r^2 - 4r - 12}{8}\right\}\) such that the iterative sequence \(\{x_{k}\}_{k=0}^{\infty}\) generated by M-FISTA,
5 — 7 , with any initial \(x_0 = y_0 \in
\mathbb{R}^d\), satisfies the following inequality: \[\label{eqn:32m-fista-rate}
\Phi(x_k) - \Phi(x^{\star}) \leq \frac{(r +1)\left( \Phi(x_1) - \Phi(x^{\star})\right) + r^2 L\| x_1 - x^{\star} \|^2}{k(k+r) \left[ 1 + (1 - Ls) \cdot \frac{\mu s}{4} \right]^k},\tag{79}\] for any \(k \geq
\max\left\{ 1, K \right\}\).
In this paper, we introduce an additional gradient term to the iterative relation 12 , resulting in a refined gradient-based iterative relation 15 . This
modification paves the way for the development of a novel Lyapunov function. Unlike the Lyapunov function presented in [2], our new formulation not
only eliminates the kinetic energy but also ensures linear convergence for NAG
. Furthermore, the linear convergence derived from this Lyapunov function is independent of both the parameters of the strongly convex functions and the step size,
thus providing a more universal and robust result. By performing a series of straightforward substitutions, we derive the gradient-based iterative relation 17 for M-NAG
. Interestingly, we observe
that this gradient iterative relation 17 coincides with the one in 15 when the position-velocity relation 9 is applied.
Additionally, we further demonstrate that the derivation of linear convergence relies solely on the gradient iterative relation 17 and does not require the position-velocity relation 9 . The key observation allows us to extend the linear convergence from NAG
to M-NAG
. Finally, leveraging the two proximal inequalities for composite functions developed in [2], [9], which serves as the proximal analogs of the two classical strongly convex inequalities, we
extend the linear convergence to both FISTA
) and M-FISTA
. This extension broadens the applicability of our results, providing insights into the convergence behavior of these widely used proximal algorithms.
For \(\mu\)-strongly convex functions, another Nesterov’s accelerated gradient method, referred to as NAG-SC
, was originally proposed by [21]. In NAG-SC
, the momentum coefficient is defined as \(\frac{1-\sqrt{\mu s}}{1 + \sqrt{\mu s}}\), in contrast to \(\frac{k}{k+r+1}\) used in the standard NAG
. The iterative scheme for NAG-SC
is given by: \[\tag{80}
\begin{empheq}[left=\empheqlbrace]{align}
& x_{k+1} = y_{k} - s\nabla f(y_{k}), \tag{81} \\
& y_{k+1} = x_{k+1} + \frac{1-\sqrt{\mu s}}{1 + \sqrt{\mu s}} (x_{k+1} - x_{k}), \tag{82}
\end{empheq}\] where \(s\) is the step
size. Building on the same idea introduced by [3], we can propose a monotonic variant of NAG-SC
, which we denote
as M-NAG-SC
. The iterative scheme for M-NAG-SC
is as follows: \[\tag{83}
\begin{empheq}[left=\empheqlbrace]{align}
& z_{k} = y_{k} - s\nabla f(y_{k}), \tag{84} \\
& x_{k+1} = \left\{ \begin{align} & z_{k}, && \text{if}\; f(z_{k}) \leq f(x_{k}), \\ & x_{k}, && \text{if}\; f(z_{k}) > f(x_{k}), \end{align} \right. \tag{85} \\
& y_{k+1} = x_{k+1} +\frac{1-\sqrt{\mu s}}{1 + \sqrt{\mu s}}(x_{k+1} - x_{k}) + (z_{k} - x_{k+1}), \tag{86}
\end{empheq}\]
where \(s\) is the step size. We demonstrate the numerical performance of both NAG-SC
and M-NAG-SC
in comparison with the vanilla gradient descent method in .
Figure 2: Numerical comparison of the iterative progression of function values forthe vanilla gradient descent method, NAG-SC
and M-NAG-SC
. The step size is \(s = 0.01\). All
optimization algorithms are applied to the quadratic function \(f(x_1, x_2) = 5 \times 10^{-3}x_1^2 + x_2^2\)..
An intriguing avenue for future research is to explore whether M-NAG-SC
, as described by 84 — 86 , can retain the accelerated convergence rate
of NAG-SC
. However, it is important to note that the Lyapunov function presented in [6], which incorporates the
kinetic energy as an essential part of the analysis, complicates the direct generalization from NAG-SC
to M-NAG-SC
. Thus, extending the Lyapunov analysis to the monotonic variant requires further exploration.
We thank Bowen Li for his helpful and insightful discussions. Mingwei Fu acknowledges partial support from the Loo-Keng Hua Scholarship of CAS. Bin Shi was partially supported by the startup fund from SIMIS and Grant No.YSBR-034 of CAS.
Corresponding author: binshi@simis.cn↩︎
In detail, \(x^{\star}\) is the unique global minimum, and the fundamental inequality for strongly convex functions is rigorously shown in 20 .↩︎