Convergence Guarantees for RMSProp and Adam in Generalized-smooth Non-convex Optimization with Affine Noise Variance


Abstract

This paper provides the first tight convergence analyses for RMSProp and Adam in non-convex optimization under the most relaxed assumptions of coordinate-wise generalized smoothness and affine noise variance. We first analyze RMSProp, which is a special case of Adam with adaptive learning rates but without first-order momentum. Specifically, to solve the challenges due to dependence among adaptive update, unbounded gradient estimate and Lipschitz constant, we demonstrate that the first-order term in the descent lemma converges and its denominator is upper bounded by a function of gradient norm. Based on this result, we show that RMSProp with proper hyperparameters converges to an \(\epsilon\)-stationary point with an iteration complexity of \(\mathcal{O}(\epsilon^{-4})\). We then generalize our analysis to Adam, where the additional challenge is due to a mismatch between the gradient and first-order momentum. We develop a new upper bound on the first-order term in the descent lemma, which is also a function of the gradient norm. We show that Adam with proper hyperparameters converges to an \(\epsilon\)-stationary point with an iteration complexity of \(\mathcal{O}(\epsilon^{-4})\). Our complexity results for both RMSProp and Adam match with the complexity lower bound established in [1].

1 Introduction↩︎

RMSProp [2] and Adam [3] are among the most popular and powerful adaptive optimizers in training state-of-the-art machine learning models [4][7]. RMSProp and Adam only require first-order gradients with little memory requirement, and thus are efficient to use in practice. RMSProp is based on the idea of adaptive learning rates for each individual parameter, and Adam combines the benefits of RMSprop [2] and AdaGrad [8], which consists of two key components of adaptive learning rates and momentum. Despite their empirical success, theoretical understandings on the convergence and complexity, especially when optimizing non-convex loss functions, e.g., neural networks, still remain underdeveloped until very recently.

Recently, there have been a series of works in examining the convergence and complexity of RMSProp and Adam for non-convex loss functions (see for a detailed review). However, these works do not completely explain the performance of RMSProp and Adam in training neural networks, as they rely on assumptions that may not necessarily hold. For example, [9] pointed out that neural networks are not \(L\)-smooth, and instead satisfy the generalized \((L_0,L_1)\)-smoothness, where the gradient Lipschitz constant increases linearly with the gradient norm. Furthermore, many of these works assumed that the stochastic gradient has a bounded norm/variance, which however does not even hold for linear regression [10]; and instead a relaxed affine noise condition shall be used.

In this paper, we derive the convergence guarantee and iteration complexity for RMSProp and Adam with coordinate-wise generalized \((L_0,L_1)\)-smooth loss function and affine noise variance. To the best of our knowledge, these are some of the most relaxed assumptions in the convergence analyses of RMSProp and Adam that best describes the training of neural networks. We prove that RMSProp and Adam with proper hyperparameters converge to an \(\epsilon\)-stationary point with a complexity of \(\mathcal{O}(\epsilon^{-4})\), which matches with the lower bound for first-order optimization in [1].

1.1 Challenges and Insights↩︎

Our theoretical analyses address several major challenges: (i) dependence between stepsize and gradient, (ii) potential unbounded gradients, (iii) mismatch between gradient and first-order momentum, and (iv) additional bias terms due to affine variance and coordinate-wise \((L_0,L_1)\)-smoothness. Prior research circumvented most of these challenges by introducing extra assumptions, whereas we provide several new insights and show that these assumptions may not be needed.

Figure 1: Adam

In this paper, we have access to an unbiased estimate \(\boldsymbol{g}\) of \(\nabla f(\boldsymbol{x})\) such that \(\mathbb{E}[\boldsymbol{g}| \boldsymbol{x}]=\nabla f(\boldsymbol{x})\) where \(\boldsymbol{x}\in \mathbb{R}^d\) is a \(d\)-dimensional parameter and \(f: \mathbb{R}^d\to \mathbb{R}\) is the objective function. Define \(\mathcal{F}_t:=\sigma(\boldsymbol{g}_1,...,\boldsymbol{g}_{t-1})\) as the sigma field of the stochastic gradients up to \(t-1\). Consider the Adam algorithm in , which reduces to RMSProp if \(\beta_1=0\). For the coordinate-wise \((L_0,L_1)\)-smooth objective functions, we have the following descent inequality (Lemma 1 in [11]): \[\begin{align} &\underbrace{\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{x}_t), \boldsymbol{x}_t-\boldsymbol{x}_{t+1} \right\rangle|\mathcal{F}_t\right]}_{\text{First Order}} \le f(\boldsymbol{x}_t)-\mathbb{E}[f(\boldsymbol{x}_{t+1})|\mathcal{F}_t]\nonumber\\ &+\underbrace{\sum_{i=1}^d\frac{L_0}{2\sqrt{d}}\mathbb{E}[\|\boldsymbol{x}_{t+1}-\boldsymbol{x}_t\||\boldsymbol{x}_{t+1,i}-\boldsymbol{x}_{t,i}||\mathcal{F}_t]}_{\text{Second Order}} \label{eq:generalsmooth} \nonumber\\ &+\underbrace{\sum_{i=1}^d\frac{L_1|\partial_i f(\boldsymbol{x}_t)|}{2}\mathbb{E}[\|\boldsymbol{x}_{t+1}-\boldsymbol{x}_t\||\boldsymbol{x}_{t+1,i}-\boldsymbol{x}_{t,i}||\mathcal{F}_t]}_{\text{Additional Term}}, \end{align}\tag{1}\] where the last term is the additional term due to the \((L_0,L_1)\)-smooth assumption.

Challenge 1: dependence between stepsize and gradient. We use the RMSProp optimizer to explain our technical novelty. The challenge is the same for Adam. For RMSProp the optimized parameter \(\boldsymbol{x}\) is updated: \(\boldsymbol{x}_{t+1}=\boldsymbol{x}_t-\frac{\eta \boldsymbol{g}_t}{\sqrt{\boldsymbol{v}_t}+\zeta}\), where the adaptive stepsize \(\boldsymbol{\eta}_t=\frac{\eta}{\sqrt{\boldsymbol{v}_t}+\zeta}\) depends on the current gradient estimate \(\boldsymbol{g}_t\), which makes it hard to bound the conditional expectation of the first-order term in 1 . To address this challenge, studies on Adagrad [12][14] and studies on RMSProp [15] propose a surrogate \(\tilde{\boldsymbol{v}}_t\) of \(\boldsymbol{v}_t\) which is independent of \(\boldsymbol{g}_t\), and then the first-order term is divided into two parts: the first-order.a term \(\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{x}_t), \frac{-\eta \boldsymbol{g}_t}{\sqrt{\tilde{\boldsymbol{v}}_t}+\zeta} \right\rangle\Big |\mathcal{F}_t\right]\) and first-order.b term \(\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{x}_t), \frac{-\eta \boldsymbol{g}_t}{\sqrt{\boldsymbol{v}_t}+\zeta} +\frac{\eta \boldsymbol{g}_t}{\sqrt{\tilde{\boldsymbol{v}}_t}+\zeta}\right\rangle\Big |\mathcal{F}_t\right]\). The main challenge lies in the first-order.b term (surrogate error). In [15], this term is bounded based on the assumption of bounded gradient norm, which does not hold in this paper.

Insight 1: reduce surrogate error. Recently [10] shows that in the \(t\)-th iteration, the error term can be written as \(\mathbb{E} [\phi(\boldsymbol{v}_{t-1})-\phi(\boldsymbol{v}_{t})|\mathcal{F}_t]\) where \(\phi\) is some function. Thus, these terms cancel out with each other by taking sum from \(t=1\) to \(T\). However, this method only works for Adagrad due to its non-increasing stepsize. We develop a novel approach, which involves a minor change to the adaptive stepsize from \(\frac{\eta}{\sqrt{\boldsymbol{v}_t}+\zeta}\) to \(\frac{\eta}{\sqrt{\boldsymbol{v}_t+\zeta }}\) but does not change the adaptive nature of the stepsize. In this way, we can cancel out terms by taking sum w.r.t. to \(t\) (details can be found in Lemma 1).

Challenge 2: unbounded gradient. Previous works, e.g., [13], [15], [16] assumed that the gradient norm is bounded, based on which, they proved the convergence. However, with refined gradient noise, the gradient may be infinite, and thus those approaches do not apply.

Insight 2: recursive bounding technique. We first show that \(\mathbb{E}\left[\frac{1}{T}\sum_{t=1}^T\frac{\|\nabla f(\boldsymbol{x}_t)\|^2}{{\sqrt{\|\tilde{\boldsymbol{v}}_t\|+\zeta}}}\right]\) is bounded. If the gradient norm is upper bounded, then \(\tilde{\boldsymbol{v}}_t\) is bounded, and the convergence result directly follows. However, under refined gradient noise, the gradient norm may not be bounded. We develop a novel approach to bound \(\mathbb{E}\left[\frac{1}{T}\sum_{t=1}^T \sqrt{\|\tilde{\boldsymbol{v}}_t\|+\zeta}\right]\) using \(\mathbb{E}\left[{\sum_{t=1}^T\|\nabla f(\boldsymbol{x}_t)\|}\right]\) instead of a constant (see Lemma 3 for details). Applying Hölder’s inequality [17] we will obtain the convergence result. The complexity result matches with the lower bound in [1].

Challenge 3: mismatch between gradient and first-order momentum. Compared with RMSProp, Adam employs the first momentum \(\boldsymbol{m}_t\). The momentum \(\boldsymbol{m}_t\) is dependent on the surrogate stepsize \(\frac{\eta}{\sqrt{\tilde{\boldsymbol{v}}_t+\zeta}}\). Moreover, the momentum \(\boldsymbol{m}_t\) is a biased estimate of the current true gradient. Both the above challenges make it hard to theoretically characterize the convergence rate of Adam.

Insight 3: bounding first-order term using \(\mathbb{E}\big[\frac{1}{T}\sum_{t=1}^T\|\nabla f(\boldsymbol{x}_t)\|\big]\). Inspired by the analysis for SGDM [18] and Adam [19], we consider a potential function of \(f(\boldsymbol{u}_t)\) with \(\boldsymbol{u}_t=\frac{\boldsymbol{x}_t-\boldsymbol{x}_{t-1}{\beta_1}/{\sqrt{\beta_2}}}{1-{\beta_1}/{\sqrt{\beta_2}}}\). We can show that \(\boldsymbol{u}_{t+1}-\boldsymbol{u}_t\) is close to a function of \(\frac{\boldsymbol{g}_t}{\sqrt{\tilde{\boldsymbol{v}}_t+\zeta}}\), which is much easier to analyze compared with \(\frac{\boldsymbol{m}_t}{\sqrt{\tilde{\boldsymbol{v}}_t+\zeta}}\). Using the descent lemma of \(f(\boldsymbol{u}_t)\), we show the first-order term is also bounded by a constant plus a function of \(\mathbb{E}\left[\frac{\sum_{t=1}^T\|\nabla f(\boldsymbol{x}_t)\|}{T}\right]\). Compared to RMSProp, this additional function is introduced due to the bias of \(\boldsymbol{m}_t\). Then via Hölder’s inequality, we show Adam converges as fast as RMSProp.

Challenge 4: additional error terms due to affine variance and \((L_0,L_1)\)-smoothness. Compared with \(L\)-smooth objectives, in the analysis for RMSProp with \((L_0,L_1)\)-smooth objectives, there is an additional second-order error term: \(\sum_{i=1}^d\frac{L_1|\partial_i f(\boldsymbol{x}_t)|}{2}[\|\boldsymbol{x}_{t+1}-\boldsymbol{x}_t\||\boldsymbol{x}_{t+1,i}-\boldsymbol{x}_{t,i}|]\), which is hard to bound since \(|\partial_i f(\boldsymbol{x}_t)|\) may be unbounded. Moreover, for RMSProp, since \(\mathbb{E}[|\boldsymbol{x}_{t+1,i}-\boldsymbol{x}_{t,i}||\mathcal{F}_t]\le\mathbb{E}\left[\frac{\eta |\boldsymbol{g}_{t,i}|}{\sqrt{\tilde{\boldsymbol{v}}_{t,i}+\zeta}}\Big |\mathcal{F}_t\right]\) and \(\boldsymbol{g}_{t,i}\) is independent of \(\tilde{\boldsymbol{v}}_{t,i}\) given \(\mathcal{F}_t\), the affine noise assumption can be leveraged to bound the second-order error term directly. Nevertheless, for Adam due to the dependence between \(\boldsymbol{m}_{t,i}\) and \(\tilde{\boldsymbol{v}}_{t,i}\), the above approach cannot be applied directly.

Insight 4: bounding additional term by function of first-order term. For RMSProp, we can show that \(\|\boldsymbol{x}_{t+1}-\boldsymbol{x}_t\|\le \frac{\eta\sqrt{d}}{\sqrt{1-\beta_2}}\) and \(|\partial_i f(\boldsymbol{x}_t)||\boldsymbol{x}_{t+1,i}-\boldsymbol{x}_{t,i}|\le \frac{|\partial_i f(\boldsymbol{x}_{t})|^2+\eta^2\boldsymbol{g}_{t,i}^2}{2\sqrt{\tilde{\boldsymbol{v}}_{t,i}+\zeta}}\). According to the affine noise assumption, we have that \(\mathbb{E}[\boldsymbol{g}_{t,i}^2|\mathcal{F}_t]\) is upper bounded by a linear function of \((\partial_i f(\boldsymbol{x}_t))^2\), thus we can bound the additional error term. However, we cannot directly apply this method to Adam since \(\mathbb{E}[\boldsymbol{m}_{t,i}^2|\mathcal{F}_t]\) is hard to bound. Instead, we provide a new upper bound on \(\sum_{t=1}^T\frac{\boldsymbol{m}_{t,i}^2}{\sqrt{\boldsymbol{v}_{t,i}}}\) using the gradient norm.

1.2 Related work↩︎

1.2.1 Relaxed Assumptions↩︎

Affine Noise Variance: In most of the studies on stochastic optimization, access to an unbiased estimate of the gradient with uniformly bounded variance is assumed [20][23]. [21] first showed that for \(L\)-smooth objectives, the SGD algorithm converges to a first-order \(\epsilon\)-stationary point with an iteration complexity of \(\mathcal{O}(\epsilon^{-4})\) if the stochastic gradient has uniformly bounded variance. Furthermore, [1] proved that for any first-order algorithm with uniformly bounded gradient variance, the iteration complexity of \(\mathcal{O}(\epsilon^{-4})\) is optimal. For overparameterized neural networks, [24] considered another gradient noise assumption: the strong growth condition, where the upper bound on the second-order moment of the oracle norm scales with the gradient square norm. Both the uniformly bounded variance and strong growth condition are special cases of the affine noise variance. It was demonstrated in [25] that for non-adaptive algorithms with affine noise variance, the optimal iteration complexity of \(\mathcal{O}(\epsilon^{-4})\) can be achieved. The extension of affine noise assumption to adaptive algorithms is not straightforward and was studied in [10], [14], [26][31]. In this paper, we study two adaptive optimizers: RMSProp and Adam with affine noise variance.

Generalized Smoothness: In stochastic optimization, the \(L\)-smooth objectives are widely assumed [21], [32]. However, it was demonstrated in [9] that the \(L\)-smoothness does not hold for neural networks and polynomial functions with degree larger than \(2\). Then, extensive experiments were conducted to verify that these functions satisfy the generalized (\(L_0,L_1\))-smoothness condition, where the gradient Lipschitz constant increases linearly with the gradient norm. Compared with \(L\)-smoothness, (\(L_0,L_1\))-smoothness introduces extra second-order error terms, thus making the optimization problem hard to solve. The clipping algorithms for generalized smooth function were studied in [9], [33]. However, they require the gradient norm to be bounded. A relaxed assumption on bounded gradient variance was studied in [34], where the SPIDER algorithm was applied. Furthermore, [27] showed that for generalized smooth objectives with affine noise variance, the SPIDER algorithm still finds a stationary point. Under the same assumption, [26] provided the convergence rate for a normalized momentum algorithm. With extensive experiments, [11] showed that in the training of Transformer, the \((L_0,L_1)\)-smoothness holds coordinate-wisely. In this paper, we consider functions that are coordinate-wise \((L_0,L_1)\)-smooth.

1.2.2 Adaptive Optimizers↩︎

Adaptive optimizers are widely used in deep learning due to their ability to adjust to changing data and conditions. Adagrad [8] is the first adaptive algorithm, which calculates the accumulated sum of the past gradient norms and uses the reciprocal of its square root to scale the current gradient. Recently, [10], [30] studied Adagrad under generalized smoothness and affine noise variance conditions. However, the training of the above Adagrad algorithm may stop in advance since the accumulated sum does not shrink, and thus the learning rate can be extremely close to \(0\). To overcome this problem, RMSProp [2] was proposed, where a momentum method is employed to replace the accumulated sum. Thus, the adaptive learning rate can increase or decrease. There is a rich literature on the convergence analyses of RMSProp [15], [16], [29]. However, all of them focus on \(L\)-smooth objectives, and only [29] considered the affine noise variance. RMSProp is a special case of Adam, which only includes the second-order momentum, and is widely studied e.g., [13], [19], [28], [31], [35][39]. In [39], RMSProp for \((L_0,L_1)\)-smooth objectives with sub-Gaussian norm was studied, where the gradient estimate bias follows a sub-Gaussian distribution. Under this assumption, based on the gradient estimate, the real gradient belongs to a bounded set with high probability, which converts the unbounded Lipschitz constant to a bounded one. The only results of Adam on \((L_0,L_1)\)-smoothness with affine noise (for the special case of finite sum problems) were in [28]. However, they only showed that Adam converges to the neighborhood of a stationary point with a constant learning rate. More details can be found in Table 1.

Table 1: Comparison for existing RMSProp and Adam analyses. Only our paper focuses on the \((L_0,L_1)\)-smooth objectives with affine noise variance, and we show convergence to a stationary point. For \(\nabla f(\boldsymbol{x})\) with its estimate \(\boldsymbol{g}\), the bounded norm assumption is \(\|\boldsymbol{g}\|\le G\) (almost surely), where \(G\) is some positive constant. The bounded second-order moment assumption is that \(\mathbb{E}[\boldsymbol{g}]\le G^2\). The bounded sub-Gaussian norm assumption is that \(\|\boldsymbol{G}-\nabla f(\boldsymbol{x})\|\) follows a sub-Gaussion distribution, which is weaker than the bounded norm assumption but stronger than the bounded variance assumption.
Explanation on the upper footmarks: \(a:\) indicates the algorithm only converges to the neighborhood of a stationary point, whose radius can not be made small.\(b:\) [16] requires the signs of the gradients to be the same across batches. \(c:\) [38] requires the upper bound on the gradient norm. \(d:\) A variance-reduced method is also investigated in [39], and the complexity is \(\mathcal{O}(\epsilon^{-3})\).
Method Smoothness Algorithm Stationary point Assumption Batch size Complexity
[16] \(L\)-smooth RMSProp Bounded Norm \(\mathcal{O}(1)\) \(\mathcal{O}({\epsilon}^{-4})\)
[15] \(L\)-smooth RMSProp Bounded Norm \(\mathcal{O}({\epsilon}^{-2})\) \(\mathcal{O}({\epsilon}^{-4})\)
[29] \(L\)-smooth RMSProp Finite Sum Affine Noise - -
[13] \(L\)-smooth Adam Bounded Norm \(\mathcal{O}(1)\) \(\mathcal{O}({\epsilon}^{-4})\)
[35] \(L\)-smooth Adam Bounded Second-order Moment \(\mathcal{O}(1)\) \(\mathcal{O}({\epsilon}^{-4})\)
[36] \(L\)-smooth Adam Bounded Norm - -
[37] \(L\)-smooth Adam Finite Sum Affine Noise - -
[28] Finite Sum \((L_0,L_1)\)-smooth Adam Finite Sum Affine Noise - -
[38] \(L\)-smooth Adam Affine Noise \(\mathcal{O}(1)\) \(\mathcal{O}({\epsilon}^{-4})\)
[31] \(L\)-smooth Adam Coornidate-wise Affine Noise \(\mathcal{O}(1)\) \(\mathcal{O}(ploy(\log(\epsilon^{-1})){\epsilon}^{-4})\)
[19] \(L\)-smooth Adam Coornidate-wise Affine Noise \(\mathcal{O}(1)\) \(\mathcal{O}({\epsilon}^{-4})\)
[39] \((L_0,L_1)\)-smooth Adam Sub-Gaussian Norm \(\mathcal{O}(1)\) \(\mathcal{O}({\epsilon}^{-4})\footnotemark[4]\)
Our method Coornidate-wise \((L_0,L_1)\)-smooth Adam Coornidate-wiseAffine Noise \(\mathcal{O}(1)\) \(\mathcal{O}(\epsilon^{-4})\)

There are also two recent works [19], [31] that studied Adam for \(L\)-smooth objectives with affine noise variance. However, their methods can not be generalized to \((L_0,L_1)\) smooth objectives due to the additional terms invalidating the key inequality or requirements of bounded Lipchitz constant in their key Lemma. In this paper, we study the more practical and challenging coordinate-wise \((L_0,L_1)\)-smooth objectives. Moreover, we show that RMSProp and Adam converge to a stationary point with a complexity of \(\mathcal{O}(\epsilon^{-4})\), which matches with the lower bound in [1]. Our results improve those in [31] by logarithmic factors.

2 Preliminaries↩︎

Let \(f:\mathbb{R}^d \to \mathbb{R}\) be a differentiable non-convex loss function. For a positive integer \(d\), let \([d]\) denote the set \(\{1,2,..., d\}\). Let \(\boldsymbol{x}\in \mathbb{R}^d\) be an optimization variable. Our goal is to minimize the objective function \(f(x)\): \[\min_{\boldsymbol{x}} f(\boldsymbol{x}).\] For a differentiable function \(f\), a point \(\boldsymbol{x}\in \mathbb{R}^d\) is called a first-order \(\epsilon\)-stationary point if we have that \(\|\nabla f(\boldsymbol{x})\|\le \epsilon\). Denote by \(\boldsymbol{x}_t\in \mathbb{R}^d\) the optimization variable at the \(t\)-th iteration and we have access to an estimate \(\boldsymbol{g}_t\) of \(\nabla f(\boldsymbol{x}_t)\). Define \(\mathcal{F}_t:=\sigma(\boldsymbol{g}_1,...,\boldsymbol{g}_{t-1})\) is the sigma field of the stochastic gradients up to \(t-1\). We focus on the Adam algorithm shown in Algorithm 1, where \(\odot\) denotes the Hadamard product. For any \(i\in [d]\), \(\partial_i f(\boldsymbol{x}_t), \boldsymbol{g}_{t,i}, \boldsymbol{m}_{t,i}\) and \(\boldsymbol{v}_{t,i}\) are the \(i\)-th element of \(\nabla f(\boldsymbol{x}_t), \boldsymbol{g}_{t}, \boldsymbol{m}_{t}\) and \(\boldsymbol{v}_{t}\), respectively.

Compared with the original Adam, we make a minor change in the adaptive stepsize from \(\boldsymbol{\eta}_t= \frac{\eta}{\sqrt{{\boldsymbol{v}_t} }+ \zeta }\) to \(\frac{\eta}{\sqrt{{\boldsymbol{v}_t} + \zeta}}\). This minor change does not influence the adaptivity of the algorithm but makes the analysis much easier.

2.1 Technical Assumptions↩︎

Throughout this paper, we make the following assumptions.

Assumption 1. \(f(\boldsymbol{x})\) is bounded from below such that \(\inf_{\boldsymbol{x}} f(\boldsymbol{x})>-\infty\).

Assumption 2 (Coordinate-wise affine noise variance [19], [31]). We have access to an unbiased gradient estimate \(\boldsymbol{g}_t\) such that \(\mathbb{E}[\boldsymbol{g}_t|\mathcal{F}_t]=\nabla f(\boldsymbol{x}_t)\) and for any \(i\in [d]\) \(\mathbb{E}[\boldsymbol{g}_{t,i}^2|\mathcal{F}_t]\le D_0+D_1 (\partial_i f(\boldsymbol{x}_{t}))^2\), where \(D_0,D_1\ge 0\) are some constants.

As discussed in [31], this assumption allows the magnitude of noise to scale with the corresponding gradient coordinate. Many widely used noise assumptions are special cases of this affine noise variance assumption. For example, when \(D_0=0\), it is the bounded second-order moment assumption in [35] and for the finite sum formulation \(f(x)=\frac{\sum_{j=1}^N f_j(x)}{N}\), when \(D_1=\frac{1}{N}\), it is equivalent to bounded gradient variance. However, as pointed out in [10], these two assumptions of bounded second-order moment and bounded gradient variance do not even hold for linear regression problems. \(D_0=0\) is called “strong growth condition” [24], which can be seen as a simple condition for overparameterized neural networks that interpolate all data points. Under , the norm of the gradient increases with the norm of the true gradient. This is important for model parameters that are multiplicatively perturbed by noise, e.g., multilayer network [14]. In this paper, we study the coordinate-wise affine noise variance assumption, which was also used in [19], [31]. Though the \(L\)-smoothness assumption is widely used in optimization studies, recently it has been observed that in the training of neural networks, this assumption does not hold. Instead, it is numerically verified that the following generalized smoothness assumption better models the training of neural networks [9]: \(\|\nabla f(\boldsymbol{x})-\nabla f(\boldsymbol{y})\|\le (L_0+L_1\|\nabla f(\boldsymbol{x})\|)\|\boldsymbol{x}-\boldsymbol{y}\|\) for some positive \(L_0\) and \(L_1\). This assumption is widely studied in the literature, e.g., [10], [26][28], [30], [39]. Compared with \(L\)-smooth functions, for the generalized smooth functions, the Lipschitz constant scales with the true gradient norm thus may not be bounded. For the training of Transformer models, [11] finds the following coordinate-wise \((L_0,L_1)\)-smoothness provides a more accurate characterization of the objective. It is generalized from the coordinate-wise \(L\)-smoothness [40][42]. In this paper, we focus on coordinate-wise \((L_0,L_1)\)-smooth functions as defined below:

Assumption 3 (Coordinate-wise \((L_0,L_1)\)-smoothness). A function \(f\) is coordinate-wise \((L_0,L_1)\)-smooth if for any \(\boldsymbol{x},\boldsymbol{y}\in \mathbb{R}^d\) and \(i\in[d]\), \[\begin{align} |\partial_i f(\boldsymbol{x})-\partial_i f(\boldsymbol{y})|\le (L_0+L_1|\partial_i f(\boldsymbol{x})|)\|\boldsymbol{x}-\boldsymbol{y}\|. \end{align}\]

The training of Adam enjoys adaptive learning rates for each parameter individually due to the coordinate-wise update process. Moreover, extensive experiments in [11] show that the \(L_0\) and \(L_1\) for each coordinate varies a lot. Therefore, it is more accurate to leverage the coordinate-wise \((L_0,L_1)\)-smoothness. In this paper, for the sake of simplicity and coherence with the coordinate-wise affine noise assumption, we assume the \(L_0\) and \(L_1\) to be identical for each coordinate. Our results can be easily adapted to the case with distinct \(L_0\) and \(L_1\) for different coordinates.

3 Convergence Analysis of RMSProp↩︎

To provide a fundamental understanding of Adam, in this section, we focus on RMSProp which consists of the design of adaptive learning rates for each individual parameter in Adam.

For RMSProp, the main challenges come from dependence between stepsize and gradient, potential unbounded gradients due to the affine noise, and additional error terms due to \((L_0,L_1)\)-smoothness. The analysis can be extended to general Adam, which requires additional efforts to solve the first-order momentum. Define \(c=\sqrt{\zeta}+d\sqrt{D_0+\|\boldsymbol{v}_0\|}\). We then present our analysis of RMSProp in the following theorem.

Theorem 1 (Informal). Let Assumptions 1, 2 and 3 hold. Let \(1-\beta_2\sim \mathcal{O}(\epsilon^{2})\), \(\eta \sim \mathcal{O}(\epsilon^{2})\), and \(T\sim \mathcal{O}(\epsilon^{-4})\). For small \(\epsilon\) such that \(\epsilon\le \frac{\sqrt{5dD_0}}{\sqrt{D_1}\sqrt[4]{\zeta}}\), we have that \[\begin{align} \frac{1}{T} \sum_{t=1}^T \mathbb{E}[\|\nabla f(\boldsymbol{x}_t)\|]\le \left(\frac{2d\sqrt{35D_0D_1}}{\sqrt[4]{\zeta}}+\sqrt{c}\right)\epsilon. \end{align}\]

To the best of our knowledge, this paper provides the first convergence analysis of RMSProp for \((L_0,L_1)\)-smooth functions under the most relaxed assumption of affine noise variance. Existing studies mostly assume bounded gradient norm or variance [15], [16] or only show the algorithm converges to the neighborhood of a stationary point [29]. More importantly, our result matches the lower bound in [1], and thus is optimal.

The formal version of the theorem and detailed proof can be found in Appendix 9. Below, we provide a proof sketch to highlight our major technical novelties.

Proof sketch. Our proof can be divided into three stages: Stage : develop an upper bound of \(\mathbb{E}\left[\frac{ \|\nabla f(\boldsymbol{x}_t)\|^2}{\sqrt{{\beta_2 \|\boldsymbol{v}_{t-1}} \|+ \zeta}}\right]\); Stage : develop an upper bound on \(\mathbb{E}[\sqrt{{\beta_2 \|\boldsymbol{v}_{t-1}}\| + \zeta}]\); and Stage : show \(\mathbb{E}\left[{ \|\nabla f(\boldsymbol{x}_t)\|}\right]\) converges using results from Stages , and Hölder’s inequality.

Stage I: upper bound of \(\mathbb{E}\left[\frac{ \|\nabla f(\boldsymbol{x}_t)\|^2}{\sqrt{{\beta_2 \|\boldsymbol{v}_{t-1}} \|+ \zeta}}\right]\). As discussed in Section 1, for coordinate-wise \((L_0,L_1)\)-smooth functions, following the descent inequality (Lemma 1 in [11]) we can get (1 ). We first obtain \(\frac{ \|\nabla f(\boldsymbol{x}_t)\|^2}{\sqrt{{\beta_2 \|\boldsymbol{v}_{t-1}} \|+ \zeta}}\le \sum_{i=1}^d\frac{\eta (\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\). Therefore, in the following we will bound \(\mathbb{E}\left[\sum_{i=1}^d\frac{\eta (\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\right]\). Towards this goal, in Step 1.1, we will show the LHS of 1 is lower bounded by a function of \(\sum_{i=1}^d\frac{\eta (\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\); and in Step 1.2 we will show the RHS of 1 is upper bounded by a function of \(\sum_{i=1}^d\frac{\eta (\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\). Combining the two steps, we will obtain an upper bound on \(\sum_{i=1}^d\frac{\eta (\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\).

Step 1.1: lower bound on the first-order term in 1 . Since the adaptive stepsize and the gradient estimate are dependent, we design a surrogate \(\tilde{\boldsymbol{v}}_t=\beta_2 \boldsymbol{v}_{t-1}\) to decompose the first-order term in 1 into two parts: \[\begin{align} \label{eq:descentproof} &\underbrace{\mathbb{E}[\langle \nabla f(\boldsymbol{x}_t), \boldsymbol{x}_{t}-\boldsymbol{x}_{t+1} \rangle|\mathcal{F}_t]}_{\text{First Order}}\nonumber\\ &=\underbrace{\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{x}_t), \frac{\eta \boldsymbol{g}_t}{\sqrt{\tilde{\boldsymbol{v}}_t+\zeta}} \right\rangle\Big |\mathcal{F}_t\right]}_{\text{First Order.a}}\nonumber\\ &+ \underbrace{\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{x}_t), \frac{\eta \boldsymbol{g}_t}{\sqrt{\boldsymbol{v}_t+\zeta}} -\frac{\eta \boldsymbol{g}_t}{\sqrt{\tilde{\boldsymbol{v}}_t+\zeta}}\right\rangle\Big |\mathcal{F}_t\right]}_{\text{First Order.b}}. \end{align}\tag{2}\] For the first-order.a term in 2 , we can show that \[\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{x}_t), \frac{\eta \boldsymbol{g}_t}{\sqrt{\tilde{\boldsymbol{v}}_t+\zeta}} \right\rangle|\mathcal{F}_t\right]=\sum_{i=1}^d\frac{\eta (\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}.\]

We then bound the first-order.b term in 2 .

Remark 2 (Importance of modified adaptive stepsize \(\boldsymbol{\eta}\)). For the original RMSProp, [15] chose the same surrogate \(\tilde{\boldsymbol{v}}_t=\beta_2 \boldsymbol{v}_{ t-1}\) as in this paper. Then the first-order.b term was lower bounded by a function of \(\sum_{i=1}^d \mathbb{E}\left[\frac{-\boldsymbol{g}_{t,i}^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}}}+\zeta}\Big |\mathcal{F}_t\right]\times\mathbb{E}\left[\frac{(1-\beta_2)\boldsymbol{g}_{t,i}^2}{(\sqrt{{ \boldsymbol{v}_{t,i}}}+\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}}})^2}\Big |\mathcal{F}_t\right]\). Then they developed an upper bound on the second term \(\mathbb{E}\left[\frac{(1-\beta_2)\boldsymbol{g}_{t,i}^2}{(\sqrt{{ \boldsymbol{v}_{t,i}}}+\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}}})^2}\Big |\mathcal{F}_t\right]\le 1\) which is quite loose, thus they introduced an additional assumption on the upper bound of \(|\boldsymbol{g}_{t,i}|\) [15].

In contrast, using our adaptive stepsize \(\eta \frac{\boldsymbol{g}_t}{\sqrt{{\boldsymbol{v}_t} + \zeta}}\), we can show that the first-order.b term can be lower bounded by a function of \(\sum_{i=1}^d \mathbb{E}\left[\frac{-\boldsymbol{g}_{t,i}^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}}+\zeta}}\Big |\mathcal{F}_t\right]\times\mathbb{E}\left[\frac{(1-\beta_2)\boldsymbol{g}_{t,i}^2}{(\sqrt{{ \boldsymbol{v}_{t,i}+\zeta}}+\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}}+\zeta})^2}\Big |\mathcal{F}_t\right]\), which can be further bounded by \(\sum_{i=1}^d \mathbb{E}\big[{-\boldsymbol{g}_{t,i}^2}|\mathcal{F}_t\big]\mathbb{E}\left[\frac{1}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{1}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}\Big |\mathcal{F}_t\right]\). Applying the affine noise variance assumption in , we obtain a lower bound on the first-order.b term in the following lemma.

Lemma 1 (Informal). Under Assumptions 2 and 3, we have that: \[\begin{align} &\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{x}_t), \frac{\eta \boldsymbol{g}_t}{\sqrt{\boldsymbol{v}_t+\zeta}} -\frac{\eta \boldsymbol{g}_t}{\sqrt{\beta_2 \boldsymbol{v}_{t-1}+\zeta}}\right\rangle\Big |\mathcal{F}_t\right] \nonumber\\ \ge &-\sum_{i=1}^d\Bigg[\mathcal{O}\Bigg( \frac{\eta(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}\Bigg)\nonumber\\ &+\mathcal{O}\Bigg( \frac{\eta^2}{\sqrt{1-\beta_2} }\frac{(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}\Bigg)\nonumber\\ &+\mathcal{O}\Bigg(\frac{\eta}{\sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}-\mathbb{E}\Bigg[{\frac{\eta}{\sqrt{\boldsymbol{v}_{t,i}+\zeta}}}\Big |\mathcal{F}_t\Bigg]\Bigg)\nonumber\\ &+\mathcal{O}\left(\frac{\eta(\partial_i f(\boldsymbol{x}_{t-1}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\mathbb{E}\left[\frac{\eta(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}\Big |\mathcal{F}_t\right]\right)\Bigg]\nonumber\\ &-\text{small error}.\label{eq:lemma32} \end{align}\tag{3}\]

The formal version and detailed proof of can be found in the Appendix 6, which is based on our modified update process on \(\boldsymbol{v}_t\), Hölder’s inequality and Assumption 3.

In the RHS of 3 , we have \[\mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_{t-1}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}\Big |\mathcal{F}_t\right].\] Taking sum from \(t=1\) to \(T\), the terms \(\frac{(\partial_i f(\boldsymbol{x}_{t-1}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\) and \(\frac{(\partial_i f(\boldsymbol{x}_{t-1}))^2}{\sqrt{{ \boldsymbol{v}_{t-1,i}} + \zeta}}\) cancel with each other as \(\beta_2 \to 1\). Similarly, \(\frac{\eta}{\sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}-\mathbb{E}\left[{\frac{\eta}{\sqrt{\boldsymbol{v}_{-1,i}+\zeta}}}\Big |\mathcal{F}_t\right]\) can also be bounded.

Step 1.2: upper bound on second-order and additional terms in 1 . We first focus on the second-order term. Based on the update process of \(\boldsymbol{v}_t\) and Assumption 2, we get that \[\begin{align} \label{eq:second} &{\sum_{i=1}^d\frac{L_0}{2\sqrt{d}}\mathbb{E}[\|\boldsymbol{x}_{t+1}-\boldsymbol{x}_t\||\boldsymbol{x}_{t+1,i}-\boldsymbol{x}_{t,i}||\mathcal{F}_t]}\nonumber\\ \le &\sum_{i=1}^d \frac{L_0\eta^2}{2\sqrt{\zeta}} \frac{D_0+D_1(\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}. \end{align}\tag{4}\] We then focus on the additional term in 1 and we provide a new upper bound using \(\sum_{i=1}^d\frac{ (\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\) with some small errors: for any \(\alpha_2>0\), we have that \[\begin{align} \label{eq:addtional} &\frac{L_1|\partial_i f (\boldsymbol{x}_t)|}{2}\mathbb{E}[\|\boldsymbol{x}_{t+1}-\boldsymbol{x}_t\||\boldsymbol{x}_{t+1,i}-\boldsymbol{x}_{t,i}||\mathcal{F}_t]\nonumber\\ & \le \frac{\eta^2\sqrt{d}}{2\sqrt{1-\beta_2}}\left(L_1\sqrt{D_1}+\frac{1}{\alpha_2\sqrt{1-\beta_2}}\right)\frac{(\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}\nonumber\\ &+ \frac{\eta^2\sqrt{d}\alpha_2L_1^2D_0}{2{}\sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}. \end{align}\tag{5}\] Plugging Lemma 1, (4 ) and (5 ) in (1 ) we get the following lemma.

Lemma 2. Let Assumptions 1,2 and 3 hold. With the same parameters in Theorem 1, we have that \[\begin{align} \frac{1}{T} \sum_{t=1}^T \mathbb{E}\left[\frac{ \|\nabla f(\boldsymbol{x}_t)\|^2}{\sqrt{{\beta_2 \|\boldsymbol{v}_{t-1}}\| + \zeta}} \right]\le \epsilon^2. \end{align}\]

The details of the proof can be found in the Appendix 7. The major idea in the proof of is to bound the first-order.b, the second-order term and the additional term using \(\sum_{i=1}^d\mathbb{E}\left[\frac{ (\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}} \right]\).

Stage II: upper bound of \(\mathbb{E}[\sqrt{{\beta_2 \|\boldsymbol{v}_{t-1}}\| + \zeta}]\). With the bounded gradient norm assumption, \(\frac{1}{T} \sum_{t=1}^T \mathbb{E}[{ \|\nabla f(\boldsymbol{x}_t)\|^2}] \sim \mathcal{O}(\epsilon^2)\) follows directly from Lemma 2. However, under the refined gradient noise assumption in this paper, the gradient norm may not be bounded. Here, we establish a key observation that \(\frac{1}{T} \sum_{t=1}^T \mathbb{E}[\sqrt{{\beta_2 \|\boldsymbol{v}_{t-1}}\| + \zeta}]\) can be bounded using \(\frac{1}{T} \sum_{t=1}^T \mathbb{E}[\|\nabla f(\boldsymbol{x}_t)\|]\). By Hölder’s inequality, we have \[\begin{align} \label{eq:stage2} &\left(\frac{1}{T} \sum_{t=1}^T \mathbb{E}[\|\nabla f(\boldsymbol{x}_t)\|]\right)^2 \le \left(\frac{1}{T} \sum_{t=1}^T \mathbb{E}[\sqrt{{\beta_2 \|\boldsymbol{v}_{t-1}}\| + \zeta}]\right)\nonumber\\ &\times\left(\frac{1}{T} \sum_{t=1}^T \mathbb{E}\left[\frac{ \|\nabla f(\boldsymbol{x}_t)\|^2}{\sqrt{{\beta_2\| \boldsymbol{v}_{t-1}}\| + \zeta}} \right]\right) . \end{align}\tag{6}\] It is worth noting that in the RHS of (6 ), the second term is bound in . If the first term is upper bounded using \(\frac{1}{T} \sum_{t=1}^T \mathbb{E}[\|\nabla f(\boldsymbol{x}_t)\|]\), we then can prove an upper bound on \(\frac{1}{T} \sum_{t=1}^T \mathbb{E}[\|\nabla f(\boldsymbol{x}_t)\|]\), and show the algorithm converges to a stationary point. In the following lemma, we show a novel bound on \(\frac{1}{T} \sum_{t=1}^T \mathbb{E}[\sqrt{{\beta_2 \|\boldsymbol{v}_{t-1}}\| + \zeta}]\).

Lemma 3. Let Assumption 2 hold. Then, we have \[\begin{align} &\frac{1}{T} \sum_{t=1}^T \mathbb{E}[\sqrt{{\beta_2 \|\boldsymbol{v}_{t-1}}\| + \zeta}]\nonumber\\ &\le c +\frac{2\sqrt{dD_1}}{\sqrt{(1-\beta_2)}} \frac{\sum_{t=1}^T\mathbb{E}[\|\nabla f(\boldsymbol{x}_t)\|]}{T}. \end{align}\]

The detailed proof can be found in the Appendix 8, which recursively applies Jensen’s inequality [43]. The proof only depends on the affine noise variance and the update process on \(v_t\). Thus, it works for both RMSProp and Adam with \((L_0,L_1)\)-smooth objectives.

Stage : upper bound of \(\mathbb{E}\left[{ \|\nabla f(\boldsymbol{x}_t)\|}\right]\). Now we show that the algorithm converges to a stationary point. Define \(e=\frac{1}{T} \sum_{t=1}^T \mathbb{E}[\|\nabla f(\boldsymbol{x}_t)\|]\). By (6 ), Lemma 2 and Lemma 3 we have that \[\begin{align} e^2\le \epsilon^2\left(c+\frac{2\sqrt{dD_1}}{\sqrt{1-\beta_2}}e\right). \end{align}\] Thus we have \[e\le\left(\frac{2d\sqrt{35D_0D_1}}{\sqrt[4]{\zeta}}+\sqrt{c}\right)\epsilon\] if \(\epsilon\le \frac{\sqrt{5dD_0}}{\sqrt{D_1}\sqrt[4]{\zeta}}\), which indicates the algorithm converges to a stationary point. ◻

4 Convergence Analysis of Adam↩︎

In this section, we extend our convergence analysis of RMSProp to Adam. Such a result is attractive since empirical results on complicated tasks show Adam may perform better, e.g., the mean reward is improved by \(88\%\) via RMSProp and \(110\%\) via Adam for Atari games [44].

We present the convergence result for Adam as follows.

Theorem 3 (Informal). Let Assumptions 1, 2 and 3 hold. Let \(1-\beta_2\sim \mathcal{O}(\epsilon^{2})\), \(0<\beta_1\le \sqrt{\beta_2}<1\), \(\eta \sim \mathcal{O}(\epsilon^{2})\), and \(T\sim \mathcal{O}(\epsilon^{-4})\). For small \(\epsilon\) such that \(\epsilon\le \frac{\sqrt{2C_2}}{\sqrt{7\alpha_0C_6D_1}}\), we have that \[\begin{align} \frac{1}{T} \sum_{t=1}^T \mathbb{E}[\|\nabla f(\boldsymbol{x}_t)\|]\le \left(2c+\sqrt{2c}+\frac{4\sqrt{dD_1}}{\sqrt{C_6}}\right)\epsilon, \end{align}\] where \(C_6\) is a positive constant defined in Appendix 13.

To the best of our knowledge, is the first convergence result of Adam to a stationary point under the most relaxed assumptions of \((L_0,L_1)\)-smoothness and refined gradient noise. In [19], [39], the authors show that the Adam converges to a stationary point with affine noise. However, their methods only work for \(L\)-smooth objectives, while in this paper we focus on the \((L_0,L_1)\)-smooth functions. Moreover, we show that for Adam, our total computational complexity matches with the optimal results [1], while there is an additional logarithmic term in [39]. The normalized momentum algorithm [26] can also be viewed as a special case of the Adam family, which applies the momentum on the first-order gradient. However, in their algorithm, a mini-batch of samples is required in each training iteration, while we do not require such a mini-batch. Thus, in the distributed setting with heterogeneous data, where the data distributions under each computational node are different, Algorithm 1 can be used directly. However, the normalized momentum in [26] may require gradient information from many computational nodes, making the problem degrade to the centralized setting.

The formal version of the theorem and detailed proof can be found in Appendix 13. Below, we provide the proof sketch to underscore our key technical novelties.

Proof Sketch. Similar to the proof of RMSProp, we divided our proof into three stages. The key difference lies in Stage I, which is because of the dependence between \(\boldsymbol{m}_t\) and \(\tilde{\boldsymbol{v}}_t\).

Stage I: upper bound of \(\mathbb{E}\left[\frac{ \|\nabla f(\boldsymbol{x}_t)\|^2}{\sqrt{{\beta_2 \|\boldsymbol{v}_{t-1}} \|+ \zeta}}\right]\). For the Adam optimizer, the first-order.a term \(\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{x}_t), \frac{-\eta \boldsymbol{m}_t}{\sqrt{\tilde{\boldsymbol{v}}_t+\zeta}} \right\rangle\Big |\mathcal{F}_t\right]\) is challenging to bound due to the dependence between \(\boldsymbol{m}_t\) and \(\tilde{\boldsymbol{v}}_t\). Following the recent analyses of SGDM [18] and Adam [19], we study a potential function \(f(\boldsymbol{u}_t)\) with \(\boldsymbol{u}_t=\frac{\boldsymbol{x}_t-\frac{\beta_1}{\sqrt{\beta_2}}\boldsymbol{x}_{t-1}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}\). The benefit of analyzing \(\boldsymbol{u}_t\) is that we have \(\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{x}_t), \boldsymbol{u}_t-\boldsymbol{u}_{t+1} \right\rangle|\mathcal{F}_t\right]\approx \frac{(1-\beta_1)}{1-\frac{\beta_1}{\sqrt{\beta_2}}}\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{x}_t),\frac{\eta \boldsymbol{g}_t}{\sqrt{\tilde{\boldsymbol{v}}_t+\zeta}}\right\rangle\Big |\mathcal{F}_t\right]\), where the numerator and denominator in the last term are independent.

In Step 1.1, we will show the LHS of the descent lemma for \(f(\boldsymbol{u}_t)\) (shown in 43 ) is lower bounded by a function of \(\sum_{i=1}^d\frac{\eta (\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\); and in Step 1.2 we will show the RHS of the descent lemma of \(f(\boldsymbol{u}_t)\) is upper bounded by a function of \(\sum_{i=1}^d\frac{\eta (\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\). Combining the two steps, we will obtain an upper bound on \(\sum_{i=1}^d\frac{\eta (\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\).

Step \(1.1\): lower bound of the first-order term \({\mathbb{E}[\langle \nabla f(\boldsymbol{u}_t), \boldsymbol{u}_{t}-\boldsymbol{u}_{t+1} \rangle|\mathcal{F}_t]}\). We first divide the first-order terms into two parts: \[\begin{align} &{\mathbb{E}[\langle \nabla f(\boldsymbol{u}_t), \boldsymbol{u}_{t}-\boldsymbol{u}_{t+1} \rangle|\mathcal{F}_t]}\nonumber\\ &={\mathbb{E}[\langle \nabla f(x_t), \boldsymbol{u}_{t}-\boldsymbol{u}_{t+1} \rangle|\mathcal{F}_t]}\nonumber\\ &+{\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{u}_t)-\nabla f(x_t), \boldsymbol{u}_{t}-\boldsymbol{u}_{t+1} \right\rangle|\mathcal{F}_t\right]}.\nonumber \end{align}\] The first part mimics the first-order term in the proof of RMSProp and the second one is due to a mismatch between \(\boldsymbol{u}_t\) and \(\boldsymbol{x}_t\). By bounding these two parts separately, we have a lemma similar to Lemma 1 but with two additional terms: \(\mathbb{E}\left[\frac{\boldsymbol{m}_{t,i}^2}{{\boldsymbol{v}_{t,i}+\zeta}}\Big |\mathcal{F}_t\right]\) and \(\mathbb{E}\left[\frac{\boldsymbol{m}_{t,i}^2}{\sqrt{\boldsymbol{v}_{t,i}+\zeta}}\Big |\mathcal{F}_t\right]\) due to the employment of the first-order momentum (see details in Lemma 9). For the additional terms, [19] showed that \(\sum_{t=1}^T\frac{\boldsymbol{m}_{t,i}^2}{{\boldsymbol{v}_{t,i}}}\) can be bounded by a function of \(\boldsymbol{v}_{T,i}\) (shown in Lemma 7). It is worth noting that due to the \((L_0,L_1)\)-smoothness our objective is harder to bound and only Lemma 7 is not enough. Here, we bound the \(\sum_{t=1}^T\frac{\boldsymbol{m}_{t,i}^2}{\sqrt{\boldsymbol{v}_{t,i}}}\) term by a function of \(\frac{\sqrt{\boldsymbol{v}_{T,i}}-\sqrt{\boldsymbol{v}_{0,i}}}{1-\beta_2}+\sum_{t=1}^T \sqrt{\boldsymbol{v}_{t-1,i}}\) (the details can be found in Lemma 8).

Step 1.2: upper bound on second-order and additional terms. Based on the update process of \(\boldsymbol{u}_t\) and \(\boldsymbol{x}_t\), we bound the second-order and additional terms similarly to those in 4 and 5 , but with \(\boldsymbol{g}_t\) replaced by \(\boldsymbol{m}_t\) (see details in 44 and 45 ). Unlike in the proof of RMSProp, where \(\mathbb{E}\left[\frac{\boldsymbol{g}_{t,i}^2}{\sqrt{\beta_2\boldsymbol{v}_{t,i}+\zeta}}\Big|\mathcal{F}_t\right]\) can be bounded by \(\frac{D_0}{\sqrt{\zeta}}+\frac{D_1 (\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{\boldsymbol{v}_{t,i}+\zeta}}\), Assumption 2 does not hold for \(\boldsymbol{m}_t\) and this is the reason why two terms \(\mathbb{E}\left[\frac{\boldsymbol{m}_{t,i}^2}{{\boldsymbol{v}_{t,i}+\zeta}}\Big |\mathcal{F}_t\right]\) and \(\mathbb{E}\left[\frac{\boldsymbol{m}_{t,i}^2}{\sqrt{\boldsymbol{v}_{t,i}+\zeta}}\Big |\mathcal{F}_t\right]\) are kept in the upper bound of the second-order and additional terms. Then based on the descent lemma of \(f(\boldsymbol{u}_t)\), we can show \(\mathbb{E}\left[\sum_{i=1}^d\frac{\eta (\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\right]\) can be upper bounded by a function of \(\sum_{t=1}^T\frac{\boldsymbol{m}_{t,i}^2}{{\boldsymbol{v}_{t,i}}}\) and \(\sum_{t=1}^T\frac{\boldsymbol{m}_{t,i}^2}{\sqrt{\boldsymbol{v}_{t,i}}}\), which further can be bounded by a function of \(\mathbb{E}[\nabla f(\boldsymbol{x}_t)]\). We then get the following lemma:

Lemma 4. Let Assumptions 1, 2 and 3 hold. With the same parameters as in Theorem 3, we have that \[\begin{align} \frac{1}{T} \sum_{t=1}^T \mathbb{E}[\frac{ \|\nabla f(\boldsymbol{x}_t)\|^2}{\sqrt{{\beta_2 \|\boldsymbol{v}_{t-1}} \|+ \zeta}} ]\le \epsilon^2+\frac{\epsilon}{T}\sum_{t=1}^T\mathbb{E}[\|\nabla f(\boldsymbol{x}_t)\|].\nonumber \end{align}\]

The details of the proof can be found in the Appendix 12.

Stage is the same with the proof of RMSProp and thus is omitted here.

Stage : upper bound of \(\mathbb{E}\left[{ \|\nabla f(\boldsymbol{x}_t)\|}\right]\). As we mentioned before, Lemma 3 and 6 hold for Adam. It is worth noting that in the RHS of (6 ), the first term is bounded in Lemma 4, which is more complicated than Lemma 2 and has an additional term. Let \(e=\frac{1}{T} \sum_{t=1}^T \mathbb{E}[\|\nabla f(\boldsymbol{x}_t)\|]\). By (6 ), Lemma 4 and Lemma 3 we have that \[\begin{align} e^2\le c\epsilon^2+\frac{2\sqrt{dD_1}}{\sqrt{C_6}\epsilon}e\epsilon^2+ce\epsilon +\frac{e^2}{2}. \end{align}\] Thus, \(e\sim \mathcal{O}(\epsilon)\), which shows the algorithm converges to a stationary point. ◻

5 Conclusion↩︎

In this paper, we provide the tightest convergence analyses for RMSProp and Adam for non-convex objectives under the most relaxed assumptions of generalized smoothness and refined gradient noise. The complexity to achieve an \(\epsilon\)-stationary point for both algorithms is shown to be \(\mathcal{O}(\epsilon^{-4})\), which matches with the lower bound for first-order algorithms established in [1].

6 Formal Version of Lemma 1 and Its Proof↩︎

Lemma 5. Under Assumptions 2 and 3, for any \(\alpha_0,\alpha_1>0, \boldsymbol{x}_0=\boldsymbol{x}_1\) and \(t>0\), we have that: \[\begin{align} \label{eq:lemma33} &\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{x}_t), \left[\frac{\eta }{\sqrt{\boldsymbol{v}_t+\zeta}} -\frac{\eta }{\sqrt{\beta_2 \boldsymbol{v}_{t-1}+\zeta}}\right]\odot \boldsymbol{g}_t\right\rangle\Big |\mathcal{F}_t\right] \nonumber\\ \ge &-\Bigg[ \sum_{i=1}^d \frac{\eta (\partial_i f(\boldsymbol{x}_t))^2}{2\alpha_0\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}} +\sum_{i=1}^d\frac{\eta \alpha_0D_0}{2}\mathbb{E}\left[\frac{1}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{1}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}\Big |\mathcal{F}_t\right]\nonumber\\ &+\sum_{i=1}^d\frac{\eta \alpha_0D_1}{2}\mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_{t-1}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}\Big |\mathcal{F}_t\right]\nonumber\\ &+\sum_{i=1}^d \frac{\eta \alpha_0D_1}{2\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\left(\frac{(\partial_i f(\boldsymbol{x}_t))^2}{\alpha_1 D_1}+\frac{\alpha_1d D_1L_0^2\eta^2}{1-\beta_2}+\frac{2\sqrt{d}\eta L_1(\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{1-\beta_2}}\right)\Bigg] \end{align}\tag{7}\]

Proof. Since \(\boldsymbol{v}_t=\beta_2 \boldsymbol{v}_{t-1}+(1-\beta_2)\boldsymbol{g}_t \odot \boldsymbol{g}_t\), for any \(i\in [d]\), we have that \[\begin{align} \label{eq:lemma1eq1} &(-\frac{1}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}+\frac{1}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}})\nonumber\\ =& \frac{(\boldsymbol{v}_{t,i}-\beta_2 \boldsymbol{v}_{t-1,i})}{(\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta})(\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta})(\sqrt{{ \boldsymbol{v}_{t,i}}+\zeta}+\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}}+\zeta})}\nonumber\\ =& \frac{(1-\beta_2)\boldsymbol{g}_{t,i}^2}{(\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta})(\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta})(\sqrt{{ \boldsymbol{v}_{t,i}}}+\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}}})}. \end{align}\tag{8}\] Thus, the LHS of 7 can be bounded as \[\begin{align} \label{proof:eq1} &\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{x}_t), \left[\frac{-\eta }{\sqrt{\boldsymbol{v}_t+\zeta }} +\frac{\eta}{\sqrt{\beta_2 \boldsymbol{v}_{t-1}+\zeta }}\right]\odot \boldsymbol{g}_t\right\rangle|\mathcal{F}_t\right] \nonumber\\ \le &\sum_{i=1}^d \mathbb{E}\left[\eta |\partial_i f(\boldsymbol{x}_t)||\boldsymbol{g}_{t,i}|\left| \frac{-1}{\sqrt{\boldsymbol{v}_{t,i}+\zeta}} +\frac{1}{\sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}\right||\mathcal{F}_t\right]\nonumber\\ \le &\sum_{i=1}^d \mathbb{E}\left[\eta |\partial_i f(\boldsymbol{x}_t)||\boldsymbol{g}_{t,i}|\left| \frac{(1-\beta_2)\boldsymbol{g}_{t,i}^2 }{\sqrt{{\boldsymbol{v}_{t,i}} + \zeta}\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}(\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}+\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta})}\right||\mathcal{F}_t\right]\nonumber\\ \le & \sum_{i=1}^d\eta \frac{|\partial_i f(\boldsymbol{x}_t)|}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\mathbb{E}\left[\left| \frac{\sqrt{1-\beta_2}\boldsymbol{g}_{t,i}^2}{(\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}+\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta})}\right||\mathcal{F}_t\right], \end{align}\tag{9}\] where the second inequality is due to 8 and the last inequality is because \(\frac{|\boldsymbol{g}_{t,i}|}{\sqrt{\boldsymbol{v}_{t,i}+\zeta}}\le \frac{1}{\sqrt{1-\beta_2}}\). For any \(a,b \in \mathbb{R}\), we have \(ab\le \frac{a^2+b^2}{2}\). For any \(\alpha_0>0\), the RHS of 9 can be further bounded as \[\begin{align} \label{proof:eq2} & \eta \frac{|\partial_i f(\boldsymbol{x}_{t})|}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\mathbb{E}\left[\left\| \frac{\sqrt{1-\beta_2}\boldsymbol{g}_{t,i}^2}{(\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}+\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta})}\right\||\mathcal{F}_t\right]\nonumber\\ \le &\frac{\eta (\partial_i f(\boldsymbol{x}_t))^2}{2\alpha_0\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}+\frac{\eta\alpha_0 }{2\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\left(\mathbb{E}\left[\frac{\sqrt{1-\beta_2}\boldsymbol{g}_{t,i}^2}{(\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}+\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta})}|\mathcal{F}_t\right]\right)^2. \end{align}\tag{10}\] For the last term of 10 , due to Hölder’s inequality, we have that \[\begin{align} \label{proof:eq3} &\frac{\eta }{2\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\left(\mathbb{E}\left[\frac{\sqrt{1-\beta_2}\boldsymbol{g}_{t,i}^2}{(\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}+\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta})}|\mathcal{F}_t\right]\right)^2\nonumber\\ \le& \frac{(1-\beta_2)\eta}{2\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\mathbb{E}[\boldsymbol{g}_{t,i}^2|\mathcal{F}_t]\mathbb{E}\left[\frac{\boldsymbol{g}_{t,i}^2}{(\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}+\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta})^2}|\mathcal{F}_t\right]. \end{align}\tag{11}\] Based on Assumption 2, the RKS of 11 can be further bounded as \[\begin{align} \label{proof:eq119} & \frac{(1-\beta_2)\eta}{2\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\mathbb{E}[\boldsymbol{g}_{t,i}^2|\mathcal{F}_t]\mathbb{E}\left[\frac{\boldsymbol{g}_{t,i}^2}{(\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}+\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta})^2}|\mathcal{F}_t\right]\nonumber\\ \le & \frac{\eta}{2}(D_0+D_1 (\partial_i f(\boldsymbol{x}_t))^2)\times \mathbb{E}\left[\frac{(1-\beta_2)\boldsymbol{g}_{t,i}^2}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}(\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}+\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta})}|\mathcal{F}_t\right]\nonumber\\ \le & \frac{\eta}{2}(D_0+D_1 (\partial_i f(\boldsymbol{x}_t))^2)\times \mathbb{E}\left[\frac{\boldsymbol{v}_{t,i}+\zeta-(\beta_2\boldsymbol{v}_{t-1,i}+\zeta)}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}(\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}+\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta})}|\mathcal{F}_t\right]\nonumber\\ \le &\frac{\eta}{2}(D_0+D_1 (\partial_i f(\boldsymbol{x}_t))^2)\mathbb{E}\left[\frac{1}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{1}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}|\mathcal{F}_t\right] \end{align}\tag{12}\] Thus for \(t=1\), combining 9 , 10 , 11 and 12 , we have that \[\begin{align} \label{proof:t1} &\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{x}_t), \left[\frac{-\eta }{\sqrt{\boldsymbol{v}_t+\zeta }} +\frac{\eta}{\sqrt{\beta_2 {v}_{t-1}+\zeta }}\right]\odot \boldsymbol{g}_t\right\rangle|\mathcal{F}_t\right] \nonumber\\ \le & \sum_{i=1}^d\frac{\eta (\partial_i f(\boldsymbol{x}_t))^2}{2\alpha_0\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}+\sum_{i=1}^d\frac{\eta\alpha_0}{2}(D_0+D_1 (\partial_i f(\boldsymbol{x}_t))^2)\mathbb{E}\left[\frac{1}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{1}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}|\mathcal{F}_t\right] \end{align}\tag{13}\] For \(t>1\), 12 can be further bounded as \[\begin{align} \label{proof:eq4} & \frac{(1-\beta_2)\eta}{2\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\mathbb{E}[\boldsymbol{g}_{t,i}^2|\mathcal{F}_t]\mathbb{E}\left[\frac{\boldsymbol{g}_{t,i}^2}{(\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}+\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta})^2}|\mathcal{F}_t\right]\nonumber\\ \le &\frac{\eta D_0}{2}\mathbb{E}\left[\frac{1}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{1}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}|\mathcal{F}_t\right]+\frac{\eta D_1}{2}\mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_{t-1}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}|\mathcal{F}_t\right]\nonumber\\ &+ \frac{\eta D_1}{2}\mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_t))^2-(\partial_i f(\boldsymbol{x}_{t-1}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}|\mathcal{F}_t\right]. \end{align}\tag{14}\] For the last term in (14 ) with \((L_0,L_1)\)-smooth objectives and any \(\alpha_1>0\), we have that \[\begin{align} \label{proof:eq5} &\frac{\eta D_1}{2}\mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_t))^2-(\partial_i f(\boldsymbol{x}_{t-1}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}|\mathcal{F}_t\right]\nonumber\\ \le& \frac{\eta D_1}{2}\mathbb{E}\left[\frac{2|\partial_i f(\boldsymbol{x}_t)|\big||\partial_i f(\boldsymbol{x}_t)|-|\partial_i f(\boldsymbol{x}_{t-1})|\big|}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}|\mathcal{F}_t\right] \nonumber\\ \le &\eta D_1\frac{|\partial_i f(\boldsymbol{x}_t)|(L_0+L_1|\partial_i f(\boldsymbol{x}_t)|)\eta\left\|\frac{1}{\sqrt{\boldsymbol{v}_{t-1}+\zeta}}\odot \boldsymbol{g}_{t-1}\right\|}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\nonumber\\ \le & \frac{\eta D_1}{2\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\left(\frac{(\partial_i f(\boldsymbol{x}_t))^2}{\alpha_1 D_1}+\alpha_1 D_1L_0^2\eta^2\sum_{i=1}^d\frac{(\boldsymbol{g}_{t-1,i})^2}{\boldsymbol{v}_{t-1,i}+\zeta}\right)\nonumber\\ &+ \frac{\eta^2L_1 D_1\sqrt{d}(\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{1-\beta_2}\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\nonumber\\ \le & \frac{\eta D_1}{2\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\left(\frac{(\partial_i f(\boldsymbol{x}_t))^2}{\alpha_1 D_1}+\frac{\alpha_1 dD_1L_0^2\eta^2}{1-\beta_2}+\frac{2\sqrt{d}\eta L_1(\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{1-\beta_2}}\right). \end{align}\tag{15}\] where the first inequality is due to the fact that for any \(a,b \ge 0\), we have \(a^2-b^2\le 2a|a-b|\), the second inequality is by the \((L_0,L_1)\)-smoothness and the third one is because \(\frac{|\boldsymbol{g}_{t,i}|}{\sqrt{\boldsymbol{v}_{t,i}}+\zeta}\le \frac{1}{\sqrt{1-\beta_2}}\). Based on (9 ), (10 ), (11 ), (14 ) and (15 ), for \(t>1\) we can get that \[\begin{align} \label{eq:lemma33result} &\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{x}_t), \left[\frac{-\eta }{\sqrt{\boldsymbol{v}_t+\zeta}} +\frac{\eta }{\sqrt{\beta_2 \boldsymbol{v}_{t-1}+\zeta}}\right]\odot \boldsymbol{g}_t\right\rangle|\mathcal{F}_t\right] \nonumber\\ \le & \sum_{i=1}^d \frac{\eta (\partial_i f(\boldsymbol{x}_t))^2}{2\alpha_0\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}} +\sum_{i=1}^d\frac{\eta \alpha_0D_0}{2}\mathbb{E}\left[\frac{1}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{1}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}|\mathcal{F}_t\right]\nonumber\\ &+\sum_{i=1}^d\frac{\eta \alpha_0D_1}{2}\mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_{t-1}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}|\mathcal{F}_t\right]\nonumber\\ &+\sum_{i=1}^d \frac{\eta \alpha_0D_1}{2\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\left(\frac{(\partial_i f(\boldsymbol{x}_t))^2}{\alpha_1 D_1}+\frac{\alpha_1d D_1L_0^2\eta^2}{1-\beta_2}+\frac{2\sqrt{d}\eta L_1(\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{1-\beta_2}}\right). \end{align}\tag{16}\] Since we set \(\boldsymbol{x}_0=\boldsymbol{x}_1\), 16 also holds for \(t=0\). We then completes the proof. ◻

7 Proof of Lemma 2↩︎

For \((L_0,L_1)\)-smooth objective functions, we have the following descent inequality (Lemma 1 in [11]): \[\begin{align} \label{eq:l0l1descent} &\underbrace{\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{x}_t), \frac{\eta}{\sqrt{\tilde{\boldsymbol{v}}_t+\zeta}} \odot \boldsymbol{g}_t \right\rangle|\mathcal{F}_t\right]}_{\text{First Order.a}}\nonumber\\ \le & f(\boldsymbol{x}_t)-\mathbb{E}[f(\boldsymbol{x}_{t+1})|\mathcal{F}_t]+\underbrace{\sum_{i=1}^d\frac{L_0}{2\sqrt{d}}\mathbb{E}[\|\boldsymbol{x}_{t+1}-\boldsymbol{x}_t\||\boldsymbol{x}_{t+1,i}-\boldsymbol{x}_{t,i}||\mathcal{F}_t]}_{\text{Second Order}}\nonumber\\ &+\underbrace{\sum_{i=1}^d\frac{L_1|\partial_i f(\boldsymbol{x}_t)|}{2}\mathbb{E}[\|\boldsymbol{x}_{t+1}-\boldsymbol{x}_t\||\boldsymbol{x}_{t+1,i}-\boldsymbol{x}_{t,i}||\mathcal{F}_t]}_{\text{Additional Term}}\nonumber\\ &+ \underbrace{\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{x}_t), \left[\frac{-\eta}{\sqrt{\boldsymbol{v}_t+\zeta}} +\frac{\eta }{\sqrt{\tilde{\boldsymbol{v}}_t+\zeta}}\right]\odot \boldsymbol{g}_t\right\rangle|\mathcal{F}_t\right]}_{\text{First Order.b}}. \end{align}\tag{17}\] For the first-order main item, given \(\mathcal{F}_t\), we have \(\boldsymbol{g}_t\) independent of \(\tilde{\boldsymbol{v}}_t\). It then follows that \(\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{x}_t), \frac{\eta}{\sqrt{\tilde{\boldsymbol{v}}_t+\zeta}} \odot \boldsymbol{g}_t\right\rangle|\mathcal{F}_t\right]=\sum_{i=1}^d\frac{\eta (\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\). Based on Lemma 1, we can bound the first-order error term. For any \(\alpha_0,\alpha_1>0\) and \(t>1\), we then have: \[\begin{align} \label{proof:2eq1} &\sum_{i=1}^d\frac{\eta (\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\nonumber\\ \le & f(\boldsymbol{x}_t)-\mathbb{E}[f(\boldsymbol{x}_{t+1})|\mathcal{F}_t]+\underbrace{\sum_{i=1}^d\frac{L_0}{2\sqrt{d}}\mathbb{E}[\|\boldsymbol{x}_{t+1}-\boldsymbol{x}_t\||\boldsymbol{x}_{t+1,i}-\boldsymbol{x}_{t,i}||\mathcal{F}_t]}_{\text{Second Order}}\nonumber\\ &+\underbrace{\sum_{i=1}^d\frac{L_1|\partial_i f(\boldsymbol{x}_t)|}{2}\mathbb{E}[\|\boldsymbol{x}_{t+1}-\boldsymbol{x}_t\||\boldsymbol{x}_{t+1,i}-\boldsymbol{x}_{t,i}||\mathcal{F}_t]}_{\text{Additional Term}}\nonumber\\ &+\sum_{i=1}^d \frac{\eta (\partial_i f(\boldsymbol{x}_t))^2}{2\alpha_0\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}} +\sum_{i=1}^d\frac{\eta \alpha_0D_0}{2}\mathbb{E}\left[\frac{1}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{1}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}|\mathcal{F}_t\right]\nonumber\\ &+\sum_{i=1}^d\frac{\eta \alpha_0D_1}{2}\mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_{t-1}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}|\mathcal{F}_t\right]\nonumber\\ &+\sum_{i=1}^d \frac{\eta \alpha_0D_1}{2\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\left(\frac{(\partial_i f(\boldsymbol{x}_t))^2}{\alpha_1 D_1}+\frac{\alpha_1d D_1L_0^2\eta^2}{1-\beta_2}+\frac{2\sqrt{d}\eta L_1(\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{1-\beta_2}}\right). \end{align}\tag{18}\] Now we focus on the second-order term, which can be bounded as follows \[\begin{align} \label{proof:2eq2} &{\sum_{i=1}^d\frac{L_0}{2\sqrt{d}}\mathbb{E}[\|\boldsymbol{x}_{t+1}-\boldsymbol{x}_t\||\boldsymbol{x}_{t+1,i}-\boldsymbol{x}_{t,i}||\mathcal{F}_t]}\nonumber\\ \le& \sum_{i=1}^d \frac{L_0}{2\sqrt{d}}\mathbb{E}\left[\frac{\|\boldsymbol{x}_{t+1}-\boldsymbol{x}_t\|^2}{2\sqrt{d}}+\frac{\sqrt{d}}{2}|\boldsymbol{x}_{t+1,i}-\boldsymbol{x}_{t,i}|^2\mathcal{F}_t\right]\nonumber\\ \le& \sum_{i=1}^d \frac{L_0}{2}\mathbb{E}\left[\eta^2 \frac{\boldsymbol{g}_{t,i}^2}{\boldsymbol{v}_{t,i}+\zeta}|\mathcal{F}_t\right]\nonumber\\ \le& \sum_{i=1}^d \frac{L_0}{2} \mathbb{E}\left[\eta^2 \frac{\boldsymbol{g}_{t,i}^2}{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}|\mathcal{F}_t\right]\nonumber\\ \le &\sum_{i=1}^d \frac{L_0\eta^2}{2} \frac{D_0+D_1(\partial_i f(\boldsymbol{x}_t))^2}{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}\nonumber\\ \le &\sum_{i=1}^d \frac{L_0\eta^2}{2\sqrt{\zeta}} \frac{D_0+D_1(\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}, \end{align}\tag{19}\] where the fourth inequality is due to Assumption 2. For the addition term due to \((L_0,L_1)\)-smooth objectives, based on Assumption 2, for any \(\alpha_2>0\) we have that \[\begin{align} \label{proof:2eq3} &\frac{L_1|\partial_i f (\boldsymbol{x}_t)|}{2}\mathbb{E}[\|\boldsymbol{x}_{t+1}-\boldsymbol{x}_t\||\boldsymbol{x}_{t+1,i}-\boldsymbol{x}_{t,i}||\mathcal{F}_t]\nonumber\\ \le& \frac{\sqrt{d}L_1|\partial_i f(\boldsymbol{x}_t)|}{2\sqrt{1-\beta_2}}\mathbb{E}\left[\eta^2 \frac{|\boldsymbol{g}_{t,i}|}{\sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}|\mathcal{F}_t\right]\nonumber\\ \le & \frac{\eta^2\sqrt{d}L_1|\partial_i f(\boldsymbol{x}_t)|}{2\sqrt{1-\beta_2}\sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}\sqrt{\mathbb{E}\left[ {\boldsymbol{g}_{t,i}^2}|\mathcal{F}_t\right]}\nonumber\\ \le &\frac{\eta^2\sqrt{d}L_1|\partial_i f(\boldsymbol{x}_t)|(\sqrt{D_0}+\sqrt{D_1}|\partial_i f(\boldsymbol{x}_t)|)}{2\sqrt{1-\beta_2}\sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}\nonumber\\ \le & \frac{\eta^2\sqrt{d}L_1\sqrt{D_1}(\partial_i f(\boldsymbol{x}_t))^2}{2\sqrt{1-\beta_2}\sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}+ \frac{\eta^2\sqrt{d}(\partial_i f(\boldsymbol{x}_t))^2}{2({1-\beta_2})\alpha_2\sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}+ \frac{\eta^2\sqrt{d}\alpha_2L_1^2D_0}{2{}\sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}, \end{align}\tag{20}\] where the first inequality is due to \(\|\boldsymbol{x}_{t+1}-\boldsymbol{x}_t\|\le \frac{\sqrt{d}\eta}{\sqrt{1-\beta_2}}\), the second inequality is because that \(\mathbb{E}\left[ {\boldsymbol{g}_{t,i}}|\mathcal{F}_t\right]\le \sqrt{\mathbb{E}\left[ {\boldsymbol{g}_{t,i}^2}|\mathcal{F}_t\right]}\), the third inequality is due to Assumption 3 and the fact that for any \(a,b\ge 0\), we have \(\sqrt{a+b}\le \sqrt{a}+\sqrt{b}\). Plug (19 ) and (20 ) into (18 ), and it follows that \[\begin{align} &\sum_{i=1}^d\frac{\eta (\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\nonumber\\ \le & f(\boldsymbol{x}_t)-\mathbb{E}[f(\boldsymbol{x}_{t+1})|\mathcal{F}_t]+\sum_{i=1}^d \frac{L_0\eta^2}{2\sqrt{\zeta}} \frac{D_0+D_1(\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}.\nonumber\\ &+\sum_{i=1}^d\left(\frac{\eta^2\sqrt{d}L_1\sqrt{D_1}(\partial_i f(\boldsymbol{x}_t))^2}{2\sqrt{1-\beta_2}\sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}+ \frac{\eta^2\sqrt{d}(\partial_i f(\boldsymbol{x}_t))^2}{2({1-\beta_2})\alpha_2\sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}+ \frac{\eta^2\sqrt{d}\alpha_2L_1^2D_0}{2{}\sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}\right)\nonumber\\ &+\sum_{i=1}^d \frac{\eta (\partial_i f(\boldsymbol{x}_t))^2}{2\alpha_0\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}} +\sum_{i=1}^d\frac{\eta \alpha_0D_0}{2}\mathbb{E}\left[\frac{1}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{1}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}|\mathcal{F}_t\right]\nonumber\\ &+\sum_{i=1}^d\frac{\eta \alpha_0D_1}{2}\mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_{t-1}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}|\mathcal{F}_t\right]\nonumber\\ &+\sum_{i=1}^d \frac{\eta \alpha_0D_1}{2\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\left(\frac{(\partial_i f(\boldsymbol{x}_t))^2}{\alpha_1 D_1}+\frac{\alpha_1d D_1L_0^2\eta^2}{1-\beta_2}+\frac{2\sqrt{d}\eta L_1(\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{1-\beta_2}}\right). \end{align}\] which can be written as \[\begin{align} \label{proof:4eq3} &\sum_{i=1}^d\left(\eta-\frac{\eta}{2\alpha_0}-\frac{\eta\alpha_0}{2\alpha_1}-\frac{L_0\eta^2D_1}{2\sqrt{\zeta}}-\frac{\eta^2L_1\sqrt{dD_1}}{2\sqrt{1-\beta_2}}-\frac{\eta^2\sqrt{d}}{2(1-\beta_2)\alpha_2}-\frac{\eta^2\sqrt{d}\alpha_0L_1D_1}{\sqrt{1-\beta_2}}\right) \frac{ (\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}} \nonumber\\ \le & f(\boldsymbol{x}_t)-\mathbb{E}[f(\boldsymbol{x}_{t+1})|\mathcal{F}_t]+\frac{dL_0\eta^2D_0}{2{\zeta}}+\frac{\alpha_0\alpha_1d^2 L_0^2\eta^3 D_1^2}{2(1-\beta_2)} \frac{1}{\sqrt{ \zeta}}+ \frac{\eta^2\alpha_2d^{1.5}L_1^2D_0}{2{}\sqrt{\zeta}} \nonumber\\ &+\sum_{i=1}^d\frac{\eta\alpha_0 D_0}{2}\mathbb{E}\left[\frac{1}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{1}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}|\mathcal{F}_t\right]\nonumber\\ &+\sum_{i=1}^d\frac{\eta\alpha_0 D_1}{2}\mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_{t-1}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}|\mathcal{F}_t\right] \end{align}\tag{21}\] It is worth noting that the sum of last two terms from \(t=1\) to \(T\) can be further bounded. Specifically, for any \(i\in[d]\) we have that \[\begin{align} \label{proof:coeq1} &\sum_{t=1}^T \mathbb{E}\left[\frac{1}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{1}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}\right]\nonumber\\ =& \mathbb{E}\left[\frac{1}{\sqrt{{\beta_2 \boldsymbol{v}_{0,i}} + \zeta}}\right]+\sum_{t=1}^{T-1} \mathbb{E}\left[\frac{1}{\sqrt{{\beta_2 \boldsymbol{v}_{t,i}} + \zeta}}-\frac{1}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}\right]-\mathbb{E}\left[\frac{1}{\sqrt{{ \boldsymbol{v}_{T,i}} + \zeta}}\right]\nonumber\\ \le & \mathbb{E}\left[\frac{1}{\sqrt{ \zeta}}\right]+\sum_{t=1}^{T-1} \mathbb{E}\left[\frac{1}{\sqrt{{\beta_2 \boldsymbol{v}_{t,i}} + \zeta}}-\frac{\sqrt{\beta_2}}{\sqrt{\beta_2 { \boldsymbol{v}_{t,i}} + \zeta}}\right]\nonumber\\ = & \frac{1}{\sqrt{ \zeta}}+\sum_{t=1}^{T-1} \mathbb{E}\left[\frac{1-\sqrt{\beta_2}}{\sqrt{\beta_2{ \boldsymbol{v}_{t,i}} + \zeta}}\right]\nonumber\\ \le & \frac{1}{\sqrt{ \zeta}}+T \frac{1-\sqrt{\beta_2}}{\sqrt{\zeta}}. \end{align}\tag{22}\] Similarly, for the last term, we have that \[\begin{align} \label{proof:coeq2} &\mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_{0}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{0,i}} + \zeta}}-\frac{(\partial_i f(\boldsymbol{x}_{1}))^2}{\sqrt{{ \boldsymbol{v}_{1,i}} + \zeta}}\right]+\sum_{t=2}^T \mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_{t-1}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}\right]\nonumber\\ =& \mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_{1}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{0,i}} + \zeta}}\right]+\sum_{t=1}^{T-1} \mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t,i}} + \zeta}}-\frac{(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}\right]-\mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{\boldsymbol{v}_{T,i}} + \zeta}}\right]\nonumber\\ \le & \mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_{1}))^2}{\sqrt{ \zeta}}\right]+\sum_{t=1}^{T-1} \mathbb{E}\left[\left(\frac{1}{\sqrt{{\beta_2 \boldsymbol{v}_{t,i}} + \zeta}}-\frac{\sqrt{\beta_2}}{\sqrt{\beta_2 { \boldsymbol{v}_{t,i}} + \zeta}}\right)(\partial_i f(\boldsymbol{x}_{t}))^2\right]\nonumber\\ = & \frac{(\partial_i f(\boldsymbol{x}_{1}))^2}{\sqrt{ \zeta}}+\sum_{t=1}^{T-1}(1-\sqrt{\beta_2}) \mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{\beta_2{ \boldsymbol{v}_{t,i}} + \zeta}}\right]\nonumber\\ \le & \frac{(\partial_i f(\boldsymbol{x}_{1}))^2}{\sqrt{ \zeta}}+\sum_{t=1}^{T-1}({1-\beta_2}) \mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{\beta_2{ \boldsymbol{v}_{t-1,i}} + \zeta}}\right] \end{align}\tag{23}\] since \(\boldsymbol{x}_0=\boldsymbol{x}_1\),\(\boldsymbol{g}_0=0\) and \(\frac{1}{\beta_2}\le 1+\sqrt{\beta_2}\). By taking expectations and sums of (21 ) for \(t=1\) to \(T\), based on (22 ) and (23 ), we have that \[\begin{align} &\sum_{t=1}^T\sum_{i=1}^d\left(\eta-\frac{\eta}{2\alpha_0}-\frac{\eta\alpha_0}{2\alpha_1}-\frac{L_0\eta^2D_1}{2\sqrt{\zeta}}-\frac{\eta^2L_1\sqrt{dD_1}}{2\sqrt{1-\beta_2}}-\frac{\eta^2\sqrt{d}}{2(1-\beta_2)\alpha_2}-\frac{\eta^2\sqrt{d}\alpha_0L_1D_1}{\sqrt{1-\beta_2}}\right) \frac{ (\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}} \nonumber\\ \le & f(\boldsymbol{x}_1)-\mathbb{E}[f(\boldsymbol{x}_{T+1})|\mathcal{F}_t]+\frac{\eta \alpha_0 D_1\|\nabla f(\boldsymbol{x}_1)\|^2}{2\sqrt{\zeta}}+\frac{\eta \alpha_0 dD_0 }{2\sqrt{\zeta}}+T\frac{dL_0\eta^2D_0}{2{\zeta}}+T\frac{\alpha_0\alpha_1d^2 L_0^2\eta^3 D_1^2}{2(1-\beta_2)} \frac{1}{\sqrt{ \zeta}}+ T\frac{\eta^2\alpha_2d^{1.5}L_1^2D_0}{2{}\sqrt{\zeta}} \nonumber\\ &+T\frac{\eta \alpha_0dD_0(1-\sqrt{\beta_2})}{2\sqrt{ \zeta}}+\frac{\eta \alpha_0D_1(1-{\beta_2})}{2}\sum_{i=1}^d\sum_{t=1}^T \mathbb{E}\left[\frac{ \|\partial_i f(\boldsymbol{x}_t)\|^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}} \right] \end{align}\] Define \(\Delta=f(\boldsymbol{x}_1)-f(\boldsymbol{x}^*)+\frac{\eta\alpha_0 dD_0 }{2\sqrt{\zeta}}+\frac{\eta \alpha_0D_1\|\nabla f(\boldsymbol{x}_1)\|^2}{2\sqrt{\zeta}}\). If we set \(\alpha_0=1, \alpha_1=7,\alpha_2=1\), obviously we can find some \(1-\beta_2=\min(\frac{1}{7D_1},\frac{\sqrt{\zeta}\epsilon^2}{35dD_0})\sim \mathcal{O}(\epsilon^{2})\), and \(\eta=\min(\frac{\sqrt{\zeta}}{7L_0D_1},\frac{\sqrt{1-\beta_2}}{\max{(14L_1\sqrt{d}D_1,7L_1\sqrt{dD_1})}},\frac{1-\beta_2}{7\sqrt{d}},\frac{\zeta \epsilon^2}{35L_0dD_0},\frac{\epsilon \sqrt{1-\beta_2}\sqrt[4]{\zeta}}{7D_1L_0d\sqrt{5}},\frac{\sqrt{\zeta} \epsilon^2}{35L_1^2d^{1.5}D_0})\sim \mathcal{O}(\epsilon^{2})\) such that \[\begin{align} \frac{\eta}{14}\sum_{i=1}^d\sum_{t=1}^T \mathbb{E}\left[\frac{ (\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}} \right]\le \Delta+T\frac{\eta}{70}\epsilon^2+T\frac{\eta}{70}\epsilon^2+T\frac{\eta}{70}\epsilon^2+T\frac{\eta}{70}\epsilon^2, \end{align}\] Set \(T=\frac{70\Delta}{\eta\epsilon^2} \sim \mathcal{O}(\epsilon^{-4})\), and we have that \[\begin{align} \frac{1}{T} \sum_{t=1}^T \mathbb{E}\left[\frac{ \|\nabla f(\boldsymbol{x}_t)\|^2}{\sqrt{{\beta_2 \|\boldsymbol{v}_{t-1}}\| + \zeta}} \right]\le\sum_{i=1}^d\frac{1}{T} \sum_{t=1}^T \mathbb{E}\left[\frac{ (\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}} \right]\le \epsilon^2. \end{align}\] This completes the proof.

8 Proof of Lemma 3↩︎

For any \(a,b>0\), we have that \(\sqrt{a+b}\le \sqrt{a}+\sqrt{b}\). It then follows that \[\begin{align} \mathbb{E}\Big[\sqrt{{\beta_2 \|\boldsymbol{v}_{t-1}}\| + \zeta}\Big]\le \sum_{i=1}^d\mathbb{E}\Big[\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}}}\Big]+\sqrt{\zeta}. \end{align}\] For the first term, for \(t>1\) we have that \[\begin{align} \mathbb{E}[\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}}}]=\mathbb{E}_{\mathcal{F}_{t-1}}\left[ \mathbb{E} \left[\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}}}|\mathcal{F}_{t-1}\right]\right]=\mathbb{E}_{\mathcal{F}_{t-1}}\left[ \mathbb{E} \left[\sqrt{{\beta_2^2 \boldsymbol{v}_{t-2,i}}+\beta_2(1-\beta_2)(\boldsymbol{g}_{t-1,i})^2}|\mathcal{F}_{t-1}\right]\right]. \end{align}\] Give \(\mathcal{F}_{t-1}\), the \(\boldsymbol{v}_{t-2}\) is deterministic. By Jensen’s inequality, we have that \[\begin{align} \label{proof:3eq1} &\mathbb{E}_{\mathcal{F}_{t-1}}\left[ \mathbb{E} \left[\sqrt{{\beta_2^2 \boldsymbol{v}_{t-2,i}}+\beta_2(1-\beta_2)(\boldsymbol{g}_{t-1,i})^2}|\mathcal{F}_{t-1}\right]\right]\nonumber\\ \le & \mathbb{E}_{\mathcal{F}_{t-1}}\left[ \sqrt{\mathbb{E} \left[ {\beta_2^2 \boldsymbol{v}_{t-2,i}}+\beta_2(1-\beta_2)(\boldsymbol{g}_{t-1,i})^2|\mathcal{F}_{t-1}\right]}\right]\nonumber\\ \le & \mathbb{E}_{\mathcal{F}_{t-1}}\left[ \sqrt{ {\beta_2^2 \boldsymbol{v}_{t-2,i}}+\beta_2(1-\beta_2)(D_0+D_1(\partial_i f(\boldsymbol{x}_{t-1}))^2)}\right]\nonumber\\ \le & \mathbb{E}_{\mathcal{F}_{t-1}}\left[ \sqrt{ {\beta_2^2 \boldsymbol{v}_{t-2,i}}+\beta_2(1-\beta_2)D_0}\right]+\mathbb{E}\Big[\sqrt{\beta_2(1-\beta_2)D_1}|\partial_i f(\boldsymbol{x}_{t-1,i})|\Big], \end{align}\tag{24}\] where the second inequality is according to Assumption 2. By recursively applying (24 ) we have that \[\begin{align} &\mathbb{E}_{\mathcal{F}_{t-1}}\left[ \sqrt{ {\beta_2^2 \boldsymbol{v}_{t-2}}+\beta_2(1-\beta_2)D_0}\right]\nonumber\\ \le & \mathbb{E}_{\mathcal{F}_{t-2}}\left[ \sqrt{ {\beta_2^3 \boldsymbol{v}_{t-3,i}}+(\beta_2+\beta_2^2)(1-\beta_2)D_0}\right]+\mathbb{E}\Big[\sqrt{\beta_2^2(1-\beta_2)D_1}|\partial_i f(\boldsymbol{x}_{t-2})|\Big] \end{align}\] Specifically, we can get that \[\begin{align} \label{eq:3recursive} \mathbb{E}\Big[\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}}}\Big]\le& \mathbb{E}_{\mathcal{F}_{t-1}}\left[ \sqrt{ {\beta_2^2 \boldsymbol{v}_{t-2,i}}+\beta_2(1-\beta_2)D_0}\right]+\mathbb{E}\Big[\sqrt{\beta_2(1-\beta_2)D_1}|\partial_i f(\boldsymbol{x}_{t-1})|\Big]\nonumber\\ \le&\mathbb{E}_{\mathcal{F}_{t-2}}\left[ \sqrt{ {\beta_2^3 \boldsymbol{v}_{t-3,i}}+(\beta_2+\beta_2^2)(1-\beta_2)D_0}\right]+\mathbb{E}\Big[\sqrt{\beta_2^2(1-\beta_2)D_1}|\partial_i f(\boldsymbol{x}_{t-t})|\Big]\nonumber\\ &+\mathbb{E}\Big[\sqrt{\beta_2(1-\beta_2)D_1}|\partial_i f(\boldsymbol{x}_{t-1})|\Big]\nonumber\\ \le&...\nonumber\\ \le&\mathbb{E}\left[ \sqrt{ (\beta_2+\beta_2^2+\beta_2^3+...+\beta_2^t)(1-\beta_2)D_0+\boldsymbol{v}_{0,i}}\right]+\sum_{i=1}^{t-1} \mathbb{E}\Big[ \sqrt{{}\beta_2^{j}(1-\beta_2)D_1}|\partial_i f(\boldsymbol{x}_{t-j})|\Big]\nonumber\\ \le& \sqrt{D_0+\|\boldsymbol{v}_0\|}+\sum_{j=1}^{t-1} \mathbb{E}\Big[ \sqrt{{}\beta_2^{j}(1-\beta_2)D_1}|\partial_i f(\boldsymbol{x}_{t-j})|\Big]. \end{align}\tag{25}\] As a result, we have that \[\begin{align} \label{eq:3recursive2} &\frac{1}{T} \sum_{t=1}^T \mathbb{E}\Big[\sqrt{{\beta_2 \|\boldsymbol{v}_{t-1}}\| + \zeta}\Big]\nonumber\\ &\le \sqrt{\zeta}+d\sqrt{D_0+\|\boldsymbol{v}_0\|}+\frac{1}{T}\sum_{i=1}^d\sum_{t=1}^T \sum_{j=1}^{t-1} \mathbb{E}\Big[ \sqrt{{}\beta_2^{j}(1-\beta_2)D_1}|\partial_i f(\boldsymbol{x}_{t-j})|\Big]\nonumber\\ &\le c+\sum_{i=1}^d\frac{\sum_{t=1}^T\mathbb{E}[|\partial_i f(\boldsymbol{x}_t)|]}{T} \sqrt{(1-\beta_2)D_1}\Bigg(\sum_{j=1}^T \sqrt{\beta_2^j}\Bigg) \nonumber\\ &\le c+\frac{2\sqrt{D_1}}{\sqrt{(1-\beta_2)}} \sum_{i=1}^d\frac{\sum_{t=1}^T\mathbb{E}[|\partial_i f(\boldsymbol{x}_t)|]}{T}\nonumber\\ &\le c+\frac{2\sqrt{dD_1}}{\sqrt{(1-\beta_2)}} \frac{\sum_{t=1}^T\mathbb{E}[\|\nabla f(\boldsymbol{x}_t)\|]}{T}, \end{align}\tag{26}\] where the last inequlity is due to \(\sum_{i=1}^d \|\partial_i f(\boldsymbol{x})\|\le \sqrt{d}\|f(\boldsymbol{x})\|\). We then complete the proof.

9 Formal Version of Theorem 1 and Its Proof↩︎

Define \(c=\sqrt{\zeta}+d\sqrt{D_0+\|\boldsymbol{v}_0\|}\), \(\Delta=f(\boldsymbol{x}_1)-f(\boldsymbol{x}^*)+\frac{\eta\alpha_0 dD_0 }{2\sqrt{\zeta}}+\frac{\eta \alpha_0D_1\|\nabla f(\boldsymbol{x}_1)\|^2}{2\sqrt{\zeta}}\), \(\Lambda_1=\frac{1}{\max{(14L_1\sqrt{d}D_1,7L_1\sqrt{dD_1})}}\), \(\Lambda_2=\min\left(\frac{\zeta}{35L_0dD_0},\frac{\sqrt{\zeta}}{35L_1^2d^{1.5}D_0}\right)\) and \(\Lambda_3=\frac{\sqrt[4]{\zeta}}{7D_1L_0d\sqrt{5}}\).

Theorem 4. Let Assumptions 1, 2 and 3 hold. Let \(1-\beta_2=\min\left(\frac{1}{7D_1},\frac{\sqrt{\zeta}\epsilon^2}{35dD_0}\right)\sim \mathcal{O}(\epsilon^{2})\), \(\eta=\min\left(\frac{\sqrt{\zeta}}{7L_0D_1},\Lambda_1{\sqrt{1-\beta_2}},\frac{1-\beta_2}{7\sqrt{d}},\Lambda_2{ \epsilon^2},\Lambda_3{\epsilon \sqrt{1-\beta_2}}\right)\sim \mathcal{O}(\epsilon^{2})\), and \(T=\frac{70\Delta}{\eta\epsilon^2} \sim \mathcal{O}(\epsilon^{-4})\). For small \(\epsilon\) such that \(\epsilon\le \frac{\sqrt{5dD_0}}{\sqrt{D_1}\sqrt[4]{\zeta}}\), we have that \[\begin{align} \frac{1}{T} \sum_{t=1}^T \mathbb{E}[\|\nabla f(\boldsymbol{x}_t)\|]\le \left(\frac{2d\sqrt{35D_0D_1}}{\sqrt[4]{\zeta}}+\sqrt{c}\right)\epsilon. \end{align}\]

Proof. According to Lemma 2, we have that \[\begin{align} \frac{1}{T} \sum_{t=1}^T \mathbb{E}\left[\frac{ \|\nabla f(\boldsymbol{x}_t)\|^2}{\sqrt{{\beta_2 \|\boldsymbol{v}_{t-1}}\| + \zeta}} \right]\le \epsilon^2. \end{align}\] With bounded gradient norms assumption, it is easy to show that \(\frac{1}{T} \sum_{t=1}^T \mathbb{E}[{ \|\nabla f(\boldsymbol{x}_t)\|^2}] \sim \mathcal{O}(\epsilon^2)\) via Corollary 2. However, in this paper we focus on the bounded-free-assumptions thus the \(\sqrt{{\beta_2\|\boldsymbol{v}_{t-1}}\| + \zeta}\) is potentially infinite, making it hard to show the algorithm converges to a stationary point directly.

According to Lemma 3, we have that \[\begin{align} &\left(\frac{1}{T} \sum_{t=1}^T \mathbb{E}[\sqrt{{\beta_2 \|\boldsymbol{v}_{t-1}}\| + \zeta}]\right)\nonumber\\ &\le c +\frac{2\sqrt{dD_1}}{\sqrt{(1-\beta_2)}} \frac{\sum_{t=1}^T\mathbb{E}[\|\nabla f(\boldsymbol{x}_t)\|]}{T}. \end{align}\]

Define \(e=\frac{1}{T} \sum_{t=1}^T \mathbb{E}[\|\nabla f(\boldsymbol{x}_t)\|]\). By Hölder’s inequality, we have that \[\begin{align} \left(\frac{1}{T} \sum_{t=1}^T \mathbb{E}[\|\nabla f(\boldsymbol{x}_t)\|]\right)^2\le \left(\frac{1}{T} \sum_{t=1}^T \mathbb{E}\left[\frac{ \|\nabla f(\boldsymbol{x}_t)\|^2}{\sqrt{{\beta_2 \|\boldsymbol{v}_{t-1}}\| + \zeta}} \right]\right)\left(\frac{1}{T} \sum_{t=1}^T \mathbb{E}[\sqrt{{\beta_2 \|\boldsymbol{v}_{t-1}}\| + \zeta}]\right), \end{align}\] based on Lemma 3 and Lemma 2 which can be written as \[\begin{align} e^2\le \epsilon^2(c+\frac{2\sqrt{dD_1}}{\sqrt{1-\beta_2}}e). \end{align}\] Thus we have that \[\frac{1}{T} \sum_{t=1}^T \mathbb{E}[\|\nabla f(x_t)\|]=e\le\frac{\sqrt{dD_1}\epsilon^2}{\sqrt{1-\beta_2}}+\frac{1}{2}\sqrt{\left(\frac{2\sqrt{dD_1}\epsilon^2}{\sqrt{1-\beta_2}}\right)^2+4c\epsilon^2}.\] Since \(1-\beta_2=\min(\frac{1}{7D_1},\frac{\sqrt{\zeta}\epsilon^2}{35dD_0})\sim \mathcal{O}(\epsilon^{2})\), if \(\epsilon\le \frac{\sqrt{5dD_0}}{\sqrt{D_1}\sqrt[4]{\zeta}}\) we have \(\frac{\epsilon^2}{\sqrt{1-\beta_2}} \le \frac{\sqrt{35dD_0}}{\sqrt[4]{\zeta}}\epsilon\). It demonstrates that \[\begin{align} \frac{1}{T} \sum_{t=1}^T \mathbb{E}[\|\nabla f(x_t)\|]\le \left(\frac{2d\sqrt{35D_0D_1}}{\sqrt[4]{\zeta}}+\sqrt{c}\right)\epsilon, \end{align}\] which completes the proof. ◻

10 Lemmas for Theorem 3↩︎

Here are some lemmas we need for the proof of Theorem 3.

Lemma 6. (Lemma 6 in [19]) For any \(\{c_t\}_{t=0}^\infty\ge0\) and \(a_t=\beta_2 a_{t-1}+(1-\beta_2)c_t^2\) and \(b_t=\beta_1b_{t-1}+(1-\beta_1)c_t\), if \(0<\beta_1^2<\beta_2<1\), we have that \[\begin{align} \frac{b_t}{\sqrt{a_t+\zeta}}\le \frac{1-\beta_1}{\sqrt{1-\beta_2}\sqrt{1-\frac{\beta_1^2}{{\beta_2}}}} \end{align}\]

Proof. we have that \[\begin{align} \frac{b_t}{\sqrt{a_t+\zeta}}\le\frac{b_t}{\sqrt{a_t}}&\le \frac{\sum_{i=0}^{t-1}(1-\beta_1)\beta_1^ic_{t-i}}{\sqrt{\sum_{i=0}^{t-1}(1-\beta_2)\beta_2^ic_{t-i}^2+a_0}}\nonumber\\ &\le \frac{1-\beta_1}{\sqrt{1-\beta_2}}\frac{\sqrt{\sum_{i=0}^{t-1}\beta_2^ic_{t-i}^2}\sqrt{\sum_{i=0}^{t-1}\frac{\beta_1^{2i}}{\beta_2}}}{\sqrt{\sum_{i=0}^{t-1}\beta_2^ic_{t-i}^2}} \le \frac{1-\beta_1}{\sqrt{1-\beta_2}\sqrt{1-\frac{\beta_1^2}{{\beta_2}}}}, \end{align}\] where the third inequality is due to Cauchy’s inequality. ◻

Lemma 7. (Lemma 5 in [19]) For any \(\{c_t\}_{t=0}^\infty\ge0\) and \(a_t=\beta_2 a_{t-1}+(1-\beta_2)c_t^2\) and \(b_t=\beta_1b_{t-1}+(1-\beta_1)c_t\), if \(0<\beta_1^2<\beta_2<1\) and \(a_0>0\), we have that \[\begin{align} \sum_{t=1}^T\frac{b_t^2}{{a_t}}\le \frac{(1-\beta_1)^2}{(1-\frac{\beta_1}{\sqrt{\beta_2}})^2(1-\beta_2)}\left(\ln\left(\frac{a_T}{a_0}\right)-T\ln(\beta_2)\right) \end{align}\]

Proof. Due to the monotonicity of \(\frac{1}{a}\) function and Lemma 5.2 in [13], we have that \[\begin{align} \frac{(1-\beta_2)c_t^2}{a_t}\le \int_{a=a_t-(1-\beta_2)c_t^2}^{a_t} \frac{1}{a}da=\ln\left(\frac{a_t}{a_t-(1-\beta_2)c_t^2}\right)=\ln\left(\frac{a_t}{a_{t-1}}\right)-\ln(\beta_2). \end{align}\] By telescoping, we have the \(\sum_{t=1}^T \frac{c_t^2}{a_t}\le \frac{1}{1-\beta_2}\ln\left(\frac{a_T}{a_{0}}\right)-T\frac{\ln(\beta_2)}{(1-\beta_2)}\). Moreover, for \(b_t\) we have that

\[\begin{align} \frac{b_t}{\sqrt{a_t}}\le (1-\beta_1)\sum_{i=1}^t\frac{\beta_1^{t-i}c_i}{\sqrt{a_t}} \le (1-\beta_1)\sum_{i=1}^t\left(\frac{\beta_1}{\sqrt{\beta_2}}\right)^{t-i}\frac{c_i}{\sqrt{a_i}}. \end{align}\]

Moreover, due to the Cauchy’s inequality, we get that \[\begin{align} \frac{b_t^2}{{a_t}} \le& (1-\beta_1)^2\left(\sum_{i=1}^t\left(\frac{\beta_1}{\sqrt{\beta_2}}\right)^{t-i}\frac{c_i}{\sqrt{a_i}}\right)^2\nonumber\\ \le &(1-\beta_1)^2\left(\sum_{i=1}^t\left(\frac{\beta_1}{\sqrt{\beta_2}}\right)^{t-i}\right)\left(\sum_{i=1}^t\left({\frac{\beta_1}{\sqrt{\beta_2}}}\right)^{t-i} \frac{c_i^2}{{a_i}}\right)\nonumber\\ \le & \frac{(1-\beta_1)^2}{\left(1-\frac{\beta_1}{\sqrt{\beta_2}}\right)}\left(\sum_{i=1}^t\left(\frac{\beta_1}{\sqrt{\beta_2}}\right)^{t-i} \frac{c_i^2}{{a_i}}\right) \end{align}\] Thus we have that \[\begin{align} \sum_{t=1}^T\frac{b_t^2}{{a_t}}\le \frac{(1-\beta_1)^2}{(1-\frac{\beta_1}{\sqrt{\beta_2}})^2(1-\beta_2)}\left(\ln\left(\frac{a_T}{a_0}\right)-T\ln(\beta_2)\right). \end{align}\] ◻

Lemma 8. For any \(\{c_t\}_{t=0}^\infty\ge0\) and \(a_t=\beta_2 a_{t-1}+(1-\beta_2)c_t^2\) and \(b_t=\beta_1b_{t-1}+(1-\beta_1)c_t\), if \(0<\beta_1^4<\beta_2<1\) and \(a_0>0\), we have that \[\begin{align} \sum_{t=1}^T\frac{b_t^2}{\sqrt{a_t}}\le \frac{(1-\beta_1)^2}{(1-\frac{\beta_1}{\sqrt[4]{\beta_2}})^2}\left(\frac{2}{1-\beta_2}(\sqrt{a_T}-\sqrt{a_{0}})+\sum_{t=1}^T2\sqrt{a_{t-1}}\right) \end{align}\]

Proof. Due to the monotonicity of \(\frac{1}{\sqrt{a}}\) function, we have that \[\begin{align} \frac{(1-\beta_2)c_t^2}{\sqrt{a}_t}&\le \int_{a=a_t-(1-\beta_2)c_t^2}^{a_t} \frac{1}{\sqrt{a}}da=2\sqrt{a_t}-2\sqrt{a_t-(1-\beta_2)c_t^2}\nonumber\\ &=2\sqrt{a_t}-2\sqrt{\beta_2a_{t-1}}\le 2\sqrt{a_t}-2\sqrt{a_{t-1}}+2(1-\beta_2)\sqrt{a_{t-1}} \end{align}\] By taking sum up from \(t=1\) to \(T\), we have the \[\begin{align} \label{lemma:e3} \sum_{t=1}^T \frac{(1-\beta_2)c_t^2}{\sqrt{a_t}}\le 2\sqrt{a_T}-2\sqrt{a_{0}}+2(1-\beta_2)\sum_{t=1}^T\sqrt{a_{t-1}}. \end{align}\tag{27}\] Moreover, for \(b_t\) we have that \[\begin{align} \frac{b_t}{\sqrt[4]{a_t}}\le (1-\beta_1)\sum_{i=1}^t\frac{\beta_1^{t-i}c_i}{\sqrt[4]{a_t}} \le (1-\beta_1)\sum_{i=1}^t\left(\frac{\beta_1}{\sqrt[4]{\beta_2}}\right)^{t-i}\frac{c_i}{\sqrt[4]{a_i}} \end{align}\] Thus we have that \[\begin{align} \frac{b_t^2}{\sqrt{a_t}} \le& (1-\beta_1)^2\left(\sum_{i=1}^t\left(\frac{\beta_1}{\sqrt[4]{\beta_2}}\right)^{t-i}\frac{c_i}{\sqrt[4]{a_i}}\right)^2\nonumber\\ \le &(1-\beta_1)^2\left(\sum_{i=1}^t\left(\frac{\beta_1}{\sqrt[4]{\beta_2}}\right)^{t-i}\right)\left(\sum_{i=1}^t\left({\frac{\beta_1}{\sqrt[4]{\beta_2}}}\right)^{t-i} \frac{c_i^2}{\sqrt{a_i}}\right)\nonumber\\ \le & \frac{(1-\beta_1)^2}{1-\frac{\beta_1}{\sqrt[4]{\beta_2}}}\left(\sum_{i=1}^t\left({\frac{\beta_1}{\sqrt[4]{\beta_2}}}\right)^{t-i} \frac{c_i^2}{\sqrt{a_i}}\right) \end{align}\] Thus we have that \[\begin{align} \label{eq:proof59} \sum_{t=1}^T\frac{b_t^2}{\sqrt{a_t}}\le \frac{(1-\beta_1)^2}{(1-\frac{\beta_1}{\sqrt[4]{\beta_2}})^2}\sum_{t=1}^T\frac{c_t^2}{\sqrt{a_t}}. \end{align}\tag{28}\] Plug 27 in 28 , and we can complete the proof. ◻

11 Lemma 9 and Its Proof↩︎

Define \(C_1=1-\frac{\beta_1}{\sqrt{\beta_2}},C_2=\sqrt{1-\frac{\beta_1^2}{\beta_2}}\), we have the following Lemma:

Lemma 9. For any \(\alpha_0,\alpha_1,\alpha_3, \alpha_4>0\), \(\beta_2>0.5,\) and \(0<\beta_1^2<\beta_2\), we have that \[\begin{align} &{\mathbb{E}[\langle \nabla f(\boldsymbol{u}_t), \boldsymbol{u}_{t}-\boldsymbol{u}_{t+1} \rangle|\mathcal{F}_t]}\nonumber\\ \ge& \Bigg[\frac{\eta (1-\beta_1)}{C_1}-\frac{\eta(1-\beta_1)}{C_1C_2}\left(\frac{1}{2\alpha_0}+\frac{\alpha_0}{2\alpha_1}+\frac{\eta \alpha_0\sqrt{d}D_1L_1(1-\beta_1)}{\sqrt{1-\beta_2}C_2}\right)-\frac{\eta \beta_1(1-\beta_1)\sqrt{\zeta}}{2\alpha_3C_1C_2}-\frac{\eta L_1(1-C_1)^2}{2\alpha_4C_1^2}-\frac{\eta L_1(1-C_1)}{2\alpha_4C_1^2}\Bigg]\nonumber\\ &\times \sum_{i=1}^d\frac{(\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{{\beta_2\boldsymbol{v}_{t-1,i}} + \zeta}}\nonumber\\ &-\sum_{i=1}^d\frac{\eta (1-\beta_1)}{C_1C_2}\Bigg[\frac{\alpha_0D_0}{2}\Bigg(\frac{1}{\sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}-\mathbb{E}\Bigg[{\frac{1}{\sqrt{\boldsymbol{v}_{t,i}+\zeta}}}|\mathcal{F}_t\Bigg]\Bigg)+\frac{\alpha_0 D_1}{2}\mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_{t-1}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}|\mathcal{F}_t\right]\Bigg]\nonumber\\&-\frac{\alpha_0\alpha_1D_1^2L_0^2\eta^3d^2(1-\beta_1)^3}{2(1-\beta_2)C_1C_2^3\sqrt{\zeta})}-\frac{\alpha_3\eta d\beta_1(1-\beta_1)(1-\beta_2)}{2C_1C_2}\nonumber\\ &-\sum_{i=1}^d\frac{\eta^2((1-C_1)^2+0.5(1-C_1))\sqrt{d}L_0}{C_1^2} \frac{\boldsymbol{m}_{t-1,i}^2}{\boldsymbol{v}_{t-1,i}+\zeta}-\sum_{i=1}^d\frac{\eta^20.5(1-C_1)\sqrt{d}L_0}{C_1^2}\mathbb{E}\Bigg[ \frac{\boldsymbol{m}_{t,i}^2}{\boldsymbol{v}_{t,i}+\zeta}|\mathcal{F}_t\Bigg]\nonumber\\ &-\sum_{i=1}^d\frac{\alpha_4\eta^3(1-\beta_1)^2(1-C_1)^2dL_1}{2(1-\beta_2)C_1^2C_2^2}\frac{\boldsymbol{m}_{t-1,i}^2}{\sqrt{\boldsymbol{v}_{t-1,i}+\zeta}}-\sum_{i=1}^d\frac{\alpha_4\eta^3(1-\beta_1)^2(1-C_1)dL_1}{2(1-\beta_2)C_1^2C_2^2}\mathbb{E}\left[\frac{\boldsymbol{m}_{t,i}^2}{\sqrt{\boldsymbol{v}_{t,i}+\zeta}}|\mathcal{F}_t\right]. \end{align}\]

Proof. Since \(\boldsymbol{u}_t=\frac{\boldsymbol{x}_t-\frac{\beta_1}{\sqrt{\beta_2}}\boldsymbol{x}_{t-1}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}\), we then have that \[\begin{align} \label{eq:u} \boldsymbol{u}_{t+1}-\boldsymbol{u}_t=&\frac{\boldsymbol{x}_{t+1}-\boldsymbol{x}_t}{1-\frac{\beta_1}{\sqrt{\beta_2}}}-\frac{\beta_1}{\sqrt{\beta_2}}\frac{\boldsymbol{x}_{t}-\boldsymbol{x}_{t-1}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}\nonumber\\ =&\frac{ - \eta \odot \frac{\boldsymbol{m}_t}{\sqrt{{\boldsymbol{v}_t} + \zeta}}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}+\frac{\beta_1}{\sqrt{\beta_2}}\frac{ \eta \odot \frac{\boldsymbol{m}_{t-1}}{\sqrt{{\boldsymbol{v}_{t-1}} + \zeta}}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}\nonumber\\ =&\frac{ - \eta \odot \frac{\boldsymbol{m}_t}{\sqrt{{\beta_2\boldsymbol{v}_{t-1}} + \zeta}}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}+\frac{ - \eta \odot \frac{\boldsymbol{m}_t}{\sqrt{{\boldsymbol{v}_t} + \zeta}}+ \eta \odot \frac{\boldsymbol{m}_t}{\sqrt{{\beta_2\boldsymbol{v}_{t-1}} + \zeta}}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}\nonumber\\ &+\frac{ \eta \odot \frac{\beta_1\boldsymbol{m}_{t-1}}{\sqrt{\beta_2{\boldsymbol{v}_{t-1}} + \zeta}}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}+\frac{ \eta \odot \frac{\beta_1\boldsymbol{m}_{t-1}}{\sqrt{\beta_2{\boldsymbol{v}_{t-1}} + \beta_2\zeta}}- \eta \odot\frac{\beta_1\boldsymbol{m}_{t-1}}{\sqrt{\beta_2{\boldsymbol{v}_{t-1}} + \zeta}}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}\nonumber\\ =&\frac{ - \eta \odot \frac{(1-\beta_1)\boldsymbol{g}_t}{\sqrt{{\beta_2\boldsymbol{v}_{t-1}} + \zeta}}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}+\frac{ - \eta \odot \frac{\boldsymbol{m}_t}{\sqrt{{\boldsymbol{v}_t} + \zeta}}+ \eta \odot \frac{\boldsymbol{m}_t}{\sqrt{{\beta_2\boldsymbol{v}_{t-1}} + \zeta}}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}+\frac{ \eta\odot \frac{\beta_1\boldsymbol{m}_{t-1}}{\sqrt{\beta_2{\boldsymbol{v}_{t-1}} + \beta_2\zeta}}- \eta\odot \frac{\beta_1\boldsymbol{m}_{t-1}}{\sqrt{\beta_2{\boldsymbol{v}_{t-1}} + \zeta}}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}, \end{align}\tag{29}\] where the last equality is due to \(\boldsymbol{m}_t=\beta_1 \boldsymbol{m}_{t-1}+(1-\beta_1)\boldsymbol{g}_t\). For \((L_0,L_1)\)-smooth objectives, from Lemma 1 in [11] we can get that \[\begin{align} \underbrace{\mathbb{E}[\langle \nabla f(\boldsymbol{u}_t), \boldsymbol{u}_{t}-\boldsymbol{u}_{t+1} \rangle|\mathcal{F}_t]}_{\text{First Order}}&\le f(\boldsymbol{u}_t)-\mathbb{E}[f(\boldsymbol{u}_{t+1})|\mathcal{F}_t] +\underbrace{\sum_{i=1}^d\frac{L_0}{2\sqrt{d}}\mathbb{E}[\|\boldsymbol{u}_{t+1}-\boldsymbol{u}_t\||\boldsymbol{u}_{t+1,i}-\boldsymbol{u}_{t,i}||\mathcal{F}_t]}_{\text{Second Order}}\nonumber\\ &+\underbrace{\sum_{i=1}^d\frac{L_1|\partial_i f(\boldsymbol{u}_t)|}{2}\mathbb{E}[\|\boldsymbol{u}_{t+1}-\boldsymbol{u}_t\||\boldsymbol{u}_{t+1,i}-\boldsymbol{u}_{t,i}||\mathcal{F}_t]}_{\text{Additional Term}}. \label{eq:generalsmoothforu} \end{align}\tag{30}\] We then focus on the first-order term. Based on 29 we divide our first order terms into four parts \[\begin{align} \label{eq:6eq8} {\mathbb{E}[\langle \nabla f(\boldsymbol{u}_t), \boldsymbol{u}_{t}-\boldsymbol{u}_{t+1} \rangle|\mathcal{F}_t]}=& {\mathbb{E}[\langle \nabla f(x_t), \boldsymbol{u}_{t}-\boldsymbol{u}_{t+1} \rangle|\mathcal{F}_t]}+ {\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{u}_t)-\nabla f(x_t), \boldsymbol{u}_{t}-\boldsymbol{u}_{t+1} \right\rangle|\mathcal{F}_t\right]}\nonumber\\ =& \underbrace{{\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{x}_t),\frac{ \eta\odot \frac{(1-\beta_1)\boldsymbol{g}_t}{\sqrt{{\beta_2\boldsymbol{v}_{t-1}} + \zeta}}}{1-\frac{\beta_1}{\sqrt{\beta_2}}} \right\rangle|\mathcal{F}_t\right]}}_{\text{First Order Main}}-\underbrace{{\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{x}_t),\frac{ - \eta \odot \frac{\boldsymbol{m}_t}{\sqrt{{\boldsymbol{v}_t} + \zeta}}+ \eta\odot \frac{\boldsymbol{m}_t}{\sqrt{{\beta_2\boldsymbol{v}_{t-1}} + \zeta}}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}\right\rangle|\mathcal{F}_t\right]}}_{\text{Error 1}}\nonumber\\ &-\underbrace{{\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{x}_t),\frac{ \eta \odot \frac{\beta_1\boldsymbol{m}_{t-1}}{\sqrt{\beta_2{\boldsymbol{v}_{t-1}} + \beta_2\zeta}}- \eta \odot \frac{\beta_1\boldsymbol{m}_{t-1}}{\sqrt{\beta_2{\boldsymbol{v}_{t-1}} + \zeta}}}{1-\frac{\beta_1}{\sqrt{\beta_2}}} \right\rangle|\mathcal{F}_t\right]}}_{\text{Error 2}}\nonumber\\ &- \underbrace{{\mathbb{E}[\langle \nabla f(\boldsymbol{u}_t)-\nabla f(\boldsymbol{x}_t), \boldsymbol{u}_{t+1}-\boldsymbol{u}_{t} \rangle|\mathcal{F}_t]}}_{\text{Error 3}}. \end{align}\tag{31}\] The first-order main is easy to bound. Since give \(\mathcal{F}_t\), \(\boldsymbol{g}_t\) is independent of \(\boldsymbol{v}_{t-1}\) and we have that \[\begin{align} \label{eq:6eq9} \underbrace{{\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{x}_t),\frac{ \eta \odot \frac{(1-\beta_1)\boldsymbol{g}_t}{\sqrt{{\beta_2\boldsymbol{v}_{t-1}} + \zeta}}}{1-\frac{\beta_1}{\sqrt{\beta_2}}} \right\rangle|\mathcal{F}_t\right]}}_{\text{First Order Main}}=\sum_{i=1}^d\frac{\eta(1-\beta_1)}{1-\frac{\beta_1}{\sqrt{\beta_2}}}\frac{(\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{{\beta_2\boldsymbol{v}_{t-1,i}} + \zeta}}. \end{align}\tag{32}\] We then focus on the Error 1 term. Since \(\boldsymbol{v}_t=\beta_2 \boldsymbol{v}_{t-1}+(1-\beta_2)\boldsymbol{g}_t\odot \boldsymbol{g}_t\), we have@eq:eq:lemma1eq1 and it follows that \[\begin{align} &\underbrace{{\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{x}_t),\frac{ - \eta \odot \frac{\boldsymbol{m}_t}{\sqrt{{\boldsymbol{v}_t} + \zeta}}+ \eta\odot \frac{\boldsymbol{m}_t}{\sqrt{{\beta_2\boldsymbol{v}_{t-1}} + \zeta}}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}\right\rangle|\mathcal{F}_t\right]}}_{\text{Error 1}}\nonumber\\ \le& \sum_{i=1}^d \frac{\eta}{1-\frac{\beta_1}{\sqrt{\beta_2}}}{\mathbb{E}\left[ |\partial_i f(\boldsymbol{x}_t)| \frac{(1-\beta_2)\boldsymbol{g}_{t,i}^2\boldsymbol{m}_{t,i}}{(\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta})(\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta})(\sqrt{{ \boldsymbol{v}_{t,i}}+\zeta}+\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}}+\zeta})}|\mathcal{F}_t\right]}. \end{align}\] According to Lemma 6, we have \(\frac{|\boldsymbol{m}_{t,i}|}{\sqrt{\boldsymbol{v}_{t,i}+\zeta}}\le \frac{1-\beta_1}{\sqrt{1-\beta_2}\sqrt{1-\frac{\beta_1^2}{{\beta_2}}}}\), which demonstrates that \[\begin{align} \text{Error 1} \le &\sum_{i=1}^d \frac{\eta}{1-\frac{\beta_1}{\sqrt{\beta_2}}} \frac{1-\beta_1}{\sqrt{1-\frac{\beta_1^2}{\beta_2}}} \frac{|\partial_i f(\boldsymbol{x}_t)|}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\mathbb{E}\left[\left\| \frac{\sqrt{1-\beta_2}\boldsymbol{g}_{t,i}^2}{(\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}+\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta})}\right\||\mathcal{F}_t\right] \end{align}\] Similar to Appendix 6, we have that for any \(\alpha_0>0\) and \(i\in[d]\) \[\begin{align} \label{proof:6eq2} & \frac{|\partial_i f(\boldsymbol{x}_t)|}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\mathbb{E}\left[\left\| \frac{\sqrt{1-\beta_2}\boldsymbol{g}_{t,i}^2}{(\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}+\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta})}\right\||\mathcal{F}_t\right]\nonumber\\ \le &\frac{ (\partial_i f(\boldsymbol{x}_t))^2}{2\alpha_0\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}+\frac{\alpha_0 }{2\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\left(\mathbb{E}\left[\frac{\sqrt{1-\beta_2}\boldsymbol{g}_{t,i}^2}{(\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}+\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta})}|\mathcal{F}_t\right]\right)^2. \end{align}\tag{33}\] For the last term, due to Hölder’s inequality, we have that \[\begin{align} \label{proof:6eq3} &\frac{\alpha_0 }{2\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\left(\mathbb{E}\left[\frac{\sqrt{1-\beta_2}\boldsymbol{g}_{t,i}^2}{(\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}+\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta})}|\mathcal{F}_t\right]\right)^2\nonumber\\ \le& \frac{(1-\beta_2)\alpha_0}{2\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}\mathbb{E}[\boldsymbol{g}_{t,i}^2|\mathcal{F}_t]\mathbb{E}\left[\frac{\boldsymbol{g}_{t,i}^2}{(\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}+\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta})^2}|\mathcal{F}_t\right]\nonumber\\ \le & \frac{\alpha_0}{2}(D_0+D_1 (\partial_i f(\boldsymbol{x}_t))^2)\times \mathbb{E}\left[\frac{(1-\beta_2)\boldsymbol{g}_{t,i}^2}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}(\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}+\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta})}|\mathcal{F}_t\right]\nonumber\\ \le &\frac{\alpha_0}{2}(D_0+D_1 (\partial_i f(\boldsymbol{x}_t))^2)\mathbb{E}\left[\frac{1}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{1}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}|\mathcal{F}_t\right]\nonumber\\ \end{align}\tag{34}\] Similar to 14 , we will show the sum of 34 from \(t=1\) to \(T\) can be bounded. For \(t>1\) we have that \[\begin{align} \label{proof:6eq4} & \frac{\alpha_0}{2}(D_0+D_1 (\partial_i f(\boldsymbol{x}_t))^2)\mathbb{E}\left[\frac{1}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{1}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}|\mathcal{F}_t\right]\nonumber\\ = &\frac{\alpha_0 D_0}{2}\mathbb{E}\left[\frac{1}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{1}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}|\mathcal{F}_t\right]+\frac{\alpha_0 D_1}{2}\mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_{t-1}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}|\mathcal{F}_t\right]\nonumber\\ &+ \frac{\alpha_0 D_1}{2}\mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_t))^2-(\partial_i f(\boldsymbol{x}_{t-1}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}|\mathcal{F}_t\right]. \end{align}\tag{35}\] Note for the last term in RHS of 35 , due to the different update of \(\boldsymbol{x}_t\), there is a little difference compared with 15 . For the last term and any \(\alpha_1 >0\) and \(i\in [d]\), we have that \[\begin{align} &\frac{\alpha_0 D_1}{2}\mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_t))^2-(\partial_i f(\boldsymbol{x}_{t-1}))^2}{\sqrt{{\beta \boldsymbol{v}_{t-1,i}} + \zeta}}|\mathcal{F}_t\right]\nonumber\\ \le& \frac{\alpha_0 D_1}{2}\mathbb{E}\left[\frac{2|\partial_i f(\boldsymbol{x}_t)|\big||\partial_i f(\boldsymbol{x}_t)|-|\partial_i f(\boldsymbol{x}_{t-1})|\big|}{\sqrt{{\beta \boldsymbol{v}_{t-1,i}} + \zeta}}|\mathcal{F}_t\right] \nonumber\\ \le &\alpha_0 D_1\frac{|\partial_i f(\boldsymbol{x}_t)|(L_0+L_1|\partial_i f(\boldsymbol{x}_t)|)\eta\left\|\frac{1}{\sqrt{\boldsymbol{v}_{t-1}+\zeta}}\odot \boldsymbol{m}_{t-1}\right\|}{\sqrt{{\beta \boldsymbol{v}_{t-1,i}} + \zeta}}\nonumber\\ \le & \frac{\alpha_0 D_1}{2\sqrt{{\beta \boldsymbol{v}_{t-1,i}} + \zeta}} \left(\frac{(\partial_i f(\boldsymbol{x}_t))^2}{\alpha_1 D_1}+\alpha_1 D_1L_0^2\eta^2\left\|\frac{1}{\boldsymbol{v}_{t-1}+\zeta}\odot \boldsymbol{m}_{t-1}\right\|^2+2\eta L_1(\partial_i f(\boldsymbol{x}_t))^2\left\|\frac{1}{\boldsymbol{v}_{t-1}+\zeta}\odot \boldsymbol{m}_{t-1}\right\|\right)\nonumber\\ \le & \frac{\alpha_0 D_1}{2\sqrt{{\beta \boldsymbol{v}_{t-1,i}} + \zeta}}\left(\frac{(\partial_i f(\boldsymbol{x}_t))^2}{\alpha_1 D_1}+{\alpha_1 D_1L_0^2\eta^2}{}\frac{d(1-\beta_1)^2}{{(1-\beta_2)}{(1-\frac{\beta_1^2}{\beta_2}})}+2\eta L_1(\partial_i f(\boldsymbol{x}_t))^2\frac{\sqrt{d}(1-\beta_1)}{\sqrt{1-\beta_2}\sqrt{1-\frac{\beta_1^2}{\beta_2}}}\right), \end{align}\] where the first inequality is due to the fact that for any \(a,b \ge 0\), we have \(a^2-b^2\le 2a|a-b|\), the second inequality is by the \((L_0,L_1)\)-smoothness and update process of \(x_t\) and the last inequality is due to Lemma 6. As a result, for \(\alpha_0,\alpha_1>0\), if \(t=1\) we have that \[\begin{align} \text{Error 1} \le & \sum_{i=1}^d\frac{\eta}{1-\frac{\beta_1}{\sqrt{\beta_2}}} \frac{1-\beta_1}{\sqrt{1-\frac{\beta_1^2}{\beta_2}}} \Bigg(\frac{ (\partial_i f(\boldsymbol{x}_t))^2}{2\alpha_0\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}+\frac{\alpha_0 D_0}{2}\mathbb{E}\left[\frac{1}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{1}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}|\mathcal{F}_t\right]\nonumber\\ &+\frac{\alpha_0 D_1}{2}\mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}|\mathcal{F}_t\right]\Bigg), \end{align}\] and if \(t>1\) we have that \[\begin{align} \label{proof:6eq5} \text{Error 1} \le &\sum_{i=1}^d \frac{\eta}{1-\frac{\beta_1}{\sqrt{\beta_2}}} \frac{1-\beta_1}{\sqrt{1-\frac{\beta_1^2}{\beta_2}}} \Bigg[\frac{ (\partial_i f(\boldsymbol{x}_t))^2}{2\alpha_0\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}+\frac{\alpha_0 D_0}{2}\mathbb{E}\left[\frac{1}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{1}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}|\mathcal{F}_t\right]\nonumber\\ &+\frac{\alpha_0 D_1}{2}\mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_{t-1}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}|\mathcal{F}_t\right]\nonumber\\ &+\frac{\alpha_0 D_1}{2\sqrt{{\beta \boldsymbol{v}_{t-1,i}} + \zeta}}\left(\frac{(\partial_i f(\boldsymbol{x}_t))^2}{\alpha_1 D_1}+{\alpha_1 D_1L_0^2\eta^2}{}\frac{d(1-\beta_1)^2}{{(1-\beta_2)}{(1-\frac{\beta_1^2}{\beta_2}})}+2\eta L_1(\partial_i f(\boldsymbol{x}_t))^2\frac{\sqrt{d}(1-\beta_1)}{\sqrt{1-\beta_2}\sqrt{1-\frac{\beta_1^2}{\beta_2}}}\right)\Bigg]. \end{align}\tag{36}\] Since we define \(\boldsymbol{m}_0=0\), then \(\boldsymbol{x}_{0}=\boldsymbol{x}_1\) so this inequality works for \(t=1\) too. Note many terms in the RHS of 36 can be reduced by telescoping, which will be demonstrated later. Now we move to the Error 2 term. For any \(\alpha_3>0,0<\beta_1^2<\beta_2\) and \(\beta_2>0.5\), we have that

\[\begin{align} \label{eq:6eq10} &\underbrace{{\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{x}_t),\frac{ \eta \odot \frac{\beta_1\boldsymbol{m}_{t-1}}{\sqrt{\beta_2{\boldsymbol{v}_{t-1}} + \beta_2\zeta}}- \eta \odot \frac{\beta_1\boldsymbol{m}_{t-1}}{\sqrt{\beta_2{\boldsymbol{v}_{t-1}} + \zeta}}}{1-\frac{\beta_1}{\sqrt{\beta_2}}} \right\rangle|\mathcal{F}_t\right]}}_{\text{Error 2}}\nonumber\\ = &\frac{\eta \beta_1}{1-\frac{\beta_1}{\sqrt{\beta_2}}} {\mathbb{E}\left[\left\langle \nabla f(\boldsymbol{x}_t), \frac{(1-\beta_2)\zeta }{(\sqrt{\beta_2{\boldsymbol{v}_{t-1}} + \zeta})\odot(\sqrt{\beta_2{\boldsymbol{v}_{t-1}} + \beta_2\zeta})\odot(\sqrt{\beta_2{\boldsymbol{v}_{t-1}} + \zeta}+\sqrt{\beta_2{v_{t-1}} + \beta_2\zeta})}\odot\boldsymbol{m}_t\right\rangle|\mathcal{F}_t\right]}\nonumber\\ \le &\sum_{i=1}^d\frac{\eta \beta_1}{1-\frac{\beta_1}{\sqrt{\beta_2}}} |\partial_i f(\boldsymbol{x}_t)| \frac{(1-\beta_2)\zeta |\boldsymbol{m}_{t-1,i}|}{(\sqrt{\beta_2{\boldsymbol{v}_{t-1,i}} + \zeta})(\sqrt{\beta_2{\boldsymbol{v}_{t-1,i}} + \beta_2\zeta})(\sqrt{\beta_2{\boldsymbol{v}_{t-1,i}} + \zeta}+\sqrt{\beta_2{\boldsymbol{v}_{t-1,i}} + \beta_2\zeta})}\nonumber\\ \le &\sum_{i=1}^d\frac{\eta \beta_1}{1-\frac{\beta_1}{\sqrt{\beta_2}}} |\partial_i f(\boldsymbol{x}_t)| \frac{(1-\beta_2)\zeta }{(\sqrt{\beta_2{\boldsymbol{v}_{t-1,i}} + \zeta})\sqrt{\zeta}}\frac{1-\beta_1}{\sqrt{1-\beta_2}\sqrt{1-\frac{\beta_1^2}{\beta_2}}}\nonumber\\ \le &\sum_{i=1}^d\frac{\eta \beta_1 (1-\beta_1)\sqrt{\zeta}}{(1-\frac{\beta_1}{\sqrt{\beta_2}})\sqrt{1-\frac{\beta_1^2}{\beta_2}}}\left(\frac{(\partial_i f(\boldsymbol{x}_t))^2}{2\alpha_3\sqrt{\beta_2\boldsymbol{v}_{t-1,i}+\zeta}}+\frac{\alpha_3(1-\beta_2)}{2\sqrt{\beta_2\boldsymbol{v}_{t-1,i}+\zeta}}\right), \end{align}\tag{37}\] where the second inequality is due to Lemmma 6 and \(2\beta_2\ge 1\).

For Error 3, we have that \[\begin{align} \label{eq:6eq11} &{\langle \nabla f(\boldsymbol{u}_t)-\nabla f(\boldsymbol{x}_t), \boldsymbol{u}_{t}-\boldsymbol{u}_{t+1} \rangle}\nonumber\\ \le &\sum_{i=1}^d |\partial_i f(\boldsymbol{u}_t)-\partial_i f(\boldsymbol{x}_t)|| \boldsymbol{u}_{t,i}-\boldsymbol{u}_{t+1,i}|\nonumber\\ \le &\sum_{i=1}^d(L_0+L_1|\partial_i f(\boldsymbol{x}_t)|)\|\boldsymbol{u}_t-\boldsymbol{x}_t\|| \boldsymbol{u}_{t,i}-\boldsymbol{u}_{t+1,i}|\nonumber\\ \le &\sum_{i=1}^d(L_0+L_1|\partial_i f(\boldsymbol{x}_t)|)\frac{\frac{\beta_1}{\sqrt{\beta_2}}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}\|\boldsymbol{x}_t-\boldsymbol{x}_{t-1}\|\left| \frac{\boldsymbol{x}_{t+1,i}-\boldsymbol{x}_{t,i}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}-\frac{\beta_1}{\sqrt{\beta_2}}\frac{\boldsymbol{x}_{t,i}-\boldsymbol{x}_{t-1,i}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}\right|, \end{align}\tag{38}\] where the second inequality is due to the Assumption 3. According to the update process of \(\boldsymbol{x}_t\) and Lemmma 7, it is easy to get that \[\begin{align} \label{eq:6eq12} &\sum_{i=1}^dL_0\frac{\frac{\beta_1}{\sqrt{\beta_2}}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}\|\boldsymbol{x}_t-\boldsymbol{x}_{t-1}\|\left| \frac{\boldsymbol{x}_{t+1,i}-\boldsymbol{x}_{t,i}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}-\frac{\beta_1}{\sqrt{\beta_2}}\frac{\boldsymbol{x}_{t,i}-\boldsymbol{x}_{t-1,i}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}\right|\nonumber\\ &\le \sum_{i=1}^dL_0\frac{\frac{\beta_1}{\sqrt{\beta_2}}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}\left[\frac{1}{1-\frac{\beta_1}{\sqrt{\beta_2}}}\left({\|\boldsymbol{x}_{t}-\boldsymbol{x}_{t-1}\|}|\boldsymbol{x}_{t+1,i}-\boldsymbol{x}_{t,i}| \right)+\frac{\frac{\beta_1}{\sqrt{\beta_2}}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}\left({\|\boldsymbol{x}_{t}-\boldsymbol{x}_{t-1}\|}|\boldsymbol{x}_{t,i}-\boldsymbol{x}_{t-1,i}| \right)\right]\nonumber\\ &\le \sum_{i=1}^dL_0\frac{\frac{\beta_1}{\sqrt{\beta_2}}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}\left[\frac{1}{1-\frac{\beta_1}{\sqrt{\beta_2}}}\left({\frac{\|\boldsymbol{x}_{t}-\boldsymbol{x}_{t-1}\|^2}{2\sqrt{d}} }+\frac{\sqrt{d}|\boldsymbol{x}_{t+1,i}-\boldsymbol{x}_{t,i}|^2}{2} \right)+\frac{\frac{\beta_1}{\sqrt{\beta_2}}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}\left({\frac{\|\boldsymbol{x}_{t}-\boldsymbol{x}_{t-1}\|^2}{2\sqrt{d}} }+\frac{\sqrt{d}|\boldsymbol{x}_{t,i}-\boldsymbol{x}_{t-1,i}|^2}{2} \right)\right]\nonumber\\ &\le L_0\eta^2\sum_{i=1}^d\left[\frac{\frac{\beta_1}{2\sqrt{\beta_2}}\sqrt{d}+\frac{\beta_1^2}{\beta_2}\sqrt{d}}{\left(1-\frac{\beta_1}{\sqrt{\beta_2}}\right)^2}\frac{|\boldsymbol{m}_{t-1,i}|^2}{\boldsymbol{v}_{t-1,i}+\zeta}+\frac{\frac{\beta_1}{2\sqrt{\beta_2}}\sqrt{d}}{\left(1-\frac{\beta_1}{\sqrt{\beta_2}}\right)^2}\frac{\boldsymbol{m}_{t,i}^2}{\boldsymbol{v}_{t,i}+\zeta}\right] \end{align}\tag{39}\] However, it is hard to bound the remaining \(|\partial_i f(\boldsymbol{x}_t)|\|\boldsymbol{x}_t-\boldsymbol{x}_{t-1}\||\boldsymbol{u}_{t,i}-\boldsymbol{u}_{t+1,i}|\) term by directly applying Lemma 6, which will induce a \(\mathcal{O}(|\partial_i f(\boldsymbol{x}_t)| \frac{\eta^2}{1-\beta_2})\) term and in our affine variance setting we can only bound \(\frac{(\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{\beta_2\boldsymbol{v}_{t-1,i}+\zeta}}\) term. It is worth noting that this challenging additional term is due to the \((L_0,L_1)\)-smoothness and the methods in [19], [39] do not generalize to our case. To solve this challenge, we first bound the additional terms as follows: \[\begin{align} \label{eq:6eq13} &\sum_{i=1}^d|\partial_i f(\boldsymbol{x}_t)|\|\boldsymbol{x}_t-\boldsymbol{x}_{t-1}\||\boldsymbol{x}_{t,i}-\boldsymbol{x}_{t-1,i}|\nonumber\\ \le & \sum_{i=1}^ d|\partial_i f(\boldsymbol{x}_t)|\boldsymbol{x}_t-\boldsymbol{x}_{t-1}\|\|\frac{\eta |\boldsymbol{m}_{t-1,i}|}{\sqrt{{\boldsymbol{v}_{t-1,i}+\zeta}}}\nonumber\\ \le &\sum_{i=1}^d \left[ \frac{\eta (\partial_i f(\boldsymbol{x}_t))^2 }{2\alpha_4 \sqrt{\boldsymbol{v}_{t-1,i}+\zeta}}+ \frac{\alpha_4 \eta \boldsymbol{m}_{t-1,i}^2}{2\sqrt{\boldsymbol{v}_{t-1,i}+\zeta}}|\boldsymbol{x}_t-\boldsymbol{x}_{t-1}|^2\right]\nonumber\\ \le &\sum_{i=1}^d \frac{\eta (\partial_i f(\boldsymbol{x}_t))^2 }{2\alpha_4 \sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}+\sum_{i=1}^d\frac{\alpha_4 \eta^3 \boldsymbol{m}_{t-1,i}^2}{2\sqrt{\boldsymbol{v}_{t-1,i}+\zeta}}\frac{d(1-\beta_1)^2}{{(1-\beta_2)}{(1-\frac{\beta_1^2}{\beta_2})}}, \end{align}\tag{40}\] for all \(\alpha_4>0\), where the last inequality is due to Lemma 6. The motivation is that we can get a \(\frac{(\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{\beta_2\boldsymbol{v}_{t-1,i}+\zeta}}\) term and the \(\frac{\eta^3 \boldsymbol{m}_{t-1,i}^2}{2\sqrt{\boldsymbol{v}_{t-1,i}+\zeta}}\) can be bounded by Lemma 8. Similarly, we have that \[\begin{align} \label{eq:6eq14} &\sum_{i=1}^d|\partial_i f(\boldsymbol{x}_t)|\|\boldsymbol{x}_t-\boldsymbol{x}_{t-1}\||\boldsymbol{x}_{t+1,i}-\boldsymbol{x}_{t,i}|\nonumber\\ \le &\sum_{i=1}^d \left[ \frac{\eta (\partial_i f(\boldsymbol{x}_t))^2 }{2\alpha_4 \sqrt{\boldsymbol{v}_{t,i}+\zeta}}+ \frac{\alpha_4 \eta \boldsymbol{m}_{t,i}^2}{2\sqrt{\boldsymbol{v}_{t,i}+\zeta}}|\boldsymbol{x}_t-\boldsymbol{x}_{t-1}|^2\right]\nonumber\\ \le &\sum_{i=1}^d \frac{\eta (\partial_i f(\boldsymbol{x}_t))^2 }{2\alpha_4 \sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}+\sum_{i=1}^d\frac{\alpha_4 \eta^3 \boldsymbol{m}_{t,i}^2}{2\sqrt{\boldsymbol{v}_{t,i}+\zeta}}\frac{d(1-\beta_1)^2}{{(1-\beta_2)}{(1-\frac{\beta_1^2}{\beta_2})}}. \end{align}\tag{41}\] Combine 31 , 32 , 36 , 37 , 38 , 39 , 40 and 41 , and we then have that \[\begin{align} \label{eq:6eq15} &{\mathbb{E}[\langle \nabla f(\boldsymbol{u}_t), \boldsymbol{u}_{t+1}-\boldsymbol{u}_{t} \rangle|\mathcal{F}_t]}\nonumber\\ \le & -\sum_{i=1}^d\frac{\eta(1-\beta_1)}{1-\frac{\beta_1}{\sqrt{\beta_2}}}\frac{(\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{{\beta_2\boldsymbol{v}_{t-1,i}} + \zeta}}\nonumber\\ &+\sum_{i=1}^d \frac{\eta}{1-\frac{\beta_1}{\sqrt{\beta_2}}} \frac{1-\beta_1}{\sqrt{1-\frac{\beta_1^2}{\beta_2}}} \Bigg[\frac{ (\partial_i f(\boldsymbol{x}_t))^2}{2\alpha_0\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}+\frac{\alpha_0 D_0}{2}\mathbb{E}\left[\frac{1}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{1}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}|\mathcal{F}_t\right]\nonumber\\ &+\frac{\alpha_0 D_1}{2}\mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_{t-1}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}|\mathcal{F}_t\right]\nonumber\\ &+\frac{\alpha_0 D_1}{2\sqrt{{\beta \boldsymbol{v}_{t-1,i}} + \zeta}}\left(\frac{(\partial_i f(\boldsymbol{x}_t))^2}{\alpha_1 D_1}+{\alpha_1 D_1L_0^2\eta^2}{}\frac{d(1-\beta_1)^2}{{(1-\beta_2)}{(1-\frac{\beta_1^2}{\beta_2}})}+2\eta L_1(\partial_i f(\boldsymbol{x}_t))^2\frac{\sqrt{d}(1-\beta_1)}{\sqrt{1-\beta_2}\sqrt{1-\frac{\beta_1^2}{\beta_2}}}\right)\Bigg]\nonumber\\ &+\sum_{i=1}^d\frac{\eta \beta_1 (1-\beta_1)\sqrt{\zeta}}{(1-\frac{\beta_1}{\sqrt{\beta_2}})\sqrt{1-\frac{\beta_1^2}{\beta_2}}}\left(\frac{(\partial_i f(\boldsymbol{x}_t))^2}{2\alpha_3\sqrt{\beta_2\boldsymbol{v}_{t-1,i}+\zeta}}+\frac{\alpha_3(1-\beta_2)}{2\sqrt{\beta_2\boldsymbol{v}_{t-1,i}+\zeta}}\right)\nonumber\\ &+L_0\eta^2\sum_{i=1}^d\left[\frac{\frac{\beta_1}{2\sqrt{\beta_2}}\sqrt{d}+\frac{\beta_1^2}{\beta_2}\sqrt{d}}{\left(1-\frac{\beta_1}{\sqrt{\beta_2}}\right)^2}\frac{\boldsymbol{m}_{t-1,i}^2}{\boldsymbol{v}_{t-1,i}+\zeta}+\frac{\frac{\beta_1}{2\sqrt{\beta_2}}\sqrt{d}}{\left(1-\frac{\beta_1}{\sqrt{\beta_2}}\right)^2}\frac{\boldsymbol{m}_{t,i}^2}{\boldsymbol{v}_{t,i}+\zeta}\right]\nonumber\\ &+L_1\frac{\frac{\beta_1^2}{{\beta_2}}}{\left(1-\frac{\beta_1}{\sqrt{\beta_2}}\right)^2}\left[\sum_{i=1}^d \frac{\eta (\partial_i f(\boldsymbol{x}_t))^2 }{2\alpha_4 \sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}+\sum_{i=1}^d\frac{\alpha_4 \eta^3 \boldsymbol{m}_{t-1,i}^2}{2\sqrt{\boldsymbol{v}_{t-1,i}+\zeta}}\frac{d(1-\beta_1)^2}{{(1-\beta_2)}{(1-\frac{\beta_1^2}{\beta_2})}}\right]\nonumber\\ &+{L_1}\frac{\frac{\beta_1}{\sqrt{\beta_2}}}{\left(1-\frac{\beta_1}{\sqrt{\beta_2}}\right)^2}\mathbb{E}\left[\sum_{i=1}^d \frac{\eta (\partial_i f(\boldsymbol{x}_t))^2 }{2\alpha_4 \sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}+\sum_{i=1}^d\frac{\alpha_4 \eta^3 \boldsymbol{m}_{t,i}^2}{2\sqrt{\boldsymbol{v}_{t,i}+\zeta}}\frac{d(1-\beta_1)^2}{{(1-\beta_2)}{(1-\frac{\beta_1^2}{\beta_2})}}\right]. \end{align}\tag{42}\] Since \(C_1=1-\frac{\beta_1}{\sqrt{\beta_2}},C_2=\sqrt{1-\frac{\beta_1^2}{\beta_2}}\), 42 can be further bounded as follows \[\begin{align} &{\mathbb{E}[\langle \nabla f(\boldsymbol{u}_t), \boldsymbol{u}_{t+1}-\boldsymbol{u}_{t} \rangle|\mathcal{F}_t]}\nonumber\\ \le& \Bigg[-\frac{\eta (1-\beta_1)}{C_1}+\frac{\eta(1-\beta_1)}{C_1C_2}\left(\frac{1}{2\alpha_0}+\frac{\alpha_0}{2\alpha_1}+\frac{\eta \alpha_0\sqrt{d}D_1L_1(1-\beta_1)}{\sqrt{1-\beta_2}C_2}\right)+\frac{\eta \beta_1(1-\beta_1)\sqrt{\zeta}}{2\alpha_3C_1C_2}+\frac{\eta L_1(1-C_1)^2}{2\alpha_4C_1^2}+\frac{\eta L_1(1-C_1)}{2\alpha_4C_1^2}\Bigg]\nonumber\\ &\times \sum_{i=1}^d\frac{(\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{{\beta_2\boldsymbol{v}_{t-1,i}} + \zeta}}\nonumber\\ &+\sum_{i=1}^d\frac{\eta (1-\beta_1)}{C_1C_2}\Bigg[\frac{\alpha_0D_0}{2}\Bigg(\frac{1}{\sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}-\mathbb{E}\Bigg[{\frac{1}{\sqrt{\boldsymbol{v}_{t,i}+\zeta}}}|\mathcal{F}_t\Bigg]\Bigg)+\frac{\alpha_0 D_1}{2}\mathbb{E}\left[\frac{(\partial_i \boldsymbol{f}(x_{t-1}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}|\mathcal{F}_t\right]\Bigg]\nonumber\\ &+\frac{\alpha_0\alpha_1D_1^2L_0^2\eta^3d^2(1-\beta_1)^3}{2(1-\beta_2)C_1C_2^3\sqrt{\zeta})}+\frac{\alpha_3\eta d\beta_1(1-\beta_1)(1-\beta_2)}{2C_1C_2}\nonumber\\ &+\sum_{i=1}^d\frac{\eta^2((1-C_1)^2+0.5(1-C_1))\sqrt{d}L_0}{C_1^2} \frac{\boldsymbol{m}_{t-1,i}^2}{\boldsymbol{v}_{t-1,i}+\zeta}+\sum_{i=1}^d\frac{\eta^20.5(1-C_1)\sqrt{d}L_0}{C_1^2}\mathbb{E}\Bigg[ \frac{\boldsymbol{m}_{t,i}^2}{\boldsymbol{v}_{t,i}+\zeta}|\mathcal{F}_t\Bigg]\nonumber\\ &+\sum_{i=1}^d\frac{\alpha_4\eta^3(1-\beta_1)^2(1-C_1)^2dL_1}{2(1-\beta_2)C_1^2C_2^2}\frac{\boldsymbol{m}_{t-1,i}^2}{\sqrt{\boldsymbol{v}_{t-1,i}+\zeta}}+\sum_{i=1}^d\frac{\alpha_4\eta^3(1-\beta_1)^2(1-C_1)dL_1}{2(1-\beta_2)C_1^2C_2^2}\mathbb{E}\left[\frac{\boldsymbol{m}_{t,i}^2}{\sqrt{\boldsymbol{v}_{t,i}+\zeta}}|\mathcal{F}_t\right]. \end{align}\] This completes the proof. ◻

12 Proof of Lemma 4↩︎

For \((L_0,L_1)\)-smooth objective functions, we have the following descent inequality (Lemma 1 in [11]): \[\begin{align} \label{eq:7eq1} \underbrace{\mathbb{E}[\langle \nabla f(\boldsymbol{u}_t), \boldsymbol{u}_{t}-\boldsymbol{u}_{t+1} \rangle|\mathcal{F}_t]}_{\text{First Order}}&\le f(\boldsymbol{u}_t)-\mathbb{E}[f(\boldsymbol{u}_{t+1})|\mathcal{F}_t] +\underbrace{\sum_{i=1}^d\frac{L_0}{2\sqrt{d}}\mathbb{E}[\|\boldsymbol{u}_{t+1}-\boldsymbol{u}_t\||\boldsymbol{u}_{t+1,i}-\boldsymbol{u}_{t,i}||\mathcal{F}_t]}_{\text{Second Order}}\nonumber\\ &+\underbrace{\sum_{i=1}^d\frac{L_1\|\partial_i f(\boldsymbol{u}_t)\|}{2}\mathbb{E}[\|\boldsymbol{u}_{t+1}-\boldsymbol{u}_t\||\boldsymbol{u}_{t+1,i}-\boldsymbol{u}_{t,i}||\mathcal{F}_t]}_{\text{Additional Term}}. \end{align}\tag{43}\]

The first-order term is bounded by Lemma 9, we then only need to bound the remaining two terms. For the second-order term, based on the definition of \(\boldsymbol{u}_t\) and update process of \(\boldsymbol{x}_t\), we have that \[\begin{align} \label{eq:7eq2} &\sum_{i=1}^d\frac{L_0}{2\sqrt{d}}\mathbb{E}[\|\boldsymbol{u}_{t+1}-\boldsymbol{u}_t\||\boldsymbol{u}_{t+1,i}-\boldsymbol{u}_{t,i}||\mathcal{F}_t]\nonumber\\ \le&\sum_{i=1}^d\frac{L_0}{2\sqrt{d}}\mathbb{E}\left[ \frac{1}{2\sqrt{d}}\left\|\frac{\boldsymbol{x}_{t+1}-\boldsymbol{x}_t}{1-\frac{\beta_1}{\sqrt{\beta_2}}}-\frac{\beta_1}{\sqrt{\beta_2}}\frac{\boldsymbol{x}_{t}-\boldsymbol{x}_{t-1}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}\right\|^2+\frac{\sqrt{d}}{2}\left|\frac{\boldsymbol{x}_{t+1,i}-\boldsymbol{x}_{t,i}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}-\frac{\beta_1}{\sqrt{\beta_2}}\frac{\boldsymbol{x}_{t,i}-\boldsymbol{x}_{t-1,i}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}\right|^2|\mathcal{F}_t\right]\nonumber\\ \le &\sum_{i=1}^d\frac{L_0}{2\sqrt{d}}\mathbb{E}\left[ \frac{2\sqrt{d}}{(1-\frac{\beta_1}{\sqrt{\beta_2}})^2}|\boldsymbol{x}_{t+1,i}-\boldsymbol{x}_{t,i}|^2+\frac{2\sqrt{d}\frac{\beta_1^2}{{\beta_2}}}{(1-\frac{\beta_1}{\sqrt{\beta_2}})^2}|\boldsymbol{x}_{t,i}-\boldsymbol{x}_{t-1,i}|^2|\mathcal{F}_t\right]\nonumber\\ =& \sum_{i=1}^d\frac{L_0}{2\sqrt{d}}\left(\frac{2\sqrt{d}}{C_1^2}\mathbb{E}\left[\frac{\eta^2\boldsymbol{m}_{t,i}^2}{\boldsymbol{v}_{t,i}+\zeta}|\mathcal{F}_t\right]+\frac{2\sqrt{d}(1-C_1)^2}{C_1^2}\frac{\eta^2 \boldsymbol{m}_{t-1,i}^2}{\boldsymbol{v}_{t-1,i}+\zeta}\right). \end{align}\tag{44}\]

Now we focus on the additional term. According to the definition of \(u_t\) and update process of \(x_t\), for \(\alpha_4>0\) we have that \[\begin{align} \label{eq:7eq3} &\sum_{i=1}^d\frac{L_1\|\partial_i f(\boldsymbol{u}_t)\|}{2}\mathbb{E}[\|\boldsymbol{u}_{t+1}-\boldsymbol{u}_t\||\boldsymbol{u}_{t+1,i}-\boldsymbol{u}_{t,i}||\mathcal{F}_t]\nonumber\\ \le &\sum_{i=1}^d\frac{L_1}{2}(|\partial_i f(\boldsymbol{x}_t)|+(L_0+L_1|\partial_i f(\boldsymbol{x}_t)|)\|\boldsymbol{u}_t-\boldsymbol{x}_t\|)\mathbb{E}[\|\boldsymbol{u}_{t+1}-\boldsymbol{x}_t\||\boldsymbol{u}_{t+1,i}-\boldsymbol{u}_{t,i}||\mathcal{F}_t]\nonumber\\ \le &\sum_{i=1}^d\frac{L_1}{2}\left(|\partial_i f(\boldsymbol{x}_t)|+L_0\frac{\frac{\beta_1}{\sqrt{\beta_2}}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}\|\boldsymbol{x}_t-\boldsymbol{x}_{t-1}\|+L_1|\partial_i f(\boldsymbol{x}_t)|\frac{\frac{\beta_1}{\sqrt{\beta_2}}}{1-\frac{\beta_1}{\sqrt{\beta_2}}}\|\boldsymbol{x}_t-\boldsymbol{x}_{t-1}\|\right)\nonumber\\ &\times \mathbb{E}[\|\boldsymbol{u}_{t+1}-\boldsymbol{u}_t\||\boldsymbol{u}_{t+1,i}-\boldsymbol{u}_{t,i}||\mathcal{F}_t]\nonumber\\ \le &\sum_{i=1}^d\frac{L_1}{2}\left(1+L_1\frac{1-C_1}{C_1}\eta\sqrt{d} \frac{1-\beta_1}{\sqrt{1-\beta_2}C_2}\right)\nonumber\\ &\times \left(\sqrt{d}\frac{2-C_1}{C_1}\frac{\eta (1-\beta_1)}{\sqrt{1-\beta_2}C_2}|\partial_i f(\boldsymbol{x}_t)|\left(\frac{1}{C_1}\mathbb{E}\left[\frac{\eta |\boldsymbol{m}_{t,i}|}{\sqrt{\boldsymbol{v}_{t,i}+\zeta}}|\mathcal{F}_t\right]+\frac{1-C_1}{C_1}\frac{\eta |\boldsymbol{m}_{t-1,i}|}{\sqrt{\boldsymbol{v}_{t-1,i}+\zeta}}\right)\right)\nonumber\\ &+\sum_{i=1}^d\frac{L_1L_0}{2}\frac{1-C_1}{C_1}\eta\sqrt{d} \frac{1-\beta_1}{\sqrt{1-\beta_2}C_2}\times \left(\frac{2\sqrt{d}}{C_1^2}\mathbb{E}\left[\frac{\eta^2\boldsymbol{m}_{t,i}^2}{\boldsymbol{v}_{t,i}+\zeta}|\mathcal{F}_t\right]+\frac{2\sqrt{d}(1-C_1)^2}{C_1^2}\frac{\eta^2 \boldsymbol{m}_{t-1,i}^2}{\boldsymbol{v}_{t-1,i}+\zeta}\right)\nonumber\\ \le&\sum_{i=1}^d\frac{L_1}{2}\left(1+L_1\frac{1-C_1}{C_1}\eta\sqrt{d} \frac{1-\beta_1}{\sqrt{1-\beta_2}C_2}\right)\Bigg[\frac{2\sqrt{d}}{C_1^2}\left(\frac{\eta}{2\alpha_4}\frac{(\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{\beta_2\boldsymbol{v}_{t-1,i}+\zeta}}+\frac{\alpha_4\eta^3(1-\beta_1)^2}{2(1-\beta_2)C_2^2}\mathbb{E}\left[\frac{\boldsymbol{m}_{t,i}^2}{\sqrt{\boldsymbol{v}_{t,i}+\zeta}}|\mathcal{F}_t\right]\right)\nonumber\\ &\frac{2\sqrt{d}(2-C_1)^2}{C_1^2}\left(\frac{\eta }{2\alpha_4}\frac{(\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{\beta_2\boldsymbol{v}_{t-1,i}+\zeta}}+\frac{\alpha_4\eta^3(1-\beta_1)^2}{2(1-\beta_2)C_2^2}\frac{\boldsymbol{m}_{t-1,i}^2}{\sqrt{\boldsymbol{v}_{t-1,i}+\zeta}}\right)\Bigg]\nonumber\\ &+\sum_{i=1}^d\frac{L_1L_0}{2}\frac{1-C_1}{C_1}\eta\sqrt{d} \frac{1-\beta_1}{\sqrt{1-\beta_2}C_2}\left(\frac{2\sqrt{d}}{C_1^2}\mathbb{E}\left[\frac{\eta^2\boldsymbol{m}_{t,i}^2}{\boldsymbol{v}_{t,i}+\zeta}|\mathcal{F}_t\right]+\frac{2\sqrt{d}(1-C_1)^2}{C_1^2}\frac{\eta^2 \boldsymbol{m}_{t-1,i}^2}{\boldsymbol{v}_{t-1,i}+\zeta}\right), \end{align}\tag{45}\] where the first inequality is due to the \((L_0,L_1)\)-smoothness, the third inequality is due to Lemma 6 and 44 , and the last inequality is due to 40 and 41 . Combine Lemma 9, 43 , 44 and 45 together, and we have that \[\begin{align} \label{eq:7eq4} &\Bigg[\frac{\eta (1-\beta_1)}{C_1}-\frac{\eta(1-\beta_1)}{C_1C_2}\left(\frac{1}{2\alpha_0}+\frac{\alpha_0}{2\alpha_1}+\frac{\eta \alpha_0\sqrt{d}D_1L_1(1-\beta_1)}{\sqrt{1-\beta_2}C_2}\right)-\frac{\eta \beta_1(1-\beta_1)\sqrt{\zeta}}{2\alpha_3C_1C_2}-\frac{\eta L_1(1-C_1)^2}{2\alpha_4C_1^2}-\frac{\eta L_1(1-C_1)}{2\alpha_4C_1^2}\nonumber\\ -&\frac{\sqrt{d}L_1}{2}\left(1+L_1\frac{1-C_1}{C_1}\eta\sqrt{d} \frac{1-\beta_1}{\sqrt{1-\beta_2}C_2}\right)\left(\frac{\eta}{\alpha_4C_1^2}+\frac{\eta(2-C_1)^2}{\alpha_4C_1^2}\right) \Bigg]\times \sum_{i=1}^d\frac{(\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{{\beta_2\boldsymbol{v}_{t-1,i}} + \zeta}}\nonumber\\ \le&f(\boldsymbol{u}_t)-\mathbb{E}[f(\boldsymbol{u}_{t+1})|\mathcal{F}_t] \nonumber\\ &+\sum_{i=1}^d\frac{\eta (1-\beta_1)}{C_1C_2}\Bigg[\frac{\alpha_0D_0}{2}\Bigg(\frac{1}{\sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}-\mathbb{E}\Bigg[{\frac{1}{\sqrt{\boldsymbol{v}_{t,i}+\zeta}}}|\mathcal{F}_t\Bigg]\Bigg)+\frac{\alpha_0 D_1}{2}\mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_{t-1}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}|\mathcal{F}_t\right]\Bigg]\nonumber\\ &+\frac{\alpha_0\alpha_1D_1^2L_0^2\eta^3d^2(1-\beta_1)^3}{2(1-\beta_2)C_1C_2^3\sqrt{\zeta})}+\frac{\alpha_3\eta d\beta_1(1-\beta_1)(1-\beta_2)}{2C_1C_2}\nonumber\\ &+\sum_{i=1}^d\left[\frac{\eta^2(2(1-C_1)^2+0.5(1-C_1))\sqrt{d}L_0}{C_1^2}+ \frac{dL_1L_0}{2}\frac{1-C_1}{C_1}\eta \frac{1-\beta_1}{\sqrt{1-\beta_2}C_2}\frac{2(1-C_1)^2}{C_1^2}\eta^2\right] \frac{\boldsymbol{m}_{t-1,i}^2}{\boldsymbol{v}_{t-1,i}+\zeta}\nonumber\\ &+\sum_{i=1}^d\left[\frac{\eta^2(0.5(1-C_1)+1)\sqrt{d}L_0}{C_1^2}+ \frac{dL_1L_0}{2}\frac{1-C_1}{C_1}\eta \frac{1-\beta_1}{\sqrt{1-\beta_2}C_2}\frac{2}{C_1^2}\eta^2\right] \mathbb{E}\left[\frac{\boldsymbol{m}_{t,i}^2}{\boldsymbol{v}_{t,i}+\zeta}|\mathcal{F}_t\right]\nonumber\\ & +\sum_{i=1}^d\left[\frac{\alpha_4\eta^3(1-\beta_1)^2(1-C_1)^2dL_1}{2(1-\beta_2)C_1^2C_2^2}+\frac{L_1}{2}\left(1+L_1\frac{1-C_1}{C_1}\eta\sqrt{d} \frac{1-\beta_1}{\sqrt{1-\beta_2}C_2}\right)\frac{2\sqrt{d}(2-C_1)^2}{C_1^2}\frac{\alpha_4\eta^3(1-\beta_1)^2}{2(1-\beta_2)C_2^2}\right]\nonumber\\ &\times \frac{\boldsymbol{m}_{t-1,i}^2}{\sqrt{\boldsymbol{v}_{t-1,i}+\zeta}}\nonumber\\ & +\sum_{i=1}^d\left[\frac{\alpha_4\eta^3(1-\beta_1)^2(1-C_1)dL_1}{2(1-\beta_2)C_1^2C_2^2}+\frac{L_1}{2}\left(1+L_1\frac{1-C_1}{C_1}\eta\sqrt{d} \frac{1-\beta_1}{\sqrt{1-\beta_2}C_2}\right)\frac{2\sqrt{d}}{C_1^2}\frac{\alpha_4\eta^3(1-\beta_1)^2}{2(1-\beta_2)C_2^2}\right]\mathbb{E}\left[\frac{\boldsymbol{m}_{t,i}^2}{\sqrt{\boldsymbol{v}_{t,i}+\zeta}}|\mathcal{F}_t\right]. \end{align}\tag{46}\] It is worth noting that (22 ) and (23 ) still hold for Adam since the update of \(\boldsymbol{v}_t\) does not change. Specifically, for any \(i\in [d]\) we have that \[\begin{align} \label{eq:7eq5} &\sum_{t=1}^T \mathbb{E}\left[\frac{1}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{1}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}\right] \le \frac{1}{\sqrt{ \zeta}}+T \frac{1-\sqrt{\beta_2}}{\sqrt{\zeta}} \end{align}\tag{47}\] and since \(\boldsymbol{x}_0=\boldsymbol{x}_1\) we have that \[\begin{align} \label{eq:7eq6} &\mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_{0}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{0,i}} + \zeta}}-\frac{(\partial_i f(\boldsymbol{x}_{1}))^2}{\sqrt{{ \boldsymbol{v}_{1,i}} + \zeta}}\right]+\sum_{t=2}^T \mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_{t-1}))^2}{\sqrt{{\beta_2 \boldsymbol{v}_{t-1,i}} + \zeta}}-\frac{(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{{ \boldsymbol{v}_{t,i}} + \zeta}}\right]\nonumber\\ \le & \frac{(\partial_i f(\boldsymbol{x}_{1}))^2}{\sqrt{ \zeta}}+\sum_{t=1}^{T-1}(1-{\beta_2}) \mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_{t}))^2}{\sqrt{\beta_2{ \boldsymbol{v}_{t-1,i}} + \zeta}}\right] \end{align}\tag{48}\]

By taking expectations and telescoping (46 ) for \(t=1\) to \(T\) , based on (47 ), (48 ), Lemma 7 and Lemma 8, we have that \[\begin{align} \label{eq:7eq7} &\Bigg[\frac{\eta (1-\beta_1)}{C_1}-\frac{\eta(1-\beta_1)}{C_1C_2}\left(\frac{1}{2\alpha_0}+\frac{\alpha_0}{2\alpha_1}+\frac{\eta \alpha_0\sqrt{d}D_1L_1(1-\beta_1)}{\sqrt{1-\beta_2}C_2}\right)-\frac{\eta \beta_1(1-\beta_1)\sqrt{\zeta}}{2\alpha_3C_1C_2}-\frac{\eta L_1(1-C_1)^2}{2\alpha_4C_1^2}-\frac{\eta L_1(1-C_1)}{2\alpha_4C_1^2}\nonumber\\ -&\frac{\sqrt{d}L_1}{2}\left(1+L_1\frac{1-C_1}{C_1}\eta\sqrt{d} \frac{1-\beta_1}{\sqrt{1-\beta_2}C_2}\right)\left(\frac{\eta}{\alpha_4C_1^2}+\frac{\eta(2-C_1)^2}{\alpha_4C_1^2}\right) \Bigg]\times \sum_{t=1}^T\sum_{i=1}^d\frac{(\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{{\beta_2\boldsymbol{v}_{t-1,i}} + \zeta}}\nonumber\\ \le&f(\boldsymbol{u}_1)-\mathbb{E}[f(\boldsymbol{u}_{T+1})|\mathcal{F}_t] \nonumber\\ &+\sum_{i=1}^d\frac{\eta (1-\beta_1)}{C_1C_2}\Bigg[\frac{\alpha_0D_0}{2}\Bigg(\frac{1}{\sqrt{ \zeta}}+T \frac{1-\sqrt{\beta_2}}{\sqrt{\zeta}}\Bigg)+\frac{\alpha_0 D_1}{2}\left(\frac{(\partial_i f(\boldsymbol{x}_{1}))^2}{\sqrt{ \zeta}}+\sum_{t=1}^{T-1}(1-{\beta_2}) \mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_{1}))^2}{\sqrt{\beta_2{ \boldsymbol{v}_{t-1,i}} + \zeta}}\right]\right)\Bigg]\nonumber\\ &+T\frac{\alpha_0\alpha_1D_1^2L_0^2\eta^3d^2(1-\beta_1)^3}{2(1-\beta_2)C_1C_2^3\sqrt{\zeta}}+T\frac{\alpha_3\eta d\beta_1(1-\beta_1)(1-\beta_2)}{2C_1C_2}\nonumber\\ &+\Bigg[\frac{\eta^22(1-C_1)^2\sqrt{d}L_0}{C_1^2} + \frac{\eta^2(2-C_1)\sqrt{d}L_0}{C_1^2} +\frac{dL_1L_0}{2}\frac{1-C_1}{C_1}\eta \frac{1-\beta_1}{\sqrt{1-\beta_2}C_2}\frac{2(1-C_1)^2+2}{C_1^2}\eta^2\Bigg]\nonumber\\ &\times\sum_{i=1}^d\mathbb{E}\Bigg[ \frac{(1-\beta_1)^2}{(1-\frac{\beta_1}{\sqrt{\beta_2}})^2(1-\beta_2)}\left(\ln\left(\frac{\boldsymbol{v}_{T,i}}{\boldsymbol{v}_{0,i}}\right)-T\ln(\beta_2)\right)\Bigg]\nonumber\\ &+\left[\frac{\alpha_4\eta^3(1-\beta_1)^2dL_1((1-C_1)+(1-C_1)^2)}{2(1-\beta_2)C_1^2C_2^2}+\frac{\sqrt{d}L_1}{2}\left(1+L_1\frac{1-C_1}{C_1}\eta\sqrt{d} \frac{1-\beta_1}{\sqrt{1-\beta_2}C_2}\right)\frac{2+2(2-C_1)^2}{C_1^2}\frac{\alpha_4\eta^3(1-\beta_1)^2}{2(1-\beta_2)C_2^2}\right]\nonumber\\ &\times \sum_{i=1}^d\mathbb{E}\left[\frac{(1-\beta_1)^2}{(1-\frac{\beta_1}{\sqrt[4]{\beta_2}})^2}\left(\frac{2}{1-\beta_2}(\sqrt{\boldsymbol{v}_{T,i}}-\sqrt{\boldsymbol{v}_{0,i}})+\sum_{t=1}^T2\sqrt{\boldsymbol{v}_{t-1,i}}\right)\right]. \end{align}\tag{49}\] Moreover, for any \(a>0\) and \(i\in[d]\) we have that \(\ln(a)\le a-1\). We then have that \[\begin{align} \label{eq:7eq8} \frac{1}{T}\mathbb{E}[\ln(\boldsymbol{v}_{T,i})]=\frac{2}{T}\mathbb{E}[\ln(\sqrt{\boldsymbol{v}_{T,i}})]\le \frac{2}{T}\mathbb{E}[\sqrt{\boldsymbol{v}_{T,i}}]&\le \frac{2\sqrt{D_0+\boldsymbol{v}_{0,i}}}{T}+\frac{2}{T}\sum_{i=0}^{T-1} \mathbb{E}\left[ \sqrt{{}\beta_2^{i}(1-\beta_2)D_1}|\partial_i f(\boldsymbol{x}_{T-i})|\right]\nonumber\\ &\le \frac{2\sqrt{D_0+\boldsymbol{v}_{0,i}}}{T}+\frac{2}{T}\sum_{i=0}^{T-1} \mathbb{E}\left[ \sqrt{{}(1-\beta_2)D_1}|\partial_i f(\boldsymbol{x}_{T-i})|\right], \end{align}\tag{50}\] where the second inequality is due to 25 . In addition, for \(\beta_2\ge 0.5\) we have that \[\begin{align} \label{eq:7eq9} -\ln(\beta_2)=\ln\left(\frac{1}{\beta_2}\right)\le \frac{1}{\beta_2}-1\le 2(1-\beta_2). \end{align}\tag{51}\] Combine 26 , 49 , 50 and 51 , we then can show \(\sum_{i=1}^d\sum_{t=1}^T\mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}\right]\) is upper bounded by a function of \(\sum_{t=1}^T\mathbb{E}\left[{\|\nabla f(\boldsymbol{x}_t)\|}\right]\). Specifically, we can get \[\begin{align} \label{eq:7eq10} &\Bigg[\frac{\eta (1-\beta_1)}{C_1}-\frac{\eta(1-\beta_1)}{C_1C_2}\left(\frac{1}{2\alpha_0}+\frac{\alpha_0}{2\alpha_1}+\frac{\eta \alpha_0\sqrt{d}D_1L_1(1-\beta_1)}{\sqrt{1-\beta_2}C_2}\right)-\frac{\eta \beta_1(1-\beta_1)\sqrt{\zeta}}{2\alpha_3C_1C_2}-\frac{\eta L_1(1-C_1)^2}{2\alpha_4C_1^2}-\frac{\eta L_1(1-C_1)}{2\alpha_4C_1^2}\nonumber\\ -&\frac{\sqrt{d}L_1}{2}\left(1+L_1\frac{1-C_1}{C_1}\eta\sqrt{d} \frac{1-\beta_1}{\sqrt{1-\beta_2}C_2}\right)\left(\frac{\eta}{\alpha_4C_1^2}+\frac{\eta(2-C_1)^2}{\alpha_4C_1^2}\right)-\frac{\eta (1-\beta_1)}{C_1C_2}\frac{\alpha_0 D_1(1-\beta_2)}{2} \Bigg]\times \sum_{t=1}^T\sum_{i=1}^d\frac{(\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{{\beta_2\boldsymbol{v}_{t-1,i}} + \zeta}}\nonumber\\ \le&f(\boldsymbol{u}_1)-\mathbb{E}[f(\boldsymbol{u}_{T+1})|\mathcal{F}_t]+ \frac{\eta \alpha_0dD_0(1-\beta_1)}{2C_1C_2\sqrt{\zeta}}+ \frac{\eta \alpha_0D_1(1-\beta_1)\|\nabla f(\boldsymbol{x}_1)\|^2}{2C_1C_2\sqrt{\zeta}}\nonumber\\ &+T\frac{\eta d\alpha_0D_0 (1-\beta_1)(1-\beta_2)}{2C_1C_2\sqrt{\zeta}} +T\frac{\alpha_0\alpha_1D_1^2L_0^2\eta^3d^2(1-\beta_1)^3}{2(1-\beta_2)C_1C_2^3\sqrt{\zeta}}+T\frac{\alpha_3\eta d\beta_1(1-\beta_1)(1-\beta_2)}{2C_1C_2}\nonumber\\ &+\Bigg[\frac{\eta^22(1-C_1)^2\sqrt{d}L_0}{C_1^2} + \frac{\eta^2(2-C_1)\sqrt{d}L_0}{C_1^2} +\frac{dL_1L_0}{2}\frac{1-C_1}{C_1}\eta \frac{1-\beta_1}{\sqrt{1-\beta_2}C_2}\frac{2(1-C_1)^2+2}{C_1^2}\eta^2\Bigg]\nonumber\\ &\times\sum_{i=1}^d\mathbb{E}\Bigg[ \frac{(1-\beta_1)^2}{(1-\frac{\beta_1}{\sqrt{\beta_2}})^2(1-\beta_2)}\left(2\sqrt{D_0+\boldsymbol{v}_{0,i}}+\sum_{t=1}^T\mathbb{E}[2\sqrt{(1-\beta_2)D_1}|\partial_i f(\boldsymbol{x}_t)|]+2T(1-\beta_2)-\ln(\boldsymbol{v}_{0,i})\right)\Bigg]\nonumber\\ &+\left[\frac{\alpha_4\eta^3(1-\beta_1)^2dL_1((1-C_1)+(1-C_1)^2)}{2(1-\beta_2)C_1^2C_2^2}+\frac{\sqrt{d}L_1}{2}\left(1+L_1\frac{1-C_1}{C_1}\eta\sqrt{d} \frac{1-\beta_1}{\sqrt{1-\beta_2}C_2}\right)\frac{2+2(2-C_1)^2}{C_1^2}\frac{\alpha_4\eta^3(1-\beta_1)^2}{2(1-\beta_2)C_2^2}\right]\nonumber\\ &\times\sum_{i=1}^d \mathbb{E}\Bigg[\frac{(1-\beta_1)^2}{(1-\frac{\beta_1}{\sqrt[4]{\beta_2}})^2}\Bigg(\frac{2}{1-\beta_2}\left(\sqrt{D_0+\boldsymbol{v}_{0,i}}+\sum_{t=1}^T\mathbb{E}[\sqrt{(1-\beta_2)D_1}|\partial_i f(\boldsymbol{x}_t)|]\right)\nonumber\\ &+2T\sqrt{D_0+\boldsymbol{v}_{0,i}}+\frac{4\sqrt{D_1}}{\sqrt{1-\beta_2}}\sum_{t=1}^T\mathbb{E}[|\partial_i f(\boldsymbol{x}_t)|]\Bigg)\Bigg]. \end{align}\tag{52}\] If we have \(\alpha_0\ge \frac{21}{2C_2},\alpha_1\ge \frac{21\alpha_0}{2C_2},\alpha_3\ge \frac{7\beta_1\sqrt{\zeta}}{2C_2}, \alpha_4\ge \frac{14L_1\sqrt{d}(2-C_1)^2}{C_1(1-\beta_1)}\), for \(\eta\le \min\left(\frac{C_2^2\sqrt{1-\beta_2}}{21\alpha_0\sqrt{d}D_1L_1(1-\beta_1)}, \frac{C_1C_2\sqrt{1-\beta_2}}{\sqrt{d}L_1(1-C_1)(1-\beta_1)}\right)\), \(1-\beta_2\le (\frac{2C_2}{7\alpha_0D_1})\) we have that \[\begin{align} &\Bigg[\frac{\eta (1-\beta_1)}{C_1}-\frac{\eta(1-\beta_1)}{C_1C_2}\left(\frac{1}{2\alpha_0}+\frac{\alpha_0}{2\alpha_1}+\frac{\eta \alpha_0\sqrt{d}D_1L_1(1-\beta_1)}{\sqrt{1-\beta_2}C_2}\right)-\frac{\eta \beta_1(1-\beta_1)\sqrt{\zeta}}{2\alpha_3C_1C_2}-\frac{\eta L_1(1-C_1)^2}{2\alpha_4C_1^2}-\frac{\eta L_1(1-C_1)}{2\alpha_4C_1^2}\nonumber\\ -&\frac{\sqrt{d}L_1}{2}\left(1+L_1\frac{1-C_1}{C_1}\eta\sqrt{d} \frac{1-\beta_1}{\sqrt{1-\beta_2}C_2}\right)\left(\frac{\eta}{\alpha_4C_1^2}+\frac{\eta(2-C_1)^2}{\alpha_4C_1^2}\right)-\frac{\eta (1-\beta_1)}{C_1C_2}\frac{\alpha_0 D_1(1-\beta_2)}{2} \Bigg]\ge \frac{\eta (1-\beta_1)}{7C_1}. \end{align}\] Since \(\Delta'=f(\boldsymbol{u}_1)-f^*+ \frac{\eta \alpha_0dD_0(1-\beta_1)}{2C_1C_2\sqrt{\zeta}}+ \frac{\eta \alpha_0D_1(1-\beta_1)\|\nabla f(\boldsymbol{x}_1)\|^2}{2C_1C_2\sqrt{\zeta}}\), \(C_3=\Bigg[\frac{2(1-C_1)^2\sqrt{d}L_0}{C_1^2} + \frac{(2-C_1)\sqrt{d}L_0}{C_1^2} +{\sqrt{d}L_0}{}\frac{(1-C_1)^2+1}{C_1^2}\Bigg]\), \(C_4=\left[\frac{\alpha_4(1-\beta_1)^2dL_1((1-C_1)+(1-C_1)^2)}{2C_1^2C_2^2}+{\sqrt{d}L_1}\frac{2+2(2-C_1)^2}{C_1^2}\frac{\alpha_4(1-\beta_1)^2}{2C_2^2}\right]\), we then have that \[\begin{align} &\frac{\eta(1-\beta_1)}{7C_1}\sum_{i=1}^d\sum_{t=1}^T\mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}\right]\nonumber\\ \le& \Delta'+T\frac{\eta d\alpha_0D_0 (1-\beta_1)(1-\beta_2)}{2C_1C_2\sqrt{\zeta}} +T\frac{\alpha_0\alpha_1D_1^2L_0^2\eta^3d^2(1-\beta_1)^3}{2(1-\beta_2)C_1C_2^3\sqrt{\zeta}}+T\frac{\alpha_3\eta d\beta_1(1-\beta_1)(1-\beta_2)}{2C_1C_2}\nonumber\\ &+\frac{\eta^2C_3(1-\beta_1)^2}{C_1^2(1-\beta_2)}\times\sum_{i=1}^d\mathbb{E}\Bigg[ \left(2\sqrt{D_0+\boldsymbol{v}_{0,i}}+\sum_{t=1}^T\mathbb{E}[2\sqrt{(1-\beta_2)D_1}|\partial_i f(\boldsymbol{x}_t)|]+2T(1-\beta_2)-\ln(\boldsymbol{v}_{0,i})\right)\Bigg]\nonumber\\ &+\frac{\eta^3C_4}{(1-\beta_2)}\frac{(1-\beta_1)^2}{(1-\frac{\beta_1}{\sqrt[4]{\beta_2}})^2}\nonumber\\ &\times\sum_{i=1}^d \mathbb{E}\Bigg[\Bigg(\frac{2}{1-\beta_2}\left(\sqrt{D_0+\boldsymbol{v}_{0,i}}+\sum_{t=1}^T\mathbb{E}[\sqrt{(1-\beta_2)D_1}|\partial_i f(\boldsymbol{x}_t)|]\right)+2T\sqrt{D_0+\boldsymbol{v}_{0,i}}+\frac{4\sqrt{D_1}}{\sqrt{1-\beta_2}}\sum_{t=1}^T\mathbb{E}[|\partial_i f(\boldsymbol{x}_t)|]\Bigg)\Bigg]. \end{align}\] It follows that \[\begin{align} &\frac{1}{T}\sum_{t=1}^T\mathbb{E}\left[\frac{\|\nabla f(\boldsymbol{x}_t)\|^2}{\sqrt{\beta_2 \|\boldsymbol{v}_{t-1}\|+\zeta}}\right]\nonumber\\ \le &\frac{1}{T}\sum_{i}^d\sum_{t=1}^T\mathbb{E}\left[\frac{(\partial_i f(\boldsymbol{x}_t))^2}{\sqrt{\beta_2 \boldsymbol{v}_{t-1,i}+\zeta}}\right]\nonumber\\ \le & \frac{1}{T}\left(\frac{7C_1\Delta'}{\eta(1-\beta_1)}+\frac{7\eta C_3(1-\beta_1)}{C_1(1-\beta_2)}(2d\sqrt{D_0+\|\boldsymbol{v}_{0}\|}-\sum_{i=1}^d\ln(\boldsymbol{v}_{0,i}))+\frac{14\eta^2C_1C_4(1-\beta_1)d}{(1-\beta_2)^2(1-\frac{\beta_1}{\sqrt[4]{\beta_2}})^2}\sqrt{D_0+\|\boldsymbol{v}_{0}\|} \right)\nonumber\\ &+\frac{ 7\alpha_0dD_0(1-\beta_2)}{2C_2\sqrt{\zeta}}+\frac{7\alpha_0\alpha_1D_1^2L_0^2\eta^2d^2(1-\beta_1)^2}{2(1-\beta_2)C_2^3\sqrt{\zeta}}+\frac{7\alpha_3 \beta_1(1-\beta_2)d}{2C_2}+\frac{14\eta dC_3(1-\beta_1)}{C_1}\nonumber\\ &+\frac{14\eta^2C_1C_4(1-\beta_1)d\sqrt{D_0+\|\boldsymbol{v}_0\|}}{(1-\beta_2)(1-\frac{\beta_1}{\sqrt[4]{\beta_2}})^2}\nonumber\\ &+\left(\frac{14\eta C_3(1-\beta_1)\sqrt{D_1}}{C_1\sqrt{1-\beta_2}} +\frac{42\eta^2C_1C_4(1-\beta_1)\sqrt{D_1}}{(1-\beta_2)^{1.5}(1-\frac{\beta_1}{\sqrt[4]{\beta_2}})^2}\right)\left(\frac{\sum_{t=1}^T\sqrt{d}\mathbb{E}[\|\nabla f(\boldsymbol{x}_t)\|]}{T}\right). \end{align}\] For \(\eta\le C_5(1-\beta_2)\), (where \(C_5 >0\) and will be introduced in Appendix 13), \(C_6=\min\Bigg(\frac{C_2\sqrt{\zeta}}{21\alpha_0dD_0},\frac{C_2^3\sqrt{\zeta}}{21\alpha_0\alpha_1D_1^2L_0^2(1-\beta_1)^2d^2C_5^2},\frac{C_2}{21\alpha_3d\beta_1}, \frac{C_1}{84C_3C_5d(1-\beta_1)},\frac{(1-\frac{\beta_1}{\sqrt[4]{\beta_2}})^2}{84C_1C_4C_5^2(1-\beta_1)d\sqrt{D_0+\|\boldsymbol{v}_0\|}},\frac{C_1^2}{784C_3^2C_5^2(1-\beta_ 1)^2dD_1},\\ \frac{(1-\frac{\beta_1}{\sqrt[4]{\beta_2}})^4}{7056C_1^2C_4^2C_5^4dD_1}\Bigg)\), \(1-\beta_2\le C_6 \epsilon^2\), and \(T\ge\max\Bigg(\frac{126C_1\Delta'}{\eta(1-\beta_1)\epsilon^2}, \frac{126 C_3C_5(1-\beta_1)}{C_1}(2d\sqrt{D_0+ \|\boldsymbol{v}_0\|}-\sum_{i=1}^d\ln(\boldsymbol{v}_{0,i}))\epsilon^{-2},\\ \frac{252C_5^2C_1C_4(1-\beta_1)d}{(1-\frac{\beta_1}{\sqrt[4]{\beta_2}})^2}\sqrt{D_0+\|\boldsymbol{v}_0\|}\epsilon^{-2} \Bigg)\), we then have that \[\begin{align} \label{eq:7bound} \frac{1}{T}\sum_{t=1}^T\mathbb{E}\left[\frac{\|\nabla f(\boldsymbol{x}_t)\|^2}{\sqrt{\beta_2 \|\boldsymbol{v}_{t-1}\|+\zeta}}\right]\le \epsilon^2+\left(\frac{14\eta C_3(1-\beta_1)\sqrt{dD_1}}{C_1\sqrt{1-\beta_2}} +\frac{42\eta^2C_1C_4(1-\beta_1)\sqrt{dD_1}}{(1-\beta_2)^{1.5}(1-\frac{\beta_1}{\sqrt[4]{\beta_2}})^2}\right)\left(\frac{\sum_{t=1}^T\mathbb{E}[\|\nabla f(\boldsymbol{x}_t)\|]}{T}\right). \end{align}\tag{53}\] Since we have that \(\eta\le C_5(1-\beta_2)\) and \(1-\beta_2\le C_6\epsilon^2,\) thus we have that \[\begin{align} \frac{1}{T}\sum_{t=1}^T\mathbb{E}\left[\frac{\|\nabla f(\boldsymbol{x}_t)\|^2}{\sqrt{\beta_2 \|\boldsymbol{v}_{t-1}\|+\zeta}}\right]\le \epsilon^2+\epsilon\left(\frac{\sum_{t=1}^T\mathbb{E}[\|\nabla f(\boldsymbol{x}_t)\|]}{T}\right). \end{align}\]

This completes the proof.

13 Formal Version of Theorem 3 and Its Proof↩︎

For \(\alpha_0\ge \frac{21}{2C_2},\alpha_1\ge \frac{21\alpha_0}{2C_2},\alpha_3\ge \frac{7\beta_1\sqrt{\zeta}}{2C_2}, \alpha_4\ge \frac{14L_1\sqrt{d}(2-C_1)^2}{C_1(1-\beta_1)}\), define \(\Delta'=f(\boldsymbol{u}_1)-f^*+ \frac{\eta \alpha_0dD_0(1-\beta_1)}{2C_1C_2\sqrt{\zeta}}+ \frac{\eta \alpha_0D_1(1-\beta_1)\|\nabla f(\boldsymbol{x}_1)\|^2}{2C_1C_2\sqrt{\zeta}}\), \(C_3=\Bigg[\frac{2(1-C_1)^2\sqrt{d}L_0}{C_1^2} + \frac{(2-C_1)\sqrt{d}L_0}{C_1^2} +{\sqrt{d}L_0}{}\frac{(1-C_1)^2+1}{C_1^2}\Bigg]\), \(C_4=\left[\frac{\alpha_4(1-\beta_1)^2dL_1((1-C_1)+(1-C_1)^2)}{2C_1^2C_2^2}+{\sqrt{d}L_1}\frac{2+2(2-C_1)^2}{C_1^2}\frac{\alpha_4(1-\beta_1)^2}{2C_2^2}\right]\) \(C_5=\min\left(\frac{C_1}{112C_3(1-\beta_1)dD_1},\frac{1-\frac{\beta_1}{\sqrt[4]{\beta_2}}}{168D_1C_1C_4(1-\beta_1)d}\right)\), \(C_6=\min\Bigg(\frac{C_2\sqrt{\zeta}}{21\alpha_0dD_0},\frac{C_2^3\sqrt{\zeta}}{21\alpha_0\alpha_1D_1^2L_0^2(1-\beta_1)^2d^2C_5^2},\frac{C_2}{21\alpha_3d\beta_1}, \frac{C_1}{84C_3C_5d(1-\beta_1)},\frac{(1-\frac{\beta_1}{\sqrt[4]{\beta_2}})^2}{84C_1C_4C_5^2(1-\beta_1)d\sqrt{D_0+\|\boldsymbol{v}_0\|}},\frac{C_1^2}{784C_3^2C_5^2(1-\beta_ 1)^2dD_1},\\ \frac{(1-\frac{\beta_1}{\sqrt[4]{\beta_2}})^4}{7056C_1^2C_4^2C_5^4dD_1}\Bigg)\), \(\Lambda_4=\min\left(\frac{C_2^2}{21\alpha_0\sqrt{d}D_1L_1(1-\beta_1)}, \frac{C_1C_2}{ \sqrt{d}L_1(1-C_1)(1-\beta_1)}\right)\), \(\Lambda_5= \left(\frac{126 C_3C_5(1-\beta_1)}{C_1}(2d\sqrt{D_0+\|\boldsymbol{v}_0\|}-\sum_{i=1}^d\ln(\boldsymbol{v}_{0,i}))\right)\) and \(\Lambda_6= \left( \frac{252C_5^2C_1C_4(1-\beta_1)d}{(1-\frac{\beta_1}{\sqrt[4]{\beta_2}})^2}\sqrt{D_0+\|\boldsymbol{v}_0\|} \right)\).

Theorem 5. Let Assumptions 1, 2 and 3 hold. Let \(1-\beta_2= \min\left(\frac{2C_2}{7\alpha_0D_1},C_6 \epsilon^2\right)\sim \mathcal{O}(\epsilon^2)\), \(0<\beta_1\le \sqrt{\beta_2}<1\), \(\eta= \min\left(\Lambda_4\sqrt{1-\beta_2},C_5(1-\beta_2)\right)\sim \mathcal{O}(\epsilon^2)\), and \(T\ge\max\left(\frac{126C_1\Delta'}{\eta(1-\beta_1)\epsilon^2},\Lambda_5\epsilon^{-2},\Lambda_6 \epsilon^{-2}\right)\sim \mathcal{O}(\epsilon^{-4})\). For small \(\epsilon\) such that \(\epsilon\le \frac{\sqrt{2C_2}}{\sqrt{7\alpha_0C_6D_1}}\), we have that \[\begin{align} \frac{1}{T} \sum_{t=1}^T \mathbb{E}[\|\nabla f(x_t)\|]\le \left(2c+\sqrt{2c}+\frac{4\sqrt{dD_1}}{\sqrt{C_6}}\right)\epsilon. \end{align}\]

Proof. The proof follows Appendix 9. According to 53 in the proof of Corollary 4, we have that \[\begin{align} \frac{1}{T} \sum_{t=1}^T \mathbb{E}\left[\frac{ \|\nabla f(\boldsymbol{x}_t)\|^2}{\sqrt{{\beta_2 \|\boldsymbol{v}_{t-1}}\| + \zeta}} \right]\le \epsilon^2+\left(\frac{14\eta C_3(1-\beta_1)\sqrt{dD_1}}{C_1\sqrt{1-\beta_2}} +\frac{42\eta^2C_1C_4(1-\beta_1)\sqrt{dD_1}}{(1-\beta_2)^{1.5}(1-\frac{\beta_1}{\sqrt[4]{\beta_2}})^2}\right)\left(\frac{\sum_{t=1}^T\mathbb{E}[\|\nabla f(\boldsymbol{x}_t)\|]}{T}\right). \end{align}\]

According to Lemma 3, we have that \[\begin{align} &\left(\frac{1}{T} \sum_{t=1}^T \mathbb{E}[\sqrt{{\beta_2 \|\boldsymbol{v}_{t-1}}\| + \zeta}]\right)\nonumber\\ &\le c +\frac{2\sqrt{dD_1}}{\sqrt{(1-\beta_2)}} \frac{\sum_{t=1}^T\mathbb{E}[\|\nabla f(\boldsymbol{x}_t)\|]}{T}. \end{align}\]

Define \(e=\frac{1}{T} \sum_{t=1}^T \mathbb{E}[\|\nabla f(\boldsymbol{x}_t)\|]\). By Hölder’s inequality, we have \[\begin{align} \left(\frac{1}{T} \sum_{t=1}^T \mathbb{E}[\|\nabla f(\boldsymbol{x}_t)\|]\right)^2\le \left(\frac{1}{T} \sum_{t=1}^T \mathbb{E}\left[\frac{ \|\nabla f(\boldsymbol{x}_t)\|^2}{\sqrt{{\beta_2 \|\boldsymbol{v}_{t-1}}\| + \zeta}} \right]\right)\left(\frac{1}{T} \sum_{t=1}^T \mathbb{E}[\sqrt{{\beta_2 \|\boldsymbol{v}_{t-1}}\| + \zeta}]\right), \end{align}\] based on Lemma 3 and Corollary 4 which can be written as

\[\begin{align} e^2&\le \left(\epsilon^2+\left(\frac{14\eta C_3(1-\beta_1)\sqrt{dD_1}}{C_1\sqrt{1-\beta_2}} +\frac{42\eta^2C_1C_4(1-\beta_1)\sqrt{dD_1}}{(1-\beta_2)^{1.5}(1-\frac{\beta_1}{\sqrt[4]{\beta_2}})^2}\right)e\right)\left(c+\frac{2\sqrt{dD_1}}{\sqrt{1-\beta_2}}e \right)\nonumber\\ &\le c\epsilon^2+ce\epsilon+\frac{2\sqrt{dD_1}}{\sqrt{C_6}\epsilon}e\epsilon^2 +\frac{e^2}{2}, \end{align}\] where the second inequality is due to the that that \(\left(\frac{14\eta C_3(1-\beta_1)\sqrt{dD_1}}{C_1\sqrt{1-\beta_2}} +\frac{42\eta^2C_1C_4(1-\beta_1)\sqrt{dD_1}}{(1-\beta_2)^{1.5}(1-\frac{\beta_1}{\sqrt[4]{\beta_2}})^2}\right)\le \epsilon\) and \(\left(\frac{14\eta C_3(1-\beta_1)\sqrt{dD_1}}{C_1\sqrt{1-\beta_2}} +\frac{42\eta^2C_1C_4(1-\beta_1)\sqrt{dD_1}}{(1-\beta_2)^{1.5}(1-\frac{\beta_1}{\sqrt[4]{\beta_2}})^2}\right)\frac{2\sqrt{dD_1}}{\sqrt{1-\beta_2}}\le \frac{1}{2}\) if \(\eta\le C_5(1-\beta_2)\), \(1-\beta_2=\min\left(\frac{2C_2}{7\alpha_0D_1},C_6 \epsilon^2\right)= C_6\epsilon^2\) and \(C_5=\min\left(\frac{C_1}{112C_3(1-\beta_1)dD_1},\frac{1-\frac{\beta_1}{\sqrt[4]{\beta_2}}}{168D_1C_1C_4(1-\beta_1)d}\right)\).

Thus \[\frac{1}{T} \sum_{t=1}^T \mathbb{E}[\|\nabla f(x_t)\|]=e\le\left(2c+\sqrt{2c}+\frac{4\sqrt{dD_1}}{\sqrt{C_6}}\right)\epsilon,\] which completes the proof. ◻

References↩︎

[1]
Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Srebro, N., and Woodworth, B. Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199 (1-2): 165–214, 2023.
[2]
Hinton, G., Srivastava, N., and Swersky, K. Lecture 6d-a separate, adaptive learning rate for each connection. Slides of Lecture Neural Networks for Machine Learning, 1: 1–31, 2012.
[3]
Kingma, D. P. and Ba, J. dam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[4]
Brock, A., Donahue, J., and Simonyan, K. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
[5]
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33: 1877–1901, 2020.
[6]
Cutkosky, A. and Mehta, H. Momentum improves normalized SGD. In International Conference on Machine Learning, pp. 2260–2268. PMLR, 2020.
[7]
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[8]
Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic pptimization. Journal of Machine Learning Research, 12 (7), 2011.
[9]
Zhang, J., He, T., Sra, S., and Jadbabaie, A. Why gradient clipping accelerates training: A theoretical justification for adaptivity. arXiv preprint arXiv:1905.11881, 2019.
[10]
Wang, B., Zhang, H., Ma, Z., and Chen, W. Convergence of Adagrad for non-convex objectives: Simple proofs and relaxed assumptions. In Conference on Learning Theory, pp. 161–190. PMLR, 2023.
[11]
Crawshaw, M., Liu, M., Orabona, F., Zhang, W., and Zhuang, Z. Robustness to unbounded smoothness of generalized signsgd. Advances in Neural Information Processing Systems, 35: 9955–9968, 2022.
[12]
Ward, R., Wu, X., and Bottou, L. Adagrad stepsizes: Sharp convergence over nonconvex landscapes. The Journal of Machine Learning Research, 21 (1): 9047–9076, 2020.
[13]
Défossez, A., Bottou, L., Bach, F., and Usunier, N. A simple convergence proof of Adam and Adagrad. arXiv preprint arXiv:2003.02395, 2020.
[14]
Faw, M., Tziotis, I., Caramanis, C., Mokhtari, A., Shakkottai, S., and Ward, R. The power of adaptivity in SGD: Self-tuning step sizes with unbounded gradients and affine variance. In Conference on Learning Theory, pp. 313–355. PMLR, 2022.
[15]
Zaheer, M., Reddi, S., Sachan, D., Kale, S., and Kumar, S. Adaptive methods for nonconvex optimization. Advances in Neural Information Processing Systems, 31, 2018.
[16]
De, S., Mukherjee, A., and Ullah, E. Convergence guarantees for RMSProp and Adam in non-convex optimization and an empirical comparison to Nesterov acceleration. arXiv preprint arXiv:1807.06766, 2018.
[17]
Hardy, G. H., Littlewood, J. E., and Pólya, G. Inequalities. Cambridge university press, 1952.
[18]
Liu, Y., Gao, Y., and Yin, W. An improved analysis of stochastic gradient descent with momentum. Advances in Neural Information Processing Systems, 33: 18261–18271, 2020.
[19]
Wang, B., Fu, J., Zhang, H., Zheng, N., and Chen, W. Closing the gap between the upper bound and the lower bound of Adam’s iteration complexity. arXiv preprint arXiv:2310.17998, 2023.
[20]
Nemirovskij, A. S. and Yudin, D. B. Problem complexity and method efficiency in optimization. 1983.
[21]
Ghadimi, S. and Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23 (4): 2341–2368, 2013.
[22]
Bubeck, S. et al. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8 (3-4): 231–357, 2015.
[23]
Foster, D. J., Sekhari, A., Shamir, O., Srebro, N., Sridharan, K., and Woodworth, B. The complexity of making the gradient small in stochastic convex optimization. In Conference on Learning Theory, pp. 1319–1345. PMLR, 2019.
[24]
Vaswani, S., Bach, F., and Schmidt, M. Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron. In International Conference on Artificial Intelligence and Atatistics, pp. 1195–1204. PMLR, 2019.
[25]
Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale machine learning. SIAM review, 60 (2): 223–311, 2018.
[26]
Jin, J., Zhang, B., Wang, H., and Wang, L. Non-convex distributionally robust optimization: Non-asymptotic analysis. Advances in Neural Information Processing Systems, 34: 2771–2782, 2021.
[27]
Chen, Z., Zhou, Y., Liang, Y., and Lu, Z. Generalized-smooth nonconvex optimization is as efficient as smooth nonconvex optimization. arXiv preprint arXiv:2303.02854, 2023.
[28]
Wang, B., Zhang, Y., Zhang, H., Meng, Q., Ma, Z.-M., Liu, T.-Y., and Chen, W. Provable adaptivity in Adam. arXiv preprint arXiv:2208.09900, 2022.
[29]
Shi, N., Li, D., Hong, M., and Sun, R. rop converges with proper hyper-parameter. In International Conference on Learning Representations, 2020.
[30]
Faw, M., Rout, L., Caramanis, C., and Shakkottai, S. Beyond uniform smoothness: A stopped analysis of adaptive SGD. arXiv preprint arXiv:2302.06570, 2023.
[31]
Hong, Y. and Lin, J. High probability convergence of Adam under unbounded gradients and affine variance noise. arXiv preprint arXiv:2311.02000, 2023.
[32]
Ghadimi, S., Lan, G., and Zhang, H. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155 (1-2): 267–305, 2016.
[33]
Zhang, B., Jin, J., Fang, C., and Wang, L. Improved analysis of clipping algorithms for non-convex optimization. Advances in Neural Information Processing Systems, 33: 15511–15521, 2020.
[34]
Reisizadeh, A., Li, H., Das, S., and Jadbabaie, A. Variance-reduced clipping for non-convex optimization. arXiv preprint arXiv:2303.00883, 2023.
[35]
Zou, F., Shen, L., Jie, Z., Zhang, W., and Liu, W. A sufficient condition for convergences of Adam and RMSProp. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp. 11127–11135, 2019.
[36]
Chen, X., Liu, S., Sun, R., and Hong, M. On the convergence of a class of Adam-type algorithms for non-convex optimization. arXiv preprint arXiv:1808.02941, 2018.
[37]
Zhang, Y., Chen, C., Shi, N., Sun, R., and Luo, Z.-Q. dam can converge without any modification on update rules. Advances in Neural Information Processing Systems, 35: 28386–28399, 2022.
[38]
Guo, Z., Xu, Y., Yin, W., Jin, R., and Yang, T. A novel convergence analysis for algorithms of the Adam family and beyond. arXiv preprint arXiv:2104.14840, 2021.
[39]
Li, H., Jadbabaie, A., and Rakhlin, A. Convergence of Adam under relaxed assumptions. arXiv preprint arXiv:2304.13972, 2023.
[40]
Richtárik, P. and Takáč, M. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144 (1-2): 1–38, 2014.
[41]
Khaled, A. and Richtárik, P. Better theory for sgd in the nonconvex world. arXiv preprint arXiv:2002.03329, 2020.
[42]
Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., and Anandkumar, A. signsgd: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, pp. 560–569. PMLR, 2018.
[43]
Jensen, J. L. W. V. . Acta Mathematica, 30: 175 – 193, 1906. . URL https://doi.org/10.1007/BF02418571.
[44]
Agarwal, R., Schuurmans, D., and Norouzi, M. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning, pp. 104–114. PMLR, 2020.