In decentralized optimization, the choice of stepsize plays a critical role in algorithm performance. A common approach is to use a shared stepsize across all agents to ensure convergence. However, selecting an optimal stepsize often requires careful tuning, which can be time-consuming and may lead to slow convergence, especially when there is significant variation in the smoothness (\(L\)-smoothness) of local objective functions across agents. Individually tuning stepsizes per agent is also impractical, particularly in large-scale networks. To address these limitations, we propose AdGT, an adaptive gradient tracking method that enables each agent to adjust its stepsize based on the smoothness of its local objective. We prove that AdGT achieves linear convergence to the global optimal solution. Through numerical experiments, we compare AdGT with fixed-stepsize gradient tracking methods and demonstrate its superior performance. Additionally, we compare AdGT with adaptive gradient descent (AdGD) in a centralized setting and observe that fully adaptive stepsizes offer greater benefits in decentralized networks than in centralized ones.
Recent advances in artificial intelligence and communication technologies have accelerated the adoption of distributed optimization techniques to solve large-scale problems. These methods allow individual agents to perform local computations and share information with their neighbors, which is particularly advantageous when data is stored across distributed locations or partitioned across multiple servers to improve training efficiency. Key challenges in decentralized optimization include maintaining data privacy, ensuring algorithmic scalability, and achieving robustness to network and system-level disturbances. Distributed optimization has found widespread applications in areas such as machine learning [1], [2] and control systems [3]. A common objective in distributed optimization is to solve the consensus problem, which can be formulated as: \[\begin{align} \label{eqz1} {{\boldsymbol{x}^{*}}\in}\mathop{\mathrm{argmin}}\limits_{{\boldsymbol{x}}{\in\mathbb{R}^p}}{{f}}({\boldsymbol{x}}){\triangleq} \frac{1}{n} \sum\limits_{i=1}^{n} f_{i}({\boldsymbol{x}}), \end{align}\tag{1}\] where the objective function \({f}\) represents the summation of all individual cost functions \(\{f_i\}_{i=1}^n\), where \(f_{i} : \mathbb{R}^{p}\rightarrow \mathbb{R}\) is the private function of agent \(i\).
Building on the foundational work of [4], the authors of [5] introduced a distributed (sub)gradient method for solving the consensus problem in 1 . When each local function \(f_i\) is convex, the method achieves sublinear convergence. This relatively slow rate is a known limitation of subgradient methods and is largely due to the use of diminishing stepsizes. To improve convergence, [6] proposed an accelerated method that assumes bounded and Lipschitz continuous gradients. By incorporating Nesterov’s acceleration technique, they obtained a convergence rate of \(\mathcal{O}(\log(k)/k^2)\), which significantly outperforms standard subgradient methods. Under similar assumptions, the EXTRA [7] was developed for smooth convex functions with Lipschitz gradients. This method uses a fixed stepsize and introduces a correction term involving the difference between consecutive gradients. It achieves a convergence rate of \(\mathcal{O}(1/k)\), which improves to linear convergence when the objective functions are strongly convex. In [8], the authors proposed the Exact Diffusion method that was shown to achieve linear convergence under standard assumptions.
In addition to gradient-based methods, several distributed algorithms have been developed based on the alternating direction method of multipliers (ADMM). These approaches are known for their strong convergence guarantees and ability to handle more complex and structured optimization problems. Notable examples include [9]–[12], which address general consensus optimization settings. Furthermore, several ADMM-based methods have been proposed for solving composite convex objectives [13], [14], as well as for problems involving complicated constraints [15], [16].
Gradient tracking methods (GT) [17]–[21] form a widely studied class of algorithms in decentralized optimization. These methods introduce an auxiliary variable to estimate the global gradient by locally tracking the average gradient across the network. This mechanism enables their performance to closely match that of centralized algorithms, which often exhibit linear convergence rates. One of the key strengths of gradient tracking methods lies in their flexibility and broad applicability across a wide range of distributed optimization problems. These techniques have been effectively adapted to a wide range of challenging settings, including directed and time-varying communication graphs, asynchronous computation, composite optimization problems, and even nonconvex objectives; see, for instance, [19], [22]–[34]. More recently, [35] proposed a new approach that incorporates gradient tracking in an implicit way by introducing a novel momentum term, specifically designed for use over directed networks. To improve convergence rates and computational complexity, a variety of accelerated decentralized gradient-based methods have been proposed; see, for example, [36]–[45].
However, tuning a single fixed stepsize that works well for all agents is a difficult task, particularly in heterogeneous networks. This challenge becomes more severe when some agents hold more complex or ill-conditioned data, often associated with larger smoothness constants. In such cases, the stepsize must be chosen conservatively to ensure stability for all agents. However, this conservative choice typically results in a much smaller stepsize, as the global smoothness constant is dominated by the worst-case agents. As a result, the overall convergence rate is significantly slowed down. Moreover, several works have proposed using uncoordinated, agent-dependent stepsizes in gradient tracking methods over both undirected and directed graphs [21], [46], [47]. However, manually tuning individual stepsizes for each agent is often impractical, especially in large-scale decentralized systems. Therefore, these challenges motivate the following question:
Is it possible to develop a decentralized optimization method where each agent automatically adjusts its local stepsize at every iteration in an adaptive manner, achieving faster convergence in practice?
This question has been addressed in the centralized setting when function \(f\) is convex or strongly convex by the work of [48], which introduced an adaptive stepsize scheme for gradient descent based on the local smoothness of the objective function \(f\), which is often substantially smaller than the global smoothness constant \(L\). This approach updates the stepsize at each iteration based on the local smoothness of the objective, which often leads to faster and more reliable convergence compared to using a fixed, globally chosen stepsize. Several extensions have been proposed, including approaches that modify the stepsize rule or incorporate adaptive proximal gradient methods. See, for example, [49]–[51].
Furthermore, adaptive stepsize techniques have received increasing interest in federated learning. These include client adaptivity, adaptive optimizers at the server aimed at improving convergence rates, see, for example, [52]–[54]. In this context, a fully adaptive stepsize scheme was proposed in a recent study by [55], which—building on the ideas of [48]—enables each client to independently select its own stepsize based on the local smoothness of its objective function.
If the answer to the above question is yes, and we are indeed able to develop a fully adaptive decentralized optimization framework inspired by the ideas presented in [48], then a natural and significant follow-up question is:
Which setting benefits more from fully adaptive stepsizes: decentralized or centralized?
In this paper, we address both questions above by introducing an adaptive gradient tracking method (AdGT). While we were in the process of establishing the theoretical guarantees of our algorithm, the work of [56] appeared, proposing a decentralized optimization approach that also employs an adaptive stepsize based on local agent information, following the idea introduced in [48]. Although their method shares similarities with ours, it is developed under a slightly different setting, and their analysis only guarantees convergence to a neighborhood of the optimal solution. In contrast, our method is designed to achieve linear convergence to the optimal solution.
Contributions: This paper makes two main contributions.
We propose AdGT, a decentralized gradient–tracking method with fully adaptive per–agent stepsizes computed from local information, featuring the same per-iteration computation and communication as standard GT. For undirected graphs with doubly stochastic mixing and where each local objective \(f_i\) is \(L_i\)-smooth and \(\mu_i\)-strongly convex, we prove exact convergence to the consensus optimum with a linear (geometric) rate.
AdGT requires little or no stepsize tuning and consistently outperforms tuned GT across diverse convex benchmarks and network topologies. Gains are largest under heterogeneous smoothness (\(L_i\) vary across agents), where fixed-stepsize GT is bottlenecked by \(\max_i L_i\), while AdGT allows each agent to use a stepsize matched to its own local smoothness.
In summary, AdGT avoids the fixed-stepsize GT bottleneck, where the network stepsize must be set by the worst local smoothness, by enabling per-agent, locally computed stepsizes, while retaining provable convergence.
The rest of the paper is organized as follows. Section II presents the proposed fully adaptive decentralized gradient tracking (AdGT) algorithm, which enables agents to automatically adjust their local stepsizes based solely on local gradient information. Section III establishes the convergence guarantees of AdGT under strongly convex settings, characterizing the stability conditions introduced by stepsize adaptivity. Section IV provides comprehensive experimental validation on real and synthetic datasets, demonstrating the benefits of local adaptivity in decentralized systems across various problem types, network topologies, and heterogeneity levels. Finally, Section V concludes the paper and discusses potential extensions toward more general network models and stochastic settings
Notation: Throughout the paper, we use the notation \(\|\cdot\|\) to denote the Euclidean \(2\)-norm for vectors and the Frobenius norm for matrices. In particular, for any matrices \(A\) and \(B\), \[\|AB\| \leq {\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert A \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}\,\|B\|,\] where \(\|\cdot\|\) denotes the Frobenius norm and \({\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert \cdot \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}\) the spectral norm3.
Consider the consensus optimization problem described by 1 over a communication network represented by an undirected connected graph \(\mathcal{G} = (\mathcal{V}, \mathcal{E})\). The graph consists of a set of nodes (agents) \(\mathcal{V} = \{1, 2, \ldots, n\}\) and a set of edges \(\mathcal{E}\). Each node \(i \in \mathcal{V}\) has an individual cost function \(f_i : \mathbb{R}^p \rightarrow \mathbb{R}\), which is only accessible to that node.
For each node, \(i \in \mathcal{V}\), \(\mathcal{N}_i = \{j \in \mathcal{V} : (j, i) \in \mathcal{E}\} \cup \{i\}\) is the set of neighbors of agent \(i\). i.e., all nodes that can communicate directly with node \(i\). Additionally, \(W = [w_{ij}] \in \mathbb{R}^{n \times n}\) is a doubly-stochastic matrix satisfying: \[\begin{align} \label{eqz5} w_{ij}=\left\{ \begin{array}{ll} > 0, &\quad j \in {\mathcal{N}_{i},}\\ 0,& \quad \text{otherwise{;}} \end{array} \right. \quad \sum\limits_{{j\in \mathcal{V}}} w_{ij}=1, \quad \forall~i{\in\mathcal{V}}. \end{align}\tag{2}\] which ensures that the weights sum to one across both rows and columns.
Throughout the paper we make the following assumptions.
Assumption 1. \(\mathcal{G}\) is undirected and connected.
Remark 1. Let \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_n\) denote the eigenvalues of \(W\), then \(\lambda_1 = 1\), and for \(i = 2, \ldots, n\), the eigenvalues satisfy \(|\lambda_i| < 1\). Additionally, let us define \(\tilde{W}= W-\frac{\boldsymbol{11}^\top}{n}\), then \(\tilde{{\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert W \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}} \leq \lambda < 1\).
Assumption 2. For every \(i\in\mathcal{V}\), the local function \(f_{i}\) is \(L_i\)-smooth, i.e., it is differentiable with a Lipschitz gradient: \[\label{eqz2} \Vert \nabla f_{i}({\boldsymbol{x}})-\nabla f_{i}({\boldsymbol{x}'})\Vert \leq L_i \Vert {\boldsymbol{x}} -{\boldsymbol{x}'} \Vert,\quad{\forall~{\boldsymbol{x}},\boldsymbol{x}' \in \mathbb{R}^{p}.}\qquad{(1)}\]
Remark 2. The Lipschitz constant of \(\nabla f\) is given by \(L = \max_i L_i\), where \(L_i\) is the Lipschitz constant of the local function \(f_i\).
Assumption 3. For all \(i\in\mathcal{V}\), \(f_{i}\) is \(\mu_i\)-strongly convex, i.e., \[\label{eqz3} f_{i}({\boldsymbol{y}}) \geq f_{i}({\boldsymbol{x}}) + \nabla f_{i}({\boldsymbol{x}})^\top ({\boldsymbol{y}}-{\boldsymbol{x}}) + \dfrac{\mu_i}{2} \parallel {\boldsymbol{y}} -{\boldsymbol{x}} \Vert^2\qquad{(2)}\] for some \(\mu_i>0\) and all \({\boldsymbol{x}},\boldsymbol{y}\in\mathbb{R}^{p}\).
1. Define \(\boldsymbol{\mathrm{x}} \triangleq \left[ \boldsymbol{x}_1,\ldots,\boldsymbol{x}_n\right]^\top \in \mathbb{R}^{{n\times p}}\) and \(\boldsymbol{\mathrm{y}} \triangleq \left[ \boldsymbol{y}_1,\ldots,\boldsymbol{y}_n\right]^\top \in \mathbb{R}^{{n\times p}}\), where \(\boldsymbol{x}_i\), \(\boldsymbol{y}_i \in \mathbb{R}^{p}\) are the local variables of agent \(i\in\mathcal{V}\triangleq\{1,\ldots,n\}\), and in an algorithmic framework, their values at iteration \(k\geq 0\) are \(\boldsymbol{x}_i^{k}\) and \(\boldsymbol{y}_i^{k}\) for \(i\in\mathcal{V}\). Let \(f:\mathbb{R}^{{n \times p}}\to\mathbb{R}\) be a function of local variables \(\{\boldsymbol{x}_i\}_{i\in\mathcal{V}}\) such that \(f(\boldsymbol{\mathrm{x}})\triangleq \sum_{i\in\mathcal{V}} f_i(\boldsymbol{x}_i)\) for \(\boldsymbol{\mathrm{x}}\in\mathbb{R}^{{n\times p} }\) and \(\nabla f(\boldsymbol{\mathrm{x}}) \triangleq \left[ \nabla f_1(\boldsymbol{x}_1),...,\nabla f_n(\boldsymbol{x}_n)\right]^\top \in \mathbb{R}^{{n \times p}}\), where \(\nabla f_{i}(\boldsymbol{x}_i) \in \mathbb{R}^{p}\) denotes the gradient of \(f_{i}\) at \(\boldsymbol{x}_i\in\mathbb{R}^p\). Define \(\boldsymbol{\bar{x}}\triangleq \tfrac{1}{n}\boldsymbol{1}^\top\boldsymbol{\mathrm{x}}\in\mathbb{R}^{1\times p}\), \(\boldsymbol{\bar{y}}\triangleq \tfrac{1}{n}\boldsymbol{1}^\top\boldsymbol{\mathrm{y}}\in\mathbb{R}^{1\times p}\), \(\boldsymbol{\bar{\mathrm{x}}}\triangleq \boldsymbol{1}\boldsymbol{\bar{x}}\in\mathbb{R}^{n\times p}\), \(\boldsymbol{\bar{\mathrm{y}}}\triangleq \boldsymbol{1}\boldsymbol{\bar{y}}\in\mathbb{R}^{n\times p}\), \(\nabla f(\boldsymbol{\bar{x}})\triangleq \left[\nabla f_1(\boldsymbol{\bar{x}}),\ldots,\nabla f_n(\boldsymbol{\bar{x}})\right]^\top\in\mathbb{R}^{n\times p}\), \(\nabla f(\boldsymbol{x}^\ast)\triangleq \left[\nabla f_1(\boldsymbol{x}^\ast),\ldots,\nabla f_n(\boldsymbol{x}^\ast)\right]^\top\in\mathbb{R}^{n\times p}\).
We next propose a decentralized optimization algorithm, AdGT, to solve the consensus optimization problem in 1 .
Consider the AdGT algorithm presented in Algorithm 1. At iteration \(k \geq 0\), each agent \(i \in \mathcal{V}\) updates four variables: \(\boldsymbol{x}_{i}^{k}\), \(\boldsymbol{y}_{i}^{k} \in \mathbb{R}^{p}\), \(\theta_{i}^{k} \geq 0\), and \(\alpha_{i}^{k} > 0\). The stepsize sequence \(\{\alpha_i^k\}\) is the key adaptive component and is detailed below.
Our stepsize rules, defined Algorithm 1, is inspired by the adaptive gradient descent method of [48]. In that setting, the centralized stepsize \(\alpha^{k}\) is updated according to the following rule: \[\begin{align} \label{Adap95Cent95Step} \alpha^{k+1} = \min \left\{ \frac{\|\boldsymbol{x}^{k+1} - \boldsymbol{x}^{k}\|}{2\|\nabla f(\boldsymbol{x}^{k+1}) - \nabla f(\boldsymbol{x}^{k})\|}, \sqrt{1 + \theta^{k}}\,\alpha^{k} \right\}, \end{align}\tag{3}\] where \(\theta^{k} = \alpha^{k} / \alpha^{k-1}\). This approach is an automated gradient descent method by adjusting the stepsize based on the local smoothness of the objective function. However, a direct application of 3 in a decentralized scheme can lead to algorithmic divergence. Recently, in [56], the authors introduced an adaptive decentralized method by combining the GT method with the following adaptive stepsize \[\begin{align} \label{Adap95Dist95DM} \alpha_{i}^{k+1} = \min \left\{ \frac{1}{2 L_{f_i}^k},\frac{1}{\|y_i^{k+1}\|}, \sqrt{2}\alpha_{i}^{k} \right\} \end{align}\tag{4}\] where \[\begin{align} \label{L95smooth95L95f95i} L_{f_i}^k=\frac{\|\nabla f_i(\boldsymbol{x}_{i}^{k+1}) - \nabla f_i(\boldsymbol{x}_{i}^{k})\|}{\|\boldsymbol{x}_{i}^{k+1} - \boldsymbol{x}_{i}^{k}\|} \end{align}\tag{5}\] The method is theoretically proven to converge only to a neighborhood of the optimal solution. However, our numerical results show that the algorithm proposed in [56], when used with the adaptive distributed stepsize 4 , often diverges—particularly in settings with a small number of samples or sparse network connectivity.
To ensure both theoretical guarantees and practical convergence of the decentralized optimization algorithm using the adaptive stepsize in 3 , the first stepsize rule is specified as follows: \[\begin{align} \label{alpha95nound95L95F} \alpha_{i}^{k+1} = \min \left\{ \frac{1}{2 \gamma L_{f_i}^k}, \sqrt{1 + \theta_{i}^{k}}\,\alpha_{i}^{k} \right\}, \end{align}\tag{6}\] where \(L_{f_i}^k\) is given in 5 and \(\gamma > 0\). Here, the first term of 6 defines a locally safe stepsize using agent \(i\)’s estimate \(L_{f_i}^k\) of the local gradient Lipschitz constant of \(f_i\); taking the inverse of \(L_{f_i}^k\) and scaling by \(2\gamma\) keeps updates consistent with local smoothness. The second term \(\sqrt{1+\theta_i^k}\,\alpha_i^k\) allows the stepsize to grow gradually when recent progress suggests the previous value was too small; the square root limits the growth and prevents abrupt increases. The parameter \(\gamma>0\) is a global scaling factor that harmonizes these per-agent updates so that, although each agent adapts independently, the overall decentralized iteration remains stable.
While the adaptive rule 6 performs well in many cases, its effectiveness can depend on the choice of the global parameter \(\gamma\), which may require tuning when the network topology or data distribution is highly heterogeneous. To improve numerical stability and reduce sensitivity to \(\gamma\), we replace the local smoothness estimate \(L_{f_i}^k\) with the empirical quantity \[\begin{align} \label{L95smooth95L95y95i} L_{y_i}^k=\frac{\|\boldsymbol{y}_{i}^{k+1} - \boldsymbol{y}_{i}^{k}\|}{\|\boldsymbol{x}_{i}^{k+1} - \boldsymbol{x}_{i}^{k}\|} \end{align}\tag{7}\] leading to the second adaptive stepsize \[\begin{align} \label{Stepsize95Adgt95y95i} \alpha_{i}^{k+1} = \min \left\{ \frac{1}{2 \gamma L_{y_i}^k}, \sqrt{1 + \theta_{i}^{k}}\,\alpha_{i}^{k} \right\}. \end{align}\tag{8}\] In practice, this variant is typically stable across different network conditions and heterogeneous data, and it reduces the need for tuning of \(\gamma\) compared with the original rule 6 . The algorithm, instead of using local gradients \(\nabla f_i(\boldsymbol{x}_i^k)\), approximates the global gradient via the local variable \(\boldsymbol{y}_i^k\). It is straightforward to show that average of \(\boldsymbol{y}_i^k\) preserves the average of local gradients, see [20]. \[\bar{y}^k = \frac{1}{n}\sum_{i=1}^{n} \nabla f_i(\boldsymbol{x}_i^k)\]
To facilitate analysis, we introduce a third stepsize rule by augmenting the stepsize rule 8 with the additional term \(1/(2\gamma L_{f_i}^k)\), leading to \[\begin{align} \label{Stepsize95Adgt95y95i95f95i} \alpha_{i}^{k+1} = \min \left\{ \frac{1}{2 \gamma L_{f_i}^k}, \frac{1}{2 \gamma L_{y_i}^k}, \sqrt{1 + \theta_{i}^{k}}\,\alpha_{i}^{k} \right\}. \end{align}\tag{9}\] With this form, the uniform upper bounds on the stepsizes established for stepsize rule 6 (see Lemma 1) remain directly applicable. As shown later in Remark 6, adding the extra \(1/(2\gamma L_{f_i}^k)\) term does not affect the convergence proof. In practice, rules 8 and 9 yield essentially identical numerical performance.
Remark 4. The parameter \(\gamma\) in the AdGT method requires tuning, but its role is different from that in gradient tracking with a fixed stepsize. In theory, \(\gamma\) may depend on problem parameters such as \(L\), \(\mu\), and network connectivity. In practice, however, setting \(\gamma = 1\) is often sufficient. Unlike fixed-stepsize methods, AdGT is not highly sensitive to tuning. Since the stepsize is automatically adjusted at each iteration using only local information, AdGT achieves much faster convergence in numerical experiments compared to the fixed-stepsize version.
We present AdGT stated in Algorithm 1 in a compact form as follows: \[\begin{align} \boldsymbol{\mathrm{x}}^{k+1} &= W \Big( \boldsymbol{\mathrm{x}}^k - {D^k} \boldsymbol{\mathrm{y}}^k \Big) \tag{10} \\ \boldsymbol{\mathrm{y}}^{k+1} &= W \boldsymbol{\mathrm{y}}^k + \nabla f(\boldsymbol{\mathrm{x}}^{k+1})-\nabla f(\boldsymbol{\mathrm{x}}^k)\tag{11} \end{align}\] where \(D^k= diag \left( \alpha_1^k, \dots, \alpha_n^k \right)\) is a diagonal matrix with its diagonal equal to \(\left( \alpha_1^k, \dots, \alpha_n^k \right)\).
Remark 5. Our analysis and experiments focus on undirected graphs. Preliminary tests indicate that the adaptive rule remains effective when combined with gradient tracking over directed graphs, but a formal analysis is left for future work.
In this section, we show that the sequence of iterates generated by the AdGT algorithm, defined in equations 10 and 11 with the stepsize rule 6 , converges to the optimal solution \(\boldsymbol{\mathrm{x}^\ast} = \boldsymbol{1} \boldsymbol{x}^\ast\). Moreover, in Remark 6, we explain that the convergence proof also extends to the alternative stepsize rule 9 . The structure of our proof is inspired by the approach in [20] and [35].
Lemma 1. Suppose Assumptions 2 and 3 hold. Let the stepsizes \(\{\alpha_i^k\}_{k\geq 0}\) be defined according to 6 . Then, for every agent \(i\) and all \(k \geq 0\), the stepsizes are uniformly bounded as \[\label{bound95alpha} \frac{1}{2 \gamma L} \;\leq\; \alpha_i^{k+1} \;\leq\; \frac{1}{2 \gamma \mu}.\qquad{(3)}\]
Proof. We first establish the lower bound. Starting from an arbitrary initial stepsize \(\alpha_i^{0}\), the update rule yields \[\alpha_i^{1} \;=\; \frac{1}{2 \gamma L_{f_i}^1}
= \frac{\|\boldsymbol{x}_{i}^{1} - \boldsymbol{x}_{i}^{0}\|}{2 \gamma \|\nabla f_i(\boldsymbol{x}_{i}^{1}) - \nabla f_i(\boldsymbol{x}_{i}^{0})\|}
\;\geq\; \frac{1}{2 \gamma L_i},\] where the inequality follows from Assumption 2. By induction, assume that \(\alpha_i^j \geq
\tfrac{1}{2 \gamma L_i}\) for all \(j=1,\dots,k\). We now verify that the same inequality holds for \(\alpha_i^{k+1}\).
From the definition in 6 , there are two possibilities:
If \(\alpha_i^{k+1} = \tfrac{1}{2 \gamma L_{f_i}^k}\), then by Assumption 2, \[\alpha_i^{k+1} = \frac{\|\boldsymbol{x}_{i}^{k+1} - \boldsymbol{x}_{i}^{k}\|}{2 \gamma \|\nabla f_i(\boldsymbol{x}_{i}^{k+1}) - \nabla f_i(\boldsymbol{x}_{i}^{k})\|} \;\geq\; \frac{1}{2 \gamma L_i}.\]
If \(\alpha_i^{k+1} = \sqrt{1+\theta_i^k}\,\alpha_i^k\), then by the induction hypothesis, \[\alpha_i^{k+1} \;\geq\; \alpha_i^k \;\geq\; \frac{1}{2 \gamma L_i}.\]
Thus, in both cases, \(\alpha_i^{k+1} \geq \tfrac{1}{2 \gamma L_i}\). Since \(L_i \leq L\), it follows that \[\alpha_i^{k+1} \;\geq\; \frac{1}{2 \gamma L}, \quad \forall k \geq 0.\] For the upper bound, observe that by Assumption 3 and the definition of \(\alpha_i^k\), it always holds that \[\alpha_i^{k+1} \leq \frac{\|\boldsymbol{x}_{i}^{k+1} - \boldsymbol{x}_{i}^{k}\|}{2 \gamma \|\nabla f_i(\boldsymbol{x}_{i}^{k+1}) - \nabla f_i(\boldsymbol{x}_{i}^{k})\|} \;\leq\; \frac{1}{2 \gamma \mu_i}\leq\;\frac{1}{2 \gamma \mu}.\] Combining the two estimates yields ?? . ◻
We define \(\alpha_{\max} \triangleq \max_{k > 0} \{\alpha_i^k \}\), then we have \[\label{define95bound95alpha95max} \frac{1}{2 \gamma L} \;\leq\; \alpha_{\max} \;\leq\; \frac{1}{2 \gamma \mu}.\tag{12}\]
Remark 6. From Lemma 1, it follows that the stepsize rule 9 has the same upper bound as setup 6 , namely \(\tfrac{1}{2\gamma\mu}\). For the lower bound, if \(\tfrac{1}{2\gamma L_{y_i}^k}\) is smaller than \(\tfrac{1}{2\gamma L}\), we may use \(\tfrac{1}{2b\gamma L}\) as a valid bound, where \(b \geq 1\). This does not affect the proof, since only the lower bound of \(\alpha_i\) appears on the right-hand side of 17 , and the same reasoning applies to 23 . Therefore, the proof holds for any positive value of \(b\).
Lemma 2. For all \(k\), the following inequality holds: \[\begin{align} \|\boldsymbol{\mathrm{y}}^k\|\leq \|\boldsymbol{\mathrm{y}}^k- \bar{\boldsymbol{\mathrm{y}}}^k \| + L\|\boldsymbol{\mathrm{x}}^k- \bar{\boldsymbol{\mathrm{x}}}^k\| +\sqrt{n} L\| \bar{\boldsymbol{{x}}}^k - \boldsymbol{{x}}^* \|. \end{align}\]
**Proof.* We add and subtract \(\tfrac{1}{n}\boldsymbol{1}\boldsymbol{1}^\top \nabla f(\boldsymbol{\mathrm{x}}^k)\), use the fact that \(\boldsymbol{1}^\top \nabla f(\boldsymbol{x}^*)=0\), and apply the triangle inequality: \[\begin{align} \|\boldsymbol{\mathrm{y}}^k\| \leq \|\boldsymbol{\mathrm{y}}^k-\tfrac{1}{n} \boldsymbol{1}\boldsymbol{1}^\top \nabla f(\boldsymbol{\mathrm{x}}^k)\| + \|\tfrac{1}{n} \boldsymbol{1}\boldsymbol{1}^\top \nabla f(\boldsymbol{\mathrm{x}}^k)-\tfrac{1}{n} \boldsymbol{1}\boldsymbol{1}^\top \nabla f(\bar{\boldsymbol{x}}^{k})\| +\|\tfrac{1}{n} \boldsymbol{1}\boldsymbol{1}^\top \nabla f(\bar{\boldsymbol{x}}^{k})-\tfrac{1}{n} \boldsymbol{1}\boldsymbol{1}^\top \nabla f(\boldsymbol{x}^*)\|. \end{align}\] Since \(\bar{\boldsymbol{\mathrm{y}}}^k=\tfrac{1}{n}\boldsymbol{1}^\top \nabla f(\boldsymbol{\mathrm{x}}^k)\) and \(\|\boldsymbol{1}\boldsymbol{b}\|=\sqrt{n}\|\boldsymbol{b}\|\), this becomes \[\begin{align} \|\boldsymbol{\mathrm{y}}^k\| \leq \|\boldsymbol{\mathrm{y}}^k- \bar{\boldsymbol{\mathrm{y}}}^k \| + \sqrt{n}\|\tfrac{1}{n}\boldsymbol{1}^\top \nabla f(\boldsymbol{\mathrm{x}}^k)-\tfrac{1}{n}\boldsymbol{1}^\top \nabla f(\bar{\boldsymbol{x}}^{k})\| + \sqrt{n}\|\tfrac{1}{n}\boldsymbol{1}^\top \nabla f(\bar{\boldsymbol{x}}^{k})-\tfrac{1}{n}\boldsymbol{1}^\top \nabla f(\boldsymbol{x}^*)\|. \end{align}\] By \(L\)-smoothness, \[\begin{align} \|\boldsymbol{\mathrm{y}}^k\| \leq \|\boldsymbol{\mathrm{y}}^k- \bar{\boldsymbol{\mathrm{y}}}^k \| + \sqrt{n} L \sum_{i=1}^n \tfrac{\|\boldsymbol{{x}}_i^k-\bar{\boldsymbol{{x}}}^k\|}{n} + \sqrt{n} L \|\bar{\boldsymbol{{x}}}^k-\boldsymbol{{x}}^*\|. \end{align}\] Finally, Jensen’s inequality gives \[\sum_{i=1}^n \tfrac{\|\boldsymbol{{x}}_i^k-\bar{\boldsymbol{{x}}}^k\|}{n} \leq \sqrt{\sum_{i=1}^n \tfrac{\|\boldsymbol{{x}}_i^k-\bar{\boldsymbol{{x}}}^k\|^2}{n}} = \tfrac{1}{\sqrt{n}}\|\boldsymbol{\mathrm{x}}^k-\bar{\boldsymbol{\mathrm{x}}}^k\|,\] which completes the proof. ◻*
For completeness, we restate another technical result; see [20] for a proof.
Lemma 3. Let \(\alpha \in \bigl(0,\tfrac{2}{L}\bigr)\) and define \[\eta \triangleq \max \left\{ |1-L\alpha|,\; |1-\mu\alpha| \right\}.\] If Assumptions 2 and 3 hold, then for all \(\boldsymbol{x}\in \mathbb{R}^p\), \[\bigl\| \boldsymbol{x}-\alpha \frac{1}{n}\!\!\sum_{i=1}^n \nabla f_i(\boldsymbol{x}) - \boldsymbol{x}^{*}\bigr\| \;\leq\; \eta \,\|\boldsymbol{x}-\boldsymbol{x}^{*}\|.\]
In what follows, we derive bounds on \(\|\boldsymbol{\mathrm{x}}^{k+1}- \bar{\boldsymbol{\mathrm{x}}}^{k+1}\|\), \(\|\boldsymbol{\mathrm{y}}^{k+1}- \bar{\boldsymbol{\mathrm{y}}}^{k+1}\|\), and \(\| \bar{\boldsymbol{x}}^{k+1} - \boldsymbol{x}^*\|\), which together will form the foundation for establishing the linear convergence rate of AdGT.
Lemma 4. For all \(k\), the following inequality holds: \[\begin{align} \|\boldsymbol{\mathrm{x}}^{k+1}- \bar{\boldsymbol{\mathrm{x}}}^{k+1}\| \leq (\lambda+\lambda \alpha_{\max} L)\|\boldsymbol{\mathrm{x}}^k- \bar{\boldsymbol{\mathrm{x}}}^k\| + \lambda \alpha_{\max} \|\boldsymbol{\mathrm{y}}^k- \bar{\boldsymbol{\mathrm{y}}}^k \| + \sqrt{n}\,\lambda \alpha_{\max} L\| \bar{\boldsymbol{{x}}}^k - \boldsymbol{{x}}^* \|. \end{align}\]
Proof. From the update rule we have \[\begin{align} \boldsymbol{\mathrm{x}}^{k+1}- \bar{\boldsymbol{\mathrm{x}}}^{k+1} &= \left(W- \tfrac{1}{n}\boldsymbol{1}\boldsymbol{1}^\top\right)\big(\boldsymbol{\mathrm{x}}^k - D^k \boldsymbol{\mathrm{y}}^k\big). \end{align}\] Taking norms and applying the triangle inequality yields \[\begin{align} \|\boldsymbol{\mathrm{x}}^{k+1}- \bar{\boldsymbol{\mathrm{x}}}^{k+1}\| &\leq \|\tilde{W}(\boldsymbol{\mathrm{x}}^k - \bar{\boldsymbol{\mathrm{x}}}^k)\| + \|\tilde{W}(D^k\boldsymbol{\mathrm{y}}^k)\|, \end{align}\] where \(\tilde{W} := W-\tfrac{1}{n}\boldsymbol{1}\boldsymbol{1}^\top\). Since \(\tilde{{\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert W \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}}\leq \lambda\) and \(D^k\) is diagonal with entries \(\alpha_i^k \leq \alpha_{\max}\) (hence \({\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert D^k \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}\leq \alpha_{\max}\)), we obtain \[\begin{align} \|\boldsymbol{\mathrm{x}}^{k+1}- \bar{\boldsymbol{\mathrm{x}}}^{k+1}\| &\leq \lambda \|\boldsymbol{\mathrm{x}}^k - \bar{\boldsymbol{\mathrm{x}}}^k\| + \lambda \alpha_{\max}\|\boldsymbol{\mathrm{y}}^k\|. \end{align}\] Finally, applying Lemma 2 to bound \(\|\boldsymbol{\mathrm{y}}^k\|\) gives the desired result. ◻
Lemma 5. For all \(k\), the following inequality holds: \[\begin{align} \| \boldsymbol{\mathrm{y}}^{k+1}- \bar{\boldsymbol{\mathrm{y}}}^{k+1} \| \leq (\lambda+L\alpha_{\max}) \|\boldsymbol{\mathrm{y}}^k- \bar{\boldsymbol{\mathrm{y}}}^k \| + L \big({\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert W-I \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}+ L\alpha_{\max}\big)\|\boldsymbol{\mathrm{x}}^k- \bar{\boldsymbol{\mathrm{x}}}^k\| +\sqrt{n}\, L^2 \alpha_{\max} \| \bar{\boldsymbol{x}}^k - \boldsymbol{x}^* \|. \end{align}\]
Proof. From the update rule we can write \[\begin{align} \boldsymbol{\mathrm{y}}^{k+1}- \bar{\boldsymbol{\mathrm{y}}}^{k+1} = W \boldsymbol{\mathrm{y}}^k + \nabla f(\boldsymbol{\mathrm{x}}^{k+1})-\nabla f(\boldsymbol{\mathrm{x}}^k) - \bar{\boldsymbol{\mathrm{y}}}^k - \tfrac{1}{n} \boldsymbol{1}\boldsymbol{1}^\top \big(\nabla f(\boldsymbol{\mathrm{x}}^{k+1})-\nabla f(\boldsymbol{\mathrm{x}}^k)\big). \end{align}\] Taking norms and applying the triangle inequality gives \[\begin{align} \|\boldsymbol{\mathrm{y}}^{k+1}- \bar{\boldsymbol{\mathrm{y}}}^{k+1} \| &\leq \|\tilde{W} (\boldsymbol{\mathrm{y}}^k - \bar{\boldsymbol{\mathrm{y}}}^{k}) \| + \big\|(I-\tfrac{1}{n} \boldsymbol{1}\boldsymbol{1}^\top)(\nabla f(\boldsymbol{\mathrm{x}}^{k+1})-\nabla f(\boldsymbol{\mathrm{x}}^k))\big\| \nonumber\\ &\leq \lambda \|\boldsymbol{\mathrm{y}}^k - \bar{\boldsymbol{\mathrm{y}}}^{k}\| + L \|\boldsymbol{\mathrm{x}}^{k+1}-\boldsymbol{\mathrm{x}}^k\|, \label{eq:32bound95diff95x95y} \end{align}\tag{13}\] where we used that the spectral norm of \(\tilde{W}\) satisfies \(\tilde{{\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert W \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}}\leq \lambda\), the spectral norm of \({\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert I-\tfrac{1}{n}\boldsymbol{1}\boldsymbol{1}^\top \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}\) equals \(1\), and \(L\)-smoothness. Next, we bound the term \(\|\boldsymbol{\mathrm{x}}^{k+1}-\boldsymbol{\mathrm{x}}^k\|\): \[\begin{align} \|\boldsymbol{\mathrm{x}}^{k+1}-\boldsymbol{\mathrm{x}}^k\| &= \|(W-I)\boldsymbol{\mathrm{x}}^{k} - W D^k \boldsymbol{\mathrm{y}}^k\| \nonumber\\ &\leq \|(W-I)(\boldsymbol{\mathrm{x}}^{k}-\bar{\boldsymbol{\mathrm{x}}}^{k})\| + \|W D^k \boldsymbol{\mathrm{y}}^k\| \nonumber\\ &\leq {\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert W-I \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}\,\|\boldsymbol{\mathrm{x}}^{k}-\bar{\boldsymbol{\mathrm{x}}}^{k}\| + \alpha_{\max}\|\boldsymbol{\mathrm{y}}^k\|, \label{eq:diff95bound95x95k} \end{align}\tag{14}\] where in the second equation we used \((W-I)\bar{\boldsymbol{\mathrm{x}}}^k=0\), and in the last inequality we used that the spectral norm of \(W\) equals \(1\), together with the fact that \(D^k\) is diagonal with entries \(\alpha_i^k \leq \alpha_{\max}\), hence \({\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert D^k \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}\leq \alpha_{\max}\)4. Finally, substituting 14 into 13 , and applying Lemma 2 to bound \(\|\boldsymbol{\mathrm{y}}^k\|\), yields the stated inequality. ◻
Lemma 6. Let \(\gamma \geq \tfrac{L}{\mu}\). Then, for all \(k\), \[\begin{align} \| \bar{\boldsymbol{x}}^{k+1} - \boldsymbol{x}^*\| &\leq (1- \frac{\mu}{2 \gamma L}) \,\|\bar{\boldsymbol{x}}^{k} -\boldsymbol{x}^{*}\| + L \frac{\alpha_{max}}{\sqrt{n}} \|\bar{\boldsymbol{\mathrm{x}}}^{k} -\boldsymbol{\mathrm{x}}^{k} \| +\frac{\alpha_{max}}{\sqrt{n}} \|\boldsymbol{\mathrm{y}}^{k} -\bar{\boldsymbol{\mathrm{y}}}^{k}\| \end{align}\]
Proof. From taking averaged of 10 , we obtain \[\bar{\boldsymbol{x}}^{k+1}-\boldsymbol{x}^*
= \bar{\boldsymbol{x}}^{k}-\tfrac{1}{n}\boldsymbol{1}^\top D^k \boldsymbol{\mathrm{y}}^{k}-\boldsymbol{x}^*.\] Adding and subtracting \(\tfrac{1}{n}\boldsymbol{1}^\top D^k\bar{\boldsymbol{\mathrm{y}}}^{k}\) and
applying the triangle inequality yield \[\begin{align} \|\bar{\boldsymbol{x}}^{k+1} - \boldsymbol{x}^*\| &\leq \|\bar{\boldsymbol{x}}^{k} - \tfrac{1}{n} \boldsymbol{1}^\top D^k
\bar{\boldsymbol{\mathrm{y}}}^{k} - \boldsymbol{x}^* \| +\| \tfrac{1}{n} \boldsymbol{1}^\top D^k (\boldsymbol{\mathrm{y}}^{k} -\bar{\boldsymbol{\mathrm{y}}}^{k}) \| \nonumber\\ &\leq \|\bar{\boldsymbol{x}}^{k} - \tfrac{1}{n} \boldsymbol{1}^\top D^k
\bar{\boldsymbol{\mathrm{y}}}^{k} - \boldsymbol{x}^* \| +\| \tfrac{1}{n} \boldsymbol{1}^\top D^k\| \| \boldsymbol{\mathrm{y}}^{k} -\bar{\boldsymbol{\mathrm{y}}}^{k} \| \nonumber\\ &\leq \|\bar{\boldsymbol{x}}^{k} - \tfrac{1}{n} \boldsymbol{1}^\top D^k
\bar{\boldsymbol{\mathrm{y}}}^{k} - \boldsymbol{x}^* \| +\frac{\alpha_{max}}{\sqrt{n}} \|\boldsymbol{\mathrm{y}}^{k} -\bar{\boldsymbol{\mathrm{y}}}^{k}\| \label{eq:barx95step95split}
\end{align}\tag{15}\] In the last inequality we used the spectral spectral norm \(\|\tfrac{1}{n}\boldsymbol{1}^\top D^k\|\le \alpha_{\max}/\sqrt{n}\).
Next, add and subtract \(\tfrac{1}{n}\boldsymbol{1}^\top D^k\,\tfrac{1}{n}\boldsymbol{1}\boldsymbol{1}^\top\nabla f(\bar{\boldsymbol{x}}^{k})\) and use \(\bar{\boldsymbol{\mathrm{y}}}^{k}=\tfrac{1}{n}\boldsymbol{1}\boldsymbol{1}^\top\nabla f(\bar{\boldsymbol{x}}^{k})\): \[\begin{align}
\|\bar{\boldsymbol{x}}^{k} - \tfrac{1}{n} \boldsymbol{1}^\top D^k& \bar{\boldsymbol{\mathrm{y}}}^{k} - \boldsymbol{x}^* \| \nonumber\\ &\leq \|\bar{\boldsymbol{x}}^{k} - \tfrac{1}{n} \boldsymbol{1}^\top D^k
\tfrac{1}{n}\boldsymbol{1}\boldsymbol{1}^\top \nabla f(\bar{\boldsymbol{x}}^{k}) - \boldsymbol{x}^* \| +\| \tfrac{1}{n} \boldsymbol{1}^\top D^k \tfrac{1}{n}\boldsymbol{1}\boldsymbol{1}^\top \nabla f(\bar{\boldsymbol{x}}^{k}) -\tfrac{1}{n}
\boldsymbol{1}^\top D^k \bar{\boldsymbol{\mathrm{y}}}^{k} \| \nonumber\\ &\leq \|\bar{\boldsymbol{x}}^{k} - \big(\tfrac{1}{n} \sum_{i=1}^n \alpha_i^k \big) \big(\tfrac{1}{n} \sum_{i=1}^n \nabla f_i(\bar{\boldsymbol{x}}^{k}) \big) - \boldsymbol{x}^* \|
+\| \tfrac{1}{n} \boldsymbol{1}^\top D^k \tfrac{1}{n}\boldsymbol{1}\boldsymbol{1}^\top(\nabla f(\bar{\boldsymbol{x}}^{k}) - \nabla f(\boldsymbol{\mathrm{x}}^{k}) ) \| \nonumber\\ &\leq \|\bar{\boldsymbol{x}}^{k} - \big(\tfrac{1}{n} \sum_{i=1}^n
\alpha_i^k \big)\big(\tfrac{1}{n} \sum_{i=1}^n \nabla f_i(\bar{\boldsymbol{x}}^{k}) \big) - \boldsymbol{x}^* \| +L \frac{\alpha_{max}}{\sqrt{n}} \|\bar{\boldsymbol{\mathrm{x}}}^{k} -\boldsymbol{\mathrm{x}}^{k} \|\label{eq:centralized95plus95consensus}
\end{align}\tag{16}\] The last line follows from \(L\)-smoothness and the same bound on \(\tfrac{1}{n}\boldsymbol{1}^\top D^k\). When \(\tfrac{1}{n}\sum_{i=1}^n \alpha_i^k<\tfrac{2}{L}\), Lemma 3 gives \[\Bigl\|\bar{\boldsymbol{x}}^{k}-\Bigl(\tfrac{1}{n}\sum_{i=1}^n \alpha_i^k\Bigr)\Bigl(\tfrac{1}{n}\sum_{i=1}^n \nabla f_i(\bar{\boldsymbol{x}}^{k})\Bigr)-\boldsymbol{x}^*\Bigr\|
\le \eta\,\|\bar{\boldsymbol{x}}^{k}-\boldsymbol{x}^*\|,\] where \[\eta \triangleq \max\!\left\{\,\bigl|1-L\,\tfrac{1}{n}\!\sum_{i=1}^n \alpha_i^k\bigr|,\;
\bigl|1-\mu\,\tfrac{1}{n}\!\sum_{i=1}^n \alpha_i^k\bigr|\,\right\}.\] By Lemma 1, this stepsize condition holds whenever \(\gamma>\tfrac{L}{4 \mu}\). Moreover, under \(\gamma\ge \tfrac{L}{\mu}\), both terms inside the maximum are positive and strictly less than \(1\), so the
maximum reduces to \[\eta=1-\mu\Bigl(\tfrac{1}{n}\sum_{i=1}^n \alpha_i^k\Bigr).\] Finally, since Lemma 1 guarantees
\(\alpha_i^k\ge \tfrac{1}{2\gamma L}\), we obtain \[\label{eq:effect951} \eta \le 1-\frac{\mu}{2\gamma L}.\tag{17}\] Combining 16 with the bound on \(\eta\) yields \[\begin{align}
\Bigl\|\bar{\boldsymbol{x}}^{k}-\tfrac{1}{n}\boldsymbol{1}^\top D^k \bar{\boldsymbol{\mathrm{y}}}^{k}-\boldsymbol{x}^*\Bigr\|
&\le \Bigl(1-\frac{\mu}{2\gamma L}\Bigr)\|\bar{\boldsymbol{x}}^{k}-\boldsymbol{x}^*\| \nonumber\\ & \quad + L\,\frac{\alpha_{\max}}{\sqrt{n}}\;\|\bar{\boldsymbol{\mathrm{x}}}^{k}-\boldsymbol{\mathrm{x}}^{k}\|.
\label{eq:barx95first95term95final}
\end{align}\tag{18}\] Finally, substituting 18 into 15 gives \[\begin{align}
\|\bar{\boldsymbol{x}}^{k+1}-\boldsymbol{x}^*\|
& \leq \Bigl(1-\frac{\mu}{2\gamma L}\Bigr)\|\bar{\boldsymbol{x}}^{k}-\boldsymbol{x}^*\| + L\,\frac{\alpha_{\max}}{\sqrt{n}}\;\|\bar{\boldsymbol{\mathrm{x}}}^{k}-\boldsymbol{\mathrm{x}}^{k}\| +
\frac{\alpha_{\max}}{\sqrt{n}}\;\|\boldsymbol{\mathrm{y}}^{k}-\bar{\boldsymbol{\mathrm{y}}}^{k}\|,
\end{align}\] which proves the claim. ◻
Combining the results of Lemmas 4, 5 and 6, we will construct a linear dynamical system prove the linear convergence of the proposed algorithm. For \(\gamma \geq \tfrac{L}{\mu}\), AdGT sequence \(\{\boldsymbol{\mathrm{x}}^k\}_{k\geq 0}\) satisfies the following system: \[\begin{align} \label{eqz19} \theta_{k+1}\leq \Upsilon~\theta_{k},\quad \forall~k\geq 0, \end{align}\tag{19}\] where \(\theta_k\), and \(\Upsilon\) are defined as \[\begin{align} &\theta_{k}= \begin{bmatrix} \|\boldsymbol{\mathrm{x}}^k- \bar{\boldsymbol{\mathrm{x}}}^k\| \\ \|\boldsymbol{\mathrm{y}}^k- \bar{\boldsymbol{\mathrm{y}}}^k \| \\ \| \bar{\boldsymbol{{x}}}^k - \boldsymbol{{x}}^* \| \end{bmatrix},\\ &\Upsilon= \begin{bmatrix} \lambda+\lambda \alpha_{\max} L& \lambda \alpha_{\max} & \sqrt{n}\,\lambda \alpha_{\max}\\ L \big({\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert W-I \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}+ L\alpha_{\max}\big) & \lambda+L\alpha_{\max} & \sqrt{n}\, L^2 \alpha_{\max} \\ L \frac{\alpha_{max}}{\sqrt{n}} & \frac{\alpha_{max}}{\sqrt{n}} & (1- \frac{\mu}{2 \gamma L}) \end{bmatrix}. \end{align}\]
Theorem 1. Suppose Assumptions 2–3 hold. Let \[\gamma \;>\; \max\!\left\{\frac{L}{\mu},\; \frac{1}{2\mu \,\bar{\alpha}} \right\},\] where \(\bar{\alpha}>0\) is given by \[\label{eq:alpha95bar95def} \bar{\alpha} = \min\!\left\{ \frac{(1-\lambda)\,\beta_1}{\lambda\big(L\beta_{1}+\beta_{2}+\sqrt{n}\,\beta_{3}\big)},\; \frac{\beta_{2}(1-\lambda)-L{\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert W-I \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}\, \beta_1}{L\big(\beta_{1}+ \beta_{2}+\sqrt{n}\,L \beta_{3}\big)} \right\},\qquad{(4)}\] for some \(\beta_1,\beta_2,\beta_3>0\). Then \(\rho(\Upsilon)<1\).
Proof. Given \(\gamma \geq \frac{L}{\mu}\), it follows from Lemmas 4-5 that 19 holds for \(k\geq 0\). Next, we show \(\rho(\Upsilon)<1\). Since \(\Upsilon\) has all non-negative entries, it is sufficient to show that \(\Upsilon~ \beta< \beta\) for some positive \(\beta=\left[\beta_{1},\beta_{2},\beta_{3} \right]^{\top} \in \mathbb{R}^{3}_+\) –see [57]. Hence, \(\Upsilon~\beta< \beta\) is equivalent to \[\tag{20} \begin{align} (\lambda L\beta_{1}+\lambda\beta_{2}+\sqrt{n}\,\lambda \beta_{3})\, \alpha_{\max} &< \beta_{1}(1-\lambda), \tag{21}\\ (L\beta_{1}+ L\beta_{2}+ \sqrt{n}\, L^2 \beta_{3})\,\alpha_{\max} &< \beta_{2}(1-\lambda)-L{\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert W-I \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}\, \beta_1 , \tag{22}\\ \left(\frac{L}{\sqrt{n}}\beta_{1}+\frac{1}{\sqrt{n}}\beta_{2} \right)\alpha_{\max} &< \frac{\mu \beta_{3}}{2 \gamma L}. \tag{23} \end{align}\] For 22 to be feasible, the right-hand side must be positive, which enforces \[\label{eq:beta95195bound} 0 \;<\; \beta_1 \;<\; \frac{(1-\lambda)}{L{\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert W-I \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}}\,\beta_2 .\tag{24}\] Moreover, by 12 we have the upper bound \(\alpha_{\max} \le \frac{1}{2\gamma\mu}\). Substituting this value into 23 yields \[\label{eq:beta95395bound} \beta_3 \;>\; \left(\frac{L}{\sqrt{n}}\beta_{1}+\frac{1}{\sqrt{n}}\beta_{2}\right)\frac{L}{\mu^{2}}\tag{25}\] is sufficient. Under 24 and 25 (for any fixed \(\beta_2>0\)), the inequalities 21 –22 are guaranteed whenever \[\label{eq:alpha95max95bar95alpha95b} \alpha_{\max} \;<\; \bar{\alpha},\tag{26}\] where \(\bar\alpha\) is defined in ?? . Because the upper bound \(\alpha_{\max} \le \frac{1}{2\gamma\mu}\), condition 26 follows from \(\gamma>1/(2\mu\bar\alpha)\). Combining this with the standing requirement \(\gamma>L/\mu\) ensures the existence of \(\beta\in\mathbb{R}^3_{++}\) with \(\Upsilon\beta<\beta\), hence \(\rho(\Upsilon)<1\). ◻
Remark 7. A constructive way to guarantee \(\rho(\Upsilon)<1\) is as follows: fix any \(\beta_2>0\), select \(\beta_1\) so that 24 holds, and then choose \(\beta_3\) to satisfy 25 . These choices determine \(\bar\alpha\) via ?? . If, in addition, \[\gamma \;>\; \max\!\left\{\frac{L}{\mu},\,\frac{1}{2\mu\,\bar\alpha}\right\},\] then the spectral radius satisfies \(\rho(\Upsilon)<1\).
Remark 8. For convenience and to express the required lower bound on \(\gamma\) explicitly, one can select \[\beta_1=\frac{1-\lambda}{2L{\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert W-I \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}}\,\beta_2, \quad \beta_{3} \;=\; \frac{2L}{\sqrt{n}\,\mu^{2}}\!\left(\frac{1-\lambda}{{\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert W-I \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}}+1\right)\beta_{2},\] which satisfy 24 and 25 . By Remark 7 and ?? , it follows that \[\;\gamma \;>\; \max\!\left\{\frac{L}{\mu},\,A_1,\,A_2\right\}. \;\] Here, \[\begin{align} A_{1} \;&=\; \frac{L\lambda\Big(4L{\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert W-I \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert} +2\mu^{2}{\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert W-I \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}+\mu^{2}(1-\lambda)+4L(1-\lambda)\Big)}{2\mu^{3}(1-\lambda)^{2}},\\[2mm] A_{2} \;&=\; \frac{4L^{3}{\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert W-I \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}+2L\mu^{2}{\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert W-I \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}+\mu^{2}(1-\lambda)+4L^{3}(1-\lambda)}{2\mu^{3}{\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert W-I \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}(1-\lambda)}. \end{align}\] Thus, when \(\gamma>\max\!\{L/\mu,\,A_1,\,A_2\}\), Lemma 1 guarantees that the local stepsizes generated by stepsize rule 6 automatically satisfy \[\frac{1}{2\gamma L}\;\le\;\alpha_i^{k}\;\le\;\frac{1}{2\gamma\mu},\qquad \forall i,k,\] The theoretical lower bound for \(\gamma\) involves global quantities such as \(\mu\), \(L\) which are typically unknown in practice. Moreover, these bounds are often conservative and larger than the values needed in real implementations, reflecting the technical complexity of the convergence analysis in decentralized optimization. Consequently, \(\gamma\) is best selected empirically. A robust default is \(\gamma=1\), which performs well on most of the graphs. For sparsely connected networks (e.g., rings or lines, where \(1-\lambda\) is small), choosing a larger \(\gamma>1\) often improves convergence. Empirically, the \(\gamma\) delivering the best performance is frequently smaller than the conservative theoretical threshold, as consistently observed in our numerical results.
Finally, in the next theorem, we prove that AdGT iterate sequence converges with a linear rate.
Theorem 2. Let Assumptions 2–3 hold. Suppose \[\gamma \;>\; \max\!\left\{\frac{L}{\mu},\; \frac{1}{2\mu \,\bar{\alpha}} \right\},\] where \(\bar{\alpha}>0\) is defined in ?? . Then the sequence \(\{\boldsymbol{\mathrm{x}}^{k}\}\) converges linearly to \(\boldsymbol{\mathrm{x}}^{*}=\boldsymbol{1}\,\boldsymbol{x}^*\), with a rate arbitrarily close to \(\rho(\Upsilon)\).
Proof. By Theorem 1, \(\rho(\Upsilon)<1\). Hence, by [57], for any \(\zeta\in(0,\,1-\rho(\Upsilon))\) there exists an induced matrix norm \(\|\cdot\|_{(\zeta)}\) such that \[\|\Upsilon\|_{(\zeta)} \le \rho(\Upsilon)+\tfrac{\zeta}{2}.\] Since norms are equivalent in finite dimensions, there exists \(\Gamma>0\) such that for all \(k\ge
0\), \[\begin{align}
\label{eqz26}
\| \Upsilon^{k} \| \leq \Gamma \tilde{\lambda}^{k}
\end{align}\tag{27}\] where \(\tilde{\lambda} \triangleq \rho(\Upsilon)+\tfrac{\zeta}{2}<1\). By writing 19 recursively, we get, for all \(k\geq 0\),
\[\theta_{k+1} \;\le\; \Upsilon^{k}\,\theta_{0},\] Combining this with 27 yields \[\label{eq:theta-kplus1}
\|\theta_{k+1}\| \;\le\; \Gamma\,\tilde{\lambda}^{k}\,\|\theta_{0}\|, \qquad k\ge 0.\tag{28}\] Define \(a_k \triangleq\sum\limits_{j=0}^{k-1} \| \theta_{j} \|\) for \(k\ge 0\).
Then \[\label{eq:ak-diff}
\|\theta_k\| \;=\; a_{k+1}-a_k \;\le\; \Gamma\,\tilde{\lambda}^{k}\,\|\theta_0\|\tag{29}\] which implies \[a_{k+1} \;\le\; a_k + \Gamma\,\tilde{\lambda}^{k}\,\|\theta_0\|.\] Because \(\tilde{\lambda}\in(0,1)\), the series \(\sum_{k\ge 0}\tilde{\lambda}^{k}\) converges, hence \(\{a_k\}\) is convergent and therefore bounded.
Now pick the slightly larger factor \(\tilde{\lambda}+\tfrac{\zeta}{2}\in(0,1)\). Using 29 and the boundedness of \(\{a_k\}\), \[\lim_{k\to\infty}\frac{\|\theta_k\|}{\big(\tilde{\lambda}+\tfrac{\zeta}{2}\big)^k}
\;\le\;
\lim_{k\to\infty}
\frac{\Gamma\,\tilde{\lambda}^{k-1}\,\|\theta_0\|}{\big(\tilde{\lambda}+\tfrac{\zeta}{2}\big)^k}
\;=\; 0.\] Therefore, there exists \(c>0\) such that for all \(k\ge 0\), \[\label{eq:theta-decay}
\|\theta_k\| \;\le\; c\,\big(\tilde{\lambda}+\tfrac{\zeta}{2}\big)^k.\tag{30}\] Finally, we get the desired result since for all \(k\geq 0\), \[\begin{align}
\| \boldsymbol{\mathrm{x}}^k-\boldsymbol{\mathrm{x}}^{*} \| \quad &\leq \quad \| \boldsymbol{\mathrm{x}}^k-\bar{\boldsymbol{\mathrm{x}}}^k\|+ \| \bar{\boldsymbol{\mathrm{x}}}^k-\boldsymbol{\mathrm{x}}^{*} \| \nonumber \\
&\leq \quad \| \boldsymbol{\mathrm{x}}^k-\bar{\boldsymbol{\mathrm{x}}}^k\|+ \sqrt{n} \| \bar{\boldsymbol{x}}^k-\boldsymbol{{x}}^{*} \| \nonumber
\end{align}\] Combining with 30 gives \[\|\boldsymbol{\mathrm{x}}^k-\boldsymbol{1}\,\boldsymbol{x}^*\|
\;\le\; (1+\sqrt{n})\,c\,\big(\tilde{\lambda}+\tfrac{\zeta}{2}\big)^k \le\; (1+\sqrt{n})\,c\,\big(\rho(\Upsilon)+\zeta\big)^k.\] Thus, the sequence \(\{ \boldsymbol{\mathrm{x}}^k\}\) generated by AdGT converges
exactly to the unique optimal solution \(\boldsymbol{\mathrm{x}}^*\) at a geometric rate. ◻
In this section, we present numerical results to evaluate the performance of AdGT, with a particular focus on our contributions. To keep the comparison clear and meaningful, we compare AdGT with two methods: the fixed stepsize Gradient Tracking (GT) method and an adaptive decentralized method, referred to as Method-DM, proposed in [56]. All experiments are carried out under undirected communication graphs. We chose these two methods because GT has already been widely studied and compared with other advanced decentralized optimization methods in previous works.
Our experiments are divided into three parts. First, in Section 4.1, we show that AdGT can change stepsizes automatically, removing the need for manual tuning and making the algorithm easier to use. We evaluate AdGT using both stepsize rules 6 and 9 , and compare its performance with other state-of-the-art methods. Second, in Section 4.2, we point out the advantage of letting each agent use a different stepsize based on the smoothness of its local function. In this case, where agents have different smoothness levels, we show that the gap between AdGT and GT is quite noticeable compared to the gap between Adaptive Gradient Descent (AdGD) [48] and regular Gradient Descent (GD) in a centralized setting. Third, in Section 4.3, we study how the number of nodes affects the performance of AdGT and GT in two types of networks: random graphs and cycle graphs.
In this experiment, our objective is to estimate \(\tilde{x}\) using the optimal solution \(x^\ast\) of the decentralized logistic regression problem, which is formulated as: \[\begin{align} \label{eq95logestic95reg} x^\ast \in \mathop{\mathrm{argmin}}_{\boldsymbol{x}\in\mathbb{R}^{p}}{f}(x) = \sum_{i=1}^n f_i(x), \end{align}\tag{31}\] where each local function \(f_i(x)\) is defined as \[\begin{align} f_i(x) = \sum_{j=1}^m \log\left(1 + \exp(-y_{i}^j \cdot M_{i}^j x)\right) + \frac{\rho}{2} \|x\|^2. \end{align}\] We solve the optimization problem in 31 across a variety of known network topologies—specifically, Star, Cycle, Line, Ladder—as well as two randomly generated graphs with connectivity ratios of \(0.2\) and \(0.35\). The dataset used is the w8a dataset [58], which contains 49,749 examples, each with a 300-dimensional feature vector and a corresponding binary label. To include an intercept term in the model, we set \(p=301\).
The network consists of \(n = 16\) nodes (or agents). Each agent \(i \in \mathcal{V}\) randomly selects \(m_i = 25\) data points from the training set
without replacement. For each agent, the local data matrix \(M_i^j \in \mathbb{R}^{ p}\), \(j=1,..,m_i\) is formed by standardizing the original \(p - 1\)
feature vectors using Z-score normalization, and then adding a column of ones to include the intercept term, and the regularizer is \(\rho = 0.01\). To generate the random graphs, we used Ggen
code5.
In all simulations presented in Sections 4.1 to 4.3, we fix the random seed to 42, and set the initial point as \(x^0=0\), and the convergence behavior of different methods is illustrated through the residual sequence \({r(k)}_{k\geq 1}\), which is defined as \[\begin{align} r(k)\triangleq\dfrac{\Vert\boldsymbol{\mathrm{x}}^k-\boldsymbol{\mathrm{x}}^{*}\Vert}{\Vert\boldsymbol{\mathrm{x}}^0-\boldsymbol{\mathrm{x}}^{*}\Vert} \end{align}\]
Figure 2: Decentralized logistic regression with \(n=16\). (a) \(\{r(k)\}_k\) for Star, (b) \(\{r(k)\}_k\) for Cycle, (c) \(\{r(k)\}_k\) for Line, (d) \(\{r(k)\}_k\) for Ladder, (e) \(\{r(k)\}_k\) for Random graph with connectivity ratio 0.2, (f) \(\{r(k)\}_k\) for Random graph with connectivity ratio 0.35..
In Fig. 2, we plot the residual sequence \(\{r(k)\}_{k \geq 0}\) for all methods. To improve the practical convergence rate, the stepsize in the GT algorithm was tuned via a grid search. For the AdGT algorithm, we evaluate both stepsize rules 6 and 9 , denoted as “AdGT(8)” and “AdGT(12)”, respectively. In our experiments, we set \(\gamma = 1\) for AdGT(12) and \(\gamma = 4\) for AdGT(8) across all graph topologies, which generally leads to better performance compared with GT. For line, cycle, and ladder graphs, setting \(\gamma = 8\) results in slightly faster convergence. As discussed earlier, the AdGT(12) is typically more stable and requires little or no tuning. However, with minimal tuning of \(\gamma\), the AdGT(8) can also achieve a high convergence rate. Moreover, we should emphasize that the performance of AdGT method is not sensitive to slight changes in \(\gamma\). We also include results for the GT algorithm using two additional stepsize values: one larger and one smaller than the empirically optimal value. This is to illustrate the sensitivity of GT to stepsize selection and highlight the significant effort required to properly tune it. In contrast, using an adaptive stepsize, as in AdGT, substantially reduces this burden. As shown in the results, AdGT not only requires little to no tuning in most cases, but it also often achieves better convergence rates compared to the constant-stepsize version. In some scenarios, such as random or star graph topologies, the improvement is significant. As noted previously, Method-DM fails to converge in all graph settings except for the star graph. In the following sections, we use only the stepsize rule 9 for the AdGT method.
In this section, we study the benefits of using different stepsizes, especially in situations where some agents have larger smoothness constants (higher \(L-\)smoothness). Such differences in smoothness can have a strong impact on the convergence speed. We show that AdGT, which adjusts stepsizes automatically, achieves much faster convergence than methods with fixed stepsizes. To explore this, we consider a decentralized quadratic optimization problem in which each agent \(i \in \{1, \ldots, n\}\) has a local quadratic cost function of the form \[f_i(x) := \frac{1}{2} x^\top A_i x + b_i^\top x,\] where \(A_i \in \mathbb{S}_{++}^p\) is a positive definite diagonal matrix, and \(b_i \in \mathbb{R}^p\) is a vector. The overall objective is to minimize the sum of all local functions: \[\begin{align} \label{Objective95quadratic} f(x) := \sum_{i=1}^{n} \frac{1}{2} x^\top A_i x + b_i^\top x . \end{align}\tag{32}\]
To model differences in smoothness across the local objectives, we generate matrices \(A_i\) with varying condition numbers, which directly affect the Lipschitz constants \(L_i \approx \|A_i\| = \lambda_{\max}(A_i)\). Following the setup in [59], we construct each diagonal matrix \(A_i\) by sampling its entries as follows: the first half of the diagonal (i.e., the first \(p/2\) entries) is drawn uniformly at random from the set \(\{1, 10^{-1}, \ldots, 10^{-\tau}\}\), and the second half from \(\{1, 10^{1}, \ldots, 10^{\tau}\}\). This ensures that the eigenvalues of each matrix \(A_i\) lie within the interval \([10^{-\tau}, 10^{\tau}]\). The linear terms \(b_i^\top x\) are added to ensure that the local objective functions have different minimizers. The vectors \(b_i\) are generated uniformly at random from the box \([0, 1]^p\).
We consider a network of \(n = 100\) nodes, where the dimension of the local variable \(x\) is \(p = 20\). We evaluate the performance under four different scenarios based on the choice of the condition number parameter \(\tau\):
All agents are assigned a condition number parameter of \(\tau = 3\);
Half of the agents (i.e., 50 nodes) have \(\tau = 3\), while the remaining half have \(\tau = 1\);
Ten percent of the agents (i.e., 10 nodes) are assigned \(\tau = 3\), and the remaining 90 nodes have \(\tau = 1\);
Only 3 agents are assigned \(\tau = 3\), with all others using \(\tau = 1\).
Figure 3: Decentralized quadratic problem with \(n=100\)..
We included AdGD and GD, along with Method-DM and GT, to compare against AdGT. The corresponding numerical results for these four methods are illustrated in Fig. 3. In all simulations, we set \(\gamma = 1\). For GD and GT, the stepsizes are tuned to achieve the best possible convergence rate. As observed, when the number of nodes with high \(L\)-smoothness decreases, the convergence rate of AdGT improves significantly compared to GT. In contrast, the gap between the convergence rates of AdGD and GD remains relatively unchanged. This highlights the effectiveness of using adaptive stepsizes, particularly in scenarios where the data distributed across nodes exhibit varying levels of smoothness.
We consider ridge regression as a specific instance of problem 1 , where the local objective function for each agent \(i\) is defined as \[\begin{align} f_i(x) = \|A_i
x_i - b_i\|^2 + \rho \|x_i\|^2,
\end{align}\] with \(A_i \in \mathbb{R}^{m \times p}\), \(b_i \in \mathbb{R}^{m}\), and the regularization parameter set to \(\rho = 0.1\). The data
used in this experiment corresponds to the w8a
dataset, as described in Section 4.1.
To evaluate performance, we conducted numerical experiments on both random and cycle networks. For the random networks, we generated graphs with a connectivity ratio of \(0.35\), considering two cases with \(n = 25\) and \(n = 100\) nodes. For the cycle graph, we tested it with \(n = 10\) and \(n = 25\) nodes.
Figure 4: Performance of ridge regression over cycle and random graphs..
In all four experimental settings, the full dataset was evenly partitioned among the \(n\) agents. Each agent was randomly assigned a local dataset of size \(m\), drawn from the global dataset without replacement. We fix \(\gamma = 1\), and the stepsizes for GD and GT are carefully tuned to obtain their optimal convergence performance.
The residual sequences \({r(k)}_{k \geq 1}\) for all methods are presented in Fig. 4. In the case of the random graph with a moderate connectivity ratio of 0.35, we observe that increasing the number of nodes generally leads to faster convergence for AdGT compared to GT. In contrast, for the cycle graph, AdGT performs better when the number of agents is small. Moreover, Method-DM fails to converge in all scenarios, except for the cycle graph with \(n = 10\) agents. Similar to the results presented in Section 4.2, the difference in convergence rate between AdGT and GT is noticeably larger than that between AdGD and GD, particularly in random graphs with \(n = 100\) nodes.
In this work, we propose AdGT, a decentralized optimization algorithm that adaptively selects stepsizes based on local information available at each agent. By eliminating the need for manual stepsize tuning, AdGT simplifies practical implementation and improves convergence efficiency, especially in heterogeneous networks with varying local smoothness. Our theoretical analysis establishes that AdGT achieve linear convergence. Furthermore, empirical results demonstrate that adaptive stepsizes significantly outperform fixed-stepsize methods in both convergence speed and robustness. These findings underscore the value of local adaptivity in decentralized optimization and suggest promising directions for future research on scalable, communication-efficient algorithms over general network topologies.
This work was partially supported by the Research Council of Finland (Grant 354523) and the Research Council of Norway.↩︎
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.↩︎
See [57], Problem 5.6.P20 and the proof of Theorem 7.4.10.1.↩︎
In particular, if \(W\) is doubly stochastic, then \({\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert W-I \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}\leq 2\).↩︎
The Ggen
code is written by Dr. W. Shi; see https://sites.google.com/view/wilburshi/home/research/software-a-z-order/graph-tools/ggen.↩︎