Delay-Tolerant Augmented-Consensus-based Distributed Directed Optimization


Abstract

Distributed optimization finds applications in large-scale machine learning, data processing and classification over multi-agent networks. In real-world scenarios, the communication network of agents may encounter latency that may affect the convergence of the optimization protocol. This paper addresses the case where the information exchange among the agents (computing nodes) over data-transmission channels (links) might be subject to communication time-delays, which is not well addressed in the existing literature. Our proposed algorithm improves the state-of-the-art by handling heterogeneous and arbitrary but bounded and fixed (time-invariant) delays over general strongly-connected directed networks. Arguments from matrix theory, algebraic graph theory, and augmented consensus formulation are applied to prove the convergence to the optimal value. Simulations are provided to verify the results and compare the performance with some existing delay-free algorithms.

Introducing a new distributed optimization technique based on algebraic graph theory and augmented consensus protocols.

Handling heterogeneous, arbitrary, time-invariant but bounded delays over general strongly-connected directed networks

time-delay ,distributed optimization ,graph theory ,machine learning ,augmented consensus

1 Introduction↩︎

In recent years, the study of distributed (or decentralized) algorithms for optimization, learning, and classification over a network of computing nodes/agents has gained significant attention due to adavances in cloud-computing and parallel data processing [1]. These networks consist of multiple agents, each with limited computational and communication capabilities, working collaboratively to solve optimization problems or learn from data in a distributed manner. However, a critical challenge in these networks is the presence of time-delays [2], [3], which can arise from communication latencies, processing times, or network congestion. Time-delays can severely impact the performance and convergence of distributed algorithms, making it essential to develop robust methods that can handle such delays. This paper explores the theoretical foundations and practical implementations of distributed optimization and learning algorithms that are resilient to time-delays, providing mathematical proofs, analysis, and potential applications.

1.1 Problem and Contributions↩︎

In distributed optimization, the idea is to optimize a cost function (or loss function) over a network of computing nodes. The objective function is the sum of some local cost functions at the nodes, and the goal is to optimize this objective using locally defined gradient-based algorithms. The common form of the optimization problem is, \[\begin{align} \min_{\mathbf{z}} F(\mathbf{z}) = \sum_{i=1}^{N} f_i(\mathbf{z}) \label{eq95prob} \end{align}\tag{1}\] with state parameter \(\mathbf{z} \in \mathbb{R}^{m}\). Functions \(f_i:\mathbb{R}^m \mapsto \mathbb{R}\) are strongly convex, differentiable with Lipschitz gradients, and denote the objective function (cost, loss, etc.) at computing node \(i\). It is assumed that the optimal point \(\underline{\mathbf{z}}^* = \min_{\mathbf{z}} F(\mathbf{z})\) for this problem exists. The primary work [4] introduces subgradient algorithms to solve this problem. ADD-OPT algorithm [5] and its recent stochastic version S-ADD-OPT [6] are popular algorithms to solve problem 1 . These algorithms work over strongly-connected directed networks with irreducible column stochastic adjacency matrices, and are granted with (i) constant step-size in contrast to existing diminishing step-size algorithms, (ii) providing accelerated convergence by tuning the step-size over a wide range, and (iii) linear convergence rate for strongly convex cost functions. Other existing distributed algorithms include: event-triggered-based second-order multi-agent systems [7], [8], double step-size solutions for nonsmooth optimization [9], reduced-complexity and flexible algorithms [10], primal-dual subgradient-based solutions [11], EXTRA algorithm for first-order consensus-based optimization [12], push-pull gradient-based methods [13], and the solutions based on alternating direction method of multipliers (ADMM) [14][17]. The literature also includes distributed constrained optimization with application to resource allocation under time-delay. For, example, DTAC-ADMM discusses ADMM-based distributed resource allocation under time-delay [18]. Similarly, asynchronous ADMM-based resource allocation algorithms are proposed in [19]. These works consider distributed optimization subject to a coupling resource-demand balance constraint, where the objective functions are decoupled and local. Asynchronous distributed optimization is also discussed in [20], [21], where agents perform local computations and communications without requiring global synchronization. In such methods, each node updates its local model using the most recently available information from neighbors (which may be received at irregular times). Recall that scalability is a key advantage of existing distributed optimization techniques, which follows the polynomial-order complexity of the algorithms. Polynomial-order complexity ensures computationally-efficient solutions as the number of agents or decision variables increases, making it feasible to deploy these algorithms on large-scale networks.

In this work, as the main contribution, we extend such distributed optimization algorithms to further address arbitrary and bounded time-delays over multi-agent networks. Latency is primarily addressed in consensus literature including: resilient consensus with \(l\)-hop communication [22], multi-agent consensus subject to uncertainties and time-varying delays [23], group consensus over digraphs subject to noise and latency [24], continuous-time linear average consensus with constant delays at all nodes [25], discrete-time consensus algorithms with constant communication delays [26], discrete-time consensus over digraphs under heterogeneous time-delays [27]. These works are advantageous as they provide rigorous stability/convergence analysis applicable to other distributed setups; however, they mostly assume constant homogeneous delays. For a review of consensus algorithms under time-delays and their advantages/disadvantages, refer to [28]. The concept of time-delay is not sufficiently addressed in distributed optimization literature. The inherent time-delay of information exchange among communicating nodes may lead the distributed optimization algorithm to lose convergence. The delays are typically assumed to be bounded, implying that the information sent over every link eventually reaches the destination node, i.e., no packet loss over the network. In this paper, we propose augmented consensus-based algorithms to analyze the effect of time-delays while keeping the consensus matrix on the link weights column stochastic. Our solution can tolerate heterogeneous communication delays at different links. In this regard, similar to [29], this work improves the existing algorithms over non-delayed networks [5], [6], [12], [13], [30][32] to more advanced delay-tolerant solutions which are not well-addressed in the literature (to our best knowledge). This work also advances the existing ADMM-based solutions [14][16] to withstand latency and network time-delays. Our proposed delay-tolerant augmented consensus-based DTAC-ADDOPT algorithm is in single time-scale, i.e., it performs only one step of (augmented) consensus on received information per iteration/epoch. This is computationally more efficient in contrast to the double time-scale methods [33], [34] with many steps of inner-loop consensus per iteration/epoch. Heterogeneous time-delays are considered primarily for ADMM-based [18] and gradient-descent-based [3] equality-constraint distributed optimization and resource allocation, but this current paper is our first paper addressing it over unconstrained ADD-OPT.

1.2 Applications↩︎

Distributed Training for Binary Classification: Consider a group of agents to classify \(N\) data points \({\boldsymbol{\chi}_i \in \mathbb{R}^{m-1}}\), \({i=1,\ldots,N}\), labeled by \({l_i \in \{-1,1\}}\). The problem is to find the partitioning hyperplane \({\boldsymbol{\omega}^\top \boldsymbol{\chi} - \nu =0}\), for \({\boldsymbol{\chi}\in\mathbb{R}^{m-1}}\). In the linearly non-separable case, a proper nonlinear mapping \(\phi(\cdot)\) with kernel \(K(\boldsymbol{\chi}_i,\boldsymbol{\chi}_j)=\phi(\boldsymbol{\chi}_i)^\top \phi(\boldsymbol{\chi}_j)\) can be found such that \({g(\widehat{\boldsymbol{\chi}})= \text{sgn}(\boldsymbol{\omega}^\top \phi(\widehat{\boldsymbol{\chi}}) - \nu)}\) determines the class of \(\widehat{\boldsymbol{\chi}}\). Agents collaboratively solve the problem by finding the optimal \(\boldsymbol{\omega}~\) and \(\nu\) and optimize the following convex loss [35]: \[\label{eq95svm95cent} \begin{align} \displaystyle f_i(\boldsymbol{\omega},\nu) = \boldsymbol{\omega}^\top \boldsymbol{\omega} + C \sum_{j=1}^{N} \max\{x,0\}^p \end{align}\tag{2}\] with \(p \in \mathbb{N}\) as the smoothness factor (which is typically a finite number), \(C>0\) as the margin size parameter, and \({x=1-l_j( \boldsymbol{\omega}^\top \phi(\boldsymbol{\chi}^i_j)-\nu)}\). The differentiable smooth equivalent of \(f_i\) in Eq. 2 is in the following form (assuming large enough \(\mu>0\)): \[\label{eq95svm95smooth} f_i(\boldsymbol{\omega},\nu)=\boldsymbol{\omega}^\top \boldsymbol{\omega} + C \sum_{j=1}^{N_i} \tfrac{1}{\mu}\log (1+\exp(\mu x)).\tag{3}\] This problem is also known as distributed support-vector-machine (D-SVM) [31], [32].

Distributed Least Squares: In this problem, the idea is to solve the least square problem \(H\mathbf{z}=\mathbf{b}\) in a distributed manner. Every agent/node \(i\) takes measurement \(\mathbf{b}_i \in \mathbb{R}^p\) and has a \(p\)-by-\(n\) measurement matrix \(H_i\) and collaboratively optimizes the private loss function in the following form [32]: \[\begin{align} f_i(\mathbf{x}) = \frac{1}{2} \|H_i \mathbf{z}-\mathbf{b}_i\|_2^2 \label{eq95prob95ls} \end{align}\tag{4}\] This can be addressed further in the context of distributed filtering [36][38].

Distributed Logistic Regression: In this problem each agent \(i\) with access to \(m_i\) training data points defined by \((c_{ij},y_{ij}) \in \mathbb{R}^p \times \{-1,1\}\), where the parameter \(c_{ij}\) has \(p\) features of the \(j\)th training data and \(y_{ij}\) denotes the binary label \(\{-1,+1\}\). Each agent, collaborating with others, solves and optimizes the private loss function in the following form [30]: \[\begin{align} f_i(\mathbf{w},b) = \sum_{j=1}^{m_i} \log(1+\exp(-(\mathbf{w}^\top c_{ij}+b)y_{ij})) + \frac{\lambda}{2} \|\mathbf{w}\|_2^2 \label{eq95prob95lr} \end{align}\tag{5}\] where the last term is for regularization to avoid overfitting.

1.3 Paper Organization↩︎

Section 2 provides the preliminary notions. Section 3 gives the main DTAC-ADDOPT algorithm with proof of convergence in Section 4. Section 5 provides the simulation results on both the academic setup and the real dataset. Section 6 provides the concluding remarks.

1.4 Notations↩︎

Table 1 summarizes the notations in this paper.

0.7

Table 1: Description of notations and symbols
Symbol Description
\(\mathcal{G}\) multi-agent network
\(C\) adjacency matrix of the network
\(\mathcal{N}_i\) neighbors of agent \(i\)
\(\mathbf{z}\) global state variable
\(\underline{\mathbf{z}}^*\) optimal state
\(F\) global objective function
\(f_i\) local objective function at node \(i\)
\(m\) dimention of state variable
\(n\) number of nodes/agents
\(\tau_{ij}\) time-delay at link \((i,j)\)
\(\overline{\tau}\) bound on the time-delays
\(\overline{C}\) augmented adjacency matrix
\(\mathbf{y},\mathbf{x},\mathbf{g}\) auxiliary optimization variables
\(\nabla f_k\) gradient of function \(f\) at time \(k\)
\(\overline{\nabla f}_k\) gradient vector over last \(\overline{\tau}\) steps at time \(k\)
\(\alpha\) gradient-tracking step rate
\(\widehat{\mathbf{z}}\) augmented state variable
\(\widehat{\mathbf{y}},\widehat{\mathbf{x}},\widehat{\mathbf{g}}\) augmented auxiliary variables
\(\rho\) spectral radius
\(\mathbf{1}_n\) all ones column vector of size \(n\)
\(I_n,0_n\) identity and zero matrix of size \(n\)
\(k\) time index
\(\|\cdot\|\) 2-norm operator
\(\otimes\) Kronecker product operator

2 Preliminaries↩︎

2.1 Algebraic Graph Theory↩︎

We consider the network of agents as a digraph (directed graph) of nodes denoted by \(\mathcal{G}=\{\mathcal{V},\mathcal{E}\}\) with \(\mathcal{V}\) and \(\mathcal{E}\) respectively as the node set and link set. A link \((i,j) \in \mathcal{E}\) from node \(i\) to node \(j\) implies a communication link for message passing from agent \(i\) to agent \(j\). The adjacency matrix of \(\mathcal{G}\) is denoted by \(C\) where \(C_{ij}\) is the weight on the link \((j,i)\) (or \(j \rightarrow i\)). Define the in-neighborhood of every node \(i\) as \(\mathcal{N}_i = \{j|(j,i) \in \mathcal{E}\}\) or \(\mathcal{N}_i = \{j|C_{ij} \neq 0 \}\).

Assumption 1. The digraph (or network) \(\mathcal{G}\) is strongly connected and its adjacency matrix \(C\) is irreducible [39]. Moreover, the matrix \(C\) is column stochastic, i.e., \(\sum_{i=1}^n C_{ij} =1\).

Note that, in most directed network implementations, agents already know their outgoing neighbor set for column-stochastic design of matrices, and the out-degree is locally available.

2.2 Augmented Formulation↩︎

The delay model is similar to the consensus literature [27] and is clearly defined in the following assumption.

Assumption 2. The time-delays are considered heterogeneous (at different links), bounded, arbitrary, and time-invariant. An integer value \(0\leq \tau_{ij}\leq \overline{\tau}\) represents the delay at link \((i,j)\). The bound \(\overline{\tau}\) ensures no information loss over the network.

We justify the above assumption. Note that, in practice, delays may change on a time-scale much slower than the algorithm step-size (or are upper-bounded by the same constant); therefore, the derived bounds using the maximum delay remain valid in practical cases. Also, in many networks, communication paths and routing remain stable for long periods. Communication latencies in these settings are dominated by propagation and queuing delays that are (on the algorithm time-scale) nearly constant and hence well modeled as time-invariant. Moreover, treating delays as fixed (but heterogeneous) provides a conservative worst-case analysis useful for algorithm design and safety guarantees.

For every set of connected nodes \((i,j)\) and \((i,k)\), the communication delay implies the heterogeneous scenario. Define the augmented state vectors \(\widehat{\mathbf{z}}_k = (\mathbf{z}_k; \mathbf{z}_{k-1}; \dots; \mathbf{z}_{k-\overline{\tau}})\) as the column-concatenation of delayed state vectors (";" denotes column concatenation). Given the column stochastic consensus matrix \(C\) and maximum delay \(\overline{\tau}\), its augmented matrix is defined as, \[\begin{align} \label{eq95aug95C} \small \overline{C} = \left( \begin{array}{cccccc} C_0 & I_n & 0_n & \ldots & 0_n & 0_n \\ C_1 & 0_n & I_n &\ldots & 0_n& 0_n\\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ C_{\overline{\tau}-1} & 0_n & 0_n & \ldots & I_n & 0_n \\ C_{\overline{\tau}} & 0_n & 0_n & \ldots & 0_n & I_n \end{array} \right), \normalsize \end{align}\tag{6}\] with \(I_{n}\) and \(0_{n}\) respectively as \(n\)-by-\(n\) identity and zero matrices. The non-negative matrices \(C_r\) with \(r \in \{0,\dots,\overline{\tau}\}\) are defined based on delay \(0 \leq r \leq \overline{\tau}\) as, \[\begin{align} C_{r,ij} = \left\{ \begin{array}{ll} C_{ij}, & \text{If}~ \tau_{ij}=r \\ 0, & \text{Otherwise}. \end{array}\right. \end{align}\] Assuming time-invariant delays, for every \((i,j) \in \mathcal{E}\), only one of the entries \(C_{0,ij},C_{1,ij}, \ldots, C_{\overline{\tau},ij}\) is equal to \(C_{ij}\) and the rest are zero. This implies that the column-sum of the first \(n\) columns of \(\overline{C}\) and \(C\) are equal. Note that, given a column-stochastic \(C\) matrix, the augmented matrix \(\overline{C}\) is also column stochastic from the definition. It should be noted that this large augmented matrix is only used in proof analysis of the proposed algorithm, and it is not practically used in the agents’ dynamics (see the iterative dynamics 11 14 in the next section).

3 The Main Algorithm↩︎

The main algorithm in compact matrix form (subject to no time-delay) is given as follows: \[\begin{align} \tag{7} \mathbf{x}_{k+1} &= {C}_k \mathbf{x}_{k} - \alpha \mathbf{g}_k \\ \tag{8} \mathbf{y}_{k+1} &= {C}_k \mathbf{y}_{k} \\ \tag{9} \mathbf{z}_{k+1} &= \frac{\mathbf{x}_{k+1}}{\mathbf{y}_{k+1}} \\ \tag{10} \mathbf{g}_{k+1} &= {C}_k \mathbf{g}_{k} + \nabla \mathbf{f}_{k+1} - \nabla \mathbf{f}_{k} \end{align}\] For delayed case, define the augmented vectors \(\widehat{\mathbf{x}}_k,\widehat{\mathbf{y}}_k,\widehat{\mathbf{g}}_k\) of size \(n(\overline{\tau}+1)\). Let, \(\mathbf{Y}_k = diag(\widehat{\mathbf{y}}_k).\) Further, define the auxiliary matrix \(\Xi^{n}_{i,\overline{\tau}}\) is an \(n \times (\overline{\tau}+1)n\) matrix defined as \(\Xi^{n}_{i,\overline{\tau}}= (\mathbf{b}^{\overline{\tau}+1}_i \otimes I_{n})^\top\) with \(\mathbf{b}^{\overline{\tau}+1}_i\) as the unit column-vector of the \(i\)’th coordinate (\(1\leq i \leq {\overline{\tau}+1}\)), i.e., \(\mathbf{b}^{\overline{\tau}+1}_i =( \underbrace{\overbrace{0;\dots;0}^{i-1};1;0;\dots;0}_{\overline{\tau}+1})\) In case \(\mathbf{x} \in \mathbb{R}^{np}\) then \(\Xi^{np}_{i,\overline{\tau}} = \Xi^{n}_{i,\overline{\tau}} \otimes I_p\). Then, putting \(i=1\), we have \(\mathbf{x}_k = \Xi^{np}_{1,\overline{\tau}} \widehat{\mathbf{x}}_k, \mathbf{y}_k = \Xi^{np}_{1,\overline{\tau}} \widehat{\mathbf{y}}_k, \mathbf{g}_k = \Xi^{np}_{1,\overline{\tau}} \widehat{\mathbf{g}}_k\). In fact, \(\Xi^{np}_{1,\overline{\tau}}\) returns the first \(np\) rows of the augmented vector of size \(np(\overline{\tau}+1)\).

The main distributed optimization dynamics under communication time-delays are in the following vector form, \[\begin{align} \tag{11} \mathbf{x}_{k+1,i} &= \sum_{j \in \mathcal{N}_i} \sum_{r=0}^{\overline{\tau} } C_{k,ij} \mathcal{I}_{k-r,ij}(r) \mathbf{x}_{k-r,j} - \alpha \mathbf{g}_{k,i} \\ \tag{12} \mathbf{y}_{k+1,i} &= \sum_{j \in \mathcal{N}_i} \sum_{r=0}^{\overline{\tau} } C_{k,ij} \mathcal{I}_{k-r,ij}(r) \mathbf{y}_{k-r,j}\\ \tag{13} \mathbf{z}_{k+1,i} &= \frac{\mathbf{x}_{k+1,i}}{\mathbf{y}_{k+1,j}} \\ \tag{14} \mathbf{g}_{k+1,i} &= \sum_{j \in \mathcal{N}_i} \sum_{r=0}^{\overline{\tau} } C_{k,ij} \mathcal{I}_{k-r,ij}(r) \mathbf{g}_{k-r,j} + (\nabla \mathbf{f}_{k+1,i} - \nabla \mathbf{f}_{k,i}) \end{align}\] where \(\nabla \mathbf{f}_{k+1,i}\) denotes \(\nabla \mathbf{f}_{i}(\mathbf{z}_{k+1,i})\) and \(\mathcal{I}\) is the indicator function, \[\begin{align} \label{eq95mcI} \mathcal{I}_{k,ij}(\tau) = \left\{ \begin{array}{ll} 1, & \text{if}~ \tau_{ij}(k) = \tau,\\ 0, & \text{otherwise}. \end{array}\right. \end{align}\tag{15}\] In practice, agents use Eqs. 11 14 to update their states, i.e., the recipient agents use the neighboring data as they arrive with some possible delays. The proposed solution is summarized in Algorithm 1.

Figure 1: DTAC-ADDOPT

Equivalently in the matrix form, \[\begin{align} \tag{16} \widehat{\mathbf{x}}_{k+1} &= \overline{C}_k \widehat{\mathbf{x}}_{k} - \alpha \widehat{\mathbf{g}}_k \\ \tag{17} \widehat{\mathbf{y}}_{k+1} &= \overline{C}_k \widehat{\mathbf{y}}_{k} \\ \tag{18} \mathbf{z}_{k+1} &= \Xi^{np}_{i,\overline{\tau}} \widehat{\mathbf{z}}_{k+1},~ \widehat{\mathbf{z}}_{k+1} = \frac{\widehat{\mathbf{x}}_{k+1}}{\widehat{\mathbf{y}}_{k+1}} \\ \tag{19} \widehat{\mathbf{g}}_{k+1} &= \overline{C}_k \widehat{\mathbf{g}}_{k} + (\overline{\nabla \mathbf{f}}_{k+1} - \overline{\nabla \mathbf{f}}_{k}) \end{align}\] where, \(\overline{\nabla \mathbf{f}}_{k} := (\nabla \mathbf{f}_{k}; \nabla \mathbf{f}_{k-1}; \dots ; \nabla \mathbf{f}_{k-\overline{\tau}})\). Augmented matrix \(\overline{C}_k\) (defined in Section 2) is column-stochastic [27]. For the proposed solution, one can substitute the strong-connectivity of \(\mathcal{G}\) (irreducibility of \(C_k\)) in [5], [13] by the irreducibility of \(\overline{C}_k\). Note that given a column-stochastic consensus matrix \(C\), its augmented consensus version \(\overline{C}_k\) is also column-stochastic.

Lemma 1. There exists \(0<\gamma_1<1\) and \(0<T<\infty\) such that \[\begin{align} \|\mathbf{Y}_k - \mathbf{Y}_\infty \|_2 \leq T \gamma_1^k \end{align}\]

Proof. The proof follows from [5], [13], [27], [40].

The proof of \(\overline{C}_{l+k}\dots \overline{C}_{k+1}\overline{C}_{k}\) being SIA is given in [27]. The SIA property used in [5], [13], [40] to prove the lemma in the absence of time-delays. Recall that from the definition of the spectral radius [40], \[\begin{align} \rho(\overline{C}) = \lim_{k\rightarrow \infty} \| \overline{C}_1 \overline{C}_2 \dots \overline{C}_k\|^k \end{align}\] Then from [40], \(\gamma_1 > \rho(\overline{C})\). This value can be compared with \(\gamma > \rho({C})\) (for the non-delayed case) given in [5]. It can be shown from [41] that \(\rho(\overline{C})\leq \rho(C)^{\frac{1}{1+\overline{\tau}}}\) in Lemma 2. This implies that one can choose \(\gamma_1 = \gamma^{\overline{\tau}+1}\) for example. This implies that the convergence rate may be reduced by a power \({\overline{\tau}+1}\). ◻

Corollary 1. The proof can be extended to uniformly strongly connected graphs over \(B\) time-steps (or B-connected networks) as described in [13]. In this scenario, the multi-agent network is not necessarily connected at all times, but its union is connected over \(B\) time-steps, i.e., \(\cup_{t_k}^{t_k+B} \mathcal{G}_k\) is connected for all steps \(k \geq 0\).

The following lemma from our previous work [41] relates the spectral property of the delayed and deay-free system matrices.

Lemma 2. [41] Given a matrix \(A\) with \(\rho(A) < 1\) and its augmented form \(\overline{A}\) from 6 , we have \[\rho(\overline{A}) \leq \rho(A)^{\frac{1}{1+\overline{\tau}}} < 1\] Similarly, if \(\rho(A) = 1\), then \(\rho(\overline{A})=1\).

Define, \[\begin{align} y &:= \sup_k \| \mathbf{Y}_k\| \\ y_- &:= \sup_k \| \mathbf{Y}^{-1}_k\| \end{align}\] and recall the column-stochasticity of \(\overline{C}_k\). Then, the following lemma holds.

Lemma 3. For \(\mathbf{a} \in \mathbb{R}^{np(\overline{\tau}+1)}\) and \(\mathbf{Y}_\infty = \lim_{k \rightarrow \infty} \mathbf{Y}_k\) from \(\mathbf{Y}_k = diag(\widehat{\mathbf{y}}_k)\), there exist \(0<\sigma <1\) for some \(\overline{\tau}\), \[\begin{align} \|\overline{C}_k \mathbf{a} - \mathbf{Y}_\infty \overline{\mathbf{a}} \| \leq \sigma \|\mathbf{a} - \mathbf{Y}_\infty \overline{\mathbf{a}} \| \end{align}\]

Proof. The proof follows from the column-stochasticity of \(\overline{C}\) and Lemma 2. For any \(\mathbf{a} \in \mathbb{R}^{np}\) and \(\overline{\mathbf{a}} = \frac{1}{n} (\mathbf{1}_n \otimes I_{p})(\mathbf{1}^\top_n \otimes I_{p})\mathbf{a}\), there exist \(0<\sigma_1<1\) [5], \[\begin{align} \|{C}_k \mathbf{a} - \mathbf{Y}_\infty \overline{\mathbf{a}} \| \leq \sigma_1 \|\mathbf{a} - \mathbf{Y}_\infty \overline{\mathbf{a}} \| \end{align}\] Irreducible column-stochastic \(C\) with positive diagonals implies \(\rho(C) =1\). Let \(\pi\) satisfy \(C\pi = \pi\) and \(\mathbf{1}_n^\top \pi = 1\). \(C_\infty = \lim_{k \rightarrow \infty} C^k = \pi \mathbf{1}_n^\top \otimes I_p\). In the presence of time delays, if \(\tau_{ij} = \overline{\tau}\),\(\forall i,j\), then the proof for \(\overline{\pi}\) (the augmented version of \({\pi}\)) similarly follows. In this case, \(\overline{C}\) is irreducible, column-stochastic, and with proper column/row permutations, it can be transformed into a form with positive diagonals. Then, Perron-Frobenius theorem follows and \(\rho(\overline{C}) =1\) with other eigenvalue than \(1\) strictly less than \(\rho(\overline{C})\). Then, there exist (strictly positive) right-eigenvector \(\overline{\pi}\) corresponding to the eigenvalue \(1\) of \(\overline{C}\) such that \(\overline{C}_\infty = \lim_{k \rightarrow \infty} \overline{C}^k = \overline{\pi} \mathbf{1}_{n(\overline{\tau}+1)}^\top \otimes I_p\) (for example, \(\overline{\pi} = \mathbf{1}_{n(\overline{\tau}+1}\)) and the proof exactly follows. In case, \(\tau_{ij} \leq \overline{\tau}\), then \(\overline{\pi}\) is not strictly positive but it is non-negative. Following from [41], \(\overline{C}\) has some more zero eigenvalues. With \(\overline{C}_\infty = \overline{\pi} \mathbf{1}_{n(\overline{\tau}+1)}^\top \otimes I_p\), it follows that, \[\begin{align} \overline{C} \overline{C}_\infty &= \overline{C}_\infty. \\ \overline{C}_\infty \overline{C}_\infty &= \overline{C}_\infty. \end{align}\] and \(\frac{1}{n(\overline{\tau}+1)} \mathbf{Y}_\infty (\mathbf{1}_{n(\overline{\tau}+1)} \otimes I_p)(\mathbf{1}_{n(\overline{\tau}+1)}^\top \otimes I_p) = \overline{C}_\infty\), \[\begin{align} \overline{C}\mathbf{a} - \mathbf{Y}_\infty \overline{\mathbf{a}} = (\overline{C}-\overline{C}_\infty)(\mathbf{a} - \mathbf{Y}_\infty \overline{\mathbf{a}}). \end{align}\] Next, \[\begin{align} \rho(\overline{C} - \overline{C}_\infty) = \rho(\overline{C} - \overline{\pi} \mathbf{1}_{n(\overline{\tau}+1)}^\top \otimes I_p)<1 \end{align}\] Then, \[\begin{align} \|\overline{C}\mathbf{a} - \mathbf{Y}_\infty \overline{\mathbf{a}}\| \leq \|\overline{C}-\overline{C}_\infty\| \|\mathbf{a} - \mathbf{Y}_\infty \overline{\mathbf{a}}\|. \end{align}\] where \(\sigma = \|\overline{C}-\overline{C}_\infty\|\). ◻

Finding an exact relation between \(\sigma\) and \(\sigma_1\) as a function of \(\overline{\tau}\) could be one direction of future research. Here, the relation between \(\sigma=\left\| \overline{C} - \overline{C}_{\infty} \right\|\) (the augmented case) and \(\sigma_1=\left\| C - C_{\infty} \right\|\) (the delay-free case) is approximated in the following lemma.

Lemma 4. [41] Given a matrix \(C\) with \(\rho(C) < 1\) and its column-augmented form \(\overline{C}\), we have \[\rho(\overline{C}) \leq \rho(C)^{\frac{1}{1+\overline{\tau}}} < 1\] If \(\rho(C) = 1\), then \(\rho(\overline{C})=1\).

This lemma implies that \(\sigma \leq \sigma_1^{\frac{1}{1+\overline{\tau}}} < 1\), which says that by increasing \(\overline{\tau}\), \(\sigma\) gets closer to \(1\).

In the absence of delays, from [5] we have, \[\begin{align} \underline{\overline{\mathbf{x}}}_k &:= \frac{1}{n} (\mathbf{1}_n \otimes I_{p})(\mathbf{1}^\top_n \otimes I_{p})\mathbf{x}_k \\ \underline{\overline{\mathbf{g}}}_k &:= \frac{1}{n} (\mathbf{1}_n \otimes I_{p})(\mathbf{1}^\top_n \otimes I_{p})\mathbf{g}_k \\ \mathbf{z}^* &:= \mathbf{1}_n \otimes \underline{\mathbf{z}}^* \\ \underline{\mathbf{h}}_k &:= \frac{1}{n} (\mathbf{1}_n \otimes I_{p})(\mathbf{1}^\top_n \otimes I_{p}) \nabla \mathbf{f}_{k} \\ \underline{\mathbf{q}}_k &:= \frac{1}{n} (\mathbf{1}_n \otimes I_{p})(\mathbf{1}^\top_n \otimes I_{p}) \nabla \mathbf{f}(\underline{\overline{\mathbf{x}}}_k). \end{align}\] and, in the presence of communication time-delays (with max delay \(\overline{\tau}\)), the variables change to the augmented version as \[\begin{align} \overline{\mathbf{x}}_k &:= (\underline{\overline{\mathbf{x}}}_k;\underline{\overline{\mathbf{x}}}_{k-1}; \dots; \underline{\overline{\mathbf{x}}}_{k-\overline{\tau}}) \\ \overline{\mathbf{g}}_k &:= (\underline{\overline{\mathbf{g}}}_k;\underline{\overline{\mathbf{g}}}_{k-1}; \dots; \underline{\overline{\mathbf{g}}}_{k-\overline{\tau}}) \\ \mathbf{z}^* &:= \mathbf{1}_{n(\overline{\tau}+1)} \otimes \underline{\mathbf{z}}^* \\ \mathbf{h}_k &:= (\underline{\mathbf{h}}_k;\underline{\mathbf{h}}_{k-1}; \dots; \underline{\mathbf{h}}_{k-\overline{\tau}}) \\ \mathbf{q}_k &:= (\underline{\mathbf{q}}_k;\underline{\mathbf{q}}_{k-1}; \dots; \underline{\mathbf{q}}_{k-\overline{\tau}}). \end{align}\] Let’s define the following variables for the proof analysis: \[\begin{align} \mathbf{t}_k &:= \left( \begin{array}{c} \|\widehat{\mathbf{x}}_k - \mathbf{Y}_\infty \overline{\mathbf{x}}_k\| \\ \| \overline{\mathbf{x}}_k - \mathbf{z}^*\|_2 \\ \|\widehat{\mathbf{g}}_k - \mathbf{Y}_\infty \mathbf{h}_k\| \end{array} \right), \\ \mathbf{s}_k &:= \left( \begin{array}{c} \|\mathbf{x}_k \|_2 \\ 0 \\ 0 \end{array} \right),\\ \tag{20} G &:= \left( \begin{array}{ccc} \sigma & 0 & \alpha\\ \alpha c l y_- & \eta & 0 \\ c d \epsilon l y_-(\kappa + \alpha l y y_-) & \alpha d \epsilon l^2 y y_- & \sigma + \alpha c d \epsilon l y \end{array} \right),\\ \tag{21} H_k &:= \left( \begin{array}{ccc} 0 & 0 & 0\\ \alpha l y_- T \gamma_1^{k-1} & 0 & 0 \\ (\alpha l y +2)d\epsilon l y_-^2 T \gamma_1^{k-1} & 0 & 0 \end{array} \right), \end{align}\] where \(\kappa := \|\overline{C}_k - I_{np\overline{\tau}}\|_2 = \|C - I_{np}\|_2\), \(\epsilon := \|I_{np\overline{\tau}} - \overline{C}_\infty \|_2 := \|I_{np} - C_\infty \|_2\), \(\eta := \max\{1-n(\overline{\tau}+1)l,1-n(\overline{\tau}+1)s\}\). \(y=\sup_k\|\mathbf{Y}_k\|_2\), \(y_-=\sup_k\|\mathbf{Y}^{-1}_k\|_2\), and \(c,d\) are positive constant from the equivalence of \(\|\cdot\|\) and \(\|\cdot\|_2\). These variables are used in the following lemmas,

Lemma 5. Given the dynamics 11 14 , the following relation holds, \[\begin{align} \label{eq95lem5} \mathbf{t}_k \leq G \mathbf{t}_{k-1}+H_{k-1} \mathbf{s}_{k-1} \end{align}\qquad{(1)}\]

Proof. The proof is given later in Section 4. ◻

It should be clarified that Eq. ?? provides a linear iterative relation between \(t_k\) and \(t_{k+1}\) via matrices, \(G\) and \(H_{k-1}\). Therefore, the convergence of \(t_k\) follows from spectral analysis of matrices \(G\) and \(H\). In other words, to prove linear convergence of \(\|t_k\|_2\) toward zero, the sufficient condition is to prove \(\rho(G)<1\), as well as the linear decay of matrix \(H_{k-1}\), which is straightforward from Eq. 21 since \(0<\gamma_1<1\).

So, we need to prove that \(\rho(G)<1\) as a sufficient condition to bound \(\alpha\) (the spectral radius of \(G\) defined in Eq. 20 being less than \(1\)). This is discussed in the following lemma and proved by matrix perturbation theory.

Lemma 6. Given the matrix \(G(\alpha)\) defined in Eq. 20 , then \(\rho(G(\alpha))<1\) if, \[\begin{align} \label{eq95alpharange} 0 < \alpha < \min \{\alpha_3 , \frac{1}{n(\overline{\tau}+1)l}\} \end{align}\qquad{(2)}\] with \(\alpha_3:= \frac{\sqrt{\delta^{2}+4 n(\overline{\tau}+1) \mu(1-\sigma)^{2} \theta}-\delta}{2 \theta}\) and \[\begin{align} \nonumber \delta:=& n(\overline{\tau}+1)s c d \epsilon l y_-(1-\sigma+\kappa) \\ \nonumber \theta :=& c d \epsilon l^{2} y y_-^2(l+n(\overline{\tau}+1)s) \end{align}\]

Proof. If \(\alpha < \frac{1}{n(\overline{\tau}+1)l}\) then \(\eta=1-\alpha n(\overline{\tau}+1)s\), since \(l \geq s\) (see details in [5] and [42]). Following matrix perturbation analysis in [31] we set \(G = G_0 + \alpha \overline{G}\) with matrix \(\alpha \overline{G}\) collecting the \(\alpha\)-dependent terms in \(G\) and other independent terms in \(G_0\) as, \[\begin{align} G_0 &=\left(\begin{array}{ccc} \sigma & 0 & 0 \\ 0 & \eta & 0 \\ c d \epsilon l y_-\kappa & 0 & \sigma \end{array}\right) \\ \overline{G} &=\left(\begin{array}{ccc} 0 & 0 & 1 \\ c l y_- & 0 & 0 \\ c d \epsilon l y_-\left(l y y_-\right) & d \epsilon l^{2} y y_- & c d \epsilon ly_- \end{array}\right) \end{align}\] It is clear that for \(\alpha=0\), we have \(\rho(G)=\rho(G_0)=1\). This is because we know that \(0<\sigma<1\). From matrix perturbation theory [43] and following the characteristic polynomial of \(G_\alpha\) defined as

\[\begin{align} \label{eq:characteristic95polynomial} &\left((\lambda-\sigma)^{2}-\alpha c d \epsilon l y_-(\lambda-\sigma)\right)(\lambda-1+n(\overline{\tau}+1) \alpha s)-\alpha^{3} c d \epsilon l^{3} y y_-^{2}\nonumber\\ & \; -\alpha(\lambda-1+n(\overline{\tau}+1) \alpha s)\left(c d \epsilon l \kappa y_-+\alpha\left(c d \epsilon l^{2} y y_-^{2}\right)\right)=0.&& \end{align}\tag{22}\] One can conclude that \[\begin{align} \frac{d \lambda}{d \alpha}|_{\alpha=0, \lambda=1}=-n(\overline{\tau}+1) s<0, \end{align}\] This implies that if we slightly increase \(\alpha\) from \(0\) (i.e., going from \(G_0\) to \(G=G_0 + \alpha \overline{G}\)), the change in the eigenvalue \(\lambda=1\) is towards inside the unit circle and \(\rho(G_{\alpha})<1\). Next to find the range of admissible values for \(\alpha\), by setting \(\lambda =1\) and solving the characteristic equation 22 we get three answers: \[\begin{align} \alpha_{1}&=0, \quad \alpha_{2}<0, \quad \nonumber \text{and} \\ \alpha_{3}&=\frac{\sqrt{\delta^{2}+4 n(\overline{\tau}+1) s(1-\sigma)^{2} \theta}-\delta}{2 \theta}>0.\nonumber \end{align}\] which implies that for \(\alpha\) in the range of Eq. ?? we have \(\rho(G_{\alpha})<1\). ◻

It should be noted that the bound on \(\alpha\) holds for both heterogeneous delays \(\tau \leq \overline{\tau}\) and homogeneous maximum delays \(\tau = \overline{\tau}\). For both cases, the gradient-tracking rate \(\alpha\) satisfying Eq. ?? ensures the convergence.

Remark 1. As a follow-up to Eq. ?? in Lemma 6, when the bound on time-delay \(\overline{\tau}\) becomes very large, the gradient tracking rate \(\alpha\) needs to become very small. This results in low convergence rate for large delays. For \(\overline{\tau} \rightarrow \infty\) we have \(\alpha \rightarrow 0\), which implies that the algorithm converges so slowly that it becomes difficult to implement it. Therefore, in this paper, practically we assume reasonable bounded delays and no packet-loss.

Lemma 7. The following holds for all \(k>0\), \[\begin{align} \label{eq95ghk} \overline{\mathbf{g}}_k &= \mathbf{h}_k, \\ \label{eq95xhk} \overline{\mathbf{x}}_{k+1} &= \overline{\mathbf{x}}_{k} - \alpha \mathbf{h}_k. \end{align}\] {#eq: sublabel=eq:eq95ghk,eq:eq95xhk}

Proof. From Eq. 19 \[\begin{align} \nonumber \overline{\mathbf{g}}_k= \frac{1}{n(\overline{\tau}+1)} &(\mathbf{1}_{n(\overline{\tau}+1)} \otimes I_{p})(\mathbf{1}^\top_{n(\overline{\tau}+1)} \otimes I_{p})\\ &(\overline{C}_k \widehat{\mathbf{g}}_{k-1} + \nabla \mathbf{f}_{k} - \nabla \mathbf{f}_{k-1}). \end{align}\] Then, from column stochasticity of \(\overline{C}_k\), \[\begin{align} \nonumber \overline{\mathbf{g}}_k &= \overline{\mathbf{g}}_{k-1} + \mathbf{h}_k - \mathbf{h}_{k-1} \\ &= \overline{\mathbf{g}}_{0} + \mathbf{h}_k - \mathbf{h}_{0} = \mathbf{h}_k \end{align}\] where the last equality follows from \(\overline{\mathbf{g}}_{0} = \mathbf{h}_{0}\). Then, similar to [5], Eq. ?? follows by replacing Eq. ?? in Eq. 16 . ◻

Lemma 8. For the proposed dynamics 11 14 , from lemma 1 we have, \[\begin{align} \|\mathbf{Y}_{k-1}^{-1} \mathbf{Y}_\infty - I_{np}\|_2 \leq y_- T \gamma_1^{k-1} \\ \|\mathbf{Y}_{k}^{-1} - \mathbf{Y}_{k-1}^{-1}\|_2 \leq 2y_-^2 T \gamma_1^{k-1} \end{align}\]

Proof. The proof follows similar to [5]. ◻

The following lemma provides the main results on the linear convergence of Algorithm 1.

Lemma 9. Let \(s\) and \(l\) be the strong-convexity and Lipschitz-continuity constants and \(\mathbf{z}_+ = \mathbf{z} - \alpha \nabla \mathbf{f}(\mathbf{z})\) for given \(\mathbf{z}\) and \(0 < \alpha < \min \{\alpha_3 , \frac{1}{n(\overline{\tau}+1)l}\}\) with \(\alpha_3\) defined as in Lemma 6. Then, we have \[\begin{align} \label{eq95lem9} \|\mathbf{z}_+ - \underline{\mathbf{z}}^*\|_2 \leq \eta_1 \| \mathbf{z} - \underline{\mathbf{z}}^* \|_2 \end{align}\qquad{(3)}\] where \(\eta_1 = \max (|1 - \alpha n l| , |1 - \alpha n s|)\).

Proof. The proof follows similar to [5] and from [42]. ◻

It should be noted that large delays may cause considerable computational overhead as the dimension of the augmented matrices scales with the time-delay bound \(\overline{\tau}\). However, this trade-off is inherent to worst-case delay handling in this paper; handling delayed messages explicitly enables delay-tolerant convergence and explicit stability margins for gradient-tracking rate \(\alpha\) (as shown in Eq. ?? ) while, on the other hand, results in higher computational costs for large delays. In this paper, considering reasonable and sufficiently small delay bounds (to avoid packet loss), the convergence analysis and computational complexity are justified.

4 Convergence and Proof of Lemma 5↩︎

This section presents the proof of Lemma 5 and the convergence analysis in three separate steps.

Step-I:

First, from Eq. 16 , Lemma 3 and Lemma 7 we bound \(\|\widehat{\mathbf{x}}_k - \mathbf{Y}_\infty \overline{\mathbf{x}}_k\|\) as, \[\begin{align} \nonumber \|\widehat{\mathbf{x}}_k - \mathbf{Y}_\infty \overline{\mathbf{x}}_k\| \leq& \|\overline{C}_k \widehat{\mathbf{x}}_{k-1} - \mathbf{Y}_\infty \overline{\mathbf{x}}_{k-1}\| \\&+ \alpha \| \widehat{\mathbf{g}}_{k-1} - \mathbf{Y}_\infty \mathbf{h}_{k-1}\| \label{eq95step195final} \\ \nonumber \leq& \sigma \|\widehat{\mathbf{x}}_{k-1} - \mathbf{Y}_\infty \overline{\mathbf{x}}_{k-1}\| \\&+ \alpha \| \widehat{\mathbf{g}}_{k-1} - \mathbf{Y}_\infty \mathbf{h}_{k-1}\| \end{align}\tag{23}\]

Step-II:

Next we bound \(\|\overline{\mathbf{x}}_{k} - \mathbf{z}^*\|_2\). From Lemma 7, \[\begin{align} \label{eq95qh} \overline{\mathbf{x}}_{k} = (\overline{\mathbf{x}}_{k-1} - \alpha \mathbf{q}_{k-1}) - \alpha (\mathbf{h}_{k-1} - \mathbf{q}_{k-1}) \end{align}\tag{24}\] Let’s define \(\mathbf{x}_+ = \overline{\mathbf{x}}_{k-1} - \alpha \mathbf{q}_{k-1}\) as the augmented version of centralized GD step. Redefining Lemma 9 and Eq. ?? for the augmented variables, we get \[\begin{align} \|\mathbf{x}_+ - {\mathbf{z}}^*\|_2 \leq \eta \| \widehat{\mathbf{x}}_{k-1} - {\mathbf{z}}^* \|_2 \end{align}\] For the second term in 24 , from Lipschitz condition we obtain, \[\begin{align} \|\mathbf{h}_{k-1} - \mathbf{q}_{k-1}\|_2 \leq \|\frac{1}{n(\overline{\tau}+1)} (\mathbf{1}_{n(\overline{\tau}+1)} \mathbf{1}^\top_{n(\overline{\tau}+1)}) \otimes I_{p})\|_2 l \|\widehat{\mathbf{z}}_{k-1} - \overline{\mathbf{x}}_{k-1}\|_2 \end{align}\] Then, \[\begin{align} \nonumber \|\overline{\mathbf{x}}_{k} - \mathbf{z}^*\|_2 &\leq \|\mathbf{x}_+ - {\mathbf{z}}^*\|_2 + \alpha l \|\mathbf{h}_{k-1} - {\mathbf{q}}_{k-1}\|_2 \\ &\leq \eta \| \widehat{\mathbf{x}}_{k-1} - {\mathbf{z}}^* \|_2 + \alpha l \| \widehat{\mathbf{z}}_{k-1} - \overline{\mathbf{x}}_{k-1} \|_2 \label{eq:proofxz} \end{align}\tag{25}\] From Eq. 13 (or Eq. 18 ) and recalling Lemma 8 we get, \[\begin{align} \nonumber \|\widehat{\mathbf{z}}_{k-1} - \overline{\mathbf{x}}_{k-1}\|_2 \leq& \|\mathbf{Y}^{-1}_{k-1}(\widehat{\mathbf{x}}_{k-1} - \mathbf{Y}_\infty \overline{\mathbf{x}}_{k-1})\|_2 \\\nonumber &+ \|\mathbf{Y}^{-1}_{k-1}\mathbf{Y}_\infty - I_{np(\overline{\tau}+1)})\overline{\mathbf{x}}_{k-1}\|_2 \\ \nonumber \leq& y_-\|\widehat{\mathbf{x}}_{k-1} - \mathbf{Y}_\infty \overline{\mathbf{x}}_{k-1}\|_2 \\&+ y_-T \gamma_1^{k-1} \|\widehat{\mathbf{x}}_{k-1}\|_2 \label{eq:step2semifinal} \end{align}\tag{26}\] where we also used the fact that \(\|\overline{x}_{k-1}\|_2 \leq \|\widehat{x}_{k-1}\|_2\). Then, by substituting the above in Eq. 25 we get, \[\begin{align} \nonumber \|\overline{\mathbf{x}}_{k} - \mathbf{z}^*\|_2 \leq& \alpha c l \mathbf{y}_- \|\widehat{\mathbf{x}}_{k-1} \mathbf{Y}_\infty \overline{\mathbf{x}}_{k-1} \|_2 \\ &+ \eta \|\overline{\mathbf{x}}_{k-1} - \mathbf{z}^*\|_2 + \alpha l \mathbf{y}_- T \gamma_1^{k-1} \| \widehat{\mathbf{x}}_{k-1} \|_2 \label{eq95step295final} \end{align}\tag{27}\]

Step-III:

Next, we bound \(\|\widehat{\mathbf{g}}_{k} - \mathbf{Y}_\infty \mathbf{h}_k\|_2\). From Eq. 19 \[\begin{align} \nonumber \|\widehat{\mathbf{g}}_{k} - \mathbf{Y}_\infty \mathbf{h}_k\|_2 &\leq \|\overline{C}_k \widehat{\mathbf{g}}_{k-1} - \mathbf{Y}_\infty \mathbf{h}_{k-1}\|_2 \\ \label{eq95gh} &+ \|(\overline{\nabla \mathbf{f}}_k - \overline{\nabla \mathbf{f}}_{k-1} - \mathbf{Y}_\infty( \mathbf{h}_{k} - \mathbf{h}_{k-1})\|_2 \end{align}\tag{28}\] From Lemma 3 and Lemma 7, \[\begin{align} \|\overline{C}_k \widehat{\mathbf{g}}_{k-1} - \mathbf{Y}_\infty \mathbf{h}_{k-1}\|_2 \leq \sigma \|( \widehat{\mathbf{g}}_{k-1} - \mathbf{Y}_\infty \overline{\mathbf{g}}_{k-1})\|_2 \end{align}\] Further, the second term in 28 can be recalculated as, \[\begin{align} \nonumber &\|(\overline{\nabla \mathbf{f}}_k - \overline{\nabla \mathbf{f}}_{k-1}) - \mathbf{Y}_\infty( \mathbf{h}_{k} - \mathbf{h}_{k-1})\|_2 =\\ \nonumber &\|(I_{np(\overline{\tau}+1)} - \frac{\mathbf{Y}_\infty}{n}(\mathbf{1}_{n(\overline{\tau}+1)} \otimes I_{p})(\mathbf{1}^\top_{n(\overline{\tau}+1)} \otimes I_{p})) (\overline{\nabla \mathbf{f}}_k - \overline{\nabla \mathbf{f}}_{k-1}) \|_2 \\ &\leq \epsilon l \|\widehat{\mathbf{z}}_k - \widehat{\mathbf{z}}_{k-1}\|_2 \end{align}\] which follows from the Lipschitz condition. Therefore, \[\begin{align} \label{eq95gh2} \|\widehat{\mathbf{g}}_{k} - \mathbf{Y}_\infty \mathbf{h}_k\|_2 &\leq \sigma \|\widehat{\mathbf{g}}_{k-1} - \mathbf{Y}_\infty \mathbf{h}_{k-1}\|_2 + d \epsilon l \|\widehat{\mathbf{z}}_k - \widehat{\mathbf{z}}_{k-1}\|_2 \end{align}\tag{29}\] To bound \(\|\widehat{\mathbf{z}}_k - \widehat{\mathbf{z}}_{k-1}\|_2\) we have, \[\begin{align} \nonumber \|{\mathbf{h}}_{k-1} \|_2 &= \|\frac{1}{n(\overline{\tau}+1)}(\mathbf{1}_{n(\overline{\tau}+1)} \otimes I_{p})(\mathbf{1}^\top_{n(\overline{\tau}+1)} \otimes I_{p}\nabla \mathbf{f}(\overline{\mathbf{x}}_{k-1}) \|_2 \\ &\leq l \|\overline{\mathbf{x}}_{k-1} - \mathbf{z}^*\|_2 \end{align}\] Therefore, using Eq. 26 , we obtain, \[\begin{align} \nonumber \|\mathbf{Y}_k^{-1} \widehat{\mathbf{g}}_{k-1}\|_2 \leq& y_-\| \widehat{\mathbf{g}}_{k-1} - \mathbf{Y}_\infty \mathbf{h}_{k-1} \|_2 \\\nonumber &+ y_- y l \| \overline{\mathbf{x}}_{k-1} - \mathbf{z}^* \|_2 \\\nonumber &+ y_-^2 y l \| \widehat{\mathbf{x}}_{k-1} - \mathbf{Y}_\infty \overline{\mathbf{x}}_{k-1} \|_2 \\ &+ y_-^2 y l T \gamma_1^{k-1}\| \widehat{\mathbf{x}}_{k-1} \|_2 \end{align}\] Recall that \((\overline{C}-I_{n(\overline{\tau}+1)p})\mathbf{Y}_\infty \overline{\mathbf{x}}_{k-1} = \mathbf{0}\). Then, \[\begin{align} \nonumber \|\widehat{\mathbf{z}}_k - \widehat{\mathbf{z}}_{k-1}\|_2 \leq& (y_- \kappa + \alpha y_-^2 y l) \|\widehat{\mathbf{x}}_{k-1} - \mathbf{Y}_\infty \overline{\mathbf{x}}_{k-1}\|_2 \\\nonumber &+ \alpha y_- \|\widehat{\mathbf{g}}_{k-1} - \mathbf{Y}_\infty {\mathbf{h}}_{k-1}\|_2 \\\nonumber &+ \alpha y_- y l \|\overline{\mathbf{x}}_{k-1} - {\mathbf{z}}^*\|_2 \\ &+ (\alpha y l +2) y_-^2 T \gamma_1^{k-1} \|\widehat{\mathbf{x}}_{k-1}\|_2 \end{align}\] Substitute the above in Eq. 29 , \[\begin{align} \nonumber \|{\mathbf{g}}_{k} - \mathbf{Y}_\infty \mathbf{h}_k\|_2 \leq& (c d \epsilon l \kappa y_- + \alpha c d \epsilon l^2 y y_-^2) \|\widehat{\mathbf{x}}_{k-1} \\ \nonumber &- \mathbf{Y}_\infty \overline{\mathbf{x}}_{k-1}\|_2 + \alpha d \epsilon l^2 y y_- \|\overline{\mathbf{x}}_k - \mathbf{z}^*\|_2 \\ \nonumber &+ (\sigma + \alpha c d \epsilon l y_-) \|\widehat{\mathbf{g}}_{k-1} - \mathbf{Y}_\infty \mathbf{h}_{k-1}\|_2 \\ &+ (\alpha y l +2)d \epsilon l y_-^2 T \gamma_1^{k-1} \|\widehat{\mathbf{x}}_{k-1}\|_2 \label{eq95step395final} \end{align}\tag{30}\] Finally, combining Eqs. 23 , 27 , and 30 results in Lemma 5 and proves the convergence.

5 Simulations↩︎

5.1 Academic Example↩︎

For the experimental simulation, we consider a quadratic cost function as Eq. 5 similar to [44] with randomly set parameters. The number of agents is set as \(n=10\) nodes. The bound on the time-delay is set \(\overline{\tau}=5\) and \(\alpha=0.005\). We consider convergence over two cases of random Erdos-Renyi (ER) networks: (i) static networks where the structure of the multi-agent network is time-invariant, and (ii) dynamic (switching) networks where the network structure randomly changes every \(2\) iterations. The simulations are shown in Fig. 2. For the switching case, there exist some oscillations in the residual decay due to changes in the network topology.

a

b

Figure 2: This figure shows the decay of the optimization residual (average error) under time-delays over (left) a static ER network and (right) a dynamic ER network. As it is clear from the figures, the algorithm converges under time-delays. There are some oscillation in the decay of the right figure, which is due to change in the network topology..

Next, we redo the simulations over an ER network to check the convergence for different values of bound on the time-delays, i.e., \(\overline{\tau}= 5,10,15,20\). We set gradient-tracking rate \(\alpha = 0.001\) and \(\alpha = 0.005\) for this simulation. The mean-square-error (MSE) residuals at agents for different bounds on the time-delay are shown in Fig. 3. As it can be seen from the figure, for large value of \(\overline{\tau}\), the residual decay becomes unstable and the optimization diverges (see the residual for \(\overline{\tau}= 15,20\) in the right figure for \(\alpha = 0.005\)).

a

b

Figure 3: This figure shows the decay of the optimization residual (mean-square-error) subject to different values of time-delays over an ER network. The left figure shows the residual decay for \(\alpha=0.001\) and the right figure for \(\alpha=0.005\). As it is clear from the figure, for large value of \(\overline{\tau}\) the residual decay becomes unstable and loses convergence..

5.2 Real Data-Set Example↩︎

We use the MNIST dataset for distributed optimization, which is a well-known dataset in the field of machine learning and image classification. It consists of handwritten digits from \(0\) to \(9\) and is commonly used to train and test various classification algorithms. The dataset includes \(70000\) images of handwritten digits. Each image is a \(28 \times 28\) grayscale image, resulting in \(784\) pixels per image, and is associated with a label from \(0\) to \(9\), indicating the digit it represents. A set of sampled images is shown in Fig. 4.

Figure 4: This figure shows a sample set of images of hand-written numbers from 0 to 9 taken from the MNIST data set. This data set is used for image classification via the optimization objective 31 and 32 .

The data set and image processing algorithms are taken from [45]. We randomly select \(N = 12000\) labelled images from the MNIST data set to be classified using logistic regression with a convex regularizer. The data are distributed among \(n=16\) agents to be cooperatively classified. In our cost optimization setup, define \[\begin{align} \label{eq95F95mnist} \min_{\mathbf{b},c} & F(\mathbf{b},c) = \frac{1}{n}\sum_{i=1}^{n} f_i(\mathbf{x}) \end{align}\tag{31}\] with every node \(i\) taking a batch of \(m_i=750\) sample images. Each node \(i\), then, locally minimizes the following classification cost: \[\begin{align} \label{eq95fij95regression} f_i(\mathbf{x}) = \frac{1}{m_i}\sum_{j=1}^{m_i} \ln(1+\exp(-(\mathbf{b}^\top x_{i,j}+c)y_{i,j}))+\frac{\lambda}{2}\|\mathbf{b}\|_2^2. \end{align}\tag{32}\] with \(\mathbf{b},c\) as the classifier parameters. The residual is defined as \(F(\overline{\mathbf{x}}^k)-F(\mathbf{x}^*)\) with \(\overline{\mathbf{x}}^k = \frac{1}{n} \sum_{i=1}^{n}\mathbf{x}^k_i\). We run and compare the residual of distributed training for different existing distributed optimization techniques in the literature over an exponential network. The following optimization algorithms are used for comparison: GP [13], SGP [46], [47], S-ADDOPT [6], and PushSAGA [48]. The simulation results are given in Fig. 5 for an exponential graph of \(n=16\) nodes (the graph is shown in the figure). It should be mentioned that GP, SGP, S-ADDOPT, and Push-SAGA are not delay-tolerant and, thus, are simulated for delay-free case. Therefore, as expected, they show better performance in the absence of time-delays, while practically they do not converge in the presence of time-delays. On the other hand, our DTAC-ADDOPT algorithm converges in the presence of heterogeneous time-delays. For this simulation, we set \(\overline{\tau}=3\). The slower rate of convergence for DTAC-ADDOPT is due to time-delays in the data sharing as compared to the other delay-free optimization techniques.

a

b

Figure 5: This simulation presents different distributed techniques over the exponential graph (given in the right figure) to optimize the objective function 31 and 32 . Note that only the proposed DTAC-ADDOPT is simulated subject to information-exchange delays, and the other techniques are simulated in the absence of delays..

6 Conclusions and Future Works↩︎

Delay-tolerant distributed optimization over digraphs is proposed in this work. We present a distributed algorithm over a multi-agent network that is robust to time-delayed information-exchange among the agents. The delay-tolerance is shown both by mathematical proofs and experimental simulations. Future research direction includes finding a tighter bound between \(\sigma_1\) and \(\sigma\) based on \(\overline{\tau}\) in Lemma 3. One can extend the convergence analysis to find maximum delay \(\tau_{\max}\) for which the algorithm fails to converge when \(\tau > \tau_{\max}\). Our analysis is based on bounded delays, considering no packet loss over the network, where the extension to certain classes of packet losses via standard buffering/retransmission or stochastic-analysis variants are left for future research. Moreover, distributed optimization subject to asynchronous and event-triggered operation, privacy-preserving distributed optimization [49], and adding nonlinearities and momentum terms to reach faster convergence (similar to coupling-constrained optimization and resource allocation in [50]) are other open problems and directions of future research. Different applications in machine learning setups can also be considered for future research.

References↩︎

[1]
Z. S. Ageed, S. R. M. Zeebaree, Distributed Systems Meet Cloud Computing: A Review of Convergence and Integration, International Journal of Intelligent Systems and Applications in Engineering12 (11s) (2024) 469–490.
[2]
X. Zhang, Q. Han, Time-delay systems and their applications, International Journal of Systems Science53 (12) (2022) 2477–2479.
[3]
M. Doostmohammadian, Z. R. Gabidullina, H. R. Rabiee, Momentum-based distributed resource scheduling optimization subject to sector-bound nonlinearity and latency, Systems & Control Letters199(2025) 106062.
[4]
A. Nedic, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization, IEEE Transactions on Automatic Control54 (1) (2009) 48–61.
[5]
C. Xi, R. Xin, U. A. Khan, ADD-OPT: Accelerated Distributed Directed Optimization, IEEE Transactions on Automatic Control63 (5) (2018) 1329–1339, .
[6]
M. I. Qureshi, R. Xin, S. Kar, U. A. Khan, S-ADDOPT: Decentralized stochastic first-order optimization over directed graphs, IEEE Control Systems Letters5 (3) (2020) 953–958.
[7]
M. Xu, M. Li, F. Hao, Fully distributed optimization of second-order systems with disturbances based on event-triggered control, Asian Journal of Control25 (5) (2023) 3715–3728.
[8]
X. Cai, H. Zhong, Y. Li, J. Liao, X. Chen, X. Nan, B. Gao, An event-triggered quantization communication strategy for distributed optimal resource allocation, Systems & Control Letters180(2023) 105619.
[9]
P. Yi, L. Li, Distributed nonsmooth optimization over Markovian switching random networks with two step-sizes, Journal of Systems Science and Complexity34 (4) (2021) 1324–1344.
[10]
S. Liang, L. Zhang, Y. Wei, Y. Liu, Hierarchically Distributed Optimization with a Flexible and Complexity-Reducing Algorithm, Journal of Systems Science and Complexity(2024) 1–26.
[11]
K. Zhu, Y. Tang, Primal-dual \(\varepsilon\)-subgradient method for distributed optimization, Journal of Systems Science and Complexity36 (2) (2023) 577–590.
[12]
W. Shi, Q. Ling, G. Wu, W. Yin, Extra: An exact first-order algorithm for decentralized consensus optimization, SIAM Journal on Optimization25 (2) (2015) 944–966.
[13]
A. Nedić, A. Olshevsky, Distributed optimization over time-varying directed graphs, IEEE Transactions on Automatic Control60 (3) (2014) 601–615.
[14]
M. Zarepisheh, L. Xing, Y. Ye, A computation study on an integrated alternating direction method of multipliers for large scale optimization, Optimization Letters12(2018) 3–15.
[15]
T. Chang, M. Hong, X. Wang, Multi-agent distributed optimization via inexact consensus ADMM, IEEE Transactions on Signal Processing63 (2) (2014) 482–497.
[16]
C. Song, S. Yoon, V. Pavlovic, Fast ADMM algorithm for distributed optimization with adaptive penalty, in: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, 2016.
[17]
Z. Lu, S. Mou, Distributed optimization under edge agreements: A continuous-time algorithm, Systems & Control Letters183(2024) 105698.
[18]
M. Doostmohammadian, W. Jiang, T. Charalambous, DTAC-ADMM: Delay-Tolerant Augmented Consensus ADMM-based Algorithm for Distributed Resource Allocation, in: IEEE 61st Conference on Decision and Control (CDC), IEEE, 308–315, 2022.
[19]
W. Jiang, M. Doostmohammadian, T. Charalambous, Distributed resource allocation via ADMM over digraphs, in: IEEE 61st Conference on Decision and Control (CDC), IEEE, 5645–5651, 2022.
[20]
M. Assran, A. Aytekin, H. Feyzmahdavian, M. Johansson, M. G. Rabbat, Advances in asynchronous parallel and distributed optimization, Proceedings of the IEEE108 (11) (2020) 2013–2031.
[21]
X. Wu, C. Liu, S. Magnússon, M. Johansson, Asynchronous distributed optimization with delay-free parameters, IEEE Transactions on Automatic Control .
[22]
Y. Shang, Resilient consensus in continuous-time networks with l-hop communication and time delay, Systems & Control Letters175(2023) 105509.
[23]
Y. Shang, Average consensus in multi-agent systems with uncertain topologies and multiple time-varying delays, Linear Algebra and its Applications459(2014) 411–429.
[24]
Y. Shang, Group consensus of multi-agent systems in directed networks with noises and time delays, International Journal of Systems Science46 (14) (2015) 2481–2492.
[25]
R. Olfati-Saber, R. M. Murray, Consensus problems in networks of agents with switching topology and time-delays, IEEE Transactions on automatic control49 (9) (2004) 1520–1533.
[26]
A. Seuret, D. V. Dimarogonas, K. H. Johansson, Consensus under communication delays, in: 47th IEEE Conference on Decision and Control, IEEE, 4922–4927, 2008.
[27]
C. N. Hadjicostis, T. Charalambous, Average consensus in the presence of delays in directed graph topologies, IEEE Transactions on Automatic Control59 (3) (2013) 763–768.
[28]
S. Behjat, M. Salehizadeh, G. Lorenzini, Modeling time-delay in consensus control: A review, International Journal of Research and Technology in Electrical Industry3 (1) (2024) 287–298.
[29]
K. I. Tsianos, S. Lawlor, M. G. Rabbat, Push-Sum Distributed Dual Averaging for convex optimization, in: 51st IEEE Conference on Decision and Control (CDC), 5453–5458, 2012.
[30]
R. Xin, S. Kar, U. A. Khan, Decentralized Stochastic Optimization and Machine Learning: A Unified Variance-Reduction Framework for Robust Performance and Fast Convergence, IEEE Signal Processing Magazine37 (3) (2020) 102–113.
[31]
M. Doostmohammadian, A. Aghasi, T. Charalambous, U. A. Khan, Distributed support vector machines over dynamic balanced directed networks, IEEE Control Systems Letters6(2021) 758–763.
[32]
F. Saadatniaki, R. Xin, U. A. Khan, Decentralized optimization over time-varying directed graphs with row and column-stochastic matrices, IEEE Transactions on Automatic Control65 (11) (2020) 4769–4780.
[33]
W. Jiang, T. Charalambous, Distributed alternating direction method of multipliers using finite-time exact ratio consensus in digraphs, in: European Control Conference (ECC), IEEE, 2205–2212, 2021.
[34]
K. Rokade, R. K. Kalaimani, Distributed ADMM With Linear Updates Over Directed Networks, IEEE Transactions on Network Science and Engineering12 (2) (2025) 1396–1407.
[35]
O. Chapelle, Training a support vector machine in the primal, Neural computation19 (5) (2007) 1155–1178.
[36]
M. Doostmohammadian, M. Pirani, On the Design of Resilient Distributed Single Time-Scale Estimators: A Graph-Theoretic Approach, IEEE Transactions on Network Science and Engineering(2025) 1–10.
[37]
A. G. Dimakis, S. Kar, J. M. F. Moura, M. G. Rabbat, A. Scaglione, Gossip algorithms for distributed signal processing, Proceedings of the IEEE98 (11) (2010) 1847–1864.
[38]
S. Kar, J. M. F. Moura, Distributed consensus algorithms in sensor networks with imperfect communication: Link failures and channel noise, IEEE Transactions on Signal Processing57 (1) (2008) 355–369.
[39]
C. Godsil, G. Royle, Algebraic graph theory, New York: Springer, 2001.
[40]
V. D. Blondel, J. M. Hendrickx, A. Olshevsky, J. N. Tsitsiklis, Convergence in multiagent coordination, consensus, and flocking, in: 44th IEEE Conference on Decision and Control, 2996–3000, 2005.
[41]
M. Doostmohammadian, M. Pirani, U. A. Khan, T. Charalambous, Consensus-Based Distributed Estimation in the presence of Heterogeneous, Time-Invariant Delays, IEEE Control Systems Letters6(2021) 1598 – 1603.
[42]
S. Bubeck, Convex optimization: Algorithms and complexity, Foundations and Trends® in Machine Learning8 (3-4) (2015) 231–357.
[43]
R. Bhatia, Perturbation bounds for matrix eigenvalues, SIAM, 2007.
[44]
N. K. Ramesh, Accelerated Distributed Directed Optimization With Time Delays, master Thesis, Aalto University, 2021.
[45]
M. I. Qureshi, U. A. Khan, Stochastic First-Order Methods Over Distributed Data, in: IEEE 12th Sensor Array and Multichannel Signal Processing Workshop (SAM), 405–409, 2022.
[46]
A. Spiridonoff, A. Olshevsky, I. C. Paschalidis, Robust asynchronous stochastic gradient-push: Asymptotically optimal and network-independent performance for strongly convex functions, Journal of Machine Learning Research21 (58) (2020) 1–47.
[47]
A. Nedić, A. Olshevsky, Stochastic gradient-push for strongly convex functions on time-varying directed graphs, IEEE Transactions on Automatic Control61 (12) (2016) 3936–3947.
[48]
M. I. Qureshi, R. Xin, S. Kar, U. A. Khan, Push-SAGA: A decentralized stochastic algorithm with variance reduction over directed graphs, IEEE Control Systems Letters6(2022) 1202–1207.
[49]
O. Mangasarian, Privacy-preserving linear programming, Optimization Letters5(2011) 165–172.
[50]
M. Doostmohammadian, A. Aghasi, M. Pirani, E. Nekouei, U. A. Khan, T. Charalambous, Fast-convergent anytime-feasible dynamics for distributed allocation of resources over switching sparse networks with quantized communication links, in: European Control Conference, IEEE, 84–89, 2022.