Abstract

In this paper, we propose DeMuon, a method for decentralized matrix optimization over a given communication topology. DeMuon incorporates matrix orthogonalization via Newton–Schulz iterations—a technique inherited from its centralized predecessor, Muon—and employs gradient tracking to mitigate heterogeneity among local functions. Under heavy-tailed noise conditions and additional mild assumptions, we establish the iteration complexity of DeMuon for reaching an approximate stochastic stationary point. This complexity result matches the best-known complexity bounds of centralized algorithms in terms of dependence on the target tolerance. To the best of our knowledge, DeMuon is the first direct extension of Muon to decentralized optimization over graphs with provable complexity guarantees. We conduct preliminary numerical experiments on decentralized transformer pretraining over graphs with varying degrees of connectivity. Our numerical results demonstrate a clear margin of improvement of DeMuon over other popular decentralized algorithms across different network topologies.

Muon, decentralized optimization, transformers, heavy-tailed noise, iteration complexity

1 Introduction↩︎

A recent matrix-variate optimizer, called Muon [1], has garnered widespread attention among deep learning researchers. Muon departs from traditional optimizers, such as Adam [2], which are based on vectorizing matrix variables, and demonstrates clear advantages through strong empirical results on large-scale, ill-conditioned matrix optimization problems arising in the training of large models. The empirical success of Muon has inspired a large number of research efforts focused on analyzing its convergence [3]–[10] and developing its variants, including SWAN [11], Scion [12], Gluon [7], PolarGrad [13], Dion [14], and LR-Muon [15], among others.

Given the success of Muon in centralized settings, it is natural to ask whether it can be extended to decentralized settings. In this paper, we take a step toward extending Muon to decentralized problems and consider the following finite-sum matrix optimization problem: \[\begin{align} \label{ucpb} \min_{X\in\mathbb{R}^{m\times n}} f(X) := \frac{1}{N} \sum_{i=1}^N f_i(X), \end{align}\tag{1}\] where local objective \(f_i:\mathbb{R}^{m\times n}\to\mathbb{R}\) at each node \(i\), \(i\in[N]\), is continuously differentiable and possibly nonconvex, and the \(N\) nodes are connected by a given topology \(\mathcal{G}=(\mathcal{V},\mathcal{E})\). Here, \(\mathcal{V}=[N]\) denotes all node indices and \(\mathcal{E}\) collects all directed pairs \((i,j)\in[N]\times[N]\) such that node \(i\) can send information to node \(j\). In addition, at each node \(i\), only stochastic gradients of \(f_i\) corrupted by heavy-tailed noise (see Assumption 1(c) for details) are accessible. The overarching goal is to solve 1 using local computation and peer-to-peer communication, without relying a central coordinator [16], [17]. Problems in the form of 1 encompass a broad range of applications in machine learning and related fields, such as decentralized optimization for neural network training [18], [19], matrix factorization [20], PCA [21], low-rank matrix completion [22]–[24].

Despite extensive research on algorithms for decentralized optimization, mainstream approaches—such as DSGD [25], [26], Extra [27], and gradient tracking methods [28]—require vectorizing matrix variables when applied to solve problem 1 . This dominance of vector-variate algorithms in decentralized settings is similar to the situation before the recent success of matrix-variate algorithms in centralized settings. Moreover, research on decentralized stochastic optimization under heavy-tailed noise remains limited, with [29]–[34] being among the few recent works that address this topic. These limitations in existing works on solving problem 1 motivate the following research questions:

Can we design a decentralized algorithm that solves problem 1 while leveraging the matrix orthogonalization in Muon?
Can we further establish its complexity guarantees under heavy-tailed noise conditions?

In this paper, we address these questions by proposing a novel decentralized variant of Muon, and establish its iteration complexity for finding an approximate stochastic stationary point of 1 under heavy-tailed noise regime. Our main contributions are highlighted below.

We propose a decentralized variant of Muon, named DeMuon, which is the first decentralized optimization method leveraging matrix orthogonalization to solve 1 . We establish its iteration complexity for finding an \(\epsilon\)-nuclear norm stationary solution of 1 under heavy-tailed noise conditions.
We conduct preliminary numerical experiments on decentralized transformer pretraining, and our results demonstrate the superior performance of DeMuon compared to other popular decentralized algorithms across various network topologies.

1.1 Notation↩︎

Throughout this paper, we use \(\mathbb{R}^{m\times n}\) to represent the Euclidean space of \(m\times n\) real matrices, and \(\mathbb{Z}_+\) to denote the set of all nonnegative integers. We use \(\|\cdot\|\) to denote the Euclidean norm of a vector or the spectral norm of a matrix, and \(\|\cdot\|_*\) and \(\|\cdot\|_F\) to denote the nuclear norm and the Frobenius norm of a matrix, respectively. We use \(\langle\cdot,\cdot\rangle\) to denote the trace inner product for matrices. We define the matrix sign of any nonzero \(M \in \mathbb{R}^{m\times n}\) as \(\mathrm{msgn}(M) = UV^T\), where \(U \in \mathbb{R}^{m\times r}\) and \(V \in \mathbb{R}^{n\times r}\) are column-orthogonal matrices obtained from the reduced SVD of \(M\). For any integer \(N\ge1\), we denote \([N]:=\{1,\ldots,N\}\). For matrices \(X_i\in\mathbb{R}^{m\times n}\), \(i\in[N]\), we denote the stacked and average matrices as \(X_{[N]}:=[X_1^T,\cdots,X_N^T]^T\) and \(\overline{X}:=\frac{1}{N}\sum_{i=1}^N X_i\), respectively. We define the stacked local objectives and gradients as \[\begin{align} F(X_{[N]}):=[f_1(X_1), \ldots, f_N(X_N)]^T,\quad \nabla F(X_{[N]})&:=[\nabla f_1(X_1)^T, \cdots, \nabla f_N(X_N)^T]^T, \end{align}\] and the average local gradients as \(\overline{\nabla} F(X_{[N]}):=\frac{1}{N}\sum_{i=1}^N\nabla f_i(X_i)\). For any \(A\in\mathbb{R}^{m\times n}\) and \(B\in\mathbb{R}^{p\times q}\), let \(A\otimes B\) denote the Kronecker product of \(A\) and \(B\). We present the following useful property of the Kronecker product: \[\begin{align} \label{eq:ppt-pd} (A\otimes B)(C\otimes D)=(AC)\otimes(BD)\;\text{holds when AC and BD are defined (i.e., dimensionally compatible).} \end{align}\tag{2}\] For any positive integer \(d\), let \(I_d\) denote the \(d\times d\) identity matrix, and \(\mathbf{1}_d\) denote the \(d\)-dimensional all-ones vector. In addition, we use \(\mathcal{O}(\cdot)\) to denote the standard big-O notation.

1.2 Assumptions↩︎

We now make the following assumptions throughout this paper.

Assumption 1.

There exists a finite \(f_{\mathrm{low}}\) such that \(f(X)\ge f_{\mathrm{low}}\) holds for all \(X\in\mathbb{R}^{m\times n}\).
There exists some \(L_*>0\) such that \[\begin{align} \|\nabla f_i(X) - \nabla f_i(Y)\|_* \le L_*\|X - Y\|\quad \forall X,Y\in\mathbb{R}^{m\times n},i\in[N]. \end{align}\]
There exist some \(\sigma>0\) and \(\alpha\in(1,2]\) such that the stochastic estimators \(G_i:\mathbb{R}^{m\times n}\times\Xi\to\mathbb{R}^{m\times n}\) satisfy \[\begin{align} \mathbb{E}[G_i(X;\xi)]=\nabla f_i(X),\quad \mathbb{E}[\|G_i(X;\xi) - \nabla f_i(X)\|_*^\alpha] \le \sigma^\alpha\quad \forall X\in\mathbb{R}^{m\times n},i\in[N]. \end{align}\]
The mixing matrix \(W\in\mathbb{R}^{N\times N}\) associated with the graph \(\mathcal{G}\) satisfies properties:
1. Primitivity: \(W\ge0\), and \(W^j>0\) for some positive integer \(j\);
2. Doubly stochasticity: \(\mathbf{1}_N^TW=\mathbf{1}_N^T\), and \(W\mathbf{1}_N = \mathbf{1}_N\).

Remark 1. (i) It follows from Assumption 1(b) that \[\begin{align} f(Y) \le f(X) + \langle\nabla f(X), Y-X\rangle + \frac{L_*}{2}\|Y-X\|^2\quad \forall X,Y\in\mathbb{R}^{m\times n}.\label{ineq:desc-} \end{align}\tag{3}\] In addition, for convenience, we let \(L_F:=NL_*\). Then, using Lemma 3(i) below, we obtain \[\begin{align} \|\nabla F(U_{[N]})-\nabla F(V_{[N]})\|_* & \le \sum_{i=1}^N \|\nabla f_i(U_i) - \nabla f_i(V_i)\|_* \le L_*\sum_{i=1}^N \|U_i - V_i\| \nonumber\\ & \le L_F\|U_{[N]}-V_{[N]}\|\quad \forall U_{[N]},V_{[N]}\in\mathbb{R}^{(Nm)\times n}.\nonumber \end{align}\]

(ii) Under Assumption 1(d), we define the mixing rate of the mixing matrix \(W\) as \[\begin{align} \label{upbd:eig-W} \lambda := \|W - \mathbf{1}_N \mathbf{1}_N^T/N \| < 1, \end{align}\tag{4}\] which characterizes the consensus performance in decentralized optimization (see, e.g., [35]).

1.3 Organization↩︎

The remainder of this paper is organized as follows. In Section 2, we develop a decentralized variant of Muon, referred to as DeMuon. Sections 3 and 4 present the simulation results and the proof of the consensus error and iteration complexity of DeMuon, respectively.

2 A Decentralized Muon↩︎

In this section, we propose a decentralized variant of Muon, which we call DeMuon for brevity, and establish its consensus error and iteration complexity for computing an approximate stationary solution to 1 under heavy-tailed noise.

Our proposed DeMuon generates three sequences, \(\{M^k_i\}\), \(\{V^k_i\}\), and \(\{X^k_i\}\). Specifically, at each iteration \(k\ge0\), each local gradient estimator \(M^k_i\) is updated as an exponentially weighted moving average of the stochastic gradients of \(f_i\) evaluated at \(X^0_i,\ldots,X^k_i\). Next, it applies a tracking technique (see, e.g., [36]) to update \(V^k_i\), which ensures that \(V^k_i\), \(i\in[N]\), achieve consensus and approximate the global gradient. Then, the local iterate \(X^{k+1}_i\) is updated by aggregating the Muon-type updates with orthogonalization performed on \(V^k_j\) from neighboring nodes. Details of DeMuon are described in Algorithm 1.

The next lemma gives an upper bound on the consensus error of \(\{X_i^k\}\) generated by DeMuon. Its proof is deferred to Section 4.

Lemma 1 (consensus error). Suppose that Assumption 1 holds. Let \(\{X_i^k\}\) be generated by Algorithm 1 with step size \(\eta>0\), and let \(\lambda\) be given in 4 . Then, it holds that \(\|X_{[N]}^k - \mathbf{1}_N\otimes\overline{X}^k\| \le \sqrt{N}\lambda\eta/(1-\lambda)\) for all \(k\ge0\).

Remark 2. We observe that the consensus error obtained in Lemma 1 is similar to the one for decentralized normalized vector-variate algorithms; e.g., see [32]. However, upon closer inspection, the consensus error in DeMuon is measured using the spectral norm, which is always bounded above by the Frobenius norm. Therefore, DeMuon yields a tighter consensus error compared to the decentralized normalized SGD.

The following theorem establishes the iteration complexity of DeMuon. Its proof is deferred to Section 4.

Theorem 1 (iteration complexity). Suppose that Assumption 1 holds. Let \(f_{\mathrm{low}}\), \(L_*\), \(\sigma\), and \(\alpha\) be given in Assumption 1, and \(L_F\) and \(\lambda\) be defined in Remark 1. Define \[\begin{align} U_{\mathrm{dm}} &:= f(\overline{X}^0) - f_{\mathrm{low}} + 3(N\sigma)^\alpha + \frac{2(N+1)\lambda\sigma}{1-\lambda} + \frac{4N\sigma\lambda}{1-\lambda} \nonumber \\ &\qquad + 3L_F^\alpha\Big(\frac{2\sqrt{N}\lambda}{1-\lambda} + 1\Big)^\alpha + \Big(\frac{2\lambda L_F}{1-\lambda} + \frac{L_*}{2}\Big)\Big(\frac{2\sqrt{N}\lambda}{1-\lambda} + 1\Big) + (\alpha-1)\Big(\frac{2\sqrt{\min\{m,n\}}}{\alpha(1-\lambda)}\Big)^{\frac{\alpha}{\alpha-1}}.\label{def:Udm} \end{align}\tag{5}\] Let \(\{X_i^k\}\) be generated by Algorithm 1 with inputs \(\eta\) and \(\theta\) given by \[\eta=K^{-\frac{2\alpha-1}{3\alpha-2}},\quad \theta = K^{-\frac{\alpha}{3\alpha-2}}.\] with some positive integer \(K\). Then, for any \(\epsilon\in(0,1)\), it holds that \(\mathbb{E}[\|\overline{\nabla} F(X^{\iota_K}_{[N]})\|_*]\le \epsilon\) for all \(K\) satisfying \[\begin{align} K\ge\max\Big\{\Big(\frac{U_{\mathrm{dm}}}{\epsilon}\Big)^{\frac{3\alpha-2}{\alpha-1}},4\Big\}, \end{align}\] where \(\iota_K\) is uniformly drawn from \(\{0,\ldots,K-1\}\).

Remark 3. From Theorem 1, we see that DeMuon finds an \(\epsilon\)-nuclear norm stochastic stationary solution of 1 within at most \(\mathcal{O}(\epsilon^{-(3\alpha-2)/(\alpha-1)})\) iterations. Approximate stationarity measured by the nuclear norm is a desirable characteristic of Muon-type algorithms; see, e.g., [7], [15], [37]. Moreover, this iteration complexity matches the best-known dependence on \(\epsilon\) in the literature of centralized stochastic optimization under heavy-tailed noise (see, e.g., [38]). It is possible to further improve the dependence on problem constants (e.g., Lipschitz constants, noise bounds, etc.) in the iteration complexity. However, doing so may require optimally choosing \((\eta,\theta)\) based on these parameters, which we leave as a direction for future research.

3 Simulation Results↩︎

In this section, we present our preliminary simulation results. We evaluate DeMuon on real-world language modeling tasks. Specifically, we consider a 3M-parameter GPT model [39] and perform an auto-regressive language modeling task on the Multi30k dataset, following the setup in [32]. The model is distributed across \(N=8\) nodes connected via three representative network topologies: complete graph, directed exponential graph, and ring graph. We compare DeMuon with three baselines: DSGD [40], DSGD_Clip [41], and GT_NSGDm [32]. The training and validation losses are reported in Fig. 2 and Fig. 3, respectively. The hyperparameters used for the simulations are listed in Table [table1].

Table 1: Summary of parallel update schemes for all competing method and their hyper-parameter selection.
Algorithm	Parallel Update Schemes	Hyper-Parameters

DSGD	\(X_i^{k+1}=\sum_{j=1}^Nw_{ij}(X_j^k-\eta G(X_j^k;\xi_j^k))\quad i\in[N]\)	\(\eta=0.01\)


DSGD_Clip
\(X_i^{k+1}=\sum_{j=1}^Nw_{ij}(X_j^k-\eta_t \text{clip}(G(X_j^k;\xi_j^k),\tau_t))\quad i\in[N]\)	\((\eta,\tau)=(10,0.1)\)


GT_NSGDm
\(V_i^k = \sum_{j=1}^N w_{ij} (V_j^{k-1} + M_j^k - M_j^{k-1})\quad i\in[N]\)
\(X_i^{k+1}=\sum_{j=1}^N w_{ij}(X_j{k}-\eta V_j^k/\\|V_j^k\\|_F)\quad i\in[N]\)	\((\eta,\theta)=(0.1,0.2)\)


DeMuon	See steps [update-mk]-[update-xk]	\((\eta,\theta)=(0.1,0.2)\)

Figure 2: Training losses in decentralized training of Transformer models over complete, directed exponential, and ring graphs.

Figure 3: Validation losses in decentralized training of Transformer models over complete, directed exponential, and ring graphs.

Across all network structures, DeMuon demonstrates rapid initial convergence and consistently reduces the validation loss to a low level within approximately 500 rounds. Specifically, we observe that

On the directed exponential and ring graphs, DeMuon achieves a validation loss close to that of GT_NSGDm, while significantly outperforming both DSGD and DSGD_Clip, which remain trapped at higher loss values. This highlights the robustness of DeMuon to limited connectivity and directed communication constraints.
On the complete graph, DeMuon initially converges rapidly to a low validation loss, but exhibits some instability in later rounds compared to GT_NSGDm. Nevertheless, it still provides a substantial improvement over both DSGD and DSGD_Clip, suggesting that DeMuon maintains competitive performance in dense networks.

Overall, these results confirm that DeMuon offers a strong balance between fast convergence and robustness across network topologies in the decentralized training of large-scale Transformer models, making it a promising approach for collaborative language modeling tasks.

4 Proof of the Main Results↩︎

In this section, we provide the proofs of our main results presented in Section 2, particularly, Lemma 1 and Theorem 1.

For notational convenience, we define a sequence of potentials: \[\begin{align} \mathcal{P}_k := f(\overline{X}^k) + p\|\nabla F(X^k_{[N]}) - M^k_{[N]}\|_F^\alpha + q\|V^k_{[N]} - \mathbf{1}_N\otimes& \overline{V}^k\|_*\quad \forall k\ge0,\label{def:pot} \end{align}\tag{6}\] where the sequence \(\{(X_i^k,M_i^k,V_i^k)\}\) is generated by DeMuon, and \(p,q>0\) are scalars to be specified later. Also, we define \(\xi_{[N]}:=\{\xi_i\}_{i=1}^N\), and the stacked notation: \[\begin{align} G(X_{[N]};\xi_{[N]}) :=[G_1(X_1;\xi_1)^T,\cdots, G_N(X_N;\xi_N)^T]^T,\quad V_{O,[N]} :=[\mathrm{msgn}(V_1)^T,\cdots,\mathrm{msgn}(V_N)^T]^T, \end{align}\] and the average notation \(\overline{V}_O:=\frac{1}{N}\sum_{i=1}^N\mathrm{msgn}(V_i)\). Using the stacked notation defined above, the updates of DeMuon in [update-mk]-[update-xk] can be rewritten in the following compact form: \[\begin{align} M_{[N]}^k & = (1-\theta)M_{[N]}^{k-1} + \theta G(X_{[N]}^k;\xi_{[N]}^k),\tag{7}\\ V_{[N]}^k & = (W\otimes I_{m})(V_{[N]}^{k-1}+M_{[N]}^k-M_{[N]}^{k-1}),\tag{8}\\ X_{[N]}^{k+1} & = (W\otimes I_{m})(X_{[N]}^k - \eta V^k_{O,[N]})\quad\forall k=0,1,\ldots.\tag{9} \end{align}\] Moreover, using the doubly stochasticity of \(W\) imposed in Assumption 1(d), we obtain from 9 the updates of \(\{\overline{X}^k\}\) as \[\begin{align} \label{update-ave-x} \overline{X}^{k+1} = \overline{X}^k - \eta \overline{V}^k_O\quad\forall k=0,1,2,\ldots. \end{align}\tag{10}\]

We next provide a technical lemma regarding the expansion of \(\|U+V\|^\alpha_F\), whose proof can be found in [38].

Lemma 2. For any \(\alpha\in(1,2]\), we have \[\begin{align} &\|U+V\|_F^\alpha \le \|U\|_F^\alpha + \alpha\|U\|_F^{\alpha-2} \langle U,V\rangle + 2\|V\|_F^\alpha\quad \forall U,V\in\mathbb{R}^{m\times n},\\ &\|U+V\|_F^\alpha\le (1+c)\|U\|_F^\alpha + (2 + (\alpha-1)^{\alpha-1}c^{1-\alpha})\|V\|_F^\alpha\quad \forall U,V\in\mathbb{R}^{m\times n},c>0. \end{align}\]

We also establish the following technical lemma for the analysis of decentralized algorithms involving spectral and nuclear norms.

Lemma 3. Suppose that Assumption 1 holds. Let \(\{Y_i\}_{i=1}^N\subset\mathbb{R}^{m\times n}\) be arbitrarily given, and let \(W\in\mathbb{R}^{N\times N}\) be the mixing matrix. Then, the following statements hold.

\(\frac{1}{N}\sum_{i=1}^N\|Y_i\|\le \|Y_{[N]}\|\le (\sum_{i=1}^N\|Y_i\|^2)^{1/2}\) and \(\frac{1}{N}\sum_{i=1}^N\|Y_i\|_*\le \|Y_{[N]}\|_*\le\sum_{i=1}^N\|Y_i\|_*\);
\(\frac{1}{N}(\mathbf{1}_N\mathbf{1}_N^T\otimes I_m)X_{[N]} = \mathbf{1}_N\otimes\overline{X}\);
\((W-\frac{1}{N}\mathbf{1}_N\mathbf{1}_N^T)(I_N - \frac{1}{N}\mathbf{1}_N\mathbf{1}_N^T)=W-\frac{1}{N}\mathbf{1}_N\mathbf{1}_N^T=(I_N - \frac{1}{N}\mathbf{1}_N\mathbf{1}_N^T)(W-\frac{1}{N}\mathbf{1}_N\mathbf{1}_N^T)\).

Proof. To prove statement (i), it suffices to show \[\begin{align} \|A_1\| \le \|[A_1^T,A_2^T]^T\|\le (\|A_1\|^2 + \|A_2\|^2)^{1/2},\quad \|A_1\|_*\le\|[A_1^T,A_2^T]^T\|_* \le \|A_1\|_* + \|A_2\|_* \end{align}\] hold for all \(A_1\in\mathbb{R}^{m_1\times n},A_2 \in\mathbb{R}^{m_2\times n}\). To this end, we let \(P:=[I_{m_1\times m_1}, \mathbf{0}_{m_1\times m_2}]\). One has \[\begin{align} &\|A_1\| = \left\|P\left[\begin{smallmatrix} A_1 \\ A_2\end{smallmatrix}\right]\right\|\le \|P\|\cdot\left\|\left[\begin{smallmatrix} A_1 \\ A_2\end{smallmatrix}\right]\right\|=\left\|\left[\begin{smallmatrix} A_1 \\ A_2\end{smallmatrix}\right]\right\|,\tag{11}\\ &\|A_1\|_* = \left\|P\left[\begin{smallmatrix} A_1 \\ A_2\end{smallmatrix}\right]\right\|_*\le \|P\|\cdot\left\|\left[\begin{smallmatrix} A_1 \\ A_2\end{smallmatrix}\right]\right\|_*=\left\|\left[\begin{smallmatrix} A_1 \\ A_2\end{smallmatrix}\right]\right\|_*.\tag{12} \end{align}\] where the inequalities are due to the (mixed) submultiplicativity of the spectral norm and the nuclear norm, respectively. In addition, notice that \[\begin{align} \left\|\left[\begin{smallmatrix} A_1 \\ A_2\end{smallmatrix}\right]\right\|^2 \le \|A_1^TA_1\| + \|A_2^TA_2\| = \|A_1\|^2 + \|A_2\|^2,\quad \left\|\left[\begin{smallmatrix} A_1 \\ A_2\end{smallmatrix}\right]\right\|_* \le \left\|\left[\begin{smallmatrix} A_1 \\ \mathbf{0}_{m_2\times n}\end{smallmatrix}\right]\right\|_* + \left\|\left[\begin{smallmatrix} \mathbf{0}_{m_1\times n} \\ A_2\end{smallmatrix}\right]\right\|_* = \|A_1\|_* + \|A_2\|_*, \end{align}\] which along with 11 and 12 implies that statement (i) holds.

Statement (ii) holds because \[\begin{align} \frac{(\mathbf{1}_N\mathbf{1}_N^T\otimes I_m)X_{[N]}}{N}=\frac{1}{N}\begin{bmatrix} I_m&\cdots&I_m\\ \vdots&\ddots&\vdots\\ I_m&\cdots&I_m \end{bmatrix}\cdot\begin{bmatrix} X_1\\ \vdots\\ X_N \end{bmatrix} = \mathbf{1}_N\otimes\overline{X}. \end{align}\] In addition, statement (iii) holds true because \(W\mathbf{1}_N\mathbf{1}_N^T=\mathbf{1}_N\mathbf{1}_N^T=\mathbf{1}_N\mathbf{1}_N^TW\) due to the double stochasticity of \(W\). ◻

We next prove Lemma 1.

Proof of Lemma 1. When \(k=0\), this lemma holds due to \(X^0_i=\overline{X}^0\) for all \(1\le i\le N\). We next prove this lemma for any \(k\ge1\). Fix any \(k\ge1\). It follows from 9 , and Lemma 3(ii) and (iii) that \[\begin{align} &X^k_{[N]} - \mathbf{1}_N\otimes \overline{X}^k = \Big(\Big(I_N- \frac{1}{N}\mathbf{1}_N\mathbf{1}_N^T\Big)\otimes I_m\Big) X^k_{[N]}\\ &\overset{\eqref{def:ave-2}}{=}\Big(\Big(I_N- \frac{1}{N}\mathbf{1}_N\mathbf{1}_N^T\Big)\otimes I_m\Big)(W\otimes I_m)\big(X_{[N]}^{k-1} - \eta V^{k-1}_{O,[N]}\big) = \Big(\Big(W- \frac{1}{N}\mathbf{1}_N\mathbf{1}_N^T\Big)\otimes I_m\Big)\big(X_{[N]}^{k-1} - \eta V^{k-1}_{O,[N]}\big)\\ & = \Big(\Big(W- \frac{1}{N}\mathbf{1}_N\mathbf{1}_N^T\Big)\otimes I_m\Big)\Big(\Big(\Big(I_{N} - \frac{1}{N}\mathbf{1}_N\mathbf{1}_N^T\Big)\otimes I_m\Big)X^{k-1}_{[N]} + \eta V^{k-1}_{O,[N]}\Big)\\ &=\Big(\Big(W- \frac{1}{N}\mathbf{1}_N\mathbf{1}_N^T\Big)\otimes I_m\Big)\big(X_{[N]}^{k-1} - \mathbf{1}_N\otimes\overline{X}^{k-1} - \eta V^{k-1}_{O,[N]}\big), \end{align}\] where the first equality is due to Lemma 3(ii), the third equality follows from 2 and double stochasticity of \(W\), the fourth equality is due to 2 and Lemma 3(iii), and the last equality follows from Lemma 3(ii). We recall from Lemma 3(i) that \(\|V^{k}_{O,[N]}\|\le(\sum_{i=1}^N\|\mathrm{msgn}(V^k_i)\|^2)^{1/2}=\sqrt{N}\). Using this, 4 , and the above inequality, we have \[\begin{align} &\|X^k_{[N]} - \mathbf{1}_N\otimes \overline{X}^k\| \\ &\le \Big\|\Big(W - \frac{1}{N}\mathbf{1}_N\mathbf{1}_N^T\Big)\otimes I_m\Big\| \big(\|X^{k-1}_{[N]} - \mathbf{1}_N\otimes \overline{X}^{k-1}\| + \sqrt{N}\eta\big)\\ &\le\lambda\big(\|X^{k-1}_{[N]} - \mathbf{1}_N\otimes \overline{X}^{k-1}\| + \sqrt{N}\eta\big)\le\cdots \le \lambda^k\|X^0_{[N]} - \mathbf{1}_N\otimes \overline{X}^0\| + \sqrt{N}\eta \sum_{t=1}^k\lambda^t \le \frac{\sqrt{N}\lambda\eta}{1-\lambda}, \end{align}\] where the last inequality follows from \(X_i^0=\overline{X}^0\) for each \(i\). Hence, the conclusion of this lemma holds as desired. ◻

The following lemma provides a descent inequality on the network average \(\{\overline{X}^k\}\).

Lemma 4. Suppose that Assumption 1 holds. Let \(\{(X^k_i,M^k_i,V^k_i)\}\) be generated by Algorithm 1 with input parameters \(\eta\) and \(\theta\). Then, it holds that for all \(k\ge0\), \[\begin{align} &f(\overline{X}^{k+1}) \nonumber \\ &\le f(\overline{X}^k) - \eta\|\overline{\nabla} F({X}_{[N]}^k)\|_* + 2\eta\|\nabla F({X}_{[N]}^k) - M_{[N]}^k\|_* + 2\eta\| V_{[N]}^k - \mathbf{1}_N\otimes\overline{V}^k\|_* + \Big(\frac{\sqrt{N}\lambda}{1-\lambda} + \frac{1}{2}\Big)L_*\eta^2,\label{ineq:desc-1} \end{align}\tag{13}\] where \(L_*\) is given in Assumption 1(b), and \(\lambda\) is defined in 4 .

Proof. Fix any \(k\ge0\). Using 8 and the doubly stochasticity of \(W\), one can show by induction that \(\overline{M}^k=\overline{V}^k\). In addition, by Assumption 1(b), and Lemma 3(i), one has \[\begin{align} \|\nabla f(\overline{X}^k) - \overline{\nabla} F({X}_{[N]}^k)\|_* &\le \frac{L_*}{N}\sum_{i=1}^N\|\overline{X}^k - X_i^k\|\le L_*\|X^k_{[N]} - \mathbf{1}_N\otimes\overline{X}^k\|,\label{ineq:ave-grad-1} \end{align}\tag{14}\] Note that \(\|\overline{V}^k_O\|\le\sum_{i=1}^N\|\mathrm{msgn}(V_i^k)\|/N\le 1\). Using these, the 3 with \((X,Y)=(\overline{X}^k,\overline{X}^{k+1})\), and 10 , we have \[\begin{align} &f(\overline{X}^{k+1}) {\le} f(\overline{X}^k) + \langle\nabla f(\overline{X}^k), \overline{X}^{k+1}-\overline{X}^k\rangle + \frac{L_*}{2}\|\overline{X}^{k+1}-\overline{X}^k\|^2\nonumber\\ &\overset{\eqref{update-ave-x}}{=} f(\overline{X}^k) - \eta\langle \overline{V}^k, \overline{V}^k_O\rangle+ \eta\langle \overline{V}^k - \nabla f(\overline{X}^k), \overline{V}^k_O\rangle + \frac{L_*\eta^2}{2}\|\overline{V}^k_O\|^2\nonumber\\ & \le f(\overline{X}^k) - \eta\langle \overline{V}^k, \overline{V}^k_O\rangle + \eta\|\nabla f(\overline{X}^k) - \overline{V}^k\|_* + \frac{L_*\eta^2}{2}\nonumber\\ &\le f(\overline{X}^k) - \eta\langle \overline{V}^k, \overline{V}^k_O\rangle + \eta\|\nabla f(\overline{X}^k) - \overline{\nabla} F({X}_{[N]}^k)\|_* + \eta\|\overline{\nabla} F({X}_{[N]}^k) - \overline{V}^k\|_* + \frac{L_*\eta^2}{2}\nonumber\\ &\overset{\eqref{ineq:ave-grad-1}}{\le} f(\overline{X}^k) - \eta\langle \overline{V}^k, \overline{V}^k_O\rangle + L_*\eta\|X^k_{[N]} - \mathbf{1}_N\otimes\overline{X}^k\| + \eta\|\overline{\nabla} F({X}_{[N]}^k) - \overline{M}^k\|_* + \frac{L_*\eta^2}{2},\label{ineq:upbd-desc-1} \end{align}\tag{15}\] where the second inequality follows from \(\|\overline{V}^k_O\|\le1\) and the trace Hölder inequality, the third inequality is due to the triangular inequality, and the last inequality follows from 14 and \(\overline{M}^k=\overline{V}^k\). Also, notice that \[\begin{align} -\langle\overline{V}^k, \overline{V}^k_O\rangle & = -\frac{1}{N}\sum_{i=1}^N\langle \overline{V}^k-V_i^k, \mathrm{msgn}(V_i^k)\rangle - \frac{1}{N}\sum_{i=1}^N\|V_i^k\|_*\nonumber\\ &\le \frac{1}{N}\sum_{i=1}^N\|\overline{V}^k - V_i^k\|_* - \frac{1}{N}\sum_{i=1}^N\|V_i^k\|_* \le \frac{2}{N}\sum_{i=1}^N\|\overline{V}^k - V_i^k\|_* - \|\overline{V}^k\|_*\nonumber\\ & \le -\|\overline{\nabla} F({X}_{[N]}^k)\|_* + \|\overline{\nabla} F({X}_{[N]}^k) - \overline{V}^k\|_* + \frac{2}{N}\sum_{i=1}^N\|\overline{V}^k - V_i^k\|_*, \end{align}\] where the first inequality follows from the trace Hölder inequality and \(\|\mathrm{msgn}(V_i^k)\|\le1\) for each \(i\), and the last two inequalities follow from the triangular inequality. Using this inequality, Lemma 1, 15 , Lemma 3(i), and \(\overline{M}^k=\overline{V}^k\), we obtain that \[\begin{align} &f(\overline{X}^{k+1})\le f(\overline{X}^k) - \eta\|\overline{\nabla} F({X}_{[N]}^k)\|_* + 2\eta\|\overline{\nabla} F({X}_{[N]}^k) - \overline{M}^k\|_* + \frac{2\eta}{N}\sum_{i=1}^N\|\overline{V}^k - V_i^k\|_*+ \Big(\frac{\sqrt{N}\lambda}{1-\lambda} + \frac{1}{2}\Big)L_*\eta^2,\nonumber\\ &\le f(\overline{X}^k) - \eta\|\overline{\nabla} F({X}_{[N]}^k)\|_* + 2\eta\|\nabla F(X_{[N]}^k) - M^k_{[N]}\|_* +2\eta\|V^k_{[N]} - \mathbf{1}_N\otimes\overline{V}^k\|_* + \Big(\frac{\sqrt{N}\lambda}{1-\lambda} + \frac{1}{2}\Big)L_*\eta^2.\nonumber \end{align}\] Hence, this lemma holds as desired. ◻

The following lemma provides a recurrence relation for the consensus error of \(\{V_i^k\}\).

Lemma 5. Suppose that Assumption 1 holds. Let \(\{(X^k_i,M_i^k,V^k_i)\}\) be generated by Algorithm 1 with input parameters \(\eta\) and \(\theta\). Then it holds that for all \(k\ge-1\), \[\begin{align} \mathbb{E}_{\xi_{[N]}^{k+1}}[\|V_{[N]}^{k+1} - \mathbf{1}_N\otimes\overline{V}^{k+1}\|_*]\le \lambda \|V^k_{[N]} - \mathbf{1}_{N}\otimes\overline{V}^k\|_* + \frac{\lambda\theta }{1-\theta}\Big(\mathbb{E}_{\xi_{[N]}^{k+1}}[\|\nabla F(X_{[N]}^{k+1}) - M^{k+1}_{[N]}\|_*]+\sigma\Big),\label{ineq:vr-v} \end{align}\tag{16}\] where \(\lambda\) is given in 4 , and \(\sigma\) is given in Assumption 1(c).

Proof. Fix any \(k\ge-1\). It follows that \[\begin{align} &V^{k+1}_{[N]} - \mathbf{1}_N\otimes\overline{V}^{k+1} = \Big(\Big(I_N - \frac{1}{N}\mathbf{1}_N\mathbf{1}_N^T\Big)\otimes I_m\Big) V_{[N]}^{k+1}\nonumber\\ &\overset{\eqref{def:ave-1}}{=}\Big(\Big(W - \frac{1}{N}\mathbf{1}_N\mathbf{1}_N^T\Big)\otimes I_m\Big)(V^k_{[N]} + M^{k+1}_{[N]} - M^{k}_{[N]} )\nonumber\\ & = \Big(\Big(W - \frac{1}{N}\mathbf{1}_N\mathbf{1}_N^T\Big)\otimes I_m\Big)\big(V_{[N]}^k - \mathbf{1}_N\otimes\overline{V}^k + M^{k+1}_{[N]} - M^{k}_{[N]}\big),\label{ineq:V-cs-1} \end{align}\tag{17}\] where the first equality is due to Lemma 3(ii), the second equality follows from 2 and 8 , and the last equality is due to 2 and Lemma 3(iii). We also recall from 7 that \[\begin{align} M^{k+1}_{[N]}& - M^{k}_{[N]} = \theta (G(X_{[N]}^{k+1};\xi^{k+1}_{[N]}) - \nabla F(X_{[N]}^{k+1}))\\ &+ \theta (M^{k+1}_{[N]} - M^{k}_{[N]}) + \theta (\nabla F(X_{[N]}^{k+1}) - M^{k+1}_{[N]}), \end{align}\] which implies that \[\begin{align} M^{k+1}_{[N]} - M^{k}_{[N]} = \frac{\theta }{1-\theta }(G(X_{[N]}^{k+1};\xi^{k+1}_{[N]}) - \nabla F(X_{[N]}^{k+1})) + \frac{\theta }{1-\theta }(\nabla F(X_{[N]}^{k+1}) - M^{k+1}_{[N]}). \end{align}\] By this, 4 , 17 , Lemma 3(i), and Assumption 1(c), we see that \[\begin{align} \mathbb{E}_{\xi_{[N]}^{k+1}}&[\|V_{[N]}^{k+1} - \mathbf{1}_N\otimes\overline{V}^{k+1}\|_*] \le \lambda\|V_{[N]}^k - \mathbf{1}_N\otimes\overline{V}^k\|_* + \frac{N\sigma\lambda\theta }{1-\theta} + \frac{\lambda\theta }{1-\theta }\mathbb{E}_{\xi_{[N]}^{k+1}}[\|\nabla F(X_{[N]}^{k+1}) - M^{k+1}_{[N]}\|_*], \end{align}\] Hence, this lemma holds as desired. ◻

The following lemma provides two recurrence relations for the estimation errors of the stacked momentum sequence \(\{M_{[N]}^k\}\).

Lemma 6. Suppose that Assumption 1 holds. Let \(\{(X_i^k,M_i^k)\}\) be generated by Algorithm 1 with input parameters \(\eta\) and \(\theta\). Then it holds that for all \(k\ge0\), \[\begin{align} &\mathbb{E}_{\xi^{k+1}_{[N]}}[\|\nabla F({X}^{k+1}_{[N]}) - M_{[N]}^{k+1}\|_*]\le (1-\theta )\|\nabla F({X}^k_{[N]}) - M_{[N]}^k\|_*+ (1-\theta)L_F\Big(\frac{2\sqrt{N}\lambda}{1-\lambda} +1\Big)\eta + N\sigma \theta,\tag{18}\\ &\mathbb{E}_{\xi^{k+1}_{[N]}}[\|\nabla F({X}^{k+1}_{[N]}) - M_{[N]}^{k+1}\|_F^\alpha]\le (1-\theta )\|\nabla F({X}^k_{[N]}) - M_{[N]}^k\|_F^\alpha + 3L_F^\alpha \Big(\frac{2\sqrt{N}\lambda}{1-\lambda} +1\Big)^\alpha\theta^{1-\alpha}\eta^\alpha + 2(N\sigma\theta)^\alpha,\tag{19} \end{align}\] where \(L_*\), \(\sigma\), and \(\alpha\) are given in Assumption 1, and \(\lambda\) is defined in 4 .

Proof. Fix any \(k\ge0\). It follows from 7 that \[\begin{align} &M_{[N]}^{k+1} - \nabla F(X_{[N]}^{k+1})\nonumber\\ &= (1-\theta ) M^k_{[N]} + \theta G(X_{[N]}^{k+1};\xi^{k+1}_{[N]}) - \nabla F(X_{[N]}^{k+1})\nonumber\\ &=(1-\theta ) ((M^k_{[N]} - \nabla F(X_{[N]}^k))+(\nabla F(X_{[N]}^k) - \nabla F(X_{[N]}^{k+1}))) + \theta (G(X_{[N]}^{k+1};\xi^{k+1}_{[N]}) - \nabla F(X_{[N]}^{k+1})).\label{rela:MF-id-1} \end{align}\tag{20}\] Notice from 10 that \(\|\overline{X}^{k+1}-\overline{X}^k\|=\eta\|\overline{V}^k_O\|\le\eta\). In addition, by Lemma 1 and the definition of \(L_F\) in Section 1.2, one has that \[\begin{align} &\|\nabla F(X_{[N]}^k) - \nabla F(X_{[N]}^{k+1})\|_*\le L_F\|X_{[N]}^k - X_{[N]}^{k+1}\|\nonumber\\ &\le L_F(\|X_{[N]}^k - \mathbf{1}_N\otimes\overline{X}^k\| + \|X_{[N]}^{k+1} - \mathbf{1}_N\otimes\overline{X}^{k+1}\| + \|\overline{X}^k - \overline{X}^{k+1}\|) {\le} L_F\Big(\frac{2\sqrt{N}\lambda}{1-\lambda} +1\Big)\eta.\label{ineq:useful-vr} \end{align}\tag{21}\] Using this, Assumption 1(c), 20 , and 21 , we obtain that \[\begin{align} &\mathbb{E}_{\xi^{k+1}_{[N]}}[\|\nabla F({X}^{k+1}_{[N]}) - M_{[N]}^{k+1}\|_*]\\ &\le (1-\theta)(\|M^k_{[N]} - \nabla F(X_{[N]}^k)\|_* +\|\nabla F(X_{[N]}^k) - \nabla F(X_{[N]}^{k+1})\|_*) + \theta \mathbb{E}_{\xi^{k+1}_{[N]}}[\|G(X_{[N]}^{k+1};\xi^{k+1}_{[N]}) - \nabla F(X_{[N]}^{k+1})\|_*] \\ &\le (1-\theta )\Big(\|\nabla F({X}^k_{[N]}) - M_{[N]}^k\|_* + L_F\Big(\frac{2\sqrt{N}\lambda}{1-\lambda} +1\Big)\eta\Big) + N\sigma\theta, \end{align}\] which implies that 18 holds.

We next prove 19 . By Lemma 2, 20 , and 21 , one has that for any \(c>0\), \[\begin{align} &\mathbb{E}_{\xi^{k+1}_{[N]}}[\|\nabla F({X}^{k+1}_{[N]}) - M_{[N]}^{k+1}\|_F^\alpha]\nonumber\\ &\le \|(1-\theta ) (M^k_{[N]} - \nabla F(X_{[N]}^k)+\nabla F(X_{[N]}^k) - \nabla F(X_{[N]}^{k+1}))\|_F^\alpha + 2\mathbb{E}_{\xi^{k+1}_{[N]}}[\|\theta (G(X_{[N]}^{k+1};\xi^{k+1}_{[N]}) - \nabla F(X_{[N]}^{k+1}))\|_F^\alpha]\nonumber\\ &\le (2+(\alpha-1)^{\alpha-1}c^{1-\alpha})(1-\theta)^\alpha\|\nabla F(X_{[N]}^k) - \nabla F(X_{[N]}^{k+1})\|_F^\alpha + (1+c)(1-\theta)^\alpha\|M^k_{[N]} - \nabla F(X_{[N]}^k)\|_F^\alpha + 2(N\sigma\theta)^\alpha\nonumber\\ &\overset{\eqref{ineq:useful-vr}}{\le} (2+(\alpha-1)^{\alpha-1}c^{1-\alpha})(1-\theta)^\alpha L_F^\alpha\Big(\frac{2\sqrt{N}\lambda}{1-\lambda} +1\Big)^\alpha\eta^\alpha + (1+c)(1-\theta)^\alpha\|M^k_{[N]} - \nabla F(X_{[N]}^k)\|_F^\alpha + 2(N\sigma\theta)^\alpha,\nonumber \end{align}\] where the first and second inequalities are due to Lemma 2. Letting \(c = (1 - \theta)^{1 - \alpha} - 1\), and using the fact that \(\alpha \in (1, 2]\), we have \[\begin{align} c^{1-\alpha} = ((1-\theta)^{1-\alpha} - 1)^{1-\alpha} \le \Big(\frac{1}{1-(\alpha-1)\theta} - 1\Big)^{1-\alpha} \le ((\alpha-1)\theta)^{1-\alpha}, \end{align}\] where the first inequality follows from \((1-\tau)^\beta\le 1-\beta\tau\) for all \(\tau\in(-\infty,1)\) and \(\beta\in[0,1]\). Combining the above two inequalities, we obtain that \[\begin{align} &\mathbb{E}_{\xi^{k+1}_{[N]}}[\|\nabla F({X}^{k+1}_{[N]}) - M_{[N]}^{k+1}\|^\alpha]\\ & \le L_F^\alpha \Big(\frac{2\sqrt{N}\lambda}{1-\lambda} +1\Big)^\alpha(2+\theta^{1-\alpha})(1-\theta)^\alpha \eta^\alpha + (1-\theta )\|\nabla F({X}^k_{[N]}) - M_{[N]}^k\|_F^\alpha +2(N\sigma\theta)^\alpha, \end{align}\] which along with \(\theta\in(0,1)\) and \(\alpha\in(1,2]\) implies that 19 holds. ◻

The following lemma establishes a descent property for the potential sequence \(\{\mathcal{P}_k\}\) defined below.

Lemma 7. Suppose that Assumption 1 holds. Let \(\{(X_i^k,M_i^k,V_i^k)\}\) be the sequence generated by Algorithm 1 with input parameters \(\eta\) and \(\theta\), and \(L_*\), \(\sigma\), and \(\alpha\) be defined in Assumption 1, and let \(\{\mathcal{P}_k\}\) be defined in 6 for \(\{(X_i^k,M_i^k,V_i^k)\}\) and \(q=2\eta/(1-\lambda)\) and any positive scalar \(p\). Then, it holds that for all \(k\ge0\), \[\begin{align} \mathbb{E}_{\xi^{k+1}_{[N]}}[\mathcal{P}_{k+1}] & \le \mathcal{P}_k - \eta \|\overline{\nabla} F({X}_{[N]}^k)\|_* + \frac{(\alpha-1)((2\eta + \lambda\theta q)\sqrt{n})^{\alpha/(\alpha-1)}}{\alpha^{\alpha/(\alpha-1)}(\theta p)^{1/(\alpha-1)}} + 3p L_F^\alpha\Big(\frac{2\sqrt{N}\lambda}{1-\lambda} + 1\Big) \theta ^{1-\alpha}\eta^\alpha\nonumber \\ & \qquad + 2 p(N\sigma\theta)^\alpha + \lambda\theta q L_F\Big(\frac{2\sqrt{N}\lambda}{1-\lambda}+1\Big)\eta + \frac{N\sigma \lambda q\theta^2}{1-\theta} + \Big(\frac{\sqrt{N}\lambda}{1-\lambda} + \frac{1}{2}\Big)L_*\eta^2.\label{ineq:pot-desc} \end{align}\tag{22}\]

Proof. Fix any \(k\ge0\). By 6 , 13 , 16 , 18 , 19 , and the fact that \(q=2\eta/(1-\lambda)\), one has \[\begin{align} \mathbb{E}_{\xi^{k+1}_{[N]}}[\mathcal{P}_{k+1}] \overset{\eqref{def:pot}}{=} &\;\mathbb{E}_{\xi^{k+1}_{[N]}}\big[f(\overline{X}^{k+1}) + p\|\nabla F(X^{k+1}_{[N]}) - M^{k+1}_{[N]}\|_F^\alpha + q\|V^{k+1}_{[N]} - \mathbf{1}_N\otimes \overline{V}^{k+1}\|_*\big]\nonumber\\ \overset{\eqref{ineq:desc-1}\eqref{ineq:vr-v}}{\le} &\;f(\overline{X}^k) - \eta\|\overline{\nabla} F({X}_{[N]}^k)\|_* + p\mathbb{E}_{\xi^{k+1}_{[N]}}[\|\nabla F(X^{k+1}_{[N]}) - M^{k+1}_{[N]}\|_F^\alpha]\nonumber\\ & + 2\eta\|\nabla F({X}_{[N]}^k) - M_{[N]}^k\|_* + \frac{\lambda\theta q}{1-\theta}\mathbb{E}_{\xi^{k+1}_{[N]}}[\|\nabla F(X_{[N]}^{k+1}) - M^{k+1}_{[N]}\|_*]\nonumber\\ & + (2\eta + \lambda q)\|V^k_{[N]} - \mathbf{1}_{N}\otimes\overline{V}^k\|_* + \frac{\lambda\sigma\theta q}{1-\theta } + \Big(\frac{\sqrt{N}\lambda}{1-\lambda} + \frac{1}{2}\Big)L_*\eta^2\nonumber\\ \overset{\eqref{ineq:vr1}\eqref{ineq:vr2}}{\le} &\;f(\overline{X}^k) - \eta \|\overline{\nabla} F({X}_{[N]}^k)\|_* + (2\eta + \lambda\theta q)\|\nabla F({X}_{[N]}^k) - M_{[N]}^k\|_* \nonumber\\ & + (1-\theta )p\|\nabla F({X}^k_{[N]}) - M_{[N]}^k\|_F^\alpha + (2\eta + \lambda q)\|V^k_{[N]} - \mathbf{1}_{N}\otimes\overline{V}^k\|_*\nonumber\\ & + 3p L_F^\alpha\Big(\frac{2\sqrt{N}\lambda}{1-\lambda} + 1\Big)^\alpha \theta ^{1-\alpha}\eta^\alpha + 2 p(N\sigma\theta)^\alpha\nonumber\\ & + \lambda\theta q L_F\Big(\frac{2\sqrt{N}\lambda}{1-\lambda}+1\Big)\eta + \frac{N\sigma \lambda q\theta^2}{1-\theta} + \Big(\frac{\sqrt{N}\lambda}{1-\lambda} + \frac{1}{2}\Big)L_*\eta^2.\label{ineq:inter-pot-desc} \end{align}\tag{23}\] In addition, letting \(\alpha^\prime=\alpha/(\alpha-1)\) and using the Young’s inequality, we have \[\begin{align} (2\eta + \lambda\theta q)\|\nabla F({X}_{[N]}^k) - M_{[N]}^k\|_* & \le (2\eta + \lambda\theta q) \sqrt{\min\{m,n\}} \|\nabla F({X}_{[N]}^k) - M_{[N]}^k\|_F \nonumber\\ &\le \frac{((\alpha\theta p)^{1/\alpha}\|\nabla F({X}_{[N]}^k) - M_{[N]}^k\|_F)^\alpha}{\alpha} + \frac{\Big(\frac{(2\eta + \lambda\theta q)\sqrt{\min\{m,n\}}}{(\alpha\theta p)^{1/\alpha}}\Big)^{\alpha^\prime}}{\alpha^\prime}\\ &=\theta p\|\nabla F({X}_{[N]}^k) - M_{[N]}^k\|_F^\alpha + \frac{(\alpha-1)((2\eta + \lambda\theta q)\sqrt{\min\{m,n\}})^{\alpha/(\alpha-1)}}{\alpha^{\alpha/(\alpha-1)}(\theta p)^{1/(\alpha-1)}}. \end{align}\] This along with 23 and \(q=2\eta/(1-\lambda)\) implies that \[\begin{align} \mathbb{E}_{\xi^{k+1}_{[N]}}[\mathcal{P}_{k+1}] &\le f(\overline{X}^k) + p\|\nabla F(X^k_{[N]}) - M^k_{[N]}\|_F^\alpha + q\|V^k_{[N]} - \mathbf{1}_{N}\otimes\overline{V}^k\|_*\nonumber\\ &\qquad - \eta\|\overline{\nabla} F({X}_{[N]}^k)\|_* + \frac{(\alpha-1)((2\eta + \lambda\theta q)\sqrt{\min\{m,n\}})^{\alpha/(\alpha-1)}}{\alpha^{\alpha/(\alpha-1)}(\theta p)^{1/(\alpha-1)}} + 3p L_F^\alpha\Big(\frac{2\sqrt{N}\lambda}{1-\lambda} + 1\Big)^\alpha \theta ^{1-\alpha}\eta^\alpha \nonumber\\ &\qquad + 2 p(N\sigma\theta)^\alpha + \lambda\theta q L_F\Big(\frac{2\sqrt{N}\lambda}{1-\lambda}+1\Big)\eta + \frac{N\sigma \lambda q\theta^2}{1-\theta} + \Big(\frac{\sqrt{N}\lambda}{1-\lambda} + \frac{1}{2}\Big)L_*\eta^2. \end{align}\] The conclusion 22 then follows from this and 6 . ◻

We are now ready to prove Theorem 1.

Proof of Theorem 1. Let \(\{(X^k_i,M^k_i,V^k_i)\}\) be generated by Algorithm 1 with \((\eta,\theta)\) given in Theorem 1 and \(\{\mathcal{P}_k\}\) be defined in 6 with such \(\{(X^k_i,M^k_i,V^k_i)\}\) and the following \((p,q)\): \[\begin{align} \label{def:p-q} p=K^{(\alpha^2-3\alpha+2)/(3\alpha-2)},\quad q= 2\eta/(1-\lambda). \end{align}\tag{24}\] Notice that \(\eta>0\) and \(\theta\in(0,1)\). Hence, \((\eta,\theta,p,q)\) satisfies the assumptions in Lemma 7 and Algorithm 1. Recall from that \[\begin{align} \mathbb{E}[\|M^0_{[N]} - \nabla F(X^0_{[N]})\|_F^\alpha] &\le\mathbb{E}[\|G(X^0_{[N]};\xi^0_{[N]}) - \nabla F(X^0_{[N]})\|_*^\alpha]\nonumber\\ &\le\mathbb{E}\Big[\Big(\sum_{i=1}^N\|G(X^0_i;\xi^0_i) - \nabla f_i(X^0_i)\|_*\Big)^\alpha\Big] \le (N\sigma)^\alpha.\nonumber \end{align}\] In addition, it follows from the fact that \(V^{-1}_i=\mathbf{0}_{m\times n}\) for all \(i\in[N]\), Lemma 3(i), and Lemma 5 with \(k=-1\) that \[\begin{align} \mathbb{E}[\|V_{[N]}^0 - \mathbf{1}_N\otimes\overline{V}^0\|_*] &\le \frac{\lambda\theta}{1-\theta}\mathbb{E}[\|\nabla F(X^0_{[N]}) - M^0_{[N]}\|_*] + \frac{\lambda\sigma\theta}{1-\theta}\nonumber\\ &\le \frac{\lambda\theta}{1-\theta}\mathbb{E}[\|\nabla F(X^0_{[N]}) - G(X^0_{[N]};\xi^0_{[N]})\|_*] + \frac{\lambda\sigma\theta}{1-\theta}\nonumber\\ &\le \frac{\lambda\theta}{1-\theta}\sum_{i=1}^N\mathbb{E}[\|\nabla f_i(X^0_i) - G(X^0_i;\xi^0_i)\|_*] + \frac{\lambda\sigma\theta}{1-\theta}\le \frac{(N+1)\lambda\sigma\theta}{1-\theta}.\nonumber \end{align}\] Then, by these, 6 , 24 , and the fact that \(p<1\) and \(q<2/(1-\lambda)\), one has that \[\begin{align} \mathbb{E}[\mathcal{P}_0] & = f(\overline{X}^0) + p\mathbb{E}\big[\|M^0_{[N]} - \nabla F(X^0_{[N]})\|_F^\alpha\big] + q\|V^0_{[N]} - \mathbf{1}_N\otimes \overline{V}^0\|_* \le f(\overline{X}^0) + (N\sigma)^\alpha + \frac{2(N+1)\lambda\sigma\theta}{(1-\theta)(1-\lambda)},\tag{25}\\ \mathbb{E}[\mathcal{P}_K] &= \mathbb{E}\big[f(\overline{X}^K) + p\|M^K_{[N]} - \nabla F(X^K_{[N]})\|_F^\alpha + q\|V^K_{[N]} - \mathbf{1}_N\otimes \overline{V}^K\|_*\big] \ge \mathbb{E}\big[f(\overline{X}^K)\big] \ge f_{\mathrm{low}}.\tag{26} \end{align}\] Taking expectation on both sides of 22 with respect to \(\{\xi^t_{[N]}\}_{t=0}^{k}\), we have \[\begin{align} \mathbb{E}[\mathcal{P}_{k+1}] & \le \mathbb{E}[\mathcal{P}_k] - \eta\mathbb{E}[\|\overline{\nabla}F({X}_{[N]}^k)\|_*] + \frac{(\alpha-1)((2\eta + \lambda\theta q)\sqrt{\min\{m,n\}})^{\alpha/(\alpha-1)}}{\alpha^{\alpha/(\alpha-1)}(\theta p)^{1/(\alpha-1)}} + 3p L_F^\alpha\Big(\frac{2\sqrt{N}\lambda}{1-\lambda} + 1\Big)^\alpha \theta ^{1-\alpha}\eta^\alpha \nonumber\\ &\qquad + 2 p(N\sigma\theta)^\alpha + \lambda\theta q L_F\Big(\frac{2\sqrt{N}\lambda}{1-\lambda}+1\Big)\eta + \frac{N\sigma \lambda q\theta^2}{1-\theta} + \Big(\frac{\sqrt{N}\lambda}{1-\lambda} + \frac{1}{2}\Big)L_*\eta^2. \end{align}\] Summing up this inequality over \(k=0,\ldots,K-1\), and using 25 and 26 , we can obtain that for all \(K\ge1\), \[\begin{align} &f_{\mathrm{low}} \le \mathbb{E}[\mathcal{P}_K]\le \mathbb{E}[\mathcal{P}_0] -\eta\sum_{k=0}^{K-1}\mathbb{E}[\|\overline{\nabla}F({X}_{[N]}^k)\|_*]\nonumber\\ &\qquad + \sum_{k=0}^{K-1}\Big(\frac{(\alpha-1)((2\eta + \lambda\theta q)\sqrt{\min\{m,n\}})^{\alpha/(\alpha-1)}}{\alpha^{\alpha/(\alpha-1)}(\theta p)^{1/(\alpha-1)}} + 3p L_F^\alpha\Big(\frac{2\sqrt{N}\lambda}{1-\lambda} + 1\Big)^\alpha \theta ^{1-\alpha}\eta^\alpha \nonumber\\ &\qquad + 2 p(N\sigma\theta)^\alpha + \lambda\theta q L_F\Big(\frac{2\sqrt{N}\lambda}{1-\lambda}+1\Big)\eta + \frac{N\sigma \lambda q\theta^2}{1-\theta} + \Big(\frac{\sqrt{N}\lambda}{1-\lambda} + \frac{1}{2}\Big)L_*\eta^2\Big)\nonumber\\ &\le f(\overline{X}^0) + (N\sigma)^\alpha + \frac{2(N+1)\lambda\sigma\theta}{(1-\theta)(1-\lambda)} - \eta\sum_{k=0}^{K-1}\mathbb{E}[\|\overline{\nabla}F({X}_{[N]}^k)\|_*]\nonumber\\ &\qquad + \sum_{k=0}^{K-1}\Big(\frac{(\alpha-1)((2\eta + \lambda\theta q)\sqrt{\min\{m,n\}})^{\alpha/(\alpha-1)}}{\alpha^{\alpha/(\alpha-1)}(\theta p)^{1/(\alpha-1)}} + 3p L_F^\alpha\Big(\frac{2\sqrt{N}\lambda}{1-\lambda} + 1\Big)^\alpha \theta ^{1-\alpha}\eta^\alpha + 2 p(N\sigma\theta)^\alpha \nonumber\\ &\qquad + \lambda\theta q L_F\Big(\frac{2\sqrt{N}\lambda}{1-\lambda}+1\Big)\eta + \frac{N\sigma \lambda q\theta^2}{1-\theta} + \Big(\frac{\sqrt{N}\lambda}{1-\lambda} + \frac{1}{2}\Big)L_*\eta^2\Big).\nonumber \end{align}\] Rearranging the terms in this inequality, and using the definition of \((\eta,\theta,p,q)\), we obtain that for all \(K\ge4\), \[\begin{align} \frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}[\|\overline{\nabla}F({X}_{[N]}^k)\|_*]& \le \frac{f(\overline{X}^0) - f_{\mathrm{low}} + 3(N\sigma)^\alpha + 2(N+1)\lambda\sigma/(1-\lambda)}{K\eta}\\ &\qquad + \frac{(\alpha-1)(2\sqrt{\min\{m,n\}}/(\alpha(1-\lambda)))^{\alpha/(\alpha-1)} + 3L_F^\alpha(2\sqrt{N}\lambda/(1-\lambda) + 1)^\alpha}{K\eta}\\ &\qquad+\frac{2\lambda L_F(2\sqrt{N}\lambda/(1-\lambda) + 1)/(1-\lambda)}{K\eta\cdot K^{2\alpha/(3\alpha-2)}} + \frac{4N\sigma\lambda/(1-\lambda)}{K\eta\cdot K^{(\alpha+1)/(3\alpha-2)}}\\ &\qquad+ \frac{(\sqrt{N}\lambda/(1-\lambda) + 1/2)L_*}{K\eta\cdot K^{\alpha/(3\alpha-2)}} \le U_{\mathrm{dm}}K^{-(\alpha-1)/(3\alpha-2)}, \end{align}\] where the last inequality is due to 5 and \(K\ge4\). Thus, we obtain that for all \(K\ge4\), \[\begin{align} \mathbb{E}[\|\overline{\nabla}F({X}_{[N]}^{\iota_K})\|_*]=\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}[\|\overline{\nabla}F({X}_{[N]}^k)\|_*] \le \frac{U_{\mathrm{dm}}}{K^{(\alpha-1)/(3\alpha-2)}}. \end{align}\] Hence, the conclusion of this theorem holds as desired. ◻

References↩︎

[1]

K. Jordan, Y. Jin, V. Boza, J. You, F. Cesista, L. Newhouse, and J. Bernstein, “Muon: An optimizer for hidden layers in neural networks,” 2024. [Online]. Available: https://kellerjordan.github.io/posts/muon/.

[2]

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference for Learning Representations, May 2015.

[3]

K. An, Y. Liu, R. Pan, S. Ma, D. Goldfarb, and T. Zhang, “ASGO: Adaptive structured gradient optimization,” arXiv preprint arXiv:2503.20762, Mar. 2025.

[4]

L. Chen, J. Li, and Q. Liu, “Muon optimizes under spectral norm constraints,” arXiv preprint arXiv:2506.15054, Jun. 2025.

[5]

D. Kovalev, “Understanding gradient orthogonalization for deep learning via non-Euclidean trust-region optimization,” arXiv preprint arXiv:2503.12645, Apr. 2025.

[6]

J. Li and M. Hong, “A note on the convergence of Muon and further,” arXiv preprint arXiv:2502.02900, pp. arXiv–2502, Jun. 2025.

[7]

A. Riabinin, E. Shulgin, K. Gruntkowska, and P. Richtárik, “Gluon: Making Muon &Scion great again! (bridging theory and practice of LMO-based optimizers for LLMs),” arXiv preprint arXiv:2505.13416, May 2025.

[8]

N. Sato, H. Naganuma, and H. Iiduka, “Analysis of Muon’s convergence and critical batch size,” arXiv preprint arXiv:2507.01598, Aug. 2025.

[9]

M.-E. Sfyraki and J.-K. Wang, “Lions and Muons: Optimization via stochastic Frank-Wolfe,” arXiv preprint arXiv:2506.04192, Jun. 2025.

[10]

W. Shen, R. Huang, M. Huang, C. Shen, and J. Zhang, “On the convergence analysis of Muon,” arXiv preprint arXiv:2505.23737, May 2025.

[11]

C. Ma, W. Gong, M. Scetbon, and E. Meeds, “SWAN: Preprocessing SGD enables Adam-level performance on LLM training with significant memory reduction,” arXiv e-prints, pp. arXiv–2412, Feb. 2024.

[12]

T. Pethick, W. Xie, K. Antonakopoulos, Z. Zhu, A. Silveti-Falls, and V. Cevher, “Training deep learning models with norm-constrained LMOs,” arXiv preprint arXiv:2502.07529, Jun. 2025.

[13]

T. T.-K. Lau, Q. Long, and W. Su, “PolarGrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective,” arXiv preprint arXiv:2505.21799, Aug. 2025.

[14]

K. Ahn, B. Xu, N. Abreu, and J. Langford, “Dion: Distributed orthonormalized updates,” arXiv preprint arXiv:2504.05295, Apr. 2025.

[15]

C. He, Z. Deng, and Z. Lu, “Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training,” arXiv preprint arXiv:2509.11983, 2025.

[16]

T. Yang, X. Yi, J. Wu, Y. Yuan, D. Wu, Z. Meng, Y. Hong, H. Wang, Z. Lin, and K. H. Johansson, “A survey of distributed optimization,” Annual Reviews in Control, vol. 47, pp. 278–305, Jan. 2019.

[17]

A. Nedić, J.-S. Pang, G. Scutari, and Y. Sun, “Multi-agent optimization,” Cham, Switzerland: Springer, 2018.

[18]

B. Yuan, Y. He, J. Davis, T. Zhang, T. Dao, B. Chen, P. S. Liang, C. Re, and C. Zhang, “Decentralized training of foundation models in heterogeneous environments,” in Advances in Neural Information Processing Systems, vol. 35, Dec. 2022.

[19]

H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “\(d^2\): Decentralized training over decentralized data,” in International Conference on Machine Learning.PMLR, Jul. 2018.

[20]

R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis, “Large-scale matrix factorization with distributed stochastic gradient descent,” in The 17th ACM SIGKDD international conference on Knowledge discovery and data mining, Aug. 2011.

[21]

H. Ye and T. Zhang, “Deepca: Decentralized exact pca with linear convergence rate,” Journal of Machine Learning Research, vol. 22, no. 238, pp. 1–27, Feb. 2021.

[22]

V. Kalofolias, X. Bresson, M. Bronstein, and P. Vandergheynst, “Matrix completion on graphs,” arXiv preprint arXiv:1408.1717, Aug. 2014.

[23]

Q. Ling, Y. Xu, W. Yin, and Z. Wen, “Decentralized low-rank matrix completion,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Aug. 2012.

[24]

A.-Y. Lin and Q. Ling, “Decentralized and privacy-preserving low-rank matrix completion,” Journal of the Operations Research Society of China, vol. 3, no. 2, pp. 189–205, Jun. 2015.

[25]

X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,” in Advances in Neural Information Processing Systems, vol. 30, Dec. 2017.

[26]

A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. Stich, “A unified theory of decentralized SGD with changing topology and local updates,” in International Conference on Machine Learning.PMLR, Nov. 2020.

[27]

W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, Nov. 2015.

[28]

S. Pu and A. Nedić, “Distributed stochastic gradient tracking methods,” Mathematical Programming, vol. 187, no. 1, pp. 409–457, May 2021.

[29]

Y. Yang, K. Lu, and L. Wang, “High probability convergence of distributed clipped stochastic gradient descent with heavy-tailed noise,” arXiv preprint arXiv:2506.11647, Jun. 2025.

[30]

E. Gorbunov, A. Sadiev, M. Danilova, S. Horváth, G. Gidel, P. Dvurechensky, A. Gasnikov, and P. Richtárik, “High-probability convergence for composite and distributed stochastic minimization and variational inequalities with heavy-tailed noise,” in International Conference on Machine Learning, Jul. 2024.

[31]

Y. Qin, K. Lu, H. Xu, and X. Chen, “High probability convergence of clipped distributed dual averaging with heavy-tailed noises,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 55, no. 4, pp. 2624–2632, Jan. 2025.

[32]

S. Yu, D. Jakovetic, and S. Kar, “Decentralized nonconvex optimization under heavy-tailed noise: Normalization and optimal convergence,” arXiv preprint arXiv:2505.03736, May 2025.

[33]

N. M. Kornilov, P. Zmushko, A. Semenov, A. Gasnikov, and A. Beznosikov, “Sign operator for coping with heavy-tailed noise: High probability convergence bounds with extensions to distributed optimization and comparison oracle,” arXiv preprint arXiv:2502.07923, May 2025.

[34]

S. H. Lee, M. Zaheer, and T. Li, “Efficient distributed optimization under heavy-tailed noise,” arXiv preprint arXiv:2502.04164, Aug. 2025.

[35]

L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,” Systems & Control Letters, vol. 53, no. 1, pp. 65–78, Sep. 2004.

[36]

P. Di Lorenzo and G. Scutari, “NEXT: In-network nonconvex optimization,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 2, pp. 120–136, Jun. 2016.

[37]

W. Shen, R. Huang, M. Huang, C. Shen, and J. Zhang, “On the convergence analysis of muon,” 2025.

[38]

C. He, Z. Lu, D. Sun, and Z. Deng, “Complexity of normalized stochastic first-order methods with momentum under heavy-tailed noise,” arXiv preprint arXiv:2506.11214, Jun. 2025.

[39]

A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” Open AI, 2018.

[40]

A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, Jan. 2009.

[41]

C. Sun and B. Chen, “Distributed stochastic strongly convex optimization under heavy-tailed noises,” in 2024 IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE International Conference on Robotics, Automation and Mechatronics (RAM), Aug. 2024.

DeMuon: A Decentralized Muon for Matrix Optimization over Graphs