Abstract

Conventional wisdom in deep learning optimization dictates updating all layers at every step–a principle followed by all recent state-of-the-art optimizers such as Muon. In this work, we challenge this assumption, showing that full-network updates can be fundamentally suboptimal, both in theory and in practice. We introduce a non-Euclidean Randomized Progressive Training method—Drop-Muon—a simple yet powerful framework that updates only a subset of layers per step according to a randomized schedule, combining the efficiency of progressive training with layer-specific non-Euclidean updates for top-tier performance. We provide rigorous convergence guarantees under both layer-wise smoothness and layer-wise \((L^0, L^1)\)-smoothness, covering deterministic and stochastic gradient settings, marking the first such results for progressive training in the stochastic and non-smooth regime. Our cost analysis further reveals that full-network updates are not optimal unless a very specific relationship between layer smoothness constants holds. Through controlled CNN experiments, we empirically demonstrate that Drop-Muon consistently outperforms full-network Muon, achieving the same accuracy up to \(1.4\times\) faster in wall-clock time. Together, our results suggest a shift in how large-scale models can be efficiently trained, challenging the status quo and offering a highly efficient, theoretically grounded alternative to full-network updates.

1 Introduction↩︎

Since their debut, Adam and related methods [1], [2] have dominated deep learning optimization. Yet, the field is now at an inflection point. Recent advances highlight a new generation of algorithms designed to better capture the geometry of modern models, with Muon [3] and its successors—Scion [4] and Gluon [5]—emerging as promising alternatives. Fueled by state-of-the-art performance in large language models (LLMs) training [6]–[10] and emerging theoretical developments [4], [5], [11], these methods are on track to disrupt entrenched practices.

Central to their design are the layer-specific linear minimization oracles (LMOs) over non-Euclidean norm balls, enabling better alignment with the highly anisotropic loss landscapes of neural networks. Concretely, let \(X = [X_1, \ldots, X_b]\) denote the parameters of a \(b\)-layer model, with \(X_i\) indexing the parameters of layer \(i \in[b] :=\{1, \ldots, b\}\). Each of the aforementioned optimizers can be viewed as an instance of the general update rule \[\begin{align} \label{eq:muon} X_i^{k+1} = X_i^k + {\rm LMO}_{\mathcal{B}_i(0,t_i^k)}(M_i^k), \qquad {\color{myred} i\in\{1, \ldots, b\}} \end{align}\tag{1}\] (see 7.1). Here, \(\mathcal{B}_i(X_i,t_i) :=\{Z_i \in \mathcal{X}_i: \left\| X_i - Z_i \right\|_{(i)} \leq t_i\}\) is a ball of radius \(t_i\) centered at \(X_i\) in the vector space \(\mathcal{X}_i\), where \(\left\| \cdot \right\|_{(i)}\) is a norm chosen for the \(i\)th layer, \(M_i^k\) is a momentum term, and the linear minimization oracle is defined as \({\rm LMO}_{\mathcal{B}_i(X_i,t_i)}(M_i) :=\mathop{\mathrm{arg\,min}}_{Z_i \in \mathcal{B}_i(X_i,t_i)} \left\langle M_i,Z_i\right\rangle\). Different choices of \(\left\| \cdot \right\|_{(i)}\) produce different algorithms (for example, Muon uses the spectral norm for hidden layers). Crucially, however, all these updates share a common characteristic: all layers are updated at every iteration. In this work, we question this design choice. Our central hypothesis is simple yet fundamental:

Updating the entire network at every step may not be optimal.

In the remainder of this paper, we demonstrate that the default practice of full-network updates is indeed not universally the best choice, from both theoretical and practical points of view, calling into question a core principle of standard training protocols.

1.1 Background↩︎

Let us begin by formalizing the setup. We consider the optimization problem \[\begin{align} \label{eq:problem} \min_{X\in\mathcal{X}} \left\{ f(X) :={\mathbb{E}}_{\xi\sim\mathcal{P}}\left[f(X; \xi)\right] \right\}, \end{align}\tag{2}\] where \(X \in \mathcal{X}\) represents the collection of trainable parameters of a neural network. Specifically, \(X\) is composed of block variables \(X_i \in \mathcal{X}_i :=\mathbb{R}^{m_i \times n_i}\) corresponding to layer \(i \in[b]\); we write \(X = [X_1, \ldots, X_b]\). In this context, \(\mathcal{X}\) is the \(d\)-dimensional product space \[\begin{align} \mathcal{X}:=\bigotimes_{i = 1}^b \mathcal{X}_i \equiv \mathcal{X}_1 \otimes \ldots \otimes \mathcal{X}_b, \end{align}\] where \(d :=\sum_{i=1}^b m_i n_i\). Each function \(f(\cdot; \xi): \mathcal{X}\to \mathbb{R}\) is continuously differentiable, potentially nonconvex and non-smooth, and represents the loss of the model evaluated at a data point \(\xi\) sampled from the probability distribution \(\mathcal{P}\). We denote by \(\nabla_i f(X) \in \mathcal{X}_i\) the gradient component corresponding to the \(i\)th layer, so that \(\nabla f(X) = [\nabla_1 f(X), \dots, \nabla_b f(X)] \in \mathcal{X}\). Each space \(\mathcal{X}_i\) is equipped with the trace inner product, defined as \(\langle X_i, Y_i \rangle_{(i)} :=\mathop{\mathrm{tr}}(X_i^{\top} Y_i)\) for \(X_i,Y_i \in \mathcal{X}_i\), which induces the standard Euclidean norm, denoted by \(\left\| \cdot \right\|_2\). In addition, each space is endowed with an arbitrary norm \(\left\| \cdot \right\|_{(i)}\) (which need not be induced by this inner product). We let \(\left\| \cdot \right\|_{(i) \star}\) be the dual norm associated with \(\left\| \cdot \right\|_{(i)}\) (i.e., \(\|X_i\|_{(i) \star} :=\sup_{\left\| Z_i \right\|_{(i)} \leq 1} \left\langle X_i,Z_i\right\rangle_{(i)}\) for any \(X_i\in \mathcal{X}_i\)).

Input: initial iterate \(X^0 = [X_1^0, \dots, X_b^0] \in \mathcal{X}\); momentum \(M^0=[M_1^0, \dots, M_b^0]\in \mathcal{X}\); stepsizes \(\gamma_i^k > 0\); momentum parameters \(\beta^k \in [0,1)\) Sample \(\xi^k \sim \mathcal{P}\) and the set of active layers \(S^k\sim\mathcal{D}\) \(\triangleright\) Freeze layers not selected as active \(M_i^k = M^{k-1}_i\) \(X_i^{k+1} = X_i^k\) \(\triangleright\) Update active layers Update momentum \(M_i^k = (1-\beta_i) M^{k-1}_i + \beta_i \nabla_i f(X^k; \xi^k)\) Update parameters via \[\begin{align} \label{eq:sharp95upd95stoch} X_i^{k+1} = {\rm LMO}_{\mathcal{B}(X_i^k,t_i^k)}(M_i^k) = X_i^k - \gamma_i^k \left( M_i^k \right)^{\sharp} \end{align}\tag{3}\]

2 The Algorithm↩︎

Before summarizing our main contributions (see the end of 3), we dive directly into presenting our method. Motivated by the considerations in 1, we propose Drop-Muon ([alg:rt95arbitrary95stoch])–a non-Euclidean layer-wise optimizer for deep learning based on the idea of sub-network training. At each iteration \(k\), instead of updating the entire network as in standard Muon, Drop-Muon samples a random subset \(S^k \subseteq [b]\) of layers according to a user-defined distribution \(\mathcal{D}\) and updates only the parameters of layers in \(S^k\), keeping all other layers frozen. As the reader may have noticed, the main LMO update step 3 admits two equivalent formulations. The alternative representation of 1 uses the sharp operator [12], [13], defined for any \(M \in \mathcal{X}\) by \(M^{\sharp} :=\mathop{\mathrm{arg\,max}}_{X \in \mathcal{X}} \{\left\langle M,X\right\rangle - \frac{1}{2} \left\| X \right\|^2\}\). It is well known that \(M^{\sharp}\) relates to the LMO via \(M^{\sharp} = - \left\| M \right\|_{\star} {\rm LMO}_{\mathcal{B}(0,1)}(M)\), and hence \[\begin{align} \label{eq:lmo95sharp} X_i^{k+1} = X_i^k + t_i^k {\rm LMO}_{\mathcal{B}_i(0,1)}(M_i^k) = X_i^k - \frac{t_i^k}{\left\| M_i^k \right\|_{(i) \star}} \left( M_i^k \right)^{\sharp}, \end{align}\tag{4}\] which corresponds to a layer-wise normalized steepest descent step with stepsize \(\gamma_i^k :=\frac{t_i^k}{\left\| M_i^k \right\|_{\star}}\). When \(\left\| \cdot \right\|_{(i)}=\left\| \cdot \right\|_2\) is the standard Euclidean norm, the sharp operator reduces to the identity mapping, so that \(M^{\sharp} = M\), and the update coincides with Stochastic Gradient Descent with momentum (SGDM) [14], though here performed layer-wise. These equivalent formulations will be repeatedly invoked in the proofs of the results from Sections 4.1 and 4.2.

3 Cost Model↩︎

Our key theoretical contribution is that strategically skipping some layer updates may lead to performance gains. To isolate the core phenomenon, we first study a simplified variant of the method without stochasticity in the gradients and momentum, described in 1. This deterministic version follows the same fundamental principles as [alg:rt95arbitrary95stoch], with the difference that momentum terms \(M_i^k\) are replaced by the components of the full gradient \(\nabla_i f(X^k)\). When \(\left\| \cdot \right\|_{(i)}=\left\| \cdot \right\|_2\) is the Euclidean norm, the update [eq:sharp95upd] coincides with that of layer-wise Gradient Descent (GD).

Figure 1: Drop-Muon (deterministic gradient variant)

How expensive is one step of 1? The answer is governed by the sampling distribution \(\mathcal{D}\). Consider iteration \(k\) and denote \(s^k :=\min S^k\) (the smallest index of an active layer). The operations performed by the algorithm can be summarized as follows:

Backward pass: Backpropagate gradients through layers \([X^k_{s^k}, \ldots, X^k_b]\). Since layers \(1, \ldots, s^k - 1\) are frozen, no gradients are computed for them, effectively truncating the gradient flow at the first active layer.
Forward pass: To evaluate the loss, activations must in principle be propagated through all layers \(1, \ldots, b\). However, since only layers \([X^k_{s^k}, \ldots, X^k_b]\) are updated at iteration \(k\), the activations up to layer \(s^k - 1\) may be cached and reused in the next step.
Gradient transformation: Given the gradients \(\{\nabla_i f(X^k)\}_{i\in S^k}\), compute the corresponding sharp operators \(\{(\nabla_i f(X^k))^{\sharp}\}_{i\in S^k}\) (or, equivalently, the LMOs; see 4 ).
Parameter updates: Update the parameters of layers \(\{X_i^k\}_{i\in S^k}\) using their computed (transformed) gradients, while keeping the frozen layers unchanged.

To model the total computational effort of the optimization procedure, we associate a cost with each step (measured, for example, in FLOPs or wall-clock time). Let \(c_{\mathrm{ov}} \geq 0\) denote the fixed per-iteration overhead (e.g., data loading). As noted above, backpropagation must be performed from the last layer \(b\) down to layer \(s^k\), while forward-pass activations up to layer \(s^k - 1\) can be cached and reused in subsequent iterations. Hence, the costs of steps (i) and (ii) can be aggregated into a single per-layer constant \(c_i > 0\) for each \(i \in [b]\). In the non-Euclidean setting (where \(M^{\sharp} \neq M\)), an additional cost arises from computing sharp operators. We denote by \(c_i^{\sharp} \geq 0\) the combined cost of evaluating this operator and performing the corresponding parameter update for layer \(i\) (steps (iii) and (iv)). Under this model, the total compute cost of iteration \(k\) is \[\begin{align} \label{eq:cost95k} \mathrm{cost}(S^k) :=c_{\mathrm{ov}} + \sum\limits_{i=s^k}^b c_i + \sum\limits_{i \in S^k} c_i^{\sharp}, \end{align}\tag{5}\] and, consequently, for a fixed target accuracy \(\varepsilon > 0\), the expected cost of the entire optimization procedure can be expressed as \[\begin{align} \label{eq:exp95cost95gen} \mathrm{cost}_{\varepsilon}(\mathcal{D}) :=K \times {\mathbb{E}}\left[\mathrm{cost}(\hat{S})\right], \end{align}\tag{6}\] where we write \(\hat{S} \sim \mathcal{D}\) to denote a random variable with the same distribution as that of the samplings (since \(\{S^k\}_{k\geq0}\) are i.i.d.), and \(K\) is the number of iterations to reach convergence (interpreted in the nonconvex case as reaching an \(\varepsilon\)-stationary point in expectation). In the remainder of this paper, we compute \(K\) under two smoothness regimes: layer-wise smoothness (2) and layer-wise \((L^0, L^1)\)–smoothness (3). We then evaluate various layer-update strategies and compute their expected costs as in 6 . These results converge to a single conclusion:

Full-network updates are not optimal unless a very specific relationship
between layer smoothness constants holds.

We provide a rigorous statement and justification for this claim in the sections that follow. The main takeaway, however, is simple: as this condition is highly unlikely to be realized in practice, updating only a subset of layers at each iteration should be seen as more efficient than the default strategy of updating all parameters at each iteration.

Our contributions can be summarized as follows:

1. Challenging full network updates. We provide, to our knowledge, the first systematic investigation of the practice of updating all layers of a network at every iteration. We argue—and rigorously demonstrate both in theory and in practice—that this design choice is generally suboptimal.

2. General framework for sub-network optimization. We introduce Drop-Muon ([alg:rt95arbitrary95stoch] and its deterministic gradient counterpart 1), a principled layer-wise optimization framework with randomized layer subsampling. Drop-Muon strictly generalizes LMO-type methods (including Muon [3], Scion [4], and Gluon [5]) by allowing random subsets of layers to be updated per step, with full-network training as a special case. Drop-Muon supports virtually any layer sampling scheme (9). In the main part of this paper, we focus on Randomized Progressive Training (RPT) [15], a natural strategy aligned with backpropagation mechanics that avoids redundant gradient computations and reduces compute cost while maintaining strong convergence guarantees.

3. Tight iteration complexity guarantees under novel smoothness regimes. We establish convergence guarantees for Drop-Muon under two regimes: layer-wise smoothness (1) and layer-wise \((L^0,L^1)\)–smoothness (Theorems 2 and 4). Our rates recover the state of the art for SGD- and Muon-type methods, and, to our knowledge, provide the first convergence guarantees for progressive training-style methods in the non-smooth setting.

4. Theoretical compute-optimality results. To isolate the key phenomena, we first consider a deterministic gradient variant of Drop-Muon (1). Using a simple yet expressive cost model (5 ) accounting for per-layer forward/backward passes, gradient transformations, and parameter updates, we prove that full-network updates are not optimal unless a very specific condition on layer smoothness constants holds (Theorems 3 and 15), which is unlikely in practice. This formally justifies selective layer updates as the compute-optimal default.

5. Empirical validation. Controlled CNN experiments on MNIST, Fashion-MNIST, and CIFAR-10 show that Drop-Muon consistently outperforms standard full-network Muon, achieving the same accuracy up to \(1.4\times\) faster in wall-clock time.

4 Randomized Progressive Training↩︎

The general framework in 1 allows virtually any sampling strategy. However, due to the mechanics of backpropagation, it is most natural to update all layers from the last one down to some sampled minimal index. Specifically, if the smallest sampled index at iteration \(k\) is \(s^k\), then computing the gradient \(\nabla_{s^k} f(X^k)\) requires backpropagating from the last layer \(b\) up to layer \(s^k\), which automatically produces all gradient components \([\nabla_{s^k} f(X^k), \ldots, \nabla_b f(X^k)]\).

Formally, we can define the sampling distribution \(\mathcal{D}\) as follows: at each iteration \(k\), sample \(s^k \in [b]\) with probabilities \(p_i :={\mathbb{P}}\left(s^k=i\right)\), where \(\sum_{i=1}^b p_i = 1\) and \(p_1>0\), and set \(S^k = \{s^k, \ldots, b\}\). Algorithms [alg:rt95arbitrary95stoch] and 1 then update the layers \([X^k_{s^k}, \ldots, X^k_b]\), while \([X_1^k, \ldots, X_{s^k-1}^k]\) remain frozen at their previous values (with the convention that \([X_1^k, X_0^k] = \emptyset\) when \(s^k = 1\)). We refer to this sampling scheme as Randomized Progressive Training (RPT), or Drop-training for short (see 5).

4.1 Iteration Complexity – Deterministic Gradient Setting↩︎

We now analyze the iteration complexity of 1 under two smoothness regimes; the proofs are deferred to the Appendix. Throughout, we make the standard assumption that the objective function is lower-bounded.

Assumption 1. There exist \(f^{\star} \in \mathbb{R}\) such that \(f(X) \geq f^{\star}\) for all \(X \in \mathcal{X}\).

This ensures the existence of an approximately stationary point for any desired level of accuracy.

4.1.0.1 Smooth case.

We first establish convergence under the layer-wise smoothness assumption.

Assumption 2 (\(\textrm{supp}(\mathcal{D})\)–layer-wise smoothness). The function \(f: \mathcal{X}\mapsto \mathbb{R}\) is \(\textrm{supp}(\mathcal{D})\)–layer-wise \(L^0\)–smooth with constants \(L^0 :=\{(L_{1,S}^0, \ldots, L_{b,S}^0)\}_{S\in\textrm{supp}(\mathcal{D})}\), \((L_{1,S}^0, \ldots, L_{b,S}^0) \in \mathbb{R}^b_+\), i.e., for any \(S\in\textrm{supp}(\mathcal{D})\), \[\begin{align} f(X + \Gamma) - f(X) - \left\langle\nabla f(X),\Gamma\right\rangle \leq \sum\limits_{i\in S} \frac{L_{i,S}^0}{2}\left\| \Gamma_i \right\|_{(i)}^2. \end{align}\] for all \(X = [X_1, \ldots, X_b]\in \mathcal{X}\) and \(\Gamma = [\Gamma_1, \ldots, \Gamma_b] \in \mathcal{X}\) such that \(\Gamma_i = 0\) for all \(i\not\in S\).

We take \(L^0\) to be the smallest collection of constants satisfying the above. Throughout, we use \(\textrm{supp}(\mathcal{D})\) to denote the subsets of \([b]\) assigned positive probability mass by \(\mathcal{D}\). In the progressive training setting, these supported sets take the form \(\textrm{supp}(\mathcal{D}) = \{\{j,\ldots,b\}, j\in[b]\}\). By definition, \(L_{i,S}^0 = 0\) whenever \(i \notin S\). Moreover, if \(S_1, S_2 \in \textrm{supp}(\mathcal{D})\) are such that \(S_1 \subseteq S_2\), then \(L_{i,S_1}^0 \le L_{i,S_2}^0\) (see 1).

The assumption is inspired by the coordinate descent (CD) literature [16], reducing to the standard block-wise Lipschitz continuity of the gradient [17]–[19] in the special case when \(\textrm{supp}(\mathcal{D}) = \{\{1\}, \{2\}, \ldots, \{b\}\}\). 2 captures the intuition that each subset of layers can have its own effective smoothness constant, allowing tighter bounds on the local curvature of \(f\) and better reflecting the structure of the model. Importantly, it is not more restrictive than standard smoothness–rather, it offers a richer parametric description by assigning separate constants to different layer subsets, allowing a more precise analysis without shrinking the function class.

With the assumptions in place, we are now ready to state the first formal convergence result.

Theorem 1. Let Assumptions 1 and 2 hold, and let \(\{X^k\}_{k=0}^{K-1}\) be the iterates of 1 run with stepsizes \(\gamma_i^k = \frac{1}{L_{i,S^k}^0}\). Then \[\begin{align} \frac{1}{K} \sum\limits_{k=0}^{K-1} \sum\limits_{i=1}^b \frac{w_i}{\frac{1}{b} \sum_{j=1}^b w_j} {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|^2_{(i) \star}\right] \leq \frac{f(X^0) - f^{\star}}{K \left( \frac{1}{b} \sum_{j=1}^b w_j \right)}, \end{align}\] where \(w_i :=\sum_{s=1}^i \frac{p_s}{2 L_{i, \{s,\dots,b\}}^0}\).

1 establishes an \(\mathcal{O}(K^{-1})\) convergence rate for a weighted sum of squared gradient component norms, matching the theoretical rates previously established for Muon-type methods under classical smoothness [4], [11], [20]. For the result to be meaningful, every gradient component must contribute to the weighted average. In other words, the sampling distribution must satisfy \(w_i>0\) for all \(i\in[b]\). This is a natural requirement, equivalent to ensuring that \(p_1>0\), i.e., that all layers are updated with nonzero probability. Obviously, if this was not the case, the first layer would be completely ignored, making convergence impossible.

4.1.0.2 Generalized smooth case.

Layer-wise optimizers considered here are designed for deep learning, where the classical smoothness assumption is often violated [5], [21]. Consequently, the layer-wise smoothness model in 2 may not accurately capture the local geometry of the loss. To address this, we adopt a more expressive framework building upon \((L^0, L^1)\)–smoothness [21], [22]. 3 below generalizes 2 by letting the local curvature of each layer depend not only on fixed constants \(L_{i,S}^0\), but also on the magnitude of the layer’s gradient via additional terms \(L_{i,S}^1 \left\| \nabla_i f(X) \right\|_{(i) \star}\).

Assumption 3 (\(\textrm{supp}(\mathcal{D})\)–layer-wise \((L^0, L^1)\)–smoothness). The function \(f: \mathcal{X}\mapsto \mathbb{R}\) is \(\textrm{supp}(\mathcal{D})\)–layer-wise \((L^0, L^1)\)–smooth with constants \(L^\alpha :=\{(L_{1,S}^\alpha, \ldots, L_{b,S}^\alpha)\}_{S\in\textrm{supp}(\mathcal{D})}\), \((L_{1,S}^\alpha, \ldots, L_{b,S}^\alpha) \in \mathbb{R}^b_+\), \(\alpha\in\{0,1\}\), i.e., for any \(S\in\textrm{supp}(\mathcal{D})\), \[\begin{align} f(X + \Gamma) - f(X) - \left\langle\nabla f(X),\Gamma\right\rangle \leq \sum\limits_{i\in S} \frac{L_{i,S}^0 + L_{i,S}^1 \left\| \nabla_i f(X) \right\|_{(i) \star}}{2}\left\| \Gamma_i \right\|_{(i)}^2, \end{align}\] for all \(X = [X_1, \ldots, X_b]\in \mathcal{X}\) and \(\Gamma = [\Gamma_1, \ldots, \Gamma_b] \in \mathcal{X}\) such that \(\Gamma_i = 0\) for all \(i\not\in S\).

As with 2, assigning separate constants to each subset of layers \(S\) allows for tighter, subset-specific bounds, reflecting the interactions among layers.

Theorem 2. Let Assumptions 1 and 3 hold, fix \(\varepsilon>0\), and let \(\{X^k\}_{k=0}^{K-1}\) be the iterates of 1 run with stepsizes \(\gamma_i^k = \big(L_{i,S^k}^0 + L_{i,S^k}^1 \left\| \nabla_i f(X^k) \right\|_{(i) \star}\big)^{-1}\). Then, to guarantee that \[\begin{align} \min_{k=0,\ldots,K-1} \sum\limits_{i=1}^b \left[\frac{w_i}{\frac{1}{b} \sum_{l=1}^b w_l} {\mathbb{E}}\left[\left\| \nabla _i f(X^k) \right\|_{(i) \star}\right]\right] \leq \varepsilon, \end{align}\] it suffices to run the algorithm for \[\begin{align} K = \left\lceil \frac{2 \delta^0 \sum\limits_{i=1}^b \frac{\left( \sum_{s=1}^i p_s \right)^2 \left( \sum_{s=1}^i p_s L^0_{i, \{s,\dots,b\}} \right)}{\left( \sum_{s=1}^i p_s L^1_{i, \{s,\dots,b\}} \right)^2}}{\varepsilon^2 \left( \frac{1}{b} \sum_{l=1}^b w_l \right)^2} + \frac{2 \delta^0}{\varepsilon \left( \frac{1}{b} \sum_{l=1}^b w_l \right)} \right\rceil \end{align}\] iterations, where \(\delta^0 :=f(X^0) - f^{\star}\) and \(w_i :=\frac{\left( \sum_{s=1}^i p_s \right)^2}{\sum_{s=1}^i p_s L^1_{i, \{s,\dots,b\}}}\).

Similar to 1, 2 guarantees an \(\mathcal{O}(K^{-1/2})\) convergence rate for a weighted sum of gradient component norms.² Again, the sampling distribution must ensure that \(w_i > 0\) for all \(i\in[b]\), which amounts to requiring that \(p_1>0\). In the extreme case when \((p_1,p_2,\ldots,p_b) = (1,0,\ldots,0)\), corresponding to full-network training, the weights simplify to \(w_i = \frac{1}{L^1_{i,[b]}}\), exactly recovering the convergence rate of deterministic Gluon [5], demonstrating the tightness of our guarantees. Importantly, the stepsizes naturally scale inversely with the layer-specific smoothness constants and gradient magnitudes. This automatic adaptation to local geometry prevents overshooting, ensures stable convergence, and allows more aggressive updates when the \((L^0, L^1)\) constants and gradient norms are small.

4.1.1 Cost Optimization↩︎

Let us now make clear why performing full-network updates in all iterations is not in general an optimal strategy. We first consider the layer-wise smooth case. According to 1, under 2, 1 guarantees that \(\frac{1}{K} \sum_{k=0}^{K-1} \sum_{i=1}^b {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|^2_{(i) \star}\right] \leq \varepsilon\) after \[\begin{align} K = \left\lceil \frac{f(X^0) - f^{\star}}{\varepsilon \times \min_{i\in[b]} \left[\sum\limits_{s=1}^i \frac{p_s}{2 L_{i, \{s,\dots,b\}}^0}\right]} \right\rceil \end{align}\] iterations. Since in this case \({\mathbb{E}}\left[\mathrm{cost}(\hat{S})\right] = c_{\mathrm{ov}} + \sum_{j=1}^b (c_j + c_j^{\sharp}) \sum_{i=1}^j p_i\), the expected total cost of the optimization procedure satisfies \[\begin{align} \label{eq:smooth95cost95full} \mathrm{cost}_{\varepsilon}(\mathcal{D}) \,\overset{\eqref{eq:exp95cost95gen}}{=}\, K \times {\mathbb{E}}\left[\mathrm{cost}(\hat{S})\right] \,\propto\, \frac{c_{\mathrm{ov}} + \sum_{i=1}^b (c_i + c_i^{\sharp}) \sum_{s=1}^i p_s}{\min_{i\in[b]} \left[\sum\limits_{s=1}^i \frac{p_s}{2 L_{i, \{s,\dots,b\}}^0}\right]} \end{align}\tag{7}\] (see 11.1.1 for the details of the derivation). Finding the sampling distribution that minimizes \(\mathrm{cost}_{\varepsilon}(\mathcal{D})\) is equivalent to optimizing over \(\{p_i\}_{i\in[b]}\). Letting \((p_1,p_2,\ldots,p_b) = (1,0,\ldots,0)\) recovers full-network training, which serves as a natural baseline. Yet, can we do better? The following theorem shows that this configuration is optimal under very specific conditions only.

Theorem 3. The cost 7 is minimized by \((p_1, p_2, \ldots, p_b)=(1, 0, \ldots, 0)\) if and only if \[L_{1,\{1,\ldots,b\}}=\max_{i\in[b]} L_{i,\{1,\ldots,b\}}.\]

Note that the condition in 3 is entirely independent of the cost parameters! In fact, one can derive a recursive construction of the optimal probabilities that depends solely on the smoothness constants, from which 3 follows as a simple corollary (see 14).

The layer-wise \((L^0, L^1)\)–smooth case (15) leads to the same conclusion: full-network updates are optimal only if the first layer is associated with the largest smoothness constant–a restrictive and rarely observed condition, as confirmed by experiments on NanoGPT [5]. While broader validation is needed, there is no reason to expect this phenomenon to hold broadly. Overall, from the theoretical standpoint, the prevalent practice of always updating all layers is fundamentally flawed.

4.2 Iteration Complexity – Stochastic Gradient Setting↩︎

We now turn to the convergence analysis of the practical variant of Drop-Muon with stochastic gradients and momentum ([alg:rt95arbitrary95stoch]), within the general layer-wise \((L^0, L^1)\)–smoothness framework.

Assumption 4 (\(\textrm{supp}(\mathcal{D})\)–layer-wise \((L^0, L^1)\)–smoothness II). The function \(f: \mathcal{X}\mapsto \mathbb{R}\) is \(\textrm{supp}(\mathcal{D})\)–layer-wise \((L^0, L^1)\)–smooth with constants \(L^\alpha :=\{(L_{1,S}^\alpha, \ldots, L_{b,S}^\alpha)\}_{S\in\textrm{supp}(\mathcal{D})}\), \((L_{1,S}^\alpha, \ldots, L_{b,S}^\alpha) \in \mathbb{R}^b_+\), \(\alpha\in\{0,1\}\), i.e., for any \(S\in\textrm{supp}(\mathcal{D})\), \[\begin{align} \left\| \nabla_i f(X + \Gamma) - \nabla_i f(X) \right\|_{(i) \star} \leq \left( L_{i,S}^0 + L_{i,S}^1 \left\| \nabla_i f(X) \right\|_{(i) \star} \right) \left\| \Gamma_i \right\|_{(i)} \end{align}\] for all \(X = [X_1, \ldots, X_b]\in \mathcal{X}\) and \(\Gamma = [\Gamma_1, \ldots, \Gamma_b] \in \mathcal{X}\) such that \(\Gamma_i = 0\) for all \(i\not\in S\).

4 is a slightly stronger variant of 3 used in the analysis of 1 (see 10). Consequently, all results proven under 3 carry over to this setting. This stronger form is necessary to rigorously establish convergence in the stochastic case.

Furthermore, we assume access to a standard stochastic gradient oracle with bounded variance.

Assumption 5. The stochastic gradient estimator \(\nabla f(\cdot; \xi): \mathcal{X}\mapsto \mathcal{X}\) is unbiased and has bounded variance, i.e., \({\mathbb{E}}_{\xi \sim \mathcal{D}}\left[\nabla f(X; \xi)\right] = \nabla f(X)\) for all \(X \in \mathcal{X}\) and there exist \(\sigma_i \geq 0\) such that \({\mathbb{E}}_{\xi \sim \mathcal{D}}\left[\left\| \nabla_i f(X; \xi) - \nabla_i f(X) \right\|^2_2\right] \leq \sigma_i^2\) for all \(X \in \mathcal{X}\) and \(i = 1, \dots, p\).

Note that, to facilitate the proofs, the variance bound in 5 is defined with respect to the Euclidean norm, consistent with prior analyses [4], [5], [11]. This is not restrictive: since \(\mathcal{X}\) is finite-dimensional, there exist \(\underline{\rho}_i, \bar{\rho}_i > 0\) such that \(\underline{\rho}_i \left\| X_i \right\|_{(i)} \leq \left\| X_i \right\|_2 \leq \bar{\rho}_i \left\| X_i \right\|_{(i)}\) for all \(X_i\in\mathcal{X}_i\) (equivalently, \(\underline{\rho}_i \left\| X_i \right\|_2 \leq \left\| X_i \right\|_{(i) \star} \leq \bar{\rho}_i \left\| X_i \right\|_2\)).

Theorem 4. Let Assumptions 1, 4, and 5 hold. Let \(\{X^k\}_{k=0}^K\), \(K \geq 0\), be the iterates of [alg:rt95arbitrary95stoch] initialized with \(M_i^0 = \nabla_i f(X^0; \xi^0)\) and run with \(t_i^k \equiv t_i = \frac{\eta_i}{(K+1)^{3/4}}\), where \(0 < \eta_i^2 \leq \min\bigg\{\frac{1}{4} (K+1)^{1/2} \Big(\sum_{s=1}^i p_s L^1_{i,\{s,\dots,b\}}\Big)^{-1} \Big(\sum_{s=1}^b p_s \max_{i\in[b]} L^1_{i,\{s,\dots,b\}}\Big)^{-1}\), \(\frac{\underline{\rho}_i p_1}{16 \bar{\rho}_i (1-\beta_i)} \Big(\sum_{s=1}^i p_s\Big)^{-1} \Big(\sum_{s=1}^i p_s L^1_{i,\{s,\dots,b\}}\Big)^{-1} \Big(\sum_{s=1}^b p_s \max_{i\in[b]} L^1_{i,\{s,\dots,b\}}\Big)^{-1}, 1\bigg\}\), and \(\beta_i \equiv \beta = (K+1)^{-1/2}\). Then \[\begin{align} &\min\limits_{k=0,\ldots,K} \sum\limits_{i=1}^b \frac{\left( \sum_{s=1}^i p_s \right) \eta_i}{\frac{1}{b} \sum_{l=1}^b \sum_{s=1}^l p_s \eta_l} {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \\ &\leq \frac{3 \delta^0}{(K+1)^{1/4} \frac{1}{b} \sum_{l=1}^b \sum_{s=1}^l p_s \eta_l} + \frac{6}{(K+1)^{1/2}} \sum\limits_{i=1}^b \frac{\eta_i \bar{\rho}_i \sigma_i}{\frac{1}{b} \sum_{l=1}^b \sum_{s=1}^l p_s \eta_l} + \sum\limits_{i=1}^b \frac{2 \bar{\rho}_i \sigma_i \left( \sum_{s=1}^i p_s \right) \eta_i}{(K+1)^{1/4} \frac{1}{b} \sum_{l=1}^b \sum_{s=1}^l p_s \eta_l} \\ &\quad+ \sum\limits_{i=1}^b \frac{2 \bar{\rho}_i \eta_i^2}{\underline{\rho}_i (K+1)^{1/4} \frac{1}{b} \sum_{l=1}^b \sum_{s=1}^l p_s \eta_l} \left( \sum\limits_{s=1}^i p_s L^0_{i,\{s,\dots,b\}} + \left( \sum\limits_{s=1}^i p_s L^1_{i,\{s,\dots,b\}} \right) \left( \sum\limits_{s=1}^i p_s \frac{L^0_{i,\{s,\dots,b\}}}{L^1_{i,\{s,\dots,b\}}} \right) \right) \\ &\quad+ \sum\limits_{i=1}^b \frac{\eta_i^2}{2 (K+1)^{3/4} \frac{1}{b} \sum_{l=1}^b \sum_{s=1}^l p_s \eta_l} \left( \sum\limits_{s=1}^i p_s L^0_{i, \{s,\dots,b\}} + \left( \sum\limits_{s=1}^i p_s L^1_{i, \{s,\dots,b\}} \right) \left( \sum\limits_{s=1}^i p_s \frac{L^0_{i,\{s,\dots,b\}}}{L^1_{i,\{s,\dots,b\}}} \right) \right). \end{align}\]

4 establishes an \(\mathcal{O}(K^{-1/4})\) convergence rate for a weighted sum of gradient component norms, in line with the state-of-the-art results for SGD- and Muon-type methods [5], [11], [14], [23]. As in the deterministic setting, one could attempt a cost analysis; however, the stochastic bound’s complexity necessitates a case-by-case treatment depending on which term dominates. While this prevents deriving as clean analytic results as in 4.2, our experiments (6) clearly confirm that the approach remains effective in the stochastic setting, too.

5 Prior Work on Progressive Training↩︎

The concept of partial network updates has been explored in several prior works. Progressive Training (PT) [24] was first introduced in the context of Generative Adversarial Networks (GANs) [25]. The central idea is to begin with a shallow network trained on low-resolution inputs and gradually increase both the input resolution and network depth by adding layers over time. This progressive growing strategy has been shown to improve computational efficiency and stabilize the training process. Despite its empirical success, PT has remained a largely heuristic method without formal convergence guarantees. An early attempt to provide such guarantees was made by [26] with ProgFed, which also extended progressive training to distributed settings. However, the accompanying theoretical analysis has been criticized as vacuous [15]. A more rigorous treatment was proposed by [15], who introduced Randomized Progressive Training (RPT). This method can be viewed as a randomized proxy for PT and was the first to establish theoretical convergence rates for progressive training on general smooth objectives, building on the framework of Randomized Coordinate Descent (RCD) [12], [27]. Conceptually, RPT is closely related to our approach: like Drop-Muon, it updates a random subnetwork at each iteration. However, the method of [15] is simply a form of sketched GD with a particular choice of sketch operator, and has several limitations that restrict its practical utility. First, it treats network parameters as a flat vector, ignoring the layer-wise structure. Consequently, it fails to exploit layer-specific, non-Euclidean update rules–failing, to recover methods such as Muon as special cases. Second, RPT requires non-stochastic gradient computation, making it computationally infeasible for large-scale architectures.

6 Experiments↩︎

We evaluate Drop-Muon on three benchmarks–MNIST, Fashion-MNIST, and CIFAR-10–using 3-layer convolutional neural networks (CNNs) of varying capacity. For all \(i \in [b]\), we set \(\left\| \cdot \right\|_{(i)} = \left\| \cdot \right\|_{2\to2}\), i.e., the spectral norm, consistent with Muon. We study RPT sampling strategy with several smallest-index sampling rules, including uniform and an epoch-shift distribution that gradually shifts probability mass from shallow to deep layers as training progresses (see 13). Our baseline is standard Muon (i.e., Drop-Muon with spectral norms and full-network training) run under identical hyperparameter configurations. This design allows us to isolate the effect of partial network updates while controlling for all other factors. For each configuration, we record accuracy as a function of both epochs and wall-clock time. Further experiments and details of hyperparameter selection procedure are provided in 13.

6.0.0.1 Results on `MNIST`.

2 illustrates a representative run with uniform layer index sampling: although Drop-Muon converges slightly slower per epoch (left), its lower per-epoch cost (\(27\)s vs. \(39\)s for Muon) yields faster overall training in wall-clock time (right). This advantage is even clearer with epoch-shift sampling: 3 (left) aggregates time-to-target results over multiple seeds, showing that Drop-Muon consistently outperforms Muon and achieves up to \(1.4\times\) speed-up.

Figure 2: Evolution of the training accuracy for Muon and Drop-Muon with uniform index sampling on MNIST. Batch size =8192, learning rate =0.1, channels =[64,128,256]. — Figure 2: Evolution of the training accuracy for Muon and Drop-Muon with uniform index sampling on `MNIST`. Batch size \(=8192\), learning rate \(=0.1\), channels \(=[64,128,256]\).

6.0.0.2 Results on `Fashion-MNIST`.

We repeat the experiment on Fashion-MNIST. Drop-Muon again delivers meaningful acceleration: as before, its per-epoch convergence is slightly slower, but it overtakes Muon in wall-clock time. 3 (right) shows aggregated time-to-target results over multiple seeds, with Drop-Muon achieving roughly \(1.2\times\) faster convergence on average across accuracy thresholds (we omit the \(99\%\) threshold since neither method reaches it).

Figure 3: Averaged time-to-target speed-up over multiple runs comparing Muon and Drop-Muon with epoch-shift index sampling. Left: MNIST with batch size =8192, learning rate =0.1, and channels =[64,128,256]. Right: Fashion-MNIST with batch size =32768, learning rate =0.1, and channels =[64,128,256]. — Figure 3: Averaged time-to-target speed-up over multiple runs comparing Muon and Drop-Muon with epoch-shift index sampling. Left: `MNIST` with batch size \(=8192\), learning rate \(=0.1\), and channels \(=[64,128,256]\). Right: `Fashion-MNIST` with batch size \(=32768\), learning rate \(=0.1\), and channels \(=[64,128,256]\).

6.0.0.3 Results on `CIFAR-10`.

4 presents a representative run. Although the absolute gain is smaller than on MNIST or Fashion-MNIST, Drop-Muon still reaches \(90\%\) training accuracy earlier in wall-clock time (see also 11).

Figure 4: Evolution of the training accuracy for Muon and Drop-Muon with epoch-shift index sampling on CIFAR-10. Batch size =8192, learning rate =0.1, channels =[128,256,512]. — Figure 4: Evolution of the training accuracy for Muon and Drop-Muon with epoch-shift index sampling on `CIFAR-10`. Batch size \(=8192\), learning rate \(=0.1\), channels \(=[128,256,512]\).

In summary, even though Drop-Muon trains on partial gradients, it reaches high accuracy levels earlier than Muon in wall-clock time. Importantly, Drop-Muon is simple to implement, requiring only a few lines of code. The observed speed-up over Muon could be even greater with dedicated tuning: in our experiments, both methods use the same constant learning rates, whereas theory (Theorems 1, 2, and 4) and the practice of coordinate descent (7.3) suggest that Drop-Muon can safely employ larger stepsizes than the full-network Muon baseline. Here, we kept the learning rates identical for both optimizers to isolate the wall-clock benefit of partial layer updates, but additional gains are likely achievable with method-specific tuning. Overall, our results demonstrate that Drop-Muon is an effective, practical, and easily implementable strategy that, with a well-chosen sampling strategy, consistently accelerates training while retaining high accuracy, making it a compelling choice for modern neural network optimization.

Acknowledgments↩︎

The research reported in this publication was supported by funding from King Abdullah University of Science and Technology (KAUST): i) KAUST Baseline Research Scheme, ii) CRG Grant ORFS-CRG12-2024-6460, and iii) Center of Excellence for Generative AI, under award number 5940.

Appendix↩︎

7 Additional Literature Review↩︎

7.1 Muon (and Friends)↩︎

Recent work has revisited how neural network parameters are updated, moving beyond simple element-wise gradient steps toward more structured update rules. Notable examples include the Muon optimizer [3], as well as the closely related Scion [4] and Gluon [5]. All of these methods can be interpreted as instances of algorithms driven by a linear minimization oracle (LMO) over norm balls.

Muon, introduced by [3], is an optimizer for the hidden layers of neural networks (with first and last layers typically trained using AdamW [2]). Given a layer \(X_i\) and the associated (stochastic) gradient \(G_i\), the update is obtained by solving the constrained optimization problem \[\begin{align} \label{eq:muon95lmo} \min_{\Delta X_i}\;\left\langle G_i,\Delta X_i\right\rangle \quad \text{subject to} \quad \|\Delta X_i\|_{2\to 2} \le t_i, \end{align}\tag{8}\] where \(t_i > 0\) plays the role of a trust-region radius or stepsize. The solution is characterized by the singular vectors of \(G_i\): if \(G_i = U_i \Sigma_i V_i^\top\) is its singular value decomposition, then \[\begin{align} \label{eq:muon95sol} \Delta X_i = -t_i U_i V_i^\top, \qquad X_i^{k+1} = X_i^k + \Delta X_i. \end{align}\tag{9}\] In other words, Muon moves in the direction of steepest descent measured in the spectral norm.

Since computing an exact SVD can be expensive, Muon uses the Newton-Schulz method [28], [29] to approximate the orthogonalization. In practice, the algorithm also incorporates momentum, yielding the update \[\begin{align} M_i^k &= (1-\beta_i) M_i^{k-1} + \beta_i G_i^k, \\ X_i^{k+1} &= X_i^k - t_i^k \mathrm{NewtonSchulz}(M_i^k), \end{align}\] where \(\beta_i \in (0,1]\) is the momentum parameter and \(M_i^k\) is the running average of past gradients.

The update rule in 9 can be interpreted as a special case of the more general LMO-based update described in 1. By selecting different norms for the LMO ball constraint, one obtains different algorithmic variants. Since our primary focus is on deep learning applications, we are especially interested in matrix norms. Within this setting, a particularly important family is that of operator norms, defined for any \(A \in \mathbb{R}^{m \times n}\) as \[\left\| A \right\|_{\alpha \to \beta} :=\sup _{\left\| Z \right\|_\alpha=1} \left\| A Z \right\|_\beta,\] where \(\left\| \cdot \right\|_{\alpha}\) and \(\left\| \cdot \right\|_{\beta}\) denote norms on \(\mathbb{R}^{n}\) and \(\mathbb{R}^{m}\), respectively.

In particular, Muon’s update in 8 corresponds to an LMO over the spectral norm ball \(\mathcal{B}_i^{2\to2}(0,t_i) :=\{Z_i \in \mathcal{X}_i : \left\| Z_i \right\|_{2\to 2} \leq t_i\}\), leading to the equivalent update \[\begin{align} X_i^{k+1} = X_i^k + {\rm LMO}_{\mathcal{B}_i^{2\to 2}(0,t_i)}(G_i). \end{align}\] Hence, Muon is just one member of a broader family of methods parameterized by the geometry of the update set.

Building on this perspective, [4] introduced Scion, which extends LMO-based updates to all network layers, rather than being limited to the hidden matrix-shaped layers as in Muon. To better capture layer-specific behavior, Scion employs different norms depending on layer type (scaled spectral norms for the weight matrices of transformer blocks and the \(\left\| \cdot \right\|_{1\to \infty}\) norm for embedding and output layers).

Finally, Gluon [5] generalizes the theory behind both Muon and Scion. It provides a convergence framework for updates over arbitrary norm balls. The analysis relies on a novel layer-wise smoothness model (closely related to our Assumptions 2, 3, and 4), capturing heterogeneity through \((L^0,L^1)\) parameters, that more accurately reflects the non-uniform characteristics of deep learning models (see more details in 7.2). As such, Gluon can be seen as the theoretical foundation unifying the entire class of LMO-based optimizers for training deep networks.

7.2 Generalized Smoothness↩︎

Gradient-based methods are traditionally analyzed under the assumption of Lipschitz smoothness of the gradient. Yet, many modern deep learning tasks violate this assumption [21], [30], prompting the development of alternative smoothness models. One such model is \((L^0, L^1)\)–smoothness, initially introduced by [21] for twice continuously differentiable functions and later generalized beyond this setting [22], [31].

This framework has been further tailored to deep learning through the non-Euclidean layer-wise \((L^0, L^1)\)–smoothness assumption [5], which closely relates to our Assumptions 2, 3, and 4. Concretely, [5] assume that \[\begin{align} \label{eq:oagsbv} \left\| \nabla _i f(X) - \nabla _i f(Y) \right\|_{(i) \star} \leq \left( L^0_i + L^1_i \| \nabla _i f(X) \|_{(i) \star} \right) \left\| X_i - Y_i \right\|_{(i)} \end{align}\tag{10}\] for all \(i\in[b]\) and all \(X = [X_1, \ldots, X_p]\in \mathcal{X}\), \(Y = [Y_1, \ldots, Y_p] \in \mathcal{X}\). Condition 10 can be interpreted as a global version of our 4. However, it is less precise because it does not allow each subset of layers to have its own effective smoothness constants. By adopting 4 instead of 10 , we can derive tighter bounds without restricting the function class under consideration.

Accounting for heterogeneous parameter structures is not novel and has been studied in the context of coordinate descent [12], [27] and in the analysis of algorithms such as signSGD [30], [32], AdaGrad [33], [34], and Adam [35].

7.3 Randomized Block Coordinate Descent↩︎

Figure 5: Randomized Block Coordinate Descent

Coordinate Descent (CD) algorithms have a long history [12], [16], [36], [37] and are among the most widely studied methods for large-scale optimization. Their randomized variants, known as Randomized Block Coordinate Descent (RBCD), have emerged as a powerful class of first-order methods, particularly in high-dimensional problems. Instead of updating the entire parameter vector, RBCD updates only a subset (“block”) of variables at each iteration, reducing per-iteration computational cost and improving scalability.

In the context of CD, the parameters are viewed as a flat vector in \(\mathbb{R}^d\). The general procedure of RBCD is summarized in 5. At each iteration, one block is selected at random and updated, while the remaining blocks remain unchanged. Clearly, 5 is a special case of 1 with the serial sampling strategy (i.e., sampling a single coordinate per iteration) and standard Euclidean updates.

We define a family of linear maps \(\{\mathbf{U}_i: \mathcal{X}_i\rightarrow \mathcal{X}: i\in [b]\}\) (where \(\mathcal{X}=\mathbb{R}^d\) and \(\mathcal{X}_i=\mathbb{R}^{d_i}\), with \(\sum_{i=1}^b d_i = d\)), forming a partition of the identity map, i.e., \(X = \sum_{i\in [b]} \mathbf{U}_i \left(X_i\right)\). The standard analysis of RBCD relies on the assumption of blockwise Lipschitz continuity of the gradient: \[\label{eq:sdfewr} \left\| \nabla_i f(X+\mathbf{U}_i (\Gamma_i)) - \nabla_i f(X) \right\|_{(i)\star}\leq L_i \|\Gamma_i\|_{(i)},\quad \forall i\in [b], \, X \in \mathcal{X}, \, \Gamma_i \in \mathcal{X}_i.\tag{11}\] From 11 , one obtains the standard quadratic upper bound \[\label{eq:fnf} f(X+ \mathbf{U}_i (\Gamma_i))-f(X)- \left\langle \nabla_i f(X), \Gamma_i \right\rangle \leq \frac{L_i}{2} \|\Gamma_i\|^2_{(i)}.\tag{12}\] This inequality plays a pivotal role in establishing the iteration complexity of RBCD. Moreover, it explains why CD-type methods can be advantageous compared to full gradient descent: they allow for larger stepsizes, which in turn leads to faster convergence. Indeed, the standard analysis of GD relies on global \(L\)-smoothness, enforcing stepsizes on the order of \(\frac{1}{L}\). By contrast, the blockwise smoothness assumption 11 bounds the function’s variation when only a single block of coordinates is updated. Consequently, in CD, the safe stepsize for block \(i\) scales with \(\frac{1}{L_i}\). Since \(L\) must capture worst-case variations across all directions (including cross-block interactions), it is often much larger than the individual \(L_i\)’s.

Complexity analyses of RBCD in the convex setting can be found in [12], [16]. Generalizations to parallel and arbitrary sampling variants were developed by [19], [38]. For the nonconvex case considered in this paper, iteration complexity results of 1 are only known for serial sampling [39], [40]. Otherwise, some work has focused on the global convergence properties of randomized CD methods [41], rather than the iteration complexity analysis we develop here. It is worth noting that serial sampling is not suitable in the deep neural network context due to the special entanglement structure and the requirements of backpropagation for gradient computation.

8 More on Smoothness Assumptions↩︎

Recall that a function \(f: \mathcal{X}\mapsto \mathbb{R}\) is \(\textrm{supp}(\mathcal{D})\)–layer-wise \(L^0\)–smooth with constants \(L^0 :=\{(L_{1,S}^0, \ldots, L_{b,S}^0)\}_{S\in\textrm{supp}(\mathcal{D})}\), \((L_{1,S}^0, \ldots, L_{b,S}^0) \in \mathbb{R}^b_+\) (2) if for any \(S\in\textrm{supp}(\mathcal{D})\), \[\begin{align} f(X + \Gamma) - f(X) - \left\langle\nabla f(X),\Gamma\right\rangle \leq \sum_{i\in S} \frac{L_{i,S}^0}{2}\left\| \Gamma_i \right\|_{(i)}^2 \end{align}\] for all \(X = [X_1, \ldots, X_b]\in \mathcal{X}\) and \(\Gamma = [\Gamma_1, \ldots, \Gamma_b] \in \mathcal{X}\) such that \(\Gamma_i = 0\) for all \(i\not\in S\).

Lemma 1. Let \(S_1, S_2 \in \textrm{supp}(\mathcal{D})\) satisfy \(S_1 \subseteq S_2\), and suppose 2 holds. Then one can choose the constants \(\{L_{i,S_1}^0\}_{i\in S_1}\) so that \[\begin{align} L_{i,S_1}^0 \leq L_{i,S_2}^0. \end{align}\]

Proof. Let \(X = [X_1, \ldots, X_b]\in \mathcal{X}\) be arbitrary. By 2, for any \(\Gamma = [\Gamma_1, \ldots, \Gamma_b] \in \mathcal{X}\) such that \(\Gamma_i = 0\) for all \(i\not\in S_1\), we have \[\begin{align} f(X + \Gamma) - f(X) - \left\langle\nabla f(X),\Gamma\right\rangle \leq \sum_{i\in S_1} \frac{L_{i,S_1}^0}{2}\left\| \Gamma_i \right\|_{(i)}^2. \end{align}\] Now, since \(S_1 \subseteq S_2\) and \(\Gamma_i = 0\) for all \(i \in S_2\setminus S_1\), 2 also gives \[\begin{align} f(X + \Gamma) - f(X) - \left\langle\nabla f(X),\Gamma\right\rangle \leq \sum_{i\in S_2} \frac{L_{i,S_2}^0}{2}\left\| \Gamma_i \right\|_{(i)}^2 = \sum_{i\in S_1} \frac{L_{i,S_2}^0}{2}\left\| \Gamma_i \right\|_{(i)}^2. \end{align}\] Therefore, one can always choose \(L_{i,S_1}^0 \leq L_{i,S_2}^0\). ◻

Remark 5. We can show that, without further assumptions, there is no finite function \(C: [b] \to (0, \infty)\) such that \[\begin{align} \label{eq:oaivn} L_{i,S}^0 \leq C(|S|) L_{i,\{i\}}^0 \end{align}\tag{13}\] holds for all \(f\). Consider the case \(b=2\) with scalar blocks and the Euclidean norm, suppose that \(\textrm{supp}(\mathcal{D}) = \left\{ \{1\}, \{2\}, \{1,2\} \right\}\), and define \[\begin{align} f(x,y) = \alpha x y \end{align}\] for some \(\alpha > 0\). For a single-block perturbation in the \(x\)-coordinate only, we have \[\begin{align} f(x+\gamma_1,y) - f(x,y) - \partial_x f(x,y) \gamma_1 = \alpha (x+\gamma_1) y - \alpha x y - \alpha \gamma_1 y = 0, \end{align}\] which shows that \(L_{1,\{1\}}^0 = 0\). By symmetry, \(L_{2,\{2\}}^0 = 0\) as well. However, for a joint perturbation \(\Gamma = (\gamma_1,\gamma_2)\), we obtain \[\begin{align} f(x+\gamma_1, y+\gamma_2) - f(x,y) - \left\langle\nabla f(x,y),\Gamma\right\rangle &=& \alpha (x+\gamma_1)(y+\gamma_2) - \alpha x y - \alpha \gamma_1 y - \alpha \gamma_2 x \\ &=& \alpha \gamma_1 \gamma_2. \end{align}\] Now, suppose for a contradiction that 13 holds for some function \(C:[b]\to\mathbb{R}\). Then, since \(L_{1,\{1\}}^0 = 0\), it must be that \(L_{1,\{1,2\}}^0 = 0\). By 2, this requires \[\begin{align} \alpha \gamma_1 \gamma_2 \leq \frac{L_{1,\{1,2\}}^0}{2} \gamma_1^2 + \frac{L_{2,\{1,2\}}^0}{2} \gamma_2^2 = \frac{L_{2,\{1,2\}}^0}{2} \gamma_2^2. \end{align}\] But the right-hand side is independent of \(\gamma_1\), whereas the left-hand side grows without bound as \(\gamma_1\to\infty\) whenever \(\gamma_2\neq 0\). Hence no such constant \(L_{2,\{1,2\}}^0\) can exist. This proves that no uniform bound of the form 13 can hold without additional assumptions.

Remark 6. The above argument extends to any two sets with one contained in the other (not just singletons). Let \(S_1, S_2 \in \textrm{supp}(\mathcal{D})\) satisfy \(S_1 \subseteq S_2\). We show that, without further assumptions, there is no finite function \(C:[b]\times[b]\to(0,\infty)\) such that \[\begin{align} \label{eq:qwioavnf} L_{i,S_2}^0 \leq C(|S_1|,|S_2|) L_{i,S_1}^0. \end{align}\tag{14}\] Without loss of generality we can restrict attention to the coordinates indexed by \(S_2\) (so that \(S_2=[b]\)). Let \(S_1 = \{1,\dots,s\}\) and \(S_2 \setminus S_1 = \{s+1,\dots,s+u\}\), so \(|S_1|=s\), \(|S_2|=s+u\), and \(|S_2 \setminus S_1|=u\). Again, consider scalar blocks and define a function \(f: \mathbb{R}^b \to \mathbb{R}\) via \[\begin{align} f(x_1,\dots,x_s,y_{1},\dots,y_{u}) = \alpha \left( \sum_{i=1}^s x_i \right) \left( \sum_{j=1}^u y_j \right) \end{align}\] for some \(\alpha > 0\). For any increment \(\Gamma_{S_1}\) supported on \(S_1\) only (i.e., \(\gamma_j=0\) for all \(j \not\in S_1\)) the function \(f\) is affine in the \(x\)-variables with coefficients given by \(\alpha \sum_j y_j\). Therefore, for any \(X\in\mathbb{R}^b\), \[\begin{align} &&f(X+\Gamma_{S_1})-f(X)-\left\langle\nabla f(X),\Gamma_{S_1}\right\rangle \\ &=& \alpha \left( \sum_{i=1}^s (x_i + \gamma_i) \right) \left( \sum_{j=1}^u y_j \right) - \alpha \left( \sum_{i=1}^s x_i \right) \left( \sum_{j=1}^u y_j \right) - \alpha \sum_{i=1}^s \left( \sum_{j=1}^u y_j \right) \gamma_i = 0, \end{align}\] and hence \(L_{i,S_1}^0 = 0\) for every \(i\in S_1\). For a joint perturbation \(\Gamma_{S_2}=(\gamma_1,\dots,\gamma_s,\eta_1,\dots,\eta_u)\) supported on \(S_2\), we have \[\begin{align} &&f(X+\Gamma_{S_2})-f(X)-\left\langle\nabla f(X),\Gamma_{S_2}\right\rangle \\ &=& \alpha \left( \sum_{i=1}^s (x_i + \gamma_i) \right) \left( \sum_{j=1}^u (y_j + \eta_j) \right) - \alpha \left( \sum_{i=1}^s x_i \right) \left( \sum_{j=1}^u y_j \right) \\ &&- \alpha \sum_{i=1}^s \left( \sum_{j=1}^u y_j \right) \gamma_i - \alpha \sum_{i=1}^u \left( \sum_{j=1}^s x_j \right) \eta_i \\ &=& \alpha \left( \sum_{i=1}^s \gamma_i \right) \left( \sum_{j=1}^u \eta_j \right). \end{align}\] In particular, fix any nonzero vectors \(\tilde{\gamma} =(\tilde{\gamma}_1,\dots,\tilde{\gamma}_s)\) and \(\eta = (\eta_1,\dots,\eta_u)\) and let the \(x\)-perturbation scale by a factor \(\lambda\), i.e. \(\gamma_i=\lambda \tilde{\gamma}_i\). Then \[\begin{align} f(X+\Gamma)-f(X)-\left\langle\nabla f(X),\Gamma\right\rangle = \alpha \lambda \left( \sum_{i=1}^s \tilde{\gamma}_i \right) \left( \sum_{j=1}^u \eta_j \right), \end{align}\] which grows linearly in \(\lambda\). Now, suppose, for contradiction that 14 holds with some finite \(C(|S_1|,|S_2|)\). Since \(L_{i,S_1}^0=0\) for every \(i\in S_1\), the inequality forces \(L_{i,S_2}^0=0\) for all \(i\in S_1\). But then 2 would yield \[\begin{align} \alpha \lambda \left( \sum_{i=1}^s \tilde{\gamma}_i \right) \left( \sum_{j=1}^u \eta_j \right) &=& \alpha \left( \sum_{i=1}^t \gamma_i \right) \left( \sum_{j=1}^u \eta_j \right) \\ &\leq& \sum_{i\in S_1} \frac{L_{i,S_2}^0}{2} \gamma_i^2 + \sum_{i\in S_2 \setminus S_1} \frac{L_{i,S_2}^0}{2} \eta_i^2 \\ &=& \sum_{i\in S_2 \setminus S_1} \frac{L_{i,S_2}^0}{2} \eta_i^2, \end{align}\] where the right-hand side is independent of the scaling factor \(\lambda\). Taking \(\lambda\to\infty\) makes the left-hand side arbitrarily large, resulting in a contradiction. Hence no such \(C\) can exist.

9 Arbitrary Sampling↩︎

One might wonder why the cost model in 3 distinguishes between two sets of constants \(\{c_i\}_{i\in[b]}\) and \(\{c_i^\sharp\}_{i\in[b]}\), given that in the RPT case, \(c_i\) and \(c_i^\sharp\) could be combined into a single constant. This is because Algorithms [alg:rt95arbitrary95stoch] and 1 do not restrict us to the layer sampling scheme studied in 4. One could argue that deviating from this progressive training framework is inefficient, as it discards gradients obtained “for free” during backpropagation: by design, at iteration \(k\), the gradients \([\nabla_{s^k} f(X^k), \ldots, \nabla_b f(X^k)]\), where \(s^k :=\min S^k\), are necessarily computed, and thus available for the update (if, for some block \(i \in [b]\), the cost \(c_i^{\sharp}\) of applying the sharp operator is large, one can simply bypass this step and instead use the standard gradient update; this corresponds to setting \(\left\| \cdot \right\|_{(i)}=\left\| \cdot \right\|_2\) in the algorithm). Nevertheless, since the general iteration complexity results in Theorems 7, 10 and 16 hold for any distribution \(\mathcal{D}\), the framework naturally allows experimentation with alternative sampling schemes, even if only out of theoretical curiosity.

With this in mind, we generalize the results from 4 to arbitrary layer samplings. Formally, at each iteration \(k\), Algorithms [alg:rt95arbitrary95stoch] and 1 sample a random subset of layers \(S^k \subseteq [b]\) from a distribution \[\begin{align} \mathcal{D}: \mathfrak{P}([b]) \to [0,1], \qquad \sum_{S\subseteq[b]} \mathcal{D}(S) = 1, \end{align}\] with support \(\textrm{supp}(\mathcal{D}) :=\{S\in\mathfrak{P}([b]): \mathcal{D}(S)>0\}\), where \(\mathfrak{P}([b])\) denotes the power set of \([b]\). Only the layers in \(S^k\) are updated, while the rest remain fixed. We write \(\hat{S} \sim \mathcal{D}\) for a set-valued random variable with the same distribution as \(S^k\). This formulation defines a broad family of algorithms, parameterized jointly by the sampling distribution \(\mathcal{D}\) and the choice of layer-wise norms \(\left\| \cdot \right\|_{(i)}\).

Mirroring the earlier discussion, we organize the analysis around two smoothness regimes–layer-wise smooth (2) and generalized layer-wise smooth (Assumptions 3 and 4).

In what follows, we present convergence results for Algorithms [alg:rt95arbitrary95stoch] (16) and 1 (Theorems 7 and 10). When specialized to the RPT setting, these general results recover the guarantees stated in Theorems 1, 2, and 4, as detailed in Remarks 9, 12, and 18.

10 Convergence Results – Deterministic Gradient Case↩︎

10.1 Layer-Wise Smooth Case↩︎

Theorem 7. Let Assumptions 1 and 2 hold, and let \(\{X^k\}_{k=0}^{K-1}\) be the iterates of 1 run with stepsizes \(\gamma_i^k = \frac{1}{L_{i,S^k}^0}\). Then \[\begin{align} \frac{1}{K} \sum\limits_{k=0}^{K-1} \sum\limits_{i=1}^b \frac{w_i}{\frac{1}{b} \sum_{j=1}^b w_j} {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|^2_{(i) \star}\right] \leq \frac{f(X^0) - f^{\star}}{K \left( \frac{1}{b} \sum_{j=1}^b w_j \right)}, \end{align}\] where \(w_i :={\mathbb{E}}\left[\frac{\mathbb{I}\left(i\in \hat{S}\right)}{2 L_{i,\hat{S}}^0}\right]\).

Remark 8. For 7 to be meaningful, the sampling must ensure that \(w_i > 0\) for every \(i\in[b]\), i.e., \[\begin{align} {\mathbb{E}}\left[\frac{\mathbb{I}\left(i\in \hat{S}\right)}{2 L_{i,\hat{S}}^0}\right] > 0. \end{align}\] Otherwise, some layers would receive zero weight and the bound would provide no control over them. This is a natural condition, requiring that every layer is sampled with positive probability. Indeed, if \({\mathbb{P}}\left(i\in \hat{S}\right) = 0\), then the expectation vanishes, so \(w_i>0\) necessarily implies \({\mathbb{P}}\left(i\in \hat{S}\right) > 0\) for all \(i\in[b]\).

Remark 9. In the case of RPT (see 4), the weights from 7 become \[\begin{align} {\mathbb{E}}\left[\frac{\mathbb{I}\left(i\in \hat{S}\right)}{2 L_{i,\hat{S}}^0}\right] = \sum_{s=1}^b \frac{\mathbb{I}\left(i \in \{s,\dots,b\}\right)}{2 L_{i, \{s,\dots,b\}}^0} p_s = \sum_{s=1}^i \frac{p_s}{2 L_{i, \{s,\dots,b\}}^0}. \end{align}\] Substituting this expression into the rate yields the result in 1.

Proof of 7. Using 2, we get \[\begin{align} &&f(X^{k+1}) \\ &\leq& f(X^k) + \left\langle\nabla f(X^k),X^{k+1} - X^k\right\rangle + \sum_{i\in S^k} \frac{L^0_{i,S^k}}{2} \left\| X_i^k - X_i^{k+1} \right\|_{(i)}^2 \\ &=& f(X^k) + \sum_{i\in S^k} \left( \left\langle\nabla_i f(X^k),X_i^{k+1} - X_i^k\right\rangle_{(i)} + \frac{L^0_{i,S^k}}{2} \left\| X_i^k - X_i^{k+1} \right\|_{(i)}^2 \right) \\ &=& f(X^k) + \sum_{i\in S^k} \left( - \gamma_i^k \left\langle\nabla_i f(X^k),\left( \nabla_i f(X^k) \right)^{\sharp}\right\rangle_{(i)} + \frac{L^0_{i,S^k} (\gamma_i^k)^2}{2} \left\| \left( \nabla_i f(X^k) \right)^{\sharp} \right\|_{(i)}^2 \right) \\ &\overset{\eqref{eq:inpsharp}, \eqref{eq:normsharp}}{=}& f(X^k) + \sum_{i\in S^k} \left( - \gamma_i^k \left\| \nabla_i f(X^k) \right\|^2_{(i) \star} + \frac{L^0_{i,S^k} (\gamma_i^k)^2}{2} \left\| \nabla_i f(X^k) \right\|_{(i) \star}^2 \right). \end{align}\] Choosing \(\gamma_i^k = \frac{1}{L^0_{i,S^k}}\) and rearranging, we get \[\begin{align} \sum_{i\in S^k} \frac{\left\| \nabla_i f(X^k) \right\|^2_{(i) \star}}{2 L^0_{i,S^k}} \leq f(X^k) - f(X^{k+1}). \end{align}\] Taking expectation conditional on \(X^k\) (denoted as \({\mathbb{E}}_{k}\left[\cdot\right]\)) gives \[\begin{align} f(X^k) - {\mathbb{E}}_{k}\left[f(X^{k+1})\right] &\geq& {\mathbb{E}}_{k}\left[\sum_{i\in S^k} \frac{\left\| \nabla_i f(X^k) \right\|^2_{(i) \star}}{2 L^0_{i,S^k}}\right] \\ &=& {\mathbb{E}}_{k}\left[\sum_{i=1}^b \mathbb{I}\left(i\in S^k\right) \frac{\left\| \nabla_i f(X^k) \right\|^2_{(i) \star}}{2 L^0_{i,S^k}}\right] \\ &=& \sum_{i=1}^b {\mathbb{E}}\left[\frac{\mathbb{I}\left(i\in \hat{S}\right)}{2 L_{i,\hat{S}}^0}\right] \left\| \nabla_i f(X^k) \right\|^2_{(i) \star} \\ &=& \sum_{i=1}^b w_i \left\| \nabla_i f(X^k) \right\|^2_{(i) \star}, \end{align}\] where we denoted \(w_i :={\mathbb{E}}\left[\frac{\mathbb{I}\left(i\in \hat{S}\right)}{2 L_{i,\hat{S}}^0}\right]\) and \(\mathbb{I}\left(\cdot\right)\) is the indicator function (i.e., for any event \(A\), \(\mathbb{I}\left(A\right) = 1\) if \(A\) and \(\mathbb{I}\left(A\right) = 0\) otherwise). Taking full expectation, summing over the first \(K\) iterations and dividing by \(\frac{K}{b} \sum_{j=1}^b w_j\), we get \[\begin{align} \frac{1}{K} \sum_{k=0}^{K-1} \sum_{i=1}^b \frac{w_i}{\frac{1}{b} \sum_{j=1}^b w_j} {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|^2_{(i) \star}\right] &\leq \frac{1}{\frac{K}{b} \sum_{j=1}^b w_j} \sum_{k=0}^{K-1} \left( {\mathbb{E}}\left[f(X^k)\right] - {\mathbb{E}}\left[f(X^{k+1})\right] \right) \\ &\leq \frac{f(X^0) - f^{\star}}{K \left( \frac{1}{b} \sum_{j=1}^b w_j \right)}. \end{align}\] ◻

10.2 Layer-Wise Generalized Smooth Case↩︎

Theorem 10. Let Assumptions 1 and 3 hold, fix \(\varepsilon>0\), and let \(\{X^k\}_{k=0}^{K-1}\) be the iterates of 1 run with stepsizes \(\gamma_i^k = \left( L_{i,S^k}^0 + L_{i,S^k}^1 \left\| \nabla_i f(X^k) \right\|_{(i) \star} \right)^{-1}\). Then, to guarantee that \[\begin{align} \min_{k=0,\ldots,K-1} \sum\limits_{i=1}^b \left[\frac{w_i}{\frac{1}{b} \sum_{l=1}^b w_l} {\mathbb{E}}\left[\left\| \nabla _i f(X^k) \right\|_{(i) \star}\right]\right] \leq \varepsilon, \end{align}\] it suffices to run the algorithm for \[\begin{align} K = \left\lceil \frac{2 \delta^0 \sum\limits_{i=1}^b \frac{{\mathbb{P}}\left(i\in \hat{S}\right) {\mathbb{E}}\left[\left.L^0_{i,\hat{S}}\right\vert i\in \hat{S}\right]}{\left( {\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right] \right)^2}}{\varepsilon^2 \left( \frac{1}{b} \sum_{l=1}^b w_l \right)^2} + \frac{2 \delta^0}{\varepsilon \left( \frac{1}{b} \sum_{l=1}^b w_l \right)} \right\rceil \end{align}\] iterations, where \(\delta^0 :=f(X^0) - f^{\star}\) and \(w_i :=\frac{{\mathbb{P}}\left(i\in \hat{S}\right)}{{\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right]}\).

Remark 11. The guarantee in 10 is meaningful only if \(w_i > 0\) for all \(i\in[b]\), which is equivalent to requiring that \({\mathbb{P}}\left(i\in \hat{S}\right) > 0\) for all \(i\in[b]\).

Remark 12. For the RPT case (see 4), we have \[\begin{align} {\mathbb{P}}\left(i\in \hat{S}\right) = {\mathbb{P}}\left(s \leq i\right) = \sum_{s=1}^i p_s \end{align}\] and \[\begin{align} {\mathbb{E}}\left[\left.L^{\alpha}_{i,\hat{S}}\right\vert i\in \hat{S}\right] = \frac{{\mathbb{E}}\left[L^{\alpha}_{i,\hat{S}} \mathbb{I}\left(i\in \hat{S}\right)\right]}{{\mathbb{P}}\left(i\in \hat{S}\right)} = \frac{\sum_{s=1}^i p_s L^{\alpha}_{i, \{s,\dots,b\}}}{\sum_{s=1}^i p_s} \end{align}\] for \(\alpha \in \{0,1\}\). Hence the weights from 10 become \[\begin{align} \frac{{\mathbb{P}}\left(i\in \hat{S}\right)}{{\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right]} = \frac{\left( \sum_{s=1}^i p_s \right)^2}{\sum_{s=1}^i p_s L^{\alpha}_{i, \{s,\dots,b\}}}. \end{align}\] Substituting this expression into the rate yields the result in 2.

Remark 13. When \(\hat{S} = [b]\) with probability \(1\), the weights become \(w_i = \frac{1}{L^1_{i,[b]}}\), and hence 10 guarantees that \[\begin{align} \min_{k=0,\ldots,K-1} \sum_{i=1}^b \left[\frac{\frac{1}{L^1_{i,[b]}}}{\frac{1}{b} \sum_{l=1}^b \frac{1}{L^1_{l,[b]}}} \left\| \nabla _i f(X^k) \right\|_{(i) \star}\right] \leq \varepsilon, \end{align}\] after \[\begin{align} K &=& \left\lceil \frac{2 \delta^0 \sum_{i=1}^b \frac{L^0_{i,[b]}}{\left( L^1_{i,[b]} \right)^2}}{\varepsilon^2 \left( \frac{1}{b} \sum_{l=1}^b \frac{1}{L^1_{i,[b]}} \right)^2} + \frac{2 \delta^0}{\varepsilon \left( \frac{1}{b} \sum_{l=1}^b \frac{1}{L^1_{i,[b]}} \right)} \right\rceil \end{align}\] iterations, recovering the rate of Gluon [5].

Proof of 10. Starting with 3, we have \[\begin{align} &&f(X^{k+1}) \\ &\leq& f(X^k) + \sum_{i\in S^k} \Bigg(\left\langle\nabla_i f(X^k),X_i^{k+1} - X_i^k\right\rangle_{(i)} \\ &&\qquad\qquad\qquad\quad+ \frac{L^0_{i,S^k} + L^1_{i,S^k} \left\| \nabla_i f(X^k) \right\|_{(i) \star}}{2} \left\| X_i^{k+1} - X_i^k \right\|_{(i)}^2\Bigg) \\ &=& f(X^k) + \sum_{i\in S^k} \Bigg(- \gamma_i^k \left\langle\nabla_i f(X^k),\left( \nabla_i f(X^k) \right)^{\sharp}\right\rangle_{(i)} \\ &&\qquad\qquad\qquad\quad+ \frac{L^0_{i,S^k} + L^1_{i,S^k} \left\| \nabla_i f(X^k) \right\|_{(i) \star}}{2} (\gamma_i^k)^2 \left\| \left( \nabla_i f(X^k) \right)^{\sharp} \right\|_{(i)}^2\Bigg) \\ &\overset{\eqref{eq:inpsharp}, \eqref{eq:normsharp}}{=}& f(X^k) + \sum_{i\in S^k} \Bigg(- \gamma_i^k \left\| \nabla_i f(X^k) \right\|^2_{(i) \star} \\ &&\qquad\qquad\qquad\quad+ \frac{L^0_{i,S^k} + L^1_{i,S^k} \left\| \nabla_i f(X^k) \right\|_{(i) \star}}{2} (\gamma_i^k)^2 \left\| \nabla_i f(X^k) \right\|^2_{(i) \star}\Bigg). \end{align}\] Taking \(\gamma_i^k = \frac{1}{L^0_{i,S^k} + L^1_{i,S^k} \left\| \nabla_i f(X^k) \right\|_{(i) \star}}\) gives \[\begin{align} f(X^{k+1}) \leq f(X^k) - \sum_{i\in S^k} \frac{\left\| \nabla_i f(X^k) \right\|^2_{(i) \star}}{2 \left( L^0_{i,S^k} + L^1_{i,S^k} \left\| \nabla_i f(X^k) \right\|_{(i) \star} \right)}, \end{align}\] and hence \[\begin{align} \sum_{k=0}^{K-1} \sum_{i\in S^k} \frac{\left\| \nabla_i f(X^k) \right\|^2_{(i) \star}}{2 \left( L^0_{i,S^k} + L^1_{i,S^k} \left\| \nabla_i f(X^k) \right\|_{(i) \star} \right)} &\leq& \sum_{k=0}^{K-1} \left( f(X^k) - f(X^{k+1}) \right) \\ &\leq& f(X^0) - f^\star :=\delta^0. \end{align}\] Taking expectation \[\begin{align} \label{eq:naovaoqwf} \delta^0 &\geq \sum_{k=0}^{K-1} {\mathbb{E}}\left[\sum_{i\in S^k} \frac{\left\| \nabla_i f(X^k) \right\|^2_{(i) \star}}{2 \left( L^0_{i,S^k} + L^1_{i,S^k} \left\| \nabla_i f(X^k) \right\|_{(i) \star} \right)}\right] \nonumber \\ &= \sum_{k=0}^{K-1} {\mathbb{E}}\left[{\mathbb{E}}\left[\left.\sum_{i=1}^b \mathbb{I}\left(i\in S^k\right) \frac{\left\| \nabla_i f(X^k) \right\|^2_{(i) \star}}{2 \left( L^0_{i,S^k} + L^1_{i,S^k} \left\| \nabla_i f(X^k) \right\|_{(i) \star} \right)}\right\vert X^k\right]\right] \nonumber \\ &= \sum_{k=0}^{K-1} \sum_{i=1}^b {\mathbb{E}}\left[{\mathbb{P}}\left(i\in S^k\right) {\mathbb{E}}\left[\left.\frac{\left\| \nabla_i f(X^k) \right\|^2_{(i) \star}}{2 \left( L^0_{i,S^k} + L^1_{i,S^k} \left\| \nabla_i f(X^k) \right\|_{(i) \star} \right)}\right\vert X^k, \left\{ i\in S^k \right\}\right]\right] \nonumber \\ &\overset{(i)}{\geq} \sum_{k=0}^{K-1} \sum_{i=1}^b {\mathbb{E}}\left[\frac{\left\| \nabla_i f(X^k) \right\|^2_{(i) \star} {\mathbb{P}}\left(i\in S^k\right)}{2 \left( {\mathbb{E}}\left[\left.L^0_{i,S^k}\right\vert X^k, \left\{ i\in S^k \right\}\right] + {\mathbb{E}}\left[\left.L^1_{i,S^k}\right\vert X^k, \left\{ i\in S^k \right\}\right] \left\| \nabla_i f(X^k) \right\|_{(i) \star} \right)}\right] \nonumber \\ &\overset{(ii)}{=} \frac{1}{2} \sum_{k=0}^{K-1} \sum_{i=1}^b {\mathbb{E}}\left[\frac{\left\| \nabla_i f(X^k) \right\|^2_{(i) \star} {\mathbb{P}}\left(i\in \hat{S}\right)}{{\mathbb{E}}\left[\left.L^0_{i,\hat{S}}\right\vert i\in \hat{S}\right] + {\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right] \left\| \nabla_i f(X^k) \right\|_{(i) \star}}\right] \nonumber \\ &\overset{(iii)}{\geq} \frac{1}{2} \sum_{k=0}^{K-1} \sum_{i=1}^b \frac{{\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right]^2 {\mathbb{P}}\left(i\in \hat{S}\right)}{{\mathbb{E}}\left[\left.L^0_{i,\hat{S}}\right\vert i\in \hat{S}\right] + {\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right]}, \end{align}\tag{15}\] where in \((i)\) we used Jensen’s inequality and convexity of the function \(t \mapsto \frac{1}{t}\) for \(t>0\), \((ii)\) follows from independence of \(\hat{S}\) and \(X^k\), and \((iii)\) is a consequence of convexity of the function \(t \mapsto \frac{t^2 {\mathbb{P}}\left(i\in S^k\right)}{{\mathbb{E}}\left[\left.L^0_{i,\hat{S}}\right\vert i\in \hat{S}\right] + {\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right] t}\) and Jensen’s inequality.

Now, using 9 with \(x_i = \frac{\sqrt{{\mathbb{P}}\left(i\in \hat{S}\right)}}{{\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right]}\), \(y_i = \sqrt{{\mathbb{P}}\left(i\in \hat{S}\right)} {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right]\) and \(z_i = {\mathbb{E}}\left[\left.L^0_{i,\hat{S}}\right\vert i\in \hat{S}\right] + {\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right]\), we obtain \[\begin{align} &\sum_{i=1}^b \frac{{\mathbb{P}}\left(i\in \hat{S}\right) {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right]^2}{{\mathbb{E}}\left[\left.L^0_{i,\hat{S}}\right\vert i\in \hat{S}\right] + {\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right]} \\ &\geq \frac{\left( \sum_{i=1}^b \frac{{\mathbb{P}}\left(i\in \hat{S}\right)}{{\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right]} {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \right)^2}{\sum_{i=1}^b \left( \frac{{\mathbb{P}}\left(i\in \hat{S}\right)}{\left( {\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right] \right)^2} {\mathbb{E}}\left[\left.L^0_{i,\hat{S}}\right\vert i\in \hat{S}\right] + \frac{{\mathbb{P}}\left(i\in \hat{S}\right)}{{\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right]} {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \right)}. \end{align}\] Applying this in 15 , we get \[\begin{align} \delta^0 &\geq& \frac{1}{2} \sum_{k=0}^{K-1} \sum_{i=1}^b \frac{{\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right]^2 {\mathbb{P}}\left(i\in \hat{S}\right)}{{\mathbb{E}}\left[\left.L^0_{i,\hat{S}}\right\vert i\in \hat{S}\right] + {\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right]} \\ &\geq& \frac{1}{2} \sum_{k=0}^{K-1} \frac{\left( \sum_{i=1}^b \frac{{\mathbb{P}}\left(i\in \hat{S}\right)}{{\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right]} {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \right)^2}{\sum_{i=1}^b \left( \frac{{\mathbb{P}}\left(i\in \hat{S}\right)}{\left( {\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right] \right)^2} {\mathbb{E}}\left[\left.L^0_{i,\hat{S}}\right\vert i\in \hat{S}\right] + \frac{{\mathbb{P}}\left(i\in \hat{S}\right)}{{\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right]} {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \right)} \\ &=& \frac{1}{2} \sum_{k=0}^{K-1} \psi \left( \sum_{i=1}^b \frac{{\mathbb{P}}\left(i\in \hat{S}\right)}{{\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right]} {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \right), \end{align}\] where \(\psi(t) :=\frac{t^2}{\sum_{i=1}^b \frac{{\mathbb{P}}\left(i\in \hat{S}\right)}{\left( {\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right] \right)^2} {\mathbb{E}}\left[\left.L^0_{i,\hat{S}}\right\vert i\in \hat{S}\right] + t}\). Since \(\psi\) is increasing for \(t>0\), we have \[\begin{align} \delta^0 &\geq& \frac{1}{2} \sum_{k=0}^{K-1} \psi \left( \sum_{i=1}^b \frac{{\mathbb{P}}\left(i\in \hat{S}\right)}{{\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right]} {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \right) \\ &\geq& \frac{K}{2} \psi \left( \min_{k=0,\ldots,K-1} \sum_{i=1}^b \frac{{\mathbb{P}}\left(i\in \hat{S}\right)}{{\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right]} {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \right). \end{align}\] Moreover, since \(\psi\) is monotonic, it has an inverse \(\psi^{-1}\). Thus \[\begin{align} \psi^{-1}\left( \frac{2 \delta^0}{K} \right) &\geq& \min_{k=0,\ldots,K-1} \sum_{i=1}^b \frac{{\mathbb{P}}\left(i\in \hat{S}\right)}{{\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right]} {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \\ &=& \min_{k=0,\ldots,K-1} \sum_{i=1}^b w_i {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right], \end{align}\] where \(w_i :=\frac{{\mathbb{P}}\left(i\in \hat{S}\right)}{{\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right]}\). This in turn means that to reach the precision \[\begin{align} \min_{k=0,\ldots,K-1} \sum_{i=1}^b \left[\frac{w_i}{\frac{1}{b} \sum_{l=1}^b w_l} {\mathbb{E}}\left[\left\| \nabla _i f(X^k) \right\|_{(i) \star}\right]\right] \leq \varepsilon, \end{align}\] it suffices to run the algorithm for \[\begin{align} K &=& \left\lceil \frac{2 \delta^0}{\psi\left( \varepsilon \left( \frac{1}{b} \sum_{l=1}^b w_l \right) \right)} \right\rceil = \left\lceil \frac{2 \delta^0 \sum_{i=1}^b \frac{{\mathbb{P}}\left(i\in \hat{S}\right) {\mathbb{E}}\left[\left.L^0_{i,\hat{S}}\right\vert i\in \hat{S}\right]}{\left( {\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right] \right)^2} + 2 \delta^0 \left( \varepsilon \left( \frac{1}{b} \sum_{l=1}^b w_l \right) \right)}{\left( \varepsilon \left( \frac{1}{b} \sum_{l=1}^b w_l \right) \right)^2} \right\rceil \\ &=& \left\lceil \frac{2 \delta^0 \sum_{i=1}^b \frac{{\mathbb{P}}\left(i\in \hat{S}\right) {\mathbb{E}}\left[\left.L^0_{i,\hat{S}}\right\vert i\in \hat{S}\right]}{\left( {\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right] \right)^2}}{\varepsilon^2 \left( \frac{1}{b} \sum_{l=1}^b \frac{{\mathbb{P}}\left(l\in \hat{S}\right)}{{\mathbb{E}}\left[\left.L^1_{l,\hat{S}}\right\vert l\in \hat{S}\right]} \right)^2} + \frac{2 \delta^0}{\varepsilon \left( \frac{1}{b} \sum_{l=1}^b \frac{{\mathbb{P}}\left(l \in \hat{S}\right)}{{\mathbb{E}}\left[\left.L^1_{l,\hat{S}}\right\vert l\in \hat{S}\right]} \right)} \right\rceil \end{align}\] iterations. ◻

11 Optimizing the Cost – Deterministic Gradient Case↩︎

Let \(S^k\subseteq\{1,\dots,b\}\) be the random subset sampled at iteration \(k\), and \(s^k :=\min S^k\) its smallest index. Recall the per-iteration cost \[\begin{align} \mathrm{cost}(S^k) = c_{\mathrm{ov}} + \sum_{i=s^k}^b c_i + \sum_{i\in S^k} c_i^{\sharp}. \end{align}\] Define the two marginal probabilities \[\begin{align} F_i :={\mathbb{P}}\left(\hat{s} \leq i\right),\qquad Q_i :={\mathbb{P}}\left(i \in \hat{S}\right), \end{align}\] where \(\hat{s}\) is a random variable following the same distribution as \(s^k\) (since \(S^k \sim \mathcal{D}\), \(k \geq 0\), are i.i.d., the same holds for \(s^k\)). Since \[\begin{align} {\mathbb{E}}\left[\sum_{i=s^k}^b c_i\right] = \sum_{j=1}^b {\mathbb{P}}\left(s^k = j\right) \sum_{i=j}^b c_i = \sum_{i=1}^b c_i \sum_{j=1}^i {\mathbb{P}}\left(s^k = j\right) = \sum_{i=1}^b c_i {\mathbb{P}}\left(s^k \leq i\right) \end{align}\] and \[\begin{align} {\mathbb{E}}\left[\sum_{i\in S^k} c_i^{\sharp}\right] = {\mathbb{E}}\left[\sum_{i=1}^b \mathbb{I}\left(i\in S^k\right) c_i^{\sharp}\right] = \sum_{i=1}^b c_i^{\sharp} {\mathbb{P}}\left(i\in S^k\right), \end{align}\] the expected cost of one iteration is \[\begin{align} \label{eq:cost95gen} {\mathbb{E}}\left[\mathrm{cost}(\hat{S})\right] = c_{\mathrm{ov}} + \sum_{i=1}^b c_i F_i + \sum_{i=1}^b c_i^{\sharp} Q_i. \end{align}\tag{16}\] Hence, to evaluate the expected cost, it suffices to compute the two marginals \(F_i\) and \(Q_i\) under the chosen sampling scheme.

In the sequel, we describe a few example sampling strategies considered in this work and analyze their costs under the layer-wise smooth setting (see 11.1) and the generalized layer-wise smooth setting (see 11.2).

RPT. Sample \(\hat{s}\in\{1,\dots,b\}\), where \(p_i = {\mathbb{P}}\left(\hat{s}=i\right)\), and set \(\hat{S}=\{\hat{s},\dots,b\}\). Then \[\begin{align} F_i = {\mathbb{P}}\left(\hat{s}\leq i\right) = \sum_{j=1}^i p_j, \qquad Q_i = {\mathbb{P}}\left(i\in \hat{S}\right) = {\mathbb{P}}\left(\hat{s}\leq i\right) = F_i. \end{align}\] Therefore, \[\begin{align} {\mathbb{E}}\left[\mathrm{cost}(\hat{S})\right] = c_{\mathrm{ov}} + \sum_{i=1}^b (c_i + c_i^{\sharp}) F_i = c_{\mathrm{ov}} + \sum_{i=1}^b (c_i + c_i^{\sharp}) \sum_{j=1}^i p_j. \end{align}\]
\(\tau\)-nice sampling. Choose \(\hat{S}\) uniformly from all subsets of \([b]\) of size \(\tau\). Then \[\begin{align} F_i &=& 1 - {\mathbb{P}}\left(\hat{s} > i\right) = 1 - \frac{\binom{b-i}{\tau}}{\binom{b}{\tau}}, \\ Q_i &=& \frac{\binom{b-1}{\tau-1}}{\binom{b}{\tau}} = \frac{\tau}{b}, \end{align}\] where we use the convention \(\binom{n}{k}=0\) for \(n<k\) (so the formula for \(F_i\) automatically gives \(F_i=1\) when \(b-i<\tau\)). Hence \[\begin{align} {\mathbb{E}}\left[\mathrm{cost}(\hat{S})\right] = c_{\mathrm{ov}} + \sum_{i=1}^b c_i \left( 1 - \frac{\binom{b-i}{\tau}}{\binom{b}{\tau}} \right) + \frac{\tau}{b} \sum_{i=1}^b c_i^{\sharp}. \end{align}\]
\(\tau\)-submodel sampling. Sample a starting index \(\hat{s}\in\{1,\dots,b-\tau+1\}\) with probability \(p_i = {\mathbb{P}}\left(\hat{s}=i\right)\) (where \(\tau\in[b]\) is fixed), and set \(\hat{S}=\{\hat{s},\dots,\hat{s}+\tau-1\}\) (i.e., a block of \(\tau\) consecutive layers). Then the marginals are \[\begin{align} F_i &=& {\mathbb{P}}\left(\hat{s} \leq i\right) = \sum_{j=1}^{\min\{i,b-\tau+1\}} p_j, \\ Q_i &=& {\mathbb{P}}\left(i\in \hat{S}\right) = \sum_{j=\max\{1,i-\tau+1\}}^{\min\{i,b-\tau+1\}} p_j. \end{align}\] The expected per-iteration cost is therefore \[\begin{align} {\mathbb{E}}\left[\mathrm{cost}(\hat{S})\right] = c_{\mathrm{ov}} + \sum_{i=1}^b c_i \left( \sum_{j=1}^{\min\{i,b-\tau+1\}} p_j \right) + \sum_{i=1}^b c_i^{\sharp} \left( \sum_{j=\max\{1,i-\tau+1\}}^{\min\{i,b-\tau+1\}} p_j \right). \end{align}\]
Arbitrary submodel sampling. Let \(\{B_1,\dots,B_m\}\) be a partition of \([b]\) into disjoint blocks of arbitrary indices, i.e., \[\begin{align} B_1 \cup \cdots \cup B_m = [b], \quad B_k \cap B_l = \emptyset \quad\textrm{for } k\neq l. \end{align}\] At each iteration, pick block \(B_i\) with probability \(p_i\) (where \(\sum_{i=1}^m p_i = 1\)) and set \(\hat{S} = B_i\).

For \(i\in[b]\), let \(k(i)\) denote the unique block with \(i\in B_{k(i)}\), and let \(\underline{b}_k :=\min B_k\). Then \[\begin{align} F_i &= {\mathbb{P}}\left(\hat{s} \leq i\right) = \sum_{j: \underline{b}_j \leq i} p_j, \\ Q_i &= {\mathbb{P}}\left(i\in \hat{S}\right) = p_{k(i)}. \end{align}\] The expected cost per iteration is \[\begin{align} {\mathbb{E}}\left[\mathrm{cost}(\hat{S})\right] = c_{\mathrm{ov}} + \sum_{i=1}^b c_i \left( \sum_{k: \underline{b}_k \leq i} p_k \right) + \sum_{i=1}^b c_i^{\sharp} p_{k(i)}. \end{align}\]

We now consider the algorithm’s performance under the two smoothness regimes.

11.1 Smooth Case↩︎

According to 7, under 2, 1 run with stepsizes \(\gamma_i^k = \frac{1}{L_{i,S^k}^0}\) guarantees that \[\begin{align} \left( \min_{i\in[b]} w_i \right) \frac{1}{K} \sum_{k=0}^{K-1} \sum_{i=1}^b {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|^2_{(i) \star}\right] &\leq& \frac{1}{K} \sum_{k=0}^{K-1} \sum_{i=1}^b w_i {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|^2_{(i) \star}\right] \\ &\leq& \frac{f(X^0) - f^{\star}}{K}, \end{align}\] where \(w_i :={\mathbb{E}}\left[\frac{\mathbb{I}\left(i\in \hat{S}\right)}{2 L_{i,\hat{S}}^0}\right]\). Thus, to ensure that \(\frac{1}{K} \sum_{k=0}^{K-1} \sum_{i=1}^b {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|^2_{(i) \star}\right] \leq \varepsilon\), it suffices to run it for \[K = \left\lceil \frac{f(X^0) - f^{\star}}{\varepsilon \left( \min_{i\in[b]} w_i \right)} \right\rceil\] iterations. Now, recall from 16 that the expected cost of a single iteration is \[\begin{align} {\mathbb{E}}\left[\mathrm{cost}(S^k)\right] = c_{\mathrm{ov}} + \sum_{i=1}^b c_i {\mathbb{P}}\left(\min S^k \leq i\right) + \sum_{i=1}^b c_i^{\sharp} {\mathbb{P}}\left(i\in S^k\right). \end{align}\] Hence, the expected cost of the entire optimization procedure can be written as \[\begin{align} \mathrm{cost}_{\varepsilon}(\mathcal{D}) &=& K \times {\mathbb{E}}\left[\mathrm{cost}(\hat{S})\right] \\ &=& K \times \left( c_{\mathrm{ov}} + \sum_{i=1}^b c_i {\mathbb{P}}\left(\min \hat{S} \leq i\right) + \sum_{i=1}^b c_i^{\sharp} {\mathbb{P}}\left(i\in \hat{S}\right) \right) \\ &\propto& \frac{c_{\mathrm{ov}} + \sum_{i=1}^b c_i {\mathbb{P}}\left(\min \hat{S} \leq i\right) + \sum_{i=1}^b c_i^{\sharp} {\mathbb{P}}\left(i\in \hat{S}\right)}{\min_{i\in[b]} {\mathbb{E}}\left[\frac{\mathbb{I}\left(i\in \hat{S}\right)}{2 L_{i,\hat{S}}^0}\right]}, \end{align}\] and the cost minimization problem to be solved is \[\begin{align} \label{eq:cost95opt95problem} \min_{\mathcal{D}: \mathfrak{P}([b]) \to [0,1], \sum_{S\subseteq[b]} \mathcal{D}(S) = 1} \frac{c_{\mathrm{ov}} + \sum_{i=1}^b c_i {\mathbb{P}}\left(\min \hat{S} \leq i\right) + \sum_{i=1}^b c_i^{\sharp} {\mathbb{P}}\left(i\in \hat{S}\right)}{\min_{i\in[b]} {\mathbb{E}}\left[\frac{\mathbb{I}\left(i\in \hat{S}\right)}{2 L_{i,\hat{S}}^0}\right]}. \end{align}\tag{17}\]

The task above is an optimization over probability distributions on the power set \(\mathfrak{P}([b])\), which has dimension \(2^b\), making a direct solution intractable for large \(b\). Instead of tackling it in full generality, we can restrict \(\mathcal{D}\) to some parametric family. For certain such families, the ratio objective simplifies to a linear–fractional program in the cumulative marginals, which has a closed form solution or can be solved efficiently (e.g., via the Dinkelbach algorithm [42]).

Let us now consider some specific examples, starting with the procedure considered in the main part of this paper.

11.1.1 Randomized Progressive Training↩︎

Sample \(\hat{s}\in\{1,\dots,b\}\), where \(p_i = {\mathbb{P}}\left(\hat{s}=i\right)\), and set \(\hat{S}=\{\hat{s},\dots,b\}\).

We first consider the randomized progressive training setting introduced in 4. Under this sampling strategy, we have \[\begin{align} F_i = {\mathbb{P}}\left(\hat{s}\leq i\right) = \sum_{j=1}^i p_j, \qquad Q_i = {\mathbb{P}}\left(i\in \hat{S}\right) = {\mathbb{P}}\left(\hat{s}\leq i\right) = F_i, \end{align}\] and hence \[\begin{align} {\mathbb{E}}\left[\mathrm{cost}(\hat{S})\right] = c_{\mathrm{ov}} + \sum_{i=1}^b (c_i + c_i^{\sharp}) F_i = c_{\mathrm{ov}} + \sum_{i=1}^b (c_i + c_i^{\sharp}) \sum_{j=1}^i p_j. \end{align}\] Combining it with the fact that \[\begin{align} {\mathbb{E}}\left[\frac{\mathbb{I}\left(i\in \hat{S}\right)}{2 L_{i,\hat{S}}^0}\right] = \sum_{s=1}^b \frac{\mathbb{I}\left(i \in \{s,\dots,b\}\right)}{2 L_{i, \{s,\dots,b\}}^0} p_s = \sum_{s=1}^i \frac{p_s}{2 L_{i, \{s,\dots,b\}}^0}, \end{align}\] we get \[\begin{align} \label{eq:costK} \mathrm{cost}_{\varepsilon}(\mathcal{D}) \propto \frac{{\mathbb{E}}\left[\mathrm{cost}(\hat{S})\right]}{\min_{i\in[b]} {\mathbb{E}}\left[\frac{\mathbb{I}\left(i\in \hat{S}\right)}{2 L_{i,\hat{S}}^0}\right]} = \frac{c_{\mathrm{ov}} + \sum_{i=1}^b (c_i + c_i^{\sharp}) \sum_{j=1}^i p_j}{\min_{i\in[b]} \left\{ \sum_{s=1}^i \frac{p_s}{2 L_{i, \{s,\dots,b\}}^0} \right\}}. \end{align}\tag{18}\]

11.1.1.1 Optimal probabilities.

First, note that the numerator can be rewritten as \[\begin{align} c_{\mathrm{ov}} + \sum_{i=1}^b (c_i + c_i^{\sharp}) \sum_{j=1}^i p_j =c_{\mathrm{ov}}+\sum_{j=1}^b \left[\sum_{i\geq j} (c_i + c_i^{\sharp})\right] p_j =\sum_{j=1}^b \left[c_{\mathrm{ov}}+\sum_{i\geq j} (c_i + c_i^{\sharp})\right] p_j, \end{align}\] where the second equality follows from \(\sum_{j=1}^b p_j=1\). Denote \[d_j :=c_{\mathrm{ov}}+\sum_{i\geq j} (c_i + c_i^{\sharp})\] for \(j \in [b]\), and let \[\delta_{i,s}:=\frac{1}{2L_{i, \{s,\dots,b\}}}\] for \(i\in [b]\) and \(s\leq i\). Clearly \[\begin{align} \label{eq:orderofd} d_1 > d_2 > \ldots > d_b \end{align}\tag{19}\] and \[\begin{align} \label{eq:orderofdelta} \delta_{i,1} \leq \delta_{i,2} \leq \ldots \leq \delta_{i,i} \qquad \forall i\in [b]. \end{align}\tag{20}\] Based on 18 , the search for optimal probabilities reduces to solving the following linear fractional program: \[\label{eq:fl1} \begin{align} \min_{p, t} \quad & \frac{d_1p_1+\ldots+d_bp_b}{t} \\ \text{s.t.} \quad & p_1,\ldots,p_b\geq 0\\ & p_1+\ldots+p_b=1\\ & t\geq 0 \\ & t\leq \delta_{1,1}p_1 \\ & \vdots\\ & t\leq \delta_{i,1}p_1+\ldots+\delta_{i,i}p_i\\ & \vdots\\ & t\leq \delta_{b,1}p_1+\ldots+\delta_{b,b}p_b. \end{align}\tag{21}\] This program can be written equivalently as \[\label{eq:fl2} \begin{align} \min_{q} \quad & {d_1 q_1+\ldots+d_b q_b} \\ \text{s.t.} \quad & q_1,\ldots,q_b\geq 0\\ & \delta_{1,1}q_1\geq 1 \\ & \vdots\\ & \delta_{i,1}q_1+\ldots+\delta_{i,i}q_i\geq 1\\ & \vdots\\ & \delta_{b,1}q_1+\ldots+\delta_{b,b}q_b \geq 1. \end{align}\tag{22}\]

Based on that, we can derive a recursive prescription for the optimal probabilities.

First, we show that if \(\sum_{s=1}^i \delta_{i,s} q_s>1\) and \(q_i>0\) for some \(i\), then one can shift a small amount of mass from \(q_i\) to \(q_{i+1}\) and strictly reduce the objective, without violating any constraints.

Lemma 2. Let \(q\) be an optimal point of 22 and fix \(i \in [b-1]\). If \(\sum_{s=1}^i \delta_{i,s} q_s>1\), then \(q_i = 0\). Equivalently, \[\begin{align} \label{eq:exch} q_i > 0 \qquad\implies\qquad \sum_{s=1}^i \delta_{i,s} q_s = 1. \end{align}\tag{23}\]

Proof. Suppose that \(i\) is such that \(\sum_{s=1}^i \delta_{i,s} q_s>1\), but \(q_i > 0\). Then there exists \(\varepsilon>0\) such that \(q_i \geq \varepsilon\) and \[\begin{align} 0<\varepsilon \leq \frac{\sum_{s=1}^i \delta_{i,s} q_s - 1}{\delta_{i,i}}. \end{align}\] Define \[\begin{align} \tilde{q_i} = q_i-\varepsilon,\quad \tilde{q}_{i+1} = q_{i+1} + \varepsilon, \quad \tilde{q}_s = q_s \text{ for } s\notin\{i,i+1\}. \end{align}\] Then \(\{\tilde{q}_i\}_{i\in[b]}\) satisfy all constraints:

Constraint \(k<i\): \(\sum_{s=1}^k \delta_{k,s} \tilde{q}_s = \sum_{s=1}^k \delta_{k,s} q_s \geq 1\),
Constraint \(i\): \(\sum_{s=1}^i \delta_{i,s} \tilde{q}_s = \sum_{s=1}^i \delta_{i,s} q_s - \delta_{i,i} \varepsilon \geq 1\) by the choice of \(\varepsilon\),
Constraint \(k>i\): \(\sum_{s=1}^k \delta_{k,s} \tilde{q}_s = \sum_{s=1}^k \delta_{k,s} q_s + \varepsilon(\delta_{k,i+1}-\delta_{k,i}) \geq \sum_{s=1}^k \delta_{k,s} q_s \geq 1\) using the monotonicity \(\delta_{k,i} \leq \delta_{k,i+1}\).

At the same time, the value of the objective decreases, as \(\sum_j d_j \tilde{q}_j - \sum_j d_j q_j = (d_{i+1}-d_i)\varepsilon < 0\), contradicting optimality of \(q\). Therefore, we must have \(q_i = 0\). ◻

Let us now use 2 to derive a recursive construction of the optimal probabilities. Let \(q^\star = (q_1^\star, \ldots, q_b^\star)\) be a solution of 22 . As established in 23 , any positive coordinate forces its constraint to be tight. Constraint \(1\) is \(\delta_{1,1} q_1^\star \geq 1\), meaning that \(q_1^\star > 0\), and so \(\delta_{1,1} q_1^\star = 1\). Thus \[\begin{align} q_1^\star = \frac{1}{\delta_{1,1}} = 2 L_{1, \{1,\dots,b\}}. \end{align}\] Now, suppose that \(q_1^\star,\dots,q_{i-1}^\star\) have already been determined. Then, constraint \(i\) reads \[\begin{align} \sum_{s=1}^{i-1} \delta_{i,s} q_s^\star + \delta_{i,i} q_i^\star \geq 1. \end{align}\] Define the residual \[\begin{align} r_i^\star :=1 - \sum_{s=1}^{i-1} \delta_{i,s} q_s^\star = 1 - \sum_{s=1}^{i-1} \frac{q_s^\star}{2L_{i, \{s,\dots,b\}}}. \end{align}\] If \(r_i^\star \le 0\), then \((q_1^\star,\dots,q_{i-1}^\star)\) already satisfy the constraint, so by 23 , we must have \(q_i^\star=0\). If \(r_i^\star >0\), then for \((q_1^\star,\dots,q_i^\star)\) to satisfy the constraint, we need \(q_i^\star>0\), and hence by 2, we have \(\sum_{s=1}^i \delta_{i,s} q_s^\star = 1\), meaning that \[\begin{align} q_i^\star = \frac{r_i^\star}{\delta_{i,i}} = 2 r_i^\star L_{i, \{i,\dots,b\}}. \end{align}\] Combining the above yields the recursion \[\begin{align} q_1^\star &= 2 L_{1, \{1,\dots,b\}}, \\ q_i^\star &= 2 [r_i^\star]_+ L_{i, \{i,\dots,b\}},\qquad r_i^\star = 1-\sum_{s=1}^{i-1} \frac{q_s^\star}{2L_{i, \{s,\dots,b\}}},\quad i\in\{2,\ldots,b\}, \end{align}\] where \([x]_+ :=\max\{x,0\}\). The optimal probabilities are finally recovered by normalization \[\begin{align} p_i^\star = \frac{q_i^\star}{\sum_{j=1}^b q_j^\star}. \end{align}\]

Remark 14. Note that \((p_1^\star, p_2^\star, \ldots, p_b^\star)=(1, 0, \ldots, 0)\) if and only if \(q_1^\star = \sum_{j=1}^b q_j^\star\), i.e., \(q_2^\star = \ldots = q_b^\star = 0\). But for that to be the case, we need \(r_i^\star \leq 0\) for all \(i\in\{2,\ldots,b\}\), so \[\begin{align} 1 \leq \sum_{s=1}^{i-1} \delta_{i,s} q_s^\star = \delta_{i,1} q_1^\star = \frac{\delta_{i,1}}{\delta_{1,1}} = \frac{L_{1, \{1,\dots,b\}}}{L_{i, \{1,\dots,b\}}} \qquad \forall i\in\{2,\ldots,b\}. \end{align}\] Therefore, choosing \(p_1=1\) is optimal if and only if \[L_{1,\{1,\ldots,b\}}=\max_{i\in[b]} L_{i,\{1,\ldots,b\}},\] which proves 3.

11.1.2 \(\tau\)-nice Sampling↩︎

Choose \(\hat{S}\) uniformly from all subsets of \([b]\) of size \(\tau\).

Every \(\tau\)-subset has probability \(\binom{b}{\tau}^{-1}\). Thus \[\begin{align} {\mathbb{E}}\left[\frac{\mathbb{I}\left(i\in\hat{S}\right)}{2L^0_{i,\hat{S}}}\right] = \frac{1}{\binom{b}{\tau}} \sum_{\substack{S\subseteq[b]\\ |S|=\tau}} \frac{\mathbb{I}\left(i\in S\right)}{2L^0_{i,S}}, \end{align}\] and hence \[\begin{align} \mathrm{cost}_{\varepsilon}(\mathcal{D}) &\propto& \frac{{\mathbb{E}}\left[\mathrm{cost}(\hat{S})\right]}{\min_{i\in[b]} {\mathbb{E}}\left[\frac{\mathbb{I}\left(i\in \hat{S}\right)}{2 L_{i,\hat{S}}^0}\right]} = \frac{c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( 1 - \frac{\binom{b-j}{\tau}}{\binom{b}{\tau}} \right) + \frac{\tau}{b} \sum_{j=1}^b c_j^{\sharp}}{\min_{i\in[b]} \frac{1}{\binom{b}{\tau}} \sum_{\substack{S\subseteq[b]\\ |S|=\tau}} \frac{\mathbb{I}\left(i\in S\right)}{2L^0_{i,S}}} \\ &=& \frac{c_{\mathrm{ov}} \binom{b}{\tau} + \sum_{j=1}^b c_j \left( \binom{b}{\tau} - \binom{b-j}{\tau} \right) + \binom{b}{\tau} \frac{\tau}{b} \sum_{j=1}^b c_j^{\sharp}}{\min_{i\in[b]} \left\{ \sum_{\substack{S\subseteq[b]\\ |S|=\tau}} \frac{\mathbb{I}\left(i\in S\right)}{2L^0_{i,S}} \right\}}. \end{align}\] In general there is no closed-form expression for \(\tau\) minimizing this cost because the denominator depends on \(\{L^0_{i,S}\}_{S\subseteq[b]}\) in a highly problem-specific way. That said, it can be shown that \(\tau=b\) need not be optimal (indeed, it can be very sub-optimal). For \(\tau=b\), we always have \(S=[b]\), and the cost becomes \[\begin{align} \mathrm{cost}_{\varepsilon}(\mathcal{D}) &\propto& \frac{c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( 1 - \binom{b-j}{b} \right) + \sum_{j=1}^b c_j^{\sharp}}{\min_{i\in[b]} \frac{1}{2L^0_{i,[b]}}} \\ &=& 2 \max_{i\in[b]} L^0_{i,[b]} \left( c_{\mathrm{ov}} + \sum_{j=1}^b \left( c_j + c_j^{\sharp} \right) \right) \end{align}\] (recall that we use the convention \(\binom{n}{k}=0\) for \(n<k\)).

Now, let us make a simplifying assumption that \(L^0_{i,S}\) depends only on \(i\) and \(|S|\) and define \(L^0_{i,\tau} :=L^0_{i,S}\) for \(|S|=\tau\). Then \[\begin{align} \mathrm{cost}_{\varepsilon}(\mathcal{D}) &\propto& \frac{c_{\mathrm{ov}} \binom{b}{\tau} + \sum_{j=1}^b c_j \left( \binom{b}{\tau} - \binom{b-j}{\tau} \right) + \binom{b}{\tau} \frac{\tau}{b} \sum_{j=1}^b c_j^{\sharp}}{\min_{i\in[b]} \left\{ \sum_{\substack{S\subseteq[b]\\ |S|=\tau}} \frac{\mathbb{I}\left(i\in S\right)}{2L^0_{i,\tau}} \right\}} \\ &=& \frac{c_{\mathrm{ov}} \binom{b}{\tau} + \sum_{j=1}^b c_j \left( \binom{b}{\tau} - \binom{b-j}{\tau} \right) + \binom{b}{\tau} \frac{\tau}{b} \sum_{j=1}^b c_j^{\sharp}}{\min_{i\in[b]} \left\{ \frac{1}{2L^0_{i,\tau}} \binom{b-1}{\tau-1} \right\}} \\ &=& 2 \max_{i\in[b]} L^0_{i,\tau} \left( \frac{b}{\tau} c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( \frac{b}{\tau} - \frac{\binom{b-j}{\tau}}{\binom{b-1}{\tau-1}} \right) + \sum_{j=1}^b c_j^{\sharp} \right). \end{align}\] Define \[\begin{align} A(\tau) :=\max_{i\in[b]} L^0_{i,\tau}, \qquad B(\tau) :=\frac{b}{\tau} c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( \frac{b}{\tau} - \frac{\binom{b-j}{\tau}}{\binom{b-1}{\tau-1}} \right) + \sum_{j=1}^b c_j^{\sharp}. \end{align}\] Then, the objective function to be minimized is \[\begin{align} f(\tau) :=A(\tau) B(\tau). \end{align}\] By definition, \(\tau=b\) is optimal if and only if \(f(b) \leq f(\tau)\) for all \(\tau \in \{1,\dots,b-1\}\).

First, we show that \(B\) is decreasing in \(\tau\). To this end, let \(1 \leq \tau_1 < \tau_2 \leq b\) and consider the difference \[\begin{align} B(\tau_1) - B(\tau_2) &=& \left( \frac{b}{\tau_1} - \frac{b}{\tau_2} \right) c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( \frac{b}{\tau_1} - \frac{\binom{b-j}{\tau_1}}{\binom{b-1}{\tau_1-1}} - \frac{b}{\tau_2} + \frac{\binom{b-j}{\tau_2}}{\binom{b-1}{\tau_2-1}} \right) \\ &=& \left( \frac{b}{\tau_1} - \frac{b}{\tau_2} \right) c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( h_j(\tau_1) - h_j(\tau_2) \right), \end{align}\] where \(h_j(\tau) :=\frac{b}{\tau} - \frac{\binom{b-j}{\tau}}{\binom{b-1}{\tau-1}}\). Clearly, the first term is positive. Let us focus on the second term. Using Pascal’s identity, we have \[\begin{align} h_{j+1}(\tau)-h_j(\tau) &=\frac{\binom{b-j}{\tau}-\binom{b-j-1}{\tau}}{\binom{b-1}{\tau-1}} =\frac{\binom{b-j-1}{\tau-1}}{\binom{b-1}{\tau-1}}. \end{align}\] Therefore, \[\begin{align} h_j(\tau) = h_j(\tau) - h_0(\tau) = \sum_{m=0}^{j-1} \left( h_{m+1}(\tau)-h_m(\tau) \right) = \sum_{m=0}^{j-1} \frac{\binom{b-m-1}{\tau-1}}{\binom{b-1}{\tau-1}}, \end{align}\] where \[\begin{align} \frac{\binom{b-m-1}{\tau-1}}{\binom{b-1}{\tau-1}} = \frac{b - \tau}{b - m - \tau} \frac{\binom{b-m-1}{\tau}}{\binom{b-1}{\tau}} \geq \frac{\binom{b-m-1}{\tau}}{\binom{b-1}{\tau}}, \end{align}\] and the inequality is strict for \(m>0\). It follows that for any \(j \in [b]\) \[\begin{align} h_j(\tau) = \sum_{m=0}^{j-1} \frac{\binom{b-m-1}{\tau-1}}{\binom{b-1}{\tau-1}} \geq \sum_{m=0}^{j-1} \frac{\binom{b-m-1}{\tau}}{\binom{b-1}{\tau}} = h_j(\tau+1). \end{align}\] Thus, except for the trivial case when \(j=1\) (where \(h_1(\tau)\equiv1\)), the function \(h_j(\tau)\) is strictly decreasing in \(\tau\), implying that \(h_j(\tau_1) - h_j(\tau_2) > 0\), which proves that \(B(\tau_1) > B(\tau_2)\). Thus \[\begin{align} B(\tau) \geq B(b) \quad \forall \tau \leq b, \end{align}\] with strict inequality if \(c_{\mathrm{ov}} > 0\) and \(\tau < b\).

Now, note that in general larger \(\tau\) leads to a higher Lipschitz constant, and hence one may expect \(A\) to be an increasing function of \(\tau\). If \(A(b)\) is much larger than \(A(\tau)\), it may compensate for the decrease in \(B(\tau)\), resulting in \(f(\tau) = A(\tau) B(\tau) < A(b) B(b) = f(b)\), and so \(\tau=b\) may not be optimal.
As an example, suppose that \(L^0_{i,\tau}\) scales linearly with \(\tau\), i.e, \(A(\tau) = \max_{i\in[b]} \tau L^0_{i}\) for some \(L^0_{i} \geq 0\). Then \[\begin{align} \mathrm{cost}_{\varepsilon}(\mathcal{D}) &\propto& \max_{i\in[b]} L^0_{i} \left( b c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( b - \tau \frac{\binom{b-j}{\tau}}{\binom{b-1}{\tau-1}} \right) + \tau \sum_{j=1}^b c_j^{\sharp} \right) \\ &=& \max_{i\in[b]} L^0_{i} \left( b c_{\mathrm{ov}} + b \sum_{j=1}^b c_j + \tau \sum_{j=1}^b \left( c_j^{\sharp} - c_j \frac{\binom{b-j}{\tau}}{\binom{b-1}{\tau-1}} \right) \right). \end{align}\] Now, consider the function \[\begin{align} \Phi(\tau) :=\tau \sum_{j=1}^b \left( c_j^{\sharp} - c_j \frac{\binom{b-j}{\tau}}{\binom{b-1}{\tau-1}} \right). \end{align}\] Note that \[\begin{align} \Phi(\tau+1) - \Phi(\tau) &=& (\tau+1) \sum_{j=1}^b \left( c_j^{\sharp} - c_j \frac{\binom{b-j}{\tau+1}}{\binom{b-1}{\tau}} \right) - \tau \sum_{j=1}^b \left( c_j^{\sharp} - c_j \frac{\binom{b-j}{\tau}}{\binom{b-1}{\tau-1}} \right) \\ &=& \sum_{j=1}^b c_j \left( \tau \frac{\binom{b-j}{\tau}}{\binom{b-1}{\tau-1}} - (\tau+1) \frac{\binom{b-j}{\tau+1}}{\binom{b-1}{\tau}} \right) + \sum_{j=1}^b c_j^{\sharp} \\ &=& \sum_{j=1}^b c_j \frac{j (b-j)! (b-\tau-1)!}{(b-1)! (b-j-\tau)!} + \sum_{j=1}^b c_j^{\sharp} \geq 0 \end{align}\] (with the convention that the right-hand side is \(0\) when \(b-j-\tau<0\)). Moreover, the increment is strictly positive if either \(\sum_{j=1}^b c_j^{\sharp}>0\) or there exists \(j\) with \(c_j>0\) and \(b-j-\tau\geq 0\). Hence \(\Phi(\tau)\) is non-decreasing in \(\tau\), and strictly increasing whenever one of these conditions holds, and the optimal choice is \(\tau^{\star}=1\).

11.1.3 \(\tau\)-Submodel Sampling↩︎

Sample a starting index \(\hat{s}\in\{1,\dots,b-\tau+1\}\) with probability \(p_i = {\mathbb{P}}\left(\hat{s}=i\right)\) and set \(\hat{S}=\{\hat{s},\dots,\hat{s}+\tau-1\}\).

The denominator is \[\begin{align} {\mathbb{E}}\left[\frac{\mathbb{I}\left(i\in\hat{S}\right)}{2L^0_{i,\hat{S}}}\right] = \sum_{j=1}^{b-\tau+1} p_j \frac{\mathbb{I}\left(i\in\{j,\dots,j+\tau-1\}\right)}{2L^0_{i,\{j,\dots,j+\tau-1\}}} = \sum_{j=\max\{1,i-\tau+1\}}^{\min\{i,b-\tau+1\}} \frac{p_j}{2L^0_{i,\{j,\dots,j+\tau-1\}}}. \end{align}\] Hence the total expected cost is proportional to \[\begin{align} \mathrm{cost}_{\varepsilon}(\mathcal{D}) &\propto& \frac{{\mathbb{E}}\left[\mathrm{cost}(\hat{S})\right]}{\min_{i\in[b]} {\mathbb{E}}\left[\frac{\mathbb{I}\left(i\in\hat{S}\right)}{2L^0_{i,\hat{S}}}\right]} \\ &=& \frac{c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left(\sum_{i=1}^{\min\{j,b-\tau+1\}} p_i\right) + \sum_{j=1}^b c_j^{\sharp} \left(\sum_{i=\max\{1,j-\tau+1\}}^{\min\{j,b-\tau+1\}} p_i\right)}{\min_{i\in[b]} \left\{ \sum_{j=\max\{1,i-\tau+1\}}^{\min\{i,b-\tau+1\}} \frac{p_j}{2L^0_{i,\{j,\dots,j+\tau-1\}}} \right\}}. \end{align}\]

11.1.3.1 Partitioned \(\tau\)-submodel sampling.

Let us consider a special case of the above sampling scheme where the submodels assigned non-zero probability partition \([b]\). For simplicity, suppose that \(b\) is divisible by \(\tau\), let \(m=\frac{b}{\tau}\), and define the block start indices via \[\begin{align} s_k=(k-1)\tau+1,\qquad k=1,\dots,m. \end{align}\] The algorithm then picks block \(B_k :=\{s_k, \dots, s_k+\tau-1\}\) with probability \(p_{s_k}> 0\) (where \(\sum_{k=1}^m p_{s_k} = 1\)). This is equivalent to the submodel sampling with starting-index distribution satisfying \[\begin{align} p_{s_k}=p_{(k-1)\tau+1} > 0 \quad (k=1,\dots,m), \qquad p_j=0 \text{ otherwise}. \end{align}\] Plugging this choice into the general submodel expressions immediately gives \[\begin{align} F_i &=& \sum_{j=1}^{\min\{i, b-\tau+1\}} p_j = \sum_{k=1}^{\min\{m, \left\lfloor \frac{(i-1)}{\tau} \right\rfloor + 1\}} p_{s_k}, \\ Q_i &=& \sum_{j=\max\{1, i-\tau+1\}}^{\min\{i, b-\tau+1\}} p_j = p_{s_{\left\lceil \frac{i}{\tau} \right\rceil}}, \\ {\mathbb{E}}\left[\frac{\mathbb{I}\left(i\in\hat{S}\right)}{2L^0_{i,\hat{S}}}\right] &=& \sum_{j=\max\{1,i-\tau+1\}}^{\min\{i,b-\tau+1\}} \frac{p_j}{2L^0_{i,\{j,\dots,j+\tau-1\}}} = \frac{p_{s_{\left\lceil \frac{i}{\tau} \right\rceil}}}{2L^0_{i,B_{\left\lceil \frac{i}{\tau} \right\rceil}}}. \end{align}\] Therefore the full cost reduces to \[\begin{align} \label{eq:part95tau95submodel95cost} \mathrm{cost}_{\varepsilon}(\mathcal{D}) &\propto& \frac{c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( \sum_{k=1}^{\min\{m, \left\lfloor \frac{(j-1)}{\tau} \right\rfloor + 1\}} p_{s_k} \right) + \sum_{j=1}^b c_j^{\sharp} p_{s_{\left\lceil \frac{j}{\tau} \right\rceil}}}{\min_{i\in[b]} \left\{ \frac{p_{s_{\left\lceil \frac{i}{\tau} \right\rceil}}}{2L^0_{i,B_{\left\lceil \frac{i}{\tau} \right\rceil}}} \right\}} \nonumber \\ &=& \frac{c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( \sum_{k=1}^{\left\lceil \frac{j}{\tau} \right\rceil} p_{s_k} \right) + \sum_{j=1}^b c_j^{\sharp} p_{s_{\left\lceil \frac{j}{\tau} \right\rceil}}}{\min_{i\in[b]} \left\{ \frac{p_{s_{\left\lceil \frac{i}{\tau} \right\rceil}}}{2L^0_{i,B_{\left\lceil \frac{i}{\tau} \right\rceil}}} \right\}}. \end{align}\tag{24}\] Now, note that \[\begin{align} \sum_{j=1}^b c_j \left( \sum_{k=1}^{\left\lceil \frac{j}{\tau} \right\rceil} p_{s_k} \right) &= \sum_{j=1}^\tau c_j p_{s_1} + \sum_{j=\tau+1}^{2\tau} c_j \left( p_{s_1} + p_{s_2} \right) + \ldots + \sum_{j=(m-1)\tau+1}^{m\tau} c_j \left( p_{s_1} + \ldots + p_{s_m} \right) \\ &= p_{s_1} \sum_{j=1}^\tau c_j + \left( p_{s_1} + p_{s_2} \right) \sum_{j=\tau+1}^{2\tau} c_j + \ldots + \left( p_{s_1} + \ldots + p_{s_m} \right) \sum_{j=(m-1)\tau+1}^{m\tau} c_j \\ &= p_{s_1} \sum_{j=1}^{m \tau} c_j + p_{s_2} \sum_{j=\tau+1}^{m\tau} c_j + \ldots + p_{s_m} \sum_{j=(m-1)\tau+1}^{m\tau} c_j \end{align}\] and \[\begin{align} \sum_{j=1}^b c_j^{\sharp} p_{s_{\left\lceil \frac{j}{\tau} \right\rceil}} = \sum_{j=1}^\tau c_j^{\sharp} p_{s_1} + \sum_{j=\tau+1}^{2\tau} c_j^{\sharp} p_{s_2} + \ldots + \sum_{j=(m-1)\tau+1}^{m\tau} c_j^{\sharp} p_{s_m}. \end{align}\] Hence \[\begin{align} &\mathrm{cost}_{\varepsilon}(\mathcal{D}) \\ &\propto \frac{1}{\min_{i\in[b]} \left\{ \frac{p_{s_{\left\lceil \frac{i}{\tau} \right\rceil}}}{2L^0_{i,B_{\left\lceil \frac{i}{\tau} \right\rceil}}} \right\}} \left[c_{\mathrm{ov}} + p_{s_1} \left( \sum_{j=1}^{m \tau} c_j + \sum_{j=1}^\tau c_j^{\sharp} \right) + p_{s_2} \left( \sum_{j=\tau+1}^{m\tau} c_j + \sum_{j=\tau+1}^{2\tau} c_j^{\sharp} \right) + \ldots \right. \\ &\qquad\qquad\qquad\qquad\qquad\qquad\left.+ p_{s_m} \left( \sum_{j=(m-1)\tau+1}^{m\tau} c_j + \sum_{j=(m-1)\tau+1}^{m\tau} c_j^{\sharp} \right)\right] \\ &= \frac{p_{s_1} d_1 + p_{s_2} d_2 + \ldots + p_{s_m} d_m}{\min_{i\in[b]} \left\{ \frac{p_{s_{\left\lceil \frac{i}{\tau} \right\rceil}}}{2L^0_{i,B_{\left\lceil \frac{i}{\tau} \right\rceil}}} \right\}}, \end{align}\] where we used the fact that \(\sum_{j=1}^{m} p_{s_j} = 1\) and denoted \(d_i :=c_{\mathrm{ov}} + \sum_{j=(i-1)\tau+1}^{m \tau} c_j + \sum_{j=(i-1)\tau+1}^{i \tau} c_j^{\sharp}\).

11.1.3.2 Optimal probabilities for fixed \(\tau\).

We follow an approach similar to that in 11.1.1 to find the optimal probabilities for a fixed choice of \(\tau\). For \(i\in B_k\), define \[\begin{align} \delta_{i,k}:=\frac{1}{2L^0_{i,B_k}} \end{align}\] and set \[\begin{align} \delta_k^{\min}:=\min_{i\in B_k}\delta_{i,k} = \frac{1}{2\max_{i\in B_k} L^0_{i,B_k}}. \end{align}\] Then the expected cost can be represented as \[\begin{align} \mathrm{cost}_{\varepsilon}(\mathcal{D}) \propto \frac{d_1 p_{s_1} + \cdots + d_m p_{s_m}}{\min_{i\in[b]} \left\{ \frac{p_{s_{\left\lceil \frac{i}{\tau} \right\rceil}}}{2L^0_{i,B_{\left\lceil \frac{i}{\tau} \right\rceil}}} \right\}} = \frac{d_1 p_{s_1} + \cdots + d_m p_{s_m}}{\min_{k\in[m]} \min_{i\in B_k} \{\delta_{i,k} p_{s_k}\}}. \end{align}\] This corresponds to the linear fractional program \[\label{eq:partitioned95submod95lfp} \begin{align} \min_{p,t}\quad & \frac{d_1 p_{s_1} + \cdots + d_m p_{s_m}}{t} \\ \text{s.t.}\quad & p_{s_1}, \ldots, p_{s_m} \geq 0, \\ & p_{s_1}+\cdots+p_{s_m} = 1, \\ & t \geq 0, \\ & t \leq \delta_{i,k} p_{s_k}, \qquad k\in[m], i\in B_k. \end{align}\tag{25}\] The standard change of variables \(q_k = p_{s_k} / t\) yields the equivalent linear program \[\label{eq:fl295partitioned} \begin{align} \min_{q} \quad & d_1 q_1 + \cdots + d_m q_m \\ \text{s.t.} \quad & q_1, \ldots, q_m \geq 0, \\ & \delta_{i,k} q_k \geq 1 \qquad k\in[m], i\in B_k. \end{align}\tag{26}\] Because each block \(k\) only appears in constraints of the form \(\delta_{i,k} q_k \geq 1\) (for \(i\in B_k\)), the constraints for block \(k\) reduce to the single constraint \[\begin{align} \delta_k^{\min} q_k \geq 1 \quad\Longleftrightarrow\quad q_k \geq \frac{1}{\delta_k^{\min}}. \end{align}\] Hence 26 is separable and its optimal solution is \[\begin{align} q_k^\star = \frac{1}{\delta_k^{\min}}, \qquad k\in[m], \end{align}\] with optimal objective value \[\begin{align} \sum_{k=1}^m d_k q_k^\star = \sum_{k=1}^m \frac{d_k}{\delta_k^{\min}}. \end{align}\]

Now, let us introduce dual variables \(\lambda_i\ge0\) for the constraints \(\delta_{i,k} q_k \geq 1\), \(i\in B_k\). The dual of 26 is \[\begin{align} \max_{\lambda}\quad & \lambda_1 + \ldots + \lambda_b \\ \text{s.t.}\quad & \lambda_1, \ldots, \lambda_b \geq 0, \\ & d_k \geq \sum_{i\in B_k} \lambda_i \delta_{i,k}, \qquad k\in[m]. \end{align}\] To certify optimality of \(q^\star\), for each block \(k\), choose an index \(i_k\in B_k\) attaining the minimum \(\delta_k^{\min}\) and set \[\begin{align} \lambda_{i_k} = \frac{d_k}{\delta_{i_k,k}}, \qquad \lambda_i=0 \text{ for } i\not\in\{i_1,\dots,i_m\}. \end{align}\] Then for each block \(k\), \[\begin{align} \sum_{i\in B_k} \delta_{i,k} \lambda_i = \delta_{i_k,k} \lambda_{i_k} = d_k, \end{align}\] so the dual constraints hold with equality and the dual objective equals \[\begin{align} \sum_{i=1}^b \lambda_i = \sum_{k=1}^m \frac{d_k}{\delta_k^{\min}}, \end{align}\] matching the primal objective. By strong duality \(q^\star\) is optimal.

From \(q_k^\star=\frac{1}{\delta_k^{\min}}\) and \(p_{s_k} = t q_k = q_k / \sum_{l=1}^m q_l\) we obtain the optimal block probabilities \[\begin{align} p_{s_k}^\star = \frac{\frac{1}{\delta_k^{\min}}}{\sum_{l=1}^m \frac{1}{\delta_l^{\min}}} = \frac{\max_{i\in B_k} L^0_{i,B_k}}{\sum_{l=1}^m \max_{i\in B_l} L^0_{i,B_l}}. \end{align}\] so that each block’s probability is proportional to the worst-case local smoothness constant inside that block. For this choice, the minimal expected cost is \[\begin{align} \mathrm{cost}_{\varepsilon}(\mathcal{D}) \propto \sum_{k=1}^m \frac{d_k}{\delta_k^{\min}} = 2\sum_{k=1}^m d_k \max_{i\in B_k} L^0_{i,B_k}. \end{align}\]

11.1.3.3 Choosing \(\tau\).

We now show that the cost is not necessarily minimized by choosing \(\tau = b\). To this end, we want to minimize the function \[\begin{align} \Phi(\tau) :=2\sum_{k=1}^{m} d_k(\tau) \max_{i\in B_k} L^0_{i,B_k} \end{align}\] (where we explicitly emphasize the dependence of \(d_k\) on \(\tau\)).

To gain intuition about which extreme (\(\tau = 1\) or \(\tau = b\)) may be preferable, assume that the costs are constant, i.e., \[\begin{align} c_j \equiv c, \qquad c_j^\sharp \equiv c^\sharp, \end{align}\] in which case \[\begin{align} \sum_{k=1}^m d_k(\tau) &= \sum_{k=1}^m \left( c_{\mathrm{ov}} + \sum_{j=(k-1)\tau+1}^{m \tau} c + \sum_{j=(k-1)\tau+1}^{k\tau} c^{\sharp} \right) \\ &= \sum_{k=1}^m \left( c_{\mathrm{ov}} + c \left( m \tau - (k-1) \tau \right) + c^{\sharp} \left( k\tau - (k-1)\tau \right) \right) \\ &= m c_{\mathrm{ov}} + c \tau \frac{m(m+1)}{2} + m \tau c^\sharp. \end{align}\] Suppose in addition that the worst-case local smoothness per block does not depend on the block index, i.e., \[\begin{align} \max_{i\in B_k} L^0_{i,B_k} \equiv L^0(\tau) \quad \forall k\in[m] \end{align}\] for some non-decreasing function \(L\). Then \[\begin{align} \Phi(\tau) = 2 L^0(\tau)\sum_{k=1}^{m} d_k(\tau) = 2 L^0(\tau) \left( \frac{b}{\tau} \left( c_{\mathrm{ov}}+\frac{bc}{2} \right) + \frac{bc}{2} + b c^\sharp \right). \end{align}\] Thus, the \(\tau\)-dependence of \(\Phi\) is the product of a non-decreasing factor \(L_0(\tau)\) and a factor that decreases like \(1/\tau\) plus additive constants. Consequently:

If \(L^0(\tau)\) is constant in \(\tau\) (no worsening with larger blocks), then \(\Phi(\tau)\) is strictly decreasing in \(\tau\) and the minimizer is \(\tau^\star = b\).
If \(L^0(\tau)\) grows sublinearly, then \(L^0(\tau) \frac{b}{\tau} \left( c_{\mathrm{ov}}+\frac{bc}{2} \right)\) is decreasing in \(\tau\), while \(L^0(\tau) \left( \frac{c b}{2} + b c^\sharp \right)\) is increasing. Hence, the optimal \(\tau\) may lie strictly between \(1\) and \(b\), depending on the relative magnitudes of the costs.
If \(L^0(\tau)\) increases at least linearly in \(\tau\), then \(\Phi(\tau)\) is increasing in \(\tau\), and hence \(\tau^\star = 1\).

11.1.4 Arbitrary Submodel Sampling↩︎

Let \(\{B_1,\dots,B_m\}\) be a partition of \([b]\). Set \(\hat{S} = B_i\) with probability \(p_i\) (where \(\sum_{i=1}^m p_i = 1\)). For \(j\in[b]\), \(k(j)\) denotes the unique block with \(j\in B_{k(j)}\) and \(\underline{b}_k :=\min B_k\).

First, note that the cost can be expressed block-wise as \[\begin{align} {\mathbb{E}}\left[\mathrm{cost}(\hat{S})\right] &= c_{\mathrm{ov}} + \sum_{i=1}^b c_i \left( \sum_{k:\underline{b}_k \leq i} p_k \right) + \sum_{i=1}^b c_i^{\sharp} p_{k(i)} \\ &= c_{\mathrm{ov}} + \sum_{k=1}^m \sum_{i \geq \underline{b}_k} c_i p_k + \sum_{k=1}^m \sum_{i\in B_k} c_i^{\sharp} p_{k} \\ &= \sum_{k=1}^m d_k p_k, \end{align}\] where \(d_k :=c_{\mathrm{ov}} + \sum_{j\geq \underline{b}_k} c_j + \sum_{j\in B_k} c_j^{\sharp}\). We also have \[\begin{align} {\mathbb{E}}\left[\frac{\mathbb{I}\left(i\in \hat{S}\right)}{2 L^0_{i,\hat{S}}}\right] = \frac{p_{k(i)}}{2 L^0_{i,B_{k(i)}}}, \end{align}\] and hence \[\begin{align} \mathrm{cost}_{\varepsilon}(\mathcal{D}) \propto \frac{{\mathbb{E}}\left[\mathrm{cost}(\hat{S})\right]}{\min_{i\in[b]} {\mathbb{E}}\left[\frac{\mathbb{I}\left(i\in\hat{S}\right)}{2L^0_{i,\hat{S}}}\right]} = \frac{\sum_{k=1}^m d_k p_k}{\min_{k\in[m]} \min_{i\in B_k} \left\{ \frac{p_k}{2 L^0_{i,B_k}} \right\}}. \end{align}\] Now, define \[\begin{align} \delta_{i,k} :=\tfrac{1}{2 L^0_{i,B_k}}, \qquad \delta_k^{\min} :=\min_{i\in B_k} \delta_{i,k}. \end{align}\] By the same linear fractional reduction as in 25 , the unique optimal solution is \[\begin{align} p_k^\star = \frac{\frac{1}{\delta_k^{\min}}}{\sum_{l=1}^m \frac{1}{\delta_l^{\min}}} = \frac{\max_{i\in B_k} L^0_{i,B_k}}{\sum_{l=1}^m \max_{i\in B_l} L^0_{i,B_l}}. \end{align}\] With this choice, the objective is \[\begin{align} \label{eq:cost95arbitrary95submod} \mathrm{cost}_{\varepsilon}(\mathcal{D}) &\propto& 2 \sum_{k=1}^m d_k \max_{i\in B_k} L^0_{i,B_k} = 2\sum_{k=1}^m \left( c_{\mathrm{ov}} + \sum_{j\geq \underline{b}_k} c_j + \sum_{j\in B_k} c_j^{\sharp} \right) \max_{i\in B_k} L^0_{i,B_k}. \end{align}\tag{27}\]

Let us compare it with the cost for full model training \[\begin{align} {\mathbb{E}}\left[\mathrm{cost}_{\textrm{full}}(K)\right] \propto 2 \max_{i\in[b]} L^0_{i,[b]} \left( c_{\mathrm{ov}} + \sum_{j=1}^b \left( c_j + c_j^{\sharp} \right) \right). \end{align}\]

Example 1 (Grouping layers by cost similarity). Suppose that the layers are partitioned into groups according to their similarity, so that within each block, all per-layer costs are (approximately) the same. Concretely, for every \(k\in[m]\) and every \(j\in B_k\) assume that \[\begin{align} c_j \equiv \underline{c}_k, \qquad c_j^{\sharp}\equiv \underline{c}_k^{\sharp}. \end{align}\] Then \[\begin{align} \mathrm{cost}_{\varepsilon}(\mathcal{D}) &\propto& 2\sum_{k=1}^m \left( c_{\mathrm{ov}} + \sum_{j\geq \underline{b}_k} c_j + \sum_{j\in B_k} c_j^{\sharp} \right) \max_{i\in B_k} L^0_{i,B_k} \\ &=& 2 \sum_{k=1}^m \left( c_{\mathrm{ov}} + \sum_{j\geq \underline{b}_k} c_j + |B_k| \underline{c}_k^{\sharp} \right) \max_{i\in B_k} L^0_{i,B_k} \end{align}\] and \[\begin{align} {\mathbb{E}}\left[\mathrm{cost}_{\textrm{full}}(K)\right] \propto 2 \max_{i\in[b]} L^0_{i,[b]} \left( c_{\mathrm{ov}} + \sum_{l=1}^m |B_l| \left( \underline{c}_l + \underline{c}_l^{\sharp} \right) \right). \end{align}\] Therefore, partitioned arbitrary submodel sampling with the optimal probabilities is better than full model training if and only if \[\begin{align} \label{eq:partition95better95explicit} \sum_{k=1}^m \max_{i\in B_k} L^0_{i,B_k} \left( c_{\mathrm{ov}} + \sum_{j\geq \underline{b}_k} c_j + |B_k| \underline{c}_k^{\sharp} \right) \leq \max_{i\in[b]} L^0_{i,[b]} \left( c_{\mathrm{ov}} + \sum_{k=1}^m |B_k| \left( \underline{c}_k + \underline{c}_k^{\sharp} \right) \right). \end{align}\tag{28}\] If each block is much better conditioned than the full model, i.e. \(\max_{i\in B_k} L^0_{i,B_k} \ll \max_{i\in[b]} L^0_{i,[b]}\) for all \(k\), then the left side of 28 can be much smaller than the right side and partitioned submodel sampling will be advantageous even if the block-wise tail sums are moderately large. Indeed, if \(\max_{i\in B_k} L^0_{i,B_k} \approx \frac{1}{m} \max_{i\in[b]} L^0_{i,[b]}\), then \[\begin{align} &&\sum_{k=1}^m \max_{i\in B_k} L^0_{i,B_k} \left( c_{\mathrm{ov}} + \sum_{j\geq \underline{b}_k} c_j + |B_k| \underline{c}_k^{\sharp} \right) \\ &\approx& \frac{1}{m} \max_{i\in[b]} L^0_{i,[b]} \sum_{k=1}^m \left( c_{\mathrm{ov}} + \sum_{j\geq \underline{b}_k} c_j + |B_k| \underline{c}_k^{\sharp} \right) \\ &=& \max_{i\in[b]} L^0_{i,[b]} \left( c_{\mathrm{ov}} + \frac{1}{m} \sum_{k=1}^m \sum_{j\geq \underline{b}_k} c_j + \frac{1}{m} \sum_{k=1}^m |B_k| \underline{c}_k^{\sharp} \right), \end{align}\] which improves upon full model training if \(\frac{1}{m} \sum_{k=1}^m \sum_{j\geq \underline{b}_k} c_j + \frac{1}{m} \sum_{k=1}^m |B_k| \underline{c}_k^{\sharp} \leq \sum_{k=1}^m |B_k| \left( \underline{c}_k + \underline{c}_k^{\sharp} \right)\).

Example 2 (Grouping layers across transformer blocks). In their general forms, 27 and 28 are hard to interpret. Let us consider a simplified language model motivating example behind considering this sampling strategy. Suppose the network is composed of \(T\) identical transformer blocks, each containing \(L\) layers, so that \(b = T L\), with the natural block structure: layer index \(j\in[b]\) corresponds to position \(l\in\{1,\dots,L\}\) inside transformer block \(t\in[T]\) via \(j = l + (t-1) L\).

Consider the partition that groups together same-position layers across transformer blocks: \[\begin{align} B_l = \{l, l+L, l+2L, \dots, l+(T-1)L\}, \qquad \forall l\in[L], \end{align}\] so \(m=L\) and \(|B_l| = T\) for every \(l\). For each block \(B_l\) we have \(\underline{b}_{l}=\min B_l = l\).

In this setting, for block \(B_l\) the tail sum that appears in \(d_l\) simplifies to \[\begin{align} \sum_{j\geq \underline b_l} c_j = \sum_{j=l}^{b} c_j. \end{align}\] Note that we grouped the layers is such a way that the layers within each block are of the same type, and hence we may expect the costs to be roughly the same within each block. That is, we assume that \[\begin{align} c_{l + tL} \equiv \underline{c}_l, \qquad c_{l + tL}^{\sharp}\equiv \underline{c}_l^{\sharp}, \qquad \forall l\in[L],\;t\in\{0,\ldots,T-1\}, \end{align}\] in which case the tail sum further simplifies to \[\begin{align} \sum_{j=l}^{b} c_j = \sum_{j=l}^{L} \underline{c}_j + (T-1) \sum_{j=1}^{L} \underline{c}_j. \end{align}\] Hence, under the periodic costs assumption the block-wise cost constants become \[\begin{align} d_l = c_{\mathrm{ov}} + \sum_{j\geq \underline b_l} c_j + \sum_{j\in B_l} c_j^{\sharp} = c_{\mathrm{ov}} + \sum_{j=l}^{L} \underline{c}_j + (T-1) \sum_{j=1}^{L} \underline{c}_j + T \underline{c}_l^{\sharp}. \end{align}\]

If the local smoothness within a block are also approximately homogeneous, i.e., \[\begin{align} L^0_{i,B_l} \equiv L^0_l \qquad \forall i\in B_l, \end{align}\] then \[\begin{align} \max_{i\in B_l} L^0_{i,B_l} = L^0_l, \end{align}\] and the previously derived expected cost (for the optimal block probabilities) reduces to \[\begin{align} \mathrm{cost}_{\varepsilon}(\mathcal{D}) \propto 2 \sum_{l=1}^L L^0_l \left( c_{\mathrm{ov}} + \sum_{j=l}^{L} \underline{c}_j + (T-1) \sum_{j=1}^{L} \underline{c}_j + T \underline{c}_l^{\sharp} \right). \end{align}\]

Under the same periodicity assumptions, the full-model expected cost is \[\begin{align} {\mathbb{E}}\left[\mathrm{cost}_{\textrm{full}}(K)\right] \propto 2 \max_{i\in[b]} L^0_{i,[b]} \left( c_{\mathrm{ov}} + T \sum_{j=1}^L \underline c_j + T \sum_{j=1}^L \underline c_j^\sharp \right). \end{align}\] Comparing the two, partitioned sampling is better if \[\begin{align} \label{eq:compare95exact} \sum_{l=1}^L L^0_l \left( c_{\mathrm{ov}} + \sum_{j=l}^{L} \underline{c}_j + (T-1) \sum_{j=1}^{L} \underline{c}_j + T \underline{c}_l^{\sharp} \right) \leq \max_{i\in[b]} L^0_{i,[b]} \left( c_{\mathrm{ov}} + T \sum_{j=1}^L \underline c_j + T \sum_{j=1}^L \underline c_j^\sharp \right). \end{align}\tag{29}\] Now, let us denote \[\begin{align} \underline{C} :=\sum_{j=1}^L \underline{c}_j,\qquad \underline{C}^{\sharp} :=\sum_{j=1}^L \underline c_j^\sharp. \end{align}\] Then 29 is equivalent to \[\begin{align} \label{eq:compare95rewrite} \sum_{l=1}^L L^0_l \left( c_{\mathrm{ov}} + \sum_{j=l}^L \underline{c}_j \right) + T \sum_{l=1}^L L^0_l \left( \frac{T-1}{T} \underline{C} + \underline{c}_l^{\sharp} \right) \leq T \left( \underline{C} + \underline{C}^{\sharp} \right) \max_{i\in[b]} L^0_{i,[b]} + c_{\mathrm{ov}} \max_{i\in[b]} L^0_{i,[b]}. \end{align}\tag{30}\] For large \(T\), the dominant terms are those multiplied by \(T\). Then, the leading-order criterion becomes \[\begin{align} \label{eq:compare95asymp} \sum_{l=1}^L L^0_l \left( \underline{C} + \underline{c}_l^\sharp \right) \leq \max_{i\in[b]} L^0_{i,[b]}(\underline{C} + \underline{C}^{\sharp}), \end{align}\tag{31}\] which can hold when \(L^0_l \ll \max_{i\in[b]} L^0_{i,[b]}\).

11.2 Generalized Smooth Case↩︎

According to 10, under 3, 1 run with stepsizes \(\gamma_i^k = \big(L_{i,S^k}^0 + L_{i,S^k}^1 \left\| \nabla_i f(X^k) \right\|_{(i) \star}\big)^{-1}\) guarantees that \[\begin{align} \label{eq:avndion} \min_{k=0,\ldots,K-1} \sum_{i=1}^b \left[\frac{w_i}{\frac{1}{b} \sum_{l=1}^b w_l} {\mathbb{E}}\left[\left\| \nabla _i f(X^k) \right\|_{(i) \star}\right]\right] \leq \underline{\varepsilon}, \end{align}\tag{32}\] after \[\begin{align} \label{eq:oinqvr} K &=& \left\lceil \frac{2 \delta^0 \sum_{i=1}^b \frac{{\mathbb{P}}\left(i\in \hat{S}\right) {\mathbb{E}}\left[\left.L^0_{i,\hat{S}}\right\vert i\in \hat{S}\right]}{\left( {\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right] \right)^2}}{\underline{\varepsilon}^2 \left( \frac{1}{b} \sum_{l=1}^b \frac{{\mathbb{P}}\left(l\in \hat{S}\right)}{{\mathbb{E}}\left[\left.L^1_{l,\hat{S}}\right\vert l\in \hat{S}\right]} \right)^2} + \frac{2 \delta^0}{\underline{\varepsilon} \left( \frac{1}{b} \sum_{l=1}^b \frac{{\mathbb{P}}\left(l \in \hat{S}\right)}{{\mathbb{E}}\left[\left.L^1_{l,\hat{S}}\right\vert l\in \hat{S}\right]} \right)} \right\rceil \end{align}\tag{33}\] iterations, where \(\delta^0 :=f(X^0) - f^{\star}\) and \(w_i :=\frac{{\mathbb{P}}\left(i\in \hat{S}\right)}{{\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right]}\). To obtain a guarantee on an unweighted gradient sum, note that \[\begin{align} \underline{\varepsilon} &\geq& \min_{k=0,\ldots,K-1} \sum_{i=1}^b \left[\frac{w_i}{\frac{1}{b} \sum_{l=1}^b w_l} {\mathbb{E}}\left[\left\| \nabla _i f(X^k) \right\|_{(i) \star}\right]\right] \\ &\geq& \frac{\min_{i\in[b]} w_i}{\frac{1}{b} \sum_{l=1}^b w_l} \min_{k=0,\ldots,K-1} \sum_{i=1}^b {\mathbb{E}}\left[\left\| \nabla _i f(X^k) \right\|_{(i) \star}\right], \end{align}\] and hence, substituting in 32 and 33 , we have \[\begin{align} \min_{k=0,\ldots,K-1} \sum_{i=1}^b {\mathbb{E}}\left[\left\| \nabla _i f(X^k) \right\|_{(i) \star}\right] \leq \frac{\underline{\varepsilon} \frac{1}{b} \sum_{l=1}^b w_l}{\min_{i\in[b]} w_i} :=\varepsilon \end{align}\] after \[\begin{align} K &=& \left\lceil \frac{2 \delta^0 \sum_{i=1}^b \frac{{\mathbb{P}}\left(i\in \hat{S}\right) {\mathbb{E}}\left[\left.L^0_{i,\hat{S}}\right\vert i\in \hat{S}\right]}{\left( {\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right] \right)^2}}{\varepsilon^2 \min_{i\in[b]} \left( \frac{{\mathbb{P}}\left(i\in \hat{S}\right)}{{\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right]} \right)^2} + \frac{2 \delta^0}{\varepsilon \min_{i\in[b]} \frac{{\mathbb{P}}\left(i\in \hat{S}\right)}{{\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right]}} \right\rceil \end{align}\] iterations. Moreover, recall from 16 that the expected cost of a single iteration is \[\begin{align} {\mathbb{E}}\left[\mathrm{cost}(\hat{S})\right] = c_{\mathrm{ov}} + \sum_{i=1}^b c_i F_i + \sum_{i=1}^b c_i^{\sharp} Q_i. \end{align}\] Hence, using the fact that \[\begin{align} {\mathbb{E}}\left[\left.L^{\alpha}_{i,\hat{S}}\right\vert i\in \hat{S}\right] = \frac{{\mathbb{E}}\left[L^{\alpha}_{i,\hat{S}} \mathbb{I}\left(i\in \hat{S}\right)\right]}{{\mathbb{P}}\left(i\in \hat{S}\right)} \end{align}\] for \(\alpha \in \{0,1\}\), the expected cost of the entire optimization procedure is \[\begin{align} &\mathrm{cost}_{\varepsilon}(\mathcal{D}) = K \times {\mathbb{E}}\left[\mathrm{cost}(\hat{S})\right] \\ &= \left\lceil \frac{2 \delta^0 \sum_{i=1}^b \frac{{\mathbb{P}}\left(i\in \hat{S}\right) {\mathbb{E}}\left[\left.L^0_{i,\hat{S}}\right\vert i\in \hat{S}\right]}{\left( {\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right] \right)^2}}{\varepsilon^2 \min_{i\in[b]} \left( \frac{{\mathbb{P}}\left(i\in \hat{S}\right)}{{\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right]} \right)^2} + \frac{2 \delta^0}{\varepsilon \min_{i\in[b]} \frac{{\mathbb{P}}\left(i\in \hat{S}\right)}{{\mathbb{E}}\left[\left.L^1_{i,\hat{S}}\right\vert i\in \hat{S}\right]}} \right\rceil \left( c_{\mathrm{ov}} + \sum_{i=1}^b c_i F_i + \sum_{i=1}^b c_i^{\sharp} Q_i \right) \\ &= \left\lceil \frac{2 \delta^0 \sum_{i=1}^b \frac{{\mathbb{P}}\left(i\in \hat{S}\right)^2 {\mathbb{E}}\left[L^0_{i,\hat{S}} \mathbb{I}\left(i\in \hat{S}\right)\right]}{\left( {\mathbb{E}}\left[L^1_{i,\hat{S}} \mathbb{I}\left(i\in \hat{S}\right)\right] \right)^2}}{\varepsilon^2 \min_{i\in[b]} \left( \frac{{\mathbb{P}}\left(i\in \hat{S}\right)^2}{{\mathbb{E}}\left[L^1_{i,\hat{S}} \mathbb{I}\left(i\in \hat{S}\right)\right]} \right)^2} + \frac{2 \delta^0}{\varepsilon \min_{i\in[b]} \frac{{\mathbb{P}}\left(i\in \hat{S}\right)^2}{{\mathbb{E}}\left[L^1_{i,\hat{S}} \mathbb{I}\left(i\in \hat{S}\right)\right]}} \right\rceil \left( c_{\mathrm{ov}} + \sum_{i=1}^b c_i F_i + \sum_{i=1}^b c_i^{\sharp} Q_i \right). \end{align}\] We will consider two regimes.

(1) The \(\mathcal{O}\left( \frac{1}{\varepsilon^2} \right)\) term dominates. Then the problem to be solved is \[\begin{align} \label{eq:cost95opt95prob95e2} \min_{\mathcal{D}: \mathfrak{P}([b]) \to [0,1], \sum_{S\subseteq[b]} \mathcal{D}(S) = 1} \underbrace{\frac{\sum_{i=1}^b \frac{{\mathbb{P}}\left(i\in \hat{S}\right)^2 {\mathbb{E}}\left[L^0_{i,\hat{S}} \mathbb{I}\left(i\in \hat{S}\right)\right]}{\left( {\mathbb{E}}\left[L^1_{i,\hat{S}} \mathbb{I}\left(i\in \hat{S}\right)\right] \right)^2}}{\min_{i\in[b]} \left( \frac{{\mathbb{P}}\left(i\in \hat{S}\right)^2}{{\mathbb{E}}\left[L^1_{i,\hat{S}} \mathbb{I}\left(i\in \hat{S}\right)\right]} \right)^2} \left( c_{\mathrm{ov}} + \sum_{i=1}^b c_i F_i + \sum_{i=1}^b c_i^{\sharp} Q_i \right)}_{\propto \, {\mathbb{E}}\left[\mathrm{cost}_{\varepsilon^2}(K)\right]}. \end{align}\tag{34}\]

(2) The \(\mathcal{O}\left( \frac{1}{\varepsilon} \right)\) term dominates. Then the problem to be solved is \[\begin{align} \label{eq:cost95opt95prob95e} \min_{\mathcal{D}: \mathfrak{P}([b]) \to [0,1], \sum_{S\subseteq[b]} \mathcal{D}(S) = 1} \underbrace{\frac{1}{\min_{i\in[b]} \frac{{\mathbb{P}}\left(i\in \hat{S}\right)^2}{{\mathbb{E}}\left[L^1_{i,\hat{S}} \mathbb{I}\left(i\in \hat{S}\right)\right]}} \left( c_{\mathrm{ov}} + \sum_{i=1}^b c_i F_i + \sum_{i=1}^b c_i^{\sharp} Q_i \right)}_{\propto \, {\mathbb{E}}\left[\mathrm{cost}_{\varepsilon}(K)\right]}. \end{align}\tag{35}\] As in 11.1, both tasks above involve optimization over probability distributions, which is intractable in general. Again, for certain parametric families, the objective simplifies to a linear-fractional program, which can be solved efficiently. Let us consider some specific examples.

11.2.1 Randomized Progressive Training↩︎

Sample \(\hat{s}\in\{1,\dots,b\}\), where \(p_i = {\mathbb{P}}\left(\hat{s}=i\right)\), and set \(\hat{S}=\{\hat{s},\dots,b\}\).

In this case, \(\hat{S} = \{s,\ldots,b\}\) with probability \(p_s\) for all \(s\in[b]\), and hence \[\begin{align} {\mathbb{P}}\left(i\in \hat{S}\right) = {\mathbb{P}}\left(s \leq i\right) = \sum_{s=1}^i p_s, \end{align}\] and \[\begin{align} {\mathbb{E}}\left[L^{\alpha}_{i,\hat{S}} \mathbb{I}\left(i\in \hat{S}\right)\right] = \sum_{s=1}^i p_s L^{\alpha}_{i, \{s,\dots,b\}} \end{align}\] for \(\alpha\in\{0,1\}\). Moreover, following the same steps as in 11.1.1, \[\begin{align} {\mathbb{E}}\left[\mathrm{cost}(\hat{S})\right] = c_{\mathrm{ov}} + \sum_{j=1}^b (c_j + c_j^{\sharp}) \sum_{i=1}^j p_i = \sum_{i=1}^b \left[c_{\mathrm{ov}}+\sum_{j\geq i} (c_j + c_j^{\sharp})\right] p_i. \end{align}\] Hence, substituting into 34 and 35 , the respective optimization problems reduce to minimizing \[\begin{align} {\mathbb{E}}\left[\mathrm{cost}_{\varepsilon^2}(K)\right] &\propto& \frac{\sum_{i=1}^b \frac{{\mathbb{P}}\left(i\in \hat{S}\right)^2 {\mathbb{E}}\left[L^0_{i,\hat{S}} \mathbb{I}\left(i\in \hat{S}\right)\right]}{\left( {\mathbb{E}}\left[L^1_{i,\hat{S}} \mathbb{I}\left(i\in \hat{S}\right)\right] \right)^2}}{\min_{i\in[b]} \left( \frac{{\mathbb{P}}\left(i\in \hat{S}\right)^2}{{\mathbb{E}}\left[L^1_{i,\hat{S}} \mathbb{I}\left(i\in \hat{S}\right)\right]} \right)^2} \left( \sum_{i=1}^b \left[c_{\mathrm{ov}}+\sum_{j\geq i} (c_j + c_j^{\sharp})\right] p_i \right) \\ &=& \frac{\sum_{i=1}^b \frac{\left( \sum_{s=1}^i p_s \right)^2 \sum_{s=1}^i p_s L^0_{i, \{s,\dots,b\}}}{\left( \sum_{s=1}^i p_s L^1_{i, \{s,\dots,b\}} \right)^2}}{\min_{i\in[b]} \left( \frac{\left( \sum_{s=1}^i p_s \right)^2}{\sum_{s=1}^i p_s L^1_{i, \{s,\dots,b\}}} \right)^2} \left( \sum_{i=1}^b \left[c_{\mathrm{ov}}+\sum_{j\geq i} (c_j + c_j^{\sharp})\right] p_i \right) \end{align}\] and \[\begin{align} {\mathbb{E}}\left[\mathrm{cost}_{\varepsilon}(K)\right] &\propto& \frac{1}{\min_{i\in[b]} \frac{{\mathbb{P}}\left(i\in \hat{S}\right)^2}{{\mathbb{E}}\left[L^1_{i,\hat{S}} \mathbb{I}\left(i\in \hat{S}\right)\right]}} \left( \sum_{i=1}^b \left[c_{\mathrm{ov}}+\sum_{j\geq i} (c_j + c_j^{\sharp})\right] p_i \right) \\ &=& \frac{1}{\min_{i\in[b]} \frac{\left( \sum_{s=1}^i p_s \right)^2}{\sum_{s=1}^i p_s L^1_{i, \{s,\dots,b\}}}} \left( \sum_{i=1}^b \left[c_{\mathrm{ov}}+\sum_{j\geq i} (c_j + c_j^{\sharp})\right] p_i \right). \end{align}\]

Let us first focus on \({\mathbb{E}}\left[\mathrm{cost}_{\varepsilon}(K)\right]\), following a similar reasoning to that in 11.1.1. Denote \(d_i :=c_{\mathrm{ov}}+\sum_{j\geq i} (c_j + c_j^{\sharp})\) for \(i\in [b]\) and consider the linear fractional program \[\label{eq:fl1l0l1} \begin{align} \min_{p, t} \quad & \frac{d_1 p_1+\cdots+d_b p_b}{t} \\ \text{s.t.} \quad & p_1,\ldots,p_b\geq 0\\ & p_1+\cdots+p_b=1\\ & t\geq 0 \\ & t\leq \frac{\left( \sum_{s=1}^i p_s \right)^2}{\sum_{s=1}^i p_s L^1_{i, \{s,\dots,b\}}}, \quad i \in [b]. \end{align}\tag{36}\] This program can be written equivalently as \[\label{eq:fl2l0l1} \begin{align} \min_{q} \quad & {d_1 q_1+\cdots+d_b q_b} \\ \text{s.t.} \quad & q_1,\ldots,q_b\geq 0\\ & g_i(q) :=\left( \sum_{s=1}^i q_s \right)^2 - \sum_{s=1}^i q_s L^1_{i, \{s,\dots,b\}} \geq 0, \quad i \in [b]. \end{align}\tag{37}\] We now write the KKT conditions for 37 . Introduce multipliers \(\lambda_i\ge0\) for the constraints \(g_i(q)\ge0\) and multipliers \(\eta_k\ge0\) for the non-negativity constraints \(q_k\ge0\). The Lagrangian is \[\begin{align} \mathcal{L}(q,\lambda,s) = \sum_{i=1}^b d_i q_i - \sum_{i=1}^b \lambda_i g_i(q) - \sum_{k=1}^b \eta_k q_k. \end{align}\] Fix \(k\in[b]\). Differentiating \(g_i\) with respect to \(q_k\) yields \[\label{eq:dgi95dqk} \frac{\partial g_i(q)}{\partial q_k} = \begin{cases} 2 \sum_{s=1}^i q_s - L^1_{i,\{k,\dots,b\}}, & k\leq i,\\[4pt] 0, & k>i. \end{cases}\tag{38}\] Thus, stationarity \(\nabla_q \mathcal{L}(q,\lambda,s)=0\) yields, component-wise for \(k\in[b]\), \[\begin{align} \label{eq:stationarity} 0 = d_k - \sum_{i=k}^b \lambda_i \left( 2 \sum_{s=1}^i q_s - L^1_{i,\{k,\dots,b\}} \right) - \eta_k, \end{align}\tag{39}\] and complementarity and sign conditions are \[\begin{align} \lambda_i g_i(q) = 0, \qquad \eta_i q_i = 0, \qquad \lambda_i, \eta_i \geq 0 \qquad i\in[b]. \end{align}\]

Lemma 3. Let \(q^\star = (q_1^\star, \ldots, q_p^\star)\) be a local minimizer of 37 and let \(\textrm{supp}(q^\star):=\{k:q^\star_k>0\}\). Then, for every index \(k\in\textrm{supp}(q^\star)\), there exists at least one index \(i\) with \(i\geq k\) such that \(\lambda_i>0\). In particular, not all \(\lambda_i\) can be zero.

Proof. Fix \(k\in\textrm{supp}(q^\star)\). If \(\lambda_i=0\) for all \(i\geq k\), then 39 at \(k\) reduces to \[\begin{align} 0 = d_k - \eta_k. \end{align}\] By complementary slackness \(\eta_k q_k^\star = 0\), and since \(q_k^\star>0\), we must have \(\eta_k=0\), so \(d_k=0\) as well. But by definition \(d_k = c_{\mathrm{ov}}+\sum_{j\geq k}(c_j+c_j^\sharp)\), which is (under the model assumptions on the costs) strictly positive. Hence there must exist \(i\geq k\) with \(\lambda_i>0\). In particular not all \(\lambda_i\) are zero. ◻

Lemma 4. If \(\textrm{supp}(q)=\{k\}\), then \[\begin{align} q_k \geq \max_{i\in\{k,\dots,b\}} L^1_{i,\{k,\dots,b\}}. \end{align}\]

Proof. When \(q_j=0\) for \(j \neq k\), we have \[\begin{align} \sum_{s=1}^i q_s = \begin{cases} 0 & i<k,\\ q_k & i\geq k. \end{cases} \end{align}\] For \(i<k\), trivially \(g_i(q)=0\). For \(i \geq k\), we have \[\begin{align} g_i(q) = q_k^2 - q_k\,L^1_{i,\{k,\dots,b\}} = q_k \left( q_k - L^1_{i,\{k,\dots,b\}} \right). \end{align}\] Since \(q_k>0\), the constraint \(g_i(q) \geq 0\) is equivalent to \(q_k \geq L^1_{i,\{k,\dots,b\}}\). This must hold for every \(i\geq k\), and hence \(q_k\) is at least the stated maximum. ◻

Lemma 5. Let \(q^\star\) be a KKT point of 37 such that \(\textrm{supp}(q^\star)=\{k\}\). Then \[\begin{align} q_k^\star = \max_{i\geq k} L^1_{i,\{k,\dots,b\}}. \end{align}\]

Proof. By Lemma 4 we already have \(q_k^\star \geq \max_{i\geq k} L^1_{i,\{k,\dots,b\}}\). Assume that the inequality is strict, that is, \[\begin{align} q_k^\star > \max_{i\geq k} L^1_{i,\{k,\dots,b\}}. \end{align}\] Then for every \(i\geq k\) we have \[\begin{align} g_i(q^\star) = q^\star_k \left( q^\star_k - L^1_{i,\{k,\dots,b\}} \right) > 0. \end{align}\] Since by complementary slackness \(\lambda_i g_i(q^\star)=0\), we get \(\lambda_i = 0\) for all \(i\geq k\). Plugging this into the stationarity condition 39 yields \[\begin{align} 0 = d_k - \eta_k. \end{align}\] But \(q_k^\star>0\) forces \(\eta_k=0\), and thus \(d_k=0\), yielding a contradiction. Therefore, the strict inequality cannot hold, and we must have \[\begin{align} q_k^\star = \max_{i\geq k} L^1_{i,\{k,\dots,b\}}. \end{align}\] ◻

Theorem 15. Unless \[L^1_{1, [b]} = \max_{i\in[b]} L^1_{i, [b]},\] \((p_1, p_2, \ldots, p_b)=(1,0,\ldots,0)\) is not an optimal solution of 36 .

Proof. According to 11, the sampling must be such that \[\begin{align} {\mathbb{P}}\left(i\in \hat{S}\right) = \sum_{s=1}^i p_s > 0 \qquad \forall i\in[b], \end{align}\] and hence we must have \(p_1>0\). Thus, the support of the solution \(p^\star\) of 36 (which coincides with the support of the solution \(q^\star\) of 37 ) can only be a singleton if \(\textrm{supp}(p^\star) = \textrm{supp}(q^\star) = \{1\}\). But then, by 5, we must have \[\begin{align} q_1^\star = \max_{i\in[b]} L^1_{i,[b]}, \quad q_2^\star = \ldots = q_p^\star = 0, \end{align}\] which in turn implies that \[\begin{align} 0 \leq g_i(q^\star) = \left( \sum_{s=1}^i q_s^\star \right)^2 - \sum_{s=1}^i q_s^\star L^1_{i, \{s,\dots,b\}} = q_1^\star \left( q_1^\star - L^1_{i, [b]} \right) \end{align}\] for all \(i \in [b]\), so in particular \(q_1^\star \left( q_1^\star - L^1_{1, [b]} \right) \geq 0\). Now, suppose \(L^1_{1, [b]} \neq \max_{i\in[b]} L^1_{i, [b]}\). Then the inequality is strict and, by complementary slackness, \(\lambda_1=\eta_1=0\), so \[\begin{align} d_1 = \sum_{i=2}^b \lambda_i \left( 2 q_1^\star - L^1_{i, \{1,\dots,b\}} \right) \leq \sum_{i=2}^b \lambda_i \left( 2 q_1^\star - L^1_{i, \{2,\dots,b\}} \right) = d_2 - \eta_2 \leq d_2, \end{align}\] which is a contradiction. Thus, similar to the result in 14, we must have \(L^1_{1, [b]} = \max_{i\in[b]} L^1_{i, [b]}\). ◻

11.2.2 \(\tau\)-Nice Sampling↩︎

Choose \(S\) uniformly from all subsets of \([b]\) of size \(\tau\).

For every \(i\in[b]\) we have \[\begin{align} {\mathbb{P}}\left(i \in \hat{S}\right)=\frac{\binom{b-1}{\tau-1}}{\binom{b}{\tau}}=\frac{\tau}{b} \end{align}\] and \[\begin{align} {\mathbb{E}}\left[L^{\alpha}_{i,\hat{S}} \mathbb{I}\left(i\in\hat{S}\right)\right] = \frac{1}{\binom{b}{\tau}} \sum_{\substack{S\subseteq[b]\\|S|=\tau}} L^{\alpha}_{i,S} \mathbb{I}\left(i\in S\right). \end{align}\] for \(\alpha\in\{0,1\}\). Recall that \[\begin{align} {\mathbb{E}}\left[\mathrm{cost}(S)\right] = c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( 1 - \frac{\binom{b-j}{\tau}}{\binom{b}{\tau}} \right) + \frac{\tau}{b} \sum_{j=1}^b c_j^{\sharp}. \end{align}\] Hence, substituting into 34 and 35 , the objective functions to minimize are \[\begin{align} &&{\mathbb{E}}\left[\mathrm{cost}_{\varepsilon^2}(K)\right] \\ &\propto& \frac{\sum_{i=1}^b \frac{{\mathbb{P}}\left(i \in \hat{S}\right)^2 {\mathbb{E}}\left[L^0_{i,\hat{S}} \mathbb{I}\left(i \in \hat{S}\right)\right]}{\left( {\mathbb{E}}\left[L^1_{i,\hat{S}} \mathbb{I}\left(i \in \hat{S}\right)\right] \right)^2}}{\min_{i\in[b]} \left( \frac{{\mathbb{P}}\left(i \in \hat{S}\right)^2}{{\mathbb{E}}\left[L^1_{i,\hat{S}} \mathbb{I}\left(i \in \hat{S}\right)\right]} \right)^2} \left( c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( 1 - \frac{\binom{b-j}{\tau}}{\binom{b}{\tau}} \right) + \frac{\tau}{b} \sum_{j=1}^b c_j^{\sharp} \right) \\ &=& \frac{\max_{i\in[b]} \left( \sum_{\substack{S\subseteq[b]\\|S|=\tau}} L^1_{i,S} \mathbb{I}\left(i\in S\right) \right)^2}{\frac{\tau}{b} \binom{b-1}{\tau-1}} \\ &&\qquad\times\sum_{i=1}^b \frac{\sum_{\substack{S\subseteq[b]\\|S|=\tau}} L^0_{i,S} \mathbb{I}\left(i\in S\right)}{\left( \sum_{\substack{S\subseteq[b]\\|S|=\tau}} L^1_{i,S} \mathbb{I}\left(i\in S\right) \right)^2} \left( c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( 1 - \frac{\binom{b-j}{\tau}}{\binom{b}{\tau}} \right) + \frac{\tau}{b} \sum_{j=1}^b c_j^{\sharp} \right) \end{align}\] and \[\begin{align} {\mathbb{E}}\left[\mathrm{cost}_{\varepsilon}(K)\right] &\propto& \frac{1}{\min_{i\in[b]} \frac{{\mathbb{P}}\left(i \in \hat{S}\right)^2}{{\mathbb{E}}\left[L^1_{i,\hat{S}} \mathbb{I}\left(i \in \hat{S}\right)\right]}} \left( c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( 1 - \frac{\binom{b-j}{\tau}}{\binom{b}{\tau}} \right) + \frac{\tau}{b} \sum_{j=1}^b c_j^{\sharp} \right) \\ &=& \frac{\max_{i\in[b]} \sum_{\substack{S\subseteq[b]\\|S|=\tau}} L^1_{i,S} \mathbb{I}\left(i\in S\right)}{\left( \frac{\tau}{b} \right)^2 \binom{b}{\tau}} \left( c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( 1 - \frac{\binom{b-j}{\tau}}{\binom{b}{\tau}} \right) + \frac{\tau}{b} \sum_{j=1}^b c_j^{\sharp} \right). \end{align}\] As in 11.1.2, in general there is no closed-form expression for \(\tau\) minimizing these costs, but it can be shown that \(\tau=b\) need not be optimal. Let us make a simplifying assumption that \(L^0_{i,S}\) depends only on \(i\) and \(|S|\) and define \(L^0_{i,\tau} :=L^0_{i,S}\) for \(|S|=\tau\). Then \[\begin{align} &&{\mathbb{E}}\left[\mathrm{cost}_{\varepsilon^2}(K)\right] \\ &\propto& \frac{\max_{i\in[b]} \left( \sum_{\substack{S\subseteq[b]\\|S|=\tau}} L^1_{i,\tau} \mathbb{I}\left(i\in S\right) \right)^2}{\frac{\tau}{b} \binom{b-1}{\tau-1}} \\ &&\qquad\times\sum_{i=1}^b \frac{\sum_{\substack{S\subseteq[b]\\|S|=\tau}} L^0_{i,\tau} \mathbb{I}\left(i\in S\right)}{\left( \sum_{\substack{S\subseteq[b]\\|S|=\tau}} L^1_{i,\tau} \mathbb{I}\left(i\in S\right) \right)^2} \left( c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( 1 - \frac{\binom{b-j}{\tau}}{\binom{b}{\tau}} \right) + \frac{\tau}{b} \sum_{j=1}^b c_j^{\sharp} \right) \\ &=& \frac{\max_{i\in[b]} \left( L^1_{i,\tau} \binom{b-1}{\tau-1} \right)^2}{\frac{\tau}{b} \binom{b-1}{\tau-1}} \sum_{i=1}^b \frac{L^0_{i,\tau} \binom{b-1}{\tau-1}}{\left( L^1_{i,\tau} \binom{b-1}{\tau-1} \right)^2} \left( c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( 1 - \frac{\binom{b-j}{\tau}}{\binom{b}{\tau}} \right) + \frac{\tau}{b} \sum_{j=1}^b c_j^{\sharp} \right) \\ &=& \underbrace{\max_{i\in[b]} \left( L^1_{i,\tau} \right)^2 \sum_{i=1}^b \frac{L^0_{i,\tau}}{\left( L^1_{i,\tau} \right)^2}}_{:=A_{\varepsilon^2}(\tau)} \underbrace{\left( \frac{b}{\tau} c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( \frac{b}{\tau} - \frac{\binom{b-j}{\tau}}{\binom{b-1}{\tau-1}} \right) + \sum_{j=1}^b c_j^{\sharp} \right)}_{:=B(\tau)} \end{align}\] and \[\begin{align} {\mathbb{E}}\left[\mathrm{cost}_{\varepsilon}(K)\right] &\propto& \frac{\max_{i\in[b]} \sum_{\substack{S\subseteq[b]\\|S|=\tau}} L^1_{i,\tau} \mathbb{I}\left(i\in S\right)}{\left( \frac{\tau}{b} \right)^2 \binom{b}{\tau}} \left( c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( 1 - \frac{\binom{b-j}{\tau}}{\binom{b}{\tau}} \right) + \frac{\tau}{b} \sum_{j=1}^b c_j^{\sharp} \right) \\ &=& \frac{\max_{i\in[b]} L^1_{i,\tau} \binom{b-1}{\tau-1}}{\left( \frac{\tau}{b} \right)^2 \binom{b}{\tau}} \left( c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( 1 - \frac{\binom{b-j}{\tau}}{\binom{b}{\tau}} \right) + \frac{\tau}{b} \sum_{j=1}^b c_j^{\sharp} \right) \\ &=& \underbrace{\max_{i\in[b]} L^1_{i,\tau}}_{:=A_{\varepsilon}(\tau)} \underbrace{\left( \frac{b}{\tau} c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( \frac{b}{\tau} - \frac{\binom{b-j}{\tau}}{\binom{b-1}{\tau-1}} \right) + \sum_{j=1}^b c_j^{\sharp} \right)}_{:=B(\tau)}. \end{align}\] We have already established in 11.1.2 that \(B\) is decreasing in \(\tau\). In fact, the expression to be minimized there is structurally very similar to these obtained above. Specifically, in the smooth case we had \(\mathrm{cost}_{\varepsilon}(\mathcal{D}) \propto A(\tau) B(\tau)\), where \(A(\tau) :=\max_{i\in[b]} L^0_{i,\tau}\). Since \(L^1_{i,\tau}\), like \(L^0_{i,\tau}\), is non-decreasing in \(\tau\), the same reasoning as in 11.1.2 applies to \({\mathbb{E}}\left[\mathrm{cost}_{\varepsilon}(K)\right]\). On the other hand, the dependence of \(A_{\varepsilon^2}(\tau)\) on \(\tau\) can be arbitrary depending on the scaling between \(L^0_{i,\tau}\) and \(L^1_{i,\tau}\).

Consequently, both objectives \({\mathbb{E}}\left[\mathrm{cost}_{\varepsilon}(K)\right]\) and \({\mathbb{E}}\left[\mathrm{cost}_{\varepsilon^2}(K)\right]\) present a trade-off between a decreasing factor \(B(\tau)\) and a (typically) increasing factor \(A_1(\tau)\). If \(A_{\varepsilon}(\tau)\) or \(A_{\varepsilon^2}(\tau)\) grows sufficiently fast in \(\tau\), it can compensate for the decrease of \(B(\tau)\) and make a smaller \(\tau\) (or even \(\tau=1\)) optimal.

11.2.3 \(\tau\)-Submodel Sampling↩︎

Sample a starting index \(\hat{s}\in\{1,\dots,b-\tau+1\}\) with probability \(p_i = {\mathbb{P}}\left(\hat{s}=i\right)\) and set \(\hat{S}=\{\hat{s},\dots,\hat{s}+\tau-1\}\).

For any fixed \(i\in[b]\), we have \[\begin{align} {\mathbb{P}}\left(i \in \hat{S}\right) &=& \sum_{j=\max\{1,i-\tau+1\}}^{\min\{i,b-\tau+1\}} p_j \end{align}\] and \[\begin{align} {\mathbb{E}}\left[L^1_{i,\hat{S}} \mathbb{I}\left(i \in \hat{S}\right)\right] &=& \sum_{j=\max\{1,i-\tau+1\}}^{\min\{i,b-\tau+1\}} p_j L^1_{i,\{j,\dots,j+\tau-1\}}. \end{align}\] We have also already shown that \[\begin{align} {\mathbb{E}}\left[\mathrm{cost}(\hat{S})\right] = c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( \sum_{i=1}^{\min\{j,b-\tau+1\}} p_i \right) + \sum_{j=1}^b c_j^{\sharp} \left( \sum_{i=\max\{1,j-\tau+1\}}^{\min\{j,b-\tau+1\}} p_i \right). \end{align}\]

11.2.3.1 Partitioned \(\tau\)-submodel sampling.

Mimicking 11.1.3.1, let us consider a special case of the above sampling scheme where the submodels partition \([b]\). For simplicity, we assume that \(b\) is divisible by \(\tau\) and let \(m=\frac{b}{\tau}\). The algorithm is then equivalent to \(\tau\)-submodel sampling with starting-index distribution satisfying \[\begin{align} p_{s_k} > 0 \quad (k=1,\dots,m), \qquad p_j=0 \text{ otherwise}, \end{align}\] where \(s_k=(k-1)\tau+1\), \(k\in[m]\). For any fixed \(i\in[b]\), the derivations above simplify to \[\begin{align} {\mathbb{P}}\left(i \in \hat{S}\right) = \sum_{j=\max\{1,i-\tau+1\}}^{\min\{i,b-\tau+1\}} p_j = p_{s_{\lceil i/\tau \rceil}} \end{align}\] and \[\begin{align} {\mathbb{E}}\left[L^{\alpha}_{i,\hat{S}} \mathbb{I}\left(i \in \hat{S}\right)\right] = \sum_{j=\max\{1,i-\tau+1\}}^{\min\{i,b-\tau+1\}} p_j L^{\alpha}_{i,\{j,\dots,j+\tau-1\}} = p_{s_{\lceil i/\tau \rceil}} L^{\alpha}_{i,B_{\left\lceil \frac{i}{\tau} \right\rceil}} \end{align}\] for \(\alpha\in\{0,1\}\). Plugging this choice into 34 and 35 gives \[\begin{align} {\mathbb{E}}\left[\mathrm{cost}_{\varepsilon^2}(K)\right] &\propto& \frac{\sum_{i=1}^b \frac{{\mathbb{P}}\left(i\in \hat{S}\right)^2 {\mathbb{E}}\left[L^0_{i,\hat{S}} \mathbb{I}\left(i\in \hat{S}\right)\right]}{\left( {\mathbb{E}}\left[L^1_{i,\hat{S}} \mathbb{I}\left(i\in \hat{S}\right)\right] \right)^2}}{\min_{i\in[b]} \left( \frac{{\mathbb{P}}\left(i\in \hat{S}\right)^2}{{\mathbb{E}}\left[L^1_{i,\hat{S}} \mathbb{I}\left(i\in \hat{S}\right)\right]} \right)^2} \\ &&\qquad\times\left( c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( \sum_{i=1}^{\min\{j,b-\tau+1\}} p_i \right) + \sum_{j=1}^b c_j^{\sharp} \left( \sum_{i=\max\{1,j-\tau+1\}}^{\min\{j,b-\tau+1\}} p_i \right) \right) \\ &=& \frac{\sum_{i=1}^b \frac{p_{s_{\lceil i/\tau \rceil}} L^0_{i,B_{\left\lceil \frac{i}{\tau} \right\rceil}}}{\left( L^1_{i,B_{\left\lceil \frac{i}{\tau} \right\rceil}} \right)^2}}{\min_{i\in[b]} \left( \frac{p_{s_{\lceil i/\tau \rceil}}}{L^1_{i,B_{\left\lceil \frac{i}{\tau} \right\rceil}}} \right)^2} \left( c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( \sum_{k=1}^{\left\lceil \frac{j}{\tau} \right\rceil} p_{s_k} \right) + \sum_{j=1}^b c_j^{\sharp} p_{s_{\left\lceil \frac{j}{\tau} \right\rceil}} \right) \end{align}\] and \[\begin{align} {\mathbb{E}}\left[\mathrm{cost}_{\varepsilon}(K)\right] &\propto& \frac{1}{\min_{i\in[b]} \frac{{\mathbb{P}}\left(i\in \hat{S}\right)^2}{{\mathbb{E}}\left[L^1_{i,\hat{S}} \mathbb{I}\left(i\in \hat{S}\right)\right]}} \\ &&\qquad\times\left( c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( \sum_{i=1}^{\min\{j,b-\tau+1\}} p_i \right) + \sum_{j=1}^b c_j^{\sharp} \left( \sum_{i=\max\{1,j-\tau+1\}}^{\min\{j,b-\tau+1\}} p_i \right) \right) \\ &=& \frac{1}{\min_{i\in[b]} \frac{p_{s_{\lceil i/\tau \rceil}}}{L^1_{i,B_{\left\lceil \frac{i}{\tau} \right\rceil}}}} \left( c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( \sum_{k=1}^{\left\lceil \frac{j}{\tau} \right\rceil} p_{s_k} \right) + \sum_{j=1}^b c_j^{\sharp} p_{s_{\left\lceil \frac{j}{\tau} \right\rceil}} \right). \end{align}\] The expression for \({\mathbb{E}}\left[\mathrm{cost}_{\varepsilon}(K)\right]\) is entirely analogous to 24 derived in the smooth case, with \(L^0_{i,B_{\left\lceil \frac{i}{\tau} \right\rceil}}\) replaced by \(L^1_{i,B_{\left\lceil \frac{i}{\tau} \right\rceil}}\). Consequently, for a fixed \(\tau\), the optimal block probabilities are \[\begin{align} p_{s_k}^\star = \frac{\max_{i\in B_k} L^1_{i,B_k}}{\sum_{l=1}^m \max_{i\in B_l} L^1_{i,B_l}}, \end{align}\] Again, the optimal number of blocks \(m\) depends on the growth rate of the constants \(L^1_{i,\tau}\) with respect to \(\tau\) and on their interaction with the costs \(c_j\) and \(c_j^{\sharp}\). In general, the best choice of \(m\) is not necessarily \(b\).

The dependence of \({\mathbb{E}}\left[\mathrm{cost}_{\varepsilon^2}(K)\right]\) on \(\tau\) is significantly more complex and governed by the relative scaling between \(L^0_{i,\tau}\) and \(L^1_{i,\tau}\).

11.2.4 Arbitrary Submodel Sampling↩︎

Let \(\{B_1,\dots,B_m\}\) be a partition of \([b]\). Set \(\hat{S} = B_i\) with probability \(p_i\) (where \(\sum_{i=1}^m p_i = 1\)). For \(i\in[b]\), \(k(i)\) denotes the unique block with \(i\in B_{k(i)}\) and \(\underline{b}_k :=\min B_k\).

For any \(i\in[b]\), we have \[\begin{align} {\mathbb{P}}\left(i \in \hat{S}\right) = p_{k(i)} \end{align}\] and \[\begin{align} {\mathbb{E}}\left[L^{\alpha}_{i,\hat{S}} \mathbb{I}\left(i\in \hat{S}\right)\right] = p_{k(i)} L^{\alpha}_{i,B_{k(i)}} \end{align}\] for \(\alpha\in\{0,1\}\). Since \[\begin{align} {\mathbb{E}}\left[\mathrm{cost}(\hat{S})\right] = c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( \sum_{k:\underline{b}_k \leq j} p_k \right) + \sum_{j=1}^b c_j^{\sharp} p_{k(j)}, \end{align}\] the total expected costs become \[\begin{align} {\mathbb{E}}\left[\mathrm{cost}_{\varepsilon^2}(K)\right] &\propto& \frac{\sum_{i=1}^b \frac{{\mathbb{P}}\left(i\in \hat{S}\right)^2 {\mathbb{E}}\left[L^0_{i,\hat{S}} \mathbb{I}\left(i\in \hat{S}\right)\right]}{\left( {\mathbb{E}}\left[L^1_{i,\hat{S}} \mathbb{I}\left(i\in \hat{S}\right)\right] \right)^2}}{\min_{i\in[b]} \left( \frac{{\mathbb{P}}\left(i\in \hat{S}\right)^2}{{\mathbb{E}}\left[L^1_{i,\hat{S}} \mathbb{I}\left(i\in \hat{S}\right)\right]} \right)^2} \left( c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( \sum_{k:\underline{b}_k \leq j} p_k \right) + \sum_{j=1}^b c_j^{\sharp} p_{k(j)} \right) \\ &=& \frac{\sum_{i=1}^b \frac{p_{k(i)} L^0_{i,B_{k(i)}}}{\left( L^1_{i,B_{k(i)}} \right)^2}}{\min_{i\in[b]} \left( \frac{p_{k(i)}}{L^1_{i,B_{k(i)}}} \right)^2} \left( c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( \sum_{k:\underline{b}_k \leq j} p_k \right) + \sum_{j=1}^b c_j^{\sharp} p_{k(j)} \right) \end{align}\] and \[\begin{align} {\mathbb{E}}\left[\mathrm{cost}_{\varepsilon}(K)\right] &\propto& \frac{1}{\min_{i\in[b]} \frac{{\mathbb{P}}\left(i\in \hat{S}\right)^2}{{\mathbb{E}}\left[L^1_{i,\hat{S}} \mathbb{I}\left(i\in \hat{S}\right)\right]}} \left( c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( \sum_{k:\underline{b}_k \leq j} p_k \right) + \sum_{j=1}^b c_j^{\sharp} p_{k(j)} \right) \\ &=& \frac{1}{\min_{i\in[b]} \frac{p_{k(i)}}{L^1_{i,B_{k(i)}}}} \left( c_{\mathrm{ov}} + \sum_{j=1}^b c_j \left( \sum_{k:\underline{b}_k \leq j} p_k \right) + \sum_{j=1}^b c_j^{\sharp} p_{k(j)} \right). \end{align}\] Following the reasoning in 11.1.4, we find that for a fixed partition, the choice of probabilities minimizing \({\mathbb{E}}\left[\mathrm{cost}_{\varepsilon}(K)\right]\) is \[\begin{align} p_k^\star = \frac{\frac{1}{\delta_k^{\min}}}{\sum_{l=1}^m \frac{1}{\delta_l^{\min}}} = \frac{\max_{i\in B_k} L^0_{i,B_k}}{\sum_{l=1}^m \max_{i\in B_l} L^0_{i,B_l}}, \end{align}\] and that submodel training can improve upon full model training in certain regimes.

12 Convergence Results – Stochastic Gradient Case↩︎

The proof of 16 relies on two preliminary lemmas, which we state and prove first.

Lemma 6 (Descent Lemma). Let 4 hold and consider the update rule \(X_i^{k+1} = {\rm LMO}_{\mathcal{B}(X_i^k,t_i^k)}(M_i^k)\), \(i=1,\ldots,b\), where \(X^{k+1} = [X_1^{k+1}, \ldots, X_b^{k+1}] \in \mathcal{X}\), \(X^k = [X_1^k, \ldots, X_b^k] \in \mathcal{X}\), \(M^k = [M_1^k, \ldots, M_b^k] \in \mathcal{X}\) and \(t_i^k > 0\). Then \[\begin{align} f(X^{k+1}) &\leq& f(X^k) + \sum_{i\in S^k} 2 t_i^k \left\| \nabla_i f(X^k) - M_i^k \right\|_{(i) \star} - \sum_{i\in S^k} t_i^k \left\| \nabla_i f(X^k) \right\|_{(i) \star} \\ &&+ \sum_{i\in S^k} \frac{L^0_{i,S^k} + L^1_{i,S^k} \left\| \nabla_i f(X^k) \right\|_{(i) \star}}{2} (t_i^k)^2. \end{align}\]

Proof. By 10 \[\begin{align} &&f(X^{k+1}) \\ &\leq& f(X^k) + \left\langle\nabla f(X^k),X^{k+1} - X^k\right\rangle + \sum_{i\in S^k} \frac{L^0_{i,S^k} + L^1_{i,S^k} \left\| \nabla_i f(X^k) \right\|_{(i) \star}}{2} \left\| X_i^k - X_i^{k+1} \right\|_{(i)}^2 \\ &\overset{\eqref{eq:normlmo}}{=}& f(X^k) + \sum_{i\in S^k} \left( \left\langle\nabla_i f(X^k) - M_i^k,X_i^{k+1} - X_i^k\right\rangle_{(i)} + \left\langle M_i^k,X_i^{k+1} - X_i^k\right\rangle_{(i)} \right) \\ &&+ \sum_{i\in S^k} \frac{L^0_{i,S^k} + L^1_{i,S^k} \left\| \nabla_i f(X^k) \right\|_{(i) \star}}{2} (t_i^k)^2 \\ &\overset{\eqref{eq:inplmo}}{=}& f(X^k) + \sum_{i\in S^k} \left( \left\langle\nabla_i f(X^k) - M_i^k,X_i^{k+1} - X_i^k\right\rangle_{(i)} - t_i^k \left\| M_i^k \right\|_{(i) \star} \right) \\ &&+ \sum_{i\in S^k} \frac{L^0_{i,S^k} + L^1_{i,S^k} \left\| \nabla_i f(X^k) \right\|_{(i) \star}}{2} (t_i^k)^2 \\ &\leq& f(X^k) + \sum_{i\in S^k} \Bigg(t_i^k \left\| \nabla_i f(X^k) - M_i^k \right\|_{(i) \star} - t_i^k \left\| M_i^k \right\|_{(i) \star} \\ &&\qquad\qquad\qquad\quad+ \frac{L^0_{i,S^k} + L^1_{i,S^k} \left\| \nabla_i f(X^k) \right\|_{(i) \star}}{2} (t_i^k)^2\Bigg), \end{align}\] where in the last line we used the Cauchy-Schwarz inequality. Therefore, using triangle inequality, we get \[\begin{align} &&f(X^{k+1}) \\ &\leq& f(X^k) + \sum_{i\in S^k} \left( t_i^k \left\| \nabla_i f(X^k) - M_i^k \right\|_{(i) \star} + t_i^k \left\| \nabla_i f(X^k) - M_i^k \right\|_{(i) \star} - t_i^k \left\| \nabla_i f(X^k) \right\|_{(i) \star} \right) \\ &&+ \sum_{i\in S^k} \frac{L^0_{i,S^k} + L^1_{i,S^k} \left\| \nabla_i f(X^k) \right\|_{(i) \star}}{2} (t_i^k)^2 \\ &=& f(X^k) + \sum_{i\in S^k} \left( 2 t_i^k \left\| \nabla_i f(X^k) - M_i^k \right\|_{(i) \star} - t_i^k \left\| \nabla_i f(X^k) \right\|_{(i) \star} \right) \\ &&+ \sum_{i\in S^k} \frac{L^0_{i,S^k} + L^1_{i,S^k} \left\| \nabla_i f(X^k) \right\|_{(i) \star}}{2} (t_i^k)^2 \end{align}\] as required. ◻

Lemma 7. Let Assumptions 4 and 5 hold. Then, the iterates of [alg:rt95arbitrary95stoch] run with \(t_i^k \equiv t_i\) satisfy \[\begin{align} &&{\mathbb{E}}\left[\left\| M_i^{k+1} - \nabla_i f(X^{k+1}) \right\|_2\right] \\ &\leq& \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^{k+1} {\mathbb{E}}\left[\left\| M_i^0 - \nabla_i f(X^0) \right\|_2\right] + \frac{t_i {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right) L_{i,\hat{S}}^0\right]}{\underline{\rho}_i \beta_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right]} \nonumber \\ &&+ \frac{t_i}{\underline{\rho}_i} {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right) L_{i,\hat{S}}^1\right] \sum_{l=0}^{k} \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^{k-l} {\mathbb{E}}\left[\left\| \nabla_i f(X^l) \right\|_{(i) \star}\right] \\ &&+ \sigma_i \sqrt{\beta_i}. \end{align}\]

Proof. The proof is inspired by [14]. First, using the momentum update rule, we have \[\begin{align} M_i^{k+1} &=& \left( 1 - \mathbb{I}\left(i\in S^k\right) \beta_i \right) M_i^k + \mathbb{I}\left(i\in S^k\right) \beta_i \nabla_i f(X^{k+1}; \xi^{k+1}) \\ &=& \left( 1 - \beta_i^k \right) \left( M_i^k - \nabla_i f(X^k) \right) + \left( 1 - \beta_i^k \right) \left( \nabla_i f(X^k) - \nabla_i f(X^{k+1}) \right) \\ &&+ \beta_i^k \left( \nabla_i f(X^{k+1}; \xi^{k+1}) - \nabla_i f(X^{k+1}) \right) + \nabla_i f(X^{k+1}). \end{align}\] where \(\beta_i^k :=\mathbb{I}\left(i\in S^k\right) \beta_i\). To simplify the notation, define \(U_{1,i}^k :=M_i^k - \nabla_i f(X^k)\), \(U_{2,i}^k :=\nabla_i f(X^k) - \nabla_i f(X^{k+1})\) and \(U_{3,i}^k :=\nabla_i f(X^k; \xi^k) - \nabla_i f(X^k)\). Then the above can be written as \[\begin{align} \label{eq:rec95st} U_{1,i}^{k+1} = \left( 1 - \beta_i^k \right) U_{1,i}^k + \left( 1 - \beta_i^k \right) U_{2,i}^k + \beta_i^k U_{3,i}^{k+1}. \end{align}\tag{40}\] Unrolling the recursion in 40 , we get \[\begin{align} U_{1,i}^{k+1} &= (1-\beta_i^k) U_{1,i}^k + (1-\beta_i^k) U_{2,i}^k + \beta_i^k U_{3,i}^{k+1} \\ &= \left( \prod_{m=0}^{k}(1-\beta_i^m) \right) U_{1,i}^0 + \sum_{l=0}^{k} \left( \prod_{m=l}^{k}(1-\beta_i^m) \right) U_{2,i}^l + \sum_{l=0}^{k} \left( \beta_i^l\prod_{m=l+1}^{k}(1-\beta_i^m) \right) U_{3,i}^{l+1}, \end{align}\] where by convention \(\prod_{m=a}^{b}(\cdot)=1\) if \(a>b\). Hence, taking norms and using the triangle inequality, \[\begin{align} \label{eq:apieghrn} {\mathbb{E}}\left[\left\| U_{1,i}^{k+1} \right\|_2\right] &\leq& {\mathbb{E}}\left[\left\| \left( \prod_{m=0}^{k}(1-\beta_i^m) \right) U_{1,i}^0 \right\|_2\right] + {\mathbb{E}}\left[\left\| \sum_{l=0}^{k} \left( \prod_{m=l}^{k}(1-\beta_i^m) \right) U_{2,i}^l \right\|_2\right] \nonumber \\ &&+ {\mathbb{E}}\left[\left\| \sum_{l=0}^{k} \left( \beta_i^l\prod_{m=l+1}^{k}(1-\beta_i^m) \right) U_{3,i}^{l+1} \right\|_2\right]. \end{align}\tag{41}\] Let us consider each of the terms separately. First, using the independence of \(\{S^k\}_{k\geq0}\), we have \[\begin{align} \label{eq:niuytfvrths} {\mathbb{E}}\left[\left\| \left( \prod_{m=0}^{k}(1-\beta_i^m) \right) U_{1,i}^0 \right\|_2\right] &\leq& {\mathbb{E}}\left[\left( \prod_{m=0}^{k}(1-\beta_i^m) \right) \left\| U_{1,i}^0 \right\|_2\right] \nonumber \\ &=& {\mathbb{E}}\left[{\mathbb{E}}\left[\left.\left( \prod_{m=0}^{k}(1-\beta_i^m) \right)\right\vert X^k, M_i^k\right] \left\| U_{1,i}^0 \right\|_2\right] \nonumber \\ &=& \prod_{m=0}^{k} {\mathbb{E}}\left[1-\beta_i^m\right] {\mathbb{E}}\left[\left\| U_{1,i}^0 \right\|_2\right]. \end{align}\tag{42}\] Next, by 4 \[\begin{align} &&{\mathbb{E}}\left[\left\| \sum_{l=0}^{k} \left( \prod_{m=l}^{k}(1-\beta_i^m) \right) U_{2,i}^l \right\|_2\right] \\ &\leq& {\mathbb{E}}\left[\sum_{l=0}^{k} \left( \prod_{m=l}^{k}(1-\beta_i^m) \right) \left\| U_{2,i}^l \right\|_2\right] \\ &\leq& \frac{1}{\underline{\rho}_i} \sum_{l=0}^{k} {\mathbb{E}}\left[\left( \prod_{m=l}^{k}(1-\beta_i^m) \right) \left\| \nabla_i f(X^l) - \nabla_i f(X^{l+1}) \right\|_{(i) \star}\right] \\ &\overset{\eqref{as:arbitrary95layer95gen95smoothness}}{\leq}& \frac{1}{\underline{\rho}_i} \sum_{l=0}^{k} {\mathbb{E}}\left[\left( \prod_{m=l}^{k}(1-\beta_i^m) \right) \left( L_{i,S^l}^0 + L_{i,S^l}^1 \left\| \nabla_i f(X^l) \right\|_{(i) \star} \right) \left\| X_i^l - X_i^{l+1} \right\|_{(i)}\right]. \end{align}\] Since the product \(\prod_{m=l+1}^{k}(1-\beta_i^m)\) depends only on samplings at iterations \(>l\), these factors are independent of the \(\sigma\)-algebra generated by \((X^l,S^l)\). Therefore, \[\begin{align} &&{\mathbb{E}}\left[\left\| \sum_{l=0}^{k} \left( \prod_{m=l}^{k}(1-\beta_i^m) \right) U_{2,i}^l \right\|_2\right] \\ &\leq& \frac{1}{\underline{\rho}_i} \sum_{l=0}^{k} \left( \left( \prod_{m=l+1}^{k} {\mathbb{E}}\left[1-\beta_i^m\right] \right) {\mathbb{E}}\left[(1-\beta_i^l) \left( L_{i,S^l}^0 + L_{i,S^l}^1 \left\| \nabla_i f(X^l) \right\|_{(i) \star} \right)\right] t_i \right), \end{align}\] where \[\begin{align} &&{\mathbb{E}}\left[(1-\beta_i^l) \left( L_{i,S^l}^0 + L_{i,S^l}^1 \left\| \nabla_i f(X^l) \right\|_{(i) \star} \right)\right] \\ &=& {\mathbb{E}}\left[(1-\beta_i^l) L_{i,S^l}^0\right] + {\mathbb{E}}\left[{\mathbb{E}}\left[\left.(1-\beta_i^l) L_{i,S^l}^1 \left\| \nabla_i f(X^l) \right\|_{(i) \star}\right\vert X^l\right]\right] \\ &=& {\mathbb{E}}\left[(1-\beta_i^l) L_{i,S^l}^0\right] + {\mathbb{E}}\left[(1-\beta_i^l) L_{i,S^l}^1\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^l) \right\|_{(i) \star}\right]. \end{align}\] Thus \[\begin{align} \label{eq:dukise} &&{\mathbb{E}}\left[\left\| \sum_{l=0}^{k} \left( \prod_{m=l}^{k}(1-\beta_i^m) \right) U_{2,i}^l \right\|_2\right] \\ &\leq& \frac{t_i}{\underline{\rho}_i} \sum_{l=0}^{k} \left( \prod_{m=l+1}^{k} {\mathbb{E}}\left[1-\beta_i^m\right] \right) \left( {\mathbb{E}}\left[(1-\beta_i^l) L_{i,S^l}^0\right] + {\mathbb{E}}\left[(1-\beta_i^l) L_{i,S^l}^1\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^l) \right\|_{(i) \star}\right] \right). \nonumber \end{align}\tag{43}\] Lastly, by Jensen’s inequality, the last term can be bounded via \[\begin{align} \label{eq:aefawcrgt} &&{\mathbb{E}}\left[\left\| \sum_{l=0}^{k} \left( \beta_i^l\prod_{m=l+1}^{k}(1-\beta_i^m) \right) U_{3,i}^{l+1} \right\|_2\right] \nonumber \\ &\leq& \sqrt{{\mathbb{E}}\left[\left\| \sum_{l=0}^{k} \underbrace{\left( \beta_i^l\prod_{m=l+1}^{k}(1-\beta_i^m) \right)}_{:=a_l} U_{3,i}^{l+1} \right\|^2_2\right]} = \sqrt{{\mathbb{E}}\left[\sum_{l, r = 0}^{k} a_l a_r \left\langle U_{3,i}^{l+1},U_{3,i}^{r+1}\right\rangle\right]} \nonumber \\ &\overset{\eqref{as:bounded95var}}{=}& \sqrt{{\mathbb{E}}\left[\sum_{l = 0}^{k} a_l^2 \left\| U_{3,i}^{l+1} \right\|_2^2\right]} = \sqrt{\sum_{l = 0}^{k} {\mathbb{E}}\left[{\mathbb{E}}\left[\left.a_l^2 \left\| U_{3,i}^{l+1} \right\|_2^2\right\vert\{S^r\}_{r=l}^{k}, X^{l+1}\right]\right]} \nonumber \\ &=& \sqrt{\sum_{l = 0}^{k} {\mathbb{E}}\left[a_l^2 {\mathbb{E}}\left[\left.\left\| U_{3,i}^{l+1} \right\|_2^2\right\vert\{S^r\}_{r=l}^{k}, X^{l+1}\right]\right]} \overset{\eqref{as:bounded95var}}{\leq} \sigma_i \sqrt{\sum_{l = 0}^{k} {\mathbb{E}}\left[a_l^2\right]} \nonumber \\ &=& \sigma_i \sqrt{\sum_{l = 0}^{k} \left( {\mathbb{E}}\left[(\beta_i^l)^2\right] \prod_{m=l+1}^{k} {\mathbb{E}}\left[(1-\beta_i^m)^2\right] \right)} \nonumber \\ &=& \sigma_i \sqrt{\sum_{l = 0}^{k} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in S^l\right) \beta_i^2\right] \prod_{m=l+1}^{k} {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in S^m\right) \beta_i \right)^2\right] \right)} \nonumber \\ &=& \sigma_i \sqrt{\sum_{l = 0}^{k} \left( \beta_i^2 {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right)^2\right]^{k-l} \right)} \nonumber \\ &\leq& \sigma_i \sqrt{\frac{\beta_i^2 {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right]}{1 - {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right)^2\right]}} \nonumber \\ &=& \sigma_i \beta_i \sqrt{\frac{{\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right]}{{\mathbb{E}}\left[\left( 2 - \beta_i \right) \mathbb{I}\left(i\in \hat{S}\right) \beta_i\right]}} = \frac{\sigma_i \beta_i}{\sqrt{\left( 2 - \beta_i \right) \beta_i}} \leq \sigma_i \sqrt{\beta_i}. \end{align}\tag{44}\] Applying 42 , 43 and 44 in 41 and noting that \({\mathbb{E}}\left[\beta_i^k\right] = {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i\) yields \[\begin{align} &&{\mathbb{E}}\left[\left\| U_{1,i}^{k+1} \right\|_2\right] \\ &\leq& \prod_{m=0}^{k} {\mathbb{E}}\left[1-\beta_i^m\right] {\mathbb{E}}\left[\left\| U_{1,i}^0 \right\|_2\right] + \frac{t_i}{\underline{\rho}_i} \sum_{l=0}^{k} \left( \prod_{m=l+1}^{k} {\mathbb{E}}\left[1-\beta_i^m\right] \right) {\mathbb{E}}\left[(1-\beta_i^l) L_{i,S^l}^0\right] \nonumber \\ &&+ \frac{t_i}{\underline{\rho}_i} \sum_{l=0}^{k} \left( \prod_{m=l+1}^{k} {\mathbb{E}}\left[1-\beta_i^m\right] \right) {\mathbb{E}}\left[(1-\beta_i^l) L_{i,S^l}^1\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^l) \right\|_{(i) \star}\right] + \sigma_i \sqrt{\beta_i} \\ &=& \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^{k+1} {\mathbb{E}}\left[\left\| U_{1,i}^0 \right\|_2\right] \nonumber \\ &&+ \frac{t_i}{\underline{\rho}_i} {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right) L_{i,\hat{S}}^0\right] \sum_{l=0}^{k} \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^{k-l} \nonumber \\ &&+ \frac{t_i}{\underline{\rho}_i} {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right) L_{i,\hat{S}}^1\right] \sum_{l=0}^{k} \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^{k-l} {\mathbb{E}}\left[\left\| \nabla_i f(X^l) \right\|_{(i) \star}\right] \nonumber \\ &&+ \sigma_i \sqrt{\beta_i} \\ &\leq& \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^{k+1} {\mathbb{E}}\left[\left\| U_{1,i}^0 \right\|_2\right] + \frac{t_i}{\underline{\rho}_i \beta_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right]} {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right) L_{i,\hat{S}}^0\right] \nonumber \\ &&+ \frac{t_i}{\underline{\rho}_i} {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right) L_{i,\hat{S}}^1\right] \sum_{l=0}^{k} \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^{k-l} {\mathbb{E}}\left[\left\| \nabla_i f(X^l) \right\|_{(i) \star}\right] \nonumber \\ &&+ \sigma_i \sqrt{\beta_i}. \end{align}\] ◻

Theorem 16. Let Assumptions 1, 4, and 5 hold. Let \(\{X^k\}_{k=0}^{K-1}\), \(K \geq 1\), be the iterates of [alg:rt95arbitrary95stoch] initialized with \(M_i^0 = \nabla_i f(X^0; \xi^0)\) and run with \(\beta_i \equiv \beta = \frac{1}{(K+1)^{1/2}}\) and \[\begin{align} 0 < t_i^k \equiv t_i = \frac{\eta_i}{(K+1)^{3/4}}, \qquad i=1,\ldots,b, \end{align}\] where \(\eta_i^2 \leq \min\left\{ \frac{(K+1)^{1/2}}{4 {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right] {\mathbb{E}}\left[\max\limits_{i\in [b]} L_{i,\hat{S}}^1\right]}, \frac{\underline{\rho}_i \min\limits_{i\in [b]} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \right)}{16 \bar{\rho}_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right) L_{i,\hat{S}}^1\right] {\mathbb{E}}\left[\max\limits_{i\in [b]} L_{i,\hat{S}}^1\right]}, 1 \right\}\). Then \[\begin{align} &&\min_{k=0,\ldots,K} \sum_{i=1}^b \frac{{\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \eta_i}{\frac{1}{b} \sum_{l=1}^b {\mathbb{E}}\left[\mathbb{I}\left(l \in \hat{S}\right)\right] \eta_l} {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \\ &\leq& \frac{3 \delta^0}{(K+1)^{1/4} \frac{1}{b} \sum_{l=1}^b {\mathbb{E}}\left[\mathbb{I}\left(l \in \hat{S}\right)\right] \eta_l} + \frac{6}{(K+1)^{1/2}} \sum_{i=1}^b \frac{\eta_i \bar{\rho}_i \sigma_i}{\frac{1}{b} \sum_{l=1}^b {\mathbb{E}}\left[\mathbb{I}\left(l \in \hat{S}\right)\right] \eta_l} \\ &&+ \sum_{i=1}^b \frac{\eta_i^2}{(K+1)^{1/4} \frac{1}{b} \sum_{l=1}^b {\mathbb{E}}\left[\mathbb{I}\left(l \in \hat{S}\right)\right] \eta_l} \frac{2 \bar{\rho}_i}{\underline{\rho}_i} \left( {\mathbb{E}}\left[L_{i,\hat{S}}^0\right] + {\mathbb{E}}\left[L_{i,\hat{S}}^1\right] {\mathbb{E}}\left[\frac{L_{i,\hat{S}}^0}{L_{i,\hat{S}}^1}\right] \right) \\ &&+ \sum_{i=1}^b \frac{\eta_i^2}{2 (K+1)^{3/4} \frac{1}{b} \sum_{l=1}^b {\mathbb{E}}\left[\mathbb{I}\left(l \in \hat{S}\right)\right] \eta_l} \\ &&\qquad\quad\times \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L^0_{i,\hat{S}}\right] + {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right] {\mathbb{E}}\left[\frac{L_{i,\hat{S}}^0}{L_{i,\hat{S}}^1}\right] \right) \\ &&+ \sum_{i=1}^b \frac{2 {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \eta_i}{(K+1)^{1/4} \frac{1}{b} \sum_{l=1}^b {\mathbb{E}}\left[\mathbb{I}\left(l \in \hat{S}\right)\right] \eta_l} \bar{\rho}_i \sigma_i. \end{align}\]

Remark 17. For 16 to be meaningful, the sampling must be such that \({\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \eta_i > 0\) for every \(i\in[b]\). Thus, every layer has to be sampled with positive probability.

Remark 18. Note that under 4, without loss of generality we can set \(L^0_{i,S} = L^1_{i,S} = 0\) whenever \(i\not\in S\). Hence, in the case of RPT (see 4), we have \[\begin{align} {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] &=& \sum_{s=1}^i p_s, \\ {\mathbb{E}}\left[\max_{i\in[b]} L^1_{i,\hat{S}}\right] &=& \sum_{s=1}^b p_s \max_{i\in[b]} L^1_{i,\{s,\dots,b\}}, \\ {\mathbb{E}}\left[\frac{L^0_{i,\hat{S}}}{L^1_{i,\hat{S}}}\right] &=& \sum_{s=1}^b p_s \frac{L^0_{i,\{s,\dots,b\}}}{L^1_{i,\{s,\dots,b\}}} = \sum_{s=1}^i p_s \frac{L^0_{i,\{s,\dots,b\}}}{L^1_{i,\{s,\dots,b\}}}, \\ {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in\hat{S}\right) \beta_i \right) L^1_{i,\hat{S}}\right] &=& \sum_{s=1}^b p_s (1-\mathbb{I}\left(i\in\{s,\dots,b\}\right) \beta_i) L^1_{i,\{s,\dots,b\}} \\ &=& (1-\beta_i)\sum_{s=1}^i p_s L^1_{i,\{s,\dots,b\}}, \\ {\mathbb{E}}\left[L^{\alpha}_{i,\hat{S}}\right] &=& \sum_{s=1}^b p_s L^{\alpha}_{i,\{s,\dots,b\}} = \sum_{s=1}^i p_s L^{\alpha}_{i,\{s,\dots,b\}}, \\ {\mathbb{E}}\left[L^{\alpha}_{i,\hat{S}} \mathbb{I}\left(i\in \hat{S}\right)\right] &=& \sum_{s=1}^i p_s L^{\alpha}_{i, \{s,\dots,b\}} \end{align}\] for \(\alpha\in\{0,1\}\). Substituting these expressions into the rate yields the result in 4.

Proof of 16. Taking expectation conditional on \([X^k, \{M_i^k\}_{i\in[b]}]\) in 6 gives \[\begin{align} &{\mathbb{E}}\left[\left.f(X^{k+1})\right\vert X^k, \{M_i^k\}_{i\in[b]}\right] \\ &\leq f(X^k) + {\mathbb{E}}\left[\left.\sum_{i\in S^k} 2 t_i \left\| \nabla_i f(X^k) - M_i^k \right\|_{(i) \star} - \sum_{i\in S^k} t_i \left\| \nabla_i f(X^k) \right\|_{(i) \star}\right\vert X^k, \{M_i^k\}_{i\in[b]}\right] \\ &\quad+ {\mathbb{E}}\left[\left.\sum_{i\in S^k} \frac{L^0_{i,S} + L_{i,S}^1 \left\| \nabla_i f(X^k) \right\|_{(i) \star}}{2} t_i^2\right\vert X^k, \{M_i^k\}_{i\in[b]}\right] \\ &= f(X^k) + \sum_{i=1}^b {\mathbb{E}}\left[\left.\mathbb{I}\left(i\in S^k\right) \left( 2 t_i \left\| \nabla_i f(X^k) - M_i^k \right\|_{(i) \star} - t_i \left\| \nabla_i f(X^k) \right\|_{(i) \star} \right)\right\vert X^k, \{M_i^k\}_{i\in[b]}\right] \\ &\quad+ \sum_{i=1}^b {\mathbb{E}}\left[\left.\mathbb{I}\left(i\in S^k\right) \frac{L^0_{i,S} + L_{i,S}^1 \left\| \nabla_i f(X^k) \right\|_{(i) \star}}{2} t_i^2\right\vert X^k, \{M_i^k\}_{i\in[b]}\right] \\ &\leq f(X^k) + \sum_{i=1}^b {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \left( 2 t_i \bar{\rho}_i \left\| \nabla_i f(X^k) - M_i^k \right\|_2 - t_i \left\| \nabla_i f(X^k) \right\|_{(i) \star} \right) \\ &\quad+ \frac{1}{2} \sum_{i=1}^b {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L^0_{i,\hat{S}}\right] t_i^2 + \frac{1}{2} \sum_{i=1}^b {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right] t_i^2 \left\| \nabla_i f(X^k) \right\|_{(i) \star}. \end{align}\] Hence, taking full expectation \[\begin{align} &&{\mathbb{E}}\left[f(X^{k+1})\right] \\ &\leq& {\mathbb{E}}\left[f(X^k)\right] + \sum_{i=1}^b 2 t_i \bar{\rho}_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^k) - M_i^k \right\|_2\right] \\ &&- \sum_{i=1}^b {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] t_i {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \\ &&+ \frac{1}{2} \sum_{i=1}^b {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L^0_{i,\hat{S}}\right] t_i^2 + \frac{1}{2} \sum_{i=1}^b {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right] t_i^2 {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right]. \end{align}\] To simplify the notation, let \(\delta^k :={\mathbb{E}}\left[f(X^k) - f^{\star}\right]\) and \(P_i^k :={\mathbb{E}}\left[\left\| \nabla_i f(X^k) - M_i^k \right\|_2\right]\). Then, according to the descent inequality above and 7 \[\begin{align} \delta^{k+1} &\leq& \delta^k - \sum_{i=1}^b t_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] + \sum_{i=1}^b 2 t_i \bar{\rho}_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] P_i^k \nonumber \\ &&+ \frac{1}{2} \sum_{i=1}^b t_i^2 {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L^0_{i,\hat{S}}\right] + \frac{1}{2} \sum_{i=1}^b t_i^2 {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right], \tag{45} \\ P_i^k &\leq& \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^k P_i^0 + \frac{t_i {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right) L_{i,\hat{S}}^0\right]}{\underline{\rho}_i \beta_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right]} \nonumber \\ &&+ \frac{t_i}{\underline{\rho}_i} {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right) L_{i,\hat{S}}^1\right] \sum_{l=0}^{k-1} \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^{k-1-l} {\mathbb{E}}\left[\left\| \nabla_i f(X^l) \right\|_{(i) \star}\right] \nonumber \\ &&+ \sigma_i \sqrt{\beta_i}. \tag{46} \end{align}\] Applying 46 in 45 , \[\begin{align} &&\delta^{k+1} \\ &\leq& \delta^k - \sum_{i=1}^b t_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \\ &&+ \sum_{i=1}^b 2 t_i \bar{\rho}_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \left( \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^k P_i^0 + \frac{t_i {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right) L_{i,\hat{S}}^0\right]}{\underline{\rho}_i \beta_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right]} \right) \\ &&+ \sum_{i=1}^b \Bigg(\frac{2 t_i^2 \bar{\rho}_i}{\underline{\rho}_i} {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right) L_{i,\hat{S}}^1\right] \\ &&\qquad\qquad\times \sum_{l=0}^{k-1} \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^{k-1-l} {\mathbb{E}}\left[\left\| \nabla_i f(X^l) \right\|_{(i) \star}\right]\Bigg) \\ &&+ \sum_{i=1}^b 2 t_i \bar{\rho}_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \sigma_i \sqrt{\beta_i} \\ &&+ \frac{1}{2} \sum_{i=1}^b t_i^2{\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L^0_{i,\hat{S}}\right] + \frac{1}{2} \sum_{i=1}^b t_i^2 {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \\ &=& \delta^k - \sum_{i=1}^b t_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \\ &&+ \sum_{i=1}^b 2 t_i \bar{\rho}_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^k P_i^0 \\ &&+ \sum_{i=1}^b \Bigg(\frac{2 t_i^2 \bar{\rho}_i}{\underline{\rho}_i} {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right) L_{i,\hat{S}}^1\right] \\ &&\qquad\qquad\times \sum_{l=0}^{k-1} \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^{k-1-l} {\mathbb{E}}\left[\left\| \nabla_i f(X^l) \right\|_{(i) \star}\right]\Bigg) \\ &&+ \frac{1}{2} \sum_{i=1}^b t_i^2 {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \\ &&+ \frac{1}{2} \sum_{i=1}^b t_i^2{\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L^0_{i,\hat{S}}\right] + \sum_{i=1}^b \frac{2 t_i^2 \bar{\rho}_i {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right) L_{i,\hat{S}}^0\right]}{\beta_i \underline{\rho}_i} \\ &&+ \sum_{i=1}^b 2 t_i \bar{\rho}_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \sigma_i \sqrt{\beta_i}. \end{align}\] Let us now look at the terms involving the gradient norms. Using 12, we get \[\begin{align} &&\sum_{i=1}^b \Bigg(t_i^2 \underbrace{\frac{2 \bar{\rho}_i}{\underline{\rho}_i} {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right) L_{i,\hat{S}}^1\right]}_{:=b_i} \\ &&\qquad\qquad\times \sum_{l=0}^{k-1} \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^{k-1-l} {\mathbb{E}}\left[\left\| \nabla_i f(X^l) \right\|_{(i) \star}\right]\Bigg) \\ &=& \sum_{l=0}^{k-1} \sum_{i=1}^b t_i^2 b_i \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^{k-1-l} {\mathbb{E}}\left[\left\| \nabla_i f(X^l) \right\|_{(i) \star}\right] \\ &\leq& \sum_{l=0}^{k-1} {\mathbb{E}}\left[4 \max_{i\in [b]} \left( t_i^2 b_i \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^{k-1-l} L_{i,S^l}^1 \right) \left( f(X^l) - f^{\star} \right)\right] \\ &&+ {\mathbb{E}}\left[\sum_{i=1}^b \frac{t_i^2 b_i \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^{k-1-l} L_{i,S^l}^0}{L_{i,S^l}^1}\right] \\ &\leq& 4 \sum_{l=0}^{k-1} {\mathbb{E}}\left[{\mathbb{E}}\left[\left.\max_{i\in [b]} \left( t_i^2 b_i \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^{k-1-l} \right) \max_{i\in [b]} \left( L_{i,S^l}^1 \right) \left( f(X^l) - f^{\star} \right)\right\vert X^l\right]\right] \\ &&+ \sum_{l=0}^{k-1} \sum_{i=1}^b t_i^2 b_i \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^{k-1-l} {\mathbb{E}}\left[\frac{L_{i,\hat{S}}^0}{L_{i,\hat{S}}^1}\right] \\ &=& 4 \sum_{l=0}^{k-1} \max_{i\in [b]} \left( t_i^2 b_i \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^{k-1-l} \right) {\mathbb{E}}\left[\max_{i\in [b]} L_{i,\hat{S}}^1\right] \delta^l \\ &&+ \sum_{i=1}^b t_i^2 b_i {\mathbb{E}}\left[\frac{L_{i,\hat{S}}^0}{L_{i,\hat{S}}^1}\right] \sum_{l=0}^{k-1} \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^{k-1-l} \\ &\leq& 4 \max_{i\in [b]} \left( t_i^2 b_i \right) {\mathbb{E}}\left[\max_{i\in [b]} L_{i,\hat{S}}^1\right] \sum_{l=0}^{k-1} \max_{i\in [b]} \left( \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^{k-1-l} \right) \delta^l \\ &&+ \sum_{i=1}^b \frac{t_i^2 b_i}{{\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i} {\mathbb{E}}\left[\frac{L_{i,\hat{S}}^0}{L_{i,\hat{S}}^1}\right] \\ &=& 4 \max_{i\in [b]} \left( t_i^2 b_i \right) {\mathbb{E}}\left[\max_{i\in [b]} L_{i,\hat{S}}^1\right] \sum_{l=0}^{k-1} \left( 1 - \min_{i\in [b]} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right) \right)^{k-1-l} \delta^l \\ &&+ \sum_{i=1}^b \frac{t_i^2 b_i}{{\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i} {\mathbb{E}}\left[\frac{L_{i,\hat{S}}^0}{L_{i,\hat{S}}^1}\right], \end{align}\] and \[\begin{align} &&\sum_{i=1}^b t_i^2 {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \\ &\leq& {\mathbb{E}}\left[4 \max_{i\in [b]} \left( t_i^2 {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right] L_{i,S^k}^1 \right) \left( f(X^k) - f^{\star} \right) + \sum_{i=1}^b \frac{t_i^2 {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right] L_{i,S^k}^0}{L_{i,S^k}^1}\right] \\ &\leq& 4 \max_{i\in [b]} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right] \right) {\mathbb{E}}\left[\max_{i\in [b]} \left( t_i^2 L_{i,S^k}^1 \right) \left( f(X^k) - f^{\star} \right)\right] \\ &&+ \sum_{i=1}^b t_i^2 {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right] {\mathbb{E}}\left[\frac{L_{i,S^k}^0}{L_{i,S^k}^1}\right] \\ &=& 4 \max_{i\in [b]} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right] \right) {\mathbb{E}}\left[{\mathbb{E}}\left[\left.\max_{i\in [b]} \left( t_i^2 L_{i,S^k}^1 \right) \left( f(X^k) - f^{\star} \right)\right\vert X^k\right]\right] \\ &&+ \sum_{i=1}^b t_i^2 {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right] {\mathbb{E}}\left[\frac{L_{i,\hat{S}}^0}{L_{i,\hat{S}}^1}\right] \\ &=& 4 \max_{i\in [b]} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right] \right) {\mathbb{E}}\left[\max_{i\in [b]} \left( t_i^2 L_{i,\hat{S}}^1 \right)\right] \delta^k \\ &&+ \sum_{i=1}^b t_i^2 {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right] {\mathbb{E}}\left[\frac{L_{i,\hat{S}}^0}{L_{i,\hat{S}}^1}\right]. \end{align}\] Thus \[\begin{align} &&\delta^{k+1} \\ &\leq& \delta^k - \sum_{i=1}^b t_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \\ &&+ \sum_{i=1}^b 2 t_i \bar{\rho}_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^k P_i^0 \\ &&+ 4 \max_{i\in [b]} \left( t_i^2 b_i \right) {\mathbb{E}}\left[\max_{i\in [b]} L_{i,\hat{S}}^1\right] \sum_{l=0}^{k-1} \left( 1 - \min_{i\in [b]} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right) \right)^{k-1-l} \delta^l \\ &&+ \sum_{i=1}^b \frac{t_i^2 b_i}{{\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i} {\mathbb{E}}\left[\frac{L_{i,\hat{S}}^0}{L_{i,\hat{S}}^1}\right] + 2 \max_{i\in [b]} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right] \right) {\mathbb{E}}\left[\max_{i\in [b]} \left( t_i^2 L_{i,\hat{S}}^1 \right)\right] \delta^k \\ &&+ \frac{1}{2} \sum_{i=1}^b t_i^2 {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right] {\mathbb{E}}\left[\frac{L_{i,\hat{S}}^0}{L_{i,\hat{S}}^1}\right] \\ &&+ \frac{1}{2} \sum_{i=1}^b t_i^2{\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L^0_{i,\hat{S}}\right] + \sum_{i=1}^b \frac{2 t_i^2 \bar{\rho}_i {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right) L_{i,\hat{S}}^0\right]}{\beta_i \underline{\rho}_i} \\ &&+ \sum_{i=1}^b 2 t_i \bar{\rho}_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \sigma_i \sqrt{\beta_i} \\ &\leq& \left( 1 + 2 \max_{i\in [b]} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right] \right) {\mathbb{E}}\left[\max_{i\in [b]} \left( t_i^2 L_{i,\hat{S}}^1 \right)\right] \right) \delta^k \\ &&- \sum_{i=1}^b t_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \\ &&+ 4 \max_{i\in [b]} \left( t_i^2 b_i \right) {\mathbb{E}}\left[\max_{i\in [b]} L_{i,\hat{S}}^1\right] \sum_{l=0}^{k-1} \left( 1 - \min_{i\in [b]} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right) \right)^{k-1-l} \delta^l \\ &&+ \sum_{i=1}^b 2 t_i \bar{\rho}_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^k P_i^0 \\ &&+ \sum_{i=1}^b t_i^2 \left( \frac{2 \bar{\rho}_i {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right) L_{i,\hat{S}}^1\right]}{\beta_i \underline{\rho}_i} + \frac{{\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right]}{2} \right) {\mathbb{E}}\left[\frac{L_{i,\hat{S}}^0}{L_{i,\hat{S}}^1}\right] \\ &&+ \sum_{i=1}^b t_i^2 \left( \frac{2 \bar{\rho}_i {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right) L_{i,\hat{S}}^0\right]}{\beta_i \underline{\rho}_i} + \frac{{\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L^0_{i,\hat{S}}\right]}{2} \right) \\ &&+ 2 \sum_{i=1}^b t_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \bar{\rho}_i \sigma_i \sqrt{\beta_i} \\ &=& \left( 1 + c_1 \right) \delta^k - \sum_{i=1}^b t_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \\ &&+ c_2 \sum_{l=0}^{k-1} \left( 1 - \min_{i\in [b]} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right) \right)^{k-1-l} \delta^l \\ &&+ \sum_{i=1}^b 2 t_i \bar{\rho}_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^k P_i^0 + \sum_{i=1}^b t_i^2 c_{3,i} + \sum_{i=1}^b t_i c_{4,i}, \end{align}\] where \[\begin{align} c_1 &:=& 2 \max_{i\in [b]} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right] \right) {\mathbb{E}}\left[\max_{i\in [b]} \left( t_i^2 L_{i,\hat{S}}^1 \right)\right], \\ c_2 &:=& 4 \max_{i\in [b]} \left( t_i^2 b_i \right) {\mathbb{E}}\left[\max_{i\in [b]} L_{i,\hat{S}}^1\right], \\ c_{3,i} &:=& \left( \frac{2 \bar{\rho}_i {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right) L_{i,\hat{S}}^1\right]}{\beta_i \underline{\rho}_i} + \frac{{\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right]}{2} \right) {\mathbb{E}}\left[\frac{L_{i,\hat{S}}^0}{L_{i,\hat{S}}^1}\right] \\ &&+ \left( \frac{2 \bar{\rho}_i {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right) L_{i,\hat{S}}^0\right]}{\beta_i \underline{\rho}_i} + \frac{{\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L^0_{i,\hat{S}}\right]}{2} \right), \\ c_{4,i} &:=& 2 {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \bar{\rho}_i \sigma_i \sqrt{\beta_i}. \end{align}\] Now, let us introduce a weighting sequence \(w^k :=w^{k-1} \left( 1 + c_1 + \frac{c_2}{\min_{i\in [b]} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)} \right)^{-1}\), where \(w^{-1} = 1\) and \(W^K :=\sum_{k=0}^{K} w^k\). Then, multiplying the inequality above by \(w^k\) and summing over the first \(K+1\) iterations gives \[\begin{align} \sum_{k=0}^{K} w^k \delta^{k+1} &\leq& \sum_{k=0}^{K} w^k \left( 1 + c_1 \right) \delta^k - \sum_{k=0}^{K} w^k \sum_{i=1}^b t_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \\ &&+ \sum_{k=0}^{K} w^k c_2 \sum_{l=0}^{k-1} \left( 1 - \min_{i\in [b]} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right) \right)^{k-1-l} \delta^l \\ &&+ \sum_{k=0}^{K} w^k \sum_{i=1}^b 2 t_i \bar{\rho}_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^k P_i^0 \\ &&+ \sum_{k=0}^{K} w^k \sum_{i=1}^b t_i^2 c_{3,i} + \sum_{k=0}^{K} w^k \sum_{i=1}^b t_i c_{4,i} \\ &\leq& \left( 1 + c_1 \right) \sum_{k=0}^{K} w^k \delta^k - \sum_{k=0}^{K} w^k \sum_{i=1}^b t_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \\ &&+ c_2 \sum_{l=0}^{K-1} \sum_{k=l+1}^{K} w^k \left( 1 - \min_{i\in [b]} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right) \right)^{k-1-l} \delta^l \\ &&+ \sum_{i=1}^b 2 t_i \bar{\rho}_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] P_i^0 \sum_{k=0}^{K} \left( 1 - {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)^k \\ &&+ W^K \sum_{i=1}^b t_i^2 c_{3,i} + W^K \sum_{i=1}^b t_i c_{4,i}, \end{align}\] where in the last line we used the fact that \(w^k \leq w^{k-1} \leq w^{-1} = 1\). Therefore, \[\begin{align} \sum_{k=0}^{K} w^k \delta^{k+1} &\leq& \left( 1 + c_1 \right) \sum_{k=0}^{K} w^k \delta^k - \sum_{k=0}^{K} w^k \sum_{i=1}^b t_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \\ &&+ c_2 \sum_{l=0}^{K-1} \sum_{k=l+1}^{K} w^l \left( 1 - \min_{i\in [b]} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right) \right)^{k-1-l} \delta^l \\ &&+ \sum_{i=1}^b \frac{2 t_i \bar{\rho}_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right]}{{\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i} P_i^0 + W^K \sum_{i=1}^b t_i^2 c_{3,i} + W^K \sum_{i=1}^b t_i c_{4,i} \\ &\leq& \left( 1 + c_1 \right) \sum_{k=0}^{K} w^k \delta^k - \sum_{k=0}^{K} w^k \sum_{i=1}^b t_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \\ &&+ c_2 \sum_{l=0}^{K-1} w^l \delta^l \sum_{k=l+1}^{K} \left( 1 - \min_{i\in [b]} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right) \right)^{k-1-l} \\ &&+ \sum_{i=1}^b \frac{2 t_i \bar{\rho}_i}{\beta_i} P_i^0 + W^K \sum_{i=1}^b t_i^2 c_{3,i} + W^K \sum_{i=1}^b t_i c_{4,i} \\ &\leq& \left( 1 + c_1 \right) \sum_{k=0}^{K} w^k \delta^k - \sum_{k=0}^{K} w^k \sum_{i=1}^b t_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \\ &&+ c_2 \sum_{l=0}^{K-1} w^l \delta^l \sum_{k=0}^{\infty} \left( 1 - \min_{i\in [b]} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right) \right)^k \\ &&+ \sum_{i=1}^b \frac{2 t_i \bar{\rho}_i}{\beta_i} P_i^0 + W^K \sum_{i=1}^b t_i^2 c_{3,i} + W^K \sum_{i=1}^b t_i c_{4,i} \\ &\leq& \left( 1 + c_1 + \frac{c_2}{\min_{i\in [b]} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)} \right) \sum_{k=0}^{K} w^k \delta^k \\ &&- \sum_{k=0}^{K} w^k \sum_{i=1}^b t_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \\ &&+ \sum_{i=1}^b \frac{2 t_i \bar{\rho}_i}{\beta_i} P_i^0 + W^K \sum_{i=1}^b t_i^2 c_{3,i} + W^K \sum_{i=1}^b t_i c_{4,i} \\ &=& \sum_{k=0}^{K} w^{k-1} \delta^k - \sum_{k=0}^{K} w^k \sum_{i=1}^b t_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \\ &&+ \sum_{i=1}^b \frac{2 t_i \bar{\rho}_i}{\beta_i} P_i^0 + W^K \sum_{i=1}^b t_i^2 c_{3,i} + W^K \sum_{i=1}^b t_i c_{4,i}. \end{align}\] Rearranging the terms and dividing by \(W^K\), we get \[\begin{align} &&\min_{k=0,\ldots,K} \sum_{i=1}^b t_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \\ &\leq& \sum_{k=0}^{K} \sum_{i=1}^b \frac{w^k}{W^K} t_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \\ &\leq& \frac{1}{W^K} \sum_{k=0}^{K} \left( w^{k-1} \delta^k - w^k \delta^{k+1} \right) + \frac{1}{W^K} \sum_{i=1}^b \frac{2 t_i \bar{\rho}_i}{\beta_i} P_i^0 + \sum_{i=1}^b t_i^2 c_{3,i} + \sum_{i=1}^b t_i c_{4,i} \\ &\leq& \frac{\delta^0}{W^K} + \frac{1}{W^K} \sum_{i=1}^b \frac{2 t_i \bar{\rho}_i}{\beta_i} P_i^0 + \sum_{i=1}^b t_i^2 c_{3,i} + \sum_{i=1}^b t_i c_{4,i}. \end{align}\] Now, note that taking \(t_i = \frac{\eta_i}{(K+1)^{3/4}}\), where \[\textstyle\eta_i^2 \leq \min\left\{ \frac{(K+1)^{1/2}}{4 {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right] {\mathbb{E}}\left[\max_{i\in [b]} L_{i,\hat{S}}^1\right]}, \frac{(K+1)^{1/2} \underline{\rho}_i \min_{i\in [b]} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)}{16 \bar{\rho}_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right) L_{i,\hat{S}}^1\right] {\mathbb{E}}\left[\max_{i\in [b]} L_{i,\hat{S}}^1\right]}, 1 \right\},\] we have \[\begin{align} 2 (K+1) \max_{i\in [b]} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right] \right) {\mathbb{E}}\left[\max_{i\in [b]} \left( t_i^2 L_{i,\hat{S}}^1 \right)\right] &\leq \frac{1}{2}, \\ (K+1) \frac{4 \max_{i\in [b]} \left( t_i^2 b_i \right) {\mathbb{E}}\left[\max_{i\in [b]} L_{i,\hat{S}}^1\right]}{\min_{i\in [b]} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)} &\leq \frac{1}{2}, \end{align}\] and hence \((K+1) \left( c_1 + \frac{c_2}{\min_{i\in [b]} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)} \right) \leq 1\). Thus, the weights satisfy \[\begin{align} W^K &= \sum_{k=0}^{K} w^k \geq (K+1) w^K = \frac{(K+1) w^{-1}}{\left( 1 + c_1 + \frac{c_2}{\min_{i\in [b]} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)} \right)^{K+1}} \\ &\geq \frac{K+1}{\exp\left( (K+1) \left( c_1 + \frac{c_2}{\min_{i\in [b]} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \beta_i \right)} \right) \right)} \geq \frac{K+1}{\exp(1)} \geq \frac{K+1}{3}, \end{align}\] meaning that \[\begin{align} &\min_{k=0,\ldots,K} \sum_{i=1}^b t_i {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \\ &\leq \frac{3 \delta^0}{K+1} + \frac{6}{K+1} \sum_{i=1}^b \frac{t_i \bar{\rho}_i}{\beta_i} P_i^0 + \sum_{i=1}^b t_i^2 c_{3,i} + \sum_{i=1}^b t_i c_{4,i} \\ &= \frac{3 \delta^0}{K+1} + \frac{6}{K+1} \sum_{i=1}^b \frac{\eta_i \bar{\rho}_i}{\beta_i (K+1)^{3/4}} P_i^0 \\ &\quad+ \sum_{i=1}^b \frac{\eta_i^2}{(K+1)^{3/2}} \frac{2 \bar{\rho}_i}{\beta_i \underline{\rho}_i} \left( {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right) L_{i,\hat{S}}^0\right] + {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right) L_{i,\hat{S}}^1\right] {\mathbb{E}}\left[\frac{L_{i,\hat{S}}^0}{L_{i,\hat{S}}^1}\right] \right) \\ &\quad+ \sum_{i=1}^b \frac{\eta_i^2}{2 (K+1)^{3/2}} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L^0_{i,\hat{S}}\right] + {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right] {\mathbb{E}}\left[\frac{L_{i,\hat{S}}^0}{L_{i,\hat{S}}^1}\right] \right) \\ &\quad+ \sum_{i=1}^b \frac{2 \eta_i}{(K+1)^{3/4}} {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \bar{\rho}_i \sigma_i \sqrt{\beta_i}. \end{align}\] Dividing by \(\frac{1}{b} \sum_{l=1}^b {\mathbb{E}}\left[\mathbb{I}\left(l \in \hat{S}\right)\right] t_l = \frac{1}{(K+1)^{3/4}} \frac{1}{b} \sum_{l=1}^b {\mathbb{E}}\left[\mathbb{I}\left(l \in \hat{S}\right)\right] \eta_l\) gives \[\begin{align} &\min_{k=0,\ldots,K} \sum_{i=1}^b \frac{{\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \eta_i}{\frac{1}{b} \sum_{l=1}^b {\mathbb{E}}\left[\mathbb{I}\left(l \in \hat{S}\right)\right] \eta_l} {\mathbb{E}}\left[\left\| \nabla_i f(X^k) \right\|_{(i) \star}\right] \\ &\leq \frac{3 \delta^0}{(K+1)^{1/4} \frac{1}{b} \sum_{l=1}^b {\mathbb{E}}\left[\mathbb{I}\left(l \in \hat{S}\right)\right] \eta_l} + \frac{6}{K+1} \sum_{i=1}^b \frac{\eta_i \bar{\rho}_i}{\beta_i \frac{1}{b} \sum_{l=1}^b {\mathbb{E}}\left[\mathbb{I}\left(l \in \hat{S}\right)\right] \eta_l} P_i^0 \\ &\quad+ \sum_{i=1}^b \frac{\eta_i^2}{(K+1)^{3/4} \frac{1}{b} \sum_{l=1}^b {\mathbb{E}}\left[\mathbb{I}\left(l \in \hat{S}\right)\right] \eta_l} \frac{2 \bar{\rho}_i}{\beta_i \underline{\rho}_i} \\ &\qquad+\times\left( {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right) L_{i,\hat{S}}^0\right] + {\mathbb{E}}\left[\left( 1-\mathbb{I}\left(i\in \hat{S}\right) \beta_i \right) L_{i,\hat{S}}^1\right] {\mathbb{E}}\left[\frac{L_{i,\hat{S}}^0}{L_{i,\hat{S}}^1}\right] \right) \\ &\quad+ \sum_{i=1}^b \frac{\eta_i^2}{2 (K+1)^{3/4} \frac{1}{b} \sum_{l=1}^b {\mathbb{E}}\left[\mathbb{I}\left(l \in \hat{S}\right)\right] \eta_l} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L^0_{i,\hat{S}}\right] + {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right] {\mathbb{E}}\left[\frac{L_{i,\hat{S}}^0}{L_{i,\hat{S}}^1}\right] \right) \\ &\quad+ \sum_{i=1}^b \frac{2 {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \eta_i}{\frac{1}{b} \sum_{l=1}^b {\mathbb{E}}\left[\mathbb{I}\left(l \in \hat{S}\right)\right] \eta_l} \bar{\rho}_i \sigma_i \sqrt{\beta_i} \\ &\leq \frac{3 \delta^0}{(K+1)^{1/4} \frac{1}{b} \sum_{l=1}^b {\mathbb{E}}\left[\mathbb{I}\left(l \in \hat{S}\right)\right] \eta_l} + \frac{6}{(K+1)^{1/2}} \sum_{i=1}^b \frac{\eta_i \bar{\rho}_i}{\frac{1}{b} \sum_{l=1}^b {\mathbb{E}}\left[\mathbb{I}\left(l \in \hat{S}\right)\right] \eta_l} P_i^0 \\ &\quad+ \sum_{i=1}^b \frac{\eta_i^2}{(K+1)^{1/4} \frac{1}{b} \sum_{l=1}^b {\mathbb{E}}\left[\mathbb{I}\left(l \in \hat{S}\right)\right] \eta_l} \frac{2 \bar{\rho}_i}{\underline{\rho}_i} \left( {\mathbb{E}}\left[L_{i,\hat{S}}^0\right] + {\mathbb{E}}\left[L_{i,\hat{S}}^1\right] {\mathbb{E}}\left[\frac{L_{i,\hat{S}}^0}{L_{i,\hat{S}}^1}\right] \right) \\ &\quad+ \sum_{i=1}^b \frac{\eta_i^2}{2 (K+1)^{3/4} \frac{1}{b} \sum_{l=1}^b {\mathbb{E}}\left[\mathbb{I}\left(l \in \hat{S}\right)\right] \eta_l} \left( {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L^0_{i,\hat{S}}\right] + {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right) L_{i,\hat{S}}^1\right] {\mathbb{E}}\left[\frac{L_{i,\hat{S}}^0}{L_{i,\hat{S}}^1}\right] \right) \\ &\quad+ \sum_{i=1}^b \frac{2 {\mathbb{E}}\left[\mathbb{I}\left(i\in \hat{S}\right)\right] \eta_i}{(K+1)^{1/4} \frac{1}{b} \sum_{l=1}^b {\mathbb{E}}\left[\mathbb{I}\left(l \in \hat{S}\right)\right] \eta_l} \bar{\rho}_i \sigma_i, \end{align}\] where in the last equality we set \(\beta_i = \frac{1}{(K+1)^{1/2}}\). Finally, the initialization \(M_i^0 = \nabla_i f(X^0; \xi^0)\) guarantees that \[\begin{align} P_i^0 :={\mathbb{E}}\left[\left\| \nabla_i f(X^0) - M_i^0 \right\|_2\right] \leq \sqrt{{\mathbb{E}}\left[\left\| \nabla_i f(X^0) - \nabla_i f(X^0; \xi^0) \right\|^2_2\right]} \overset{\eqref{as:bounded95var}}{\leq} \sigma_i. \end{align}\] ◻

13 Experiments↩︎

We evaluate Drop-Muon on three standard benchmarks–MNIST, Fashion-MNIST, and CIFAR-10–using 3-layer convolutional neural networks (CNNs) of varying capacity. Our goal is to study how partial layer updates of Drop-Muon accelerate training relative to standard Muon (a special case of Drop-Muon with full-network updates).

Experiments were run on various NVIDIA GPUs based on availability. MNIST and Fashion-MNIST experiments were run on Tesla V100-SXM2-32GB. CIFAR-10 experiments were performed on a mix of A100-SXM4-80GB, Tesla P100-PCIE-16GB, and GeForce GTX 1080 Ti GPUs.

13.1 Key Training Parameters↩︎

We identify several key training parameters and conduct a grid search across these parameters to systematically explore their effects.

13.1.0.1 Layer sampling distribution.

The choice of layer distribution determines which layers are updated at each iteration and is central to the behavior of Drop-Muon. We considered the following strategies:

Uniform Distribution: Each layer is sampled with equal probability (\(p_i = \frac{1}{b}\) for all \(i\in[b]\) in the notation from 4).
Linear Distribution: Sampling probability increases linearly with the layer index, favoring deeper layers. Empirically, this distribution performed poorly.
Quadratic Distribution: Probability increases quadratically with depth, strongly biasing toward deeper layers. Similarly, this performed poorly.
Exponential Distribution: Layer probability grows exponentially with depth, emphasizing deeper layers. This also proved suboptimal.
Epoch-Shift Distribution: The distribution biases the sampling towards shallow layers at the beginning of training, gradually shifting towards deeper layers as the training progresses (see more details in 13.2). This dynamic adjustment might help in balancing early feature extraction and later complex feature learning during the optimization process–see example in 6.

13.1.0.2 Batch size and learning rate.

We test batch sizes \(\{64, 512, 8192, 16384, 32768\}\) and learning rates \(\{0.1, 0.01, 0.001\}\).

13.1.0.3 Model depth and capacity.

To assess the impact of network complexity, we train multiple 3-layer CNNs with varying channel configurations: \([8,16,32]\), \([16,32,64]\), \([64,128,256]\), \([128,256,512]\), \([64,16,8]\), and \([256,128,64]\).

13.1.0.4 Fixed parameters.

We fix the number of epochs to \(20\) or \(50\), momentum to \(0.5\), and the number of Newton-Schulz iterations to \(5\) (see 7.1).

This grid search allows us to systematically evaluate Drop-Muon. While many combinations performed poorly (and are therefore omitted in the results that follow), the experiments highlighted configurations–particularly with uniform or epoch-shift distributions–that consistently reduce training time while maintaining final accuracy.

13.2 Epoch-Shift Distribution↩︎

Figure 6: Evolution of the layer sampling distribution as a function of the epochs. Shallow layers are more trained in the first epochs but their probabilities of being sampled decrease with the epochs. This effect can be amplified or reduced by varying the value of \alpha; here we chose \alpha=0.5. — Figure 6: Evolution of the layer sampling distribution as a function of the epochs. Shallow layers are more trained in the first epochs but their probabilities of being sampled decrease with the epochs. This effect can be amplified or reduced by varying the value of \(\alpha\); here we chose \(\alpha=0.5\).

Drop-Muon can dynamically adjust which layers are updated at each iteration. The strategy we consider is the epoch-shift distribution, which biases training towards shallow layers in the early epochs, gradually shifting focus to deeper layers as training progresses (see 6). Such a schedule balances early feature extraction with later complex feature learning, improving convergence in practice.

Formally, let \(b\) denote the total number of layers, \(\alpha\) be a constant controlling the sharpness of the bias, and define the training progress as \[\text{progress} = \frac{\text{epoch}}{\text{max\_epochs}} \in [0,1].\] The weight for each layer \(i\) is given by \[w_i = \exp \left( \alpha \left[ (1 - \text{progress})(b - 1 - i) + \text{progress} \cdot i \right] \right).\] We then normalize the weights to obtain a valid probability distribution: \[W = \sum_{i=0}^{b-1} w_i, \qquad p_i = \frac{w_i}{W}.\]

By adjusting \(\alpha\), one can control the rate at which training emphasis shifts from shallow to deeper layers. In 6, we set \(\alpha = 0.5\).

13.3 Evaluation Metrics↩︎

We repeat every experiment over multiple random seeds. When reporting aggregated results, we use two complementary procedures:

Normalized curve averaging: We normalize wall-clock time for each run so that Muon always ends at \(t=1\), interpolate both Muon and Drop-Muon curves onto this grid, and average over seeds.
Time-to-target evaluation: We measure the wall-clock time to reach fixed accuracy thresholds (e.g. \(60\%\),\(70\%\),\(80\%\),...), reporting the ratio between Muon and Drop-Muon. This metric is interpolation-free and directly quantifies practical speedup.

13.4 Results on MNIST↩︎

We first evaluate Drop-Muon on the MNIST dataset. Figure 2 shows a representative run comparing standard full-network Muon training with Drop-Muon using a uniform layer sampling distribution. Although the per-epoch train accuracy of Drop-Muon is initially lower than that of full-layer Muon, the wall-clock training time tells a different story: for training times up to approximately 150 seconds, Drop-Muon consistently achieves higher accuracy. In practice, this means that reaching a train accuracy of \(95\%\) or less is faster with Drop-Muon, highlighting its efficiency advantage.

To account for variability across independent runs, we evaluate two aggregation strategies. 7 shows normalized curve averaging, where each run is rescaled to a common time grid. Figure 3 (left) presents averaged time-to-target evaluation across multiple seeds, reporting the ratio of Muon time to Drop-Muon time for fixed accuracy thresholds.

The time-to-target results clearly demonstrate a speedup of up to \(1.4\times\) across thresholds of \(60\%, 70\%, 80\%\), and \(99\%\), with slightly lower speedup for the \(90\%\) threshold. We note variability across seeds, emphasizing the stochastic nature of layer sampling. Despite this, Drop-Muon consistently provides practical acceleration across training.

Figure 7: Normalized curve averaging of several runs of Muon and Drop-Muon with uniform index sampling on MNIST. Batch size =8192, learning rate =0.1, channels =[64,128,256]. — Figure 7: Normalized curve averaging of several runs of Muon and Drop-Muon with uniform index sampling on `MNIST`. Batch size \(=8192\), learning rate \(=0.1\), channels \(=[64,128,256]\).

13.5 Results on Fashion-MNIST↩︎

Figure 8 shows a typical run comparing standard Muon training with Drop-Muon using the epoch-shift layer sampling distribution. Early in training, Drop-Muon achieves faster progress in wall-clock time, even if per-epoch accuracy is slightly lower.

Figure 8: Evolution of the training accuracy for Muon and Drop-Muon with epoch-shift index sampling on Fashion-MNIST. Batch size =32768, learning rate =0.1, channels =[64,128,256]. — Figure 8: Evolution of the training accuracy for Muon and Drop-Muon with epoch-shift index sampling on `Fashion-MNIST`. Batch size \(=32768\), learning rate \(=0.1\), channels \(=[64,128,256]\).

Normalized curve averaging across multiple seeds is presented in 9. Although the curves appear smoother at higher accuracy, this should not be interpreted as a reduction in variance; early-stage training is inherently noisier due to layer sampling, whereas later training reflects fewer updates per wall-clock time.

Figure 9: Normalized curve averaging of several runs of Muon and Drop-Muon with epoch-shift index sampling on Fashion-MNIST. Batch size =32768, learning rate =0.1, channels =[64,128,256]. — Figure 9: Normalized curve averaging of several runs of Muon and Drop-Muon with epoch-shift index sampling on `Fashion-MNIST`. Batch size \(=32768\), learning rate \(=0.1\), channels \(=[64,128,256]\).

Figure 3 (right) reports the averaged time-to-target evaluation. Drop-Muon consistently achieves speedups of approximately \(1.2\) across all considered accuracy thresholds. The \(99\%\) threshold is omitted as neither Muon nor Drop-Muon reach this level within the plotted runs.

Overall, Fashion-MNIST experiments reinforce that Drop-Muon provides consistent practical speedups, similar to MNIST, while maintaining comparable final accuracy.

13.6 Results on CIFAR-10↩︎

Figure 10: Normalized curve averaging of several runs of Muon and Drop-Muon with epoch-shift index sampling on CIFAR-10. Batch size =8192, learning rate =0.1, channels =[128,256,512]. — Figure 10: Normalized curve averaging of several runs of Muon and Drop-Muon with epoch-shift index sampling on `CIFAR-10`. Batch size \(=8192\), learning rate \(=0.1\), channels \(=[128,256,512]\).

CIFAR-10 requires more sophisticated network architectures. We use a CNN with channels \([128,256,512]\) and insert batch normalization layers between each convolution and ReLU activation to improve training stability.

4 shows an example run comparing standard full-layer Muon training with Drop-Muon using the epoch-shift layer sampling distribution. While Drop-Muon has slightly lower per-epoch training accuracy, it achieves faster progress in terms of wall-clock time. 10 shows normalized curve averaging across multiple seeds (the flat end of the curve reflects the earlier completion of Drop-Muon).

Time-to-target results in 11 indicate a notable speedup at the \(90\%\) train accuracy threshold. However, for lower thresholds (\(60\%, 70\%, 80\%\)), speedups are minimal or absent for this configuration. We attribute this limitation primarily to sub-optimal hyperparameter choices (see 6).

Figure 11: Averaged time-to-target speed-up over multiple runs comparing Muon and Drop-Muon with epoch-shift index sampling on CIFAR-10 with batch size 8192, learning rate 0.1, and channels [128,256,512]. — Figure 11: Averaged time-to-target speed-up over multiple runs comparing Muon and Drop-Muon with epoch-shift index sampling on `CIFAR-10` with batch size \(8192\), learning rate \(0.1\), and channels \([128,256,512]\).

13.7 Discussion and Practical Remarks↩︎

We summarize several key observations and practical considerations arising from our experiments:

Simplicity of implementation: Drop-Muon can be implemented in just a few lines of code, making it easy to integrate into existing training pipelines.
Further benefits from tuning: Performance can be further improved through dedicated tuning of hyperparameters (see 6).
Potential for implementation improvements:
- Turning off gradient computations at every iteration can be costly for large models. Sampling the cutoff layer only every few iterations can reduce overhead and improve efficiency.
- When the batch size or model size is small, the gradient computation may be dominated by the Newton-Schulz routine. In such cases, computing all gradients but orthogonalizing only a subset can be advantageous.
Variance across seeds: The method exhibits non-negligible variance, particularly on smaller datasets. Stabilization mechanisms could mitigate this.
Adaptive learning of sampling distributions: A promising future direction is to learn the layer sampling distribution online, so that it automatically adapts to the training dynamics of a given dataset and architecture.

14 Useful Facts↩︎

For all \(X,Y \in \mathcal{X}\) and \(t>0\), we have: \[\begin{align} \tag{47} \left\| {\rm LMO}_{\mathcal{B}(0,t)}(G) \right\| &= t \\ \tag{48} \left\langle G,{\rm LMO}_{\mathcal{B}(X,t)}(G)\right\rangle &= -t \left\| G \right\|_\star \\ \tag{49} \left\langle X,X^\sharp\right\rangle &= \left\| X^\sharp \right\|^2, \\ \tag{50} \left\| X \right\|_{\star} &= \left\| X^\sharp \right\|, \\ \tag{51} \left\langle H,X\right\rangle &= \left\| X \right\|_{\star}, \quad \left\| H \right\| = 1, \qquad\forall X\neq0. \end{align}\] where \(H \in \partial\left\| \cdot \right\|_{\star}(X)\) belongs to the subdifferential of the dual norm.

Lemma 8 (Variance decomposition). For any random vector \(X\in\mathcal{X}\) and any non-random \(c\in\mathcal{X}\), we have \[\begin{align} {\mathbb{E}}\left[\left\| X-c \right\|_2^2\right] = {\mathbb{E}}\left[\left\| X - {\mathbb{E}}\left[X\right] \right\|_2^2\right] + \left\| {\mathbb{E}}\left[X\right]-c \right\|_2^2. \end{align}\]

Lemma 9 ([5], Lemma 3). Suppose that \(x_1, \ldots, x_p, y_1, \ldots, y_p \in \mathbb{R}\), \(\max_{i\in[b]} |x_i| > 0\) and \(z_1, \ldots, z_p > 0\). Then \[\begin{align} \sum_{i=1}^b \frac{y_i^2}{z_i} \geq \frac{\left( \sum_{i=1}^b x_i y_i \right)^2}{\sum_{i=1}^b z_i x_i^2}. \end{align}\]

Lemma 10. Let 4 hold and let \(S\subseteq[b]\). Then, for any vectors \(X = [X_1, \ldots, X_b]\in \mathcal{X}\) and \(\Gamma = [\Gamma_1, \ldots, \Gamma_b] \in \mathcal{X}\) such that \(\Gamma_i = 0\) for all \(i\not\in S\), \[\begin{align} \left|f(X + \Gamma) - f(X) - \left\langle\nabla f(X),\Gamma\right\rangle\right| \leq \sum_{i\in S} \frac{L_{i,S}^0 + L_{i,S}^1 \left\| \nabla_i f(X) \right\|_{(i) \star}}{2}\left\| \Gamma_i \right\|_{(i)}^2. \end{align}\]

Proof. For all \(X \in \mathcal{X}\) we have \[\begin{align} f(X + \Gamma) &= f(X) + \int_0^1 \left\langle\nabla f(X + \tau \Gamma),\Gamma\right\rangle d \tau \\ &= f(X) + \int_0^1 \left\langle\nabla f(X + \tau \Gamma) - \nabla f(X),\Gamma\right\rangle d \tau + \left\langle\nabla f(X),\Gamma\right\rangle. \end{align}\] Therefore, using the Cauchy-Schwarz inequality \[\begin{align} \left|f(X + \Gamma) - f(X) - \left\langle\nabla f(X),\Gamma\right\rangle\right| &=&\left|\int_0^1 \sum_{i=1}^b \left\langle\nabla_i f(X + \tau \Gamma)-\nabla_i f(X),\Gamma_i\right\rangle_{(i)} d \tau \right| \\ &=&\left|\int_0^1 \sum_{i\in S} \left\langle\nabla_i f(X + \tau \Gamma)-\nabla_i f(X),\Gamma_i\right\rangle_{(i)} d \tau \right| \\ &\leq&\int_0^1 \sum_{i\in S} \left| \left\langle\nabla_i f(X + \tau \Gamma)-\nabla_i f(X),\Gamma_i\right\rangle_{(i)} \right| d \tau \\ &\leq& \int_0^1 \sum_{i\in S} \left\| \nabla_i f(X + \tau \Gamma)-\nabla_i f(X) \right\|_{(i) \star} \left\| \Gamma_i \right\|_{(i)} d \tau \\ &\overset{\eqref{as:arbitrary95layer95gen95smoothness2}}{\leq}& \int_0^1 \sum_{i\in S} \tau \left( L_{i,S}^0 + L_{i,S}^1 \left\| \nabla_i f(X) \right\|_{(i) \star} \right) \left\| \Gamma_i \right\|_{(i)}^2 d \tau\\ &=& \sum_{i\in S} \frac{L_{i,S}^0 + L_{i,S}^1 \left\| \nabla_i f(X) \right\|_{(i) \star}}{2}\left\| \Gamma_i \right\|_{(i)}^2. \end{align}\] ◻

Lemma 11. Let Assumptions 1 and 3 hold and let \(S\subseteq[b]\). Then \[\begin{align} \sum_{i\in S} \frac{\left\| \nabla_i f(X) \right\|^2_{(i) \star}}{2 \left( L_{i,S}^0 + L_{i,S}^1 \left\| \nabla_i f(X) \right\|_{(i) \star} \right)} \leq f(X) - f^{\star} \end{align}\] for all \(X \in \mathcal{X}\).

Proof. Let \(Y = [Y_1, \ldots, Y_b] \in \mathcal{X}\), where \(Y_i = X_i - \frac{\left\| \nabla_i f(X) \right\|_{(i) \star}}{L_{i,S}^0 + L_{i,S}^1 \left\| \nabla_i f(X) \right\|_{(i) \star}} H_i\) for some \(H_i \in \partial \left\| \cdot \right\|_{(i) \star} (\nabla_i f(X))\) for \(i\in S\) and \(Y_i=X_i\) otherwise. By 10 \[\begin{align} f(Y) &\leq& f(X) + \left\langle\nabla f(X),Y-X\right\rangle + \sum_{i\in S} \frac{L^0_{i,S} + L_{i,S}^1 \left\| \nabla_i f(X) \right\|_{(i) \star}}{2} \left\| X_i - Y_i \right\|_{(i)}^2 \\ &=& f(X) + \sum_{i\in S} \left\langle\nabla_i f(X),Y_i - X_i\right\rangle_{(i)} + \sum_{i\in S} \frac{L^0_{i,S} + L_{i,S}^1 \left\| \nabla_i f(X) \right\|_{(i) \star}}{2} \left\| X_i - Y_i \right\|_{(i)}^2 \\ &=& f(X) - \sum_{i\in S} \frac{\left\| \nabla_i f(X) \right\|_{(i) \star}}{L_{i,S}^0 + L_{i,S}^1 \left\| \nabla_i f(X) \right\|_{(i) \star}} \left\langle\nabla_i f(X),H_i\right\rangle_{(i)} \\ &&+ \sum_{i\in S} \left( \frac{L^0_{i,S} + L_{i,S}^1 \left\| \nabla_i f(X) \right\|_{(i) \star}}{2} \frac{\left\| \nabla_i f(X) \right\|^2_{(i) \star}}{\left( L_{i,S}^0 + L_{i,S}^1 \left\| \nabla_i f(X) \right\|_{(i) \star} \right)^2} \left\| H_i \right\|_{(i)}^2 \right) \\ &\overset{\eqref{eq:subdiff}}{=}& f(X) + \sum_{i\in S} \left( - \frac{\left\| \nabla_i f(X) \right\|^2_{(i) \star}}{L_{i,S}^0 + L_{i,S}^1 \left\| \nabla_i f(X) \right\|_{(i) \star}} + \frac{\left\| \nabla_i f(X) \right\|^2_{(i) \star}}{2 \left( L_{i,S}^0 + L_{i,S}^1 \left\| \nabla_i f(X) \right\|_{(i) \star} \right)} \right) \\ &=& f(X) - \sum_{i\in S} \frac{\left\| \nabla_i f(X) \right\|^2_{(i) \star}}{2 \left( L_{i,S}^0 + L_{i,S}^1 \left\| \nabla_i f(X) \right\|_{(i) \star} \right)}, \end{align}\] and hence \[\begin{align} \sum_{i\in S} \frac{\left\| \nabla_i f(X) \right\|^2_{(i) \star}}{2 \left( L_{i,S}^0 + L_{i,S}^1 \left\| \nabla_i f(X) \right\|_{(i) \star} \right)} \leq f(X) - f(Y) \leq f(X) - f^{\star}. \end{align}\] ◻

Lemma 12. Let Assumptions 1 and 3 hold and let \(S\subseteq[b]\). Then, for any \(x_i >0\), \(i\in[b]\), we have \[\begin{align} \sum_{i\in S} x_i \left\| \nabla_i f(X) \right\|_{(i) \star} \leq 4 \max_{i\in S} (x_i L_{i,S}^1) \left( f(X) - f^{\star} \right) + \sum_{i\in S} \frac{x_i L_{i,S}^0}{L_{i,S}^1} \end{align}\] for all \(X \in \mathcal{X}\).

Proof. Applying 11 and 9 with \(y_i = \left\| \nabla_i f(X) \right\|_{(i) \star}\), \(z_i = L_{i,S}^0 + L_{i,S}^1 \left\| \nabla_i f(X) \right\|_{(i) \star}\) and any \(x_i>0\), we have \[\begin{align} 2 \left( f(X) - f^{\star} \right) &\geq& \sum_{i\in S} \frac{\left\| \nabla_i f(X) \right\|^2_{(i) \star}}{L_{i,S}^0 + L_{i,S}^1 \left\| \nabla_i f(X) \right\|_{(i) \star}} \\ &\geq& \frac{\left( \sum_{i\in S} x_i \left\| \nabla_i f(X) \right\|_{(i) \star} \right)^2}{\sum_{i\in S} x_i^2 L_{i,S}^0 + \sum_{i\in S} x_i^2 L_{i,S}^1 \left\| \nabla_i f(X) \right\|_{(i) \star}} \\ &\geq& \frac{\left( \sum_{i\in S} x_i \left\| \nabla_i f(X) \right\|_{(i) \star} \right)^2}{\sum_{i\in S} x_i^2 L_{i,S}^0 + \max_{i\in S} (x_i L_{i,S}^1) \sum_{i\in S} x_i \left\| \nabla_i f(X) \right\|_{(i) \star}} \\ &\geq& \begin{cases} \frac{\left( \sum_{i\in S} x_i \left\| \nabla_i f(X) \right\|_{(i) \star} \right)^2}{2 \sum_{i\in S} x_i^2 L_{i,S}^0} & \textrm{if } \frac{\sum_{i\in S} x_i^2 L_{i,S}^0}{\max_{i\in S} (x_i L_{i,S}^1)} \geq \sum_{i\in S} x_i \left\| \nabla_i f(X) \right\|_{(i) \star}, \\ \frac{\sum_{i\in S} x_i \left\| \nabla_i f(X) \right\|_{(i) \star}}{2 \max_{i\in S} (x_i L_{i,S}^1)} & \textrm{otherwise}. \end{cases} \end{align}\] Therefore, \[\begin{align} \sum_{i\in S} x_i \left\| \nabla_i f(X) \right\|_{(i) \star} &\leq& \max\left\{ 4 \max_{i\in S} (x_i L_{i,S}^1) \left( f(X) - f^{\star} \right), \frac{\sum_{i\in S} x_i^2 L_{i,S}^0}{\max_{i\in S} (x_i L_{i,S}^1)} \right\} \\ &\leq& 4 \max_{i\in S} (x_i L_{i,S}^1) \left( f(X) - f^{\star} \right) + \frac{\sum_{i\in S} x_i^2 L_{i,S}^0}{\max_{i\in S} (x_i L_{i,S}^1)} \\ &\leq& 4 \max_{i\in S} (x_i L_{i,S}^1) \left( f(X) - f^{\star} \right) + \sum_{i\in S} \frac{x_i L_{i,S}^0}{L_{i,S}^1}. \end{align}\] ◻

References↩︎

[1]

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. URL https://arxiv.org/abs/1412.6980.

[2]

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://arxiv.org/abs/1711.05101.

[3]

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan.github.io/posts/muon/.

[4]

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained LMOs. arXiv preprint arXiv:2502.07529, 2025. URL https://arxiv.org/abs/2502.07529.

[5]

Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, and Peter Richtárik. Gluon: Making Muon &Scion great again! (Bridging theory and practice of LMO-based optimizers for LLMs). arXiv preprint arXiv:2505.13416, 2025. URL https://arxiv.org/abs/2505.13416.

[6]

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for LLM training. arXiv preprint arXiv:2502.16982, 2025. URL https://arxiv.org/abs/2502.16982.

[7]

Ishaan Shah, Anthony M Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, et al. Practical efficiency of Muon for pretraining. arXiv preprint arXiv:2505.02222, 2025. URL https://arxiv.org/abs/2505.02222.

[8]

Benjamin Thérien, Xiaolong Huang, Irina Rish, and Eugene Belilovsky. : Muon is a practical inner optimizer for DiLoCo. arXiv preprint arXiv:2505.23725, 2025. URL https://arxiv.org/abs/2505.23725.

[9]

Moonshot AI. Kimi K2: Open agentic intelligence, 2025. URL https://moonshotai.github.io/Kimi-K2/.

[10]

Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them. arXiv preprint arXiv:2509.02046, 2025. URL https://arxiv.org/abs/2509.02046.

[11]

Dmitry Kovalev. Understanding gradient orthogonalization for deep learning via non-Euclidean trust-region optimization, 2025. URL https://arxiv.org/abs/2503.12645.

[12]

Yu Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22 (2): 341–362, 2012. URL https://epubs.siam.org/doi/10.1137/100802001.

[13]

Jonathan A Kelner, Yin Tat Lee, Lorenzo Orecchia, and Aaron Sidford. An almost-linear-time algorithm for approximate max flow in undirected graphs, and its multicommodity generalizations. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 217–226. SIAM, 2014. URL https://arxiv.org/abs/1304.2338.

[14]

Ashok Cutkosky and Harsh Mehta. Momentum improves normalized SGD. In International conference on machine learning, pages 2260–2268. PMLR, 2020. URL https://arxiv.org/abs/2002.03305.

[15]

Rafał Szlendak, Elnur Gasanov, and Peter Richtarik. Understanding progressive training through the framework of randomized coordinate descent. In International Conference on Artificial Intelligence and Statistics, pages 2161–2169. PMLR, 2024. URL https://arxiv.org/abs/2306.03626.

[16]

Stephen J. Wright. Coordinate descent algorithms. Mathematical programming, 151 (1): 3–34, 2015. URL https://arxiv.org/abs/1502.04759.

[17]

Peter Richtárik and Martin Takáč. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144 (1-2): 1–38, 2014. URL https://arxiv.org/abs/1107.2848.

[18]

Olivier Fercoq and Peter Richtárik. Accelerated, parallel, and proximal coordinate descent. SIAM Journal on Optimization, 25 (4): 1997–2023, 2015. URL https://arxiv.org/abs/1312.5799.

[19]

Zheng Qu and Peter Richtárik. Coordinate descent with arbitrary sampling I: Algorithms and complexity. Optimization Methods and Software, 31 (5): 829–857, 2016. URL https://arxiv.org/abs/1412.8060.

[20]

Jiaxiang Li and Mingyi Hong. A note on the convergence of Muon and further, 2025. URL https://arxiv.org/abs/2502.02900.

[21]

Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. In International Conference on Learning Representations, 2020. URL https://arxiv.org/abs/1905.11881.

[22]

Ziyi Chen, Yi Zhou, Yingbin Liang, and Zhaosong Lu. Generalized-smooth nonconvex optimization is as efficient as smooth nonconvex optimization. In International Conference on Machine Learning, pages 5396–5427. PMLR, 2023. URL https://arxiv.org/abs/2303.02854.

[23]

Tao Sun, Qingsong Wang, Dongsheng Li, and Bao Wang. Momentum ensures convergence of SignSGD under weaker assumptions. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 33077–33099. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/sun23l.html.

[24]

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017. URL https://arxiv.org/abs/1710.10196.

[25]

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. URL https://arxiv.org/abs/1406.2661.

[26]

Hui-Po Wang, Sebastian Stich, Yang He, and Mario Fritz. rogFed: Effective, communication, and computation efficient federated learning by progressive training. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 23034–23054. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/wang22y.html.

[27]

Peter Richtárik and Martin Takáč. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144 (1): 1–38, 2014. URL https://arxiv.org/abs/1107.2848.

[28]

Zdislav Kovarik. Some iterative methods for improving orthonormality. SIAM Journal on Numerical Analysis, 7 (3): 386–389, 1970. URL https://doi.org/10.1137/0707031.

[29]

Å. Björck and C. Bowie. An iterative algorithm for computing the best estimate of an orthogonal matrix. SIAM Journal on Numerical Analysis, 8 (2): 358–364, 1971. URL https://doi.org/10.1137/0708036.

[30]

Michael Crawshaw, Mingrui Liu, Francesco Orabona, Wei Zhang, and Zhenxun Zhuang. Robustness to unbounded smoothness of generalized signSGD. Advances in neural information processing systems, 35: 9955–9968, 2022. URL https://arxiv.org/abs/2208.11195.

[31]

Haochuan Li, Jian Qian, Yi Tian, Alexander Rakhlin, and Ali Jadbabaie. Convex and non-convex optimization under generalized smoothness. Advances in Neural Information Processing Systems, 36: 40238–40271, 2023. URL https://arxiv.org/abs/2306.01264.

[32]

Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signSGD: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, pages 560–569. PMLR, 2018. URL https://arxiv.org/abs/1802.04434.

[33]

Ruichen Jiang, Devyani Maladkar, and Aryan Mokhtari. Convergence analysis of adaptive gradient methods under refined smoothness and noise assumptions. arXiv preprint arXiv:2406.04592, 2024. URL https://arxiv.org/abs/2406.04592.

[34]

Yuxing Liu, Rui Pan, and Tong Zhang. under anisotropic smoothness, 2024. URL https://arxiv.org/abs/2406.15244.

[35]

Shuo Xie, Mohamad Amin Mohamadi, and Zhiyuan Li. Adam exploits \(\ell_{\infty}\)-geometry of loss landscape via coordinate-wise adaptivity. arXiv preprint arXiv:2410.08198, 2024. URL https://arxiv.org/abs/2410.08198.

[36]

Richard V. Southwell. Relaxation Methods in Engineering Science: A Treatise on Approximate Computation. Read Books, 1940.

[37]

Michael J. D. Powell. On search directions for minimization algorithms. Mathematical Programming, 4: 193–201, 1973. URL https://link.springer.com/article/10.1007/BF01584660.

[38]

Peter Richtárik and Martin Takáč. Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156 (1-2): 433–484, 2016. URL https://arxiv.org/abs/1212.0873.

[39]

Andrei Patrascu and Ion Necoara. Efficient random coordinate descent algorithms for large-scale structured nonconvex optimization. Journal of Global Optimization, 61 (1): 19–46, 2015. . URL https://doi.org/10.1007/s10898-014-0151-9.

[40]

Cong D. Dang and Guanghui Lan. Stochastic block mirror descent methods for nonsmooth and stochastic optimization. SIAM Journal on Optimization, 25 (2): 856–881, 2015. . URL https://doi.org/10.1137/130936361.

[41]

Ziang Chen, Yingzhou Li, and Jianfeng Lu. On the global convergence of randomized coordinate gradient descent for nonconvex optimization. SIAM Journal on Optimization, 33 (2): 713–738, 2023. . URL https://doi.org/10.1137/21M1460375.

[42]

Werner Dinkelbach. On nonlinear fractional programming. Management Science, 13: 492–498, 1967. URL https://api.semanticscholar.org/CorpusID:119939254.

King Abdullah University of Science and Technology↩︎
2 bounds the gradient norms directly, while 1 bounds their squares; this difference naturally arises from the distinct smoothness models used in each analysis.↩︎

Drop-Muon: Update Less, Converge Faster